Advanced R

R
Author

Shitao5

Published

2023-08-30

Modified

2023-09-14

Progress

Learning Progress: 40.8%.

Learning Source

Foundations

1 Functions

R uses lexical scoping: it looks up the values of names based on how a function is defined, not how it is called. “Lexical” here is not the English adjective that means relating to words or a vocabulary. It’s a technical CS term that tells us that the scoping rules use a parse-time, rather than a run-time structure.

R’s lexical scoping follows four primary rules:

  • Name masking
  • Functions versus variables
  • A fresh start
  • Dynamic lookup

R使用词法作用域(lexical scoping):它根据函数的定义方式查找名称的值,而不是根据它的调用方式。“词法”在这里不是指与单词或词汇相关的英语形容词。它是一个技术性的计算机科学术语,告诉我们作用域规则使用的是解析时的结构,而不是运行时的结构。

R的词法作用域遵循四条主要规则:

  • 名称屏蔽(Name masking)

  • 函数与变量

  • 全新的起点

  • 动态查找

Lexical scoping determines where, but not when to look for values. R looks for values when the function is run, not when the function is created. Together, these two properties tell us that the output of a function can differ depending on the objects outside the function’s environment.

词法作用域决定了在哪里查找值,但并不决定何时查找值。R在运行函数时查找值,而不是在创建函数时。这两个属性共同告诉我们,函数的输出可能会因函数环境外部的对象而异。

Lazy evaluation is powered by a data structure called a promise, or (less commonly) a thunk. It’s one of the features that makes R such an interesting programming language.

惰性求值是由一个叫做 promise 的数据结构支持的,或者更少见的叫做 thunk。它是使R成为一个如此有趣的编程语言的特性之一。

You cannot manipulate promises with R code. Promises are like a quantum state: any attempt to inspect them with R code will force an immediate evaluation, making the promise disappear. Later, you’ll learn about quosures, which convert promises into an R object where you can easily inspect the expression and the environment.

您不能使用R代码操作 promises。Promise 就像一个量子状态:任何试图用R代码检查它们的尝试都会立即导致求值,使 promise 消失。稍后,您将了解到 quosures,它们可以将 promises 转换为R对象,您可以轻松检查表达式和环境。

An error indicates that something has gone wrong, and forces the user to deal with the problem. Some languages (like C, Go, and Rust) rely on special return values to indicate problems, but in R you should always throw an error.

错误表示出现了问题,并迫使用户处理这个问题。一些语言(如C、Go和Rust)依赖于特殊的返回值来表示问题,但在R中,您应该始终抛出一个错误。

2 Environments

The job of an environment is to associate, or bind, a set of names to a set of values. You can think of an environment as a bag of names, with no implied order (i.e. it doesn’t make sense to ask which is the first element in an environment).

环境的作用是将一组名称与一组值关联或绑定在一起。您可以将环境视为一个名称的集合,没有隐含的顺序(即在环境中询问哪个元素是第一个元素没有意义)。

e1 <- env(
  a = FALSE,
  b = "a",
  c = 2.3,
  d = 1:3
)

e1
#> <environment: 0x0000025d81de3928>
env_print(e1)
#> <environment: 0x0000025d81de3928>
#> Parent: <environment: global>
#> Bindings:
#> • a: <lgl>
#> • b: <chr>
#> • c: <dbl>
#> • d: <int>
env_names(e1)
#> [1] "a" "b" "c" "d"

To compare environments, you need to use identical() and not ==. This is because == is a vectorised operator, and environments are not vectors.

identical(global_env(), current_env())
#> [1] TRUE

global_env() == current_env()
#> Error in global_env() == current_env(): comparison (==) is possible only for atomic and list types
# Parents
e2a <- env(d = 4, e = 5)
e2b <- env(e2a, a = 1, b = 2, c = 3)

e2a
#> <environment: 0x0000025d800ce000>
e2b
#> <environment: 0x0000025dff217980>
# find the parent of an environment with env_parent()
env_parent(e2b)
#> <environment: 0x0000025d800ce000>
env_parent(e2a)
#> <environment: R_GlobalEnv>

Only one environment doesn’t have a parent: the empty environment.

The immediate parent of the global environment is the last package you attached, the parent of that package is the second to last package you attached, …

3 Conditions

Every condition has default behaviour: errors stop execution and return to the top level, warnings are captured and displayed in aggregate, and messages are immediately displayed. Condition handlers allow us to temporarily override or supplement the default behaviour.

每个条件都有默认行为:错误会停止执行并返回到顶层,警告会被捕获并按聚合方式显示,消息会立即显示。条件处理程序允许我们临时覆盖或补充默认行为。

tryCatch() registers exiting handlers, and is typically used to handle error conditions. It allows you to override the default error behaviour. For example, the following code will return NA instead of throwing an error:

f3 <- function(x) {
  tryCatch(
    error = function(cnd) NA,
    log(x)
  )
}

f3(3)
#> [1] 1.098612
f3("x")
#> [1] NA

The handlers set up by tryCatch() are called exiting handlers because after the condition is signalled, control passes to the handler and never returns to the original code, effectively meaning that the code exits.

tryCatch() 设置的处理程序被称为退出处理程序,因为在条件被发出后,控制权传递给处理程序,不再返回到原始代码,实际上意味着代码退出执行。

Warning

这章有点看不懂,暂缓。

Functional Progarmming

4 Functionals

A functional is a function that takes a function as an input and returns a vector as output. Here’s a simple functional: it calls the function provided as input with 1000 random uniform numbers.

一个函数式是一个接受函数作为输入并返回向量作为输出的函数。这是一个简单的函数式示例:它使用 1000 个随机均匀数调用提供的输入函数。

randomise <- function(f) f(runif(1e3))
randomise(mean)
#> [1] 0.512291
randomise(mean)
#> [1] 0.5115914
randomise(sum)
#> [1] 501.9704

The map functions also have shortcuts for extracting elements from a vector, powered by purrr::pluck(). You can use a character vector to select elements by name, an integer vector to select by position, or a list to select by both name and position. These are very useful for working with deeply nested lists, which often arise when working with JSON.

map 函数还具有从向量中提取元素的快捷方式,由 purrr::pluck() 提供支持。你可以使用字符向量按名称选择元素,使用整数向量按位置选择元素,或者使用列表同时按名称和位置选择元素。这在处理深层嵌套的列表时非常有用,这种情况在处理 JSON 数据时经常出现。

x <- list(
  list(-1, x = 1, y = c(2), z = "a"),
  list(-2, x = 4, y = c(5, 6), z = "b"),
  list(-3, x = 8, y = c(9, 10, 11))
)

# Select by name
map_dbl(x, "x")
#> [1] 1 4 8

# Or by position
map_dbl(x, 1)
#> [1] -1 -2 -3

# Or both
map_dbl(x, list("y", 1))
#> [1] 2 5 9

# You'll get an error if a component doesn't exist
map_chr(x, "z")
#> Error in `map_chr()`:
#> ℹ In index: 3.
#> Caused by error:
#> ! Result must be length 1, not 0.

# Unless you supply a .default value
map_chr(x, "z", .default = NA)
#> [1] "a" "b" NA

Note there’s a subtle difference between placing extra arguments inside an anonymous function compared with passing them to map(). Putting them in an anonymous function means that they will be evaluated every time f() is executed, not just once when you call map(). This is easiest to see if we make the additional argument random:

需要注意的是,在匿名函数中放置额外的参数与将它们传递给 map() 之间存在微妙的差异。将它们放在匿名函数中意味着它们将在每次执行 f() 时被评估,而不仅仅是在调用 map() 时评估一次。如果我们将额外的参数设置为随机值,这一点将变得最容易理解:

plus <- function(x, y) round(x + y, 2)

x <- rep(0, 4)
map_dbl(x, plus, runif(1))
#> [1] 0.68 0.68 0.68 0.68
map_dbl(x, ~ plus(.x, runif(1)))
#> [1] 0.29 0.02 0.91 0.01
# Purrr style
by_cyl <- split(mtcars, mtcars$cyl)

by_cyl %>% 
  map(~ lm(mpg ~ wt, data = .x)) %>% 
  map(coef) %>% 
  map_dbl(2)
#>         4         6         8 
#> -5.647025 -2.780106 -2.192438

There are three basic ways to loop over a vector with a for loop:

  • Loop over the elements: for (x in xs)

  • Loop over the numeric indices: for (i in seq_along(xs))

  • Loop over the names: for (nm in names(xs))

The first form is analogous to the map() family. The second and third forms are equivalent to the imap() family which allows you to iterate over the values and the indices of a vector in parallel.

imap() is like map2() in the sense that your .f gets called with two arguments, but here both are derived from the vector. imap(x, f) is equivalent to map2(x, names(x), f) if x has names, and map2(x, seq_along(x), f) if it does not.

有三种基本方法可以使用for循环遍历向量:

遍历元素:for (x in xs)

遍历数值索引:for (i in seq_along(xs))

遍历名称:for (nm in names(xs))

第一种形式类似于 map() 家族。第二和第三种形式等同于 imap() 家族,它允许你同时迭代向量的值和索引。

imap() 类似于 map2(),因为你的 .f 会被调用两次,但这里的两个参数都来自向量。如果 x 有名称,imap(x, f) 等同于 map2(x, names(x), f),如果没有名称,就等同于 map2(x, seq_along(x), f)

reduce() is a useful way to generalise a function that works with two inputs (a binary function) to work with any number of inputs.

reduce() 是一种有用的方式,可以将一个适用于两个输入(二进制函数)的函数泛化为适用于任意数量的输入。

set.seed(1231)
l <- map(1:4, ~ sample(1:10, 15, replace = TRUE))
str(l)
#> List of 4
#>  $ : int [1:15] 10 10 4 5 8 10 10 9 10 3 ...
#>  $ : int [1:15] 4 9 5 4 1 5 8 9 9 10 ...
#>  $ : int [1:15] 7 7 1 3 1 3 5 5 7 2 ...
#>  $ : int [1:15] 7 9 4 8 7 10 4 5 6 10 ...

# 查找出现在每个元素中的值
reduce(l, intersect)
#> [1] 10  4  5  9  7

# 查找所有出现的值
reduce(l, union)
#>  [1] 10  4  5  8  9  3  6  7  1  2
# accumulate 返回中间结果
accumulate(l, intersect)
#> [[1]]
#>  [1] 10 10  4  5  8 10 10  9 10  3  3  4  6  7 10
#> 
#> [[2]]
#> [1] 10  4  5  8  9  6  7
#> 
#> [[3]]
#> [1] 10  4  5  9  7
#> 
#> [[4]]
#> [1] 10  4  5  9  7

If you’re using reduce() in a function, you should always supply .init. Think carefully about what your function should return when you pass a vector of length 0 or 1, and make sure to test your implementation.

如果你在函数中使用 reduce(),你应该始终提供 .init 参数。仔细考虑当你传递长度为 0 或 1 的向量时,你的函数应该返回什么,并确保测试你的实现。

A predicate functional applies a predicate to each element of a vector. purrr provides seven useful functions which come in three groups:

  • some(.x, .p) returns TRUE if any element matches;

  • every(.x, .p) returns TRUE if all elements match;

  • none(.x, .p) returns TRUE if no element matches.

These are similar to any(map_lgl(.x, .p)), all(map_lgl(.x, .p)) and all(map_lgl(.x, negate(.p))) but they terminate early: some() returns TRUE when it sees the first TRUE, and every() and none() return FALSE when they see the first FALSE or TRUE respectively.

  • detect(.x, .p) returns the value of the first match; detect_index(.x, .p) returns the location of the first match.

  • keep(.x, .p) keeps all matching elements; discard(.x, .p) drops all matching elements.

谓词函数将谓词应用于向量的每个元素。purrr 提供了七个有用的函数,分为三组:

  • some(.x, .p):如果任何元素匹配,则返回 TRUE

  • every(.x, .p):如果所有元素都匹配,则返回 TRUE

  • none(.x, .p):如果没有元素匹配,则返回 TRUE

这些函数类似于 any(map_lgl(.x, .p))all(map_lgl(.x, .p))all(map_lgl(.x, negate(.p))),但它们会提前终止:some() 在看到第一个 TRUE 时返回 TRUEevery()none() 在看到第一个 FALSETRUE 时分别返回 FALSE

  • detect(.x, .p):返回第一个匹配的值;detect_index(.x, .p):返回第一个匹配的位置。

  • keep(.x, .p):保留所有匹配的元素;discard(.x, .p):删除所有匹配的元素。

df = data.frame(x = 1:3, y = letters[1:3])
detect(df, is.factor)
#> NULL
detect_index(df, is.factor)
#> [1] 0

str(keep(df, is.factor))
#> 'data.frame':    3 obs. of  0 variables
str(discard(df, is.factor))
#> 'data.frame':    3 obs. of  2 variables:
#>  $ x: int  1 2 3
#>  $ y: chr  "a" "b" "c"

map() and modify() come in variants that also take predicate functions, transforming only the elements of .x where .p is TRUE.

map()modify() 有一些变体,它们还接受谓词函数,只会在 .pTRUE 的情况下转换 .x 的元素。

df = data.frame(
  num1 = c(0, 10, 20),
  num2 = c(5, 6, 7),
  chr1 = c("a", "b", "c"),
  stringsAsFactors = FALSE
)

str(map_if(df, is.numeric, mean))
#> List of 3
#>  $ num1: num 10
#>  $ num2: num 6
#>  $ chr1: chr [1:3] "a" "b" "c"
str(modify_if(df, is.numeric, mean))
#> 'data.frame':    3 obs. of  3 variables:
#>  $ num1: num  10 10 10
#>  $ num2: num  6 6 6
#>  $ chr1: chr  "a" "b" "c"
str(map(keep(df, is.numeric), mean))
#> List of 2
#>  $ num1: num 10
#>  $ num2: num 6

5 Function Factories

A function factory is a function that makes functions. Here’s a very simple example: we use a function factory (power1()) to make two child functions (square() and cube()):

power1 = function(exp) {
  function(x) {
    x ^ exp
  }
}

square = power1(2)
cube = power1(3)

square(3)
#> [1] 9
cube(3)
#> [1] 27
square
#> function(x) {
#>     x ^ exp
#>   }
#> <environment: 0x0000025dfed55000>
cube
#> function(x) {
#>     x ^ exp
#>   }
#> <bytecode: 0x0000025dfd0dc978>
#> <environment: 0x0000025dfe94cb28>

It’s obvious where x comes from, but how does R find the value associated with exp? Simply printing the manufactured functions is not revealing because the bodies are identical; the contents of the enclosing environment are the important factors. We can get a little more insight by using rlang::env_print(). That shows us that we have two different environments (each of which was originally an execution environment of power1()). The environments have the same parent, which is the enclosing environment of power1(), the global environment.

显然,x 的值是明显的,但 R 如何找到与 exp 相关联的值呢?仅仅打印制造出的函数并不能揭示这一点,因为它们的主体是相同的;封闭环境的内容才是重要因素。我们可以通过使用 rlang::env_print() 来获得更多的见解。这将显示我们有两个不同的环境(每个最初都是 power1() 的执行环境)。这些环境具有相同的父级,即 power1() 的封闭环境,也就是全局环境。

env_print(square)
#> <environment: 0x0000025dfed55000>
#> Parent: <environment: global>
#> Bindings:
#> • exp: <dbl>
env_print(cube)
#> <environment: 0x0000025dfe94cb28>
#> Parent: <environment: global>
#> Bindings:
#> • exp: <dbl>
fn_env(square)$exp
#> [1] 2
fn_env(cube)$exp
#> [1] 3
Back to top