Part 7 Random numbers and reproducibility

When we draw random number in R, e.g.

x <- rnorm(10)

the internals of R makes sure to follow best practices so that we get numbers that have “random” properties, e.g. knowing x[1:9] does not help us predict x[10]. R uses a so called random number generator (RNG) to generate random numbers. More precisely, it uses a pseudo RNG, because 100% true random numbers are really hard to get by. An pseudo RNG is a deterministic RNG algorithm that produces a sequence of numbers that have good “random” properties. Developing good pseudo RNGs is a research field in itself, which we won’t cover in this tutorial, but what is worth mentioning is that R support several different kinds of RNGs, and the default one has carefully been chosen for us;

> RNGkind()
[1] "Mersenne-Twister" "Inversion"        "Rejection"

We can get reproducible random numbers by setting the RNG seed, e.g.

set.seed(42)
rnorm(4)
#> [1]  1.3709584 -0.5646982  0.3631284  0.6328626
rnorm(4)
#> [1]  0.40426832 -0.10612452  1.51152200 -0.09465904

If we redo the above, we get an identical sequence of “random” numbers:

set.seed(42)
rnorm(4)
#> [1]  1.3709584 -0.5646982  0.3631284  0.6328626
rnorm(4)
#> [1]  0.40426832 -0.10612452  1.51152200 -0.09465904

7.1 Why is this important?

Random number generation (RNG) is important, because:

many statistical methods rely on it for correctness,
science rely on it for reproducibility.

If RNG produce “poor” random numbers, there is a risk we get invalid results.

When performing calculations and analyses in parallel, we cannot use the default RNG (Mersenne-Twister). If we do, there is a risk that we end up with correlated random numbers across parallel workers, which then results in biased results and incorrect conclusions. To solve this, other pseudo RNGs have been developed specifically for parallel processing:

L’Ecuyer is a RNG that works well for parallel processing

> RNGkind("L'Ecuyer-CMRG")
> RNGkind()
[1] "L'Ecuyer-CMRG" "Inversion"     "Rejection"

7.2 Futures use proper parallel RNG

The future framework uses the L’Ecuyer parallel RNG. It’s all taken care of automatically. All you need to remember is to declare that your parallel code uses random numbers. This you can do by specifying seed = TRUE, e.g.

f <- future(rnorm(4), seed = TRUE)

Futures respects the random seed, and can therefore produce reproducible random numbers, e.g.

set.seed(42)
f <- future(rnorm(4), seed = TRUE)
value(f)
#> [1] -1.6430042 -1.0003081 -0.3619149  0.1475366

set.seed(42)
f <- future(rnorm(4), seed = TRUE)
value(f)
#> [1] -1.6430042 -1.0003081 -0.3619149  0.1475366

Comment: The random numbers produced with the same seed (here 42) are not identical when using the L’Ecuyer RNG as when using the Mersenne-Twister RNG.

7.2.1 What happens if you forget to declare seed = TRUE?

If you don’t specify seed = TRUE, and your future code end up using random numbers, then the future framework detects this and produces an informative warning that is hard to miss:

f <- future(rnorm(4))
x <- value(f)
#> Warning message:
#> UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random numbers 
without specifying argument 'seed'. There is a risk that those random numbers
are not statistically sound and the overall results might be invalid. To fix
this, specify 'seed=TRUE'. This ensures that proper, parallel-safe random
numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use
'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".

7.2.2 Random numbers with parallel map-reduce functions

The future.apply package use argument future.seed = TRUE, and the furrr package argument .options = furrr_options(seed = TRUE) for declaring that RNG is used. For example,

library(future.apply)
plan(multisession, workers = 2)

X <- 2:4
y <- future_lapply(X, rnorm, future.seed = TRUE)
str(y)
#> List of 3
#>  $ : num [1:2] -0.0265 -1.7324
#>  $ : num [1:3] 2.5456 -0.6127 -0.0912
#>  $ : num [1:4] -2.107 -1.304 0.229 0.775

It does not matter what parallel backend you use, or number of parallel workers, the future framework guarantees numerically identical random numbers with the same random seed, e.g.

library(future.apply)

X <- 2:4

plan(sequential)
set.seed(42)
y <- future_lapply(X, rnorm, future.seed = TRUE)
str(y)
#> List of 3
#>  $ : num [1:2] -0.169 -1.642
#>  $ : num [1:3] 0.649 1.193 0.542
#>  $ : num [1:4] 1.36 -1.85 -1.28 0.42

plan(multisession, workers = 4)
set.seed(42)
y <- future_lapply(X, rnorm, future.seed = TRUE)
str(y)
#> List of 3
#>  $ : num [1:2] -0.169 -1.642
#>  $ : num [1:3] 0.649 1.193 0.542
#>  $ : num [1:4] 1.36 -1.85 -1.28 0.42

plan(cluster, workers = c("pi", "remote.server.org"))
set.seed(42)
y <- future_lapply(X, rnorm, future.seed = TRUE)
str(y)
#> List of 3
#>  $ : num [1:2] -0.169 -1.642
#>  $ : num [1:3] 0.649 1.193 0.542
#>  $ : num [1:4] 1.36 -1.85 -1.28 0.42

As before, if you forget to declare your need for RNG, you’ll get informative warnings about it;

y <- future_lapply(X, rnorm)
Warning message:
UNRELIABLE VALUE: One of the 'future.apply' iterations ('future_lapply-1')
unexpectedly generated random numbers without declaring so. There is a risk
that those random numbers are not statistically sound and the overall results
might be invalid. To fix this, specify 'future.seed=TRUE'. This ensures that
proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG 
method. To disable this check, use 'future.seed = NULL', or set option 
'future.rng.onMisuse' to "ignore".