<- list()
xs <- list()
ys <- 0
last_idx for (idx in 1:3) {
<- letters[idx]
xs[[idx]] <- LETTERS[idx]
ys[[idx]] <- idx
last_idx }
8 foreach() is not a for-loop
For-loops are special in the way they can assign values to objects outside of the for-loop. For example,
assigns to both xs
and ys
. We also see that last_idx
is updated in every iteration, and, when the for loop completes, it holds:
last_idx
[1] 3
In contrast, we cannot do the same for map-reduce calls, such as lapply()
, because they return results, but cannot assign outside.
8.1 Super assignment (<<-
) is not a solution
Warning, using “super” assignments (<<-
), as in:
<- list()
xs <- list()
ys <- 0
last_idx <- lapply(1:3, function(idx) {
void <<- letters[idx]
xs[[idx]] <<- LETTERS[idx]
ys[[idx]] <<- idx
last_idx })
or, similarly, assign(..., envir = parent.frame())
, is considered a bad practise for many reasons. Please, do not use such hacks! (they will come and bite you if you try - trust me).
Previously, I said that any lapply()
call can be replaced with a future_lapply()
such that it can run in parallel. What would happen if we would go ahead and use the above <<-
hack? Let us try:
library(future.apply)
plan(multisession)
<- list()
xs <- list()
ys <- 0
last_idx <- future_lapply(1:3, function(idx) {
void <<- letters[idx]
xs[[idx]] <<- LETTERS[idx]
ys[[idx]] <<- idx
last_idx })
If we check xs
, ys
, and last_idx
afterward;
str(xs)
list()
str(ys)
list()
last_idx
[1] 0
we find that they are empty and zero.
Q. Why is that?
The reason is that the expressions:
<<- letters[idx]
xs[[idx]] <<- LETTERS[idx]
ys[[idx]] <<- idx last_idx
are evaluated in another R process. The assignment to xs
, ys
, and last_idx
is done to the global environment of that R process, which is not the same as the global environment of our main R session. In our main R session, the only assignment to xs
and ys
was from our initial:
<- list()
xs <- list()
ys <- 0 last_idx
assignments, which is why they are still the same.
Now, assume for a moment it would indeed be possible to use <<-
to assign to the main R session also from parallel processes. If so, what value should last_idx
have at the very end? That would depend on in which order the parallel tasks would complete. For instance, imagine the first iteration (idx = 1
) would be very slow and therefore finish last. Would you then expect last_idx
to be 1
or 3
?
Conclusion: It is not possible, and it does not make sense, to assign to the global environment when running in parallel!
8.2 Return instead of assign in map-reduce calls
The solution for map-reduce functions, such as lapply()
, is to return all results and split afterward, e.g.
<- lapply(1:3, function(idx) {
res data.frame(x = letters[idx], y = LETTERS[idx], idx = idx)
})<- lapply(res, `[[`, "x")
xs <- lapply(res, `[[`, "y")
ys <- res[[length(res)]][["last_idx"]]
last_idx rm(res)
str(xs)
List of 3
$ : chr "a"
$ : chr "b"
$ : chr "c"
str(ys)
List of 3
$ : chr "A"
$ : chr "B"
$ : chr "C"
last_idx
NULL
This strategy works in parallel too:
library(future.apply)
plan(multisession)
<- future_lapply(1:3, function(idx) {
res list(x = letters[idx], y = LETTERS[idx], idx = idx)
})<- lapply(res, `[[`, "x")
xs <- lapply(res, `[[`, "y")
ys <- res[[length(res)]][["idx"]]
last_idx rm(res)
str(xs)
List of 3
$ : chr "a"
$ : chr "b"
$ : chr "c"
str(ys)
List of 3
$ : chr "A"
$ : chr "B"
$ : chr "C"
last_idx
[1] 3
8.3 foreach() is a map-reduce function
The main thing to understand is that foreach()
does not work like a for-loop. If you would try, say
library(doFuture)
registerDoFuture()
plan(multisession)
<- list()
xs <- list()
ys <- 0
last_idx <- foreach(idx = 1:3, .export = c("xs", "ys")) %dopar% {
void <- letters[idx]
xs[[idx]] <- LETTERS[idx]
ys[[idx]] <- idx
last_idx }
you’ll find that:
str(xs)
list()
str(ys)
list()
last_idx
[1] 0
This is because foreach()
is a map-reduce function. It is only its name and the %dopar%
operator that makes it visually resemble a for-loop although it isn’t one. To further clarify this, if it would not be for the %dopar%
operator, the original creator would probably have designed foreach()
to take a function just lapply()
, e.g.
<- foreach(idx = 1:3, function(idx) {
void
... })
If that would have been the case, it would be clear that foreach()
is just another map-reduce function lust like lapply()
and map()
of the purrr package.
To conclude, we should always use foreach()
as a map-reduce function, e.g.
library(doFuture)
plan(multisession)
<- foreach(idx = 1:3) %dofuture% {
res list(x = letters[idx], y = LETTERS[idx], idx = idx)
}<- lapply(res, `[[`, "x")
xs <- lapply(res, `[[`, "y")
ys <- res[[length(res)]][["idx"]]
last_idx rm(res)
str(xs)
List of 3
$ : chr "a"
$ : chr "b"
$ : chr "c"
str(ys)
List of 3
$ : chr "A"
$ : chr "B"
$ : chr "C"
last_idx
[1] 3