<- list(1:25, 26:50, 51:75, 76:100) xs
3 Parallelizing map-reduce calls
Next, assume we have four sets of numeric vectors, and we want to calculate slow_sum()
for each of them. We have them in a list, e.g.
We could keep doing what we did in the previous section;
<- list()
ys 1]] <- slow_sum(xs[[1]])
ys[[2]] <- slow_sum(xs[[2]])
ys[[3]] <- slow_sum(xs[[3]])
ys[[4]] <- slow_sum(xs[[4]]) ys[[
This will give us the results in a list ys
of the same length as xs
, e.g.
str(ys)
List of 4
$ : num 325
$ : num 950
$ : num 1575
$ : num 2200
This approach will become very tedious when there are more sets, i.e. when length(xs)
is large. It is also error prone, e.g. it’s too easy to introduce a silent bug from a single typo, e.g.
<- list()
ys 1]] <- slow_sum(xs[[1]])
ys[[2]] <- slow_sum(xs[[2]])
ys[[3]] <- slow_sum(xs[[2]])
ys[[4]] <- slow_sum(xs[[4]]) ys[[
Whenever you find yourself repeating code by cut’n’paste from previous lines, it’s a good indicator to stop and think. There’s almost always a better way to this - you just have find what it is!
R is designed to simplify above type of tasks. In this case we can use lapply()
to achieve the same:
<- lapply(xs, slow_sum)
ys str(ys)
List of 4
$ : num 325
$ : num 950
$ : num 1575
$ : num 2200
3.1 Parallelizing a map-reduce call using the ‘future.apply’ package
Since there are four sets of data, each comprise of 25 values, and each value takes about one second to process, processing all of the data takes about 100 seconds;
tic()
<- lapply(xs, slow_sum)
ys toc()
Time difference of 100.6 secs
Can we speed this up by processing the different elements in xs
concurrently?
Yes, we can. Unfortunately, the built-in lapply()
function is not implemented to run in parallel. However, the future.apply package provides the future_lapply()
function that can run in parallel. It is designed to be a plug-and-play replacement of lapply()
. We just have to prepend future_
to the lapply
name.
library(future.apply)
plan(multisession)
<- future_lapply(xs, slow_sum)
ys str(ys)
List of 4
$ : num 325
$ : num 950
$ : num 1575
$ : num 2200
By design, this gives identical result to lapply()
, but it performs slow_sum(xs[[1]])
, slow_sum(xs[[2]])
, …, in parallel.
To convince ourselves it runs in parallel, we can measure the processing time:
tic()
<- future_lapply(xs, slow_sum)
ys toc()
Time difference of 26.2 secs
3.2 Parallelizing a map-reduce call using the ‘furrr’ package
If you use the Tidyverse framework, you might already be aware of the purrr package. It provides an alternative to the built-in lapply()
function called map()
. It works very similarly. Our
<- lapply(xs, slow_sum) ys
can be written as:
library(purrr)
<- map(xs, slow_sum) ys
It gives identical results. To run this in parallel, you can use future_map()
of the furrr package. Just as future_lapply()
can replace lapply()
as-is, future_map()
replaces map()
as-is:
library(furrr)
plan(multisession)
<- future_map(xs, slow_sum) ys
3.3 Comment about pipes
All of the above works also with pipes. You can use the semi-legacy magrittr
%>%
pipe operator popularized by Tidyverse, or the zero-cost|>
pipe operator that is now built-in with R.Just like the Tidyverse maintainers, I recommend using the latter. There is zero overhead added when using it, and there is truly no extra code being executed behind the scenes. Instead, it just a different way that R parses you code - after that everything is the same. The following two R expressions are identical from R’s perspective:
and
The analogue in mathematics, is that the following expressions are equivalent:
\[ h(x) = g(f(x)) \]
\[ h(x) = (f \circ g)(x) \]
Thus, when programming in R, we can use either of:
Same for
and