Beyond Sequential: Scaling Existing Medical Pipelines with ‘futurize’

- Easy!

Henrik Bengtsson

University of California, San Francisco
R Foundation, R Consortium
@HenrikBengtsson

Medical research is powered by trusted R packages

boot Bootstrap resampling for robust confidence intervals
lme4 Mixed-effects models for longitudinal patient data
survival Time-to-event analysis for clinical endpoints
DESeq2 Differential expression in RNA-seq data
scater Single-cell RNA-seq analysis, e.g. PCA, t-SNE, UMAP

Packages are distributed via the highly-trusted CRAN and Bioconductor repositories:

With supporting repositories such as R-universe, Pharmaverse, and R-multiverse:

Growing datasets make sequential analysis a bottleneck

  • Bootstrapping confidence intervals - 1,000 replicates × large cohort = hours
  • Mixed models across patient subgroups - dozens of subgroup fits, one after another
  • Simulation studies for sample-size planning - 10,000 iterations on a laptop overnight





32× CPU cores sitting idle

Parallelization can help →

Barrier is friction!

Parallelizing used to require abandoning existing code

Before: sequential, readable, auditable

res <- lapply(patients, fit_model)






Refactoring tax

  • Many parallelization APIs
  • Complicated to test
  • Expensive to maintain

After: parallel, hard to read, hard to maintain

library(parallel)

if (parallel) {
  if (.Platform$OS.type == "windows") {
    cl <- makeCluster(8)
    res <- parLapply(cl, patients, fit_model)
    stopCluster(cl)
  } else {
    options(mc.cores = 8)
    res <- mclapply(patients, fit_model)
  }
} else {
  res <- lapply(patients, fit_model)
}

Futureverse was introduced to lower refactoring tax

Before: sequential, readable, auditable

After: parallel, readable, maintainable



res <- lapply(patients, fit_model)
library(future.apply)
plan(multisession)
res <- future_lapply(patients, fit_model)
library(purrr)

res <- map(patients, fit_model)
library(furrr)
plan(multisession)
res <- future_map(patients, fit_model)
library(foreach)

res <- foreach(p = patients) %do% {
  fit_model(p) 
}
library(doFuture)
plan(multisession)
res <- foreach(p = patients) %dofuture% {
  fit_model(p) 
}

The R community has embraced the Futureverse



+30% reverse dependencies yearly


Top 0.7% most downloaded

… but, we can simplify it further with ‘futurize’






All you need to remember is futurize()

New package futurize preserves your original code

  • Universal adapter: One unifying function futurize()
  • Zero rewrites: Original logic unchanged
res <- lapply(patients, fit_model) |>
         futurize()
library(purrr)
res <- map(patients, fit_model) |>
         futurize()
library(foreach)
res <- foreach(p = patients) %do% {
  fit_model(p)
} |> futurize()
library(plyr)
res <- llply(patients, fit_model) |>
         futurize()
library(BiocParallel)
res <- bplapply(patients, fit_model) |>
         futurize()
library(crossmap)
res <- xmap(x, ~ .y * .x) |>
         futurize()

Easy!

Same code scales from laptop to cloud to HPC

Without changing any code, you can switch from local and remote parallel processing, to large-scale high-performance compute (HPC) processing:

plan() Environment Use case
sequential Single machine Sequential (default, debugging)
multisession Single machine Parallel across multiple cores
mirai_multisession Single machine Same as above; powered by mirai
cluster Many machines Parallel across many machines (desktops, cloud)
batchtools_* Slurm/SGE/LSF Scheduler-based HPC clusters
library(futurize)
plan(future.batchtools::batchtools_slurm)
res <- patients |> purrr::map(fit_model) |> futurize()

… we can do even more with ‘futurize’






All you need to remember is futurize()

futurize() works with a growing set of domain-specific packages

CRAN Package Use
boot Bootstrap resampling, confidence intervals
caret Classification and regression training
fwb Bootstrap resampling, confidence intervals
gamlss Generalized additive models (GAMLSS)
glmnet Lasso and elastic-net regularization
glmmTMB Generalized linear mixed models (GLMMs)
kernelshap Kernel SHAP (Shapley Additive Explanations)
lme4 Linear and non-linear mixed-effects models
metafor Meta-analysis models
mgcv Generalized additive models (GAMs)
partykit Recursive partitioning (trees)
riskRegression Risk regression for survival analysis
seriation Data ordering (seriation)
stars Spatiotemporal data cubes
structchange Testing for structural changes
tm Text mining
vegan Community ecology
Bioconductor Package Use
BiocParallel Map-reduce and parallel infrastructure
DESeq2 Differential gene expression analysis
GenomicAlignments Genomic alignments (BAM/CRAM)
GSVA Gene set variation analysis
Rsamtools Binary alignment (BAM) and tabix utilities
scater Single-cell transformations
scuttle Single-cell analysis utilities
SingleCellExperiment Single-cell data containers
sva Surrogate variable analysis

Bootstrap simulations accelerated with futurize()

Sequential:

library(boot)
b <- boot(data = cohort, statistic = cox_stat, R = 100e3)

100,000 bootstrap replicates takes hours on large cohorts!

A single worker

Parallel:

plan(future.mirai::mirai_multisession)
library(boot)
b <- boot(data = cohort, statistic = cox_stat, R = 100e3) |>
     futurize()

Faster when distributed across parallel workers.
Identical results.

32 parallel workers

Traditional parallelization is more cumbersome and less robust

library(boot)

library(parallel)
cl <- makeCluster(32)

b <- boot(data = cohort, statistic = cox_stat, R = 100e3,
          parallel = "snow", ncpus = length(cl), cl = cl)

stopCluster(cl)
  • Parallelization arguments blur the bootstrapping logic
  • All three parallelization arguments must be specified
  • Does not interrupt nicely
  • Does not handle crashed parallel workers
  • Not easy to scale to cloud or HPC job schedulers

Yes, we can do progress reporting too






All you need to remember is progressify()

Progress reporting with ‘progressify’


Vanilla call:

res <- lapply(patients, fit_model)


With progress reporting:

library(progressify)
handlers("cli", globals = TRUE)

res <- lapply(patients, fit_model) |> 
         progressify()
         

■■■■■■■■■■■■■■■■80% | ETA: 12m

In parallel with progress reporting:

library(progressify)
handlers("cli", globals = TRUE)

library(futurize)
plan(multisession)

res <- lapply(patients, fit_model) |> 
         progressify() |> futurize()

■■■■■■■■■■■■■■■■80% | ETA: 23s

Easy!

Structured concurrency allows for automatic optimization

Because futurize() limits the life-span of the parallel tasks, it can:

  • cancel remaining parallel tasks
    • if there is an error
    • if the user or the operating system requests an interrupt
  • estimate efficiency of parallelization
    • is it worth it?
    • suggest a better parallel backend
  • optimize distribution of objects to parallel workers
    • by chunking
    • by remote caching
    • via shared memory (e.g. new mori package by C. Gao 2026)
  • be agile to resource specifications
    • memory, run-time, GPU, …

Go compute and may the future be with you!

Easy to install:

install.packages(c("futurize", "progressify"))

Easy to use:

ys <- lapply(xs, fcn) |> progressify() |> futurize()

Stay with your favorite coding style:

ys <- xs |> map(fcn) |> progressify() |> futurize()

Available elsewhere too:

ys <- glmnet::cv.glmnet(x, y) |> futurize()

https://www.futureverse.org