7  Non-exportable objects

We already know that some algorithms are such that they cannot be parallelized, e.g. the next iteration depends on the result of the previous iterations.

For example, assume we set up a database connection to a MariaDB database called sakila hosted on a remote server using DBI:

library(DBI)
con <- dbConnect(RMariaDB::MariaDB(),
  host = "relational.fit.cvut.cz", port = 3306,
  dbname = "sakila",
  username = "guest", password = "relational"
)
con
<MariaDBConnection>
  Host:    relational.fit.cvut.cz
  Server:  
  Client:  

We happen to know there is a table called film in this database. We can read in this table into R as a tibble data frame using:

film <- dbReadTable(con, "film")
film <- as_tibble(film)
film
# A tibble: 1,000 × 13
   film_id title       description release_year language_id original_language_id
     <int> <chr>       <chr>              <int>       <int>                <int>
 1       1 ACADEMY DI… A Epic Dra…         2006           1                   NA
 2       2 ACE GOLDFI… A Astoundi…         2006           1                   NA
 3       3 ADAPTATION… A Astoundi…         2006           1                   NA
 4       4 AFFAIR PRE… A Fanciful…         2006           1                   NA
 5       5 AFRICAN EGG A Fast-Pac…         2006           1                   NA
 6       6 AGENT TRUM… A Intrepid…         2006           1                   NA
 7       7 AIRPLANE S… A Touching…         2006           1                   NA
 8       8 AIRPORT PO… A Epic Tal…         2006           1                   NA
 9       9 ALABAMA DE… A Thoughtf…         2006           1                   NA
10      10 ALADDIN CA… A Action-P…         2006           1                   NA
# ℹ 990 more rows
# ℹ 7 more variables: rental_duration <int>, rental_rate <dbl>, length <int>,
#   replacement_cost <dbl>, rating <chr>, special_features <chr>,
#   last_update <dttm>

7.1 A database connection is only valid in the current R session

Now, say we wish to do this in parallel instead. If we attempt to do:

library(future)
plan(multisession)

f <- future({
  df <- dbReadTable(con, "film")
  as_tibble(df)
})

It will not work;

film <- value(f)
Error: external pointer is not valid

The reason is that the database connection (con) only works in the R session where it was created. When we tried to use it a parallel worker’s R process, it is invalid there. This is certainly not obvious from that error message!

Technically, this has to do with pointers, which is a programming term used in low-level programming languages such as C and C++. In this case, we can inspect con to see that it indeed has an external pointer:

str(con)
Formal class 'MariaDBConnection' [package "RMariaDB"] with 7 slots
  ..@ ptr                   :<externalptr> 
  ..@ host                  : chr "relational.fit.cvut.cz"
  ..@ db                    : chr "sakila"
  ..@ load_data_local_infile: logi FALSE
  ..@ bigint                : chr "integer64"
  ..@ timezone              : chr "UTC"
  ..@ timezone_out          : chr "UTC"

If we dig deeper,

con@ptr
<pointer: 0x555dce911df0>

it reveals that its a pointer to a specific address in memory, which exactly how they are used in C and C++. Because it is a memory pointer, that pieces of memory does not exist in the parallel worker. The good thing is that R detects when we send over an object with an external pointer (here con). When it detects that, it invalidates the pointer by setting it to null (memory address zero) when sending it over. We can see this is we do:

f <- future(con@ptr)
ptr <- value(f)
ptr
<pointer: (nil)>

So, when we try con in a parallel workers, the external pointer con@ptr is no longer useful. DBI detects this invalid pointer when we call dbReadTable(con, "film") and throws the error.

7.2 Same problem when saving to file

Note that you have the exact same problem if you would try to save the database connection to file,

saveRDS(con, "db_con.rds")

and then load it back in again:

con2 <- readRDS("db_con.rds")

The con2 object represents a non-working database connection:

film <- dbReadTable(con2, "film")
Error: external pointer is not valid

This makes sense, because in the end of the day, it is a connection to a remote database that involves a live connection over internet with authentication, and more. Being able to save its state to file, would be a lot to ask for of the DBI package, but also of the remote database server.

7.3 Workaround

A workaround is to create a new database connection in the new R session, or in the parallel worker;

library(DBI)
library(future)
plan(multisession)

f <- future({
  con <- dbConnect(RMariaDB::MariaDB(),
    host = "relational.fit.cvut.cz", port = 3306,
    dbname = "sakila",
    username = "guest", password = "relational"
  )
  
  df <- dbReadTable(con, "film")
  as_tibble(df)
})

film <- value(f)
film
# A tibble: 1,000 × 13
   film_id title       description release_year language_id original_language_id
     <int> <chr>       <chr>              <int>       <int>                <int>
 1       1 ACADEMY DI… A Epic Dra…         2006           1                   NA
 2       2 ACE GOLDFI… A Astoundi…         2006           1                   NA
 3       3 ADAPTATION… A Astoundi…         2006           1                   NA
 4       4 AFFAIR PRE… A Fanciful…         2006           1                   NA
 5       5 AFRICAN EGG A Fast-Pac…         2006           1                   NA
 6       6 AGENT TRUM… A Intrepid…         2006           1                   NA
 7       7 AIRPLANE S… A Touching…         2006           1                   NA
 8       8 AIRPORT PO… A Epic Tal…         2006           1                   NA
 9       9 ALABAMA DE… A Thoughtf…         2006           1                   NA
10      10 ALADDIN CA… A Action-P…         2006           1                   NA
# ℹ 990 more rows
# ℹ 7 more variables: rental_duration <int>, rental_rate <dbl>, length <int>,
#   replacement_cost <dbl>, rating <chr>, special_features <chr>,
#   last_update <dttm>

7.4 Futureverse can help us detect this before it happens

The problem of non-exportable objects it not just for database connections. It happens for a large number of other classes of objects. Most of them have one thing in common: the hold a “reference” to some external resources, e.g. a file connection, a website connection, a database connection, a handler to an in-memory object living in a Python or a Java process running in the background. However, there are also cases where the references is an external pointer to a piece of the memory on the current machine.

For a list of known cases, see https://future.futureverse.org/articles/future-4-non-exportable-objects.html.

The future package can scan for external pointers, and other types of “references”. We can use this to help us protect against these type of mistakes:

library(future)
1options(future.globals.onReference = "error")
plan(multisession)

f <- future({
  df <- dbReadTable(con, "film")
  as_tibble(df)
})
1
R option to tell future to prevent objects with external pointers from being exported to a parallel worker.
Error: Detected a non-exportable reference ('externalptr') in one of the globals ('con' of class 'MariaDBConnection') used in the future expression
Timing stopped at: 0 0 0.001

This is not the default setting, because there exist objects with external points that can indeed be exported. For example, data.table objects have external pointers, but the data.table package is clever enough to ignore it, if the pointer is invalid. For example,

library(data.table)

dt <- as.data.table(iris)
dt
     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
  1:          5.1         3.5          1.4         0.2    setosa
  2:          4.9         3.0          1.4         0.2    setosa
  3:          4.7         3.2          1.3         0.2    setosa
  4:          4.6         3.1          1.5         0.2    setosa
  5:          5.0         3.6          1.4         0.2    setosa
 ---                                                            
146:          6.7         3.0          5.2         2.3 virginica
147:          6.3         2.5          5.0         1.9 virginica
148:          6.5         3.0          5.2         2.0 virginica
149:          6.2         3.4          5.4         2.3 virginica
150:          5.9         3.0          5.1         1.8 virginica

We can see the external pointer, if we use:

str(dt)
Classes 'data.table' and 'data.frame':  150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr> 

The pointer is:

attr(dt, ".internal.selfref")
<pointer: 0x555dc84474c0>

If we would set the above R option, we would get an error if we would try to send dt to a parallel worker. We don’t want that, because it works:

f <- future(summary(dt))
value(f)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

If we inspect the pointer of dt on the parallel worker, we’ll find that it is null (as expected);

f <- future(attr(dt, ".internal.selfref"))
ptr <- value(f)
ptr
<pointer: (nil)>

If we round trip to the file system, we see this familiar behavior of external pointers being set to null by R:

saveRDS(dt, "dt.rds")
attr(dt, ".internal.selfref")
<pointer: 0x555dc84474c0>
dt2 <- readRDS("dt.rds")
attr(dt2, ".internal.selfref")
<pointer: (nil)>

7.5 What about forked parallelization?

One may think that forked parallel processing could be a workaround. When using forks, the operating system will “clone” our main R session and perfectly replicate everything in the child parallel process.

Let’s try with our database example;

library(DBI)
con <- dbConnect(RMariaDB::MariaDB(),
  host = "relational.fit.cvut.cz", port = 3306,
  dbname = "sakila",
  username = "guest", password = "relational"
)
con
<MariaDBConnection>
  Host:    relational.fit.cvut.cz
  Server:  
  Client:  
library(future)
plan(multicore)   ## forked parallelization

f <- future({
  df <- dbReadTable(con, "film")
  as_tibble(df)
})

film <- value(f)
film
# A tibble: 1,000 × 13
   film_id title       description release_year language_id original_language_id
     <int> <chr>       <chr>              <int>       <int>                <int>
 1       1 ACADEMY DI… A Epic Dra…         2006           1                   NA
 2       2 ACE GOLDFI… A Astoundi…         2006           1                   NA
 3       3 ADAPTATION… A Astoundi…         2006           1                   NA
 4       4 AFFAIR PRE… A Fanciful…         2006           1                   NA
 5       5 AFRICAN EGG A Fast-Pac…         2006           1                   NA
 6       6 AGENT TRUM… A Intrepid…         2006           1                   NA
 7       7 AIRPLANE S… A Touching…         2006           1                   NA
 8       8 AIRPORT PO… A Epic Tal…         2006           1                   NA
 9       9 ALABAMA DE… A Thoughtf…         2006           1                   NA
10      10 ALADDIN CA… A Action-P…         2006           1                   NA
# ℹ 990 more rows
# ℹ 7 more variables: rental_duration <int>, rental_rate <dbl>, length <int>,
#   replacement_cost <dbl>, rating <chr>, special_features <chr>,
#   last_update <dttm>

It certainly looks like it worked! However, while doing this, we managed to confuse DBI and MariaDB. If we try to use con again, we get:

film <- dbReadTable(con, "film")
Error: Lost connection to MySQL server during query [2013]

Conclusion, it is tempting to think forked processing can solve things that other parallelization backends cannot handle, but it is often the devil in disguise.