We already know that some algorithms are such that they cannot be parallelized, e.g. the next iteration depends on the result of the previous iterations.
For example, assume we set up a database connection to a MariaDB database called sakila hosted on a remote server using DBI:
library(DBI)con <-dbConnect(RMariaDB::MariaDB(),## https://relational-data.org/dataset/Sakilahost ="db.relational-data.org", port =3306,dbname ="sakila",username ="guest", password ="relational")con
<MySQLConnection>
Connection: guest@db.relational-data.org<sakila>[529701] via TCP/IP over SSL
We happen to know there is a table called film in this database. We can read in this table into R as a tibble data frame using:
film <-dbReadTable(con, "film")film <-as_tibble(film)film
# A tibble: 1,000 × 13
film_id title description release_year language_id original_language_id
<int> <chr> <chr> <int> <int> <int>
1 1 ACADEMY DI… A Epic Dra… 2006 1 NA
2 2 ACE GOLDFI… A Astoundi… 2006 1 NA
3 3 ADAPTATION… A Astoundi… 2006 1 NA
4 4 AFFAIR PRE… A Fanciful… 2006 1 NA
5 5 AFRICAN EGG A Fast-Pac… 2006 1 NA
6 6 AGENT TRUM… A Intrepid… 2006 1 NA
7 7 AIRPLANE S… A Touching… 2006 1 NA
8 8 AIRPORT PO… A Epic Tal… 2006 1 NA
9 9 ALABAMA DE… A Thoughtf… 2006 1 NA
10 10 ALADDIN CA… A Action-P… 2006 1 NA
# ℹ 990 more rows
# ℹ 7 more variables: rental_duration <int>, rental_rate <dbl>, length <int>,
# replacement_cost <dbl>, rating <chr>, special_features <chr>,
# last_update <dttm>
7.1 A database connection is only valid in the current R session
Now, say we wish to do this in parallel instead. If we attempt to do:
The reason is that the database connection (con) only works in the R session where it was created. When we tried to use it a parallel worker’s R process, it is invalid there. This is certainly not obvious from that error message!
Technically, this has to do with pointers, which is a programming term used in low-level programming languages such as C and C++. In this case, we can inspect con to see that it indeed has an external pointer:
it reveals that its a pointer to a specific address in memory, which exactly how they are used in C and C++. Because it is a memory pointer, that pieces of memory does not exist in the parallel worker. The good thing is that R detects when we send over an object with an external pointer (here con). When it detects that, it invalidates the pointer by setting it to null (memory address zero) when sending it over. We can see this is we do:
f <-future(con@ptr)ptr <-value(f)ptr
<pointer: (nil)>
So, when we try con in a parallel workers, the external pointer con@ptr is no longer useful. DBI detects this invalid pointer when we call dbReadTable(con, "film") and throws the error.
7.2 Same problem when saving to file
Note that you have the exact same problem if you would try to save the database connection to file,
saveRDS(con, "db_con.rds")
and then load it back in again:
con2 <-readRDS("db_con.rds")
The con2 object represents a non-working database connection:
film <-dbReadTable(con2, "film")
Error: bad_weak_ptr
This makes sense, because in the end of the day, it is a connection to a remote database that involves a live connection over internet with authentication, and more. Being able to save its state to file, would be a lot to ask for of the DBI package, but also of the remote database server.
7.3 Workaround
A workaround is to create a new database connection in the new R session, or in the parallel worker;
library(DBI)library(future)plan(multisession)f <-future({ con <-dbConnect(RMariaDB::MariaDB(),host ="db.relational-data.org", port =3306,dbname ="sakila",username ="guest", password ="relational" ) df <-dbReadTable(con, "film")as_tibble(df)})film <-value(f)film
# A tibble: 1,000 × 13
film_id title description release_year language_id original_language_id
<int> <chr> <chr> <int> <int> <int>
1 1 ACADEMY DI… A Epic Dra… 2006 1 NA
2 2 ACE GOLDFI… A Astoundi… 2006 1 NA
3 3 ADAPTATION… A Astoundi… 2006 1 NA
4 4 AFFAIR PRE… A Fanciful… 2006 1 NA
5 5 AFRICAN EGG A Fast-Pac… 2006 1 NA
6 6 AGENT TRUM… A Intrepid… 2006 1 NA
7 7 AIRPLANE S… A Touching… 2006 1 NA
8 8 AIRPORT PO… A Epic Tal… 2006 1 NA
9 9 ALABAMA DE… A Thoughtf… 2006 1 NA
10 10 ALADDIN CA… A Action-P… 2006 1 NA
# ℹ 990 more rows
# ℹ 7 more variables: rental_duration <int>, rental_rate <dbl>, length <int>,
# replacement_cost <dbl>, rating <chr>, special_features <chr>,
# last_update <dttm>
7.4 Futureverse can help us detect this before it happens
The problem of non-exportable objects it not just for database connections. It happens for a large number of other classes of objects. Most of them have one thing in common: the hold a “reference” to some external resources, e.g. a file connection, a website connection, a database connection, a handler to an in-memory object living in a Python or a Java process running in the background. However, there are also cases where the references is an external pointer to a piece of the memory on the current machine.
R option to tell future to prevent objects with external pointers from being exported to a parallel worker.
Error: Detected a non-exportable reference ('externalptr') in one of the globals ('con' of class 'MySQLConnection') used in the future expression
Timing stopped at: 0.001 0.001 0.001
This is not the default setting, because there exist objects with external points that can indeed be exported. For example, data.table objects have external pointers, but the data.table package is clever enough to ignore it, if the pointer is invalid. For example,
If we would set the above R option, we would get an error if we would try to send dt to a parallel worker. We don’t want that, because it works:
f <-future(summary(dt))value(f)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
If we inspect the pointer of dt on the parallel worker, we’ll find that it is null (as expected);
f <-future(attr(dt, ".internal.selfref"))ptr <-value(f)ptr
<pointer: (nil)>
If we round trip to the file system, we see this familiar behavior of external pointers being set to null by R:
One may think that forked parallel processing could be a workaround. When using forks, the operating system will “clone” our main R session and perfectly replicate everything in the child parallel process.
Let’s try with our database example;
library(DBI)con <-dbConnect(RMariaDB::MariaDB(),host ="db.relational-data.org", port =3306,dbname ="sakila",username ="guest", password ="relational")con
<MySQLConnection>
Connection: guest@db.relational-data.org<sakila>[529704] via TCP/IP over SSL
# A tibble: 1,000 × 13
film_id title description release_year language_id original_language_id
<int> <chr> <chr> <int> <int> <int>
1 1 ACADEMY DI… A Epic Dra… 2006 1 NA
2 2 ACE GOLDFI… A Astoundi… 2006 1 NA
3 3 ADAPTATION… A Astoundi… 2006 1 NA
4 4 AFFAIR PRE… A Fanciful… 2006 1 NA
5 5 AFRICAN EGG A Fast-Pac… 2006 1 NA
6 6 AGENT TRUM… A Intrepid… 2006 1 NA
7 7 AIRPLANE S… A Touching… 2006 1 NA
8 8 AIRPORT PO… A Epic Tal… 2006 1 NA
9 9 ALABAMA DE… A Thoughtf… 2006 1 NA
10 10 ALADDIN CA… A Action-P… 2006 1 NA
# ℹ 990 more rows
# ℹ 7 more variables: rental_duration <int>, rental_rate <dbl>, length <int>,
# replacement_cost <dbl>, rating <chr>, special_features <chr>,
# last_update <dttm>
It certainly looks like it worked! However, while doing this, we managed to confuse DBI and MariaDB. If we try to use con again, we get:
film <-dbReadTable(con, "film")
Error: Lost connection to MySQL server during query [2013]
Conclusion, it is tempting to think forked processing can solve things that other parallelization backends cannot handle, but it is often the devil in disguise.