Crashing Hard Drives with data.table

Anecdotal evidence for why you should read R vignettes

Author

Erle Holgersen

Published

November 11, 2017

A few weeks ago, I accidentally generated 126TB worth of data. Before going home on a Thursday, I submitted a few R jobs to the cluster. When I checked up on them Friday after lunch, I discovered they had generated 126TB worth of text files.

After I had killed the running jobs and deleted the files, I launched a mini-investigation. By some miracle I avoided getting a stern email from IT, but I figured I should avoid making hard drive hogging a habit.

The culprit turned out to be some data.table code I had written and never properly tested. In my eagerness to get results, I had run the same script on several datasets before testing it on one. The script in question was one of the first ones I’d written using the data.table package, and some of my assumptions about its syntax turned out to be horribly, horribly wrong.

To avoid similar incidents in the future, I have gone back to basics and learned the syntax properly. In the process I also recreated my mistake for fun, further driving home the point.

Introduction to data.table

data.table is an R package for handling large datasets. It extends the built in data.frame structure to enable faster operations and updating columns by reference.

The introduction vignette is a great guide for getting started. There are also several vignettes that focus on summarizing and manipulating data, but for this short introduction I’m going to focus on selecting rows and columns.

While the data.table syntax seems similar to data.frame one, there are a few obvious differences. Firstly, data.table treats column names as variables within the [...] operator. Instead of using the dollar sign to select columns, we can refer to them directly by name. Secondly, row selection takes precedence over column selection. For a data.frame, the statement [1:4] selects the first four columns, whereas for a data.table it selects the first four rows.

To see these differences more clearly, consider the NYC flights data from the vignette. The example below shows how to select all flights out of JFK with data.table and data.frame.

library(data.table);

dt <- fread('flights14.csv');
df <- read.csv('flights14.csv');

jfk.dt <- dt['JFK' == origin];
jfk.df <- df['JFK' == df$origin, ];

print(jfk.dt);

        year month   day dep_delay arr_delay carrier origin   dest air_time
       <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>
    1:  2014     1     1        14        13      AA    JFK    LAX      359
    2:  2014     1     1        -3        13      AA    JFK    LAX      363
    3:  2014     1     1         2         9      AA    JFK    LAX      351
    4:  2014     1     1         2         1      AA    JFK    LAX      350
    5:  2014     1     1        -2       -18      AA    JFK    LAX      338
   ---                                                                     
81479:  2014    10    31        -4       -21      UA    JFK    SFO      337
81480:  2014    10    31        -2       -37      UA    JFK    SFO      344
81481:  2014    10    31         0       -33      UA    JFK    LAX      320
81482:  2014    10    31        -6       -38      UA    JFK    SFO      343
81483:  2014    10    31        -6       -38      UA    JFK    LAX      323
       distance  hour
          <int> <int>
    1:     2475     9
    2:     2475    11
    3:     2475    19
    4:     2475    13
    5:     2475    21
   ---               
81479:     2586    17
81480:     2586    18
81481:     2475    17
81482:     2586     9
81483:     2475    11

Conveniently, data.table also has a different way of printing. By default, only the top and bottom five rows are printed to screen, effectively obliterating the need to use the head function.

The learning curve gets steeper when you start manipulating columns. data.table has a column assignment operator for adding, updating, and deleting columns by reference. The operator is used directly within the [...], and there’s no need to reassign the object. For example, we can use the following code to add a column indicating if the flight occurred in the morning or afternoon.

dt[, ampm := ifelse(hour < 12, 'am', 'pm')];

Syntax can also be different when selecting columns – and this turned out to be my 126TB downfall. When selecting a single column of a data.frame, base R will simplify to a vector by default. With data.table… it depends.

If you select a column using list notation (dt$year or dt[[ 'year' ]]), data.table will return a vector. Similarly, dt[, year] will simplify to a vector. However, dt[, 'year'] will return a single column data table.

str( dt[, year] );

 int [1:253316] 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...

dt[, 'year'];

         year
        <int>
     1:  2014
     2:  2014
     3:  2014
     4:  2014
     5:  2014
    ---      
253312:  2014
253313:  2014
253314:  2014
253315:  2014
253316:  2014

Unfamiliar with the syntax, I inadvertently used paste to join two single-column data table objects together. The result is a monster of a character string. Rather than pasting the vectors element wise, the whole columns are pasted together into a single string. In the dummy example below, combining three columns of a ~250,000 row data table results in a string with over 3 million characters.

test <- paste0(dt[, 'year'], '-', dt[, 'month'], '-', dt[, 'day']);

print( nchar(test) );

[1] 3252502

To make matters worse, I assigned the resulting string to a column. Instead of storing it once, it was stored as every element of a column in a large data.table. To data.table’s credit, it was able to deal with my accidental monster. R ran as normal, and I only noticed the problems when I checked the available space on disk.

Recreating my Bug

Intrigued by the large consequences of a tiny bug in notation, I wanted to see the effect it would have on a smaller dataset. The flights14.csv data from the vignette is 15MB. I used this file and the bug shown above as a minimal example of my original problem.

To run this experiment without actually crashing my computer, I used a size-restricted Docker container. Docker can be run with different storage drivers and supporting backend filesystems. In order to use the --storage-opt size option, we need to use the overlay storage driver with the XFS filesystem.

The overlay storage driver can be used for Docker on both Mac and Linux, but the XFS filesystem cannot be mounted on Mac. I had to run Linux on a VirtualBox virtual machine to get it to work on my laptop. I installed CentOS on the virtual machine. Its default filesystem is XFS, and I wanted to minimize setup for the Docker container. In the end I had to do three things to get Docker up and running:

Download the minimal CentOS distribution and install on virtual machine
Install Docker community edition by following the official instructions. I skipped the part about devicemapper drivers as I needed to use overlay anyways.
Add pquota mount option to XFS. This was the trickiest part, and I had a few unsuccessful attempts. In the end I got it to work by following these instructions.

Afterwards, I was ready to launch Docker containers with the --storage-opt size option. An easy way to check if everything works as expected is to run df -h after launching a container. When I did not restrict the size, I got the following output (the virtual machine itself is restricted to 20GB).

Unrestricted Docker container

Setting --storage-opt size=5G restricts the space available to 5GB.

5G Docker container

Just what we wanted! With my safe space set up, I moved on to setting up the container itself. I first wrote a minimal version of the culprit R script that could be run from the container, and saved it as crash.R.

library(data.table);

dt <- fread('flights14.csv');

bug <- paste0(dt[, 'year'], '-', dt[, 'month'], '-', dt[, 'day']);
dt[, bug := bug];

write.table(dt, 'monster.txt');

To execute this within a Docker container, we need to create a Dockerfile with setup instructions. The Dockerfile details what base image the new container should be based on, and what software should be installed. The r-base Docker image contains the latest version of R. Additionally, we need to install the data.table package from CRAN. The COPY command copies the script and data file to the container, and CMD sets a default command to be executed when the container is launched.

FROM r-base
MAINTAINER Erle Holgersen

RUN Rscript -e "install.packages('data.table', repos='https://cloud.R-project.org');"

COPY flights14.csv crash.R src/
WORKDIR src/

CMD ["Rscript", "crash.R"]

From here on out things are fairly straightforward. We first need to build an image from our Dockerfile by running

docker build -t crash .

This will most likely take a few minutes, depending on what Docker images you have built in the past. Afterwards we can launch the container in detached mode and with a 3GB size restriction with the command

docker run --storage-opt size=3G  -d -t crash

Detached mode runs the container in the background, which allows us to keep monitoring the state of the container through docker ps. The -s flag displays details on the size of the container. My container took about two and a half minutes to grow to 3.22GB, and then exited. When I tried to enter the re-enter the container, I got a “disk limit exceeded” error. On the bright side, deleting the Docker container was trivial, and instantly recovered the 3GB of space!

Docker crash

Conclusion

Reading documentation and testing your code are good ideas. And if you really want to overload a hard drive, you can use a Docker container.