One of the cool features about the HDF5 file format is the ability to read subsets of the data without (necessarily) having to read the entire file, keeping both the memory usage and execution times of these operations to a minimum. However this is not always as performant as one might hope. This may be due to bottlenecks when working with data on-disk rather than in memory, or idiosyncrasies in either the HDF5 library itself or the rhdf5 package. Here we investigate some of the possible bottlenecks.
I just got back from a great week at the CZI meeting/workshop/hackathon to mark the start of the ‘Collaborative Computational Tools for the Human Cell Atlas’ project. One topic that came up frequently was the suitability of various file formats for storing single-cell data. Of particular interest to me was whether it is practical (or indeed possible) to perform parallel processing on data stored in HDF5 files from within R.
Earlier this year 10X Genomics released a single-cell RNA-sequencing dataset containing data from 1.3 million mouse brain cells. The blog post accompanying the release contained the provocative statement “We do not recommend loading the file into R, due to the file size and the lack of 64 bit integers support in R.” This is a bit of a non-sequitur, and naturally there has been a push within the Bioconductor community to address such concerns and show how to work with such datasets efficiently. Here we look at some basic benchmarks of R & Bioconductor’s performance on this dataset.