Exploring performance when extracting subsets from HDF5

One of the cool features about the HDF5 file format is the ability to read subsets of the data without (necessarily) having to read the entire file, keeping both the memory usage and execution times of these operations to a minimum. However this is not always as performant as one might hope. This may be due to bottlenecks when working with data on-disk rather than in memory, or idiosyncrasies in either the HDF5 library itself or the rhdf5 package.  Here we investigate some of the possible bottlenecks.

Read More

Parallel processing with R and HDF5

I just got back from a great week at the CZI meeting/workshop/hackathon to mark the start of the ‘Collaborative Computational Tools for the Human Cell Atlas’ project. One topic that came up frequently was the suitability of various file formats for storing single-cell data. Of particular interest to me was whether it is practical (or indeed possible) to perform parallel processing on data stored in HDF5 files from within R.

Read More

10X single-cell data & HDF5Array performance

Earlier this year 10X Genomics released a single-cell RNA-sequencing dataset containing data from 1.3 million mouse brain cells.  The blog post accompanying the release contained the provocative statement “We do not recommend loading the file into R, due to the file size and the lack of 64 bit integers support in R.” This is a bit of a non-sequitur, and naturally there has been a push within the Bioconductor community to address such concerns and show how to work with such datasets efficiently. Here we look at some basic benchmarks of R & Bioconductor’s performance on this dataset.

Read More