One of the cool features about the HDF5 file format is the ability to read subsets of the data without (necessarily) having to read the entire file, keeping both the memory usage and execution times of these operations to a minimum. However this is not always as performant as one might hope. This may be due to bottlenecks when working with data on-disk rather than in memory, or idiosyncrasies in either the HDF5 library itself or the rhdf5 package. Here we investigate some of the possible bottlenecks.
I just got back from a great week at the CZI meeting/workshop/hackathon to mark the start of the ‘Collaborative Computational Tools for the Human Cell Atlas’ project. One topic that came up frequently was the suitability of various file formats for storing single-cell data. Of particular interest to me was whether it is practical (or indeed possible) to perform parallel processing on data stored in HDF5 files from within R.
Earlier this year 10X Genomics released a single-cell RNA-sequencing dataset containing data from 1.3 million mouse brain cells. The blog post accompanying the release contained the provocative statement “We do not recommend loading the file into R, due to the file size and the lack of 64 bit integers support in R.” This is a bit of a non-sequitur, and naturally there has been a push within the Bioconductor community to address such concerns and show how to work with such datasets efficiently. Here we look at some basic benchmarks of R & Bioconductor’s performance on this dataset.
I recently responded to this post on the Bioconductor forum regarding a problem with reading a HDF5 file using the rhdf5 package. I was initially unable to reproduce the problem until I tried on Windows, then it failed immediately. Here’s an examination of why.
My previous post here discussed how to build recent versions of R on an old machine that has outdated system libraries. Similarly, sometimes you can get the core parts of R work, but one of the packages you’re trying to use fails to work as you’d expect. Once such example was posted on the Bioconductor support […]
If you’re a user of R and would like to build a recent version for yourself, but you’re working on a fairly old Linux operating system, you may encounter some issues regarding various libraries that don’t meet R’s minimum version requirements. This is my guide to getting everything you need installed.