The original comment was, "CSV parsing is relatively slow." Relative to what? Th...

roel_v · on March 30, 2016

I guess in re-reading my comment, I phrased it in a tone that was more combative than I meant, so don't take the GP as an attack.

My point was, all conversions from text into numbers (I'm assuming here that the target use case is reading large amounts of numeric data) are slow and the slow part isn't the IO, but the conversion from text to numbers. In that light, it isn't a surprise that sqlite isn't much faster than csv, because sqlite doesn't have 'strong typing' itself. Any storage format that is concerned with speed will store data in binary format. But of course text formats are a lot easier to work with, and to interface between programs. I sometimes (when I have a lot of data that I know need multiple passes of reading) build quick and dirty 'caches' where I read file.csv and do what is basically a memory dump of the parsed data into file.csv.bin. My read functions can then check if that file exists and skip the parsing step. In my experience, this can be easily an order of magnitude (10x) faster. It's not portable or even elegant, of course.

Apart from that - when speed is of importance, one wouldn't use R in the first place, of course (I say that as someone who likes R for what it is and for the things it does well).

I don't do write ups of speed analyses, nor do I know of any, so I have to cop out on that one.

rspeer · on March 30, 2016

The fact that storing data in binary format gives you more speed is exactly why you'd want to use Feather. It means you don't have to translate things into an unspecified .bin format. This is the answer to the original question about "why not use CSV".

I don't know about R, but in Python, operations on objects such as NumPy arrays and Pandas DataFrames are all implemented using fast C code, and so is Feather. You can be concerned about speed.

roel_v · on March 31, 2016

Yes, of course, and I'd much rather use something like it; and when it will support matrices with more than 2 dimensions, I will (or at least, I will look into it). In cases where I need more robust storage I already use HDF5 or NetCDF but they're a PITA to work with.

Of course in-memory operations can be implemented efficiently, R does that too. But Python needs to parse CSV into numbers just like everybody else, and even if it's done in C underneath, it'll still be 'slow' (for some values of that word).

rspeer · on March 30, 2016

> Relative to what?

Relative to HDF5, and relative to Feather if it's doing its job right.