#Regular reminder that almost every
1 messages · Page 1 of 1 (latest)
FWIW, I deal a lot with transferring large data files between organizations and I find CSV is often the best suited format.
It's streamable, contains little overhead, forces the data to be tabular which constrains exotic issues and only allows one dataset per file, and even in the situation of very sparse data the compressed form will be very small.
Every other data format I've used for this use case has produced issues that are significantly more complex to deal with.
Usually the big problem with csv is storing it somewhere where an end user can access and mutate it, because they will try and open it with Excel and save changes to it. But the industry I work in always backs up the source data, so it just becomes a question of how to provide data to the user in a sensible way that fits their use case.
If you control boths sides and have extensive tests covering all data that goes in (escapes and quoting and such) you’re fine.
But you could just use a structured format instead and not rely on a fragile pile of parameters matching up just right (like e.g. JSON lines)
Handing CSV to users is doing them a disservice, because of the exact reasons I listed above. Context about which exact parameters are used gets lost and data corruption sneaks in.
JSON doesn't support line breaks, can unexpectedly have randomly missing fields, nested data, and it often doubles the size of your files.
Also because JSON lines can support arbitrary fields per line someone can decide to put in multiple datasets in a single file and then it becomes a real mess.
Sometimes that's what users ask for and can best interact with, so whatcha going to do.
JSON doesn't support line breaks
\n?
Yeah, you have to encode and decode the string
I don’t get it, every language supports JSON.
and JSONL is a streamable format for JSON records
That step makes it slower and use more memory, in every language
I've had datasets that takes 3 days just to read
lol, then CSV is definitely a horrible idea as well
Steps like that can end up adding a day, or 2 , or 3
I think that’s probably a case for TileDB or so.
or zarr
definitely nothing text-based
It may be anecdotal, but my real world experience of thousands of datasets has proven the opposite. Problems I've had with CSV have been trivial to solve. Problems I've had with other formats have been painful or completely intractable
You don’t consider 3 days of reading time a problem?
Between software companies sure, anything non text based between software companies I've found can results in failure or being prohibitively expensive for at least one party. For example just the clearance to use software company may take one side 3+ months and many days of people's hours. CSV is almost always built into every language, tool, import and export
“worse is better”
Trust me, if you can actually go for well-specified formats, everthing is so much nicer.
It's a trivial problem, you can wait it out
A worse general format can produce a better more successful project, yes, because it is better in specific ways that are more useful to your project
I created an industry standard instead, which you can read to memory, use with out-of-core solutions, or over the internet: https://anndata.readthedocs.io/en/latest/fileformat-prose.html
AnnData objects are saved on disk to hierarchical array stores like HDF5(via H5py) and Zarr-Python. This allows us to have very similar structures in disk and on memory. As an example we’ll look in...
the majority of my scientific field switched to from R to Python because of it.
I don't need to trust you, I worked with thousands of datasets , it's great when we everything is well specified , but it isn't worth making the project 10x more expensive for.
And opens up a new vector of problems. What happens when your data doesn't meet the specification, and your counterparty doesn't assist you . You now need to either do something to the data or ignore the spec. I've experience that one a bunch of times.
You can design data structure so it has an escape hatch. Works really well.
You can design all you want, but if your counterparty won't or can't it doesn't matter
I like working in open source. Nobody told me I shouldn’t make the project take a month longer and now every single research institute in my field that works with Python provides data sets in my format.
The fact that CSV doesn't let you design anything nested or variable is a huge benefit when dealing with external parties with wide ranging technical, corporate, and regulatory restrictions
In my experience this doesn’t stop people from defining ad-hoc formats for nested or variable-length data.
For sure, I deal with a lot of those too, definite downgrade on CSV