#Regular reminder that almost every

1 messages · Page 1 of 1 (latest)

untold kernel
#

FWIW, I deal a lot with transferring large data files between organizations and I find CSV is often the best suited format.

It's streamable, contains little overhead, forces the data to be tabular which constrains exotic issues and only allows one dataset per file, and even in the situation of very sparse data the compressed form will be very small.

Every other data format I've used for this use case has produced issues that are significantly more complex to deal with.

Usually the big problem with csv is storing it somewhere where an end user can access and mutate it, because they will try and open it with Excel and save changes to it. But the industry I work in always backs up the source data, so it just becomes a question of how to provide data to the user in a sensible way that fits their use case.

foggy marsh
#

If you control boths sides and have extensive tests covering all data that goes in (escapes and quoting and such) you’re fine.

#

But you could just use a structured format instead and not rely on a fragile pile of parameters matching up just right (like e.g. JSON lines)

#

Handing CSV to users is doing them a disservice, because of the exact reasons I listed above. Context about which exact parameters are used gets lost and data corruption sneaks in.

untold kernel
#

JSON doesn't support line breaks, can unexpectedly have randomly missing fields, nested data, and it often doubles the size of your files.

Also because JSON lines can support arbitrary fields per line someone can decide to put in multiple datasets in a single file and then it becomes a real mess.

untold kernel
foggy marsh
#

JSON doesn't support line breaks
\n?

untold kernel
#

Yeah, you have to encode and decode the string

foggy marsh
#

I don’t get it, every language supports JSON.

#

and JSONL is a streamable format for JSON records

untold kernel
#

That step makes it slower and use more memory, in every language

#

I've had datasets that takes 3 days just to read

foggy marsh
#

lol, then CSV is definitely a horrible idea as well

untold kernel
#

Steps like that can end up adding a day, or 2 , or 3

foggy marsh
#

I think that’s probably a case for TileDB or so.

#

or zarr

#

definitely nothing text-based

untold kernel
foggy marsh
#

You don’t consider 3 days of reading time a problem?

untold kernel
# foggy marsh I think that’s probably a case for TileDB or so.

Between software companies sure, anything non text based between software companies I've found can results in failure or being prohibitively expensive for at least one party. For example just the clearance to use software company may take one side 3+ months and many days of people's hours. CSV is almost always built into every language, tool, import and export

foggy marsh
#

“worse is better”

#

Trust me, if you can actually go for well-specified formats, everthing is so much nicer.

untold kernel
untold kernel
# foggy marsh “worse is better”

A worse general format can produce a better more successful project, yes, because it is better in specific ways that are more useful to your project

foggy marsh
#

I created an industry standard instead, which you can read to memory, use with out-of-core solutions, or over the internet: https://anndata.readthedocs.io/en/latest/fileformat-prose.html

#

the majority of my scientific field switched to from R to Python because of it.

untold kernel
# foggy marsh Trust me, if you can actually go for well-specified formats, everthing is so muc...

I don't need to trust you, I worked with thousands of datasets , it's great when we everything is well specified , but it isn't worth making the project 10x more expensive for.

And opens up a new vector of problems. What happens when your data doesn't meet the specification, and your counterparty doesn't assist you . You now need to either do something to the data or ignore the spec. I've experience that one a bunch of times.

foggy marsh
#

You can design data structure so it has an escape hatch. Works really well.

untold kernel
#

You can design all you want, but if your counterparty won't or can't it doesn't matter

foggy marsh
#

I like working in open source. Nobody told me I shouldn’t make the project take a month longer and now every single research institute in my field that works with Python provides data sets in my format.

untold kernel
#

The fact that CSV doesn't let you design anything nested or variable is a huge benefit when dealing with external parties with wide ranging technical, corporate, and regulatory restrictions

foggy marsh
#

In my experience this doesn’t stop people from defining ad-hoc formats for nested or variable-length data.

untold kernel
#

For sure, I deal with a lot of those too, definite downgrade on CSV