Regular reminder that almost every | PyPA | Page 1

untold kernel Aug 25, 2025, 4:08 PM

#

FWIW, I deal a lot with transferring large data files between organizations and I find CSV is often the best suited format.

It's streamable, contains little overhead, forces the data to be tabular which constrains exotic issues and only allows one dataset per file, and even in the situation of very sparse data the compressed form will be very small.

Every other data format I've used for this use case has produced issues that are significantly more complex to deal with.

Usually the big problem with csv is storing it somewhere where an end user can access and mutate it, because they will try and open it with Excel and save changes to it. But the industry I work in always backs up the source data, so it just becomes a question of how to provide data to the user in a sensible way that fits their use case.

foggy marsh Aug 26, 2025, 7:30 AM

#

If you control boths sides and have extensive tests covering all data that goes in (escapes and quoting and such) you’re fine.

#

But you could just use a structured format instead and not rely on a fragile pile of parameters matching up just right (like e.g. JSON lines)

#

Handing CSV to users is doing them a disservice, because of the exact reasons I listed above. Context about which exact parameters are used gets lost and data corruption sneaks in.

untold kernel Aug 26, 2025, 12:55 PM

#

JSON doesn't support line breaks, can unexpectedly have randomly missing fields, nested data, and it often doubles the size of your files.

Also because JSON lines can support arbitrary fields per line someone can decide to put in multiple datasets in a single file and then it becomes a real mess.

untold kernel Aug 26, 2025, 12:56 PM

#

foggy marsh Handing CSV to users is doing them a disservice, because of the exact reasons I ...

Sometimes that's what users ask for and can best interact with, so whatcha going to do.

foggy marsh Aug 26, 2025, 1:53 PM

#

JSON doesn't support line breaks
\n?

untold kernel Aug 26, 2025, 1:54 PM

#

Yeah, you have to encode and decode the string

foggy marsh Aug 26, 2025, 1:55 PM

#

I don’t get it, every language supports JSON.

#

and JSONL is a streamable format for JSON records

untold kernel Aug 26, 2025, 1:56 PM

#

That step makes it slower and use more memory, in every language

#

I've had datasets that takes 3 days just to read

foggy marsh Aug 26, 2025, 1:58 PM

#

lol, then CSV is definitely a horrible idea as well

untold kernel Aug 26, 2025, 1:58 PM

#

Steps like that can end up adding a day, or 2 , or 3

foggy marsh Aug 26, 2025, 1:58 PM

#

I think that’s probably a case for TileDB or so.

#

or zarr

#

definitely nothing text-based

untold kernel Aug 26, 2025, 2:03 PM

#

foggy marsh lol, then CSV is definitely a horrible idea as well

It may be anecdotal, but my real world experience of thousands of datasets has proven the opposite. Problems I've had with CSV have been trivial to solve. Problems I've had with other formats have been painful or completely intractable

foggy marsh Aug 26, 2025, 2:03 PM

#

You don’t consider 3 days of reading time a problem?

untold kernel Aug 26, 2025, 2:05 PM

#

foggy marsh I think that’s probably a case for TileDB or so.

Between software companies sure, anything non text based between software companies I've found can results in failure or being prohibitively expensive for at least one party. For example just the clearance to use software company may take one side 3+ months and many days of people's hours. CSV is almost always built into every language, tool, import and export

foggy marsh Aug 26, 2025, 2:06 PM

#

“worse is better”

#

Trust me, if you can actually go for well-specified formats, everthing is so much nicer.

untold kernel Aug 26, 2025, 2:09 PM

#

foggy marsh You don’t consider 3 days of reading time a problem?

It's a trivial problem, you can wait it out

untold kernel Aug 26, 2025, 2:09 PM

#

foggy marsh “worse is better”

A worse general format can produce a better more successful project, yes, because it is better in specific ways that are more useful to your project

foggy marsh Aug 26, 2025, 2:11 PM

#

I created an industry standard instead, which you can read to memory, use with out-of-core solutions, or over the internet: https://anndata.readthedocs.io/en/latest/fileformat-prose.html

anndata

On-disk format

AnnData objects are saved on disk to hierarchical array stores like HDF5(via H5py) and Zarr-Python. This allows us to have very similar structures in disk and on memory. As an example we’ll look in...

#

the majority of my scientific field switched to from R to Python because of it.

untold kernel Aug 26, 2025, 2:12 PM

#

foggy marsh Trust me, if you can actually go for well-specified formats, everthing is so muc...

I don't need to trust you, I worked with thousands of datasets , it's great when we everything is well specified , but it isn't worth making the project 10x more expensive for.

And opens up a new vector of problems. What happens when your data doesn't meet the specification, and your counterparty doesn't assist you . You now need to either do something to the data or ignore the spec. I've experience that one a bunch of times.

foggy marsh Aug 26, 2025, 2:16 PM

#

You can design data structure so it has an escape hatch. Works really well.

untold kernel Aug 26, 2025, 2:17 PM

#

You can design all you want, but if your counterparty won't or can't it doesn't matter

foggy marsh Aug 26, 2025, 2:17 PM

#

I like working in open source. Nobody told me I shouldn’t make the project take a month longer and now every single research institute in my field that works with Python provides data sets in my format.

untold kernel Aug 26, 2025, 2:18 PM

#

The fact that CSV doesn't let you design anything nested or variable is a huge benefit when dealing with external parties with wide ranging technical, corporate, and regulatory restrictions

foggy marsh Aug 26, 2025, 2:19 PM

#

In my experience this doesn’t stop people from defining ad-hoc formats for nested or variable-length data.

untold kernel Aug 26, 2025, 2:20 PM

#

For sure, I deal with a lot of those too, definite downgrade on CSV

#Regular reminder that almost every