#Email Metadata pain points

1 messages ยท Page 1 of 1 (latest)

lethal aspen
#

So far I have:

  • it is difficult to properly handle line endings and "unfolding" (many tools generate text and don't handle newlines correctly)
  • it is unspecified what RFC is used, until a few years ago encoding non-ascii text was ambiguous
  • generating proper metadata is subtly difficult, and requires configuring the email module properly
  • difficult (impossible?) to properly validate given no schema
livid ferry
#

cc @hearty bough

languid shard
#

My main concern would be with the concept of "replace", I think it's fine if packages eventually only provide json only metadata, but installer tools and the link SHOULD (but not MUST) continue to support the current metadata standard

lethal aspen
#

I'm thinking the JSON metadata file would live alongside the email metadata file in a wheel minor version bump and probably indefinitely (or at least a very very long time) in sdists. Then in wheel 2.0 the email format metadata file could be dropped. Tools would still need to know how to read the email format to install wheel 1.0 packages.

hearty bough
#

here's my list:

#

i've writting a prototype that add METADATA.json and WHEEL.json in the uv build backend

lethal aspen
#

Oh wow thank you! This list is super helpful!

languid shard
#

I will say that one issue with lots of JSON files for pip is the amount of memory that it ends up consuming.

This appears to be a CPython issue, and I'm not knowledgeable enough about memory consumption to deep dive into this. But I plan to post a reproducible example (without any pip code) on DPO soon.

lethal aspen
#

Would be very interested in seeing that!

gleaming shale
#

This reminded me of this old post from Armin Ronacher about the overheads of parsing JS source maps in Python (and Sentry adopting Rust to eliminate that overhead): https://blog.sentry.io/fixing-python-performance-with-rust/

I wouldn't expect the JSON overhead to be noticeably higher than the email module overhead, though.

Product Blog โ€ข Sentry

Sentry processes billions of errors every month. We've been able to scale most of our systems, but in the last few months, one component has stood out as a comp...

#

The other key piece I'd mention is the key-value-header-to-json translation that was defined in metadata 2.1: https://peps.python.org/pep-0566/#json-compatible-metadata

We did miss some items when defining that, though (like keywords remaining a comma separated field instead of becoming an actual list).

lethal aspen
#

Yeah I will definitely profile json vs email, I have got to think email will have more overhead than json

#

@gleaming shale PEP 566 says keywords become a list, perhaps that was added later? Or maybe I misunderstand what you mean?

#

I will also go through the current core metadata specification and make sure any newer metadata items will be covered

hearty bough
#

for uv, the format is perf neutral, as the network latency is more significant than the parsing time here, and if we're reading from the cache we're using a binary format (rkyv)

lethal aspen
#

Interesting! I didn't realize you cached in a different format

languid shard
# gleaming shale This reminded me of this old post from Armin Ronacher about the overheads of par...

When I monitor memory in pip for a large resolution and use an index with a JSON api instead of an HTML API, it uses 100s of MBs of memory, the issue appears to be reading the JSON into memory , and then dropping it from memory and reading more into memory. I don't know enough about memory to understand why this happens, but lots and lots of JSON reads seems to cause an excessive bloat in memory.

Maybe the OS automatically cleans this up? Because I seem to be the only one that has noticed this issue. Anyway I'm working on a reproducible example.

hearty bough
#

All transformed keys should be reduced to lower case. Hyphens should be replaced with underscores, but otherwise should retain all other characters;
@gleaming shale is it intentional that it's - -> _ and not the other way round, where PEP 691 uses hyphens?

gleaming shale
gleaming shale
hearty bough
#

the frameworks i worked with support mapping between hyphen json fields and underscore class fields, working around many rest apis using hyphens

gleaming shale
#

Yeah, I don't think it's a big deal in practice. I just know that I have a preference to keep the two spellings the same, so it seems plausible to me that Dustin and/or Daniel might share that preference.

We'd either have to either ask them or go diving into the distutils-sig archives to get a more definitive answer, since the PEP itself doesn't give a rationale for that bit.

tawny stump
#

Without digging through the Git history or issues when packaging.metadata came into existence, the big issues were multi-line entries (there's no standard way to parse that), field name case variability (having to normalize is annoying and a bug waiting to happen), field names that aren't a valid Python identifier name (so much mapping back and forth), and probably other pain that I've blocked in my memory ๐Ÿ˜… ( @bleak ravine might have his own list).

#

And thanks for trying to tackle this!

bleak ravine
#

technically the README doesn't have to be at the end

#

but in practice it almost always is

#

there are definitely things on PyPI where it's not at the end though

#

the lack of "advanced" data types is an issue as well. things like Project-URL is naturally a key->value mapping, but since the email format doesn't really have that, it had to make it's own custom mini format for parsing

#

one thing to be careful of is there is possibility of some sort of confused deputy attack. This is true with the email format as well, but packaging.metadata was, IIRC, written to avoid that

#

where you can repeat keys and if the same key is repeated it's implementation specific whether the first or last occurence will be used

#

it's probably less of a concern with JSON, but there was a desire for the ability to do "lenient" parsing (e.g. parse what we can, and ignore the rest), which I don't think most JSON libraries allow, but also it's a lot harder/unlikely to produce structurally invalid json so ๐Ÿคทโ€โ™‚๏ธ

#

both of those GH links have a lot of brain dumps in them, the first link in particular has information of what went wrong when I was trying to actually parse every METADATA file on PyPI

#

oh yea, the Keyword metadata is conceptually a list, but for hysterical raisins instead of using a repeated Keyword field, it used a single Keywords field, and depending on what era of packaging and what tool you used, that field was either space delimited, comma delimited, or YOLO delimited

lethal aspen
#

ah YOLO delimited, my favorite!

lethal aspen
bleak ravine
#

that's mostly because I think most (or all?) build backends took the keywords input as a string and it was up to the user to format it correctly

lethal aspen
#

Ah yeah

lethal aspen
bleak ravine
lethal aspen
#

ohh I didn't realize that, definitely will go in the list :P

bleak ravine
tawny stump
bleak ravine
#

tfw parsing a format is so weird that there's more comments than there is code

grim cedar
grim cedar
lethal aspen
grim cedar
#

But pylock.toml and other machine generated files are in toml for the human readability and editability, from the peps:

lethal aspen
#

Package metadata is also not meant to be human editable, and only incidentally human readable (aka the vast majority of users will never look at it). There's also a canonicalization from the email format to json

bleak ravine
#

I think you could make a reasonable argument for either toml or json, ultimately it'll probably be up to whoever writes a PEP (hey there Emma ๐Ÿ˜„ ) which arguments they think make the most sense