#Email Metadata pain points
1 messages ยท Page 1 of 1 (latest)
So far I have:
- it is difficult to properly handle line endings and "unfolding" (many tools generate text and don't handle newlines correctly)
- it is unspecified what RFC is used, until a few years ago encoding non-ascii text was ambiguous
- generating proper metadata is subtly difficult, and requires configuring the email module properly
- difficult (impossible?) to properly validate given no schema
cc @hearty bough
My main concern would be with the concept of "replace", I think it's fine if packages eventually only provide json only metadata, but installer tools and the link SHOULD (but not MUST) continue to support the current metadata standard
I'm thinking the JSON metadata file would live alongside the email metadata file in a wheel minor version bump and probably indefinitely (or at least a very very long time) in sdists. Then in wheel 2.0 the email format metadata file could be dropped. Tools would still need to know how to read the email format to install wheel 1.0 packages.
here's my list:
- the email format requires a parser separate from everything else. this is unlike json and toml, with their (often built-in) parsers that serialize and deserialize from and into classes/structs. we're only using the email headers here, but json and toml everywhere.
- in uv, https://github.com/astral-sh/uv/blob/4c2fc5490d67e8e56ac3f263ca1bb4726539e6f3/crates/uv-pypi-types/src/metadata/metadata23.rs#L109-L181 and https://github.com/astral-sh/uv/blob/c5032aee80d5a666a45ec5e095b8c8abeffc11fb/crates/uv-pypi-types/src/metadata/metadata_resolver.rs#L90-L165 are our METADATA parsing code, while https://github.com/astral-sh/uv/blob/4c2fc5490d67e8e56ac3f263ca1bb4726539e6f3/crates/uv-pypi-types/src/metadata/metadata23.rs#L13-L14 is the json parsing code
- In PEP 691, there's a media break going from a normal rest json api to getting
.metadatawhich isn't
- the formatting of multiline fields, or rather the multiline field
license, is unclear. hatchling for example doesn't use the email module but rolls its own: https://github.com/pypa/hatch/blob/1867a9062ad3c8348857edf1ca8f945ca118e8b1/backend/src/hatchling/metadata/spec.py#L546-L555 - the one thing that's better about the header format is that the readme is at the end, so it's possible to parse the relevant metadata while ignoring the long readme text (while we still have to download it).
i've writting a prototype that add METADATA.json and WHEEL.json in the uv build backend
Oh wow thank you! This list is super helpful!
I will say that one issue with lots of JSON files for pip is the amount of memory that it ends up consuming.
This appears to be a CPython issue, and I'm not knowledgeable enough about memory consumption to deep dive into this. But I plan to post a reproducible example (without any pip code) on DPO soon.
Would be very interested in seeing that!
This reminded me of this old post from Armin Ronacher about the overheads of parsing JS source maps in Python (and Sentry adopting Rust to eliminate that overhead): https://blog.sentry.io/fixing-python-performance-with-rust/
I wouldn't expect the JSON overhead to be noticeably higher than the email module overhead, though.
The other key piece I'd mention is the key-value-header-to-json translation that was defined in metadata 2.1: https://peps.python.org/pep-0566/#json-compatible-metadata
We did miss some items when defining that, though (like keywords remaining a comma separated field instead of becoming an actual list).
Yeah I will definitely profile json vs email, I have got to think email will have more overhead than json
@gleaming shale PEP 566 says keywords become a list, perhaps that was added later? Or maybe I misunderstand what you mean?
I will also go through the current core metadata specification and make sure any newer metadata items will be covered
for uv, the format is perf neutral, as the network latency is more significant than the parsing time here, and if we're reading from the cache we're using a binary format (rkyv)
Interesting! I didn't realize you cached in a different format
When I monitor memory in pip for a large resolution and use an index with a JSON api instead of an HTML API, it uses 100s of MBs of memory, the issue appears to be reading the JSON into memory , and then dropping it from memory and reading more into memory. I don't know enough about memory to understand why this happens, but lots and lots of JSON reads seems to cause an excessive bloat in memory.
Maybe the OS automatically cleans this up? Because I seem to be the only one that has noticed this issue. Anyway I'm working on a reproducible example.
All transformed keys should be reduced to lower case. Hyphens should be replaced with underscores, but otherwise should retain all other characters;
@gleaming shale is it intentional that it's-->_and not the other way round, where PEP 691 uses hyphens?
I checked the PEP history, and I think I'm just wrong. I'm pretty sure we made a mistake like that somewhere along the line, but apparently that wasn't it.
For the key names? I don't really remember, but I assume it was just because using hyphens doesn't give valid Python identifiers, while underscores do.
Folks presumably weren't as concerned about supporting attribute based APIs when designing 691.
the frameworks i worked with support mapping between hyphen json fields and underscore class fields, working around many rest apis using hyphens
Yeah, I don't think it's a big deal in practice. I just know that I have a preference to keep the two spellings the same, so it seems plausible to me that Dustin and/or Daniel might share that preference.
We'd either have to either ask them or go diving into the distutils-sig archives to get a more definitive answer, since the PEP itself doesn't give a rationale for that bit.
Without digging through the Git history or issues when packaging.metadata came into existence, the big issues were multi-line entries (there's no standard way to parse that), field name case variability (having to normalize is annoying and a bug waiting to happen), field names that aren't a valid Python identifier name (so much mapping back and forth), and probably other pain that I've blocked in my memory ๐
( @bleak ravine might have his own list).
And thanks for trying to tackle this!
technically the README doesn't have to be at the end
but in practice it almost always is
there are definitely things on PyPI where it's not at the end though
the lack of "advanced" data types is an issue as well. things like Project-URL is naturally a key->value mapping, but since the email format doesn't really have that, it had to make it's own custom mini format for parsing
one thing to be careful of is there is possibility of some sort of confused deputy attack. This is true with the email format as well, but packaging.metadata was, IIRC, written to avoid that
where you can repeat keys and if the same key is repeated it's implementation specific whether the first or last occurence will be used
https://github.com/pypa/packaging/pull/574 and https://github.com/pypa/packaging/issues/570 are probably useful things to read
it's probably less of a concern with JSON, but there was a desire for the ability to do "lenient" parsing (e.g. parse what we can, and ignore the rest), which I don't think most JSON libraries allow, but also it's a lot harder/unlikely to produce structurally invalid json so ๐คทโโ๏ธ
both of those GH links have a lot of brain dumps in them, the first link in particular has information of what went wrong when I was trying to actually parse every METADATA file on PyPI
oh yea, the Keyword metadata is conceptually a list, but for hysterical raisins instead of using a repeated Keyword field, it used a single Keywords field, and depending on what era of packaging and what tool you used, that field was either space delimited, comma delimited, or YOLO delimited
ah YOLO delimited, my favorite!
oh interesting, I didn't realize the field names weren't standardized!
that's mostly because I think most (or all?) build backends took the keywords input as a string and it was up to the user to format it correctly
Ah yeah
yeah I will probably need to specify this is an error to be safe here
field names in rfc822 are not case sensitive IIRC, but dictionary access is ofc, case sensitive
ohh I didn't realize that, definitely will go in the list :P
https://github.com/pypa/packaging/blob/main/src/packaging/metadata.py has a bunch of comments too explaining why different things exist
Heh, I forgot we commented that much, but it makes sense as otherwise who could remember why the heck some things are so wonky
tfw parsing a format is so weird that there's more comments than there is code
At one point I needed to parse an email like header file for something else and found it easier to patch packaging than I did for using email.parser myself.....
With the advent of pyproject.toml and pylock.toml, which already require builders to read toml files, what's the rationale for writing metadata with json and not toml?
It's a better format for exchanging data written by machines
But pylock.toml and other machine generated files are in toml for the human readability and editability, from the peps:
Package metadata is also not meant to be human editable, and only incidentally human readable (aka the vast majority of users will never look at it). There's also a canonicalization from the email format to json
I think you could make a reasonable argument for either toml or json, ultimately it'll probably be up to whoever writes a PEP (hey there Emma ๐ ) which arguments they think make the most sense