Email Metadata pain points | PyPA | Page 1

lethal aspen Aug 22, 2025, 9:52 PM

#

So far I have:

it is difficult to properly handle line endings and "unfolding" (many tools generate text and don't handle newlines correctly)
it is unspecified what RFC is used, until a few years ago encoding non-ascii text was ambiguous
generating proper metadata is subtly difficult, and requires configuring the email module properly
difficult (impossible?) to properly validate given no schema

livid ferry Aug 23, 2025, 12:23 AM

#

cc @hearty bough

languid shard Aug 23, 2025, 2:49 PM

#

My main concern would be with the concept of "replace", I think it's fine if packages eventually only provide json only metadata, but installer tools and the link SHOULD (but not MUST) continue to support the current metadata standard

lethal aspen Aug 23, 2025, 3:09 PM

#

I'm thinking the JSON metadata file would live alongside the email metadata file in a wheel minor version bump and probably indefinitely (or at least a very very long time) in sdists. Then in wheel 2.0 the email format metadata file could be dropped. Tools would still need to know how to read the email format to install wheel 1.0 packages.

hearty bough Aug 25, 2025, 8:33 AM

#

here's my list:

the email format requires a parser separate from everything else. this is unlike json and toml, with their (often built-in) parsers that serialize and deserialize from and into classes/structs. we're only using the email headers here, but json and toml everywhere.
- in uv, https://github.com/astral-sh/uv/blob/4c2fc5490d67e8e56ac3f263ca1bb4726539e6f3/crates/uv-pypi-types/src/metadata/metadata23.rs#L109-L181 and https://github.com/astral-sh/uv/blob/c5032aee80d5a666a45ec5e095b8c8abeffc11fb/crates/uv-pypi-types/src/metadata/metadata_resolver.rs#L90-L165 are our METADATA parsing code, while https://github.com/astral-sh/uv/blob/4c2fc5490d67e8e56ac3f263ca1bb4726539e6f3/crates/uv-pypi-types/src/metadata/metadata23.rs#L13-L14 is the json parsing code
- In PEP 691, there's a media break going from a normal rest json api to getting .metadata which isn't
the formatting of multiline fields, or rather the multiline field license, is unclear. hatchling for example doesn't use the email module but rolls its own: https://github.com/pypa/hatch/blob/1867a9062ad3c8348857edf1ca8f945ca118e8b1/backend/src/hatchling/metadata/spec.py#L546-L555
the one thing that's better about the header format is that the readme is at the end, so it's possible to parse the relevant metadata while ignoring the long readme text (while we still have to download it).

#

i've writting a prototype that add METADATA.json and WHEEL.json in the uv build backend

#

https://github.com/astral-sh/uv/pull/15510

lethal aspen Aug 25, 2025, 2:32 PM

#

Oh wow thank you! This list is super helpful!

languid shard Aug 25, 2025, 3:54 PM

#

I will say that one issue with lots of JSON files for pip is the amount of memory that it ends up consuming.

This appears to be a CPython issue, and I'm not knowledgeable enough about memory consumption to deep dive into this. But I plan to post a reproducible example (without any pip code) on DPO soon.

lethal aspen Aug 25, 2025, 4:15 PM

#

Would be very interested in seeing that!

gleaming shale Aug 26, 2025, 3:16 PM

#

This reminded me of this old post from Armin Ronacher about the overheads of parsing JS source maps in Python (and Sentry adopting Rust to eliminate that overhead): https://blog.sentry.io/fixing-python-performance-with-rust/

I wouldn't expect the JSON overhead to be noticeably higher than the email module overhead, though.

Product Blog • Sentry

Fixing Python Performance with Rust

Sentry processes billions of errors every month. We've been able to scale most of our systems, but in the last few months, one component has stood out as a comp...

#

The other key piece I'd mention is the key-value-header-to-json translation that was defined in metadata 2.1: https://peps.python.org/pep-0566/#json-compatible-metadata

We did miss some items when defining that, though (like keywords remaining a comma separated field instead of becoming an actual list).

Python Enhancement Proposals (PEPs)

PEP 566 – Metadata for Python Software Packages 2.1 | peps.python...

This PEP describes the changes between versions 1.2 and 2.1 of the core metadata specification for Python packages. Version 1.2 is specified in PEP 345.

lethal aspen Aug 26, 2025, 3:24 PM

#

Yeah I will definitely profile json vs email, I have got to think email will have more overhead than json

#

@gleaming shale PEP 566 says keywords become a list, perhaps that was added later? Or maybe I misunderstand what you mean?

#

I will also go through the current core metadata specification and make sure any newer metadata items will be covered

hearty bough Aug 26, 2025, 3:27 PM

#

for uv, the format is perf neutral, as the network latency is more significant than the parsing time here, and if we're reading from the cache we're using a binary format (rkyv)

lethal aspen Aug 26, 2025, 3:27 PM

#

Interesting! I didn't realize you cached in a different format

languid shard Aug 26, 2025, 3:28 PM

#

gleaming shale This reminded me of this old post from Armin Ronacher about the overheads of par...

When I monitor memory in pip for a large resolution and use an index with a JSON api instead of an HTML API, it uses 100s of MBs of memory, the issue appears to be reading the JSON into memory , and then dropping it from memory and reading more into memory. I don't know enough about memory to understand why this happens, but lots and lots of JSON reads seems to cause an excessive bloat in memory.

Maybe the OS automatically cleans this up? Because I seem to be the only one that has noticed this issue. Anyway I'm working on a reproducible example.

hearty bough Aug 26, 2025, 3:30 PM

#

All transformed keys should be reduced to lower case. Hyphens should be replaced with underscores, but otherwise should retain all other characters;
@gleaming shale is it intentional that it's - -> _ and not the other way round, where PEP 691 uses hyphens?

gleaming shale Aug 26, 2025, 3:35 PM

#

lethal aspen Yeah I will definitely profile json vs email, I have got to think email will hav...

I checked the PEP history, and I think I'm just wrong. I'm pretty sure we made a mistake like that somewhere along the line, but apparently that wasn't it.

gleaming shale Aug 26, 2025, 3:41 PM

#

hearty bough > All transformed keys should be reduced to lower case. Hyphens should be replac...

For the key names? I don't really remember, but I assume it was just because using hyphens doesn't give valid Python identifiers, while underscores do.

Folks presumably weren't as concerned about supporting attribute based APIs when designing 691.

hearty bough Aug 26, 2025, 3:43 PM

#

the frameworks i worked with support mapping between hyphen json fields and underscore class fields, working around many rest apis using hyphens

gleaming shale Aug 26, 2025, 4:00 PM

#

Yeah, I don't think it's a big deal in practice. I just know that I have a preference to keep the two spellings the same, so it seems plausible to me that Dustin and/or Daniel might share that preference.

We'd either have to either ask them or go diving into the distutils-sig archives to get a more definitive answer, since the PEP itself doesn't give a rationale for that bit.

tawny stump Aug 26, 2025, 4:46 PM

#

Without digging through the Git history or issues when packaging.metadata came into existence, the big issues were multi-line entries (there's no standard way to parse that), field name case variability (having to normalize is annoying and a bug waiting to happen), field names that aren't a valid Python identifier name (so much mapping back and forth), and probably other pain that I've blocked in my memory 😅 ( @bleak ravine might have his own list).

#

And thanks for trying to tackle this!

bleak ravine Aug 26, 2025, 4:48 PM

#

technically the README doesn't have to be at the end

#

but in practice it almost always is

#

there are definitely things on PyPI where it's not at the end though

#

the lack of "advanced" data types is an issue as well. things like Project-URL is naturally a key->value mapping, but since the email format doesn't really have that, it had to make it's own custom mini format for parsing

#

one thing to be careful of is there is possibility of some sort of confused deputy attack. This is true with the email format as well, but packaging.metadata was, IIRC, written to avoid that

#

where you can repeat keys and if the same key is repeated it's implementation specific whether the first or last occurence will be used

#

https://github.com/pypa/packaging/pull/574 and https://github.com/pypa/packaging/issues/570 are probably useful things to read

#

it's probably less of a concern with JSON, but there was a desire for the ability to do "lenient" parsing (e.g. parse what we can, and ignore the rest), which I don't think most JSON libraries allow, but also it's a lot harder/unlikely to produce structurally invalid json so 🤷‍♂️

#

both of those GH links have a lot of brain dumps in them, the first link in particular has information of what went wrong when I was trying to actually parse every METADATA file on PyPI

#

oh yea, the Keyword metadata is conceptually a list, but for hysterical raisins instead of using a repeated Keyword field, it used a single Keywords field, and depending on what era of packaging and what tool you used, that field was either space delimited, comma delimited, or YOLO delimited

lethal aspen Aug 26, 2025, 5:24 PM

#

ah YOLO delimited, my favorite!

lethal aspen Aug 26, 2025, 5:26 PM

#

tawny stump Without digging through the Git history or issues when `packaging.metadata` came...

oh interesting, I didn't realize the field names weren't standardized!

bleak ravine Aug 26, 2025, 5:26 PM

#

that's mostly because I think most (or all?) build backends took the keywords input as a string and it was up to the user to format it correctly

lethal aspen Aug 26, 2025, 5:27 PM

#

Ah yeah

lethal aspen Aug 26, 2025, 5:27 PM

#

bleak ravine where you can repeat keys and if the same key is repeated it's implementation sp...

yeah I will probably need to specify this is an error to be safe here

bleak ravine Aug 26, 2025, 5:28 PM

#

lethal aspen oh interesting, I didn't realize the field names weren't standardized!

field names in rfc822 are not case sensitive IIRC, but dictionary access is ofc, case sensitive

lethal aspen Aug 26, 2025, 5:28 PM

#

ohh I didn't realize that, definitely will go in the list :P

bleak ravine Aug 26, 2025, 5:30 PM

#

https://github.com/pypa/packaging/blob/main/src/packaging/metadata.py has a bunch of comments too explaining why different things exist

tawny stump Aug 26, 2025, 5:31 PM

#

bleak ravine https://github.com/pypa/packaging/blob/main/src/packaging/metadata.py has a bunc...

Heh, I forgot we commented that much, but it makes sense as otherwise who could remember why the heck some things are so wonky

bleak ravine Aug 26, 2025, 5:32 PM

#

tfw parsing a format is so weird that there's more comments than there is code

grim cedar Sep 10, 2025, 3:15 PM

#

bleak ravine https://github.com/pypa/packaging/blob/main/src/packaging/metadata.py has a bunc...

At one point I needed to parse an email like header file for something else and found it easier to patch packaging than I did for using email.parser myself.....

grim cedar Sep 10, 2025, 3:20 PM

#

lethal aspen So far I have: - it is difficult to properly handle line endings and "unfolding"...

With the advent of pyproject.toml and pylock.toml, which already require builders to read toml files, what's the rationale for writing metadata with json and not toml?

lethal aspen Sep 10, 2025, 3:51 PM

#

grim cedar With the advent of pyproject.toml and pylock.toml, which already require builder...

It's a better format for exchanging data written by machines

grim cedar Sep 10, 2025, 3:55 PM

#

But pylock.toml and other machine generated files are in toml for the human readability and editability, from the peps:

Python Enhancement Proposals (PEPs)

PEP 518 – Specifying Minimum Build System Requirements for Python...

This PEP specifies how Python software packages should specify what build dependencies they have in order to execute their chosen build system. As part of this specification, a new configuration file is introduced for software packages to use to specify...

lethal aspen Sep 10, 2025, 3:58 PM

#

Package metadata is also not meant to be human editable, and only incidentally human readable (aka the vast majority of users will never look at it). There's also a canonicalization from the email format to json

bleak ravine Sep 10, 2025, 4:59 PM

#

I think you could make a reasonable argument for either toml or json, ultimately it'll probably be up to whoever writes a PEP (hey there Emma 😄 ) which arguments they think make the most sense

#Email Metadata pain points