#pip
1 messages Β· Page 5 of 1
Hahaha, oops, the new spinner from nested builds are leaking out on my inprocess-build-deps branch.
It's actually pretty funny to watch
--use-feature=inprocess-build-deps is also quite a bit faster. 20s -> 12s. That's good to see!
Yeah, doesn't hatchling have some insane nested build bootstrapping process? I imagine that would be significantly faster.
(are there any known specific cases where a separate process would be needed for correctness?)
(or clear, concrete theoretical reasons?)
I'm not aware of any scenarios where a separate process is required for correctness, although the implicit isolation gained by having a separate process definitely makes it easy to avoid certain bugs.
For example, pip has quite a few in-memory caches. The resolver is a good example. If the resolver was shared across the main install and the build environment install, there would likely be buggy behaviour/"interesting" solves.
ah, I imagine it isn't designed to instantiate separate caches...
Well, for basically all of pip's existence, it was fair to assume that the install logic was linear and wouldn't be called more than once (not literally, but in one shot for the same environment.
In general, the codebase is structured well enough to support concurrent installs (I mean, keeping some separation is necessary for unit testing), but you do have to be careful around caches (which means generally not sharing them at all).
multithreaded pip on the free-threaded build will be βfunβ π
could make the caches thread-local I guess, no idea if that has any performance impact
also no idea if that will ever be a goal for you all
I expect it would have some penalties since the caches are in the hot loops (for things like requirements and tags, which definitely take longer to parse and validate than a dict lookup on a str to get an existing object).
We don't use multithreading directly. The only threading pip does is through Rich which uses threads to manage live updates.
iβm imagining a future where you might want to use threads to speed up CPU-bound tasks
Yea, gonna say that's probably not going to happen any time soon.
The main thing that would be interesting for that is actually the resolver and... yea, I don't think we have anyone interested in doing parallelization of that. π
I did publish a PR to parallelize .pyc compilation but that got bogged down in a discussion about whether it even makes sense to continue compiling .pyc files in the first place π
(which fair enough, no hard feelings there!)
Last time I'd benchmarked it, pip was mostly I/O bound except for when handling wheel unpacking + all the metadata parsing. It's been a while tho, and I know more about software benchmarks/profiling now than I did back then lol.
remembers the time someone broke a parallel filesystem on a supercomputer via the import system. Lots and lots of nodes simultaneously trying to read and write pyc filesβ¦
and .pyc compilation. That's also a major bottleneck, esp. on a SSD where the I/O penalty is not as bad.
As long as Rich is free-threading safe, pip will be fine for the time being.
I was focused on the resolve side, hence why I missed all the pyc thingies. π
ah, sorry
does wheel unpacking benefit from multiprocessing somehow? (does the standard library make that feasible?)
see, I refuse to touch the resolver with a 10ft pole.
that thing is a magical black box that I still haven't learned about
Thou ist smarter than I.
I think someone did an experiment with FT for the install step, and yes, there seems to be some benefit (although most of the benefit likely came from the .pyc compilation being parallalized)
It could. The tricky bit is dealing with the race condition problem of multiple wheels providing the same file.
FT?
(I haven't benchmarked anything)
Free threading
ah
anyway, my current priority (after managing the 25.2 release) is to get inprocess-build-deps landed.
I still have a few more hours of work left, mostly related to writing more tests, but it is close to being ready for review.
well if you're grabbing multiple things and parallelizing at the level of entire wheels, I can certainly imagine it
also I would have thought that if multiple wheels are providing the same file you are going to have a bad time no matter what
It would be very nice if there was an accelerated version of the email parser, it would legitimately speed up so many pip operations. Alas, that's just a pipe dream π₯²
from my testing, the email parser just does a lot of stuff that's irrelevant to how the packaging formats use it
Huh, I didn't think of that.
That's actually a good idea.
how so?
Yea, I have a wish to see JSON serialisation of metadata one day. π
(maybe switching to JSON would make everything faster since CPython has a C implementation of json IIRC?)
yes and yes
for one thing, it sets up a "streaming" interface with a small page size and adds complexity around chunking the file, when most metadata files are fairly short anyway
I am thinking I will re-target PEP 777 to focus just on better compression and JSON metadata
and then after it's parsed all the fields, it does a bunch of validation that's based on expecting certain fields to have specific meaning in the context of an actual email
you forget that it adds readme of the project and A LOT of projects have pretty long readmes
where "pretty big" is still tiny compared to the rest of the wheel, though?
time for packaging-tailored email parser package?
I mean, there's also a world where someone contributes faster email parsing in the stdlib. π
I mean, I would call it a parser for the specific format used by python packaging core metadata, which simply happens to be based on old email standards
I did imagine that I might do something like that as part of PAPER, though.
(looks at his calendar to see if he can slot an entire "write an email parser in C" project in there)
honestly, I wonder how many people are actually using the email standard library outside of packaging
why do I feel like I now know what I will be doing this weekend lol
boromir wants to have a word about simply writing something in C
I honestly think it would be better if we simply added a METADATA.json file as an acceptable entry into dist-info. More people can upgrade their installer and take advantage of a fast JSON parser than their Python and take advantage of a fast email parser
(I mean, outside of making the installer release that was due 6 months ago π³ )
I can contribute all of my hopes and wishes and motivational talk π
who writes C these days? Rust or Zig my friend
been like a year since I wrote any C
Idk which one has better timelines. π
Not in CPython. π
oh, you wanted to add that to the stdlib... I thought you meant a 3rd party package
pip would be a very different project if we could have compiled code in it/its dependencies. π
but also just as a quick check, the total source code in the email package is something like 8 times as much as configparser.py, and I don't think that what packaging does with those files is that much more complex than an ordinary config file
I would expect email to be way more code than configparser, the email message format has a lot of hidden complexity compared to ini, and a lot more considerations around encodings and such
yeah, I guess if we were to "just" take stuff that is needed for METADATA parsing, that would be pretty small and (relatively) fast package
but the we have to go through months of figuring out a transition plan and years of the transition actually occurring... π¬
david hewitt gave a talk at the language summit about adding rust code to CPython and sadly itβs a nonstarter it seems like
the email message format does have that hidden complexity. Core metadata, however, isn't expected to and IMO shouldn't exploit that.
we could just support email/json based metadata forever ... π .. π
at least until rust is better supported on obscure targets
Outside of the actual spec change/adoption story..
The transition would happen package by package, rather than by Python version.
yes, it would.
that would be on the build backends IIUC
- release of the packages that use them
Honestly though. As long as Python keeps its email package and the old email-based metadata doesn't get updated after the json version is the default, keeping a legacy email based metadata backend wouldn't be that bad.
but /me gestures wildly at all the packages that are still not using pyproject.toml for anything sensible, or are using setup.py for pure Python projects without a real build step
Not at all! If we put a METADATA.json alongside a METADATA file, it can be entirely opt-in and eagerly used. I would argue this is one of the few cases where it makes sense to have a minor version bump to the wheel version.
my experience has been that people often don't use new things unless they're forced to, and grumble about that force.
(FWIW, that's what I had in mind, and I'm realising that maybe not everyone else π)
but yes, there's a clear enough way to do it and the opt-in shouldn't cause real friction for anyone. I'm nevertheless pessimistic about the process of getting a PEP through
in fact I, too, assumed pretty much that was what you had in mind
There will be enough transition with the new packaging governance :P
but for example, people are going to argue about what should happen if both files are present and they disagree.
if JSON file present use JSON, simple
^ I would also say the PEP should require the contents between them to be equivalent
even if that isn't practically enforceable
well, an installer could verify it by just parsing both files and raising an error if they differ. But now the experience is slower and less reliable for everyone. π
But I don't want to turn this into a PEP discussion channel π
I really should move this task up my TODO list enough to pop it off the stack...
good luck, and
for considering the issue at all
uv doesn't parallelize the resolver, so pip definitely doesn't need to
uv does parallelize pyc compilation and it does create small reproducibility issues
(is there a good reason, in uv, that the compiled bytecode can't also be cached and hard-linked like the source files? All I can think of is that it'd have the paths to the cache versions of the files, used in tracebacks, but I'm not really seeing that as a problem)
This comes up with docker images
Congrats on the 25.2 release by the way, even if you don't think it looks like anything groundbreaking compared to the previous one
π
If it makes you feel any better, pip 25.3 is (in theory) going to be a massive release with numerous deprecations being finalized, not to mention other significant changes.
Currently scheduled for 25.3 includes removing the setup.py bdist_wheel and setup.py develop mechanisms, plus two smaller deprecations.
But thank you! it's appreciated 
(actually, I had been wondering if it makes sense to expect alternate small and big releases going forward...)
I do recall the notes about 25.3 plans on the 25.1 release.
I don't think so, I think big deprecation releases are more about did some particularly invested maintainer have some free time 6 to 9 months ago
And similar for features, I have two to three mid size features I would like to land in pip, but I've had no time this past few months
i've asked the cpython upstream about that and they said that bytecode compilation is non-deterministic, so they say this is not a bug
Yeah I followed along, it's certainly an interesting problem
btw if y'all have questions about parallelism in uv, hmu
the AI world has arrived to pip π
@finite perch hmm, I don't like the benchmark killerdog proposed:
class MockDistribution:
...
def iter_dependencies(
self, extras: list[str] | None = None
) -> Iterator[PackagingRequirement]:
"""Simulate expensive dependency parsing operation."""
# Simulate some processing time for parsing dependencies
time.sleep(0.001) # 1ms per call to simulate parsing overhead
return iter(self._dependencies)
def iter_provided_extras(self) -> Iterator[str]:
"""Simulate expensive extras parsing operation."""
# Simulate some processing time for parsing extras
time.sleep(0.0005) # 0.5ms per call to simulate parsing overhead
return iter(self._extras)
They're using time.sleep instead of using actual distributions with actual dependencies.
I wouldn't be surprised if dependency parsing did indeed take a non trivial amount of time, but I would like to see actual scenarios. You're welcome to review it, but I'm ambivalent about spending my time on the PR since it seems to be heavily AI-assisted. The work is probably fine, but there is a distinct lack of "wait, is this change even warranted?" IMO
I will do a proper review, their benchmark basically just shows caching is faster than calculating, which is not by itself interesting
I agree that it is likely AI assisted, but it does look about correct, and I know some real world scenarios to see if it really does help
Yeah, it's really a matter of patience. I do not have the patience to review work with this level of AI assistance because I don't want to discuss/argue (the technical kind, not the social kind) with an AI, even indirectly.
I don't know if this comment is a reflection of my lack of patience or my best attempt at showing patience. I hope I worded it in a reasonably nice way. https://github.com/pypa/pip/pull/13518#issuecomment-3146854654
yeah a priori someone who submits a benchmark of a simulation seems not in control of the PR
if the goal is to demonstrate something about how often cachable work is repeated, it should instead just... count the times that code is hit
I just read this now, it comes off as very patient to me
It has empathy, evidence, and reasoning for what the issues are.
I'm hopeful some people will read that and learn from it, even if we never directly see the fruits of that learning.
I was dealing with an OSS maintainer just recently who was not willing to engage at all in discussion of a feature. I don't blame them but they have to accept that's why they get a high % of frustrated comments on their repo.
(I'm a little scared that the AI can generate valid prior issue numbers, even if they're nonsense in the current context)
(also, the reply you got is textbook AI, right down to the exact tone of contrition)
in the future I'd recommend cutting this sort of thing off sooner. @shy echo absolutely made the right call there. AI users who are willing to switch away (or think they have a specific, reasonable justification for using it) will change their tone (very literally, since their natural writing style might be anything, but is almost certainly radically different) in the first reply. This one doubled down, and IMO became more obvious in using AI as the thread progressed.
I'm likely going to reuse my long reply as a saved reply. I'll tweak it to more compact so it can used whenever I need to close issues/PRs for being useless AI garbage.
I was hoping that they'd clean up their act, but they didn't, alas.
Lesson learned, I suppose. It's a shame that is the takeaway from this, but my (previously unreasoned) hatred for AI was right. Β―_(γ)_/Β―
At the risk of turning this into the #bashing-ai-and-maybe-pip-discussion channel, man, it is actually so frustrating to read this: https://github.com/mlpack/mlpack/pull/3979#issuecomment-3148780562
This is not how anyone talks.
FYI, I'm going to take a break from pip (or at least typical maintainer duties) for a little bit. I'll be around to manage the 25.2 release cycle, but I do think I'm getting a bit close to burning out on pip, so I'll step away now to avoid hitting that valley.
(I know that I can do this at any time, but saying it out loud makes it easier to actually follow through on taking a break, haha)
Makes sense, take a break, no expectations!
(FWIW, I do the "To clarify," thing personally. It's also a little strange and un-AI-like for "Time" to be capitalized at the end. But that could have been something manually written in after the AI content, especially since it's so redundant. ... I should probably not devote more thinking time to this.)
error: unsupported-wheel
Γ Wheel black-25.1.0-cp312-cp312-manylinux2014_arm64.whl is unsupported on this platform
β°β> You're on Linux x86_64, but wheel requires Linux arm64
I'm using my break from maintainership to work on error messages (which is always good fun). Currently working on improving unsupported wheel errors.
The tricky part is handling wheels that support several tags, which is quite common with mac/linux wheels.
I'm pretending that they don't exist for the time being, but it would be good to do better.
Trying to figure out a fix for a pip-tools bug (ref), I come to a question about pip.
How is pip handling -r and -c when the input is some URI-looking thing? It seems like http/https might be the only supported schemes, and everything else gets treated as a filesystem path?
Oh, I went down a bad path while spelunking! Just doubled back and tried tracing it out again, found it in _internal/req/req_file.py
SCHEME_RE = re.compile(r"^(http|https|file):", re.I)
tells me exactly what I wanted to know.
https://github.com/pypa/pip/blob/20b39ec104b94181c01731192d03cbe0d1f9f3a7/src/pip/_internal/req/req_file.py#L564-L591 this is the relevant function FYI
The fact that it's being checked in two different contexts in two different ways sort of feels like maybe there's some datastructure which could keep track of this info. But it's not obvious that there's anything worth adding. And I've now got confidence to go make pip-tools ape internal pip details more. π
The answer is almost certainly: pip is a 15+ year old project, the codebase is a mess π
I don't believe that critical global infrastructure maintained by a team of unpaid volunteers could be a mess! You can't fool me! There's some deep logic behind using a regex and url splitting for this, I know it. π
Although it feels instinctively messy, there's no easy win I see here. Maybe some abstraction which gets attached to PipSession like session.path_analysis[filename] could work, but I'd have to actually try it to see if it just makes things too opaque
Hmmm, I accidentally created the branch for https://github.com/pypa/pip/pull/13520 on https://github.com/pypa/pip instead of https://github.com/notatallshaw/pip, not sure what to do now..
What's the problem with that?
I mean, I guess it's fine, the branch will be deleted once the pull request is closed
mhmm
Just was unintentional
Dependabot creates new branches on our repository and then deletes 'em. All is fine.
It's only if external users start to depend on our branches that we'd have problems.
@hidden flame I've been thinking about installing build dependencies in-process & build/constraints. And I think I've got a good way forward. I will independently create a new "use-feature" called "build-constraint" that works in the new way: constraint only affects runtime dependencies, build-constraint will affect all build processes.
Then inprocess-build-deps will match this same behaviour (which I think will be much simpler for inprocess-build-deps than trying to match the existing behaviour), so in effect inprocess-build-deps will imply build-constraint
This will allow build-constraint to be accepted on it's merits and not be smuggled in via inprocess-build-deps , and will allow for an appropriate deprecation period for the existing behaviour .
I'm speaking my mind 100% here, but didn't we reject such a design for --resume-retries? I proposed that we use --use-feature to roll it out which was rejected for being too complicated. Looking back, I agree with that decision
resume-retries didn't break existing workflows, changing the behaviour of constraints will
I've written out a backwards compatible way to implement --build-constraint here but it's non trivial: https://github.com/pypa/pip/issues/13300#issuecomment-2787887526
And I think it would be tricky to implement the existing behaviour of PIP_CONSTRAINT \ constraint with inprocess-build-deps? But if you plan to preserve compatibility for all uses of constraints with inprocess-build-deps then I can go back to that table.
It would be easy to get inprocess build deps to support PIP_CONSTRAINTS, but it would mean --constraints is also supported.
I could probably hack the CLI code to disable the forwarding if given as a flag, but I would 100% want that to be temporary as that's an awful hack.
That breaks the existing behaviour that you pass runtime constraints with --constraint and build time constraints with PIP_CONSTRAINT
Yup.
Our options are:
- Add
inprocess-build-depsw/o constraint support at all and wait until--build-constraintsis added - Add
inprocess-build-depsw/ backwards incompatible constraint support (and then I guess go through a deprecation cycle once--build-constraintsis available - Add
inprocess-build-depsw/ equivalent constraint support so it doesn't affect the constraints deprecation cycle at all
Lemme see how easy it would be hack in support for detecting where constraints come from.
It's probably not too bad. Envvars are supported by setting their values as optparse defaults. I could track the value of PIP_CONSTRAINTS somewhere and then see whether the final constraints value matches during command initialization. It wouldn't be perfect as you could pass the same constraints file via the flag and envvvar at the same time and get a false positive, but that seems fine?
I'd rather you not have to do any of that and just use a new --build-constraint but I don't know if the timing will work
We'll see what needs to be done once my PR leaves draft status.
Okay, I have a local branch that implements --build-constraint as I described above, I need to do a bit of clean up but if I get some time tomorrow I should have a PR out
Ended up being not as complex as I feared: https://github.com/pypa/pip/pull/13534
patch_check_externally_managed() in test_pep668.py doesn't seem to be working as expected (the externally managed error is not being raised) when using 3.12 think I've got a fix for it just waiting for the tests to run through
Maybe I was missing some nuance here as sunk more time into it than I was expecting to fairly new to pytest as work just uses unit test will take a better look later ended up calling it for a moment earlier
Thanks for looking, if you make a PR I'd be happy to try and review
I'll be back from vacation starting tomorrow, although I will have limited availability.
update @finite perch was actually related to an issue you raised back in 2023, (https://github.com/pypa/pip/issues/12329)
it appears I am blind and missed the note pointing this out on the docs here -> https://pip.pypa.io/en/latest/development/getting-started/#running-tests
Description I have been trying to submit PRs to Pip recently but I have having problems running the test suite locally as I get ~14 failures on a clean branch of Pip running the following instructi...
how to force pip use UTF-8?
This might be helpful: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONUTF8
thx
No, it's not exactly easy to spot π
"use" in what sense?
the data formats all mandate utf-8 in all the places where a text encoding would be relevant as far as I can think; and Pip's own output is plain ASCII except where it interpolates package-dependent stuff as far as I can think. So that leaves... file names?
setup.py is not required to be UTF-8 on all platforms IIRC
(or generally any python code executed as part of the build backend execution)
ah, true
but idk why you'd want to force the encoding there; if it doesn't declare one then Python makes it utf-8 in 3.x and if it does then overriding that would probably break it
@finite perch are you using π as an indication that you've read my review comment?
I'm a little confused with all of the π you're leaving
Oh, ahaha, I use that in my dev workflow to mean "looking at"
Is that not what that emoji generally means?
FYI, I'm not going to be near a computer till at least Monday (I meant to grab my laptop so I could do a little work this weekend while I ride trains, and I completely forgot), so I was using this emojis as a place marker to me that meant "probably correct but I need to look at it when I have access to a computer and code again"
I think in the GH context, it usually indicates surprise or interest.
Interest is similar to looking at, but there's an implication that the thing of interest is notable.
Oh, whoops...
It makes sense, it's just not an interpretation I'd ever encountered on GH. Languages are hard :)
π I definitely don't hate you at all π
Possibly the worst emoji for having one universal meaning.
FWIW, I have no issue with it. If it helps you, I don't mind it.
this is btw also a widespread interpretation that people use for this one
issue is that GitHub only has a handful emoji, so people overload them
@finite perch if you have any PRs you very much want me to look at, let me know
I have limited availability on an ongoing basis, but I have some time tomorrow if you have anything that's high priority.
I would look through my github notifications, but it's already a mess. I'm fine with being told what I need to look at :)
@hidden flame thanks for letting me know but don't worry about it, no pressure from me, you get on with whatever life stuff you need to get on with.
My build constraints and uploaded prior to PRs are ready but I wasn't expecting them to be approved soon. I have more PRs down the pipeline, but nothing that would be ready by tomorrow.
Actually thinking about it a very quick look at this one would be helpful: https://github.com/pypa/pip/pull/13549
And there may be a few other small PRs, if you get to them, but otherwise I will try and make some time before 25.3
Technical details question: I just noticed that pip writes RECORD files with CRLF line endings on Linux, rather than with LF line endings. Is there any particular reason for that, or just an accident of history?
It's likely due to how the csv module works.
I donβt see how it would though, pip specifies the arguments as suggested by cpython doc afaict
It's pretty easy to reproduce:
import csv
with open('test.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['a', 'b'])
This is because csv defaults to the "excel" dialect and this models Excel which always writes CRLF
I think it would be fine to specify the line terminator as \n in the writer constructor
Regular reminder that almost every language existed before CSV was βstandardizedβ, so nobody can rely on anything and using CSV will destroy data.
For RECORD, itβs not that bad because at this point we have a lot of data and tests, but going forward I strongly recommend not using CSV for anything.
Iβve seen so many PhD students despair because they found out months after they started using a dataset that they (or someone before them) irrevocably destroyed some of their data, just because of CSV.
Regular reminder that almost every
it honestly somehow hadn't occurred to me that the RECORD format is CSV o_O
you know, just a bunch of lines with values that are separated by commas, nothing that merits a special name or anything clearly π
The exact format doesn't come up very often, but I don't recall Tarek getting much pushback when proposing CSV, since it's a pretty natural fit for recording the size and hash of a whole pile of installed files (the PEP specifies the field delimiter, the quote character, and the use of universal newlines when reading, so the most egregious causes of CSV incompatibility were addressed up front).
It's awful when the field inclusions are more variable though, hence the use of other formats everywhere else.
I think if we were designing RECORD today we probably wouldn't use csv, but not for any reason other than it's kinda nice to minimize the number of different file formats you have to deal with
Reminds me that PEP 262 actually proposed the format to be tab delimited instead of comma. Things could be even weirder if that somehow stuck. https://peps.python.org/pep-0262/#files-section
And now Iβm reading it, PEP 376 actually said RECORD should use os.linesep so technically pip is not implementing it correctly in non-Windows platforms.
IIRC I also noticed that several tools did not support paths with backslash paths even though absolute paths with "the local platform separator" is listed as valid in PEP 376
... but that would make none-any wheels impossible, surely? since the RECORD file can't be formatted both ways simultaneously
It would mean packaging tools would need to handle both cases regardless of platform
It's irrelevant anyway, as the current spec does not say this.
This is why pylock uses toml, I presume? Because everything already needs a toml parser anyhow to parse pyproject.toml?
PEP 751 briefly went over the file formats: https://peps.python.org/pep-0751/#id1
The more in-depth review of file formats happened in PEP 518 which introduced pyproject.toml https://peps.python.org/pep-0518/#other-file-formats
The format of the file is TOML.
Tools SHOULD write their lock files in a consistent way to minimize noise in diff output. Keys in tables β including the top-level table β SHOULD be recorded in a consistent order (if inspiration is desired, this PEP has tried to write down keys in a logical order). As well, tools SHOULD sort arrays in consistent order. Usage of inline tables SHOULD also be kept consistent.
Fair, but pylock doesn't need to be human editable or readable, does it?
The rationale mentions human readability as an important goal
After all, it's a file that will be reviewed by humans in PRs
it needs to be human readable:
https://peps.python.org/pep-0751/#rationale
The file format proposed by this PEP is designed to be human-readable. This is so that the contents of the file can be audited by a human to make sure no undesired dependencies end up being included in the lock file.
I know a several parts of pip are seperated into tiny miny projects: is there a project that implements this in order to respect pip configuration for another project that downloads wheels/sdists? https://pip.pypa.io/en/stable/topics/configuration/#location
You talking about how pip vendors other projects that implement parts of the logic? If so, no, the configuration is internal to pip and doesn't use a vendored project
Ah, thank you
other installers would have quite different needs. Many of pip's dependencies don't really relate to the task of actually obtaining or installing a package; e.g. rich used for pretty terminal output
I ended up finding that someone published a library anyhow
ah, I think I misunderstood what you were hoping to accomplish, then.
to be more precise: several parts of pip where copied to different projects (installer, build and others), but pip itself remains with their own implementations
Doesn't pip rely on build/installer now?
Ah no, it uses resolvelib, dependency-groups, pyproject-hooks, and distlib though
All libraries I've looked into using myself for my own tool
especially since pip often has some special cases (for historical reasons) that are non-standard now
I think resolvelib was made with pip in mind and distlib predates most modern packaging libs
TIL there is a dependency-groups lib
Yeah it's dead simple and I think people most likely use it because of include-group subtable processing for dependency-groups
Heck I didn't even know about that until recently
It's interesting to see pip rely on both distlib and packaging
Yeah, distlib does the whole entry point script stuff, packaging doesn't do that sort of thing
And what does packaging offer that distlib doesn't?
Thanks
No idea, I never looked at the history of why pip vendored what for what purpose. I only learn this stuff when I hit issues.
distlib is indeed older afaik. I think resolvelib is one of the factored-out parts.
kinda but not really. resolvelib was made from scratch and was slowly brought into pip, with the old resolver still available as an alternative
distlib is more like a rewrite that (mostly) never caught on
Everything newly introduced into pip after a certain point was done in a new lib and vendored into pip, but things that existed in pip prior were kept there and some of them have been duplicated into new standalone libs
i have just spent the past month rewriting package finding in several phases and produced a rewrite of several parts of the packaging library which is surprisingly inefficient (and i haven't yet addressed marker evaluation which is performed in this nested loop each time even though like the rest it should really be parsing from a string once and then sticking to it)
was just about to turn to resolution so very good to hear this about distlib. resolution is funny because unlike package finding you can give different yet not incorrect answers
people also will state that package resolution is NP complete but i've never seen a proof for that being the minimal upper bound or what refinements of the problem would be required to enforce that. which means SMT or ASP solvers would be less justified and an approach with greater explainability like resolvelib's could be weighed more easily against potential performance concerns
https://research.swtch.com/version-sat ? I've not read through it and checked if it applies to Python package dependency resolution.
I wasn't sure if there was a formal connection beyond the informal dependency resolution can be reformulated as Boolean satisfiablity, and that is NP complete.
thanks so much for this link! i was sad to see russ cox contribute to go's intentionally limited update mechanism and this is a very fun post.
spack (which designs its own dependency specification language) uses an ASP solver which is a very fascinating technique that i suspect to be more appropriate for highly sparse problems with mostly wrong answers like packaging tends to be due to the initial grounding phase. but i didn't have time to work on that when i worked on spack
while in a narrow view package resolution is SAT, my experience this doesn't map to the problem we're solving for users https://docs.astral.sh/uv/reference/internals/resolver/#resolver
i'm not sure that i agree with the boolean parametrization being the most general as versions have an ordering to them which is the kind of structure that boolean satisfiability elides by assuming the booleans are unrelated.
pubgrub is exactly what we were looking at for spack
i would go even further and say there's an ordering not just in the variables, but also in the solutions
we do something very interesting with spack by encoding error facts into an optimization solve
very very impressed at that page from astral
Modern package managers often use logic solvers (SAT, ASP, SMT, CDCL, etc) for dependency resolution. Logic solvers are highly efficient at solving NP-complete problems, but often give very little information when a solve is impossible. This talk explains the solver methods used in Spack to introduce legible error messages for users, including g...
i was actually considering whether to develop a separate rust tool specifically for package finding since i'm now finding parsing to be a significant contributor to runtime and i think the caching assumptions you can make for the process might be useful to standardize.
remarkably good page on resolution from astral, better than any i've possibly ever seen. the particular focus on runtime as a matter of i/o is really important but it also treats the reader as someone who might be interested in the semantics. i doubt it would be useful to standardize resolver behavior but this has me thinking it could be possible and useful
also realizing that spack in particular would be able to make use of package finding as distinct from solving so it can bridge pypi and friends into its universe
i was very dissatisfied with PEP 751's lack of mentioning the years of work that went into install --report while introducing new syntax to marker evaluation but explicitly unspecified semantics which any subsequent specification would then necessarily be breaking so am hoping to propose a specification for that. but will be looking to expand its output to cover the cases of multiple interpreters and environments since that is also where i'm hoping to make use of it.
Sort of. distlib is of the same vintage as distribute and distutils2 (the former defunct after it merged back into setuptools itself, the latter defunct after the rewrite failed to land in Python 3.3), but it represented Vinay's (successful) effort to decouple some nice setuptools and distutils2 features from wholesale adoption of those libraries (since neither of them worked very well as "toolkit" libraries). It was one of the earliest examples of a "not pip or setuptools" consumer of the metadata interoperability standards.
I'm guessing Anaconda has made an update recently where their base environment includes a package with an invalid version, we've got a few reports all within a few days
When I have time this week I'm going to investigate, shame I no longer have a paid support line through my employer
Vinay's the same person that started virtualenv, right?
@native obsidian ^
Checking the PEP history:
- logging was Vinay (https://peps.python.org/pep-0282/)
- venv was Carl Meyer (https://peps.python.org/pep-0405/)
- virtualenv was Ian Bicking
Pretty sure Vinay was one of the early venv maintainers, though (I don't know if he contributed to virtualenv, but it wouldn't surprise me given he did enough packaging related work to want to create distlib)
yeah, I probably just saw the name in source code or something
oh wow virtualenv is older than I thought
i looked at the commit history - always fun finding 00βs vintage commits in Python projects
yep!
can confirm, was there at the time
I'm guessing Anaconda has made an update
TBH I'm a little sad that all the work we've put in at the time got lost when virtualenv was rewritten, https://github.com/pypa/virtualenv/tree/legacy contains the history going back, but it doesn't show up in the contribution graph
I was looking through distlib and felt like it had a lot of features or implantations I would need multiple libraries to get; though I see the double edge of that.
Yep, it's a big toolkit, so it's always felt like a "heavy" dependency, and that sometimes scares potential users off.
I'm probably going to use it for a tool I'm working on (once I have the time again)
Quick question, is including wheels/artifacts like this normal? I was under the impression that pip handles this.
https://github.com/unitaryfoundation/pyqrack/releases/tag/v1.69.0
Building wheels is normal, it greatly speeds up installation and allows users to install without needing external build dependencies such as compilers
But isn't that what pip is supposed to do?
If you have pip, why manually add these wheels for every release.
Pip is supposed to install your package, it doesn't have to build it if it's already built. If a package has a million users, it makes sense to build it once for each platform rather than having each user build it individually, a million times.
If you want to build all your own packages you can pass pip --no-binary ":all:" in your install commands, it will be a lot slower and you will need to do a lot more prep work in your environments.
When installing with pip (or from PyPI in general), wheels are much faster than source distributions, even for pure-Python projects.
Packages with native code are a clearer win, because the wheel file will contain pre-compiled binaries for the platform youβre installing on. This means that you donβt need to have a compiler and non-Python bui...
I thought we could supply a single wheel to pip.
I remember it made one when I use twine build dist (was it sdist or bdist).
The only binary distribution in the Python ecosystem is wheels at this point.
You (ideally) generated both a source distribution and a wheel, which is the default behaviour for most of the build tools as well.
So, can we supply multiple wheels to pip?
Or do you HAVE to manually add it to releases and force people to download it manually to use it (not using pip).
You can, assuming they're for different packages.
No no, same package.
Example of what I'm asking about.
You see how they have wheels there as opposed to just a .zip and a .tar.gz.
They're all wheels for different platforms -- you'll install only the one relevant to your OS + Python version.
(OS + Python version β platform, for this purpose)
See https://pypi.org/project/numpy/#files for another example, that has a lot more detail about how things work.
Pip doesn't automatically generate these wheels from the sdist?
You have to manually generate them yourself one by one?
It can, but that takes more time since the C build process has to be done. Wheels ship precompiled code
Pip can build a wheel from an sdist if it doesnβt find a compatible wheel
But, for example, NumPy includes code in C, C++, Fortran, and more
To install NumPy from an sdist, you need to have all of those languages installed on your computer
I see.
Is there a way to automate the generation of these and uploading them to pip in a workflow?
Yeah but that one I noticed only builds a single wheel.
If you want to build for many platforms, you replace hynek/build-and-inspect-python-package with https://cibuildwheel.pypa.io/en/stable/
Ooh I see.
Thank you so much Doggo!
I finally have fixes for https://github.com/pypa/pip/issues/13568, but I have to do some real gymnastics to get the pep668 working on Ubuntu 24.04+, does anyone know who I should ping when I raise the PR to fix these tests?
I think someone with pip-maintainer tag?
Oh nvrmind, you have it too.
Thanks, I am a pip maintainer, this channel is largely used for dev discussion
This is #pip
Ask cibuildwheel questions in #cibuildwheel
Copy that, sorry for the unrelated discussion. I'll move it there.
if you have only Python code (it's okay if your dependencies have non-Python code), then you can. Multiple wheels happen when you need to compile code for multiple platforms/ABIs, or when you really need to make different releases that are specific to a Python version (maybe because you're doing something advanced with Python bytecode)
We don't have .zip distributions any more.
I see. For my case I do python only, and nothing advanced with the bytecode.
Is there any harm in including them?
Or any advantages?
Your build tool will normally produce a single wheel, with "tags" that indicate that every system can use it (subject to the python version restrictions that you put in metadata).
I wonder
Will cibuildwheel produce a single wheel for a pure-Python codebase?
Distributing the wheel has the advantages described in ^
Indeed, I'd go far as to suggest that folks go to the Python discord or to discuss.python.org for help understanding how stuff works. π
agreed, can the above get moved perhaps, sorry for distraction I just like to answer stuff
(They are asking about advantages of making multiple platform-specific wheels for a pure-Python codebase)
Hi everyone, Iβve been contributing to pip since mid-July (a handful of PRs and some issues so far), and Iβd love to get more involved by helping with issue triage. Iβm not sure if now is the right time to ask, or if I should keep contributing a bit longer first, so Iβd really appreciate any guidance from the maintainers.
Tagging @finite perch and @hidden flame since youβve kindly reviewed my PRs from the start, hope you don't mind the ping π
.
For reference, my github is https://github.com/sepehr-rs
@unreal jungle just a heads up, the pip core team has very limited availability. I'll bring this up with the team and I'll let you know what we think!
Oh yeah, I meant to say the same, I've been ill this weekend
Thanks a lot @hidden flame and @finite perch, appreciate you both taking the time. No rush at all, Iβll keep contributing in the meantime. And hope youβre feeling better soon, @finite perch!
I'm now quite ill, too π
i found cargo's version of the simple repository API yesterday and it has some fascinating points https://www.ncameron.org/rfcs/2789.html
A curated index of Rust RFCs.
currently working on i think a sqlite db to store the result of finding python packages in a queryable way and it aligns a lot with some of the design ideas here
the incremental changelog part is super interesting. it was rejected for cargo in 2020 bc it didn't help cargo performance much. but i would be very curious about its effect on bandwidth usage
tensorflow's index page is 1.1M, setuptools is 766k. using http caching headers we can avoid refetching unless there's something new, but if so we still have to fetch the whole megabyte
the json API puts new entries at the bottom, so it could be a range request but that's so hacky
are you considering compression in that
that's plaintext json. you're totally right
I've been poking at adding zstd, and maybe zstd with a shared dictionary to the compression too
I dont' think those would help pip though until it can upgrade to a newer requests
and the latter would require work in urllib3 to add support for it
IIRC the JSON compresses fairly well precisely because it's very repetitive/compressible.
ok the gzipped tensorflow index is 198K (compared to original 1.1M)
unless it's changed pip does cache those too (though it uses a max-age of 0 so it does a conditional get each time rather than just blindly using the cache)
my pip fork has a separate adapter class for index pages
the builtin one has semantics which i think are wrong
it overwrites caching headers for you and that's fine for wheel downloads but not when we're doing fancy footwork
i also rewrote package finding. i think it's really good and makes sense but it's a big change so i'm not trying to get it merged atm
i do however have some PRs to make to the packaging lib. it's very slow
my current idea learning from that work is to separate package finding from resolving entirely. so i'm making a rust tool for that which manages the index fetching and caching and exposes an IPC protocol
only problem is that if i want it to be a database of package metadata, it also needs to be able to download wheels etc. so that's a scope increase but i'm still hopeful
i noticed that pypi provides upload-time for each version and it would be so cute if we could just tell pypi to only send me uploads newer than a given timestamp. seems too easy
the problem with things like that is it requires dynamic computation
right now we rely heavily on CDN caching to scale pypi
The last time I looked, the PyPI backend was handling ~1000 req/min, while the CDN was multiple orders of magnitude more.
FWIW, I'm willing to make time to cut a security bugfix pip release tomorrow. I'd literally just cherry-pick the PR on top of the 25.2 tag and cut that.
does zstd being in the stdlib help with that?
yes
Currently approximately 600 requests per second for index API hitting the backend vs roughly 60k rps at the edge
And that's only for the index, not for files
I guess that's the exponential growth everyone keeps talking about with PyPI? π
grumble grumble CVE grumble grumble
If you want to do that I'll approve any PRs or workflows, but don't feel pressured
"Number go up" is good, right?
(sorry, what are the bugfixes that would motivate a security release?)
I'm of the opinion that the current issue is worth cutting a security bugfix. The impact is not huge, but it's also annoying when the literal package installer has a CVE and is breaking your builds.
The main deciding factor though is that the pip project has extremely limited maintainer resources. We can't commit to always cutting security releases.
At that point, it also becomes a balancing act of whether the maintainer time spent on a security release is worth it or would have been better used on development or maintenance.
3.14.0 final is due on Tuesday, will CVE scanners complain about that too?
AFAIU, the CVE is linked to pip only. Even if you're using a modern enough CPython version that has a hardened zipfile tarfile implementation that isn't vulnerable, CVE scanning will still flag and fail the build.
There isn't a way to declare that the CVE is applicable to "pip but only on these python versions"
So in short, yes. Every user of pip that uses CVE scanning can be affected since there is no currently released version that has the fix on pip's end.
I still think the CVE is mostly meaningless though. Yes, it is a valid attack vector, but if you're using source distributions, there are so many other better and easier ways to get compromised. They can include a setup.py that deletes everything if someone wanted to.
Every CVE detector is very clear that it's not a perfect system and users will need to review with nuance. That's just the nature of using a CVE detector.
I agree it's annoying for some users, though we've actually not got that much feedback all things considered, but that's what you sign up for by blocking your CI pipeline with CVE detection.
It's really a question of how much effort it is for you, because so far you're the only one who has volunteered to do a release for it.
Every bundled version of pip into CPython is technically affected.
I dunno. I guess I've just read too many threads in the JS ecosystem where people are freaking out over CVEs
New βReDoSβ every five minutes
JS ecosystem has different, expectations and preconceptions. JS code at runtime is in a sandbox, so not much can go wrong, build time is where the pain is because it's often not sandboxed.
Python code is rarely sandboxed anywhere , any malicious code can affect every stage of the lifecycle.
(Do containers count as sandboxes?)
No, lol

They help, but they shouldn't be your only defense.
...is it actually valid by the standard to have symlinks in an sdist in the first place?
(and what would happen on Windows without the symlink-creation permission bit?)
IIRC no
but don't quote me on that. I just remember symlinks being a problem in packaging
mm
Symlinks work in sdists, but are not encodable in wheels.
Tools MAY unpack links (symbolic or hard) as regular files, using content from the archive.
ah, that might be it
ah, .tar represents them because it's based on ancient unix standards, but .zip wouldn't... that much makes sense I suppose
that sounds like the CVE just risks copying a system file into a venv... ?
but that still entails giving access to it that may not be desirable
not sure that actually escalates from what setup.py can already do... ?
It doesn't. If you're using a source distribution, pip can and will run arbitrary code by design in many cases.
right, but my point is also that "arbitrary code" run as a user can also access the same files that the symlink extraction process (run as the same user) can. Which I guess is obvious, but.
It would be some really convoluted scenario to be a security risk, like you've done a static analysis of all the package, and the build backend, but you didn't consider it could symlink to outside the archive, and that somehow started a chain of unexpected behaviors that is exploited
There may be organisations that have their own known-good wheelhouse of build backends (none of which include ACE) so they can safely use source distributions, but I consider that unlikely given it'd be easier build/distribute their own internal wheels and just ban sdists outright.
Anyway, that would be a weird place to draw your security boundary.
Perhaps an organization builds an sdist in a sandbox VM, they only allow the Python process to access a fixed number of directories , but the link that points outside the archive sets of a chain that allows the build process to escape the sandbox VM?
Like it points to a vulnerable implementation of sudo or some system util that implicitly sudos (because stuff like that happens in OSes for some reason, looking at you ping), and the process is able to escalate itself, etc. etc.
I have a little bit of free time for a few days. I will try my best to review as much as I can to unblock the release.
I've reviewed the build constraints PR again tonight. It looks pretty good, but there are some issues with the tests.
Oh, thanks for being so thorough
I honestly struggled making tests for that one initially
Hi @hidden flame and @finite perch, just wanted to follow up on my earlier message about helping with issue triage π
No rush at all if things are still busy, Iβve continued contributing and have informally triaged a few issues and PRs. I just wanted to see if thereβs anything else I should do or keep in mind for when the teamβs ready to discuss it.
Hey sorry, very busy weekend, it's not been forgotten, communication however is a bit slow between maintainers as we all have busy schedules, you will be contacted by someone soon (if not already) to discuss further
I've sent you a DM. π
Thanks a lot to both of you, @finite perch and @hidden flame :), really appreciate your support.
@Daylily I messaged this screenshot because I was blown away by the amazing traceback and was told you're the one to thank for it. Incredible work! Definitely going to make debugging easier for me.
@shy echo someone likes your--debugflag that you added.
Not sure how I got the compliments, but consider this my 307 response to redirect to you :)
Reading over https://github.com/pypa/pip/issues/12712, I gather that zip extraction is somewhat of a bottleneck in installs given faster bytecode compilation. I recently landed a change to 3.15 that should make decompression 10+% faster for files >1MB (https://github.com/python/cpython/pull/139976). I wanted to offer I'm happy to work with folks on changes in core Python that would help with pip performance
Awesome, yes, you've probably already seen this PR that optimized zip extraction time: https://github.com/pypa/pip/pull/12826, but not something we could accept in pip directly, upstream speed up will directly speed up installs, especially ones dominated by extracting wheels
Yeah I'll look into performance improvements in the zipfile code
I hope the person behind that PR did get in touch with CPython.
I don't know about that one specifically but looking at thier profile I do see they got at least one performance related PR landed on CPython: https://github.com/python/cpython/pull/119783
Though I will say, it is a common pattern to see that people give up after receiving their first rejection, even when the rejection is "you should discuss this over on Y, or raise a PR on Z instead". I sympathize, but I don't know what else I can do about it.
Probably nothing, realistically.
So I have sketched a refactor of zipfile (under the hood) which should make it easier to reduce locking and minimize the seeking done during decompression https://github.com/python/cpython/issues/136741#issuecomment-3413333521
A couple of questions I had while designing this is if stdlib allowed overriding the default decompression methods (i.e. libraries could register for ZIP_DEFLATED their own code to run)
- Would you consider using such a mechanism? I could imagine using zlib-ng or zlib-rs would provide a significant speedup to install speed vs normal zlib. Allowing an optional add-on speed up pip would be interesting
- On the flip side, would pip need to protect against 3rd parties modifying the default deflate decompression for security purposes?
So pip is not likely to write its own specific decompression related code, beyond adding a few lines to turn on or off flags.
If there was a well maintained third party library that is pure Python and licence compatible we could vendor it.
And as for security, I'm not sure I completely follow, but pip is not in the business of protecting a user from their own environment.
Gotcha. I was imagining a third party could replace the default deflate decompressor with a zlib-ng decompressor which would immediately speed up pip. This could be done transparently to pip
meaning, the system library that zipfile interfaces to? π or else how would it be transparent to pip
oh, you mean, declare a dependency, but no code change?
Right so basically users would be able to install e.g. zlib-ng and zlib-ng would be able to do something like zipfile.register_decompressor(ZIP_DEFLATED,...) and zipfile would use zlib-ng from the zlib-ng package going forward.
ooh
Why couldn't Python be compiled against zlib-ng in the first place?
It can! (and is on Windows and I believe Fedora)
But that isn't true across all distros
BPO 47193 Nosy @gpshead, @pfmoore, @tjguk, @zware, @zooba, @corona10, @arhadthedev Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current st...
Windows is not happy time zone datetime object from the year 3000 onward, and this broke a few of my tests on Windows, glad to see Microsoft is already planning to have Y3K support contracts
Just saw the notifications re #13482 and #13534. Congrats! These were important issues to solve and I've been "following" the corresponsing bug reports the whole time
Definitely has me excited for 25.3 (along with the other stuff mentioned in the 25.1 release)
(also, why expand year fields to four decimal digits to solve y2k when clearly you only need three ;) )
Yeah, I'm glad I finally decided that breaking backwards compatibility with PIP_CONSTRAINT used to defined build constraints was the way forward, it makes the UX a lot simpler, and allows us to improve stuff behind the scenes with a lot less headaches.
I almost implemented it in a backwards compatible way and I'm not sure anyone other than me would have understood how every combination of constraints and build constraints would have worked.
Sadly almost nothing I'd been working on will make it for 25.3 or possibly 26.0 either.
I really do want to get in-process build dependencies in at some point, though.
:(
Richard says that but if it wasn't for him we probably wouldn't have got build constraints over the line for this release, reviewing is as important as authoring
@finite perch congratulations on the release! (this time in the right channel)
I'm sad that I can't write a post for this release since it contains some noteworthy changes, but alas, I'm so busy π
Thanks! Pretty harrowing there for a moment as I messed up the commit order and accidentally published 26.0dev0 π«
I suppose we'll find out who is monitoring pip releases like a hawk sooner or later, hah.
Stuff happens. You got it fixed promptly. That's what is most important in the end.
Yeah, fortunately I'm used to high stress deployments, I've done a lot of manually copying files in place and running a series of 40 steps for multi-billion dollar trading systems before, ahaha
I hope those 40 steps were documented at the very least.
Yeah your posts are great, I might take this as a moment to sort my own blog out and make a post, we'll see
well, there's no 26.0dev0 on the PyPI index (even yanked) so it certainly could have been worse
anyway congrats π I recall from the 25.1 post that this one was expected to have quite a bit of interesting stuff
Yeah, I directly deleted it, I think less than 3 mins after it was uploaded, I made the decision that was the better choice than yanking if I did it quick enough.
In most cases I would prefer yanking, but because this had a higher release number than the intended one (even if it's a pre-release) and because I was catching it so early, quicker than hopefully PyPI mirrors would notice, I went that route
Plus, it's unlikely folks would complain that they can't use a dev release. And, "don't do that" is a reasonable answer for us to give IMO.
Use a temporary directory in the wheel cache to build wheels, so the built wheel is always on the same filesystem as the wheel cache, and can be atomically moved into the cache.
Is this just a performance concern, or... ?
Concurrency concern: https://github.com/pypa/pip/issues/13540
Although performance should be improved in those cases also
... shutil.copy2 doesn't lock the file it's writing to? o_O
... anyway I notice there's new debug output, too
old:
Γ Encountered error while generating package metadata.
β°β> See above for output.
new:
Γ Encountered error while generating package metadata.
β°β> from file:///path/to/<name>-<version>.tar.gz
it's a little strange to see a file URI (complete with percent-encoding), though; would it actually be possible that the source package isn't stored locally?
IIRC, pip represents requirements by package name or file URIs internally
There are other places where you'll see a file URI.
Like this error message.
And yes, pip can fetch source distributions from a remote URL. I'm not sure whether pip would present the URI of the downloaded copy or the remote URL, but there's nothing requiring a local source distribution.
I don't remember the details, but back in the day when I used to do a lot of copying across Windows and/or network drives I had some utility functions, and they always ensured that first the file was written to a temporary directory somewhere on the same drive, with the same filename and permissions it was intended to have in the final location, and then atomically moved after that. This approach had the least number of edge cases in my experience.
interesting
I guess the easiest way to be sure of "on the same drive" is to make a temp folder inside the destination folder, then move up a level...
anyway I guess that won't cause a problem for my own project but I'll keep it in mind
I am a little worried about edge cases from that change, but I don't have a concrete example to draw from so I decided not to block it.
If there are we'll likely see early next week when people's CIs start running all sorts of weird configurations
FWIW, people still think our deprecations are going to break them even though they really are designed to not do that. https://github.com/casadi/casadi/issues/4098
We're mostly done with the deprecations, so this doesn't matter anymore, but the standard deprecation message is possibly a bit too crude.
I've seen that a few times, language quality could definitely be improved
I wonder if we could implement a lazy importer for pip's own use. Regular lazy imports are unsafe, but if we use a custom importlib finder/loader that scans and reads all of pip's source code on startup, but defers the actual module execution until use, that should be safe?
My thoughts on this are to use regular lazy import and then before running any install command scan sys.lazy_modules, import them all, and then set all imports to be eager after that
The cost should then only be incurred right before an install
I'm honestly pessimistic about the lazy import proposal.
I guess match case did get approved, so anything is possible, but I'm not planning on it.
Yeah, I'm not doing any work yet, I'm not too pessimistic, there's a lot of pressure, there are multiple forks of CPython now because of no good support here
I was going to say that I'd like to also maintain lazy imports even during install, but that would be unsafe with our vendored dependencies. Even with a custom finder/loader as described above, our vendored dependencies could load/exec other Python code w/o import on module exec.
That's bad practice, but I also don't feel like assuming that our vendored dependencies are doing the right thing.
I'll experiment with a custom loader when I get some time. I'm curious to see if lazy imports are even helpful for our codebase. It's hard to optimise startup time nowadays since a huge portion from our dependencies, especially rich + packaging.
My assumption is that things like pip --help could be optimized and everything else would basically have negligible impact on overall experience
I'm surprised that no one is complaining about pip 25.3 yet, I did see a few projects have pinned to 25.2, but none of them seemed to mind and understood that one of their dependencies was very legacy
I'm surprised too. It's a Monday and no one has raised an issue complaining their $very-important-work-ci is blowing up.
I have actually been thinking about a tool that pre-packs existing .pyc files into an atlas that can be read into memory ahead of time, and then a loader that retrieves the needed bytes from there for the marshall.loads call
not specifically for pip, mind, but pip with its current architecture seems like a pretty good test case
uv has changed things. I rarely use pip anymore, if ever.
It's quantitatively changed things, it's not qualitatively changed things, there is a large pip competitor but pip is still the main installer and in millions and millions of pipelines.
I wonder what it would actually take to change that.
(for that matter, there are probably all kinds of places where pip is repeatedly running to re-install the same stuff into new containers, without a cache, rather than having those packages built into the container image....)
I assume that's why PyPI download stats are so high for popular packages, if caching was properly implemented I can imagine those numbers being 10x smaller
it's my best working theory, too.
and that seems all kinds of wasteful, and unfair to Fastly
and I don't know of a good way to track down major culprits, either.
is there any syntax for specifying extras with a plain URI in a requirements.txt file?
or does it require the standard foo[extra] @ https://something/something format?
The standard one, yea.
I checked and with a filesystem path it stops parsing the path at [ and starts parsing an extras list
context: I'm rewriting the requirements.txt parser for PyCharm
and it's not clear what things the pip parser actually supports
I think packaging/requirements is pretty much a reference implementation (pip has its own for legacy reasons)
I've been looking at that but it doesn't really help me with this particular question
as Pradyun said, pip supports the standard syntax
yes, but it also supports other things like just giving a path or a URI
and I found that support was a bit inconsistent as plain URIs cannot have extras, but plain filesystem paths can
I'd say going with the standard specified in the link I said would be the best path
how will that help me with writing a requirements.txt parser?
I need to handle everything pip supports
if pip is doing anything special that is not in there, I am sure sooner or later it will be removed in favour of the standard spec
requirements.txt contains dependency specifiers, right?
and that is what your original question was about
and aside from pep 508 specifiers it also supports plenty of other things
like pip options, plain filesystem paths and plain URIs
my question related to whether these last two oddities support things like extras (and environment markers)
Let me look.
from my testing it looks like plain URIs do not support extras but they do support environment markers
so if the parser encounters a semicolon it stops parsing the URI
Well pip's dependency specifier features were written before the standardization of PEP 508 (right number?) and for backwards compatibility, I sincerely doubt we'll remove stuff like a bare URI for local installs.
yeah which is why I need to support them properly
I just need to figure out what exactly is supported by pip
The answer is that it changes. pip 25.3 actually just changed the parser again to support Direct URL editables, but that is a standard syntax.
Sorry, I am a bit busy atm so it may take a bit to dig through the code.
yeah I'm fine with the standard syntax, that's pretty well documented
@hazy glen Actually, do you mind I get back to you in an hour or so?
sure, np
@hazy glen I'm guessing you've seen https://pip.pypa.io/en/stable/reference/requirements-file-format/ by now? I'm erring on the side of overcommunicating -- that was written a little while back, when I was trying to decouple the documentation -- part of which ended up being trying to make it easier to figure out what exactly this format was.
yup, that's what I based my work on
In general, https://github.com/pypa/pip/blob/main/src/pip/_internal/req/constructors.py contains our requirement parsing code.
There are three entrypoints (generally):
install_req_from_linewhich is used in most cases which allows for pip's custom syntaxinstall_req_from_req_stringused for where standard compliance is expected/required (it delegates to packaging's parser)install_req_from_editablefor editables. There is are two flows in this function, one for standard syntax and one for pip's extensions.
I'll take a look at the tests for those.
def test_extras_for_line_url_requirement(self) -> None:
line = "git+https://url#egg=SomeProject[ex1,ex2]"
filename = "filename"
comes_from = f"-r {filename} (line 1)"
req = install_req_from_line(line, comes_from=comes_from)
assert len(req.extras) == 2
assert req.extras == {"ex1", "ex2"}
There's this example. (The comes_from is irrelevant for you.)
okay, so the extras can also be defined in #egg=...
I'd say in general, you should look at https://github.com/pypa/pip/blob/main/tests/unit/test_req.py
Ctrl-F for the the function names I mentioned above can narrow down your search for examples (you can also try parse_editable and parse_req)
I will note that the plan is to phase out egg fragments. I'm not sure when we'll get around to doing that, but the egg fragment can be wholly replaced by the standard Direct URL syntax nowadays.
Editable VCS requirements used to need the egg fragment syntax to request extras, but that was fixed in pip 25.3.
if is_url(name):
link = Link(name)
else:
p, extras_as_string = _strip_extras(path)
url = _get_url_from_path(p, name)
if url is not None:
link = Link(url)
This seems to imply that pip will accept extras for a bare URI, although I can't find a test case for that.
Correction: bare filesystem path, URIs don't actually accept extras due to that is_url check (well directly, anyway, egg fragments are a thing).
I'm going to stop there. Let me know if you need anything else @hazy glen.
I already tested and bare filesystem paths do support extras
Idk if this would be viable for pip, but there's a world where you go a step farther and avoid the cost of finding, opening, and reading each individual source file by taking a leaf from oxidized_importer: consolidate (& serialize) all the source code into a single file as a build step so that it can be mmapped or read into memory or whatever all at once.
I literally had been thinking of that independently...
although I didn't have this much architecture in mind, nor Rust.
My bad, somehow didn't see that message. It's a neat idea.
Gonna say that's probably sketchy.
Also not going to fly with our redistributors.
It may actually be more reliable to grab a page from the video game industry and create an atlas and use a custom finder/importer that pulls from said atlas. (Unless that is the idea?)
Fair enough.
Not familiar with βatlasβ in this context, so I donβt know, sorry.
I think this is roughly the same idea: store all of the data in a single file and pull the data out piecemeal
in games an atlas has all of the e.g. texture assets in one giant PNG (or whatever file type) and you load the whole PNG then actually use it by saying "this item has texture in this area of the atlas"
Ah. Yeah, thatβs pretty close.
that was precisely my idea, yes. Although 1-dimensional.
... who are major redistributors of pip, aside from ensurepip?
Just the linux distros? and they presumably would just reject such a binary blob
but what if they ran the (FOSS of course) packing tool themselves?
Having just reviewed a PR for our autocompletion, I realized that our autocompletion would likely benefit from lazy importing, too.
Our autocompletion involves calling a Python script which can be quite slow.
Oh yeah, that's true
Although I haven't looked into whether our autocompletion could be easily optimized. My laptop has always been fast enough so it's really not a concern for me.
Do you use autocompletion? I've been unable to get any of them working on my machine
The only thing I wanted to do with auto completion is move the scripts into their own file and run linting on them, because I'm often not even sure if the syntax is valid for the versions of the shells we want to support, but given the recent stalling of multiple auto completion PRs I've held off
progress! π
(I'm still waiting for the first PR against one of my own projects... π₯Ή )
(please don't artificially resolve that.)
What if I resolve it sincerely?
You mean so high or so low? (Assuming the latter, given pips scale)
lowest since 2022: https://www.repotrends.com/pypa/pip
Analyze trends in pypa/pip issues over time
there are so many of these analysis sites that I only ever know about because someone links them...
I use my own :P
My goal is hit 90 open PRs by the end of the year. I doubt that'll happen since I'll be busy for a few weeks soon, but hey a maintainer can dream.
I've just found a subtle bug in how binary/no-binary are passed to the build requirement installer, I'm just going to pretend I didn't see it and keep my fingers cross @hidden flame eventually has the time to do an in-process installer that will remove the bug
uhhhhhh
I think in-process build deps will remove the bug since I am reusing the package finder.
Yeah, exactly, the way the arguments are reconstructed to pass to pip isn't right, because the arguments are actually order dependent
Oh, amazing.
TIL.
Timeframe-wise, I'll probably aim to get it ready for review during the holidays.
It's been long enough since I've last worked on it that I don't remember what needs to be done, but I think they are:
- Tweaking the error reporting
- Ensuring nested builds aren't horribly broken
- Writing moar tests
The logic itself should be solid. Although I do need to fix it post the removal of legacy setup.py builds.
Yeah, no rush on my end, but I am about to open anew PR that passes more arguments to the subprocess build installer, hence why I noticed this
Lovely.
I'd like to replace our janky build isolation mechanism with venv (which is something @shy echo was working on ages ago) after I get in-process deps landed, but this may be a case of me biting off more than I can chew.
In theory, it shouldn't be that bad since venv is battle tested, but I'm sure there are horrifying fun ways it can go wrong or break someone.
Yeah, every time I look at venv logic I am totally confused on what is going on, I'll stick to the simple world of package selection and dependency resolvers
"simple"
I gave a private talk about pip not too long ago and I immediately said "the one thing I am not qualified to talk to y'all about is the dependency resolver in pip."
Step 1: Take Requirements
Step 2: Resolve (???)
Step 3: Profit
Step 2.5: get lost in the math
I basically summed the process up as solving a bunch of systems of equations that only get progressively and exponentially more complicated/interwoven.
I mean, SAT solvers are a thing and I'm pretty sure used in ecosystems where dependency metadata is available all at once.
I guess it's really more boolean logic, but Β―_(γ)_/Β―
*Actually I remember saying systems of inequalities. That is a bit more accurate.
(can you tell I know nothing about dependency resolution? π)
At a super dooper high level pip's dependency resolver (i.e. pip + resolvelib) is quite basic:
- Collect user requirements into "all known requirements"
- Iterate through requirements in a DFS pattern
- For each requirement iterate though each possible candidate and check it satisfies "all known requirements" and it's requirements don't break a previously found candidate
a. If good "pin" candidate and add it's requirements to all known requirements
i. If there are unsatisfied requirements repeat from 2
ii. If all requirements are satisfied SUCCESS
b. If bad move on to the next candidate for that requirement
i. If exhausted all candidates for that requirement "backtrack" and find a different candidate to fulfil the parent requirement- If tried all parent requirements FAIL
Pip and resolvelib aren't doing smart SAT solving things, for example, at step 3, if A is already pinned with "foo>3", and B has a requirement "foo<3", then the resolver will happily pin B and not hit a problem until it tries to pin foo, which might be MUCH later, causing catastrophic backtracking
I would like to implement a more intelligent SAT solver, but one of the problems with that is it's not actually always clear whether two requirements are contradictory or not, because the Python version specifier spec is non-trivial (to say the least). I am trying to lay the groundwork for this in packaging, by adding an unsatisfiability check (https://github.com/pypa/packaging/issues/940), I have most the code ready, I will try and raise a PR before the end of the year, but in implementing it I found a bunch of edge cases I've been fixing in packaging first.
FWIW, incremental SAT solvers that don't need all of the information up front exist, and you can usually hack around non-incremental solvers to use them for incremental solves as well.
One approach is basically have variables representing the unknown things, and if the solver needs to set that to true, you fetch that and do another round with the unknown variable replaced with these known quantities, and the solver suggested to start with the assumptions+clauses from the earlier resolve step.
Yeah, and my intuition is it's pretty doable to build a cost into looking up unknown things, so looking up unknown things is avoided until necessary, and some unknown things are more expensive than others
I can only review a certain number of 700+ LOC PRs π
--uploaded-prior-to is next on my review queue, but the prerelease/final PR will take some time.
I didn't realize that we had a PR for colours in pip help until recently. This is cool. I'd like to clean it up and get it merged.
(if pip was using argparse you'd get this for free (in 3.14+))
Yeah sorry, --uploaded-prioir-to has a lot more design choices and things to work through, the pre/ final release is actually very simple if you accept it uses the existing no/binary semantics, most the code is tests
I think there were some good reasons for why pip is not using argparse
btw I know that optparse was "undeprecated", but is it now accepting PRs etc?
There's an issue about moving off of optparse that probably has all the context.
Geez, 8 years!? I... I'm one of the old people now.
From a quick scroll, the blocker was dropping Python 3.6. And of course someone doing all the work...
One of those happened... two years ago.
Closer to four years ago! https://pip.pypa.io/en/stable/news/#v22-0
but now that optparse was "undeprecated", is there really a need to switch?
Even when optparse was deprecated, there were no plans to remove it from the stdlob. And I get the feeling from the thread that the former deprecated status wasn't the main reason for a change.
tbh I don't see what the original intention was
Out of everything we could work on, the value add seems to be minimal if not zero
Due to argparse design issues it could actually end up being negative, depending how well a PR was designed and the author understood all the reasons argparse stalled out
My main issue with the current design is the options object is statically a black box, I would want any replacement to improve on that. And I'm not sure there are any good options right now, for the scale of options that pip has
have you played with click? IMO itβs pretty nice
I use click myself, but for pip, there isn't much value add from switching CLI libraries. If anything, things may be break due to subtly different parsing behaviours.
Personally I don't like the decorator-based approach, but internally click is pretty solid
...isn't click built on top of argparse?
Optparse. And it used to be. Now they have a new, hand-crafted parser (although I don't remember if it was merged already)
Huh. I should look into it more closely.
the rewrite branch (planned for 9.0 IIRC): https://github.com/pallets/click/compare/main...parser-rewrite-1
rewrite issue: https://github.com/pallets/click/issues/2205
docs: https://click.palletsprojects.com/en/stable/why/#why-not-argparse
(I was in that rabbit hole before lol)
Click breaks stuff too often for me. It's worse if you use Typer, but even plain click breaks stuff in non-major releases. Pip would vendor it, I suppose, so that might not be an issue. I personally like argparse pretty well - it's not elegant but you can provide your own namespace for it to fill. I usually only use click if I want it's chaining context feature.
βοΈ even as a huge fan of click, I agree that it's too breaky for pip. Plus it does all kinds of stuff with std streams which I don't think it should be doing. I use click preferentially if I want to just have fun, but argparse for stability and having fewer deps.
Back to looking at call graphs to find performance improvements: https://github.com/pypa/pip/pull/13656
Was hoping to find a new heuristic improvement with recent grpc issues but nothing obvious, so just seeing where I can find redundant work in pip.
Had an idea this morning that if it pans out could significantly improve long resolving times, probably won't have time to test it out today π
What's the idea?
It's a caching strategy for candidate lookups, once you get into long resolutions most the time is currently spent on checking what candidates are available for requirements already previously checked against
That's a set that'll keep changing, no?
Unless it's "list of distributions available" rather than "list of viable options left" that you're referring to here. π
If "numpy>100" returns no candidates it won't suddenly start to return candidates later, so we don't need to keep checking over and over again if each numpy version matches that requirement or not. But that's what happens because in a pathological resolution it keeps getting stuck on the same requirements over and over again.
cc @thorny crypt (Poetry's solver master)
I'm confident in the idea because I know pips bottlenecks, I just don't know until I start working on it if it would require a large refactor or if my idea to not explode memory consumption is viable
This will surely speed up some cases significantly. I think such a cache has been introduced in Poetry as part of https://github.com/python-poetry/poetry/pull/5335 (I do not assume that this will help you due to too different code bases but anyway.)
So this a bit of an odd question but I am hoping that there are some in here that might have some ideas. Our internal build system where I work has generally encouraged our packages to take on a shape of only ever a single version available. We have started a move towards more idiomatic tooling ie hatch with pip or uv. One of the spaces we are concerned about is resolving conflicting version constraints where I as a consumer consume 2 of these single version available style packages that declare pypi dependencies that could not be resolved easily. An example would be Package Foo declares a requirement on Numpy > 2 and Package Bar declares a dependency on Numpy < 2. I need both Foo and Bar but I dont have an option to select different versions where the constraints would then line up. Are there any solutions available to this problem that do not require changing the Single Version Available of first party packages?
nope, all your dependencies need to have a consistent resolution; python isnβt like node where different dependencies can have independent chains of dependencies
as a person who is partially responsible for numpy 2, sorry!
uv has a couple of different options, either globally overriding dependencies, or individually overriding dependency metadata.
For pip you would need to define a full requirements file and then install it with --no-deps, it has no override functionality.
I'd love to discuss this more generally (what happens on the Python side, rather than with a specific install tool) over on the Python server, perhaps in #packaging-and-distribution there
Sent it in packaging-and-distribution
To be fair that was just a top of the head example of what I was dealing with not an actual real world one I have seen. It just got the overall idea across of incompatible constraints that are not solveable.
It has been an intentional choice in pip not to support this beyond installing Foo and Bar one at a time, or using --no-deps.
I would be open to the idea of a global override but it's not high enough on my list to implement it myself
(it does seem like Numpy gets used as an example for this problem disproportionately often. Maybe just because it's easily accessible/understandable as an example, regardless of how often people are actually bit by it)
(I do also support a lot of scientists so the chances of them getting bit by something like numpy in this type of case does increase)
So pip's .pre-commit-config.yaml isn't actually a valid yaml file
I was trying to see if we could easily more strictly lint it, or auto format it to be valid without it being ugly, but I dunno, I have a fairly minimal yamlfmt pre-commit working
Experimented a little with performance improvements tonight, really getting the number of Version objects being created down quite a bit, need to raise a fairly simple PR with packaging for quite a big win.
One of the big issues though is because packaging is a vendored independent library it makes a lot of design choices that are not well suited for pip. We could make the API simpler and much faster if it only served pip, while keeping standards compliance.
I get why it's better for it to be independent, but really sways me to the idea for a truly good tool you want to reduce the number of external components.
i had to fork, patch and re-publish some scientific packages for a project i worked
Slaying those Version objects during heavy backtracking:
https://github.com/pypa/pip/pull/13660
https://github.com/pypa/packaging/pull/985
(I'm interested in your benchmark, btw.)
It's the first 2500 rounds of Apache Airflow 3.1.3 with the extra all. The full resolution doesn't currently solve on my computer, I left it for over 24 hours and it was still going.
I wasn't able to find a magic fix so I'm starting with a small number of rounds and seeing where the hot spots are. Next I'll increase the number of rounds and see what starts to dominate, and see if I can find new fixes.
I finally have no more unread notifications in my github inbox π
Another simple packaging PR that more than halves the number of Version objects created heavy backtracking situations. I'm starting to find it difficult to find the Version.__init__ call on the call graph diagrams.
I think I need to give looking at call graphs a break for now, the four PRs I've raised are all quite simple but I spent a long time trying different things and wondering why they weren't working to get to those four, it's a time consuming process of trying something, running a 5 min benchmark, seeing what happened, thinking on it for awhile, trying again. That said, I still have some more hunches for non-trivial improvements, hopefully I can work on them before 26.0.
wow, I forgot to paste the link: https://github.com/pypa/packaging/pull/986
Try and get some PRs reviewed next
what do you use to get call graph diagrams, btw?
gprof2dot, and then dot to turn into a png
i wish i had noticed these optimization efforts for packaging earlier
@willow flicker applied an optimization which the thing i am building; codeflash, could've found automatically
this is the optimization codeflash came up with, which looks very similar to what henry came up with
i will now run codeflash on the entire packaging repo with the latest main and see wha it can come up with
If it helps, you can run a statistical profiler (3.15a2 has one) and then give it the functions that are the slowest
Also, I bet it woun't find the atomic/possessive re speedup since models aren't likely to know those were added in 3.11.
We run our own profiles and traces to determine those sort of things, but im not going to do that yet, we typically run it at the function level and see what comes up
Depends on the model, weβve found similar optimizations in the past
it's neat that tools are at this level of sophistication now but I still hate the writing style they use in describing the effort
awesome work. did something similar a few months ago and never got around to it. have been deep in the job search
i was unable to get to optimality without breaking the API to explicitly parse Version and Specifier instead of automatically coercing. if performance is a goal, i think that will be necessary. it will also likely help to avoid accidentally using string comparison vs version/specifier's comparison operator methods
a corresponding optimization is for url strings in the pip codebase itself
I got more improvements coming without breaking the API, but yeah, there's a lot that could be done if the API was internal
that one is less helpful for performance but much more so for correctness. i was also able to improve url % quoting very slightly but i believe the stdlib needs a change to do that better which i've been working on
part of the reason i didn't push was bc i wasn't sure how to handle that. super glad you're doing the work
will keep an eye out for further changes
i also did a much bigger set of changes which performs package finding as a distinct phase before resolution. think this is a great architecture and potentially worth some sort of PEP to represent the normalized result of package finding so that resolvers can be swapped out etc. mostly because package finding is the one place where a compiled language becomes legitimately useful vs python alone to parse the simple repository API
do wish it were possible to get signoff on changes that have undergone multiple rounds of review over years and have become standard practice in other python tools like the fast-deps impl. pip was the first to support zip file metadata fetching after my prototype but after a version with poor performance was merged all my attempts to contribute one with the perf i demonstrated have been rebuffed. astral now claims to have invented the technique.
glad to see more perf work is happening and i would contribute to it if i expected it to ever be accepted
@finite perch do you have any sort of end to end benchmark i can use as a baseline for my optimization efforts?
uv's benchmarking setup might be a helpful starting point
it has warm / cold resolves and installs
thanks!
np! Let me know if you have questions
I don't have anything formal for the stuff you're looking at, I suspect. All my recent stuff has been by hand
I have this but it's very specifically measuring properties of the resolver that aren't related to wall clock time: https://github.com/notatallshaw/Pip-Resolution-Scenarios-and-Benchmarks. Like how many packages it had to visit and how many rounds it took, it's only useful for when checking the behavior of the resolver itself.
Unfortunately, pip, as a project, isn't really in the position to accept major changes. I'd love to get your PRs in, but I have no ability to review PRs of that size that touch parts of pip I've basically never worked in.
We used to have people who were experts in those segments of the codebase and could review large changes targetting those segments, but now there is essentially zero review capacity.
Also, additional complexity for performance wins is not as a clear cut choice as it may seem. I had my own idea of parallelizing the pyc compilation process fail because it wasn't clear whether the complexity was worth the performance uplift.
I recognise that your PRs are (I think?) moreso refactorings, but then again, the problem of how do we review such large changes comes up again.
It's a difficult problem. I realize that as a maintainer, I'm in a privileged position to accept/deny changes and also know what is likely to be merged or not.
βοΈ something which is only in my head so far, but which I hope turns out to be feasible and helps pip maintainers with the review load, is that I'd like to evolve pip-tools to one day not cross the private API boundary. I worry that if that doesn't happen, sooner or later it will become unmaintainable. But to make that change, I need to build and get buy-in on proposed new component libraries to vendor, as replacements for things that are currently inside of pip. I don't even know enough at this stage to know if it's feasible, so I'm probably getting ahead of myself even mentioning it aloud. So far I just have toy projects to start messing around with this idea. But I think if there were some kind of generic "dependency finder", for example, with two tools consuming it, that would make some things easier. (It also adds challenges, of course.)
unearth, build, resolvelib and installer are basically all you need to build a simple pip replacement, maybe some of those can be of help?
(wait, what's unearth?)
give dependency constraint, get index link
ooh
Readme says it all
also my plan is to buckle down and get the first unit of work out for PAPER in December (basically, full polish on the actual "install single wheel" part, and the self-install zipapp, and the associated API)
and I'm hoping I'll be able to provide a base that does accommodate more significant changes, the sort of thing that works when the field is greener
full polish on the actual "install single wheel" part
installercode as a prose? (is CaaP even a thing? π€£ )
I lost several months this year to my own struggles :/
I'm not actually using installer because it no longer seemed helpful when I considered the other things I wanted to fit into it (but also just due to my architectural taste)
I might come to regret that, idk
Yeah, build and resolvelib are exactly the sorts of things I had in mind. I wasn't aware of unearth (I don't think pip uses/vendors that?). The goal I have in mind is to get pip-tools onto a more sustainable architecture. But I also think that there's some pathway to "do it right" that also delivers benefit to pip. TBH I don't think I have enough knowledge yet to work on it -- I still have way too many blind spots within pip when I go source diving.
No pip doesn't vendor unearth, it'd probably be a big project to migrate, if so it would require a maintainer to spearhead it
As Richard was saying, with the amount of review capacity pip has anything other than a maintainer PR needs to be relatively small and/or simple. Even for maintainers it would be difficult to have capacity to do a large refactor.
That's not to say non maintainers couldn't effect a large change in pip, I just think the most likely way to do it would be to start small without large refactoring and keep building on it and perhaps become a maintainer in the process
IMO for core logic like dependency resolution, distribution installation and package finding/selecting, the benefits of pushing responsibility to a vendored library are limited.
Yes, pip maintainers stop being directly responsible for it, but now we're at the whim of the maintainers of those dependencies. This has typically worked out fine in practice since there is overlap between the maintainers of pip and such projects, but we can't really have pip maintainers take on more sub-projects.
Yeah, the difference with a PR raised by a maintainer is that A) I know the code quality is already of a reasonable level, and B) the idea has merit since a maintainer is clearly already on board. That makes it a lot easier to review.
This is exactly why I've been harboring these thoughts mostly in secret: my worry is that it's a bad idea to ask pip maintainers to help split things up into more sub-projects. (Especially since I haven't come back to try to help in months, despite my best intentions.) At the same time, build feels like a great success in reducing complexity for pip and making it possible to share out the maintenance workload. ... Maybe this is all just wishful thinking. π
We don't even use build.
We use pyproject-hooks to manage our PEP 517/660 interfaces.
Fun fact, build uses pip
Oh, of course. Because you need pip to install the build backend π€¦
Now I feel silly. Thanks for educating me though!
Actually, now I'm really curious how the installation path for a build backend works out inside of pip. I think I ought to read this a bit on my own; it will be good for me.
(you can choose to have build use uv now. But that selection is hard-coded. If you want to use some other custom installer it seems you'll have to setup the environment yourself first and use -n. At which point maybe you only need pyproject-hooks anyway.)
@hidden flame re the #general discussion I appreciate the thought but to the extent that pip is involved in the problem (in turn, to the extent that I understood it!) I suspect the problems are too fundamental to pip's design, and to its default status (the ensurepip bootstrap etc.)
because ultimately, the distro is providing pip with a site-packages folder (assuming sysconfig even works properly) and as I see it, the only sensible way for a package manager to work around that is to make venvs itself
I do kinda regret saying anything in there, though. I hoped I could contribute something useful but it's just getting too heated now
and re the pip zipapp, just for illustration
$ time ./pip.pyz --version > /dev/null
real 0m0.705s
user 0m0.668s
sys 0m0.037s
$ time pip --version > /dev/null
real 0m0.200s
user 0m0.184s
sys 0m0.016s
It's tempting to just close all AI issues/PRs outright.
What's stopping you?
Some of them seem to be worth something although Damian is the only one willing to put up with them.
I'm not :)
Hmm, my inprocess build deps branch is now error with the latest main. It seems like the removal of legacy builds broke the preparer.\
Nevermind, the Avoid pip install --dry-run downloading full wheels commit changed the contract with preparation.
it was previously pointing out something about how --dry-run will build sdists, yes? I saw that and thought "oh, another duplicate"
that's been an issue for basically the entire history of pip afaict
Yes.
(Ah I was thinking of pip download actually... as in issues 7995, 1884 etc)
I feel empathy towards wanting to help but not really understanding how, and so doing it wrong. If they can go from the step of posting something not helpful to learning how to be helpful that'd be great, I'm not sure how likely that is of ever happening, but I try and keep an optimistic view
Yea, I agree that we've all been there. π On this front, you're more reasonable than me.
It also depends on how junky the issue is, in my experience. Someone showed up on one of my projects, opened a PR, which failed linting, and then opened a slop issue to say "linting rejects this code". π
Yeah, it's on a case by case basis, but that is hilarious
Does pip have a stated policy? I keep putting it off but I need to write one for pip-tools
I like how succinct that is. I probably won't be able to keep myself that brief on it, but that's great. Thanks for sharing the pointer
Well, as with everything in pip there was a bit of back and forth on exact wording, but really thanks to Richard for sitting down and writing the final PR wording
sigh, I wish requirement files did not support CLIs flags
I was shocked when I found that out
or at least, I very much hope that constraint files do not support CLI flags because I'm not supporting that
But then slowly I found there are entire ecosystems built on it
Largely hidden from public consumption
And inprocess build dependencies and build constraints are now working together π
I wonder if people pass --hash in their constraints? Wouldn't be surprised
I guess. We'll find out if it's a problem at some point I suppose.
In-process build dependencies PR is now ready for review: https://github.com/pypa/pip/pull/13450. I hope it doesn't take months to review π
We can't remove CLI flags from constraint files, they're too widely used
People don't use --hash in constraints files because it's broken and doesn't work
Great. Well, I think the PR will handle them just fine since I'm just reusing the constraint file parsing code.
Yeah
Clarification! People don't use --hash in constraints files with the "new" resolver, because they're broken and don't work, the might be using them with the legacy resolver
Hell yeah, I like how these tests are looking with some of the new test helpers.
ooh, sophisticated.
I spent a bunch of time on rewriting tests to stop depending on the network. For my personal sanity, it needs to be easier to write tests that "do the right thing" (tm) so we don't regress.
@shy echo while you're here, do you have any opinions/notes on switching pip to use venv for its build isolation? I'm looking to pick up your old PR after in-process build dependencies (hopefully) land.
I did at-mention you on the PR, but I understand your GH notifications are a 
Yup. Slowly catching up on OSS stuff, instead of doing my end of year cleanup. π
It was 4am when I got that mention. π
It's disabled by default, no?
It might actually make sense to enable it on 3.15+ so that we at least have a 5 year deadline on that front and to make sure that it's a non breaking change for people.
Idk if we should change the default on older Python versions - that'll depend on what portion of package builds would be broken by the change and I don't have a sense of what that looks like nowadays.
A good check might be to see if anything in the top N packages breaks with that change?
Back when I was looking into this, we didn't have uv or even build fully stabilised -- the odds of things being broken in some weird way are definitely lower now.
goes back to preparing for family staying over the holidays
I don't think anything in the top N packages should break with changing to venvs
We use real virtual environments in uv and have never had any problems
We're seen more weird problems with pip's approach π
I've seen at least one issue reported to uv where the build depended on the way pip "isolates" Python, but I've seen other issues where it causes problems getting builds to work
I think the main problem for pip is going to be the performance cost
Wait, perf cost? How so?
Huh, I thought all it did was dump a couple of files on disk. Is the stdlib venv doing more stuff still?
I've very much out of the loop TBH, so I'd absolutely not be surprised if what I'm remembering is completely wrong/outdated.
No idea, never investigated it, I just know python -m venv .venv can be noticably slow, like sometimes a couple of seconds on Windows
You could use our rust impl π
Not unless it was integrated into the standard library, I'm quite supportive of the pure Python only stance
It's the env seeding
Making venv without pip (and Setuptools in older versions) is pretty fast
Oh I see, well, we wouldn't need either of those
I was joking
Were you considering publishing it as a separate package?
The crate is available, but I hadn't considered making it a PyPI package no
We could, but I'm not sure to what end
Wait, removing PYTHONDONTWRITEBYTECODE=1 from our test isolation code is actually resulting in faster test times. Are we not precompiling pip's bytecode before running tests?
I believe it
Oh wow, letting bytecode be written shaved a full minute from a full test suite run locally (3:45 -> 2:45) with only two failing tests.
We spawn a lot of pip subprocesses in our test suite, if Python has to generate bytecode every time, that would explain why our pip startup times are abysmal in CI.
Hmm, we are precompiling pip's bytecode: https://github.com/pypa/pip/blob/e5fd9e4e818a7ae5e4c67a01f938651c92ff1e0c/tests/conftest.py#L376-L381. I
Also PYTHONDONTWRITEBYTECODE=1 is set in the pip script runner, so this only takes effect after test setup (I think?)
Hmm, actually, having recorded filesystem events. I suspect that we aren't compiling the common wheels we're linking into every test environment. This is bad because coverage and pytest_subket run on Python startup.
It may also be beneficial to also not inject coverage unless we're actually collecting coverage.
We are not collecting coverage, it's still configured to use setuptools as our build backend. I've mused about fixing it, but I don't have the energy to increase test coverage, so didn't seem worthwhile
"real virtual environment" is an interesting turn of phrase
I actually had quite a bit of detail on this in a mostly-done blog post, before I abandoned it for months (and want to refactor the content into other posts when I do start writing again)
(and I do mention it briefly in my pipx tricks post, but the fun part is why the seeding takes so long)
(and actually I can probably give more detail than I originally had if I think about how to collect more evidence)
the pex project has been naming and solving these problems for years. when pants runs tests it generates a pex with all the deps and then a pex with all the sources and composes them via PEX_PATH. this retains eager bytecode compilation in the cached subtasks and is much more friendly to the OS filesystem cache
not at all picking on youβi am trying to say that i do very much want to see your blog post on this topic
one of the reasons i started contributing to pip so much was because i very specifically wanted to try to move more of our work outside of pants itself in order to benefit the general python community
ah well that's part of why I deferred it
there is a lot of material about venv creation being slow, but the original article was about why pip is bootstrapped into venvs in the first place
so I also ended up reaching into topics like the shortcomings of --python
ultimately it felt like too many things to explain in one place; I want to have separate detailed pieces and then something that's a summary with references
and the specific goal i had in mind while hammering away at pip for so many years (especially through install --report) was to enable pex to provide extremely powerful and truly instantaneous venv creation
... well hopefully the shortcoming of --python will stop being a thing at some point
maybe in 2030 π
i decided to do this for my phd thesis work
heh
i have a pip fork which separates package finding from resolution and it is much easier to understand and to optimize this way
but yeah these things are why I didn't throw in to contribute to pip, and sorry if I come across as trashing the project and not being helpful but the ideas I have just make more sense in a greenfield project
I have some plans to make some tangible progress on big ticket items next year but there's nothing firm and I can't promise anything.
I do greatly appreciate that effort, especially since I recognize the legacy commitments impeding you
oh uh
that's not even really what I had in mind π
It is so much easier to start from scratch and build new tooling than write and maintain for years.
pip is old.
yes sorry that was about --python
yes and I had different limitations of --python in mind
i unfortunately did the actual work though over multiple years so this argument unfortunately does not apply here
multiple years on the metadata resolve work https://github.com/pypa/pip/issues/12921 which was going great until people just stopped reviewing my PRs
I'm not actually very familiar with pants but the pex concept is pretty neat
pip is actually really good about PEP conformance!
pants is a really cool build tool which supports recursive build tasks. it's great for corporate monorepos. very novel research. not optimized for general python community
but pex is extremely powerful
the maintainer john sirois is great to work with
one issue i have with uv is that unlike pip or especially pex its little venvs aren't intended to be something other tools can then consume
I'm not surprised!
also they don't use the normal rust zip crate for some reason
Yes, sorry. I don't mean to discount the work you've put in. I was just responding to @timid stag's point.
it took me like three weeks of intense effort but the result was immense. i think the result of package finding as retrieved from the simple repo API should actually be its own PEP file format
Uhh, I find it hard to immediately see how such a thing would be standardized?
What's difficult about doing so?
well my initial thinking was not pip but cargo, and particularly establishing a protocol like pants v1 had for build tasks, where the results of all kinds of build tasks can be audited and transferred across systems. we would frequently be able to provide the .pants.d/ directory as a shared volume to docker builds
impl at https://github.com/pantsbuild/pants/blob/1.23.x/src/python/pants/invalidation/cache_manager.py
we have a specific format like this for what pypi provides over HTTP, but that's not at all the same thing as how to transfer the result of parsing that json or html to another python packaging tool
#12257 and #12258 (linked in https://github.com/pypa/pip/issues/12921) demonstrate exactly how you could make a standardized protocol for that. it's a huge amount of repeated work
i also want pypi to let me limit entries to specific time ranges
oh you could definitely do it
it's just not engineered to serve that purpose is all
because the tool is oriented towards managing everything itself? or because of something specific to what it puts in the environment
compare to e.g. my pex proposal here, which takes advantage of the zip format to move around zip entries without any compression activity. just pure i/o https://github.com/pex-tool/pex/pull/2175
i haven't tried it and i suspect it could be done and would work
using a venv was proposed multiple times throughout the development of this feature https://github.com/pantsbuild/pants/pull/8793 but a venv is very specifically about the act of making an environment for a python interpreter imo
Problem
See pex-tool/pex#789 for a description of the issue, and https://docs.google.com/document/d/1B_g0Ofs8aQsJtrePPR1PCtSAKgBG1o59AhS_NwfFnbI/edit for a google doc with pros and cons of differen...
it's a tool-to-user protocol rather than tool-to-tool
oh that is cool.
(I've just been putting wheels into my self-extracting zipapp as ordinary files. But everything is small, so)
(and also it works better for the initial state I want to set up, including a seeded cache)
this PR is unfortunately my first attempt at describing package finding, then filtering, then resolving as distinct phases of the pip invocation https://github.com/pypa/pip/pull/12258 my current pip fork does more than just tricks to achieve caching of pypi simple api responses like here, but completely separates the implementation of package finding from anything specific to the current invocation
Closes #12184.
This change is on top of #12257, see the +264/-116 diff against at https://github.com/cosmicexplorer/pip/compare/link-parsing-cache...cosmicexplorer:pip:interpreter-compatibility-cac...
cc @hidden flame this is how i devised a long-term fix to the scourge of the --python
... is your package finding separate to the extent that you could make a separate library wheel for it, because I could probably use that
yes!!
i also made a separate fork of the packaging crate because it has absolutely no distinction between parsed and unparsed objects like Version or SpecifierSet at at the type level so you have to check for errors that were already validated earlier and in particular checking if a Version satisfies a SpecifierSet is fucking cubic due to pathological (not like an edge case, it's just that the impl is pathological) round trip parsing performed upon every single inner comparison
and btw this is very specifically why i mentioned making it a protocol bc parsing massive pypi simple API responses in python itself is the very first time in recorded history that i have encountered a performance issue in pip resolves that python the language was not able to overcome easily
... simple API responses are large? even for a specific package?
the simple API is super overloaded and offers absolutely no form of filtering
the unavoidable act of json parsing itself is a bottleneck especially until free threading is complete
obv you can multiprocess but at this point the ability to tightly control bits and bytes retrieved from a streaming network response makes a dedicated tool in rust or c++ more viable
rust has bootstrapping issues which i'm hoping to work with a rustc team member to radically improve in the next several months
it turns out parsing is a bottleneck in some other places too. my pip fork achieves most parsing speedup by making it a hard runtime error whenever a string is received when a parsed representation was expected
but url quoting and unquoting in cpython relies upon a pretty grotesque uncommented hack that works because it triggers a memchr() call
i have a cpython dev branch that tests for sse/avx, then for avx2 support on the host, and would provide very basic methods for searching a single literal byte sequence or finding match positions for a set of bytes across a string or a stream
... hold on, we're talking about optimizing things like str.__contains__ behind the scenes now? o_O
it turns out btw that cpython reimplements this pattern (find specific byte(s) to quote or remove within a buffer) several times across the repo for encoding and decoding tasks
(oh, str.replace then?)
actually one big problem is that str.__contains__ is not sufficiently generic and cannot work with streaming inputs (or parallelize across the input, for that matter)
str.replace can perform quoting. it is also very extremely noticeably slower than the uncommented hack cpython currently does
the cpython re interface for the _sre C string search impl is very effective at being robust over decades of python development. in this way it is deeply similar to the emacs regex engine
interesting
but it absolutely does not attempt to optimize too hard
(I've long wanted a python-level way to manipulate the underlying DFAs or whatever beyond regex syntax)
emacs has rx to do that with sexps. i think it's outrageous that regex engines force me to serialize the AST by hand before processing it
this is who i rly rly wanna do a phd under https://jamiejennings.com/posts/2021-09-23-dont-look-back-2/
A
\10in a regex is sometimes a back reference to capture group 10, but not necessarily. It could be an octal literal, depending on if there are 10 or more capture groups.
This post is part of a series, though it can be read independently of the others:
Part 1: Regex and Unicode Part 2: The Four Eras of Regex (this post) Part 3: Back references in regex and grammars (forthcoming) Part 4: Why is everything on fire? (forthcoming) In Part 1 of this series, I explored a number of obstacles to using regex reliably in o...
also see my emacsconf talk which describes some fallacies with user interfaces for text search https://emacsconf.org/2024/talks/regex/
the result of this work would be less about the SIMD instructions and more about an interface for composable search/match/replace operations instead of needing to create bespoke C impls for every new type of string parsing
but for package finding itself, i think there is a really strong case for protocolizing that in a json format
https://docs.astral.sh/uv/reference/internals/resolver/
astral has this page (their only "internals" doc lol):
finding a set of version to install from a given set of requirements, is equivalent to the SAT problem and thereby NP-complete
this is extremely widely believed in the industry but is actually false. you can transform it to an instance of a SAT problem, just like every other problem in NP, but the converse reduction is extremely tricky. there's also a closed-world assumption being made that is not actually a feature of the problem itself but of the aforementioned transformation into boolean SAT
uv is an extremely fast Python package and project manager, written in Rust.
this part is good to hear someone else say though
The slowest part of resolution in uv is loading package and version metadata, even if it's cached.
(someday i will shame astral into crediting me for all their caching techniques which charlie marsh shamelessly copy/pasted from the very detailed explanations and implementations i provided to pip over the course of several years, after not hiring me then ghosting me when i tried to keep in touch)
For most resolutions, the resolver doesn't need to backtrack, picking versions iteratively is sufficient. If there are version preferences from a previous resolution, barely any work needs to be done.
making use of cached results like this means you risk choosing different versions nondeterministically depending upon the state of the local cache, which is radioactively dangerous in every way
that's why protocolizing intermediate phases of resolution is important, so you can achieve strong guarantees (e.g. that no new versions of any relevant package have been published since the last resolve) which enables very powerful inferences like this
some reasons we currently run into issues with simple repository API performance:
(1) the API https://packaging.python.org/en/latest/specifications/simple-repository-api goes into great depth on Accept header negotiation, but there is currently no requirement to respect any form of Cache-Control. fastly currently does this really cute thing where they won't return a simple 304 if it was in fact cached but had to bounce across more than one server for reasons which employ load-bearing internal jargon
(2) no form of filtering is supported at all. pip's package finding (except in my fork) does client-side filtering by matching python while streaming the results because there are quite a lot of them. this is a significant bandwidth cost that could be improved in particular if we could support time range queries (or even just one-sided ranges like "everything since this utc timestamp"). pypi currently unofficially supports Changed-Since with Cache-Control, but that only produces the boolean 304 (short) vs 200 (long) cases
i suspect this is something astral supports in their proprietary repository offering
i don't think filtering on python version or impl is anywhere near as much of a slam dunk, especially because
(a) it would be much harder for pypi to implement, and they already took almost two years to support PEP 658 after it was accepted. that PEP was specifically intended to address FUD raised about the http range request technique i invented, which relies upon HTTP standards that pypi also suddenly dropped support for and explicitly refused to engage with me about.
(b) it's not a simple database query that corresponds to a canonical totally-ordered timeline so much more work would be necessary to agree on how to serialize a general query over the space of python compatibility.
in general, i want pypi to declare the specific set of HTTP standards it implementsβthere is NO!!! specification for what pypi has to support or what it requires for HTTP requests of e.g. specific wheels and that's just very unserious.
pip implements a workaround for the lack of a separate metadata API => pypi breaks negative byte range requests and everyone i ping says they can't comment on that and furthermore indicates that pypi can't be expected to support any particular http standard
a pip maintainer takes this at face value and makes a PEP exposing a separate endpoint for metadata => pypi drags their feet and again no one can comment and i am told this is because they have no engineering headcount for things like implementing accepted PEPs we specifically created to circumvent the thing pypi's refusal to entertain any discussion about supporting standard HTTP features
but pypi does have a fully funded engineering position posting pdfs like this https://alpha-omega.dev/wp-content/uploads/sites/22/2025/10/ao_wp_102725a.pdf which specifically describes the existence of nonconformant implementations of the zip spec as if astral intentionally choosing to disregard the zip spec is anyone else's problem but astral's
ZIP even supports deleting files within an archive by rewriting the Central Directory to remove the reference to a Local File.
no it fucking doesn't!!! it absolutely fucking doesn't!!! the actual spec https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
4.3.2 Each file placed into a ZIP file MUST be preceded by a "local file header" record for that file. Each "local file header" MUST be accompanied by a corresponding "central directory header" record within the central directory section of the ZIP file.
the CVE assigned to uv was specifically regarding uv failing to correctly implement the zip file specification. the wheel format is explicitly specified to require conformance to the zip file specification. this would have been immediately caught by fuzzing, but uv uses a mysterious rust crate that doesn't fuzz anything at all, and not the rust crate named zip, which i have contributed a great degree of performance and security improvements to and is by far the most trustworthy implementation.
The implementation with the fewest differentials was
βronomon/zip,β which is implemented explicitly to reject many sources of differentials at the expense of compatibility with all ZIP archives.
the zip crate also rejects "differentials" which are explicitly forbidden by the ZIP spec. it does not lose compatibility with any ZIP archives, because these differentials arise from individual implementations that don't employ industry-standard protections like compiler-enforced memory safety and several forms of randomized testing i.e. fuzzing.
Packaging ecosystems like Pythonβs often have many tools with different organizational structures and roadmaps, and therefore cannot easily coordinate on ecosystem-wide challenges like archive format features.
translation:
- pypi stonewalls any attempt from pip developers to establish a point of contact to discuss features we need
- pypi will suddenly drop support for http standards like negative byte ranges which are actively used in pip to achieve its own performance goals, and continues to stonewall an increasingly public sequence of attempts to establish a point of contact regarding pip's expectations about archive formats
- a PEP which was explicitly designed to circumvent reliance upon specifics of any archive format mysteriously takes years to go live, with no corresponding announcement or indeed any explicit guarantee of support for the accepted PEP
This is especially true when the issue arises from an ambiguity in a packaging standard, which can take months to fix and often requires public correspondence, meaning users are left to fend for themselves from exploits while standards are fixed.
there is absolutely zero ambiguity in the packaging standards. there was also zero ambiguity in PEP 658, but pypi did indeed force users to fend for themselves from exploits during that time.
Applying protections at the public package repository level would mean that would-be attackers canβt exploit public information about a vulnerability at scale;
hence PEP 658, which circumvents this issue entirely yet pypi still doesn't actually guarantee support for and actively ignores the existence of
However, applying protections like rejecting archives from a large public package repository like PyPI is not without challenge.
rejecting uploads is the only solution pypi will accept, and they demonstrate active contempt for the entire PEP process.
This was done after noticing that, despite the wheel package specification requiring installers to check the contents of the archive with the filenames listed in βRECORDβ, no popular installer was taking this step.
that's because this is actively false in two separate ways: https://packaging.python.org/en/latest/specifications/binary-distribution-format/
Although a specialized installer is recommended, a wheel file may be installed by simply unpacking into site-packages with the standard βunzipβ tool
so this has now escalated beyond simply ignoring the existence of accepted PEPs like 658, but to unambiguously fabricate "requirements" and accuse installers of nonconformance. but there's more:
Update
distribution-1.0.dist-info/RECORDwith the installed paths.
so apparently, correctly installing a wheel file actually requires the installer to completely ignore the contents of RECORD and accept whatever the zip file contains
seth larson produced a document with multiple outright falsehoods regarding spec conformance and ignored the PEP which expressly avoids this issue that pypi still refuses to guarantee support for. i don't understand what the point of the PEP process is or why pypi ignores anything pip needs but will shift into crisis mode to uncritically repeat anything william woodruff says while ignoring the code i implemented in pip and rust zip to solve these precise problems.
anyway that's actually a third vulnerability disclosure in packaging standards from this weekend: simply lie in RECORD, and pypi won't actually protect users from your exploits. the other two are in PEP 751 and in the METADATA format itself, both traceable to brett cannon: https://github.com/pypa/packaging/issues/570#issuecomment-3678243940
anyway, after seeing pypi stonewall any attempt to codify support for http standards, suddenly break support for negative range requests that pip used to avoid any archive confusion, then slow-walk PEP 658, then claim that nonconforming zip implementations which don't even do fuzz testing make zip impossible to secure, then repeatedly state unambiguous falsehoods about the requirements for spec conformance in an official security update, asking pypi to support something like a time range seems like a waste of my time
the other interesting thing about supporting time interval queries is that it allows for checksum-based verification (e.g. via merkle tree) that a simple repository API hasn't retroactively modified its response at all. this is also something that a protocol for parsed package versions from a simple API response would achieve, enabling users to perform complex queries with entirely local information. this is what we would need in order to do the approach astral's resolver documentation describes, which reuses a previous output
i was really disappointed about the response to the telemetry too. making telemetry a json format and then making it a single unambiguous override in pip (which also makes opt-out a special case of an empty override) is clearly better for data quality and unifies that data across alternate clients such as uv. the open-source user-agent parsing code is described as poorly maintained and has a very complex optimization system. i cannot identify any engineering-based rationale against standardizing this information, and particularly in the lack of any opt-out.
in https://discuss.python.org/t/pre-pep-user-agent-schema-for-http-requests-against-remote-package-indices/104006 i described a very specific scenario i have performed and which occurs in every corporate package index: downloading a copy of some package in order to move it across a security boundary into our trusted environment. this is where i saw a full opt-out being irreplaceable. this was simply ignored
(This is my first attempt to propose a packaging standard in this forum. I am basing this off the instructions at PyPA Specifications β PyPA documentation. Those instructions seem to indicate that a PR against GitHub - pypa/packaging.python.org: Python Packaging User Guide should be provided at the same time, but Iβm not seeing many examples...
i have wasted multiple years of my life repeatedly getting to final-stage interviews and then getting rejected. during and after my unfortunately brief tenure at LLNL, i realized that everything i'd worked on at twitter could be generalized to make python packaging awesome. i wasted so much time assuming good faith, only to find people actively misrepresent my ideas without my consent:
fast-depsassigned to a GSoC student and shipped to prod without consulting me, then smeared as "broken", while my implementation is simply never accepted- astral claims to have invented http range requests on zip files after rejecting me. no one corrects them or cares that charlie marsh's VC investment is predicated upon a misrepresentation
- pep 751 just ignores
pip install --report
it's been such an actively contemptuous waste of my time and energy to contribute to a community who will concern troll me, stonewall any attempt to escalate, then just take credit for what i invented
i respect william woodruff's pride in his work and his patience with the PEP process. seth larsen just writes easily-disproved falsehoods in official documents and says out loud that having public discussions is problematic. that's worthy of a monty python skit
i hope my experience may be useful to others. thanks tzu-ping chung and daniel holth for your work and for your reviews and i hope you can achieve more here but i think pip and pypi and the PEP process are deeply compromised.
the credit stealing hurt the most
Fyi, I think that should now be fixed in main from my PRs:
https://github.com/pypa/packaging/pull/985
https://github.com/pypa/packaging/pull/986
https://github.com/pypa/packaging/pull/989
https://github.com/pypa/packaging/pull/999
https://github.com/pypa/packaging/pull/1005
And Henry did a bunch of performance work on versions: https://github.com/pypa/packaging/pulls?q=is%3Apr+author%3Ahenryiii+is%3Aclosed+perf+
I can only speak for myself, but personally I found this proposal to be tackling too much at once.
I would love to discuss a dedicated proposal for a telemetry spec, but how a tool overrides telemetry data I think should be left up to tools and not specified in that specification.
I think this makes there be way more points to discuss and cross discussing them multiplies the number of things that can be discussed, as now not only do we have to agree on a good spec proposal but also all tools have to agree to the prescribed behavior.
Maybe I get too overwhelmed too quickly, but I largely sat out that discussions because of this.
The personal attacks here seem quite inappropriate.
Community health files for the Python Packaging Authority - pypa/.github
I've locked this channel for the time being since the topic of conversation has veered heavily into off-topic category.
Threads?
I've pre-written most of the 26.0 release post. This way, I have a post ready to go for pip 26.0 even though I am going to have basically no free time near the release date.
Hmm, 6% of our total test runtime still consist of one massively parametrized keyring test. There is some caching/environment reusing that could be added, but I haven't been able to get it working yet.
I'm flying back home now, eager to do a few bigger reviews and get some PRs out the door, later this week, assuming my jet lag give me more time to work on stuff not less π
Hi, FYI, I did unlock this channel a few days ago. Feel free to go back to spending too much of your free time on managing digital packages π
Roughly when would packaging 26.0 need to be released to be included in pip 26.0? I forget when in January pip is aiming for.
If I'm doing the release I will aim to release on the 30th Jan, but please don't rush to get things out for pip, we'll just get it early in the next release cycle
I'd like to get an RC out in a day or two, and full release in a week or so, but it could always end up delayed by standardization questions (or if people hit issues with the RC, hoping it will get some exposure)
You've been testing pip with current main, I think?
I've not run a full test suite yet, I will tonight or tomorrow, and if packaging publishes an RC I will open a draft PR with that vendored and keep it up to date
Nice, thanks!
I've removed all "good first issue" labels, there's been a recent pattern of users posting almost identical questions, I think they are doing it as part of some course?
I have no problem guiding students how to submit a PR, but all the actual issues with that label have some nuanced questions that need answering before submitting a PR.
Agreed.
urllib3 vulnerabilities may force our hand and force us to drop Python 3.9 support soon: https://github.com/pypa/pip/issues/13745
I've responded, I don't see any reason these are special
would be so cool if pypi supported a timespan (or rolling hash) range for querying package links. that's the only part i can't optimize away. no clue how likely it would be to get accepted since it took several years for PEP 658 support and nobody could be reached about it. package uploads should be architected as an explicitly append-only log and it would be super cool to work with someone on that
You're need to discuss with in #pypi or on the warehouse repo, I beleive there's many ways to query their data outside python packaging standards. Certainly lots of corporations just mirror the whole thing and query it locally. But I've never spent any time looking at PyPI implementation details.
I'mma try to get some reviews done either tonight or tomorrow π
I think I've found a bunch of subtle bugs to do with yanking, that may require me to make a DPO post and ask for a spec clarification π
The inconsistencies and gaps are limitless!
I think I've pinned down all inconsistencies to do with versions and specifiers.
The big source of inconsistencies now are to do with versions/releases vs. distributions.
Yanking is defined on the distribution level, but it has behavior defined on the version/release level.
I need to write some tests, but I'm pretty sure no package tool fully respects the ability to yank a single distribution from a release, but leave others unyanked. And PyPI has no functionality to do this anyway.
But I need to spend some time looking at pip's implementation, it might be possible to make it spec compliant without too much complexity.
I'm going to keep a close eye on this distribution definition vs. release behaviour for future packaging PEPs
Today PyPI will allow you to delete a distribution file, but only allows yank at the release level, and archive at the project level
Actually, I'm just reading PEP 592 for the first time in a long time, and I have to say I don't think this went through the same level of scrutiny as a packaging PEP would receive today. This is the first paragraph under the "installers" section, and I do not understand the intent of this paragraph:
The desirable experience for users is that once a file is yanked, when a human being is currently trying to directly install a yanked file, that it fails as if that file had been deleted. However, when a human did that a while ago, and now a computer is just continuing to mechanically follow the original order to install the now yanked file, then it acts as if it had not been yanked.
As an installer, like pip, how am I supposed to know if a "human being" or a "computer is just continuing to mechanically follow the original order" is calling me?
Annoyingly the PEP repeatedly refers to the "yanked release" everywhere except the specification
This is definitely going to need a DPO thread π
Is there not a pre-suppoosition of a lock file? I think that's the intention, if you've "locked" to a yanked release, you'll still get it, but if you're trying to resolve a set of constraints, a yanked release shouldn't be used in that equation
That makes sense
This might be helpful https://docs.pypi.org/project-management/yanking/
Not from a specification point of view, but good to know there are pypi docs on it
I don't think we should support yanking at the distribution level
I don't think any index supports that?
I agree, but in the specification part of the PEP it just says that the file can be yanked
And while the PEP refers to yanked releases, it never makes an attempt to define how that relates to a file being yanked.
So the thread would be to confirm that the expectation is releases are yanked, and the specification is how that's represented in the simple API, so if an installer see's any file yanked it can and should assume the whole release is yanked.
Otherwise the whole section on what installers should do doesn't make any sense.
I'm not even sure what would happen if something was partially yanked in uv
In Poetry, partial yanks are supported and a release is considered as yanked only if all files of it are yanked. (That's how I interpreted the spec.)
Yes, I would say it's interpretable the way, but I don't think that was the intent
Okay, I've re-re-read the PEP, I think what Poetry has implemented is correct. Specifically because of the lines:
In Warehouse, the user experience will be implemented in terms of yanking or unyanking an entire release, rather than as an operation on individual files, which will then be exposed via the API as individual files being yanked.
Other repository implementations may choose to expose this capability in a different way, or not expose it at all.
Huh!
Going to make a pip issue to track supporting yanking files vs. yanking releases
I will try to squeeze a PR review of the --uploaded-prior-to in this week, but most likely it will have to wait until this weekend.
Thanks, but as always, don't do work you can't afford to do, we're all volunteers here and I get that for my PRs
I've learnt a lot about CI best practises in the past week or so for packaging improvements, once pip 26.0 is out the door I'm going to look to make some improvements
LMK if it'd be helpful to bounce ideas on this, i've been (very slowly) using zizmor to clean up CIs a few different PyPA projects over the last year π
I'm going to start curtailing discussions on PRs that aren't about the implementation, having lengthy discussions about if the goal of the PR is valid makes it difficult to review the code, I'm going to start pushing those discussions into issues (or even DPO threads where that's appropriate). I'm very guilty of this myself, so I'm not throwing any blame or shade to particular commenters, but I want to reduce the barriers to PR review to a minimum.
Is pip going to eventually drop support for uninstalling eggs?
There is a bunch of complicated code to uninstall various kinds of non-standard installs, so I guess that at some point we'll want to clean the room.
Why do you ask?
Mostly curiosity. I was examining the uninstall code and was surprised to see how complex it was.
I don't think uv supports any of this non standard stuff, and no one appears to be demanding they do, so that's been a good indicator that pip can probably deprecate and remove it without too much trouble
I havenβt seen any load-bearing eggs in projects in quite a while
I think only old project use eggs and pypi stopped accepting the uploads some time ago
I think the last time I've seen eggs is on projects stuck on a very old setuptools doing development builds. But that probably doesn't work with modern pip
We do have some support, e.g., https://github.com/astral-sh/uv/pull/4082
Interesting
We've had some problems with it though
Legacy editables (via easy_install.pth and .egg-link files) have only been removed from pip in the last release. So we may want to keep supporting uninstalling those for a little while.
In general, yes, I would be open to removing support for the ancient distutils egg installation format, but it's also not really a huge amount of tech debt.
So, at some point, someone may decide to put in the time to clean things up, but it's not exactly a pressing concern.
@finite perch in terms of what's left from me for pip 26.0, I would like to get my egg fragment removal PR in and possibly a small follow up PR for inprocess build deps. If I get around to the latter, I'll likely merge it without review since the feature is experimental anyway.
My week is pretty busy so I'm not sure when I will have time, but I'll slot it in somewhere (my to-do list is a living creature and evolves day to day).
@hidden flame noted, there's a couple of comments on the egg removal that need addressing, please don't merge anything immediately, I'll probably have time to quickly look over a PR
CI is failing, taking a look