#internals-and-peps
1 messages Β· Page 5 of 1
what did?
the first did less loops
actually faster
so, the conversation was good but i've gotta go.
bye
in timeit the less loops produced by the command means the slower it is
just so it won't take too long to time something
Try it again with bigger initial fast_thing:
fast_thing=(1,)*1000
testing performance with your shell time is like measuring the temperature of tea by tossing the thing into lava and seeing how quickly it evaporates
but also starting the stopwatch the moment you toss the liquid
it tells you something, and it might even be related to what you want to know!
attempting to redefine function with modified ast; should this be enough?
func.__code__ = compile(tree, '<preprocessed>', 'exec')
wait no that's dumb
or not
not sure if im losing info like this or what
maybe i need a way to compile only the FunctionDef node
is that possible
^ solution: compile(tree, '<preprocessed>', 'exec').co_consts[0]
does python have a frozen dict type
no
so if i have to store dicts in sets i have to convert it into an immutable type like tuple or frozen dataclass?
yes
how about making a hashable dict subclass
I'm not sure - that might work
i think there's a frozendict that uses a HAMT on pypi
also its weird to think that not all collections in python are recursive data structures
sets cannot contain sets
ah, it's immutables on pypi. there's also frozendict which i think is just an immutable hashmap
well, sets can only contain immutable/hashable types. A set is neither of those
ik
still weird and unintuitive to think about
hello everyone
AS a full stack developer, I have 6 years exp in python
If you have any problem, just feel free to ask
not here
is there a reason why reduce is tucked in functools while map and filter isnβt
it was a builtin in Python 2 but got demoted because it was rarely used
isnβt map and filter just as rarely used as reduce since comprehensions exist
not in my experience. I see map() and (less so) filter() used pretty regularly, reduce() very rarely
i use map sometimes, filter and reduce never.
e.g. the mypy source code has about 11 uses of map(), none of filter and reduce
i guess comprehensions and genexprs become more unwieldy when you introduce multiple collections so there is still merit in using map?
if you want to map/filter using a named function, it's shorter to use map/filter instead of a comprehension
e.g. this one in mypy str_ver = ".".join(map(str, python_version))
the alternative would be something like ".".join(str(part) for part in python_version)
which is longer and requires you to come up with a variable name
str(_) for _ in python_version π
does python use chaining or open addressing for hash collisions
Open addressing
!pep 603
that looks interesting
There is also builtin HAMT class, but you cant access it
I think the most rare builtin is "ascii"
Also "oct" and "memoryview"
also id
id is useful for debugging
why shouldn't set be hashable? it's the only built-in iterable that automatically removes duplicates and stores only hashable elements
i get that it's mutable but that doesn't give a good enough reason as to why it shouldn't be hashable
can we not do dynamic hashing like ```py
python pseudocode; added .last_hashed_size field
def set_hash(self):
if self.hash == -1 or len(self) != self.last_hashed_size:
... # hash the set
self.last_hashed_size = len(self)
return self.hash
there is frozenset
Hashing set by contents doesn't make sense for the same reason hashing a list by contents doesn't make sense
yeah but i'll have to unfreeze and freeze it back again to put it in a set
That's a good thing, because now it is clear in the code that you'll need to reinsert it wherever it is stored
it is more or less impossible to write a hash-based data structure that can handle the hash of its contained elements changing
the hash being stable is fairly important since that's what decides where in the datastructure the thing goes
Hash of object must not change during lifetime of object.
Also equality must imply equality of hashes.
Also hashes should be good to get fewer hash collisions (so hash===42 is not a good idea)
Can you satisfy this requirements for set hashes?
If hash chnages, you will get weird behaviour (segfault, memory leak, idk)
assuming people don't go through the proper APIs, yes
after considering the implementation of set and dict i've concluded that the worst that can probably happen is duplication of set-type elements
I doubt you can segfault, but it will not actually work
!e syntactic sugar omitted for brevity
from functools import reduce
from operator import xor
class Set:
def __init__(self, elements):
self.elements = set(elements)
def __hash__(self):
return reduce(xor, self.elements)
d = {}
s = Set((2,3,4))
d[s] = 42
s.elements.add(1)
print(d[s])
@quick snow :x: Your 3.11 eval job has completed with return code 1.
001 | Traceback (most recent call last):
002 | File "<string>", line 13, in <module>
003 | KeyError: <__main__.Set object at 0x7f2ae91b05d0>
I don't really know why PEP-8 specifies that the space between function has to be 2 lines. I don't like it
!e
from functools import reduce
from operator import xor
class Set:
def __init__(self, elements):
self.elements = set(elements)
def __hash__(self):
return reduce(xor, self.elements)
d = {}
s = Set((2,3,4))
d[s] = 42
s.elements.add(1)
s2 = Set((1,2,3,4))
print(d[s2])
```this also won't work
@flat gazelle :x: Your 3.11 eval job has completed with return code 1.
001 | Traceback (most recent call last):
002 | File "<string>", line 14, in <module>
003 | KeyError: <__main__.Set object at 0x7f7cfdc80690>
IMO, it is good unless you have a lot of similar short methods
Is dict and set behaviour proper? What I said is written in the documentation
so far i don't see the possibility of crashes
Idk if crashing is possible but you can simply screw up the data structure
You won't be able to retrieve the object, which makes the data structure useless
wdym by "screw up the data structure"
is that the issue i said earlier
after considering the implementation of
setanddicti've concluded that the worst that can probably happen is duplication ofset-type elements
If you mutate a key while it's in a dict
You can duplicate but obviously part of that is also that you won't ever find the original element
i forgot to mention that
Well that defeats the whole purpose of the data structure
the only correct way to do this operation is to remove the element and rehash and reinsert it.
Seems like a good enough reason to disallow mutable keys
which you can already do with frozenset
Right
In languages with mutation control like C++ and rust
Mutable types can be keys
They just prevent mutation while it's in the dict
Python doesn't have that option
that does require unfreezing and refreezing which is not optimal for large sets
you could do this and manually make sure you do the operations correctly
that is, remove, mutate, readd
if you can't afford the copies
yeah that's probably faster
I think Java and Kotlin don't try to enforce this the way python does
I'm not sure if it's that big a minefield
But it is definitely a minefield. Only the size is in question π
yeah, java gives you the same behaviour as the above, just for standard collections. Different ways, the java option is a hair more powerful, but also more error prone.
If you can crash CPython this way that would be a bug. I believe the effect is simply that you can't retrieve the element by key and checks like in could be wrong
(because those are going to look in the wrong hash table bucket)
Right
I guess crashing cpython with python code is always a bug
Hard for me to get used to the managed language mentality π
with some exceptions, e.g. ctypes and constructing code objects manually
and sys.setrecursionlimit(100000000)
I'm not familiar with code objects
For setrecursion limits, sure, same with forcing enormous allocations, manually sending in signals into the python interpreter
enormous allocations should just fail with MemoryError
it's probably impossible to crash with enormous allocations due to bounds checks in every part of the interpreter that allocates memory
!e ```py
exec((lambda:0).code.replace(co_code=b'\1\0'))
@rose schooner :warning: Your 3.11 eval job has completed with return code 139 (SIGSEGV).
[No output]
I don't see how you can avoid memory failures
Linux doesn't tell you that you allocate more memory than is available by default
Allocate a lot of memory and the OS will just eventually kill your program
Windows is different
if creating a MemoryError fails with oom you're probably in trouble and are about to get sniped by the OS
But I'm not sure that counts as crashing as opposed to getting killed externally
The way the Linux kernel works by default is by letting you allocate more memory than is available, and if you use too much of it, you get killed. To my knowledge you can't actually stop that from happening as a process
check it
The kernel won't tell you that you are overallocating and will die soon.
At least afaik
You can definitely get MemoryErrors on Linux. malloc() can return NULL, and when the OOM killer gets you is a configuration thing IIRC
You'll raise it if you try to allocate something silly
When does python raise MemoryError
Is it based on something other than malloc returning null
@rose schooner et al.: It just so happens dabeaz just posted on his mailing list about just this topic (hashability of mutable data structures): https://tinyletter.com/dabeaz/letters/dangerous-flexibility
In the recent book "Software Design for Flexibility ", the authors make reference to the fact that they're going to explore ...
I get a 406 Not Acceptable on the stylesheet π¦
You can configure the kernel to not overcommit memory, and I believe if you do then the OOM killer is never invoked because there's never a need to free memory owned by a userspace process
indeed
you can get a memory error on linux by default via something silly like [0] * 2 ** 40
Does Windows die along with the process? π€
Is there any function related programing in python please suggest me best online website
No, Windows just refuses to allocate memory for you if there isn't enough space. It doesn't kill the process either.
that's opt-in behavior on Linux, too. https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
it's usually disabled in favor of overcommitting, though, because fork makes it extremely easy for a child process to inherit many gigabytes of memory that it has no intention of actually using. In practice, overcommitting is usually the better call for a POSIX system. That rationale doesn't apply to Windows, though, because it has CreateProcess instead of fork+exec
functional programming is kinda discouraged in python
but its still an option
RealPython also gives you some free courses if you sign up and one of them is a functional programming course
Speaking of realpython and internals, I'm not sure if this is the case in other countries but the CPython internals book is currently 10 dollars for a physical copy on Amazon Australia https://www.amazon.com.au/CPython-Internals-Guide-Python-Interpreter/dp/1775093344/ref=asc_df_1775093344/, very good deal if it's available to you as well
who buys a book about something that can change from time to time
3.9 is like 2 years ago
For 10 bucks I think that's a very solid deal regardless, yeah the interpreter changes but a lot of the fundamentals are remaining the same
10 bucks is like a month's worth of lunch money for me
Different economies I suppose π
Hey
this is always my hangup about programming books
i'm happy to spend $15 on an e-book or donate to someone making good blog content, but i don't really want to purchase a physical book that will go out of date soon
Is there a forum here to discuss small change proposals to cpython internals, or even python's requirements as a language? I have a change I want to have an online discussion about, the scope is quite small though so posting a PEP would seem like too much red tape for what it does
are there core devs present on discord?
This is the right place for discussing something like that. A non-zero amount of core devs at least occasionally look into this channel.
You can also make a post here https://discuss.python.org/ as something in between Discord and writing a PEP
I'll make a publicl shared doc - where I could keep track of all the counterpoints made, and respond to them just once. Let's try that at least ... I will post the link to the doc when I'm done writing it
https://docs.google.com/document/d/1et5x5HckTJhUQsz2lcC1avQrgDufXFnHMin7GlI5XPI/edit?usp=sharing
It is shared for viewing and commenting
I'm open for a discussion in here or comments on the doc... thanks!
Preface / meta First I will make the case for the proposal in the title, out of first principles, then Iβll list in a separate section the counter-arguments Iβve seen so far, and my replies to them. Managing this as a doc Iβve implemented this change as a PR to CPython and that was promptly clo...
I mean it's.very surprising the setting python hash seed doesn't leave.the hash of None deterministic
I would just figure out if there's a reason for that
The whole purpose of that variable is to make hashing deterministic for testing or reproducibility purposes
No obvious reason to exclude None
for the same reason object() and function hashes are non-deterministic, it's just computed off of the memory address, rather than any deterministic value. Which I do agree is kind of dumb.
I do mention the possibility of making the hash not constant, but rather a deterministic function of the hash secret (which makes it so if you specify PYTHONHASHSEED, it will be constant across your runs).
That's arguably a better choice from a practical POV
Ah, interesting
It didn't even occur to me they would apply that to None
Im not really a fan of allowing hash by address.to start with
Most languages don't allow it
im getting same results in every interpreter launch
im not using PYTHONHASHSEED
3.11
that's because None happens to be on the same address every time
Yes. the values are different every run only on systems that apply ASLR
It's not cooncidence
For me the values are different, I'm on Linux
Those sessions are open at the same time it looks like
The binary only gets loaded into memory once when you.run an executable multiple times
but ASLR is a pretty important infosec feature, so "disable it if you want your hashes to be stable" is a pretty bad argument
.wiki ASLR
Address space layout randomization
Address space layout randomization (ASLR) is a computer security technique involved in preventing exploitation of memory corruption vulnerabilities. In
IOS
Darwin 21. iOS 16 is based on Darwin 22. In iOS 6 the kernel is subject to ASLR, similar to that of OS X Mountain Lion. This makes exploit possibilities
Yeah no argument
ah icic
>py -c "print(hash(None))"
-9223363242347854132
ok, now it is different
Did you me close all the sessions and relaunch?
how would you do it for an arbitrary object() though?
store some random piece of data?
__hash__ = id
You wouldn't, folks who use address hashing deserve what they get π
I meant specifically for None, of course object() will have a non-deterministic hash no matter what
that's literally what lakmatiol said is bad
ah
I misunderstood then
same with user-defined functions
>>> x = object()
>>> hex(id(x))
'0x189ab5d5070'
>>> hex(hash(x))
'0x189ab5d507'
>>> assert hash(x) == id(x) // 0x10
same with classes
id and object.__hash__ are both deterministically calculated from the object's memory address, but not in the same way
I guess it could be problematic if CPython wanted to move objects around in memory
but that would break like... everything innit
yeah, that's probably not on the table
considering all of CPython is built on passing pointers around
your table is very non-deterministic
Py_hash_t
_Py_HashPointerRaw(const void *p)
{
size_t y = (size_t)p;
/* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
excessive hash collisions for dicts and sets */
y = (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4));
return (Py_hash_t)y;
}
and I think id simply returns the address itself, but I'm not sure
@grave jolt : if CPython moved objects in memory it would alter their id and hash under current implementation, thus break correctness
In languages where objects are allowed to move, something like id has to be stored inside the object (and even then, taking the initial address as the id is wrong, because something else could be allocated there after the original is moved away)
PyPy uses something other than address for id() iirc
where hash(None) is implemented? i cant find it
there tp_hash is 0: https://github.com/python/cpython/blob/main/Objects/object.c#L1678-L1692
It uses the same hash as object, if tp_hash is 0, it will call Py_HashPointer (from what I can tell)
So assuming there is merit to my proposal, how am I supposed to actually make it? It got shot down horribly on the forum, maybe because my initial arguments for it were weaker.
On github, a core developer just closed my PR and the issue, and that was it.
I don't know what else can be done at this point
this issue? https://github.com/python/cpython/issues/99540
Yes
@broken sluice I think the general objection will be that hash() was never intended to be useful beyond the current process. If you need something like that, you need to implement it with the guarantees you want.
But if that is true, why does the PYTHONHASHSEED feature exist? and it is not deprecated ... and people rely on it for debug/research purposes
and as I said in the doc, you can't implement your own hashing strategy in Python
PYTHONHASHSEED might have been a stop-gap when randomization was introduced, to make dict iteration consistent.
the builtin containers do not accept a hasher as an outside parameter
sorry, i haven't read the doc, so I'm not sure what your use case is. I hash data structures, but not for insertion into dicts: https://github.com/nedbat/coveragepy/blob/master/coverage/misc.py#L227-L264
besides, is it really a counter argument?
I mean sure, Python isn't obliged to support this, but this seems like going out of its way needlessly to break something, at face value
not sure what you mean by "going out of its way": None inherits the hash from object.
Consider my example:
KeyType1 = tuple[int] | tuple[int, int]
@dataclass(frozen=True)
class KeyType2:
foo_id: int
bar_id: Optional[int]
does it make sense that one of these key types hashes deterministically and one doesn't?
well, the docs explains why I think that hashing a monostate variable to a constant is the "common sense" thing to do. That's why I said going out of its way. But in terms of implementation you are right, it simply inherits from object
generally, changes to Python need to have real-world use cases rather than "it makes sense" arguments. Again, you might have them in the doc, I haven't read it.
Start a discussion on discuss.python.org, referencing the closed issue and other previous discussion. But I can't guarantee there will be any more enthusiasm for the idea than before
The use case for something like this is to make regression tests reproducible
I'm a bit confused because this is such a standard technique
Regression tests and even unit tests commonly use a randomized seed for say rng which they log. If you have a failure you want to reproduce, you rerun it feeding the logged seed
Granted that things which use address.hashing will not be reproducible regardless, but many things will be. Anything that has a notion of value semantics, structural equality, etc
And optional fits into that
(rather, none fits into that but that's usually seen in an "optional" context)
just to be clear, I am not arguing against this proposal. I'm trying to help explain the core dev mindset, to help @broken sluice
Sure, I'm just saying, the use case seems pretty clear, no?
Reproducibility always matters
Salting hashes makes sense for prod, in testing it's ok to salt hashes but you need to be able to reproduce any test run as closely as possible
I would have assumed that that's the purpose of pythonhashseed rather than a stopgap
Sets for example still don't have defined iteration order, it's easy to write code that accidentally depends on set iteration order. Say you do and your tests usually pass, now one night they happen to fail. It's not great if you can't reproduce that quickly and easily
I'd be in favor of making the change
making only hash(None) reproducible isn't a very thorough solution for that problem though. Hashes of e.g. types and function objects are still nondeterministic
Sure, you can't solve that problem completely if you've.chosen to rely on identity based hashing
But it's pretty debatable to say that None is an identity based type
To put it mildly
None being a Singleton is an implementation detail
it's not an implementation detail, it's important that x is None behave the right way.
but being a singleton is a good reason for it to have a fixed hash
It's a monostate type
how is that different than singleton?
Singleton is an implementation detail of being a monostate type, in most cases.
Python chose to make is None idiomatic
can you explain the difference between singleton and monostate?
A monostate is a type with only value
A Singleton is a type with only one instance
ok, do we agree that you can't have a Python where None isn't a singleton?
Well, because of breaking is None checks, sure
Not for any other reason I can think of
If python just did == None which is generally what you see, then it would truly be an implementation detail
My main point I guess is that None usually shows up as a value as part of an implied Optional semantic
Since always having None isn't very interesting
And Optional[T] is something that has structural equality if T does, without fail
Hashes makes sense only in the same process. If you are saving hash and using it in other process, hashes are not obliged to mean anything useful
And structural hashing, generally, as well
actually, why did is None become idiomatic? 
Idk. The funny thing is that it was encouraged, but it's arguably pretty terrible
Is None cannot be overloaded
But I don't think that argument is compelling
there are types like numpy arrays which implement == elementwise
so probably for those
Fair point I suppose
I was going to say that performance reasons are the main reason to use identity rather than structural equality for None
can you explain the difference between singleton and monostate?
A monostate type is a type whose instances can have only a single possible state
In Python since all values are held by references, the only thing that makes sense is to store such a type in a singleton. But that's not generally the case in all other languages.
Regarding hashing, the common sense thing to do when hashing a monostate type is to return a constant. That's what I'm arguing, at least.
I am saying that the Optional "None" type (i.e. a disengaged Optional) is a monotype variable - hence the proposal. Even though in Python, None has other meanings, but we are kind of stuck with None representing a disengaged optional
again, just to give you some perspective on the core devs' mindset: they aren't interested in "other languages do it this way", and using terms from other languages and cultures (like monostate or disengaged optional) aren't likely to convince them either. They want to hear, "Here's a thing I want to do with Python, and it would be much better if we made a change"
If it were practical to separate the Optional None from the "null reference" None (or whatever otehr meanings people ascribe to it) that would be a great solution, unfortunately it doesn't seem practical at this point
If Python implemented Optional[T] as a Union[T, Unit] and had Unit hash to a constant in the first place, we wouldn't be having this discussion right now
There was a choice to overload None which caused this complication, it didn't have to be this way...
and the concerns of "value types" are universal to programming, I don't think they become irrelevant just because we are writing in Python
even if Python puts less emphasis on these ideas
That seems unfortunate to me. All languages can learn from one another,.both because of intrinsically good ideas and because of things simply becoming consensus across many languages
@broken sluice I don't know what Unit is in this case, and I don't understand what you mean by "overload None".
corak is trying to convince people. You need to take your audience into account. How you craft a persuasive argument depends on who you are trying to persuade.
So maybe I can't convince Python devs if that's really the prevailing attitude
We say reproducibility is important sometimes, you say it never is, so there's a stalemate
Sure I don't disagree with that and I understand you're only trying to help
Just think the attitude is a bit unfortunate. No language is an island after all.
Also I don't think ned said reproducibility never matters? Maybe I missed it
denball did:
Hashes makes sense only in the same process. If you are saving hash and using it in other process, hashes are not obliged to mean anything useful
which is effectively the same thing, worded differently
if hashes are different, your second run won't be the same as the first. It is enough that iterate a set anywhere in the program, and it will diverge
Creating more None's in interpreter is bad idea. It already have None, Ellipsis and NotImplemented singletons/sentinels for different purposes.
If you need one more sentinel, it is better to create it yourself: MISSING=object() instead of adding it to interpreter
I did that, but it's a lot of trouble for devs to maintain such code
having to say Union[T, Unit] instead of Optional[T] everywhere
having to convert None to Unit and back
and all that, for what? what is the gain from that approach over modifying None?
Why would you need to convert some sentinel to None back?
let's say you write an optional to JSON. Must convert from Unit to None
it gets read back into None, must convert back to Unit
it is a hassle
You're not being honest if you claim it is trivial to sanitize None out of an entire large program, and keep it that way
and again, is it a good idea to ask people to do that? only because they want reproducible runs?
I understand you are frustrated, but starting to accuse people of dishonesty is not going to help.
You are right, but like I said, if people really want to think a certain way, nothing I say will ever convince them otherwise
and that's what it looks like to me :/
it might be that they need more help to see the advantages, or how little it costs to make the change.
so let's discuss the cost ... what is the actual cost?
I mean, apart from CR
Python devs will know more than I do about that for sure
Is there a pull request that implements the change?
it sounds to me like it would be a dozen lines for the change, and perhaps a dozen for a test.
Yeah, that makes sense. Seems very low cost, even if the benefits arent going to be seen as huge
I think the main point is not the hashes being "meaningful" across processes, but reproducibility across processes. Which is very much a real thing.
See some of the above examples I gave around testing
Why would you need hash(None)? Are you using it as key in dict? Item of set? Are you hashing tuple with None?
A data class with an optional field?
Scroll up, there is an example where the hash of None injects non determinism into seemingly harmless key types
KeyType1 = tuple[int] | tuple[int, int]
@dataclass(frozen=True)
class KeyType2:
foo_id: int
bar_id: Optional[int]
Anyway I made a PR for this and it was closed. I think if I try to open another one, it might be viewed as trying to spam. and I can't re-open the closed PR AFAIK
Also in the implmentation, we need to choose between two options
- None hashes to a constant
- None hashes to a deterministic function of the hash secret, so it only stays constant if
PYTHONHASHSEEDis used
I would go with constant tbh, PYTHONHASHSEED matters for hash collision attacks, which is impossible with None
get ready for the which constant bikeshedding
I put 0xbadcab1e in my PR ... but I don't mind changing that π
What of the fact that https://github.com/python/cpython/pull/99541 is closed?
Yeah this is definitely the way to go
Also as to why it was closed , did you read the GitHub thread?
Looks like it was closed by a bot because you didn't update news (patch notes roughly speaking)
I did - rhettinger explained himself there
No, the bot did not close it; I did update the news, via blurb
and the bot had green status on my change
rhettinger then posted the following on the issue
Thanks for the suggestion but this doesn't make sense. The default hash for every object is its object id. There is nothing special about None in this regard. Also, hash randomization was added intentionally for strings and bytes β we're definitely not in business of trying to make hashes constant and we don't want people to come to rely on a particular hash order or value.
and closed my PR
Ah sorry
That's incredibly lame, fwiw
I can respond fwiw not that I'm anybody
I can't seem to actually find rhettingers comment
my opinion doesn't count for anything either
so I don't know, maybe we should leave it for someone who has the cred to push such a change
you don't see it there?
Ah there we go
I can respond there, probably won't help. Maybe on a mailing list you need to raise this first? Happy to support you there as well
I could try that
I sent the mail already
to which list?
Now also opened another thread in the forum, hopefully I don't get an infraction for that or something
for what it's worth, Raymond is often quite close-happy when he feels a change is not correct. I don't really like it, but he is a very experienced core dev and educator
so his opinion should have some weight
Fair point, however, it's one thing to think this change is wrong, another to say this: "There is nothing special about None in this regard".
He could say "catering to this use case is not worth the effort"
or "this will hinder planned changes x/y/z" or all kinds of other reasons. But saying something like that just makes it seem like he gave it only a moment's thought at best. Even if he in reality has extremely good reason for what he's saying
Are they saying they want the hash of None to be constant throughout multiple sessions? I thought it's hash was based on id but it's a singleton so that's constant for the lifetime of the program.
yes, multiple sessions.
Why
I don't understand the issue
Or should I say the benefit
I'm not seeing it in the issue
You are debugging a program, or there is a unit test failure.
The test runs a lot of code. Somewhere the program computes a set of keys, then iterates on it and does more things under the loop. Let's say the keys are frozen dataclasses with optionals in them. because of the non-determinism of the hash, the set are organized differently each run. So anything downstream from that point diverges every run. Now it might cause flaky tests, failure to repeat problem cases even if you log the inputs, etc.
and all of that for what? No one ever told me who actually benefits from hash(None) changing every run
well this has been going on for the duration of my sleeping time
wouldn't iterating over a set be non-deterministic anyways? because sets are unordered?
In theory, yes. In practice, given the exact same operations history and same hash values, the set will be in a perfectly identical state each run, and thus will iterate its contents the same order. You don't know which order and don't care, but it will be the same
- my point is that code that relies on deterministic set order is already a bug
but the code doesn't rely on it
i have no idea what you're talking about ```py
py -c "print([*{'afsaf', 'blak', 'clf', 'hae', '01s'}])"
['clf', 'afsaf', '01s', 'blak', 'hae']
py -c "print([*{'afsaf', 'blak', 'clf', 'hae', '01s'}])"
['blak', 'hae', 'clf', 'afsaf', '01s']
py -c "print([*{'afsaf', 'blak', 'clf', 'hae', '01s'}])"
['hae', 'afsaf', 'blak', 'clf', '01s']
py -c "print([*{'afsaf', 'blak', 'clf', 'hae', '01s'}])"
['afsaf', 'blak', 'clf', 'hae', '01s']
yes, try that with PYTHONHASHSEED=constant
your bug was that the hash values weren't the same
ok
read carefully the scenario I mentioned. I did not say the code needs to assume anything about the order in which it will read things from the set. Any order is legal
well i'm not too careful of a reader or too expert of a developer to understand what's even being talked about here
because of the non-determinism of the hash, the set are organized differently each run. So anything downstream from that point diverges every run. Now it might cause flaky tests,
^ that is stating a reliance on set order
I can just paste a small example then, few mins
ok it works now
code:
from dataclasses import dataclass
from typing import Optional
@dataclass(frozen=True)
class Key:
foo_id: int
bar_id: Optional[int]
set_of_keys = {
Key(foo_id=i, bar_id=None)
for i in range(10)
}
for key in set_of_keys:
# If we perform any downstream logic here based on keys, behavior will diverge
# # between subsequent runs, even though set_of_keys is a constant input
print(key)
run 1:
Key(foo_id=9, bar_id=None)
Key(foo_id=4, bar_id=None)
Key(foo_id=7, bar_id=None)
Key(foo_id=2, bar_id=None)
Key(foo_id=0, bar_id=None)
Key(foo_id=3, bar_id=None)
Key(foo_id=8, bar_id=None)
Key(foo_id=5, bar_id=None)
Key(foo_id=6, bar_id=None)
Key(foo_id=1, bar_id=None)
run 2:
Key(foo_id=6, bar_id=None)
Key(foo_id=1, bar_id=None)
Key(foo_id=9, bar_id=None)
Key(foo_id=4, bar_id=None)
Key(foo_id=0, bar_id=None)
Key(foo_id=7, bar_id=None)
Key(foo_id=2, bar_id=None)
Key(foo_id=8, bar_id=None)
Key(foo_id=3, bar_id=None)
Key(foo_id=5, bar_id=None)
how so?
I can write code that iterates the set and does something, and the code can be correct for any order. Right? it happens all the time
and I can still want reproducible behavior when I'm debugging something
these things are not at odds with one another
im saying that if the code is non-deterministic when running normally, it should either be configured to enforce order explicitly when debugging if that is what you want
imagine the set is a set of legal choices for some search algorithm
it might be correct to go over them all in any order
but maybe something goes wrong and you want the entire thing to take the exact same steps at each point
it generally makes life easier when you're debugging
I thought it's kind of obvious, but maybe not
in that example, my test would use a list, not a set, as I want exact order
I know there are workarounds
very often people use sets to deduplicate for example
now sure they can do such a thing without sets
but it's very easy to fall into the trap
function_to_produce_consistent_order(list(set(items)))
and again, for what purpose do you want to make our lives harder
what is the external cost ..
If you did list(set(items)) you broke it
if your items can be sorted, you can stick a sorted at the end and you're OK. But not all keys are comparable like that
and again, this is placing the burden on researchers who might not be too keen on preserving determinism
set order is non-deterministic by design, it is not something that can/should be able to disabled
it has undefined order. No requirement on the order. That is not the same thing as non-deterministic
It's a crucial point
how are those different?
non-determinism is a behavior, not a requirement
If i run code in IronPython, PyPy, or CPython that uses sets, it can act differently
by design
try to create a set of ints, that will show non deterministic behavior, being fed the same data/operations
really try it
something like, starting from these fixed inputs and these fixed operations, I run it once and get one order, I then run it again and get another order
again, i'm not saying anything about what the order is, any requirement at all
I don't care if it's different in another Python runtime
If all you're doing is debugging something, you don't care about that
tbh i dont care enough about this to do any of that, just if you are debugging something that loops through a set like that where a specific order of items breaks something you should build a proper test harness for that code to test your possibilites
not depend on the language doing it for you when it explicitly says it doesnt
because relying on undefined order means that any update (including minor versions) that changes something about sets can break your tests
if the tests fail it's an indication the code or the test is wrong
I agree that tests should be extensive and then they'll catch order dependency bugs
and you know what? maybe for UTs this is good enough
but what about running the entire thing on some fixed input? let's say you have an input where the program did something that doesn't make sense and broke on an assert
what if the odds of that happening again are 1:1000, and it takes 30 min of compute time to get to that point
I mean, reproducible behavior has value, of course you can get by without it. You can get by without a lot of things
look, if you think reproducible behavior has no value at all, the discussion can end there
my point is that if you have code that is that complex, and has the potential to fail in odd edgecases, it should be possible to debug post mortem without needing to rerun the code
I understand that, and maybe if I wrote all the code that I'm responsible for, the situation was better. It's a lot of code that was written in a hurry, not everyone writing the operations research code we have is even an engineer.
It's a nightmare to debug it, and non determinism isn't helping
Yes, it would be nice if all of the complex behavior was broken down to small component and tested extremely well in isolation
In reality it isn't, though
you can just say, "well sucks to be you" but I am just asking who's actually benefitting from the non-determinism
if the answer is no one, then why have it
again note that if one day someone makes sets iterate completely at random every time they can. My change does not contain any contractual guarantee for sets. It can break at any time in the future and I am OK with that
afaik, the non-determinism of sets is a speed optimization
no, I don't mean that (sets are still deterministic today!) I meant the non deterministic hash of None
again I could be wrong - show me a history of operations on a set that takes in only constant data with constant hashes and ends up iterating its data in a different order every run. That would be non-deterministic sets
^ probably possible as hash(obj) relies on id(obj) which is the address of obj which is non-deterministic from python
but then it is not the set that is non-deterministic, it is the hash
that is the point
thats like me saying "its not the door that opens, its the door knob"
No, it is not some philosophical statement...
my point was that if a dependency of an operation is non-deterministic, then the operation is non-derministic
the set inherits it
sets being non-deterministic means you can take things with fixed hashes, say construct the set {1, 2, 42} and iterate them, and let's say it returns: 2, 1, 42. Then you create another set exactly the same way and it iterates them in a different order
but you can make sets out of things that do not have fixed hashes
so yea, a subset of sets can be deterministic, but all sets cannot be deterministic
right. I agree. then all bets are off.
but I usually avoid doing that. The researchers do too
I am trying to point out that what you are asking for is not just a deterministic hash of None
they tend to use ints enums and such in keys. and if we set PYTHONHASHSEED we are generally ok.
Expect when they try to use Optional[int] and then we're not
its a deterministic hash of everything
not everything.
that can't be done anyway.
what types are used as keys? tuples of ints, strs, maybe bool, enum, maybe even frozensets of those things
!e If it is truely justNone then just do this locally and call it a day ```py
from fishhook import hook
@hook(type(None))
def hash(self):
return 0xdeadbeef
print(hex(hash(None)))```
all can be hashed deterministically, if only optional None didn't screw us over
@pliant tusk :white_check_mark: Your 3.11 eval job has completed with return code 0.
0xdeadbeef
I thought of that.
Won't that break the world though?
if there is any set or dict anywhere in the Python runtime that was already created based on the default hash of None we are screwed
yea it would probably break things
you can also use LDPRELOAD
but at that point just define your own class inplace of None and use that
A C extension to patch the tp_hash descriptor?
it would work, but again, its a bad idea
I guess I might do that. it is better than asking people to use non standard idioms for Optional
after all what they mean is Optional
not Union[T, SomeMadeUpSentinelJustSoICanHashItToZero]
you can use a .pth file to hook Optional and swap out None with your custom class
but Optional isn't actually a class
there is no such type
It is in fact, just a T | None
I'm warming up to the C extension idea
after all nothing in the code will rely on it. we can always throw it away if we want
then hook types.UnionType with fishhook and swap out the None there
the static typecheckers won't care about whats in a pth file
let's say your dataclass has a field that says
x: Optional[int] = None
A type checker will see that and consider it legit
and it is
how can your hook transform it into
x: Union[int, SentinelType] = Sentinel
seems really difficult to rewrite the program in such manner with no help from the programmer, in the general case at least
maybe with descriptors on all fields
actually
maybe the solution is to patch the hash function generated for datacalss
(but there's also NamedTuple...tuples, etc)
C extension is less headache I think
In fact, easiest is to just backport my PR into whatever version of Python I'm using
This whole line of discussion seems to miss the point behind this. The point isn't to get deterministic behavior generally, the point is to get reproducible behavior when you need it
Probably you missed it because it was discussed earlier on, but it's pretty similar to how test frameworks that work with RNG start with a singular seed that may default to being taken from e.g. an OS source of randomness, but can actually be fed in at the command line
The idea being that if your tests fail sporadically, you can look at a failing test, look at the logged seed, and feed it back in to reproduce it exactly
Nobody should be depending on set iteration order on purpose, the point is that if you accidentally depend on it subtly, when you get a test failing you want to reproduce it exactly
I just found the discuss.python.org thread about this. It got very heated it looks like
https://discuss.python.org/t/constant-hash-for-none/21110 but the first post is gone
Hmm why is it gone?
I mean with respect people who didn't seem to understand the main ideas got into writing very long posts that were mostly irrelevant and conversation just got derailed
I think the origin proposal needs to be stated in a more tightly focused way
could be
Like the whole discussion of ordered set
That's totally irrelevant here. Not sure if the original post by corak didn't do s good enough job steering away from that
my recommendation generally is to try to say yes as much as you can, and avoid saying no. It will keep the discussion where you want it.
this proposal only helps in the case where the set was full of only elements that have consistent hashes across runs or None's, though. Surely that's quite a rare case
but i also know it can be very frustrating to try to convince them, and sometimes you just can't.
setting the hash seed is a key element of making it deterministic.
that still doesn't make it deterministic for most objects, just a for those from a very small handful of types. So, yes, if you depended on iteration order of a set, and the only things in the set were either things that already have consistent hashes across runs or None's, then this proposal would allow you to reproduce the behavior. But if the set contained any other element that has inconsistent hashes across runs, you've still got the same problem as you had before.
none, int, float, strings would get you a long way.
and none is the only non-deterministic hash there.
those might get you a long way, but they don't get you all the way. Is something that makes a test reproducible only in very specific circumstances really that valuable?
Not really, I don't think it's very small
Optional is a member of many value semantic data classes
for me, the cost to do this is tiny, so if it helps someone, let's do it.
Yes, exactly
Hettinger is wrong when he says that None is no different than any other type. None is one of a handful of basic building blocks for creating types that people use
And it's used very extensively in types with value semantics
Just like strings and numeric types
you tell them I'm in favor! (this means nothing)
Obviously if you have classes that use identity semantics as keys things will be non reproducible, but that's not very surprising, and using identity semantic classes as keys is a choice and the impossibility to reproduce results is just one of the downsides
Personally I never use identity semantics in hash keys
The question I guess is how to reopen this discussion without people getting so upset
I don't know why that thread escalated so quickly
i think conflating set iteration order and making a consistent hash for None blew it up pretty fast
yoni went down the ordered set rabbit hole: https://discuss.python.org/t/constant-hash-for-none/21110/14. Better to keep the discussion on track.
Yeah that's very unfortunate
Online discussions you have to be laser precise
Also from the get go it's better to suggest that None has a salted hash the same way strings do. It maximizes security at no real performance or reproducibility cost
meh, all the ints and floats have predictable hashes, i don't see why None needs it salted. But ok.
they do, but that's an implementation detail... I think folks were bristling at the idea of promoting that from implementation detail to supported feature
I think simply because there's almost no downside you should salt it. For it's, the performance hit was significant
It seems honestly pretty obvious to me that languages that salt their hashes should have a reproducibility mechanism built in
that's the kind of language you have to avoid though
Python didn't salt its hashes until very recently, and then they added salting only as a mitigation of a security vulnerability
I'm genuinely surprised to see e.g. a cpython core dev refer to "unnamed use cases for reproducibility between runs"
Yes but this is just between friends π
sounds like it wasn't explained fully enough
Salting in general is a security mitigation though. Also, python did actually include a way to make hashing reproducible between runs, right?
Not sure what the motivation was, but it seems possible this need had already been recognized
of the one specific type they added a salt to, yes.
I mean presumably if anything else is ever salted it will be using that same value to seed the salt
I.e. that environmental variable will always allow for reproducibility of any value semantic hashing
you don't consider ```py
class SomeClass:
pass
SINGLETON = SomeClass()
hash(SINGLETON)
that's correct, but it supports both equality and hash
A monostate Singleton is something of a degenerate case
That's a big part of why None creates confusion
I'd call it value-semantic because it does allow == - so if this degenerate case is value-semantic, then no, PYTHONHASHSEED does not make all value-semantic hashes reproducible.
For a monostate Singleton, all instances are equal and hash the same so it value and reference semantics aren't exactly distinguishable
I would say it's clearly a corner case but not very likely to come up much
It comes up a lot for None in practice because it's part of Optional, in practice
And Optional is a pretty common building block in value semantic types
well, that's probably the most convincing way to formulate this argument.
- Reproducible hashes are useful for reproducing test failures (which is why pytest defaults to printing out PYTHONHASHSEED, for instance)
- The non-reproducibility of hash(None) even when PYTHONHASHSEED is set is the only thing making the hashes of many simple dataclasses non-reproducible.
when will python have symbol type ||have i asked it before||
Help me
@rich cradle I've always thought the evils of inheritance were overstated, but that's probably because I rarely actually make subclasses, and when I do, it's specifically to avoid boilerplate.
i've just never written code that really needs that kind of structure. i dunno why.
i tend to abuse things like protocols though. holdover from using typeclasses in haskell and rust.
the entire inheritance model seems strange to me, architecting shared behavior as a tree, but Β―_(γ)_/Β― i don't actually use it where possible
well, that's why we have duck typing 
well, now that i think about it, it's not even a tree, it's a... directed acyclic graph? maybe? which is even more wild.
When you say protocol you mean dunders or structural protocols
whatever typing.Protocol is, so probably the latter
How do you abuse that
i use it in places where inheritance probably would be more appropriate, that's all
Oh
I mean isn't using a protocol vs a abc like functional vs oop
Just different paradigms
it might be a different take on OOP. it might even be a different paradigm. but if it is, functional isn't the one
!docs typing.Protocol
class typing.Protocol(Generic)```
Base class for protocol classes. Protocol classes are defined like this:
```py
class Proto(Protocol):
def meth(self) -> int:
...
``` Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example...
And then you can type hint saying your callable takes in an instance matching that protocol
Yeah yeah ok
i dunno if it has a name. it's a style that originated (afaik) with haskell typeclasses.
well it's different to some extent, but it's the most similar there is in python that i know of
well, i think it is. it's been months since i wrote sane python code. the past few months have been largely random things to test one of my projects.
I don't think I write generic enough to really use protocols
And while I've used ABCs, it's honestly unnecessary
I feel like ABCs are only there to appease people from languages that have them
it's more consistent with Python's philosophy to just... not instantiate the class.
Yep, I completely agree. @broken sluice if you haven't given up, i think this may be the purest distillation of the idea I've seen.
Not exactly, ABCs in conjunction with mypy also enforce that things are overriden and overriden correctly,.e.g. without typos in the name or signature
That's very useful and a very easy mistake to make by accident
Fwiw, protocols are pretty fundamentally different from type classes or traits because the former are satisfied implicitly. The latter, explicitly
Python protocols are more like Go's interfaces
yes
but i think they're the closest usable equivalent we can have in python, at least in my usage of them
Idk, they are more or less close depending on what you look at
ABCs are explicit, like typeclasses
In that sense, ABCs are closer
right, hence my "usable"
you can't add ABCs to random stdlib types or things from other packages
...i think
I think you can but I'm not sure if mypy recognizes it
I just see Python's Protocols as formalized duck-typing.
Protocol lets you describe what a duck looks like
that's exactly what they are. but i'm fine with that, personally.
Fwiw in 95 percent of cases, ABCs are very easy to use.
It's compile time duck typing more or less
Which is what structural typing is
In that sense it matches well with python
But there's a fair amount of criticism (that I agree with) in just implicitly satisfying a constraint because you have the right API
i absolutely would prefer to tack on a ridiculously powerful type system to python. i just don't think it's fundamentally possible, and would break a lot of things.
That's why structural typing is very rare, out of popular static languages mostly just Go uses it
It's not really about the power of the type system per se
isn't it? python has built a lot of its type system by shoving things into the class model, even if they don't necessarily fit.
No? A specific choice of the type system isn't the same as it being more powerful
no, wait. ignore my previous statement. that was another thing i somewhat disagree with, but not what i meant to say.
It's just worth trying to use ABCs if you havent. Using ABCs has very little to do with the inheritance rabbit hole
i think what i'm really getting at is "i want static typing, and i want language features that only make sense with it," but there's no chance in hell that's happening. that arguably wouldn't even be python anymore.
IME most usages of polymorphism don't actually require the loose coupling provided by protocols
A class that implements an ABC is explicit about it, which is nice, and it also means you get errors early rather than later, like with protocols
Well, sure,.but I'm talking out of options available in python
right. i probably shouldn't have even brought that up.
I forget how
But almost sure it can be done
There ya go
Basically for many people this would be the ideal in a static type system
Explicit but non intrusive
ABCs are explicit and intrusive
Protocols are non-intrusive because they're implicit
Haskell type classes and rust traits have this nice explicit but non intrusive property
https://discuss.python.org/t/hash-none-mk-2/21465/16
if you care at all to write anything there
I think there's no use tbh
I've made my case, both sides are repeating the same arguments
At least you are making some sort of argument that I can respond to. You really should have started with that. If an operation returns a constant result (as can be observed from the source code, which is open), running it by definition confers no information to an attacker. I donβt need to be a security expert to know that. If anything, it is ...
Feeling odd deja vu here
Have you looked into faking /dev/random? That should account for ASLR and anything else, fixing not just your specific usecase, but any hashes of arbitrary objects (I think): https://stackoverflow.com/a/26067735/1016216
The source of the non-determinism is the memory location of None (since that is what the hash function is based on). It is not due to input from RNGs
Can I get a Tl;Dr? hash(None) was X and now they want to change hash(None) to 123?
Am I understanding things correctly that hash(None) was the default implementation hash(id(self)) and they want to change it to 123 or whatever?
!e print(0xBADCAB1E)
@elder blade :white_check_mark: Your 3.11 eval job has completed with return code 0.
3135023902
yup, the goal is to make hashes deterministic for things that make sense to use as keys, using similar reasoning to PYTHONHASHSEED being a thing instead of just leaving it random and not letting the user set it.
1 in 1 chance random
I think OP want this behavior:
# pseudocode
def get_none_hash() -> int:
if USES_PYTHONHASHSEED:
return RNG(PYTHONHASHSEED).random() # get random number from seeded RNG
else:
return RNG().random() # get random number from RNG seeded by random seed
That would make more sense, but unfortunately the PR they submitted is weird and the behaviour they seem to want even weirder
I'm not sure which makes more sense than returning a constant, there are arguments and opinions both ways
it does seem to make a lot of sense to me for a monostate type to hash to a constant, but what do I know
What's weird about the behavior they want?
I couldn't quite figure it out of all the memes
I posted in that thread trying to support it, fwiw
The biggest argument against it, is the false premise that set iteration is dependent purely on hashes and not history of the set, Steve D provided that counter example in first reply of OG thread
the history of the set doesnt change over multiple runs
After thinking about it more, I get what the OP wants but the reason for their wanting it, is just wrong and could be found for any number of classes, even after fixing None
like what though
Pick virtually any object whose hash is based on id and u r back to square one
the argument is that all other objects commonly used as dict keys dont do that
That's weak imo
str, bool, float, int, tuples of those
Str do not give constant hash
they do if you set the seed with the flag
type objects are reasonable choices as dict keys
and they also have non-reproducible hashes
Exactly
fair
I've used registry patterns where the key is the type
I don't want to beat a dead horse, I just think time would be better spent refactoring their need to iterate a set in specific order
Pretty sure the help or repl hashes types, into a set no less. I've got weird errors when I messing up custom hash implementations and all of a sudden repr broke in the repl
you could use the same argument against providing PYTHONHASHSEED, and yet it exists, despite being more complex
Didn't someone show a version of python compiled without the randomization and None was providing a consistent hash?
None uses an id-based hash. Which is mostly going to be stable due to the way modern OSs work with memory. But it is nevertheless a non-deterministic value
That's not an argument against at it all
Just a misunderstanding
Because that was never a premise
The point is reproducibility, reproducibility just requires eliminating actual sources of randomness
Whether the sets history affects iteration order doesn't matter, because it's not magically randomized
Nones hash is randomized. Just like strings hash is. The latter provides a way to make it non random. The former doesn't.
That's probably the best example I've seen of a reasonable use case for identity based hashing in python, one I admit I've actually used
The example should be kept in mind but to use it as a basis to reject improvements in reproducibility would IMHO be Nirvana fallacy
Set's history affects iteration order and that's perfectly fine, because the next time you run the program on the same input it will create the same set in the same way
It's a misconception they keep repeating over and over
the only thing the set itself can do to break reproducibility is if it reorganizes its internal structure in a manner that isn't deterministic by its input commands
for example, a set that has a thread that concurrently rehashes it or something
I don't know of a single programming language that offers only non-deterministic sets. It's a nightmare honestly, and concurrent hashmaps are used only for super high perf applications, that probably don't want to use Python anyway
is it me or @ is the most underused operator in python
its only use case is in numpy for matrix multiplications
even then dot gives the same functionality
the @ binary operator goes literally unused in the stdlib AFAIK, it really was just added to greatly help out all the numerical libraries like numpy and such, https://peps.python.org/pep-0465/ lists out a bunch of motivations for adding it to the language
Python Enhancement Proposals (PEPs)
also, its recommended to use @ over np.dot when possible, since then expressions appear to translate over to mathematical equations much clearer (plus a little boost in performance i think)
Unary @ is used a lot.
Less used operators, IMO: ~ and + (unary)
unary @?
you mean decorators?
also: not not x
and __divmod__
Doesn't count, otherwise I nominate not + ~ + + - ~x
Modules/_datetimemodule.c line 2173
/* Could optimize this (by returning self) if this isn't a```
time for a PR
Makes me want to implement date @ time -> datetime, it's a bad idea but would be cute.
I'd also nominate @ as the function composition operator (by exact analogy to matrix multiplication!) but we don't even have functools.compose(), so.
You can fishhook FunctionType.__matmul__
i've been wanting this for a long time. i think calling @ "matmul" was a huge mistake, they should have just left it open as a "do what you want" operator
i think other mathematical/array libraries support it now
yeah xarray has it
My guess is that they probably don't want people randomly abusing operators just to get infix
probably true, "here's a random operator have fun" would be pretty un-pythonic
In 99 percent of cases if an operator isn't already familiar in a context then overloading it is the wrong call
(pun intended?)
/ for paths and urls was kinda strange tbh
but I think I got used to it
Haskell libraries used all kinds of custom operators with urls
It's not really strange
It's the character separator and it's what you type in to combine paths in bash
It's certainly less strange than + for strings.
C++ also uses operator /, just like python
Another reality here is that standard library just has more leeway. They can pick something semi reasonable and everyone will learn it pretty fast. For a third party library it's more annoying really to use operators in obscure ways
Julia moment π₯΄
It is very natural to me
What does Julia do?
* for concatenation, ^ for repeating
something something monoid
Yeah I've seen people make this argument before
It's ridiculous
Ironically, * was of course used for multiplying reals, integers etc long before scalars
Which is commutative
- was selected for matrix multiplication because it's conceptually similar to multiplication
???
they could have quite literally picked any operator in existence, it's julia, they support all of latex.
Mathematicians weren't slaves to the fact that they were using a previously commutative operator for a non commutative operation
but eh, if someone wants to go all math nerd for string concat, sure
integer subtraction is noncommutative
guess * should be for subtraction then
When you put higher value on formalism than concepts relative to mathematicians you know you're in bad shape
well, subtraction is not asociative
the convention that * is for associative operations is a fairly new one
Following this logic, string repetition should be +, because it is commutative operation: 5+'abc' == 'abc'+5
^ is power isn't it?
power is right-associative and non-commutative though?
I think intuitively it's pretty clear that conceptually string concatenation is like adding. Each item present in each of the arguments shows up exactly once in the final string
And with multiplication the things in the collection are multiplied
In other words, if z = x + y, then len(z) = len(x) + len(y)
And the same relationship holds for string multiplication
Cannot agree more. I would like function composition too
Wait, what does Julia use for scalar multiplication then? β’?
How could they use *, that denotes a noncommutative operation, while scalar multiplication is commutative! Should have used + for multiplication, clearly
In fairness matrix multiplication very common in the sciences! Fortunately it's also just a special case of function composition (where the functions are affine transformations), so that makes sense. And like (matrix) multiplication, function composition is often represented by an infix dot operator π
__matmul__ is not always a matrix multiplication
__add__ is not always a addition
__truediv__ is not always a division (fpr example pathlib.Path)
So, "matmul" is not bad name for that dunder. Different operators have different names (+ addition, - substraction, * multiplication, / division, @ matrix multiplication), and the operator names are irrelevant to what the operators do. So I don't see it (matmul being a name for operator) as a problem
There is also z3 lib (iirc), that have some placeholder variables, and X+Y becomes not result of addition, but some expression object, that can be evaluated at given X and Y
Also there is some lib (i forgot name), that have some "magic" var (iirc it is stored at lambda or phi symbol name), and var+1 becomes lambda x: x+1
So, you can do whatever you want with operators until you and your users understand what's happening
sympy?
For the dunder methods __getattr__, __setattr__, and __delattr__ a difference that emerges between them is that __getattr__ is called only if looking up an attribute in an object dictionary fails but for __setattr__ and __delattr__ are called regardless of whether the attribute is present in the objects dictionary but why is it that __getattr__ is handled differently from __setattr__ and __delattr__ rather than handling them all the same?
__getattribute__ does get called for all attributes
I think the default __getattr__ behavior is useful because you often want it as a fallback for attributes that aren't explicitly defined, while attributes that are defined normally can just use the normal system
Yeah the naming isn't great, probably a historical accident
thank you!
mixing up getattr, getattribute, getitem and get
I like this asymmetry. When you want to customize item access, you define __setitem__/__getitem__. When you want to customize attribute access you define __setattr__/__getattr__, almost always. When you define __getattribute__, you have to be extremely careful, you almost never want it.
get and set are extremely ambiguous and overused words
well, in this context it's probably appropriate
Question about Python dictionaries
I'm learning about hashing in my data structures class
I just learned about how, as the load factor of the hash table approaches 1, the time to find an unoccupied slot (or, a the time to perform an unsuccessful search) approaches linear time
Generally speaking. But I've heard for most of my python-using life that dictionaries are lightning fast, at least in Python terms, and that dictionary operations are in more or less constant time
How does Python handle this? Amortized resizing of the table?
Python automatically resizes dictionaries as they grow yes
Not too familiar with the details, but that should keep access times amortized constant
It is possible to get bad behavior if you have many keys that happen to hash to the same bucket
XD There's no winning, is there?
It's fairly unlikely in practice. String hashes are randomized to avoid DoS attacks where many keys map to the same bucket
It's probably still possible with ints (which have very predictable hashes) but that's rarely relevant in practice
i think they get resized by a factor of 9/8 when they get 2/3 full or something like that
To avoid slowing down lookups on a near-full table, we resize the table when
it's USABLE_FRACTION (currently two-thirds) full.
load factor ^
- Currently set to used*3.
how much to expand by ^
i think 9/8 is for lists
Is the hash() of int and floats constant? If yes is it stable or just an implementation detail?
it's an implementation detail. the only thing that must be satisfied is that equal ints/floats have the same hash
is this just for curiosity?
Ints hash to themselves
Floats have to be compatible with that, I think
Although as a general rule of thumb you just don't want to use floats near hash tables
^ integral floats hash to an int equal to themselves, but non-integral floats hash to...something
Statically typed languages almost all just disallow using floats as keys or for lookup
(at least by default)
!e Not all of them. ```py
print(hash(-1))
@raven ridge :white_check_mark: Your 3.11 eval job has completed with return code 0.
-2
mostly ints hash to themselves π
no, just -1
no, it's a funny implementation detail of the hash function, lol
Please tell me you guys know why this happens
it's because returning -1 indicates an error
also, hash(i) == (i % 2**61) (I think)
π€¦ββοΈ
with a -1 in there somewhere
Under what circumstance does hashing return an error
def __hash__(self): 1/0
https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_hash
The value -1 should not be returned as a normal return value; when an error occurs during the computation of the hash value, the function should set an exception and return -1.
That doesn't answer my question though
it's guaranteed isn't it ```py
class A:
... def hash(self):
... return -1
...
a = A()
hash(a)
-2
I'd have to check the data model docs, but I doubt that's guaranteed...
If the user hash throws.then I suppose you can just propagate it. I don't see any legitimate reason for it to throw though,.so I'm surprised that they reserved a sentinel for this
maybe only for user-created classes
I've never seen hashing reserve an error channel in another language, I think
__hash__ can be Python code and Python code can always throw
Yes, it can, but you can also propagate that exception
there's no note about -1 in https://docs.python.org/3/reference/datamodel.html#object.__hash__ - so it doesn't seem to be guaranteed.
the CPython C API generally uses sentinel return values to indicate that an error occurred
The point is that I don't see why it would be a use case to care about it, so I don't see to do it any favors
efficiency
Efficiency in the error path shouldn't be a consideration here
otherwise you'd have to call PyErr_Occurred() after every call to tp_hash
so the non-error path would be slow
Is it more efficient in the happy path?
no, efficiency in the non-error path.
why? It's an implementation detail
and it doesn't really affect you if you're working at the Python level
unless you go out of your way to check the value of hash(-1) or something
seems like the slot for __hash__ turns -1 into -2 if there's no error https://github.com/python/cpython/blob/main/Objects/typeobject.c#L8125-L8141
I can't be sad for things happening in the implementation?
Python code thinks that __hash__ returns a Python int. For efficiency, the actual C code doesn't use Python ints, it uses int64_t's. So there's always got to be some conversion happening to get from one to the other.
If my hash returns -1 though, how does that work? Probably I don't fully understand where the sentinel is checked
check the code cereal just linked
that happens here
if your __hash__ returns -1 as a Python int, the code that turns that into an int64_t (aka Py_hash_t) returns -2
wait no that's the other direction, https://github.com/python/cpython/blob/main/Objects/typeobject.c#L7500 is for when a C tp_hash throws
Objects/typeobject.c line 7500
wrap_hashfunc(PyObject *self, PyObject *args, void *wrapped)```
I doubt that microoptimization saves very much, honestly - but it lets the error handling path be c if (hash_code == -1) { // propagate exception } instead of c if (hash_code == -1 && error_occurred()) { // propagate exception }
allowing -1 as a hash that isn't just a sentinel for an error would mean that everything that hashes to -1 needs an extra check to see if -1 is or isn't an error indicator.
I don't think it improves the performance of things that hash to the other 2^64-1 values, though - so maybe in practice it doesn't save too much.
I wonder if any other mainstream GC language has something like this, and I just didn't know about it
my gut feeling is that this is probably a microoptimization that makes pretty little difference given today's branch predictors.
regardless, it's an implementation detail that's invisible to everyone except for those working at the C API level, so π€·
If there was no sentinel then error_occured would always have to be called right? That's what was discussed previously
right
error_occured is expensive, it was alleged
but some functions in the C API do return -1 as both a sentinel and a real return value - if they return -1 you need to check the error-occurred function as well
So the branch predictor wouldn't really save you
Yes, this sort of thing makes me wince
shouldn't it get optimized to the equivalent of c if (hash_code != -1 || !PyErr_Occurred()) { goto resume_path; } /* exception */ resume_path: /* continue */ or does that require __builtin_expect
not if you unconditionally called the error-occurred function, only if you conditionally called it to confirm or deny a sentinel
I was comparing to not using a sentinel at all
that's literally what the code would be - we can't talk about what it would be optimized to by showing C code
I still don't think I understand though, Jelle s example doesn't return -1, it just throws
So there must be some C code that catches that exception, and then returns -1
seems like it's just a few function calls and attribute accesses though
So this is written this way based on being old and/or wanting to be efficient on older machine
On x86 64 returning a two word trivial object has been basically free for ages
Or 32 bit, I suppose
if we had the sentinel but you needed to confirm it by calling PyErr_Occurred, then in practice the branch predictor would assume that the return value wasn't -1, and a misprediction would mean it would need to flush the instruction cache and then call PyErr_Occurred to decide whether to go down the branch it had originally assumed it would go down or the exception handling branch
that's my guess, yeah. I doubt this makes much difference on modern CPUs.
Yeah. But anyhow you can see why I wince, having a sentinel that overlaps the legit range for something never feels good
It's like the C functions that parse strings to int
0 to indicate an error
An error sentinel which is probably also the most common legitimate output π
we can't really talk about optimizations like branch prediction by showing C code, because the magic of branch prediction is things happening speculatively and then potentially needing to be undone, and we can't illustrate that using C code.
yes, it's an ugly area of the C API. There's been talk of a new C API that wouldn't use this sort of sentinel, e.g. https://github.com/markshannon/New-C-API-for-Python/blob/main/DesignRules.md#all-functions-have-an-error-out-parameter-or-return-the-error
I guess that there's like zero chance of moving to C++?
For implementation details
That makes sense to me
I mean when did msvc start supporting c99
Like 6 months ago π
yeah. not long ago.
Gcc did the C to C++ move continuously, so it's not unthinkable.
But definitely not easy
I thought about using hash of tuples as a key for faster lookup of records stored on a database, this requires the hash to be constant for all python startups
definitely don't do that.
right, you don't want hash() for that
don't try to outsmart a database. it's good at what it does.
If you.outsmart the database you become the database
It's like beating up the bouncer
is there a pep for symbol type in python?
what would that entail? If you mean symbols in the style of erlang et al, thats pretty much the sentinel objects PEP.
i mean interned names like in lisp and ruby
wait python already have interned strings
it doesnβt look like symbols but nice
yeah, python just uses strings in those places. Thanks to the interning, the comparison is fast enough even without the identity, though unlike symbols they aren't namespaced. But for sentinel-style stuff, you use None or object()
?
If you have a question, please see #βο½how-to-get-help
!pep 638
This could be interesting to see in an interpreted language
I thought that PEP was rejected a few years ago for fear that it would fracture the ecosystem
that's kinda cursed tbh
Eww
any could be synonymous to Union instead
Then we could also finally have all (for hypothetical Intersection)
any[str, int]. all[Indexable, Sized]
or an even worse alternative, all for Never
i dislike the idea of "reusing" functions for annotations at all tbh
if Any is any, 
I promise I will start using type hints when you can do arithmetics with them. Sequence - str, ~int, ...
hell, might as well reuse iter for iterable
map for Mapping
why use typing.Any when there is object π
serious answer: because different semantics
real answer: because fewer characters to type
they're different. Any means literally any type, basically turning type checking off. object means the operations all objects support (which is not that many)
heres a fix error articleβ’οΈ https://decorator-factory.github.io/typing-tips/faq/object-vs-any/
@grave jolt
Whatever you have β a strnig, a number, a function, a chess piece β it's an object.
strnig π
i thought object on annotations means object and subclasses
which includes every object
rename TypeGuard to isinstance
Pull request when
subclasses could be passed to a function asking for object, but you wouldn't be able to use those different methods, because not all objects that you pass would support those different methods
Yeah that's what it is. So if you have a x: object, you can't do x.foo() because not every object has a foo method.
Yeah read the article, I'm very good friends with the author
just ignore linter errors
Doesn't the same logic apply to Any? Genuine question.
Yeah it's special
I see
btw how do i make recursive types
smth like this ```py
@dataclass(frozen=True)
class LinkedList:
first: object
rest: LinkedList # will throw an error
put the second "LinkedList" in quotes
hmm
i think in the ~lingo~ it's a "bottom type" and a "top type". it's a subtype of every type and a supertype of every type
(or use from __future__ import annotations, or PEP 649 in the future)
yes, it's both at once. Never is a bottom type, object is a top type, Any is both
It's pretty confusing in a way since Any is usually used as a name for the top type
and to me at least that's also what the name implies, verbally
yeah i think c++ does that
C++ doesn't have a top type
it has a library type called any that can hold anything though
typescript's any is like Python Any
are you sure? then there's a mistake on the wikipedia page
The Playground lets you write TypeScript or JavaScript online in a safe and sharable way.
yeah you're right
does TS not have a proper top type?
unknown maybe?
yeah, seems like unknown is the top type
those names feel backwards to me
unknown gives an error when you use it, but there's no error for accessing an arbitrary method
Any to me feels like a known type, that could just happen to be anything
Unknown is a type we don't know anything about, and we're opting out of type checking
weird
at least in python object is a pretty typical top-type name
they do call unknown the top type: https://www.typescriptlang.org/docs/handbook/release-notes/typescript-3-0.html#new-unknown-top-type
TypeScript 3.0 Release Notes
yeah I mean the error is just occurring even earlier
Since there are no legal operations on unknown, it immediately errors when an unkown variable is mentioned, in a context where its type hasn't been narrowed in some way
it's not really conceptually different though
Doing it this way is a little, "extra" strict because x = y is always a legal operations in python and AFAIK JS
similarly, if you have a my_list: List[object], then my_list.append(y) is legal even if y is of type object
so error'ing immediately when it's mentioned is very odd. I'd have it as a warning.
but I can see the benefits practically
compile for types.CodeType
isinstance for TypeGuard
getattr[cls, 'attr'] for declaring same type that cls.attr is (i guess there is a PEP or some proposal about that feature)
globals['varname'] - same as getattr, but using global vars instead of attrs of some type
locals['varname'] - same as globals, but using local vars
sorted for Iterable[T] where T is SupportsLT
sum for Iterable[T] where T is SupportsAdd
vars for dict[str, any] (commonly used for arbitrary namespaces, globals() and json dicts have this type, for example)
hash for Hashable
iter for Iterable
bool for Boolable
aiter for AsyncIterable
len for Sized
reversed for Reversible
open for SupportsFSPath
abs for SupportsAbs
round for SupportsRound
dir for SupportsDir
divmod for SupportsDivMod
format for SupportsFormat
pow for SupportsPow
x: property[int] for read-only var or attr (UPD: i realized it is almost equivalent to x: Final[int])
I don't like it, either. any and typing.Any represent two different concepts that aren't even guaranteed to share the same word in every human language. And I think Python should limit its already preferential status for English.
I actually didn't know about the callable builtin. I wish it had been named is_callable.
why is it even a built-in
it's a bit of a weird choice because Any is not an annotation you want to be using that often
I almost never use it. Usually it enters into it implicitly, because your code depends on first or third party code that was written without annotations, so Any is the "default" when annotations aren't present
print for wx.Printer 
and if it's not installed, it prints the mypy output
the concept of built-ins is a bit strange to me; or maybe I'm just attaching more signifiance to it then it deserves, because of the name
i prefer to just think of functions/classes that are imported by default
yeah
seems to be how it works in Kotlin, Rust
is there any meaningful distinction though between that, and a python "built in" ?
Haskell has a "prelude" which is basically a library star-imported by default. IIRC you can even replace it with a different one
i don't know if it's replaceable in kotlin or rust
I don't think Rust even has "built-ins"
ah, some macros
yeah it does, I misremember
that's basically how Python works, there's a builtins module that is basically part of the global scope by default
yeah I don't think there's something sacred about builtins
in kotlin/rust you can actually have far bigger predules, because of the ability to scope things to classes, without them being members
like, all of itertools for example, is iirc in the preludes of both Rust and Kotlin
in python this would be extremely annoying and people would rightfully complain
it's different because it's member-scoped
yeah
yeah, that's what I said
oh, you mean traits and extension methods
yes
you can have all of itertools in Kotlin or Rust, as part of the prelude/"builtins", because they're just available via member functions syntax that are only available on suitable types
groupBy in Kotlin is an extension on Iterable<T>, in python groupby is just a free function
so with the latter, you'd have issues with shadowing and such if you define any function, class, or variable called "groupby"
I have lost track of how many times I've tried naming a variable "input" in python
only to get yelled at by my IDE, sigh, and change it
I mean yeah 99% of the time i twon't matter but I have actually, a couple of times, hit a really confusing bug caused by shadowing
I think it's just best practice to not use those names
it starts off okay, then other programmers see that id is the idiomatic name for some variable that comes up in your business logic, and they start using it
soon there are local varaibles called id everywhere and then eventually something bad happens
it's annoying but there is no real technical justification to just not get a different name
student_id or foo_id or whatever
my usual fuckup scenario is when I format that id somewhere into a string
and then get something like Foo(id=<built-in function id>)
nice
another solution might be banning certain builtins altogether in a linter, like id
that would also catch such mistakes
id would also be a good example of something that could just be an extension.
I actually never thought too much of that benefit of extensions; not polluting the global namespace. kinda cool.
I think id should've just been part of sys tbh
I don't think I've actually used it besides the REPL
I guess it's useful if you want to include the id in the __repr__, like the default __repr__ but with some extra stuff.
But that's a very niche use case, hence it could be moved to sys
yeah I agree
input is probably my personal worst offender. very useful variable name and I almost never use the function.
Python uses "built in" to refer to two different things in different contexts. It either refers to a function or type implemented in native code rather than Python bytecode, or it refers to an attribute of the builtins module, which is implicitly in scope for all lookups
!e print(type("".split))
@raven ridge :white_check_mark: Your 3.11 eval job has completed with return code 0.
<class 'builtin_function_or_method'>
that's "builtin" in the first sense
almost as annoying as the fact that "package" means at least 2 different things in Python
coroutine π
yeah, that's another one.
it's moderately interesting that you can monkeypatch in new (or replacement) builtins in Python
!e ```py
import builtins
builtins.hello = lambda a: print(f"Hello, {a}!")
hello("World")
@raven ridge :white_check_mark: Your 3.11 eval job has completed with return code 0.
Hello, World!
Sick of your linter complaining that you've shadowed builtins.id? Just del it! 
Idk about rust but in Kotlin you can shadow the prelude
No warning either
It's just a lot more explicit since it can only be done by an import and not by a local variable I don't believe
If you modify the builtins module in Python, all modules will see your change, since all modules share a reference to the same builtins module
wait whats the 2nd meaning
aside from the async function one
the function itself is called a coroutine, and the thing such a function returns is also called a coroutine
ah
same with generators. the "correct" name would be "coroutine function" and "generator function" but no one actually says that
ackshully its "generator" and "generator iterator" (src: https://docs.python.org/3/glossary.html#index-19)