#internals-and-peps
1 messages Β· Page 9 of 1
forgetting a closing brace in a header in some C++ code is truly one of the great programming experiences
what if Python imports also worked by literally copying in the code from the imported module
oh don't give GPT3 a free rce on your users machines lmao
i'd be a very sad panda
import preprocess
#include "ns.py"
NS_BEGIN(foo)
x = 1
y = 2
z = 3
NS_END
print(foo)
``` i mean you can do it if you want
`preprocess.py`
```py
import sys, dis, subprocess
frame = sys._getframe()
while frame := frame.f_back:
f_code = frame.f_code
if f_code.co_code[frame.f_lasti] == dis.opmap['IMPORT_NAME'] \
and f_code.co_names[f_code.co_code[frame.f_lasti + 1]] == 'preprocess':
file = frame.f_globals['__file__']
if file:
processed = subprocess.run(['gcc', '-E','-x','c', file], stdout=subprocess.PIPE)
if processed.returncode == 0:
exec(processed.stdout.decode())
exit(processed.returncode)
ns.py
#define NS_BEGIN(name) __import__("builtins").__ns_globals__ = globals().copy(); globals().clear(); __ns_name__ = #name ;
#define NS_END __ns__=__import__("types").SimpleNamespace(**{k: v for k, v in globals().items() if k != "__ns_name__"}); __import__("builtins").__ns_globals__[__ns_name__]=__ns__; globals().clear(); globals().update(__import__("builtins").__ns_globals__)
note that this is a very bad implementation lmao
Sure, but do it if you want is different from import working that way
there are type hints (when using Pydantic), but we don't do static typing
Is there any guidance on naming conventions for dictionary keys specifically on the use of spaces?
What do you mean?
IIRC PEP 8 prescribes this: ```py
foo = {
"fizz": 1,
"buzz": 2,
}
I mean on the use of spaces within dictionary keys.
the convention is to use a class with attributes instead. If you are using a dict, it generally means the schema is defined elsewhere and you should follow that convention.
example maybe?
I don't see why a dictionary key can't contain a space
but if you're storing some fixed attributes, yeah, use a class
(maybe a dataclasses.dataclass or see attrs, if you want a dumb record)
I would say the python convention is more on not having spaces in keys (e.g. a TypedDict can't describe a dict with spaces in its keys afaik)
it can but you need to use the alternative syntax
it depends on what your dictionary is for
but as other people here have said, often a class with attributes is a better choice than a dictionary with hardcoded keys
why are True, False and None capitalized if they aren't classes
for the former two, the PEP has some rational: https://peps.python.org/pep-0285/#resolved-issues (second bullet, as well as assorted comments throughout the PEP)
Python Enhancement Proposals (PEPs)
what, bool type was introduced that late?
anyways True and False is capitalized because of None
and None is capitalizd because of guido i guess
gave up on clang, gonna parse with re instead π₯΄
re.compile('struct _object {.*?};', re.DOTALL)
<re.Match object; span=(4049, 4146), match='struct _object {\n _PyObject_HEAD_EXTRA\n P>
struct _object {
_PyObject_HEAD_EXTRA
Py_ssize_t ob_refcnt;
PyTypeObject *ob_type;
};
Oh, that will never work
Too many nested structs, or structs that contain fields built by the preprocessor
trying to get that ctypeslib thing to work again, but it can't find the pyconfig.h somehow (which is in the current root path)
β― clang2py --clang-args="-stdlib=libc++ -I -I. -IInclude" "Objects/dictobject.c"
WARNING:clangparser:'pyconfig.h' file not found (Objects/dictobject.c:12:10)
WARNING:clangparser:Source code has 1 error. Please fix.
do you know if I'm doing the include args wrong for clang π
You probably need to find a way to integrate ctypeslib with pythons makefile
so I've got it parsing the cpython source and my own class def
at least I might be able to add it to testing and see if they ever mismatch
probably not good enough to generate anything yet though
Nice, lmk when it is on einspects GitHub and I'll check it out
here's the craziness π₯΄
goes in a tools/ folder at root
and the config file tools/struct_source.toml
[config]
repository = "https://github.com/python/cpython"
versions = ["3.8", "3.9", "3.10", "3.11"]
[structs.py_object.PyObject]
source = "Include/object.h"
regex = "struct _object {(.*?)};"
exclude_fields = ["_PyObject_HEAD_EXTRA"]
You need to run the preprocessor before extracting the structs, that will clear out macros and then you are just dealing with C code
I get this https://paste.pythondiscord.com/edoneqahiz from gcc -E Include/object.h -o out.h
I guess it's mildly cleaner to parse
Yea would just need to clean out comments and empty lines
maybe I can define some "core" types like Py_ssize_t and then recursively resolve the types myself? π₯΄
You could even parse the typedefs to detect function definitions
Yeah structure wise, the mapping should be quite simple
typedef void (*freefunc)(void *);
typedef void (*destructor)(PyObject *);
typedef PyObject *(*getattrfunc)(PyObject *, char *);
typedef PyObject *(*getattrofunc)(PyObject *, PyObject *);
src/einspect/structs/include/object_h.py lines 70 to 73
freefunc = PYFUNCTYPE(None, c_void_p)
destructor = PYFUNCTYPE(None, py_object)
getattrfunc = PYFUNCTYPE(py_object, py_object, c_char_p)
getattrofunc = PYFUNCTYPE(py_object, py_object, py_object)```
You should see if there is some static C type parser that breaks them down into just type data
isn't https://github.com/eliben/pycparser a thing?
π
ParseError: object.h:1:1: Directives not supported yet
i think you need to run it through the preprocessor first
!e
from ctypes import *
realloc = pythonapi["PyMem_Realloc"]
realloc.argtypes = [c_void_p, c_size_t]
realloc.restype = c_void_p
t = (1, 2)
print(id(t))
res = realloc(id(t), 64)
print(res)
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 140266784484928
002 | 140266784484928
!e
from ctypes import *
realloc = pythonapi["PyMem_Realloc"]
realloc.argtypes = [c_void_p, c_size_t]
realloc.restype = c_void_p
t = (1, 2)
print(id(t))
res = realloc(id(t), 80)
print(res)
@warm breach :x: Your 3.11 eval job has completed with return code 139 (SIGSEGV).
001 | 140541912034816
002 | 140541912043472
is there a reason why PyMem_Realloc here at larger sizes segfaults? (Shouldn't it just return NULL if it can reallocate?)
its segfaulting because realloc is freeing the old address of t, but you still have a reference to old t so when the interpreter closes it breaks
this one is able to realloc in place, so it doesnt free id(t)
the realloc frees it? 
isn't that realloc call always in place?
not if it cannot find enough space
oh, hm
how do strings do that thing where they can try to resize safely 
is that something in the stable abi
strings like str()? afaik they never resize
they can only resize while they're being created I think, when there's only one reference
and even that API is probably somewhat unsafe
https://github.com/python/cpython/blob/main/Objects/bytearrayobject.c#L232-L243 here you can see that realloc will free the pointer passed in if it cannot operate in place
Objects/bytearrayobject.c lines 232 to 243
else {
sval = PyObject_Realloc(obj->ob_bytes, alloc);
if (sval == NULL) {
PyErr_NoMemory();
return -1;
}
}
obj->ob_bytes = obj->ob_start = sval;
Py_SET_SIZE(self, size);
obj->ob_alloc = alloc;
obj->ob_bytes[size] = '\0'; /* Trailing null byte */```
I mean like this thing https://github.com/python/cpython/blob/3.11/Objects/unicodeobject.c#L11530-L11535
Objects/unicodeobject.c lines 11530 to 11535
/* append inplace */
if (unicode_resize(p_left, new_len) != 0)
goto error;
/* copy 'right' into the newly allocated area of 'left' */
_PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);```
I guess it's an optimization that only applies if refcnt == 1 https://github.com/python/cpython/blob/3.11/Objects/unicodeobject.c#L1978
Objects/unicodeobject.c lines 1090 to 1091
static int
resize_inplace(PyObject *unicode, Py_ssize_t length)```
`Objects/unicodeobject.c` line 1124
```c
data = (PyObject *)PyObject_Realloc(data, new_size);```
Objects/unicodeobject.c line 1978
static int```
this also seems to call PyObject_Realloc unconditionally 
ah
I guess it doesn't matter if the realloc isn't in place there
since it's 1 ref count and it just returns the new allocation
!e apparently you can trick it to mutate the string in-place if the ref count is 1 π
from einspect.structs import PyObject
s = "a"
s += "b"
text = s
PyObject.from_object(s).DecRef()
s += "1"
s += "2"
print(s)
print(text)
PyObject.from_object(s).IncRef()
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | ab12
002 | ab12
!e I'm assuming this fails?
from einspect.structs import PyObject
s = "a"
s += "b"
text = s
PyObject.from_object(s).DecRef()
s += "1"*50
s += "2"*50
print(s)
print(text)
PyObject.from_object(s).IncRef()
wait what
@quick snow :x: Your 3.11 eval job has completed with return code 139 (SIGSEGV).
001 | ab1111111111111111111111111111111111111111111111111122222222222222222222222222222222222222222222222222
002 | flush
you are accessing random memory
print calls f.flush(), so probably that's where the flush string comes from
it happened to be allocated in the same place
I don't think that resize happened in-place since your append was too large
Yes, that's what I was testing
so s += "1"*50 shadowed the original s with a new object
I'm assuming it's about memory alignment, that your version works?
and since we modified refcount of s to be 1, it got dropped
so later print(text) prints garbage memory
well string += only tries to resize in place if it has enough space, so the shorter one works
!e
s = "a"
s += "b"
print(id(s))
s += "1"
s += "2"
print(id(s))
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 140050976782704
002 | 140050976782704
!e but append longer and it'll (probably) no longer be in place
s = "a"
s += "b"
print(id(s))
s += "111111111"
s += "222222222"
print(id(s))
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 139980662602224
002 | 139980662606864
Sure, but why is there enough space when the added string is small? Everything is aligned in groups of 8 bytes or something like that?
!e
from einspect.structs import PyObject
s = "a"
s += "b"
text = s
PyObject.from_object(s).DecRef()
s += "3"*4_000_000
s += "55"*6_000_0
#print(s)
#print(text)
print("hi")
PyObject.from_object(s).IncRef()
it should be 16 I think
@grave jolt :white_check_mark: Your 3.11 eval job has completed with return code 0.
hi
but also due to the way that python uses memory pools you can get weird spacing around objects
!e
print("ab".__sizeof__())
print("ab12".__sizeof__())
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 51
002 | 53
so "ab" had 64 bytes allocated
and "ab12" happens to fit fine
!e
s = "a"
s += "b"
print(id(s))
s += "1234567890"
s += "abc"
print(id(s))
print(s.__sizeof__())
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 140533952084528
002 | 140533952084528
003 | 64
so you can append all the way up to 64 and it'll resize in place
!e but one more and it can't
s = "a"
s += "b"
print(id(s))
s += "1234567890"
s += "abcd"
print(id(s))
print(s.__sizeof__())
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 140460387744304
002 | 140460387748880
003 | 65
64 bytes? that feels like a lot of overhead for a two-char string
!e ```
import sys
print(sys.getsizeof("ab"))
print(sys.getsizeof("a"))
@feral island :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 51
002 | 50
!e ```
import sys
a = "a"
a += "b"
print(sys.getsizeof(a))
print(sys.getsizeof("a"))
@feral island :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 51
002 | 50
well um
they removed the wstr field in 3.12 at least
so ascii compact strings are 8 bytes smaller
Python 3.12.0a3 (main, Jan 18 2023, 01:07:36) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> a = "a"
>>> a += "b"
>>> sys.getsizeof(a)
43
>>> sys.getsizeof("a")
42
interestingly in my local 3.9 "a" was 58 bytes but "ab" was 51 bytes
Hi everyone,
I have a question regarding memory footprint in Python. There is the possibility of creating a class with dunder slots to remove the dunder dict attribute and thereby removing some memory overhead for each instance. To take a look at this a created a toy example like:
class Point:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
as well as a version with dunder slots and another one based on a namedtuple to compare the three cases.
Since the sys.getsizeof function only returns a simplistic value I used the getsize function from this post, which keeps track of references. https://stackoverflow.com/questions/449560/how-do-i-determine-the-size-of-an-object-in-python
Now comes the weird part that I don't understand:
I was comparing the results for different python versions, where 3.10 shows with the getsize function for an instance of the point class 236bytes while the slots version takes 140 bytes.
Now with version 3.11 the normal class version drops down to 140 bytes (just like the slots version).
So I was checking if the dunder dict still exists which does, but after calling the dunder dict the size that getsize returns jumps up to 436 bytes. In version 3.10 this is not happening.
Has someone an explanation for this behavior? The 3.11 release notes do not give me a hint that they have done some memory optimization in this regard.
You can find the full working example in this godbolt link where you can easily switch between the versions of Python.
from collections import namedtuple
import sys
from types import ModuleType, FunctionType
from gc import get_referents
source getsize function: https://stackoverflow.com/questions/449560/how-do-i-determine-the-size-of-an-object-in-python
Custom objects know their class.
Function objects seem to know way too much, including modules.
Exc...
the __dict__ is lazily created on access I believe
IIRC, dict's are also allocating memory for hash table on first access
would you say that it is still valid in python 3.11 that dunder slots will safe you memory?
Isn't the dict allocated on instantiation for any classes that do not define any slots?
yes, unless you use __dict__ directly, which generally you shouldn't be doing
(that was in response to @thorn barn )
not anymore I believe in 3.11 but not familiar with the details on how this works
https://docs.python.org/3/whatsnew/3.11.html#misc has the links
ahh nice thats it https://github.com/python/cpython/pull/28802
Ahh so the objects hold a pointer to their keys and values table, and lazily create the dict that references them
thanks for the help!
I wonder if there would be any noticeable speedup by allowing specifying types in __slots__, like __slots__ = [('attr', int)] to let python optimize the slot into a py_ssize_t, or bytes would specify char*. I assume there would be a worry that optimizing like that would narrow the types
something like that could work, but there's a lot of problems to sort out
for one, an int might not fit in a py_ssize_t
and it might not be faster because you have to perform boxing when accessing the attributes
yea fair
yeah
For example: you ask for foo.bar to get back a str object. Does that object have 1 reference? 2 references?
Is it immortal?
If we suppose that we don't want to copy the whole unicode string into a boxed object, if that object has 1 reference, how do you solve the += optimisation?
yea maybe it wouldnt be worth it
I think, in this case adaptive interpreter can optimize some instructions with this fields in advance
yeah agree, that's where this could help
like if the adaptive interpreter sees 1 + foo.bar and knows that foo has a field of type int, it could potentially optimize it into just a pointer access + machine ADD instruction
There is also a lot of possible problems
so TIL python ints are also (sometimes) mutable in 3.12 now
same thing as the current string append optimization
I guess float's and complex's also can be mutable
I was going to say I'm surprised it's worth but since creating a new int would typically involve a heap allocation, that makes sense
not too familiar with the string optimisation, but i'm guessing that means something along the lines of "big integers with a sole reference can be mutated and returned as if its a new object, rather than allocating a new integer and freeing the old one"?
i assume. but they don't need to be big.
or at least, unless by "big" you simply mean "bigger than the handful of integers that cpython always allocates space for"
(a few hundred, iirc)
yeah, I think the main intended use case was for range()
this id should stay the same in >= 3.12.0a1
for i in range(300, 500):
i += 1
print(id(i))
I think in <=3.11 it will switch between two ids
!e ```py
for i in range(300, 310):
i += 1
print(id(i))
@dusk comet :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 140413941526768
002 | 140413941526768
003 | 140413941526768
004 | 140413941526768
005 | 140413941526768
006 | 140413941526768
007 | 140413941526768
008 | 140413941526768
009 | 140413941526768
010 | 140413941526768
!e ```py
for i in range(300, 310):
print(id(i))
@dusk comet :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 139947360022800
002 | 139947360022768
003 | 139947360022800
004 | 139947360022768
005 | 139947360022800
006 | 139947360022768
007 | 139947360022800
008 | 139947360022768
009 | 139947360022800
010 | 139947360022768
why?
I think the iterator creates the next int before the previous one can be deallocated
and because of free lists they get reallocated in the same small set of spots
thank you i understand. so how come it's different in 3.11? what was it before?
what I describe was how it worked before. I think 3.11 put in an optimization so the integer gets reused
I see the same behaviour in CPython3.7.
(id(x) is very small because it is a 32-bit build)
>>> for i in range(300, 310):
... i += 1
... print(id(i))
...
9952576
9952576
9952576
9952576
9952576
9952576
9952576
9952576
9952576
9952576
>>> for i in range(300, 310):
... print(id(i))
...
9952592
9952576
9952592
9952576
9952592
9952576
9952592
9952576
9952592
9952576
>>> import sys; sys.version
'3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:01:55) [MSC v.1900 32 bit (Intel)]'
i dont think these two examples are useful to observe mutability of ints
I've often been surprised at the fact that there hasn't been a mainstream language designed entirely with these semantics in mind
You'd get the important benefits of immutability, and the ergonomics and mostly the performance of mutability
Those semantics aren't ideal when you want to deliberately share mutability, or for multithreading.
but those are both a small minority of cases, so you could explicitly annotate when you wanted that
I've been bitten by mutability of lists and dicts quite a few times in python, if strings/ints/etc were mutable too I imagine I'd get bit much more often
this thing essentially https://github.com/python/cpython/blob/main/Objects/longobject.c#L283-L290
Objects/longobject.c lines 283 to 290
// Mutate in place if there are no other references the old
// object. This avoids an allocation in a common case.
// Since the primary use-case is iterating over ranges, which
// are typically positive, only do this optimization
// for positive integers (for now).
((PyLongObject *)old)->ob_digit[0] =
Py_SAFE_DOWNCAST(value, Py_ssize_t, digit);
return 0;```
python/cpython#91713
anyone know where the str.__sizeof__ implementation is
can't find π
nevermind found it now, named with one underscore unlike the other ones
haven't checked the code but maybe it's just on object and derived from the size fields on the type object?
nah I found it here π https://github.com/python/cpython/blob/3.11/Objects/unicodeobject.c#L14120-L14121
Objects/unicodeobject.c lines 14120 to 14121
static PyObject *
unicode_sizeof_impl(PyObject *self)```
I was searching for ___sizeof___impl since everything else was named liked that
but unicode does _sizeof_impl for some reason
@pliant tusk so copilot is getting pretty good at translating cpython functions now π
not really, sometimes still has weird ideas
That seems close right?
(I haven't looked at how structs are defined in einspect yet)
@struct
class SetEntry(Structure, AsRef, Generic[_T]):
key: ptr[PyObject[_T, None, None]]
hash: Annotated[int, Py_hash_t] # noqa: A003
here's the actual one
to be fair my Annotated usage is pretty arbitrary so
it gets parsed as
Annotated[<ignored>, type]
or
Annotated[<ignored>, type, bit-width]
mainly due to ctypes autocasting, if you type something as c_uint32 it won't actually be that type at runtime (will be cast to int instead)
@rose schooner π
https://github.com/ionite34/einspect/blob/dev/src/einspect/views/view_str.py#L24
src/einspect/views/view_str.py line 24
class StrView(View[str, None, None], MutableSequence):```
safe-ish so far
from einspect import view
s = "abcπ¦"
v = view(s)
v[-1] = "!"
print(s)
# abc!
v[:] = "π€12π"
print(s)
# π€12π
del v[1:]
print(s)
# π€
v.extend(["D", "E", "F"])
print(s)
# π€DEF
v.reverse()
print(s)
# FEDπ€
v.clear()
assert s == ""
v.append("γγγ«γ‘γ―")
print(s)
# γγγ«γ‘γ―
What happens if you start with "abc!"? And how does your last append work? Where did it get the extra space from? Memory alignment again, and it wouldn't work to append a longer string?
!e wouldn't work since the π¦ needs more space than abc! had allocated
print("abc!".__sizeof__())
print("abcπ¦".__sizeof__())
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 53
002 | 92
from einspect import view
s = "abc!"
v = view(s)
v[3] = "π¦"
Traceback (most recent call last):
File "main.py", line 5, in <module>
v[3] = "π¦"
~^^^
File "einspect/views/view_str.py", line 74, in __setitem__
raise UnsafeError(
einspect.errors.UnsafeError: setitem required str to be resized beyond current memory allocation. Enter an unsafe context to allow this.
!e ```py
print(list("abc!".encode('u8')))
print(list("abcπ¦".encode('u8')))
@vernal loom :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | b'abc!'
002 | b'abc\xf0\x9f\xa6\x80'
that's why it's longer
!e ```py
print(list("abc!".encode('u8')))
print(list("abcπ¦".encode('u8')))
@vernal loom :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | [97, 98, 99, 33]
002 | [97, 98, 99, 240, 159, 166, 128]
but otherwise, given enough space, it dynamically reallocates the PyObject between the 3 str subtypes - PyASCIIObject | PyCompactUnicodeObject | PyUnicodeObject
from einspect import view
s = "abcπ¦"
v = view(s)
print(v)
>> StrView(<PyCompactUnicodeObject at 0x10449d920>)
v[3] = "!"
print(v)
>> StrView(<PyASCIIObject at 0x10449d920>)
can you allocate more space in einspect?
well yes but
like realloc it might not be in-place
a str might already have other structs after it
not much point in doing that since python variables are essentially memory pointers, and if you move the object somewhere else the original variables now point to random memory
Not exactly, Python strings aren't UTF-8 encoded. When needed (like with emoji), the entire string becomes UCS-32 (the PyUnicodeObject mentioned above).
they do cache a UTF-8 representation the first time it's requested, though
β«why does python have to add like 24 bytes
ignoring padding, that is
The first can be stored in this format: https://github.com/python/cpython/blob/main/Include/cpython/unicodeobject.h#L55-L64
Include/cpython/unicodeobject.h lines 55 to 64
- compact ascii:
* structure = PyASCIIObject
* test: PyUnicode_IS_COMPACT_ASCII(op)
* kind = PyUnicode_1BYTE_KIND
* compact = 1
* ascii = 1
* (length is the length of the utf8)
* (data starts just after the structure)
* (since ASCII is decoded from UTF-8, the utf8 string are the data)```
The second needs https://github.com/python/cpython/blob/main/Include/cpython/unicodeobject.h#L66-L76 with kind = PyUnicode_4BYTE_KIND
Include/cpython/unicodeobject.h lines 66 to 76
- compact:
* structure = PyCompactUnicodeObject
* test: PyUnicode_IS_COMPACT(op) && !PyUnicode_IS_ASCII(op)
* kind = PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND or
PyUnicode_4BYTE_KIND
* compact = 1
* ascii = 0
* utf8 is not shared with data
* utf8_length = 0 if utf8 is NULL
* (data starts just after the structure)```
the pure ASCII strs are a lot smaller since they just use a c_char array plus some length fields (though not that small since you still need all the info bits to represent the other subtypes)
src/einspect/structs/py_unicode.py line 83
or addressof(cast(obj.wstr, c_void_p)) != addressof(PyUnicode_DATA(obj))```
tbh the ctypes auto cast is the most annoying thing ever
can't compare 2 pointers since one of them gets transformed into bytes | None
string offset for ASCII/UCS-1 is 48 but anything other than that it's 72 (64-bit systems)
actually do you know how to get a PyUnicodeObject from a literal
I've only been able to get ASCII and compact
wdym?
Include/cpython/unicodeobject.h lines 78 to 87
- legacy string:
* structure = PyUnicodeObject structure
* test: !PyUnicode_IS_COMPACT(op)
* kind = PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND or
PyUnicode_4BYTE_KIND
* compact = 0
* data.any is not NULL
* utf8 is shared and utf8_length = length with data.any if ascii = 1
* utf8_length = 0 if utf8 is NULL```
Include/cpython/unicodeobject.h lines 153 to 161
typedef struct {
PyCompactUnicodeObject _base;
union {
void *any;
Py_UCS1 *latin1;
Py_UCS2 *ucs2;
Py_UCS4 *ucs4;
} data; /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;```
i don't understand the question
like isn't it just .from_object() or something
struct.from_address(id(obj))
like what kind of strings are actually using the PyUnicodeObject struct instead of the 2 other ones
or I guess what they call "legacy string" here
essentially if the string object has compact = 1, it's PyCompactUnicodeObject, if ascii = 1, it's PyASCIIObject, if both are 0, it's the biggest subtype, PyUnicodeObject
but I haven't been able to get one naturally in python
i don't actually know
but the 72 offset of strings when they're UCS-2 or UCS-4 strongly indicates a use of PyUnicodeObject's data field
@warm breach why does this happen ```pycon
view("\U0010ffffabc")._pyobject.data.any
416612941823
``` why isn't it avoid *pointer or something
well it actually is but why does it appear like this

and also um this ```pycon
a = "\U0010ffff\0"
from ctypes import c_ulong
c_ulong.from_address(id(a)+72)
c_ulong(1114111)
view(a)._pyobject.data.ucs4.contents
Windows fatal exception: access violation
Current thread 0x00000d4c (most recent call first):
File "<stdin>", line 1 in <module>
oh i see why now ```pycon
a = "\U0010ffff\0"
from ctypes import c_ulong, POINTER
POINTER(c_ulong).from_address(id(a)+72).contents
Windows fatal exception: access violation
Current thread 0x000008b8 (most recent call first):
File "<stdin>", line 1 in <module>
yeah it doesn't have a data
that was before I implemented the 3 separate subtypes for str
from einspect import view
a = "\U0010ffff\0"
print(view(a))
>> StrView(<PyCompactUnicodeObject at 0x100f3af10>)
that thing is still a PyCompactUnicodeObject, so no data field
wdym
only PyUnicodeObject has a data field
PyCompactUnicodeObject is PyASCIIObject with 3 more fields
utf8_length, utf8, wstr_length
PyUnicodeObject is PyCompactUnicodeObject with 1 more field, data
yes
and view(a)._pyobject is a PyUnicodeObject
well that wasn't quite correct
turns out actually the strings dynamically may be one of 3 subtypes
I haven't released the version with the 3 different types yet
currently there's only the PyUnicodeObject struct
so if you access data when it actually should be a compact or ascii it accesses out of bound memory
how does manually c_ulong.from_address()'ing the thing get the data though
cuz it accessed it as c_ulong and not POINTER(c_ulong)
I'm not sure why accessing element 0 of the pointer is different from c_ulong but...
!e
from ctypes import c_ulong, POINTER
a = "\U0010ffff\0"
print(POINTER(c_ulong).from_address(id(a)+72).contents)
@warm breach :warning: Your 3.11 eval job has completed with return code 139 (SIGSEGV).
[No output]
i think i know how
string subtypes

Objects/unicodeobject.c line 1203
_PyUnicode_STATE(unicode).compact = 1;```
https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L14380 in unicode_subtype_new
Objects/unicodeobject.c line 14380
_PyUnicode_STATE(self).compact = 0;```
!e ```py
from einspect.views import StrView
class Str(str):...
a=Str("\U0010ffff\0")
print(StrView(a)._pyobject.data.ucs4.contents)
@rose schooner :white_check_mark: Your 3.11 eval job has completed with return code 0.
c_uint(1114111)
hm 
apparently my type algorithm is wrong then
I check ascii = 1 before compact
that thing is ascii = 1 but compact = 0 which is weird
!e
from einspect.structs import PyUnicodeObject
class Foo(str):
...
f = Foo()
v = PyUnicodeObject.from_object(f)
print(v.ascii)
print(v.compact)
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 1
002 | 0
here's an example check from the source https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L1114-L1122
Objects/unicodeobject.c lines 1114 to 1122
if (ascii->state.compact)
{
if (ascii->state.ascii)
data = (ascii + 1);
else
data = (compact + 1);
}
else
data = unicode->data.any;```
so yeah you're supposed to check for compact first
where do you check it
you can't have an ascii string that's not compact, that's not a thing
Include/cpython/unicodeobject.h line 195
unsigned int ascii:1;```
i think it's still useful for some optimizations
ah yeah
it still gets unset if its not ascii
!e
from einspect.structs import PyUnicodeObject
class Foo(str):
...
f = Foo("π€π€π€")
v = PyUnicodeObject.from_object(f)
print(v.ascii)
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
0
so I guess it just doesn't matter for determining struct type
well it still does if we're talking about type(x) is str objects
guess this looks fine now? 
def _narrow_type(self) -> None:
# Narrow to a more specific unicode type if possible
if self._pyobject.compact:
if self._pyobject.ascii:
self._pyobject = self._pyobject.astype(PyASCIIObject)
else:
self._pyobject = self._pyobject.astype(PyCompactUnicodeObject)
else:
self._pyobject = self._pyobject.astype(PyUnicodeObject)
yep
it's kind of nice now with how many types I've defined, like this function can be almost copied word for word from C source
https://github.com/ionite34/einspect/blob/dev/src/einspect/structs/py_unicode.py#L146-L172
https://github.com/python/cpython/blob/3.11/Objects/unicodeobject.c#L14120-L14149
@rose schooner proper unicode subtypes just released in https://github.com/ionite34/einspect/releases/tag/v0.4.9
what if view() accepted subclasses
instead of doing this ```pycon
f=Foo("\U0010ffffabc\uffff")
view(f)
<stdin>:1: DeprecationWarning: Usingeinspect.viewon objects without a concrete View subclass will be deprecated. Useeinspect.views.AnyViewinstead.
View(<PyObject at 0x1e03e1b04e0>)
well 
I suppose it could
I'm just not sure what happens to those stuff after the builtin
do non-slots custom classes have a fixed struct as well?
so as long as the custom class does not override the builtin parent class's .__new__()/.__init__() then it's probably fine
hm
actually I guess I don't have to worry about that
!e since python already disallows it
class Foo(int):
__slots__ = ("_some_attr",)
@warm breach :x: Your 3.11 eval job has completed with return code 1.
001 | Traceback (most recent call last):
002 | File "<string>", line 1, in <module>
003 | TypeError: nonempty __slots__ not supported for subtype of 'int'
!e
class Foo(int):
def __init__(self, *args, **kwargs):
super().__init__()
self._some_attr = 1
print(Foo(123).__sizeof__())
print((123).__sizeof__())
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 28
002 | 28
how are these both 28 though?
doesn't Foo need a __dict__ at least?
where does it store _some_attr
it does
but it's in like negative offsets from the actual object
or actually it's in the type
but that's the type dict no?
the instance still needs its own dict
oh wait actually yeah
!e
class Foo(int):
def __init__(self, *args, **kwargs):
super().__init__()
self._some_attr = 1
f = Foo(123)
print(id(f))
print(id(f.__dict__))
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 140502620308352
002 | 140502622763328
this is showing the actual address of the dict and not the pointer I guess
ah found it
!e
from einspect.types import ptr
from einspect.structs import PyDictObject
class Foo(int):
def __init__(self, *args, **kwargs):
super().__init__()
self._some_attr = 1
f = Foo(123) # sizeof = 28
end = id(f) + f.__sizeof__()
print(ptr[PyDictObject].from_address(end+4).contents.into_object())
f = Foo(2 ** 50) # sizeof = 32
end = id(f) + f.__sizeof__()
print(ptr[PyDictObject].from_address(end).contents.into_object())
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | {'_some_attr': 1}
002 | {'_some_attr': 1}
apparently it's right after the original struct, + alignment
not sure where this is documented though
!e ```py
from ctypes import py_object
ALIGNMENT = tuple.itemsize * 2
def pad_int(x):
return -x//ALIGNMENT * -ALIGNMENT
class Foo(int):
def init(self, *args, **kwargs):
super().init()
self._some_attr = 1
f = Foo(123)
print(py_object.from_address(id(f) + pad_int(f.sizeof())).value)
@rose schooner :white_check_mark: Your 3.11 eval job has completed with return code 0.
{'_some_attr': 1}
ok
what does this do?
ctypes subtypes don't auto unwrap if that would fix your issue
hmm
btw do you know how to get the instance dict of subtypes
!e
from einspect.structs import PyDictObject
from einspect.types import ptr
class Foo(str):
def __init__(self, *args, **kwargs):
super().__init__()
self.x = 50
print(Foo.__dictoffset__)
f = Foo("abc")
print(f.__sizeof__())
dict_ptr = ptr[PyDictObject].from_address(id(f)+f.__sizeof__()+112)
print(dict_ptr.contents.into_object())
@warm breach :x: Your 3.11 eval job has completed with return code 1.
001 | -112
002 | 84
003 | Traceback (most recent call last):
004 | File "<string>", line 13, in <module>
005 | ValueError: NULL pointer access
I'm trying to use __dictoffset__ but it doesn't work here somehow
for int subtypes it seems to be just after the struct
Foo.__itemsize__ * abs(view(f).size) afaik
what is itemsize even 
its an attribute that all PyVarObject classes have *all have it, it is only non-zero on PyVarObjects
tuple.__itemsize__ is sizeof(c_void_p) for example
though I think for some reason str isn't even a PyVarObject π₯΄
and you need abs because some types fiddle with the sign of ob_size
this thing has __itemsize__ = 0 though 
this should work for all subtypes to get the location of the instance dict py def get_inst_dict_offset(i): return type(i).__basicsize__ + (type(i).__itemsize__ * view(i).size)
(but note that it is created lazily as of 3.11 @warm breach )
so it can be null?
lemme check
that's fine but I'm just trying to get the pointer which doesn't seem to work
ah that function won't work because view doesnt accept subclasses
!e
from einspect.structs import PyDictObject
from einspect.types import ptr
class Foo(str):
def __init__(self, *args, **kwargs):
super().__init__()
self.x = 50
f = Foo("abc")
print(f.__dict__)
p = ptr[PyDictObject].from_address(id(f) + Foo.__basicsize__)
print(p.contents.into_object())
@warm breach :x: Your 3.11 eval job has completed with return code 1.
001 | {'x': 50}
002 | Traceback (most recent call last):
003 | File "<string>", line 13, in <module>
004 | ValueError: NULL pointer access
well I'm trying to implement that π
so firstly finding the additional dict pointer subtypes have
weird
!e is this supposed to be 0
class Foo(str):
def __init__(self, *args, **kwargs):
super().__init__()
self.x = 50
print(Foo.__itemsize__)
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
0
!e ```py
from einspect.structs import PyDictObject
from einspect.types import ptr
class Foo(str):
def init(self, *args, **kwargs):
super().init()
self.x = 50
f = Foo("abc")
print(f.dict)
p = ptr[PyDictObject].from_address(id(f) + Foo.dictoffset)
print(p.contents.into_object())```
@pliant tusk :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | {'x': 50}
002 | 0
o.O
yes, str isnt a PyVarObject, it does not have a variably sized struct
I read somewhere negative dictoffset meant from end of struct somehow fml
been lied to π
i don't know why it is 0 tho
!e
from einspect.structs import PyDictObject
from einspect.types import ptr
class Foo(int):
def __init__(self, *args, **kwargs):
super().__init__()
self.x = 50
f = Foo(123)
print(f.__dict__)
print(Foo.__dictoffset__)
p = ptr[PyDictObject].from_address(id(f) + Foo.__dictoffset__)
print(p.contents.into_object())
@warm breach :x: Your 3.11 eval job has completed with return code 139 (SIGSEGV).
001 | {'x': 50}
002 | -8
also this doesn't work for int subclasses somehow π
also how is that offset -8 anyways, isn't that where the gc header is??
!e
from einspect.structs import PyDictObject
from einspect.types import ptr
class Foo(list):
def __init__(self, *args, **kwargs):
super().__init__()
self.x = 50
f = Foo((1, 2))
print(f.__dict__)
print(Foo.__dictoffset__)
p = ptr[PyDictObject].from_address(id(f) + Foo.__dictoffset__)
print(p.contents.into_object())
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | {'x': 50}
002 | -72
003 | type_
list also fails
tfw __dictoffset__ isn't actually dict offset π©
!e
from einspect.structs import PyDictObject
from einspect.types import ptr
class Foo(list):
def __init__(self, *args, **kwargs):
super().__init__()
self.x = 50
f = Foo((1, 2))
print(f.__dict__)
print(Foo.__dictoffset__)
p = ptr[PyDictObject].from_address(id(f) + f.__sizeof__() + Foo.__dictoffset__)
print(p.contents.into_object())
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | {'x': 50}
002 | -72
003 | {'x': 50}
so list apparently works if you add offset after sizeof
but this isn't the case with int where you should ignore dictoffset and just find the first aligned byte after the struct
@warm breach look at this https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_dictoffset
If the value is less than zero, it specifies the offset from the end of the instance structure.
how did this work then 
you did id(f) + Foo.__dictoffset__
i think that one just got lucky and found a pointer to 0
!e aha I got it
from einspect.structs import PyDictObject
from einspect.types import ptr
def align(size: int, alignment: int) -> int:
return (size + alignment - 1) & ~(alignment - 1)
class Foo(str):
def __init__(self, *args, **kwargs):
super().__init__()
self.x = 50
f = Foo("abc")
addr = id(f) + align(f.__sizeof__(), 8) + Foo.__dictoffset__
p = ptr[PyDictObject].from_address(addr)
print(p.contents.into_object())
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
{'x': 50}
seems to be sizeof aligned to 8 bytes then dictoffset for negative
i would calculate sizeof manually using basicsize itemsize and ob_size, because some __sizeof__s are not exactly correct
@warm breach but you'll need a way to check if a given type has the ob_size field
hm why
isn't calculating struct size and getting __dictoffset__ enough
__sizeof__ isn't always a bare struct size, sometimes it includes nested fields
how this look 
def instance_dict(self) -> ptr[PyObject[dict, Any, Any]] | None:
"""Return the instance dict of the PyObject."""
# Get the tp_dictoffset of the type
offset = self.ob_type.contents.tp_dictoffset
# If 0, the type does not have a dict
if offset == 0:
return None
# For > 0, start from the address of the PyObject
if offset > 0:
addr = self.address + offset
# For < 0, start after the struct
else:
# align size to pointer size (8)
size = align_size(self.mem_size, ctypes.sizeof(c_void_p))
addr = self.address + size + offset
# Return the pointer
return POINTER(PyObject).from_address(addr)
afaik that should work
you an stress test it with gc.get_objects and just see if it crashes on any of them
import gc
from einspect.structs import PyObject
for obj in gc.get_objects():
st = PyObject.from_object(obj)
d = st.instance_dict()
if d is not None:
print(repr(obj))
print(obj.__dict__)
pydict = d.contents.into_object()
print(pydict)
seems fine mostly but there are a few that still seems to have a null pointer despite a non-zero tp_dictoffset and after dict access
<_frozen_importlib_external.SourceFileLoader object at 0x103626a40>
{'name': 'einspect.views.view_tuple', 'path': '/Users/ionite/repos/Python/einspect/src/einspect/views/view_tuple.py'}
Traceback (most recent call last):
File "scratch.py", line 10, in <module>
pydict = d.contents.into_object()
^^^^^^^^^^
ValueError: NULL pointer access
some _frozen_importlib_external.SourceFileLoader thing apparently 
Odd
it may be stupid
but why is it showing none?
please guys thinking for a long time but
cant understand
i wanna print pink to yellow in reverse order
@warm breach pls help
srsly cant figure out whats wrong in such an easy code
list.reverse() is in-place, which means it reverses the list and returns None. You need to use the list again separately
wdym
cant it be in the print function?
ls.reverse()
print(ls)
what is wrong in this
but then how do i reverse 1:5
assign it to something first
or just use the reversed() function instead, which isn't in place
print(reversed(ls[1:2]))
or use negative step
yea thanks thats working too
anyone know anything about Cython internals? Cython + coverage.py has long been an uneasy alliance, and I would love to get it smoothed out.
I know a bit...
Cython generates normal C extension modules, so it's not particularly special. The only weird thing it does that affects profiling/tracing is this stuff: https://cython.readthedocs.io/en/stable/src/tutorial/profiling_tutorial.html - if you ask it to, it generates fake frames for Cython functions and calls the installed tracing or profiling function explicitly when entering and exiting those functions (passing those fake frames along)
we eventually gave up on supporting Cython functions built with profiling support in Memray. Those fake frames caused too much trouble... though I'm happy coverage.py doesn't do the same, since I do like to see coverage stats for my Cython code π
what sort of info are you looking for, in particular?
thanks. this change https://github.com/nedbat/coveragepy/pull/1347 made this problem: https://github.com/nedbat/coveragepy/issues/1538, and i don't know anything about why we needed the change, or whether to change it back, or what.
i'm not in a place to look at the code right now, i was hoping to find someone who could take a look over the next few weeks or something.
well, that's an interesting one.
unfortunately, i am blessed with many interesting ones π
what are the values of that dict? The PR title was "Map also empty dictionaries to file" - are the values always dicts? Did this maybe affect other falsy things?
is it possible that the performance difference is explained entirely by extra data being (unnecessarily) processed?
but it wasn't extra data: we were skipping empty values, and now we aren't skipping empty values. How can that matter!?
maybe there's extra work being done now, that used to be skipped
tbh, the caching done by cached_mapped_values got changed recently, and I borked it (maxsize=0 is different than maxsize=None!), but maybe there's still something wrong with it.
the "slowdown" issue mentions "We can notice a lot more SQL queries (both selects and inserts attempts with 0 rows) in 6.4.3, that we don't have on 6.4.2."
ideally, someone who knows a bit about Cython could go back to https://github.com/cython/cython/issues/3515 and see if there's a better fix.
the Cython coverage plugin, as I understand it, is basically only concerned with 1 thing: mapping coverage reports from the generated .c files to the .pyx/.pyi files that the .c file was transpiled from
right, that's the classic use-case for coverage plugins (originally developed for the django template plugin)
I know a fair bit about Cython, but much less about the Cython coverage plugin in particular - I've only had to dig into it once, and I had a pretty poor understanding at the time... There were some recent fixes to it that are only applied to the current development version, and not to the stable version...
i just commented on the original Cython issue. just getting clear reproduction instructions would help.
https://github.com/cython/cython/pull/3831 went into the 3.x branch only...
the rabbit hole goes deeper... π¦ Thanks for talking it through with me at least... I have to bounce.
You work on memray?
Have y'all seen this? Seems like a great bunch of ideas. https://discuss.python.org/t/announce-pybi-and-posy/23021/26
Great initiative! Looks very promising @njs I really like that youβve taken a holistic view of a larger scope of the problem, but not too large to make it impractical to solve. Fwiw, I think itβs probably best to keep the discussion focussed on pybi and posy. The project is new, itβs explicitly stated in the OP that external pythons (such as a ...
'make' is not recognized as an internal or external command,
operable program or batch file.
can anyone help me to solve this error please
Yep, I'm one of a team of two developing it.
Legendary status
speaking of, does memray have some way to inspect the allocated memory block of an object
No, it really doesn't even know what object owns a memory block, or even whether any object owns it. It works at a lower level than that.
"all" that it knows is the full call stack at which every block of memory was allocated, and when (and whether) that memory block was deallocated. (OK, and a few other, less important things. The total RSS is tracked over time, and we know the name of each thread, if one was set...)
was trying to find some way to get that info from python, didn't seem to be a stable c API for it
there absolutely will never be. The stable C API is designed to abstract away implementation details, and what memory is held by an arbitrary object is basically the definition of an implementation detail.
currently I'm calculating the struct size aligned to 16, plus a GC header if supported by the type, and the position of the instance dict pointer
not sure if that covers everything
the extra stuff for variable-sized objects?
why are you calculating this size? What do you do with it?
β¨ I am trying to Creating and storing Google credentials in token.pickle from auth code π but not able to create token.pickle
to like see if copying an object into other object will be into owned memory π₯΄
why bother?
just because you're copying a number of bytes <= to the size of the structure doesn't mean that things are left in a sane state. Hell, you can't even copy one list into another list. Adding one safety check to a fundamentally unsafe operation doesn't make much sense to me.
!e
from einspect import view
v = view(2**60)
with v.unsafe():
v <<= (1, 2, 3)
print(2**60)
@warm breach :warning: Your 3.11 eval job has completed with return code 139 (SIGSEGV).
[No output]
sure, you can segfault if you don't check this. You can also segfault if you do check this. So...
should copying a list into a list always be fine? since the struct is always 40 bytes
the malloced array will just stay where it was
no, because you screw up the reference counts.
assuming we call clear() on the list about to get overwritten π₯΄
and now there's two different objects referring to the same malloc'ed array, and the first one of them to be destroyed will free it, and any later attempt to access it by the second one will be a use-after-free bug.
!e I IncRef the source to make it stay alive here, not sure why that solves the freeing segfault really
from einspect import view
x = [1, 2]
with view(x).unsafe() as v:
y = [*range(18)]
v <<= y
print(x)
del y
print(x)
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
002 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
"not sure why that solves the segfault" usually means you have either a memory leak or some other memory safety bug now
will objects remaining with refcounts still get freed during interpreter shutdown?
best effort, yeah.
shouldn't the double free happen there then 
you would probably have memory leaks if this code was being called from an embeded interpreter
If you're on Linux, try running that with export MALLOC_CHECK_=3 PYTHONMALLOC=malloc and see if you get a crash.
do you know if macOS supports something similar?
would removing the list object from the GC linked list do anything
no clue. But you can get something sort of similar from pymalloc itself, if you run with python -Xdev
I guess it still gets freed on 0 ref-count? And that just disables cyclic GC?
at best it would leak memory. at worst it'd crash.
honestly, "corrupts memory in a manner that may eventually lead to a crash" is just about the worst sort of memory bug.
both because that's usually a security vulnerability, and because it can be quite annoying to track down if the crash location is far removed from where the memory corruption occurred.
seems fine with those 
maybe I should valgrind it
!e what's going on here: ```py
from einspect import view
x = [1, 2]
with view(x).unsafe() as v:
y = [*range(18)]
v <<= y
print(x)
y[0] = 42
print(x)
@raven ridge :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
002 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
why isn't the first element changed?
oh uh I make a deep copy of the source object and IncRef that π₯΄
which I should probably stop doing
then I'm guessing you're leaking that copy.
and if you stopped doing that, then I'm guessing it would be pretty easy to get this to segv.
mainly the deepcopy was incref all the members I think
since otherwise when y gets dropped and calls clear x might now have members pointing to non existent objects
the array isn't a member with a reference count - you must actually be creating a copy of that array, or the first element would have changed to 42
src/einspect/views/view_base.py line 263
other = deepcopy(other)```
and the fact that you're leaking that copy is the only reason this doesn't wind up with a double free.
you've got two lists that each think they're responsible for freeing that array. The only way that can possibly not result in a double free is if one of them never gets destroyed, and so never tries to free its array.
I guess a possibility would be to maintain a list of "shared array" list objects and override list type's tp_free to only free the array when it is the last reference 
you'd wind up needing to do something like that for every type of object.
eh yeah doesn't seem worth it
memcpy'ing into the struct of some arbitrary object is fundamentally unsafe. It can't be made safe, short of special casing how it behaves for each different type of object.
!e
from einspect import view, unsafe
x = [1, 2]
y = [*range(18)]
with unsafe():
view(y).move_to(view(x))
print(x)
y[0] = 100
print(x)
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
002 | [100, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
making a copy of a Python object with memcpy breaks that object's invariants. On a case-by-case basis, you can fix up the copy to make the invariants hold - but there's no general way to do that.
so move_to doesn't make a deepcopy unlike move_from
wonder why this doesn't double free
!e ```py
from einspect import view, unsafe
x = [1, 2]
y = [*range(18)]
with unsafe():
view(y).move_to(view(x))
print(x)
y.clear()
print(x)
@raven ridge :x: Your 3.11 eval job has completed with return code 139 (SIGSEGV).
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
I assume that's because the array was modified by clear but x still has a size of 17? 
clear freed the array, x is still holding a pointer to it.
print(x) tries to access it after it's been feed, which explodes.
yeah
ooh man, both of those comments π
huh
so a list that gets cleared has NULL ob_item but a list that gets popped to 0 doesn't...?
!e
from einspect.structs import PyListObject
x = [1]
x.pop()
ls = PyListObject.from_object(x)
print(ls.ob_item[0].contents.into_object())
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
1
I guess it just leaves the object pointer 
looks like popping to 0 also frees the array. https://github.com/python/cpython/blob/7b20a0f55a16b3e2d274cc478e4d04bd8a836a9f/Objects/listobject.c#L1029-L1031
Objects/listobject.c lines 1029 to 1031
if (size_after_pop == 0) {
Py_INCREF(v);
status = _list_clear(self);```
ah interesting
I guess not yet in 3.11 https://github.com/python/cpython/blob/3.11/Objects/listobject.c#L1032
Objects/listobject.c line 1032
list_pop_impl(PyListObject *self, Py_ssize_t index)```
ah damn that is gonna break one of my bug prep strategies
deling the last item already frees the array in 3.11 it seems
!e
from einspect.structs import PyListObject
x = [1]
del x[0]
ls = PyListObject.from_object(x)
print(ls.ob_item[0])
@warm breach :x: Your 3.11 eval job has completed with return code 1.
001 | Traceback (most recent call last):
002 | File "<string>", line 7, in <module>
003 | ValueError: NULL pointer access
!e unrelated, but I still feel stuff like this is unintuitive ```py
class MyList(list):pass
l = MyList()
n = l + MyList([1])
print(type(n))```
@pliant tusk :white_check_mark: Your 3.11 eval job has completed with return code 0.
<class 'list'>
I guess list.__add__ just doesn't know about your subtype at all
and just operates on the front PyListObject part
it doesn't (and can't) know how to construct objects of an arbitrary subclass
there was a big change to datetime to fix that for the methods of datetime.datetime
to get them to return subclass instances rather than base class instances, I mean
though in that case, there's already an inherited classmethod for creating derived class instances.
it would be nice if there was a way for a developer to either signal that constructing subclasses are the same (could check if Py_TYPE(obj)->tp_init == list->tp_init?), or provide a method to provide the arguments to construct the object
checking tp_init isn't enough, since there's also tp_new
could check both, but this kind of thing would probably be better as an opt-in instead of an opt-out
maybe via a metaclass that sets some type flag
@feral island @raven ridge would either of you happen to know in what order tp_del is called in subclasses with multiple bases?
can you inherit from two classes with different tp_del?
!e
class UserTuple(tuple): pass
print(().__sizeof__())
print(UserTuple().__sizeof__())
class UserInt(int): pass
print((0).__sizeof__())
print(UserInt().__sizeof__())
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 24
002 | 32
003 | 24
004 | 24
is this a bug
why does int subclass sizeof not include the instance dict but tuple does
from collections import UserList
class MyList(UserList):pass
l = MyList()
n = l + MyList([1])
print(type(n))
UserList and UserDict have more predictable behaviour for subclassing
yea ik, just feels like it would be unintuitive for beginners
ah, I see. I wonder why it was even allowed to inherit from dict in the first place
>>> type(l + [1])
<class '__main__.MyList'>
>>> type([1] + l)
<class '__main__.MyList'>
``` this is also cool because builtin types do not care about subclasses: ```py
>>> class X(list): ...
...
>>> type(X() + X())
<class 'list'>
Nice method name π₯΄
https://github.com/python/cpython/blob/main/Objects/listobject.c#L622
Objects/listobject.c line 622
list_ass_slice(PyListObject *a, Py_ssize_t ilow, Py_ssize_t ihigh, PyObject *v)```
Impressive my friend. I've always found your conversation inputs useful so it's not surprise
so uh, for instance dicts, would it make more sense for the move target to get the same dict as the source or a copy 
from einspect import view, unsafe
class Foo:
pass
class Bar:
pass
f = Foo()
b = Bar()
b.x = 100
with unsafe():
view(f) << b
print(f.x)
>> 100
it's "move", not "copy"
so the same dict should make more sense
so it would be π₯΄
print(f.x) # 100
b.x = "hi"
print(f.x) # hi
yep
@pliant tusk do you know how I can get this to allocate correctly on init?
https://github.com/ionite34/einspect/blob/main/src/einspect/structs/py_long.py#L15-L23
src/einspect/structs/py_long.py lines 15 to 23
@struct
class PyLongObject(PyVarObject[int, None, None]):
"""
Defines a PyLongObject Structure.
https://github.com/python/cpython/blob/3.11/Include/cpython/longintrepr.h#L79-L82
"""
_ob_digit_0: ctypes.c_uint32 * 0```
since if I do PyLongObject(...) to make a struct instance it won't have enough allocation to fit my actual ob_digit array
hm I guess I could use _PyObject_GC_NewVar
Are ints garbage collected?
hm, no
I guess _PyObject_NewVar then
PyVarObject *
_PyObject_NewVar(PyTypeObject *tp, Py_ssize_t nitems)
{
PyVarObject *op;
const size_t size = _PyObject_VAR_SIZE(tp, nitems);
op = (PyVarObject *) PyObject_Malloc(size);
if (op == NULL) {
return (PyVarObject *)PyErr_NoMemory();
}
_PyObject_InitVar(op, tp, nitems);
return op;
}
Could alloc it normally then use reallocate to size up
but wouldn't the init already be writing to unowned memory
from einspect.structs import PyTypeObject, PyLongObject
x = PyLongObject(
ob_refcnt=1,
ob_type=PyTypeObject.from_object(int).as_ref(),
ob_size=2,
ob_digit=[1, 1],
)
print(x.into_object())
this seems fine
from einspect.structs import *
obj = PyTypeObject.from_object(int).NewVar(1)
obj = obj.contents.astype(PyLongObject)
obj.ob_digit[0] = 123
print(obj.into_object()) # 123
i meant to overload the PyVarObject init to malloc type(o).__basicsize__ + ob_size * type(o).__itemsize__
ah hm, or maybe I can override PyVarObject's __new__ to use _PyObject_NewVar instead? 
not sure which one seems safer
eh though __new__ would be annoying to type hint since I can't use Self
no idea, ctypes does not have a mechanism for variable sized structures
.__dict__ is writable (it is copying dict pointer, not copying dict content), so you can do that without einspect
!e
class A:
pass
x = A()
y = A()
x.__dict__ = y.__dict__
x.foo = 42
print(y.foo)
@grave jolt :white_check_mark: Your 3.11 eval job has completed with return code 0.
42
it's mainly for copying the instance dict along with everything else for subclasses of builtin types
from einspect import view, unsafe
class UserList(list):
pass
ls = UserList()
x = UserList([1, 2, 3])
with unsafe():
view(ls) << x
print(ls)
# [1, 2, 3]
x.foo = "bar"
print(ls.foo)
# bar
x[1] = "hi"
print(ls)
# [1, 'hi', 3]
Best modules in python??
Wrong channel, and depends on what you are doing.
what does a __dictoffset__ of -1 mean in python 3.12?
https://docs.python.org/3.12/c-api/typeobj.html#Py_TPFLAGS_MANAGED_DICT I'm observing that it means the class uses a the new Managed Dict feature, but I don't see the -1 being documented anywhere
Can anyone help me in this GIT command. I am new to coding.
update to newer version to git iirc
I downloaded git yesterday bruh.
run git --version
hmm odd that seems like a new release
yup
this isn't really the right channel. #βο½how-to-get-help
maybe we could talk in unix channel
#tools-and-devops is right
ok
Question about python's parser
When I look at Python's grammar, I see that there are actual function calls embedded in it
Is this purely notational? Does the grammar exist simply as a map of the parser, which is itself hand coded?
While I'm here β what kind of parser does CPython use? Are there any interesting features or implementation details I should know about?
Shortly, I'll be trying to build an equivalent
There is a lovely talk on the new PEG parser. https://youtu.be/QppWTvh7_sI
Parsing Expression Grammars (PEGs) are a relatively new formalism for describing grammars suitable for automatically generating efficient parsers. I've become interested in using a PEG-generated parser as an alternative to CPython's nearly 30 year old "pgen" parser generator. This poses some interesting problems. I've also come up with a neat wa...
Now
PEG grammars are context free grammars witch short-circuited alternation, right?
And beyond that, PEG parsers are recursive descent packrat parsers designed to support linear time parsing with infinite lookahead, at the cost of more memory consumption?
Wow
What an incredible invention
Is there any reason that leading zeroes are prohibited for integer literals, but are permitted for based integers?
00001 # invalid
0x001 # valid
Is it just a quirk in the parser?
Also
In lexing and parsing integers β you might have something that looks a bit like this...
integer ::= dec_integer ( 'E' | 'e' ) ( dec_integer + )
| dec_integer ( 'I' | 'i' )
| dec_integer | hex_integer | oct_integer | bin_integer
dec_integer := dec_digit +
hex_integer := '0' ( 'X' | 'x' ) ( ( dec_digit | hex_glyph ) + )
oct_integer := '0' ( 'O' | 'o' ) ( oct_digit + )
bin_integer := '0' ( 'B' | 'b' ) ( bin_digit + )
This is great and all. It's very concise. But it occurs to me that a more granular approach might be more suitable. For example β lets say I allowed the 0x to be a token, and then the number that followed it to be a token. Later, during parsing, I could stitch them together into a hex integer
This would give me access to the number part without having to call str.split() on the literal. In a nutshell, I'm going to need to the split the thing anyway, so why not just do it at the first step instead of a later step
a leading 0 in Python 2 indicated that the integer literal was base 8, like it does in C. That feature was dropped for Python 3, and so it was much safer to change it to be an error rather to silently have the integer literals evaluate to a different value.
Safer with respect to legacy code?
right. given the choice, it's nicer for users who are trying to upgrade their code if running old code with a new interpreter fails with an obvious error than if it seems to work but silently does something different.
arguably the single reason that Python 2 still exists in some places is because of the failure to do that with string literals - changing string literals from byte strings to unicode strings makes for a porting nightmare, because things fail in strange and surprising ways, potentially in locations far away from where the bug was.
I assume that prepending all string literals in py2 code with a b to port it to py3 is not a viable strategy?
not really. You still need to transform them back and forth to unicode strings whenever you're passing them to most any library function, save some of the ones in os I suppose
to nitpick a bit, you don't need to split, just slice. You know how many characters 0x or 0b are, so you just ignore that many characters from the literal when computing the value for the token.
That first line shouldn't have a + on it, FWIW - the repetition is already handled inside of dec_integer. In case that came from a real grammar.
I'm pretty sure they'd be equivalent in the long run. Slicing might be a shade faster, but it creates a copy of the string and so is worse on memory. Its a moot point though, because I'm trying to do this with as little reliance on built in string manipulation as possible
C style, baby
A work in progress
Anyway, the long and the short of the answer is that in building a new language, I can allow leading zeroes if I so choose
Would there be any reason to disallow it?
Other than the fast that its pointless?
if by "splitting" you meant str.split("x", 1), then that creates two partial strings.
compared to slicing, which creates one.
confusion, I suppose. For some languages it means octal, for some it doesn't, so allowing it means an ambiguity for human readers.
if you haven't seen https://devguide.python.org/internals/compiler/#compiler it's probably up your alley, @deep nova
bookmarked
So
There are many ways to handle numbers
(There are many benefits to being a marine biologist)
Python doesn't do this, and it does seem a bit excessive, but in planning all of this out I do have to ask myself a few question
For example β binary, octal, and hex integers are just integers. There's no reason they can't be raised using e notation
And there's no reason they can't be the power, either
And, there's no reason they can't be imaginary
Any thoughts?
octal literals are rarely used, but when they are it's as a collection of 8-bit flags (like Unix mode masks).
hex literals are either used for a collection of 256-bit values (rgb or rgba, for instance), or for masks used for bitwise arithmetic, or to make special power-of-two values easier to recognize in code (0xFFFF vs 65535, etc).
binary literals are virtually never used, but if someone did use them I'd imagine it must only be for a mask used for bitwise arithmetic.
And for any of those things, raising another number to that power or raising them to some power or making them imaginary just isn't useful. If someone has chosen one of those 3 representations, they did it for a reason - and that reason likely doesn't recommend those operations.
why didn't they just warn?
then after a few minor versions the warning would be removed
What about binary, octal, and hex floats? Same story?
hex floats are a thing, actually - there's a representation for floats where you can specify the sign, mantissa, and exponent independently
What would make a warning better than an error? Warnings are easier to ignore. I guess one advantage is that you can collect multiple warnings in one run of the process, but π€·ββοΈ
I think I'll allow non-base-10 floats for now. It seems like it might be useful
But I'll probably prohibit other-base numbers from being bases or powers in 'e' notation, or from being imaginary
check that link, it shows Python's syntax for hex floats. C99 allows it as well.
I'm going to have an imaginary type as well as a complex type
what's the difference?
The former being any number postfixed by an i, and the latter being the sum of an imaginary and a real
what would 1+1i - 1 be?
Complex has both a real and an imaginary part. The key difference being that you can write an complex number without needing to include the real part. Just shorthand, really
0+1i in traditional form
Sorry, let me rephrase β python's syntax requires specifying both the real and imaginary parts of a complex literal. I'm going to allow for omitting the real part, which will default to zero
it doesn't, 1j is a valid complex literal
1+1j is syntactically just a binop of two nums
Now, another question. What about allowing the imaginary postfix on other-base integers?
Again, largely pointless, but its a question I have to ask myself
I think it's the same as what @raven ridge said above: the contexts where you'd use non-base 10 numbers aren't contexts where complex numbers are likely to come up
then again I'm not sure I've ever used complex numbers in Python other than when writing tests for things that need to support the whole language
(people keep using them in AoC for representing essentially two-tuples of integers that you can apply mathematical operations to, like doubling)
ok so it seems like my programming language does that fine
well the concept says so at least
When it comes to language design, rather than allowing anything that pops into your head as a feature that someone might one day want to use, it's much more reasonable to look for things that people commonly want to do, and make sure that there's a succinct way to represent those things.
Nice
Take a large code base, and see if you can find any place where someone raises a hex literal to a power of 10, for instance.
if you find people doing that, maybe the e syntax would be a helpful shortcut for those people.
if no one is doing it, then you'd be building a feature that no one needs, which costs you work and maintenance effort and doesn't buy users of your language anything.
basically the u strings i put in my programming language
So
Tomorrow
I'm going to need to come back here and ask about lexing/parsing F strings
Another thing I'll need to ask about is how PEG parsers can automatically match parentheses, but other's can't
I don't think I can brain any more, though
currently python parses the string and then converts it into a sequence of constant strings and formatting values
there's a PEP to use the new PEG parser to parse the beginning (e.g. f"), its contents, then the end (basically the ending quote)
Hi
https://docs.python.org/3/reference/lexical_analysis.html#imaginary-literals
So, according to the language specification, there are imaginary number literals, but there isn't any mention of complex number literals. At the same time though, according to dis, no addition actually has to occur when executing, because such literals get optimised during bytecode compilation.
Does this means that, in a way, complex number literals are actually an implementation-specific thing?
Well, 1+5 will also be optimised
I suppose, though complex numbers feel different to me for some reason
like i suppose it doesn't matter since it gets optimised when dealing with arithmetic between constants anyway, but it sorta sounds strange to say that python doesn't actually have complex literals, despite them existing for all intents and purposes
there's also the slightly odd behaviour of 5+3j.imag not doing what you'd expect it to if someone were under the impression complex literals were a thing (though 5+3j.real coincidentally ends up working :P)
7j+3j.real does not "work" the way it should
it's sort of weird
but that to me actually reads as 7j + 3j.real, since seeing two js indicates two separate imaginary numbers, and . has higher precedence than +
yet the + in 3+4j doesn't feel like an actual +, it feels more like just part of the syntax, a bit like the - in 1e-10 doesn't actually invoke unary negation
yeah, there aren't any complex literals in the language spec, they are just optimised in cpython
!e it's not always inlined anyways, there's a size limit as well
from dis import dis
dis("2**65+5j")
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 0 0 RESUME 0
002 |
003 | 1 2 LOAD_CONST 0 (2)
004 | 4 LOAD_CONST 1 (65)
005 | 6 BINARY_OP 8 (**)
006 | 10 LOAD_CONST 2 (5j)
007 | 12 BINARY_OP 0 (+)
008 | 16 RETURN_VALUE
indeed
!e
import time
from ctypes import cast
from einspect import impl, ptr
from einspect.api import Py
from einspect.structs import *
@impl(list)
@classmethod
def with_capacity(cls, n: int) -> list:
return PyListObject(
ob_refcnt=1,
ob_type=PyTypeObject(list).as_ref(),
ob_size=0,
ob_item=cast(Py.Mem.Malloc(n * 8), ptr[ptr[PyObject]]),
allocated=n,
).into_object()
ls = []
s = time.perf_counter()
ls.extend(range(9_000))
print((time.perf_counter() - s) * 1000, "ms")
ls = list.with_capacity(9_000)
s = time.perf_counter()
ls.extend(range(9_000))
print((time.perf_counter() - s) * 1000, "ms")
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 0.22995099425315857 ms
002 | 0.14029024168848991 ms
could we have a list.with_capacity π
Anyone have an opinion on how code objects are hashed and compared? https://github.com/python/cpython/issues/101346
https://grep.app/search?q=hash((.*)%2B(__code__))®exp=true&filter[lang][0]=Python seems like a decent number of rewrite / compilation libraries rely on the hash of __code__
Search across a half million git repos. Search by regular expression.
if we'd just limit the compared fields of __code__ I think __eq__ is still helpful
otherwise it'd be difficult to do that comparison
if the __eq__ and __hash__ methods are removed, it would break any tool that uses code objects as keys in a dictionary
I vaguely remember running into importlib hashing a code object, but I didn't spend too much thinking about why that happens.
no, they'd just be hashed by identity
would those libraries break though if code objects were hashed by identity instead?
that's probably hard to answer unfortunately, but I suspect in most cases the answer is "no"
hm... how does the current hash work?
does it just go through each field
also are there ways the CodeTypes can mutate?
Quick question about Python's escape characters
\N{name}
Matches a named character. Are the braces actually there, or are they part of the notation
\Ncolon or \N{colon}
yes I think so, except for a few included fields. see https://github.com/python/cpython/issues/94155
in 3.11 yes, because of the specializing adaptive interpreter
Stupid question...nvm
Question about tokenizing strings
I'm shocked to learn that they were ever treated as a value-semantic type. Hashing by identity seems to obviously be the correct thing to have done in the first place. Whether that's a safe change to make now, though, I don't have an informed opinion on.
!e Easy enough to check for yourself: ```py
print("\Ncolon")
@raven ridge :x: Your 3.11 eval job has completed with return code 1.
001 | File "<string>", line 1
002 | print("\Ncolon")
003 | ^
004 | SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: malformed \N character escape
they're a required part of the syntax.
>>> '\n{colon}'
'\n{colon}'
>>> '\N{colon}'
':'
>>> f'\N{colon}'
':'
>>> f'\N{{colon}}'
File "<stdin>", line 1
f'\N{{colon}}'
^
SyntaxError: f-string: single '}' is not allowed
``` interesting
looks like that's treating {colon as the name of the character to look up, then.
!e Even weirder: py print(1+1j * 2) If 1+1j were really a "complex literal", then that would give 2+2j instead.
@raven ridge :white_check_mark: Your 3.11 eval job has completed with return code 0.
(1+2j)
I guess that's why the repr includes parentheses, come to think of it.
What function is called when in is called, i.e if my_number in my_list? Is it possible to have a custom class for my_list to overwrite the in check?
__contains__
There's actually 3 different checks that are attempted one after another when you try to use in:
- It checks if the object has a
__contains__method, and if so, attempts to use it - Otherwise, it checks if the object has an
__iter__method, and tries to use that to iterate through and find the item - If the above two checks fails, it then checks if the object has a
__getitem__method, and if so, tries indexing it by 0, 1, 2, 3, etc. until either the item is found or an exception occurs - if all 3 checks fail, then a
TypeErroris raised
I'm at a bit of a crossroads, and I'd love some input. I'm writing my first proper lexical (tokenizer) grammar for my language, and I'm considering my options with respect to strings.
Strings are a rabbit hole. Even regular strings contain escape characters (of numerous varieties) which will need to be interpreted at some point. Raw strings require special handling. Fstrings have replacement fields which will demand, at some point, that the string be properly parsed. As such, the problem at its simplest is this: when and how to parse strings?
The way I see it, I have four options:
-
Lex strings as atomic tokens. This makes the lexer a bit more complicated, but not unmanageably so. Later, when the parser encounters the string token, have it pass the string to a secondary parser designed specifically for them. This is probably the simplest option.
-
Have the lexer, upon encountering an open quote, create a new instance of itself in a different mode (string-mode). Using a secondary dfa, break the string up into runs of normal characters and escape characters. For f-strings, when the lexer encounters a replacement field, have it create yet another lexer instance in regular-mode to tokenize the expression. Break out upon the appropriate queues.
This option is recursive, which is good. Its complicated and hacky, which is bad.
- Don't lex strings at all. Instead, just have a
"operator which is treated the same as an opening parenthesis. Everything after this operator is parsed as normal β characters becomes identifiers, symbols become operators, numbers become integers, and so on. Escape characters will be recognized as their own tokens. Whitespace will be explicitly tokenized. Eventually another"will be encountered (or not) but the lexer doesn't blink or care.
Eventually, during parsing, the parser will handle assembly of the string. This is my favorite solution, as it requires only a single pass and no secondary parser or lexer, or mode of operation. But its still a bit ugly. This approach also places string recognition and validation upon the parser instead of the lexer β this is good because the parser is simply more powerful and can do things the lexer can't.
This option is also nice because replacement fields will be built during the outer string construction. There is also potential for arbitrarily nested strings in replacement fields, which I think is kinda cool.
- Don't lex at all. Just use a scannerless parser which treats characters as the tokens, and builds primitives on its own. I shy away from this option because, well, I put a lot of work into my DFA algorithm and I don't want it to go to waste. It might be the most graceful solution, though.
At the end of the day, all I know is that I don't want any secondary parsers, secondary operational modes, or hacky lexer recursion. People keep telling me I'm over thinking it, but I want to do this job right without taking the lazy way out.
I don't know the right answer, but I can say that (1) is how CPython currently works. It works fine for normal strings, but creates limitations for f-strings, so we're likely going to switch to (2) for f-strings.
(3) is an interesting idea that I haven't seen before. I feel like it would make your parser quite complex
I feel like you'll likely run into issues with (3) that are hard to solve. The language fundamentally works quite differently within and outside strings, so getting things like escaping to work properly will be hard
I'm assuming the language you're parsing is Python or close to it
so Python plus more syntax? You'll want to support \N{} and \U and all that
Python, now with more syntax!
Yeah, lots of different escape options. I'm thinking of throwing out the bytes type (I've never once used it) but I'm ready to think about that yet
I suppose, then, that option 4 is the most direct and graceful. When the parser encounters an open quote it will explicitly enter into a "new environment" in that only those methods appropriate for handling characters inside strings are called. Via the call stack, open and close quotes are handled trivially. I'm almost certainly going PEG, which will make lookahead an option and that too will be useful for detecting this like unescaped backslashes
option 4 strikes me as the hardest of those to implement by far. And the option most likely to have a negative effect on code quality, by making the parser do two jobs instead of one. As a thought exercise, imagine the language didn't have raw string literals: what changes would be required to add them?
Well, this is why I'm writing a parser generator. In theory, modifying the language's syntax would mean modifying the grammar, and working in significantly more abstract terms
In theory
But I understand where you're coming from. Lexers are good at lexing. Parsers are okay at lexing, but not great
something else to think about is how these decisions will affect your ability to recognize and report syntax errors. Option 3 strikes me as making good error reporting much more difficult.
Though its perfectly doable, the amount of code required to make a parser work on that fine of a grain is probably massive
Indeed. The ambiguity is palpable
Surely there must be a solution. I'd like to walk through the logic step by step
consider that "10e3" and 10e3 are very differing things - lexing one as OPEN_QUOTE NUMBER CHARACTER NUMBER CLOSE_QUOTE might make sense, but lexing the other one as NUMBER CHARACTER NUMBER absolutely doesn't, and would make your life much harder.
Strings are, in modern programming languages, quite complex. They are composite objects which different types of characters, may contain expressions and, potentially, other strings. Thus they are recursive as well.
It follows from this that strings are simply irregular, and cannot be lexed to satisfaction (not without gross hacks anyway)
Does this check out?
"to satisfaction" is ambiguous - to whose satisfaction?
To the satisfaction of the definition of a lexer. You can't lex an unlexable object. Trying to do so will require state and procedure that goes beyond what "a lexer" by the strict definition is capable of
that depends on what the lexemes are.
option 1 - lexing it as just STRING_LITERAL and then figuring out the rest later - is still lexing it
that option at least requires figuring out where it starts and stops, and what characters are inside the literal
If you're willing to accept strings being un-nestable (well, strings using the same quotes) then I suppose you're right. It does severely limit what you're able to put inside the string though
you don't need to accept strings being unnestable. You can nest this way as well by just making the lexer hold some counters, right?
I'm not willing to ad-hoc a solution
it's more annoying, but it's certainly not impossible for the lexer to figure out where the string ends.
I want an academically substantiated, theory based approach
in the same way as it's not impossible for a parser to do all the lexing.
what does python do right now? 
lex the entire string literal as one token
julia moment
i don't get why julia has to be so "purist"
eh, it's not a big deal
I'd state this differently. String literals are quite complex, and the contents of string literals follows an entirely different grammar than is used anywhere else in the language. While it's certainly possible to define one grammar and one set of tokens that encompasses both stuff-inside-strings and stuff-outside-strings, it's not clear to me that it makes things more readable, or maintainable, or more performant, etc.
I just reference it for the memes, no offence to julia,
If the grammar inside the string and grammar outside are fundamentally different, and, I want to maintain low-level control over construction in both environments
By which I mean, I want the escape characters in strings to be their own tokens, runs of unescaped characters to be their own tokens, and so on
Then it stands to reason I'll need two machines (or else one machine operable in two modes)
yeah.
Fair enough. Questions abound though. What about error propagation? Toggling from inside a string to outside (potentially recursively). And, as soon as you start switching between environments you need to start carrying context from one environment to the other and back again (for example, whether the string is raw or not, or whether it uses single or double quotes)
yes, you do. That could be one class with a stack of contexts, though.
So, a recursive lexer
I've heard someone seriously suggest that as a way to invoke str.split() or maybe str.partition()
Badass!!!!!
I'm totally stealing that
!e
from einspect import view
view(str)["__truediv__"] = str.partition
print("hello+world" / "+")
@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.
('hello', '+', 'world')
one major issue with trying to lex both inside and outside of a string using the same lexer is that whitespace is significant in one and not in the other. You want to tokenize 1 +5 as NUMBER("1") OPERATOR("+") NUMBER("5"), but you can't tokenize "1 +5" as BEGIN_QUOTE('"') NUMBER("1") OPERATOR("+") NUMBER("5") END_QUOTE('"') or you've lost syntactically significant whitespace.
You've stated your piece on this approach I know, but I think its worth mentioning again
In theory, one could have a single grammar which tokenizes everything from both grammars. Whether or not this is possible would depend on the nature of the differences between them. I'm not sure either of us can say for sure whether it would be impossible in a python-like language, though we can agree it would probably be difficult and potentially be quite ugly
Either case, the theory remains the same: kill them all and let the parser sort them out.
one could have a single grammar which tokenizes everything from both grammars
You absolutely can. It seems like a lot of extra complexity to me, but it's clearly possible to lex my first example asNUMBER("1") OPERATOR("+") NUMBER("5")and my second asBEGIN_QUOTE('"') LITERAL("1 +5") END_QUOTE('"'). The question is just whether that makes implementing the parser easier or harder. My intuition is that it would be harder, but π€·ββοΈ
my current approach in my parser is to lex f-strings not unlike parsing. it's something like (it's been a while) FStart, FText, FExprStart, FExprEnd, and FEnd. then the parser actually handles putting them together as necessary. they just turn into their own kind of ast node. it's significantly more permissive than what python currently allows, though.
if i wanted to support exactly what cpython does, i'd just handle it all in the lexer stage, just like it does.
you don't need as much "deep" context
Why would you want such a nested f-string? it sounds like a nightmare to read for a human
i think my impl is more like pep 701
I'm going to need some context. Do you lex fstrings literals as big atomic chunks of characters, lex again in a second step, and then parse?
i don't. but i do want f"{d["key"]}" to be valid.
I personally find f"{d['key']}" easier to parse as a human
!pep 703
f"{d[" + key + "]}" will definitely take me a while
^ this is likely to make all @grave jolt 's fears come true
o no
i keep track of how nested in delimiters i am in the lexer. fstrings and their inner expressions are another part of that. there's only one lexing pass. the example i gave above would end up lexed as something like FStart, FExprStart, (just lex like normal tokens now) Ident, OpenBracket, Str, CloseBracket, (we came to a brace and aren't nested in any other delimiter pairs, we're done)FExprEnd, FEnd.
referential transparency for one. The rule that you can refactor py greeting = "Hello, y'all" print(greeting) into py print("Hello, y'all") without changing the meaning makes it easier to reason about both the behavior of function calls and the meaning of variables. Currently, you can't do the same thing with f-strings, though - you can't refactor py greeting = "Hello, y'all" print(f'greeting={greeting}') into ```py
greeting = print(f'greeting="Hello, y'all"')
people who maintain code generators also said that the arbitrary restriction on not being able to reuse quotes within different levels of an f-string makes it much harder to generate code, because you need to carry extra context down the stack
I don't have a formal rebuttal but that sounds a bit contrived 
that sounds like a good argument ^
https://discuss.python.org/t/pep-701-syntactic-formalization-of-f-strings/22046 had a bunch of arguments, both for and against, if you want to see some of the discussion.
Hi π I am very excited to share with you a PEP that @isidentical, @lys.nikolaou and myself have been working on recently: PEP 701 - PEP 701 β Syntactic formalization of f-strings. We believe this will be a great improvement in both the maintainability of CPython and the usability of f-strings. We look forward to hear what you think about this ...
it also had a poll; 2/3rds of respondents prefer to lift the restriction and allow arbitrary nesting.
Referential transparency is kind of lacking in python tbh, in some places
like the closure gotcha
From a purist theoretical perspective the answer is (4) because there's really no such thing as lexing; there are just sequences of characters and grammar production rules and parse trees. The whole reason for introducing lexing as a separate step is for convenience. So in that sense, the very fact that you've introduced a lexer is some kind of flaw.
So, I appreciate the sentiment, but I'm not sure I'm 100% on the implementation they are proposing
I suppose that's true
They say they're adding new token types as well as new protocols into the grammar. Cool. They don't mention anything about how the non-replacement-field parts of the string are being lexed
hmm, maybe lexing and parsing are just parsing, but at different levels of abstraction
or rather, with different 'elements'
(characters vs 'tokens')
The FSTRING_MIDDLE parts
We like lexers for good practical reasons. But if you think about them as a convenience, then (1), (2), and (3) are all equally acceptable. It's just a matter of what you think is convenient for the language you want to support.
I think that if you want to support nested f-strings, then (2) is probably the best route. While if you don't, then you can't beat the ease of coding (1).
PEPs specify the behavior and contracts of the language, not the implementation. The existence or absence of a lexer in any particular implementation is an implementation detail that the PEP rightly doesn't address.
Ahh, I see
the ability to make good error reports on syntax errors is often the biggest practical difference between different approaches to parsing. It's very easy to say "the character on line 25, character 4 is unacceptable", but different approaches will make it very difficult or impossible for you to explain why that character isn't allowed there.
Well, if that's the case, the option 4 is by far the best approach
Its the option with the fewest question marks, and the least demand for ad-hoc problem solving
If its a question of "switching contexts between two lexical grammars" and propagating those contents as well as errors correctly, well, that's a whole can worms
I think that rather than thinking of it as "switching contexts" you could think of it as "recursing into a grammar rule with a different lexer". To me that makes it feel cleaner.
As an aside question, shouldn't that sort of reporting be shunted forward to the semantic analyzer?
Also if you keep an explicit stack of lexers then that probably helps error reporting.
In my mind, the only things a lexer and a parser should be reporting on are those things that unambiguously fall within their domain
What is in their domain is not always obvious, though.
Much to think about π
imagine you've got a file in a Python-like language whose first line is ff"". That's a syntax error. Discovering that error with a 1-character-at-a-time parser requires look-behind. ff is allowed as the first two characters. When you hit the ", you need to look backwards and say that a " preceded by an f is valid (an f-string), but a "' preceded by 2 f's is not valid. And then you need to figure out what type of thing the two f's are (a name, I suppose) and then explain to the user that their error was putting a string literal after a name without an operator in between them.
Honestly... I'm so very torn
A bottom-up parser does all of this without even blinking. The problem is constructing the grammar for it. That grammar would have to say, "okay that first f could be part of a format string or a variable name or a from or ... oh now I saw a second f, it must be a variable name, wait the " is no good here." If you can build that grammar then reporting the location of an error is easy. (Reporting on the type of error, though, is really hard.)
Well, hold up
If you want to be able to solve this problem...
f"this is a {"test"}"
You simply can't do this with a lexer β not without attaching a stack and some custom logic
who says a lexer can't have a stack?
grumbles uncomfortably
I really don't want an ad-hoc solution. I want something grounded in theory.
Theory says you can't do it with a DFA. But you can with a context-free grammar (even an LL grammar). So you could just write a lexer that's not a DFA, declare yourself happy, and move on.
Well β I want to talk about option 3
A single grammar for both environments. What really are the differences between the two environments?
Outside of a string you have identifiers, numbers, <strings>, keywords, and operators
Implicitly, you've also got whitespace, newlines, tabs, and comments
Inside a string you've got all of these things as well. You'd need to be able to recognize things words which didn't qualify as identifiers or number literals (maybe just default to a catch-all "blob" token). You'd need to recognize escape characters. One you entered into a replacement field you'd be back in normal territory again
they're extremely different. Basically nothing in common.
the tokenization for x [ (1+2) * 3 ] and the tokenization for "x [ (1+2) * 3 ]" should almost certainly not have a single token in common.
it's not uncommon for lexers to have a delimiter stack. i believe cpython does - or does something similar - to handle "ignoring" indentation within delimiter pairs.
i don't know about theory, but there are years of historical precedent for doing this.
But there's no reason the parser can't sort that out after the fact.
"x [ (1+2) * 3 ]" becomes (QUOTE "") (WS) (L_BRACK) (WS) (L_PAREN) (INT 1) (OP +) (INT2) (WS) (STAR) (WS) (INT 3) (WS) (R_BRACK) (QUOTE ")
The literals are all still there. The parser can stitch a string together from these tokens quite easily
XD Yeah, pretty much
one of the major things that a lexer is doing is figuring out where one "thing" stops and the next "thing" starts. Making the parser care about whitespace between tokens seems to largely defeat the point of lexing.
Alright, so I'll concede on that.
you can also think of a lexer as something that turns source code into the smallest coherent units. when you're running a string through the rest of the process, you don't really care what content structure the string has (barring escapes). you just care that it's a string, and that you know what it has. there's no need to dissect it. (massive generalization here, but you get the point)
Here's what I want, and maybe you guys can tell me what I need: I want the individual components of strings to be their own tokens. I do not want to tokenize strings as atomic literals and then run through a second machine. I do want to be able to use the same quotes as bookend the string within the string's replacement fields. I want to move through the input in a single pass (or two passes, if lexing once and then parsing once)
Option 1 is out by definition. Option 3 is out for reasons discussed above
f-strings in particular, yes? i know my approach works in practice (f-strings as what're ultimately delimiter pairs).
Option 2 depends on one thing: can recursive/stacked lexers handle strings which use the same quotes as enclose them within the replacement fields? Would an inner-lexer, upon finding sed quote, not need to escape immediately, or else perform costly lookahead?
option 4 also works. it seems like option 2 is also ruled out by virtue of "duplicating" the lexer. it appears i misunderstood there.
I'm not opposed to recursive lexing so long as it doesn't require any complicated glue
I want the individual components of strings to be their own tokens.
That implies that raw strings need to be lexed differently than regular strings, right off the bat. Because\nin one should be lexed asNEWLINE_ESCAPEor something, but not in the other.
I think recursive lexing should be able to do this as long as it behaves as a coroutine. When the parser detects a nested f-string it will have to send into the lexer the fact that it needs to recurse.
Question
The issue with scannerless parsing is that parsers aren't very good at lexing, correct?
The issue that I'm facing with f strings relates to an inability to recurse and an inability to "switch contexts" as needed, correct?
Could I build a parser that lazy-lexed?
I.E. A parser/lexer combo which automatically generated a new token via DFA whenever one is requested by the parser but does not already exist?
Something to think about :3
Heya - I'm writing a module and it relies on an outstanding PR for core python (asyncio). I'm not really aware of how to go about including core python in a pip install. I don't see this PR being accepted any time soon (was shipped in 2019). I would appreciate your advice on how to proceed. https://github.com/python/cpython/pull/16429 is the PR. Is it best practice to just import class and just overwrite the method with the patch?
So, I noticed recently, the functools.partial is incompatible with the inspect module. I have built a workaround.
# a proper Partial implementation that works with paramspec etc
def partial(f, *args, **kwargs):
return wraps(f)(lambda *a, **kw: f(*(args + a), **{**kwargs, **kw}))
this works because it returns a standard python function with the attributes of the wrapped function applied using wraps. Similar functionality cold be implemented in the default partial to allow this, especially given that wraps and partial live in the same module, this should be a fairly simple fix
probably because the actual partial() is an object implemented in C, specifically from _functools
it's not a function
Yeah its a class
So it breaks the inspect
cause obviously is no longer representing a function
the above version applies the same functionality, while preserving the function
how does it "break"?
class A:
def __init__(self):
x = 2
from inspect import getsource
print(getsource(A))
``` works
you can't use things like paramspec etc etc, anythng that inspect has to inspect functions doesn't work for partials, because partial is a class
that's only one of the factors
so if you say, have a pipeline to use these details, it will break if a partial is aver included
the other one is that it's implemented in C
I know
but my point is
with the wraps decorator
the c object is redundant
because you can do it functonally, and not break things
(3 lines over 100 in the functools module)
β«it does work
tested with the same A class above
i think the only problem is that it's in C
Have been trying to put together a visual scriptor that uses the inepct module to gather the function details to assemble the nodes, and there are a bunch that will except automatically
it seems to just get A.__init__()
wraps() uses partial() btw
try it wwith partials
def wraps(wrapped,
assigned = WRAPPER_ASSIGNMENTS,
updated = WRAPPER_UPDATES):
"""Decorator factory to apply update_wrapper() to a wrapper function
Returns a decorator that invokes update_wrapper() with the decorated
function as the wrapper argument and the arguments to wraps() as the
remaining arguments. Default arguments are as for update_wrapper().
This is a convenience function to simplify applying partial() to
update_wrapper().
"""
return partial(update_wrapper, wrapped=wrapped,
assigned=assigned, updated=updated)
so um
why use wraps() again?
You are missing whatss actually happening there though
its just passing the stuff on here
update_wrapper()?
It will still just crash with plain partial
because partial itself does not apply the update wrapper functionality
so, when inspected
does not appear as the function is is a partial of
wraps pulls up that data
and then the only solution is to try and filter and update these in whatever system you are working with
by checking if is a partial
and then doing something janky to try and update that
it seems update_wrapper also works here
it's probably better to just define and use the custom partial() in user code instead of adding it to the stdlib
so, its not like it couldn't be added to the partial
there's still the other features of partial() that a simple function doesn't replace
but shouldn't something that is literally behaving like a function in this case, not be able to be treated as such in all cases?
because its used as such in all cases
for example you can pickle a partial() but you can't pickle a function
there's also the useful representation output
why are you like this lol
I mean, yes
but make it compatible with inspect
he same way funcs are
this way
best of both
!e and a hidden feature too ```py
from functools import partial
def a(y):
pass
def b(x, z):
return x * z + 2
a.func = b
print(partial(a, 5)(2))
@rose schooner :x: Your 3.11 eval job has completed with return code 1.
001 | Traceback (most recent call last):
002 | File "<string>", line 9, in <module>
003 | TypeError: a() takes 0 positional arguments but 1 was given
its such an annoying little gotcha
ok maybe it doesn't work that way
partial = lambda f, *a, **k: update_wrapper(lambda *_a, **_k: f(*(a + _a), **{**k, **_k}), f)
partialmethod = lambda f, *a, **k: update_wrapper(lambda self, *_a, **_k: f(self, *(a + _a), **{**k, **_k}), f)
works
as a way to define em
it's in the pure python implementation in functools
I think the update wrapper functionality could technically be applied in class
oh wait it's to chain partial()s effectively