#internals-and-peps

1 messages Β· Page 9 of 1

pliant tusk
#

ah interdependencies

grave jolt
#

headers πŸ₯΄

#

probably the feature of C that aged very poorly

warm breach
#

wait

#

what if I make an api call and just ask GPT3 to translate it

#

πŸ‘€

halcyon trail
feral island
#

what if Python imports also worked by literally copying in the code from the imported module

pliant tusk
pliant tusk
# halcyon trail i'd be a very sad panda
import preprocess
#include "ns.py"

NS_BEGIN(foo)
x = 1
y = 2
z = 3
NS_END

print(foo)
``` i mean you can do it if you want 
`preprocess.py`
```py
import sys, dis, subprocess
frame = sys._getframe()
while frame := frame.f_back:
    f_code = frame.f_code
    if f_code.co_code[frame.f_lasti] == dis.opmap['IMPORT_NAME'] \
        and f_code.co_names[f_code.co_code[frame.f_lasti + 1]] == 'preprocess':
        file = frame.f_globals['__file__']
        if file:
            processed = subprocess.run(['gcc', '-E','-x','c', file], stdout=subprocess.PIPE)
            if processed.returncode == 0:
                exec(processed.stdout.decode())
            exit(processed.returncode)

ns.py

#define NS_BEGIN(name) __import__("builtins").__ns_globals__ = globals().copy(); globals().clear(); __ns_name__ = #name ;
#define NS_END __ns__=__import__("types").SimpleNamespace(**{k: v for k, v in globals().items() if k != "__ns_name__"}); __import__("builtins").__ns_globals__[__ns_name__]=__ns__; globals().clear(); globals().update(__import__("builtins").__ns_globals__)
#

note that this is a very bad implementation lmao

halcyon trail
#

Sure, but do it if you want is different from import working that way

quick snow
#

there are type hints (when using Pydantic), but we don't do static typing

sinful saddle
#

Is there any guidance on naming conventions for dictionary keys specifically on the use of spaces?

grave jolt
#

IIRC PEP 8 prescribes this: ```py
foo = {
"fizz": 1,
"buzz": 2,
}

sinful saddle
#

I mean on the use of spaces within dictionary keys.

flat gazelle
grave jolt
#

I don't see why a dictionary key can't contain a space

#

but if you're storing some fixed attributes, yeah, use a class

#

(maybe a dataclasses.dataclass or see attrs, if you want a dumb record)

flat gazelle
#

I would say the python convention is more on not having spaces in keys (e.g. a TypedDict can't describe a dict with spaces in its keys afaik)

feral island
feral island
#

but as other people here have said, often a class with attributes is a better choice than a dictionary with hardcoded keys

gray galleon
#

why are True, False and None capitalized if they aren't classes

rich cradle
gray galleon
#

what, bool type was introduced that late?

#

anyways True and False is capitalized because of None
and None is capitalizd because of guido i guess

warm breach
#
re.compile('struct _object {.*?};', re.DOTALL)
<re.Match object; span=(4049, 4146), match='struct _object {\n    _PyObject_HEAD_EXTRA\n    P>

struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    PyTypeObject *ob_type;
};
pliant tusk
#

Oh, that will never work

#

Too many nested structs, or structs that contain fields built by the preprocessor

warm breach
# pliant tusk Oh, that will never work

trying to get that ctypeslib thing to work again, but it can't find the pyconfig.h somehow (which is in the current root path)

❯ clang2py --clang-args="-stdlib=libc++ -I -I. -IInclude" "Objects/dictobject.c"
WARNING:clangparser:'pyconfig.h' file not found (Objects/dictobject.c:12:10)
WARNING:clangparser:Source code has 1 error. Please fix.
#

do you know if I'm doing the include args wrong for clang πŸ˜”

pliant tusk
#

You probably need to find a way to integrate ctypeslib with pythons makefile

warm breach
#

at least I might be able to add it to testing and see if they ever mismatch

#

probably not good enough to generate anything yet though

pliant tusk
warm breach
#

here's the craziness πŸ₯΄

#

goes in a tools/ folder at root

#

and the config file tools/struct_source.toml

[config]
repository = "https://github.com/python/cpython"
versions = ["3.8", "3.9", "3.10", "3.11"]

[structs.py_object.PyObject]
source = "Include/object.h"
regex = "struct _object {(.*?)};"
exclude_fields = ["_PyObject_HEAD_EXTRA"]
pliant tusk
warm breach
#

I guess it's mildly cleaner to parse

pliant tusk
warm breach
#

maybe I can define some "core" types like Py_ssize_t and then recursively resolve the types myself? πŸ₯΄

pliant tusk
warm breach
fallen slateBOT
#

src/einspect/structs/include/object_h.py lines 70 to 73

freefunc = PYFUNCTYPE(None, c_void_p)
destructor = PYFUNCTYPE(None, py_object)
getattrfunc = PYFUNCTYPE(py_object, py_object, c_char_p)
getattrofunc = PYFUNCTYPE(py_object, py_object, py_object)```
pliant tusk
sacred yew
warm breach
sacred yew
warm breach
#

!e

from ctypes import *

realloc = pythonapi["PyMem_Realloc"]
realloc.argtypes = [c_void_p, c_size_t]
realloc.restype = c_void_p

t = (1, 2)
print(id(t))

res = realloc(id(t), 64)
print(res)
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 140266784484928
002 | 140266784484928
warm breach
#

!e

from ctypes import *

realloc = pythonapi["PyMem_Realloc"]
realloc.argtypes = [c_void_p, c_size_t]
realloc.restype = c_void_p

t = (1, 2)
print(id(t))

res = realloc(id(t), 80)
print(res)
fallen slateBOT
#

@warm breach :x: Your 3.11 eval job has completed with return code 139 (SIGSEGV).

001 | 140541912034816
002 | 140541912043472
warm breach
#

is there a reason why PyMem_Realloc here at larger sizes segfaults? (Shouldn't it just return NULL if it can reallocate?)

pliant tusk
pliant tusk
warm breach
pliant tusk
#

not if it cannot find enough space

warm breach
#

oh, hm

#

how do strings do that thing where they can try to resize safely firT

#

is that something in the stable abi

pliant tusk
feral island
#

and even that API is probably somewhat unsafe

pliant tusk
fallen slateBOT
#

Objects/bytearrayobject.c lines 232 to 243

else {
    sval = PyObject_Realloc(obj->ob_bytes, alloc);
    if (sval == NULL) {
        PyErr_NoMemory();
        return -1;
    }
}

obj->ob_bytes = obj->ob_start = sval;
Py_SET_SIZE(self, size);
obj->ob_alloc = alloc;
obj->ob_bytes[size] = '\0'; /* Trailing null byte */```
fallen slateBOT
#

Objects/unicodeobject.c lines 11530 to 11535

/* append inplace */
if (unicode_resize(p_left, new_len) != 0)
    goto error;

/* copy 'right' into the newly allocated area of 'left' */
_PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);```
feral island
fallen slateBOT
#

Objects/unicodeobject.c lines 1090 to 1091

static int
resize_inplace(PyObject *unicode, Py_ssize_t length)```
`Objects/unicodeobject.c` line 1124
```c
data = (PyObject *)PyObject_Realloc(data, new_size);```
#

Objects/unicodeobject.c line 1978

static int```
warm breach
#

this also seems to call PyObject_Realloc unconditionally pithink

#

ah

#

I guess it doesn't matter if the realloc isn't in place there

#

since it's 1 ref count and it just returns the new allocation

warm breach
# pliant tusk strings like `str()`? afaik they never resize

!e apparently you can trick it to mutate the string in-place if the ref count is 1 πŸ‘€

from einspect.structs import PyObject

s = "a"
s += "b"
text = s

PyObject.from_object(s).DecRef()

s += "1"
s += "2"
print(s)
print(text)

PyObject.from_object(s).IncRef()
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | ab12
002 | ab12
quick snow
#

!e I'm assuming this fails?

from einspect.structs import PyObject

s = "a"
s += "b"
text = s

PyObject.from_object(s).DecRef()

s += "1"*50
s += "2"*50
print(s)
print(text)

PyObject.from_object(s).IncRef()
grave jolt
#

wait what

fallen slateBOT
#

@quick snow :x: Your 3.11 eval job has completed with return code 139 (SIGSEGV).

001 | ab1111111111111111111111111111111111111111111111111122222222222222222222222222222222222222222222222222
002 | flush
feral island
#

you are accessing random memory

grave jolt
#

ah right

#

😳

feral island
#

print calls f.flush(), so probably that's where the flush string comes from

#

it happened to be allocated in the same place

warm breach
quick snow
warm breach
#

so s += "1"*50 shadowed the original s with a new object

quick snow
#

I'm assuming it's about memory alignment, that your version works?

warm breach
#

and since we modified refcount of s to be 1, it got dropped

#

so later print(text) prints garbage memory

warm breach
#

!e

s = "a"
s += "b"
print(id(s))

s += "1"
s += "2"
print(id(s))
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 140050976782704
002 | 140050976782704
warm breach
#

!e but append longer and it'll (probably) no longer be in place

s = "a"
s += "b"
print(id(s))

s += "111111111"
s += "222222222"
print(id(s))
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 139980662602224
002 | 139980662606864
quick snow
grave jolt
#

!e

from einspect.structs import PyObject

s = "a"
s += "b"
text = s

PyObject.from_object(s).DecRef()

s += "3"*4_000_000
s += "55"*6_000_0
#print(s)
#print(text)
print("hi")
PyObject.from_object(s).IncRef()
fallen slateBOT
#

@grave jolt :white_check_mark: Your 3.11 eval job has completed with return code 0.

hi
pliant tusk
#

but also due to the way that python uses memory pools you can get weird spacing around objects

warm breach
#

!e

print("ab".__sizeof__())
print("ab12".__sizeof__())
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 51
002 | 53
warm breach
#

so "ab" had 64 bytes allocated

#

and "ab12" happens to fit fine

#

!e

s = "a"
s += "b"
print(id(s))

s += "1234567890"
s += "abc"

print(id(s))
print(s.__sizeof__())
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 140533952084528
002 | 140533952084528
003 | 64
warm breach
#

so you can append all the way up to 64 and it'll resize in place

#

!e but one more and it can't

s = "a"
s += "b"
print(id(s))

s += "1234567890"
s += "abcd"

print(id(s))
print(s.__sizeof__())
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 140460387744304
002 | 140460387748880
003 | 65
feral island
#

64 bytes? that feels like a lot of overhead for a two-char string

#

!e ```
import sys
print(sys.getsizeof("ab"))
print(sys.getsizeof("a"))

fallen slateBOT
#

@feral island :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 51
002 | 50
feral island
#

!e ```
import sys
a = "a"
a += "b"
print(sys.getsizeof(a))
print(sys.getsizeof("a"))

fallen slateBOT
#

@feral island :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 51
002 | 50
warm breach
#

they removed the wstr field in 3.12 at least

#

so ascii compact strings are 8 bytes smaller

#
Python 3.12.0a3 (main, Jan 18 2023, 01:07:36) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> a = "a"
>>> a += "b"
>>> sys.getsizeof(a)
43
>>> sys.getsizeof("a")
42
feral island
#

interestingly in my local 3.9 "a" was 58 bytes but "ab" was 51 bytes

thorn barn
#

Hi everyone,
I have a question regarding memory footprint in Python. There is the possibility of creating a class with dunder slots to remove the dunder dict attribute and thereby removing some memory overhead for each instance. To take a look at this a created a toy example like:

class Point:

def __init__(self, x, y, z):
    self.x = x
    self.y = y
    self.z = z

as well as a version with dunder slots and another one based on a namedtuple to compare the three cases.
Since the sys.getsizeof function only returns a simplistic value I used the getsize function from this post, which keeps track of references. https://stackoverflow.com/questions/449560/how-do-i-determine-the-size-of-an-object-in-python

Now comes the weird part that I don't understand:
I was comparing the results for different python versions, where 3.10 shows with the getsize function for an instance of the point class 236bytes while the slots version takes 140 bytes.
Now with version 3.11 the normal class version drops down to 140 bytes (just like the slots version).
So I was checking if the dunder dict still exists which does, but after calling the dunder dict the size that getsize returns jumps up to 436 bytes. In version 3.10 this is not happening.

Has someone an explanation for this behavior? The 3.11 release notes do not give me a hint that they have done some memory optimization in this regard.

You can find the full working example in this godbolt link where you can easily switch between the versions of Python.

https://godbolt.org/z/PsYYKxb49

feral island
dusk comet
#

IIRC, dict's are also allocating memory for hash table on first access

thorn barn
#

would you say that it is still valid in python 3.11 that dunder slots will safe you memory?

pliant tusk
feral island
#

yes, unless you use __dict__ directly, which generally you shouldn't be doing

#

(that was in response to @thorn barn )

feral island
thorn barn
pliant tusk
#

Ahh so the objects hold a pointer to their keys and values table, and lazily create the dict that references them

thorn barn
#

thanks for the help!

pliant tusk
feral island
#

for one, an int might not fit in a py_ssize_t

#

and it might not be faster because you have to perform boxing when accessing the attributes

grave jolt
#

yeah

#

For example: you ask for foo.bar to get back a str object. Does that object have 1 reference? 2 references?

#

Is it immortal?

#

If we suppose that we don't want to copy the whole unicode string into a boxed object, if that object has 1 reference, how do you solve the += optimisation?

pliant tusk
#

yea maybe it wouldnt be worth it

dusk comet
feral island
#

yeah agree, that's where this could help

#

like if the adaptive interpreter sees 1 + foo.bar and knows that foo has a field of type int, it could potentially optimize it into just a pointer access + machine ADD instruction

dusk comet
#

There is also a lot of possible problems

warm breach
#

so TIL python ints are also (sometimes) mutable in 3.12 now

#

same thing as the current string append optimization

dusk comet
#

I guess float's and complex's also can be mutable

halcyon trail
#

I was going to say I'm surprised it's worth but since creating a new int would typically involve a heap allocation, that makes sense

umbral plume
#

not too familiar with the string optimisation, but i'm guessing that means something along the lines of "big integers with a sole reference can be mutated and returned as if its a new object, rather than allocating a new integer and freeing the old one"?

halcyon trail
#

i assume. but they don't need to be big.

#

or at least, unless by "big" you simply mean "bigger than the handful of integers that cpython always allocates space for"

#

(a few hundred, iirc)

spice pecan
#

-5 through 256 I think

#

Most common ones

warm breach
#

this id should stay the same in >= 3.12.0a1

for i in range(300, 500):
    i += 1
    print(id(i))
dusk comet
#

I think in <=3.11 it will switch between two ids

#

!e ```py
for i in range(300, 310):
i += 1
print(id(i))

fallen slateBOT
#

@dusk comet :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 140413941526768
002 | 140413941526768
003 | 140413941526768
004 | 140413941526768
005 | 140413941526768
006 | 140413941526768
007 | 140413941526768
008 | 140413941526768
009 | 140413941526768
010 | 140413941526768
dusk comet
#

!e ```py
for i in range(300, 310):
print(id(i))

fallen slateBOT
#

@dusk comet :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 139947360022800
002 | 139947360022768
003 | 139947360022800
004 | 139947360022768
005 | 139947360022800
006 | 139947360022768
007 | 139947360022800
008 | 139947360022768
009 | 139947360022800
010 | 139947360022768
quick trellis
feral island
# quick trellis why?

I think the iterator creates the next int before the previous one can be deallocated

#

and because of free lists they get reallocated in the same small set of spots

quick trellis
feral island
dusk comet
#

I see the same behaviour in CPython3.7.
(id(x) is very small because it is a 32-bit build)

>>> for i in range(300, 310):
...     i += 1
...     print(id(i))
...
9952576
9952576
9952576
9952576
9952576
9952576
9952576
9952576
9952576
9952576
>>> for i in range(300, 310):
...     print(id(i))
...
9952592
9952576
9952592
9952576
9952592
9952576
9952592
9952576
9952592
9952576
>>> import sys; sys.version
'3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:01:55) [MSC v.1900 32 bit (Intel)]'
#

i dont think these two examples are useful to observe mutability of ints

halcyon trail
#

I've often been surprised at the fact that there hasn't been a mainstream language designed entirely with these semantics in mind

#

You'd get the important benefits of immutability, and the ergonomics and mostly the performance of mutability

#

Those semantics aren't ideal when you want to deliberately share mutability, or for multithreading.
but those are both a small minority of cases, so you could explicitly annotate when you wanted that

#

I've been bitten by mutability of lists and dicts quite a few times in python, if strings/ints/etc were mutable too I imagine I'd get bit much more often

warm breach
fallen slateBOT
#

Objects/longobject.c lines 283 to 290

// Mutate in place if there are no other references the old
// object.  This avoids an allocation in a common case.
// Since the primary use-case is iterating over ranges, which
// are typically positive, only do this optimization
// for positive integers (for now).
((PyLongObject *)old)->ob_digit[0] =
    Py_SAFE_DOWNCAST(value, Py_ssize_t, digit);
return 0;```
warm breach
#

python/cpython#91713

neon troutBOT
warm breach
#

anyone know where the str.__sizeof__ implementation is

#

can't find πŸ˜”

#

nevermind found it now, named with one underscore unlike the other ones

feral island
fallen slateBOT
#

Objects/unicodeobject.c lines 14120 to 14121

static PyObject *
unicode_sizeof_impl(PyObject *self)```
warm breach
#

I was searching for ___sizeof___impl since everything else was named liked that

#

but unicode does _sizeof_impl for some reason

#

@pliant tusk so copilot is getting pretty good at translating cpython functions now πŸ‘€

pliant tusk
#

Impressive

#

Can it translate the structs too?

warm breach
warm breach
pliant tusk
#

That seems close right?

#

(I haven't looked at how structs are defined in einspect yet)

warm breach
#
@struct
class SetEntry(Structure, AsRef, Generic[_T]):
    key: ptr[PyObject[_T, None, None]]
    hash: Annotated[int, Py_hash_t]  # noqa: A003
#

here's the actual one

#

to be fair my Annotated usage is pretty arbitrary so

#

it gets parsed as

Annotated[<ignored>, type]

or

Annotated[<ignored>, type, bit-width]
#

mainly due to ctypes autocasting, if you type something as c_uint32 it won't actually be that type at runtime (will be cast to int instead)

warm breach
fallen slateBOT
#

src/einspect/views/view_str.py line 24

class StrView(View[str, None, None], MutableSequence):```
warm breach
#

safe-ish so far

from einspect import view

s = "abcπŸ¦€"
v = view(s)

v[-1] = "!"
print(s)
# abc!

v[:] = "πŸ€”12🐍"
print(s)
# πŸ€”12🐍

del v[1:]
print(s)
# πŸ€”

v.extend(["D", "E", "F"])
print(s)
# πŸ€”DEF

v.reverse()
print(s)
# FEDπŸ€”

v.clear()
assert s == ""

v.append("こんにけは")
print(s)
# こんにけは
quick snow
warm breach
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 53
002 | 92
warm breach
#
from einspect import view

s = "abc!"
v = view(s)
v[3] = "πŸ¦€"
Traceback (most recent call last):
  File "main.py", line 5, in <module>
    v[3] = "πŸ¦€"
    ~^^^
  File "einspect/views/view_str.py", line 74, in __setitem__
    raise UnsafeError(
einspect.errors.UnsafeError: setitem required str to be resized beyond current memory allocation. Enter an unsafe context to allow this.
vernal loom
#

!e ```py
print(list("abc!".encode('u8')))
print(list("abcπŸ¦€".encode('u8')))

fallen slateBOT
#

@vernal loom :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | b'abc!'
002 | b'abc\xf0\x9f\xa6\x80'
vernal loom
#

that's why it's longer

#

!e ```py
print(list("abc!".encode('u8')))
print(list("abcπŸ¦€".encode('u8')))

fallen slateBOT
#

@vernal loom :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | [97, 98, 99, 33]
002 | [97, 98, 99, 240, 159, 166, 128]
warm breach
#

but otherwise, given enough space, it dynamically reallocates the PyObject between the 3 str subtypes - PyASCIIObject | PyCompactUnicodeObject | PyUnicodeObject

#
from einspect import view

s = "abcπŸ¦€"
v = view(s)
print(v)
>> StrView(<PyCompactUnicodeObject at 0x10449d920>)

v[3] = "!"
print(v)
>> StrView(<PyASCIIObject at 0x10449d920>)
vernal loom
#

can you allocate more space in einspect?

warm breach
#

well yes but

#

like realloc it might not be in-place

#

a str might already have other structs after it

#

not much point in doing that since python variables are essentially memory pointers, and if you move the object somewhere else the original variables now point to random memory

quick snow
raven ridge
#

they do cache a UTF-8 representation the first time it's requested, though

rose schooner
#

ignoring padding, that is

fallen slateBOT
#

Include/cpython/unicodeobject.h lines 55 to 64

- compact ascii:

  * structure = PyASCIIObject
  * test: PyUnicode_IS_COMPACT_ASCII(op)
  * kind = PyUnicode_1BYTE_KIND
  * compact = 1
  * ascii = 1
  * (length is the length of the utf8)
  * (data starts just after the structure)
  * (since ASCII is decoded from UTF-8, the utf8 string are the data)```
raven ridge
fallen slateBOT
#

Include/cpython/unicodeobject.h lines 66 to 76

- compact:

  * structure = PyCompactUnicodeObject
  * test: PyUnicode_IS_COMPACT(op) && !PyUnicode_IS_ASCII(op)
  * kind = PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND or
    PyUnicode_4BYTE_KIND
  * compact = 1
  * ascii = 0
  * utf8 is not shared with data
  * utf8_length = 0 if utf8 is NULL
  * (data starts just after the structure)```
warm breach
fallen slateBOT
#

src/einspect/structs/py_unicode.py line 83

or addressof(cast(obj.wstr, c_void_p)) != addressof(PyUnicode_DATA(obj))```
warm breach
#

tbh the ctypes auto cast is the most annoying thing ever

#

can't compare 2 pointers since one of them gets transformed into bytes | None

rose schooner
warm breach
#

I've only been able to get ASCII and compact

fallen slateBOT
#

Include/cpython/unicodeobject.h lines 78 to 87

- legacy string:

  * structure = PyUnicodeObject structure
  * test: !PyUnicode_IS_COMPACT(op)
  * kind = PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND or
    PyUnicode_4BYTE_KIND
  * compact = 0
  * data.any is not NULL
  * utf8 is shared and utf8_length = length with data.any if ascii = 1
  * utf8_length = 0 if utf8 is NULL```
fallen slateBOT
#

Include/cpython/unicodeobject.h lines 153 to 161

typedef struct {
    PyCompactUnicodeObject _base;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;                     /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;```
rose schooner
#

i don't understand the question

#

like isn't it just .from_object() or something

#

struct.from_address(id(obj))

warm breach
warm breach
#

essentially if the string object has compact = 1, it's PyCompactUnicodeObject, if ascii = 1, it's PyASCIIObject, if both are 0, it's the biggest subtype, PyUnicodeObject

#

but I haven't been able to get one naturally in python

rose schooner
rose schooner
#

well it actually is but why does it appear like this

warm breach
rose schooner
#

and also um this ```pycon

a = "\U0010ffff\0"
from ctypes import c_ulong
c_ulong.from_address(id(a)+72)
c_ulong(1114111)
view(a)._pyobject.data.ucs4.contents
Windows fatal exception: access violation

Current thread 0x00000d4c (most recent call first):
File "<stdin>", line 1 in <module>

#

oh i see why now ```pycon

a = "\U0010ffff\0"
from ctypes import c_ulong, POINTER
POINTER(c_ulong).from_address(id(a)+72).contents
Windows fatal exception: access violation

Current thread 0x000008b8 (most recent call first):
File "<stdin>", line 1 in <module>

warm breach
#

that was before I implemented the 3 separate subtypes for str

#
from einspect import view

a = "\U0010ffff\0"
print(view(a))
>> StrView(<PyCompactUnicodeObject at 0x100f3af10>)
#

that thing is still a PyCompactUnicodeObject, so no data field

rose schooner
warm breach
#

only PyUnicodeObject has a data field

#

PyCompactUnicodeObject is PyASCIIObject with 3 more fields

utf8_length, utf8, wstr_length
#

PyUnicodeObject is PyCompactUnicodeObject with 1 more field, data

rose schooner
#

and view(a)._pyobject is a PyUnicodeObject

warm breach
#

well that wasn't quite correct

#

turns out actually the strings dynamically may be one of 3 subtypes

#

I haven't released the version with the 3 different types yet

#

currently there's only the PyUnicodeObject struct

#

so if you access data when it actually should be a compact or ascii it accesses out of bound memory

rose schooner
warm breach
#

I'm not sure why accessing element 0 of the pointer is different from c_ulong but...

#

!e

from ctypes import c_ulong, POINTER

a = "\U0010ffff\0"

print(POINTER(c_ulong).from_address(id(a)+72).contents)
fallen slateBOT
#

@warm breach :warning: Your 3.11 eval job has completed with return code 139 (SIGSEGV).

[No output]
rose schooner
#

string subtypes

warm breach
rose schooner
fallen slateBOT
#

Objects/unicodeobject.c line 1203

_PyUnicode_STATE(unicode).compact = 1;```
rose schooner
fallen slateBOT
#

Objects/unicodeobject.c line 14380

_PyUnicode_STATE(self).compact = 0;```
rose schooner
#

!e ```py
from einspect.views import StrView
class Str(str):...

a=Str("\U0010ffff\0")
print(StrView(a)._pyobject.data.ucs4.contents)

fallen slateBOT
#

@rose schooner :white_check_mark: Your 3.11 eval job has completed with return code 0.

c_uint(1114111)
rose schooner
#

@warm breach

#

it works

warm breach
#

hm firHmm

#

apparently my type algorithm is wrong then

#

I check ascii = 1 before compact

#

that thing is ascii = 1 but compact = 0 which is weird

#

!e

from einspect.structs import PyUnicodeObject

class Foo(str):
    ...

f = Foo()
v = PyUnicodeObject.from_object(f)
print(v.ascii)
print(v.compact)
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 1
002 | 0
fallen slateBOT
#

Objects/unicodeobject.c lines 1114 to 1122

if (ascii->state.compact)
{
    if (ascii->state.ascii)
        data = (ascii + 1);
    else
        data = (compact + 1);
}
else
    data = unicode->data.any;```
rose schooner
#

so yeah you're supposed to check for compact first

warm breach
#

so...

#

wtf is ascii = 1 doing there

#

bug?

rose schooner
#

where do you check it

warm breach
#

you can't have an ascii string that's not compact, that's not a thing

fallen slateBOT
#

Include/cpython/unicodeobject.h line 195

unsigned int ascii:1;```
rose schooner
warm breach
#

ah yeah

#

it still gets unset if its not ascii

#

!e

from einspect.structs import PyUnicodeObject

class Foo(str):
    ...

f = Foo("πŸ€”πŸ€”πŸ€”")
v = PyUnicodeObject.from_object(f)
print(v.ascii)
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

0
warm breach
#

so I guess it just doesn't matter for determining struct type

rose schooner
warm breach
warm breach
warm breach
rose schooner
#

instead of doing this ```pycon

f=Foo("\U0010ffffabc\uffff")
view(f)
<stdin>:1: DeprecationWarning: Using einspect.view on objects without a concrete View subclass will be deprecated. Use einspect.views.AnyView instead.
View(<PyObject at 0x1e03e1b04e0>)

warm breach
#

I suppose it could

#

I'm just not sure what happens to those stuff after the builtin

#

do non-slots custom classes have a fixed struct as well?

rose schooner
#

so as long as the custom class does not override the builtin parent class's .__new__()/.__init__() then it's probably fine

warm breach
#

hm

warm breach
#

!e since python already disallows it

class Foo(int):
    __slots__ = ("_some_attr",)
fallen slateBOT
#

@warm breach :x: Your 3.11 eval job has completed with return code 1.

001 | Traceback (most recent call last):
002 |   File "<string>", line 1, in <module>
003 | TypeError: nonempty __slots__ not supported for subtype of 'int'
warm breach
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 28
002 | 28
warm breach
#

how are these both 28 though?

#

doesn't Foo need a __dict__ at least?

#

where does it store _some_attr

rose schooner
#

but it's in like negative offsets from the actual object

#

or actually it's in the type

warm breach
#

the instance still needs its own dict

rose schooner
#

oh wait actually yeah

warm breach
#

!e

class Foo(int):
    def __init__(self, *args, **kwargs):
        super().__init__()
        self._some_attr = 1

f = Foo(123)
print(id(f))
print(id(f.__dict__))
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 140502620308352
002 | 140502622763328
warm breach
#

this is showing the actual address of the dict and not the pointer I guess

warm breach
#

!e

from einspect.types import ptr
from einspect.structs import PyDictObject

class Foo(int):
    def __init__(self, *args, **kwargs):
        super().__init__()
        self._some_attr = 1

f = Foo(123)  # sizeof = 28
end = id(f) + f.__sizeof__()
print(ptr[PyDictObject].from_address(end+4).contents.into_object())

f = Foo(2 ** 50)  # sizeof = 32
end = id(f) + f.__sizeof__()
print(ptr[PyDictObject].from_address(end).contents.into_object())
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | {'_some_attr': 1}
002 | {'_some_attr': 1}
warm breach
#

apparently it's right after the original struct, + alignment

#

not sure where this is documented though

rose schooner
#

!e ```py
from ctypes import py_object

ALIGNMENT = tuple.itemsize * 2

def pad_int(x):
return -x//ALIGNMENT * -ALIGNMENT

class Foo(int):
def init(self, *args, **kwargs):
super().init()
self._some_attr = 1

f = Foo(123)
print(py_object.from_address(id(f) + pad_int(f.sizeof())).value)

fallen slateBOT
#

@rose schooner :white_check_mark: Your 3.11 eval job has completed with return code 0.

{'_some_attr': 1}
rose schooner
#

ok

near ocean
#

what does this do?

pliant tusk
warm breach
#

hmm

warm breach
#

!e

from einspect.structs import PyDictObject
from einspect.types import ptr

class Foo(str):
    def __init__(self, *args, **kwargs):
        super().__init__()
        self.x = 50

print(Foo.__dictoffset__)
f = Foo("abc")
print(f.__sizeof__())
dict_ptr = ptr[PyDictObject].from_address(id(f)+f.__sizeof__()+112)
print(dict_ptr.contents.into_object())
fallen slateBOT
#

@warm breach :x: Your 3.11 eval job has completed with return code 1.

001 | -112
002 | 84
003 | Traceback (most recent call last):
004 |   File "<string>", line 13, in <module>
005 | ValueError: NULL pointer access
warm breach
#

I'm trying to use __dictoffset__ but it doesn't work here somehow

warm breach
pliant tusk
#

Foo.__itemsize__ * abs(view(f).size) afaik

warm breach
#

what is itemsize even firHmm

pliant tusk
#

its an attribute that all PyVarObject classes have *all have it, it is only non-zero on PyVarObjects

#

tuple.__itemsize__ is sizeof(c_void_p) for example

warm breach
#

though I think for some reason str isn't even a PyVarObject πŸ₯΄

pliant tusk
#

and you need abs because some types fiddle with the sign of ob_size

warm breach
pliant tusk
#

(but note that it is created lazily as of 3.11 @warm breach )

warm breach
#

so it can be null?

pliant tusk
#

lemme check

warm breach
#

that's fine but I'm just trying to get the pointer which doesn't seem to work

pliant tusk
warm breach
#

!e

from einspect.structs import PyDictObject
from einspect.types import ptr

class Foo(str):
    def __init__(self, *args, **kwargs):
        super().__init__()
        self.x = 50

f = Foo("abc")
print(f.__dict__)

p = ptr[PyDictObject].from_address(id(f) + Foo.__basicsize__)
print(p.contents.into_object())
fallen slateBOT
#

@warm breach :x: Your 3.11 eval job has completed with return code 1.

001 | {'x': 50}
002 | Traceback (most recent call last):
003 |   File "<string>", line 13, in <module>
004 | ValueError: NULL pointer access
warm breach
#

so firstly finding the additional dict pointer subtypes have

warm breach
#

!e is this supposed to be 0

class Foo(str):
    def __init__(self, *args, **kwargs):
        super().__init__()
        self.x = 50

print(Foo.__itemsize__)
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

0
pliant tusk
#

!e ```py
from einspect.structs import PyDictObject
from einspect.types import ptr

class Foo(str):
def init(self, *args, **kwargs):
super().init()
self.x = 50

f = Foo("abc")
print(f.dict)

p = ptr[PyDictObject].from_address(id(f) + Foo.dictoffset)
print(p.contents.into_object())```

fallen slateBOT
#

@pliant tusk :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | {'x': 50}
002 | 0
warm breach
#

o.O

pliant tusk
warm breach
#

I read somewhere negative dictoffset meant from end of struct somehow fml

#

been lied to πŸ˜”

pliant tusk
#

i don't know why it is 0 tho

warm breach
fallen slateBOT
#

@warm breach :x: Your 3.11 eval job has completed with return code 139 (SIGSEGV).

001 | {'x': 50}
002 | -8
warm breach
#

also this doesn't work for int subclasses somehow πŸ˜”

#

also how is that offset -8 anyways, isn't that where the gc header is??

pliant tusk
#

there is more stuff in the header now

#

it changed in 3.11

warm breach
#

!e

from einspect.structs import PyDictObject
from einspect.types import ptr

class Foo(list):
    def __init__(self, *args, **kwargs):
        super().__init__()
        self.x = 50

f = Foo((1, 2))
print(f.__dict__)
print(Foo.__dictoffset__)

p = ptr[PyDictObject].from_address(id(f) + Foo.__dictoffset__)
print(p.contents.into_object())
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | {'x': 50}
002 | -72
003 | type_
warm breach
#

list also fails

#

tfw __dictoffset__ isn't actually dict offset 😩

#

!e

from einspect.structs import PyDictObject
from einspect.types import ptr

class Foo(list):
    def __init__(self, *args, **kwargs):
        super().__init__()
        self.x = 50

f = Foo((1, 2))
print(f.__dict__)
print(Foo.__dictoffset__)

p = ptr[PyDictObject].from_address(id(f) + f.__sizeof__() + Foo.__dictoffset__)
print(p.contents.into_object())
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | {'x': 50}
002 | -72
003 | {'x': 50}
warm breach
#

so list apparently works if you add offset after sizeof

#

but this isn't the case with int where you should ignore dictoffset and just find the first aligned byte after the struct

pliant tusk
warm breach
warm breach
#

you did id(f) + Foo.__dictoffset__

pliant tusk
#

i think that one just got lucky and found a pointer to 0

warm breach
# pliant tusk i think that one just got lucky and found a pointer to `0`

!e aha I got it

from einspect.structs import PyDictObject
from einspect.types import ptr

def align(size: int, alignment: int) -> int:
    return (size + alignment - 1) & ~(alignment - 1)

class Foo(str):
    def __init__(self, *args, **kwargs):
        super().__init__()
        self.x = 50

f = Foo("abc")

addr = id(f) + align(f.__sizeof__(), 8) + Foo.__dictoffset__
p = ptr[PyDictObject].from_address(addr)
print(p.contents.into_object())
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

{'x': 50}
warm breach
#

seems to be sizeof aligned to 8 bytes then dictoffset for negative

pliant tusk
#

i would calculate sizeof manually using basicsize itemsize and ob_size, because some __sizeof__s are not exactly correct

#

@warm breach but you'll need a way to check if a given type has the ob_size field

warm breach
#

isn't calculating struct size and getting __dictoffset__ enough

pliant tusk
#

__sizeof__ isn't always a bare struct size, sometimes it includes nested fields

warm breach
# pliant tusk `__sizeof__` isn't always a bare struct size, sometimes it includes nested field...

how this look firHmm

def instance_dict(self) -> ptr[PyObject[dict, Any, Any]] | None:
    """Return the instance dict of the PyObject."""
    # Get the tp_dictoffset of the type
    offset = self.ob_type.contents.tp_dictoffset
    # If 0, the type does not have a dict
    if offset == 0:
        return None
    # For > 0, start from the address of the PyObject
    if offset > 0:
        addr = self.address + offset
    # For < 0, start after the struct
    else:
        # align size to pointer size (8)
        size = align_size(self.mem_size, ctypes.sizeof(c_void_p))
        addr = self.address + size + offset
    # Return the pointer
    return POINTER(PyObject).from_address(addr)
pliant tusk
#

afaik that should work

pliant tusk
warm breach
#

seems fine mostly but there are a few that still seems to have a null pointer despite a non-zero tp_dictoffset and after dict access

#
<_frozen_importlib_external.SourceFileLoader object at 0x103626a40>
{'name': 'einspect.views.view_tuple', 'path': '/Users/ionite/repos/Python/einspect/src/einspect/views/view_tuple.py'}
Traceback (most recent call last):
  File "scratch.py", line 10, in <module>
    pydict = d.contents.into_object()
             ^^^^^^^^^^
ValueError: NULL pointer access
#

some _frozen_importlib_external.SourceFileLoader thing apparently pithink

pliant tusk
#

Odd

copper ice
#

it may be stupid

#

but why is it showing none?

#

please guys thinking for a long time but

#

cant understand

#

i wanna print pink to yellow in reverse order

#

@warm breach pls help

#

srsly cant figure out whats wrong in such an easy code

warm breach
copper ice
#

cant it be in the print function?

warm breach
#
ls.reverse()
print(ls)
copper ice
#

what is wrong in this

copper ice
warm breach
#

so c is None

copper ice
warm breach
#

assign it to something first

#

or just use the reversed() function instead, which isn't in place

#

print(reversed(ls[1:2]))

copper ice
#

coming now

#

thanks man

feral cedar
#

or use negative step

copper ice
spark magnet
#

anyone know anything about Cython internals? Cython + coverage.py has long been an uneasy alliance, and I would love to get it smoothed out.

raven ridge
#

I know a bit...

#

Cython generates normal C extension modules, so it's not particularly special. The only weird thing it does that affects profiling/tracing is this stuff: https://cython.readthedocs.io/en/stable/src/tutorial/profiling_tutorial.html - if you ask it to, it generates fake frames for Cython functions and calls the installed tracing or profiling function explicitly when entering and exiting those functions (passing those fake frames along)

#

we eventually gave up on supporting Cython functions built with profiling support in Memray. Those fake frames caused too much trouble... though I'm happy coverage.py doesn't do the same, since I do like to see coverage stats for my Cython code πŸ˜†

raven ridge
spark magnet
#

i'm not in a place to look at the code right now, i was hoping to find someone who could take a look over the next few weeks or something.

raven ridge
#

well, that's an interesting one.

spark magnet
#

unfortunately, i am blessed with many interesting ones πŸ™‚

raven ridge
#

what are the values of that dict? The PR title was "Map also empty dictionaries to file" - are the values always dicts? Did this maybe affect other falsy things?

#

is it possible that the performance difference is explained entirely by extra data being (unnecessarily) processed?

spark magnet
raven ridge
#

maybe there's extra work being done now, that used to be skipped

spark magnet
#

tbh, the caching done by cached_mapped_values got changed recently, and I borked it (maxsize=0 is different than maxsize=None!), but maybe there's still something wrong with it.

#

the "slowdown" issue mentions "We can notice a lot more SQL queries (both selects and inserts attempts with 0 rows) in 6.4.3, that we don't have on 6.4.2."

raven ridge
#

the Cython coverage plugin, as I understand it, is basically only concerned with 1 thing: mapping coverage reports from the generated .c files to the .pyx/.pyi files that the .c file was transpiled from

spark magnet
raven ridge
#

I know a fair bit about Cython, but much less about the Cython coverage plugin in particular - I've only had to dig into it once, and I had a pretty poor understanding at the time... There were some recent fixes to it that are only applied to the current development version, and not to the stable version...

spark magnet
#

i just commented on the original Cython issue. just getting clear reproduction instructions would help.

raven ridge
spark magnet
#

the rabbit hole goes deeper... 😦 Thanks for talking it through with me at least... I have to bounce.

quick snow
#

Have y'all seen this? Seems like a great bunch of ideas. https://discuss.python.org/t/announce-pybi-and-posy/23021/26

thorn oasis
#

'make' is not recognized as an internal or external command,
operable program or batch file.
can anyone help me to solve this error please

raven ridge
unkempt rock
warm breach
raven ridge
#

No, it really doesn't even know what object owns a memory block, or even whether any object owns it. It works at a lower level than that.

#

"all" that it knows is the full call stack at which every block of memory was allocated, and when (and whether) that memory block was deallocated. (OK, and a few other, less important things. The total RSS is tracked over time, and we know the name of each thread, if one was set...)

warm breach
#

was trying to find some way to get that info from python, didn't seem to be a stable c API for it

raven ridge
warm breach
#

currently I'm calculating the struct size aligned to 16, plus a GC header if supported by the type, and the position of the instance dict pointer

#

not sure if that covers everything

feral island
#

the extra stuff for variable-sized objects?

raven ridge
#

why are you calculating this size? What do you do with it?

pulsar ridge
#

✨ I am trying to Creating and storing Google credentials in token.pickle from auth code πŸ˜” but not able to create token.pickle

warm breach
raven ridge
#

why bother?

#

just because you're copying a number of bytes <= to the size of the structure doesn't mean that things are left in a sane state. Hell, you can't even copy one list into another list. Adding one safety check to a fundamentally unsafe operation doesn't make much sense to me.

warm breach
#

!e

from einspect import view

v = view(2**60)

with v.unsafe():
    v <<= (1, 2, 3)
    
print(2**60)
fallen slateBOT
#

@warm breach :warning: Your 3.11 eval job has completed with return code 139 (SIGSEGV).

[No output]
raven ridge
#

sure, you can segfault if you don't check this. You can also segfault if you do check this. So...

warm breach
#

the malloced array will just stay where it was

raven ridge
#

no, because you screw up the reference counts.

warm breach
#

assuming we call clear() on the list about to get overwritten πŸ₯΄

raven ridge
#

and now there's two different objects referring to the same malloc'ed array, and the first one of them to be destroyed will free it, and any later attempt to access it by the second one will be a use-after-free bug.

warm breach
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
002 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
feral island
#

"not sure why that solves the segfault" usually means you have either a memory leak or some other memory safety bug now

warm breach
#

will objects remaining with refcounts still get freed during interpreter shutdown?

raven ridge
#

best effort, yeah.

warm breach
#

shouldn't the double free happen there then pithink

pliant tusk
#

you would probably have memory leaks if this code was being called from an embeded interpreter

raven ridge
#

If you're on Linux, try running that with export MALLOC_CHECK_=3 PYTHONMALLOC=malloc and see if you get a crash.

pliant tusk
warm breach
#

would removing the list object from the GC linked list do anything

raven ridge
#

no clue. But you can get something sort of similar from pymalloc itself, if you run with python -Xdev

warm breach
#

I guess it still gets freed on 0 ref-count? And that just disables cyclic GC?

raven ridge
#

at best it would leak memory. at worst it'd crash.

#

honestly, "corrupts memory in a manner that may eventually lead to a crash" is just about the worst sort of memory bug.

#

both because that's usually a security vulnerability, and because it can be quite annoying to track down if the crash location is far removed from where the memory corruption occurred.

warm breach
#

maybe I should valgrind it

raven ridge
#

!e what's going on here: ```py
from einspect import view

x = [1, 2]

with view(x).unsafe() as v:
y = [*range(18)]
v <<= y

print(x)
y[0] = 42
print(x)

fallen slateBOT
#

@raven ridge :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
002 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
raven ridge
#

why isn't the first element changed?

warm breach
#

oh uh I make a deep copy of the source object and IncRef that πŸ₯΄

#

which I should probably stop doing

raven ridge
#

then I'm guessing you're leaking that copy.

#

and if you stopped doing that, then I'm guessing it would be pretty easy to get this to segv.

warm breach
#

mainly the deepcopy was incref all the members I think

#

since otherwise when y gets dropped and calls clear x might now have members pointing to non existent objects

raven ridge
#

the array isn't a member with a reference count - you must actually be creating a copy of that array, or the first element would have changed to 42

fallen slateBOT
#

src/einspect/views/view_base.py line 263

other = deepcopy(other)```
raven ridge
#

and the fact that you're leaking that copy is the only reason this doesn't wind up with a double free.

#

you've got two lists that each think they're responsible for freeing that array. The only way that can possibly not result in a double free is if one of them never gets destroyed, and so never tries to free its array.

warm breach
#

I guess a possibility would be to maintain a list of "shared array" list objects and override list type's tp_free to only free the array when it is the last reference pithink

raven ridge
#

you'd wind up needing to do something like that for every type of object.

warm breach
#

eh yeah doesn't seem worth it

raven ridge
#

memcpy'ing into the struct of some arbitrary object is fundamentally unsafe. It can't be made safe, short of special casing how it behaves for each different type of object.

warm breach
#

!e

from einspect import view, unsafe

x = [1, 2]
y = [*range(18)]

with unsafe():
    view(y).move_to(view(x))

print(x)
y[0] = 100
print(x)
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
002 | [100, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
raven ridge
#

making a copy of a Python object with memcpy breaks that object's invariants. On a case-by-case basis, you can fix up the copy to make the invariants hold - but there's no general way to do that.

warm breach
#

so move_to doesn't make a deepcopy unlike move_from

#

wonder why this doesn't double free

raven ridge
#

!e ```py
from einspect import view, unsafe

x = [1, 2]
y = [*range(18)]

with unsafe():
view(y).move_to(view(x))

print(x)
y.clear()
print(x)

fallen slateBOT
#

@raven ridge :x: Your 3.11 eval job has completed with return code 139 (SIGSEGV).

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
warm breach
raven ridge
#

clear freed the array, x is still holding a pointer to it.

#

print(x) tries to access it after it's been feed, which explodes.

warm breach
#

oh clear frees the array? pithink

#

does it allocate a new one when you append

raven ridge
#

yeah

raven ridge
#

ooh man, both of those comments πŸ˜„

warm breach
#

huh

#

so a list that gets cleared has NULL ob_item but a list that gets popped to 0 doesn't...?

#

!e

from einspect.structs import PyListObject

x = [1]
x.pop()

ls = PyListObject.from_object(x)
print(ls.ob_item[0].contents.into_object())
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

1
warm breach
#

I guess it just leaves the object pointer pithink

raven ridge
fallen slateBOT
#

Objects/listobject.c lines 1029 to 1031

if (size_after_pop == 0) {
    Py_INCREF(v);
    status = _list_clear(self);```
fallen slateBOT
#

Objects/listobject.c line 1032

list_pop_impl(PyListObject *self, Py_ssize_t index)```
pliant tusk
warm breach
#

deling the last item already frees the array in 3.11 it seems

#

!e

from einspect.structs import PyListObject

x = [1]
del x[0]

ls = PyListObject.from_object(x)
print(ls.ob_item[0])
fallen slateBOT
#

@warm breach :x: Your 3.11 eval job has completed with return code 1.

001 | Traceback (most recent call last):
002 |   File "<string>", line 7, in <module>
003 | ValueError: NULL pointer access
pliant tusk
#

!e unrelated, but I still feel stuff like this is unintuitive ```py
class MyList(list):pass

l = MyList()
n = l + MyList([1])
print(type(n))```

fallen slateBOT
#

@pliant tusk :white_check_mark: Your 3.11 eval job has completed with return code 0.

<class 'list'>
warm breach
#

I guess list.__add__ just doesn't know about your subtype at all

#

and just operates on the front PyListObject part

feral island
#

it doesn't (and can't) know how to construct objects of an arbitrary subclass

raven ridge
#

there was a big change to datetime to fix that for the methods of datetime.datetime

#

to get them to return subclass instances rather than base class instances, I mean

#

though in that case, there's already an inherited classmethod for creating derived class instances.

pliant tusk
raven ridge
#

checking tp_init isn't enough, since there's also tp_new

pliant tusk
#

could check both, but this kind of thing would probably be better as an opt-in instead of an opt-out

#

maybe via a metaclass that sets some type flag

#

@feral island @raven ridge would either of you happen to know in what order tp_del is called in subclasses with multiple bases?

raven ridge
#

can you inherit from two classes with different tp_del?

warm breach
#

!e

class UserTuple(tuple): pass

print(().__sizeof__())
print(UserTuple().__sizeof__())

class UserInt(int): pass

print((0).__sizeof__())
print(UserInt().__sizeof__())
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 24
002 | 32
003 | 24
004 | 24
warm breach
#

is this a bug

#

why does int subclass sizeof not include the instance dict but tuple does

flat gazelle
#

UserList and UserDict have more predictable behaviour for subclassing

pliant tusk
flat gazelle
#

ah, I see. I wonder why it was even allowed to inherit from dict in the first place

dusk comet
#
>>> type(l + [1])
<class '__main__.MyList'>
>>> type([1] + l)
<class '__main__.MyList'>
``` this is also cool because builtin types do not care about subclasses: ```py
>>> class X(list): ...
...
>>> type(X() + X())
<class 'list'>
fallen slateBOT
#

Objects/listobject.c line 622

list_ass_slice(PyListObject *a, Py_ssize_t ilow, Py_ssize_t ihigh, PyObject *v)```
swift imp
warm breach
#
from einspect import view, unsafe

class Foo:
    pass

class Bar:
    pass

f = Foo()
b = Bar()
b.x = 100

with unsafe():
    view(f) << b

print(f.x)
>> 100
rose schooner
#

so the same dict should make more sense

warm breach
warm breach
fallen slateBOT
#

src/einspect/structs/py_long.py lines 15 to 23

@struct
class PyLongObject(PyVarObject[int, None, None]):
    """
    Defines a PyLongObject Structure.

    https://github.com/python/cpython/blob/3.11/Include/cpython/longintrepr.h#L79-L82
    """

    _ob_digit_0: ctypes.c_uint32 * 0```
warm breach
#

since if I do PyLongObject(...) to make a struct instance it won't have enough allocation to fit my actual ob_digit array

#

hm I guess I could use _PyObject_GC_NewVar

pliant tusk
warm breach
#

I guess _PyObject_NewVar then

#
PyVarObject *
_PyObject_NewVar(PyTypeObject *tp, Py_ssize_t nitems)
{
    PyVarObject *op;
    const size_t size = _PyObject_VAR_SIZE(tp, nitems);
    op = (PyVarObject *) PyObject_Malloc(size);
    if (op == NULL) {
        return (PyVarObject *)PyErr_NoMemory();
    }
    _PyObject_InitVar(op, tp, nitems);
    return op;
}
pliant tusk
warm breach
#
from einspect.structs import PyTypeObject, PyLongObject

x = PyLongObject(
    ob_refcnt=1,
    ob_type=PyTypeObject.from_object(int).as_ref(),
    ob_size=2,
    ob_digit=[1, 1],
)

print(x.into_object())
#

this seems fine

from einspect.structs import *

obj = PyTypeObject.from_object(int).NewVar(1)
obj = obj.contents.astype(PyLongObject)
obj.ob_digit[0] = 123

print(obj.into_object())  # 123
pliant tusk
warm breach
#

not sure which one seems safer

#

eh though __new__ would be annoying to type hint since I can't use Self

pliant tusk
#

no idea, ctypes does not have a mechanism for variable sized structures

dusk comet
grave jolt
#

!e

class A:
    pass

x = A()
y = A()
x.__dict__ = y.__dict__
x.foo = 42
print(y.foo)
fallen slateBOT
#

@grave jolt :white_check_mark: Your 3.11 eval job has completed with return code 0.

42
grave jolt
#

WTF

#

cursed

warm breach
dull oxide
#

Best modules in python??

fickle ferry
warm breach
#

what does a __dictoffset__ of -1 mean in python 3.12?

median umbra
#

Can anyone help me in this GIT command. I am new to coding.

warm breach
median umbra
#

I downloaded git yesterday bruh.

unkempt rock
#

run git --version

median umbra
unkempt rock
#

hmm odd that seems like a new release

median umbra
#

yup

spark magnet
unkempt rock
#

maybe we could talk in unix channel

spark magnet
median umbra
#

ok

deep nova
#

Question about python's parser

#

When I look at Python's grammar, I see that there are actual function calls embedded in it

#

Is this purely notational? Does the grammar exist simply as a map of the parser, which is itself hand coded?

#

While I'm here β€” what kind of parser does CPython use? Are there any interesting features or implementation details I should know about?

#

Shortly, I'll be trying to build an equivalent

flat gazelle
#

There is a lovely talk on the new PEG parser. https://youtu.be/QppWTvh7_sI

Parsing Expression Grammars (PEGs) are a relatively new formalism for describing grammars suitable for automatically generating efficient parsers. I've become interested in using a PEG-generated parser as an alternative to CPython's nearly 30 year old "pgen" parser generator. This poses some interesting problems. I've also come up with a neat wa...

β–Ά Play video
deep nova
#

Now

#

PEG grammars are context free grammars witch short-circuited alternation, right?

#

And beyond that, PEG parsers are recursive descent packrat parsers designed to support linear time parsing with infinite lookahead, at the cost of more memory consumption?

deep nova
#

What an incredible invention

deep nova
#

Is there any reason that leading zeroes are prohibited for integer literals, but are permitted for based integers?

#
00001 # invalid
0x001 # valid
#

Is it just a quirk in the parser?

#

Also

#

In lexing and parsing integers β€” you might have something that looks a bit like this...

#
integer ::= dec_integer ( 'E' | 'e' ) ( dec_integer + )
          | dec_integer ( 'I' | 'i' )
          | dec_integer | hex_integer | oct_integer | bin_integer

dec_integer := dec_digit +
hex_integer := '0' ( 'X' | 'x' ) ( ( dec_digit | hex_glyph ) + )
oct_integer := '0' ( 'O' | 'o' )   ( oct_digit + )
bin_integer := '0' ( 'B' | 'b' )   ( bin_digit + )
#

This is great and all. It's very concise. But it occurs to me that a more granular approach might be more suitable. For example β€” lets say I allowed the 0x to be a token, and then the number that followed it to be a token. Later, during parsing, I could stitch them together into a hex integer

#

This would give me access to the number part without having to call str.split() on the literal. In a nutshell, I'm going to need to the split the thing anyway, so why not just do it at the first step instead of a later step

raven ridge
deep nova
#

Safer with respect to legacy code?

raven ridge
#

right. given the choice, it's nicer for users who are trying to upgrade their code if running old code with a new interpreter fails with an obvious error than if it seems to work but silently does something different.

#

arguably the single reason that Python 2 still exists in some places is because of the failure to do that with string literals - changing string literals from byte strings to unicode strings makes for a porting nightmare, because things fail in strange and surprising ways, potentially in locations far away from where the bug was.

boreal umbra
#

I assume that prepending all string literals in py2 code with a b to port it to py3 is not a viable strategy?

raven ridge
#

not really. You still need to transform them back and forth to unicode strings whenever you're passing them to most any library function, save some of the ones in os I suppose

raven ridge
raven ridge
deep nova
#

I'm pretty sure they'd be equivalent in the long run. Slicing might be a shade faster, but it creates a copy of the string and so is worse on memory. Its a moot point though, because I'm trying to do this with as little reliance on built in string manipulation as possible

#

C style, baby

deep nova
#

Anyway, the long and the short of the answer is that in building a new language, I can allow leading zeroes if I so choose

#

Would there be any reason to disallow it?

#

Other than the fast that its pointless?

raven ridge
#

compared to slicing, which creates one.

raven ridge
deep nova
#

My dudes

#

I love grammars

#

I havn't had this much fun in ages

raven ridge
deep nova
#

bookmarked

#

So

#

There are many ways to handle numbers

#

(There are many benefits to being a marine biologist)

#

Python doesn't do this, and it does seem a bit excessive, but in planning all of this out I do have to ask myself a few question

#

For example β€” binary, octal, and hex integers are just integers. There's no reason they can't be raised using e notation

#

And there's no reason they can't be the power, either

#

And, there's no reason they can't be imaginary

#

Any thoughts?

raven ridge
#

octal literals are rarely used, but when they are it's as a collection of 8-bit flags (like Unix mode masks).
hex literals are either used for a collection of 256-bit values (rgb or rgba, for instance), or for masks used for bitwise arithmetic, or to make special power-of-two values easier to recognize in code (0xFFFF vs 65535, etc).
binary literals are virtually never used, but if someone did use them I'd imagine it must only be for a mask used for bitwise arithmetic.

And for any of those things, raising another number to that power or raising them to some power or making them imaginary just isn't useful. If someone has chosen one of those 3 representations, they did it for a reason - and that reason likely doesn't recommend those operations.

rose schooner
#

then after a few minor versions the warning would be removed

deep nova
#

What about binary, octal, and hex floats? Same story?

raven ridge
#

hex floats are a thing, actually - there's a representation for floats where you can specify the sign, mantissa, and exponent independently

raven ridge
# rose schooner why didn't they just warn?

What would make a warning better than an error? Warnings are easier to ignore. I guess one advantage is that you can collect multiple warnings in one run of the process, but πŸ€·β€β™‚οΈ

deep nova
#

I think I'll allow non-base-10 floats for now. It seems like it might be useful

#

But I'll probably prohibit other-base numbers from being bases or powers in 'e' notation, or from being imaginary

raven ridge
#

check that link, it shows Python's syntax for hex floats. C99 allows it as well.

deep nova
#

I'm going to have an imaginary type as well as a complex type

rose schooner
deep nova
#

The former being any number postfixed by an i, and the latter being the sum of an imaginary and a real

deep nova
#

Complex has both a real and an imaginary part. The key difference being that you can write an complex number without needing to include the real part. Just shorthand, really

#

0+1i in traditional form

#

Sorry, let me rephrase β€” python's syntax requires specifying both the real and imaginary parts of a complex literal. I'm going to allow for omitting the real part, which will default to zero

feral island
deep nova
#

Oh!

#

Well, there we go then

feral island
#

1+1j is syntactically just a binop of two nums

deep nova
#

Now, another question. What about allowing the imaginary postfix on other-base integers?

#

Again, largely pointless, but its a question I have to ask myself

feral island
#

I think it's the same as what @raven ridge said above: the contexts where you'd use non-base 10 numbers aren't contexts where complex numbers are likely to come up

#

then again I'm not sure I've ever used complex numbers in Python other than when writing tests for things that need to support the whole language

raven ridge
#

(people keep using them in AoC for representing essentially two-tuples of integers that you can apply mathematical operations to, like doubling)

rose schooner
#

well the concept says so at least

raven ridge
#

When it comes to language design, rather than allowing anything that pops into your head as a feature that someone might one day want to use, it's much more reasonable to look for things that people commonly want to do, and make sure that there's a succinct way to represent those things.

raven ridge
#

Take a large code base, and see if you can find any place where someone raises a hex literal to a power of 10, for instance.

#

if you find people doing that, maybe the e syntax would be a helpful shortcut for those people.

#

if no one is doing it, then you'd be building a feature that no one needs, which costs you work and maintenance effort and doesn't buy users of your language anything.

rose schooner
deep nova
#

So

#

Tomorrow

#

I'm going to need to come back here and ask about lexing/parsing F strings

#

Another thing I'll need to ask about is how PEG parsers can automatically match parentheses, but other's can't

#

I don't think I can brain any more, though

rose schooner
final elk
#

Hi

umbral plume
#

https://docs.python.org/3/reference/lexical_analysis.html#imaginary-literals
So, according to the language specification, there are imaginary number literals, but there isn't any mention of complex number literals. At the same time though, according to dis, no addition actually has to occur when executing, because such literals get optimised during bytecode compilation.
Does this means that, in a way, complex number literals are actually an implementation-specific thing?

grave jolt
umbral plume
#

I suppose, though complex numbers feel different to me for some reason

#

like i suppose it doesn't matter since it gets optimised when dealing with arithmetic between constants anyway, but it sorta sounds strange to say that python doesn't actually have complex literals, despite them existing for all intents and purposes

#

there's also the slightly odd behaviour of 5+3j.imag not doing what you'd expect it to if someone were under the impression complex literals were a thing (though 5+3j.real coincidentally ends up working :P)

rose schooner
#

it's sort of weird

umbral plume
#

but that to me actually reads as 7j + 3j.real, since seeing two js indicates two separate imaginary numbers, and . has higher precedence than +

#

yet the + in 3+4j doesn't feel like an actual +, it feels more like just part of the syntax, a bit like the - in 1e-10 doesn't actually invoke unary negation

flat gazelle
#

yeah, there aren't any complex literals in the language spec, they are just optimised in cpython

warm breach
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 |   0           0 RESUME                   0
002 | 
003 |   1           2 LOAD_CONST               0 (2)
004 |               4 LOAD_CONST               1 (65)
005 |               6 BINARY_OP                8 (**)
006 |              10 LOAD_CONST               2 (5j)
007 |              12 BINARY_OP                0 (+)
008 |              16 RETURN_VALUE
flat gazelle
#

indeed

warm breach
#

!e

import time
from ctypes import cast
from einspect import impl, ptr
from einspect.api import Py
from einspect.structs import *

@impl(list)
@classmethod
def with_capacity(cls, n: int) -> list:
    return PyListObject(
        ob_refcnt=1,
        ob_type=PyTypeObject(list).as_ref(),
        ob_size=0,
        ob_item=cast(Py.Mem.Malloc(n * 8), ptr[ptr[PyObject]]),
        allocated=n,
    ).into_object()

ls = []

s = time.perf_counter()
ls.extend(range(9_000))
print((time.perf_counter() - s) * 1000, "ms")

ls = list.with_capacity(9_000)

s = time.perf_counter()
ls.extend(range(9_000))
print((time.perf_counter() - s) * 1000, "ms")
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | 0.22995099425315857 ms
002 | 0.14029024168848991 ms
warm breach
#

could we have a list.with_capacity πŸ‘€

spark magnet
warm breach
#

if we'd just limit the compared fields of __code__ I think __eq__ is still helpful

#

otherwise it'd be difficult to do that comparison

pliant tusk
flat gazelle
#

I vaguely remember running into importlib hashing a code object, but I didn't spend too much thinking about why that happens.

feral island
feral island
#

that's probably hard to answer unfortunately, but I suspect in most cases the answer is "no"

warm breach
#

hm... how does the current hash work?

#

does it just go through each field

#

also are there ways the CodeTypes can mutate?

deep nova
#

Quick question about Python's escape characters

#

\N{name}

#

Matches a named character. Are the braces actually there, or are they part of the notation

#

\Ncolon or \N{colon}

feral island
feral island
swift imp
#

Stupid question...nvm

deep nova
#

Question about tokenizing strings

raven ridge
raven ridge
fallen slateBOT
#

@raven ridge :x: Your 3.11 eval job has completed with return code 1.

001 |   File "<string>", line 1
002 |     print("\Ncolon")
003 |                    ^
004 | SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: malformed \N character escape
raven ridge
#

they're a required part of the syntax.

dusk comet
#
>>> '\n{colon}'
'\n{colon}'
>>> '\N{colon}'
':'
>>> f'\N{colon}'
':'
>>> f'\N{{colon}}'
  File "<stdin>", line 1
    f'\N{{colon}}'
                  ^
SyntaxError: f-string: single '}' is not allowed
``` interesting
raven ridge
#

looks like that's treating {colon as the name of the character to look up, then.

raven ridge
fallen slateBOT
#

@raven ridge :white_check_mark: Your 3.11 eval job has completed with return code 0.

(1+2j)
raven ridge
#

I guess that's why the repr includes parentheses, come to think of it.

bright aspen
#

What function is called when in is called, i.e if my_number in my_list? Is it possible to have a custom class for my_list to overwrite the in check?

umbral plume
#

There's actually 3 different checks that are attempted one after another when you try to use in:

  • It checks if the object has a __contains__ method, and if so, attempts to use it
  • Otherwise, it checks if the object has an __iter__ method, and tries to use that to iterate through and find the item
  • If the above two checks fails, it then checks if the object has a __getitem__ method, and if so, tries indexing it by 0, 1, 2, 3, etc. until either the item is found or an exception occurs
  • if all 3 checks fail, then a TypeError is raised
deep nova
#

I'm at a bit of a crossroads, and I'd love some input. I'm writing my first proper lexical (tokenizer) grammar for my language, and I'm considering my options with respect to strings.

Strings are a rabbit hole. Even regular strings contain escape characters (of numerous varieties) which will need to be interpreted at some point. Raw strings require special handling. Fstrings have replacement fields which will demand, at some point, that the string be properly parsed. As such, the problem at its simplest is this: when and how to parse strings?

The way I see it, I have four options:

  1. Lex strings as atomic tokens. This makes the lexer a bit more complicated, but not unmanageably so. Later, when the parser encounters the string token, have it pass the string to a secondary parser designed specifically for them. This is probably the simplest option.

  2. Have the lexer, upon encountering an open quote, create a new instance of itself in a different mode (string-mode). Using a secondary dfa, break the string up into runs of normal characters and escape characters. For f-strings, when the lexer encounters a replacement field, have it create yet another lexer instance in regular-mode to tokenize the expression. Break out upon the appropriate queues.

This option is recursive, which is good. Its complicated and hacky, which is bad.

#
  1. Don't lex strings at all. Instead, just have a " operator which is treated the same as an opening parenthesis. Everything after this operator is parsed as normal β€” characters becomes identifiers, symbols become operators, numbers become integers, and so on. Escape characters will be recognized as their own tokens. Whitespace will be explicitly tokenized. Eventually another " will be encountered (or not) but the lexer doesn't blink or care.

Eventually, during parsing, the parser will handle assembly of the string. This is my favorite solution, as it requires only a single pass and no secondary parser or lexer, or mode of operation. But its still a bit ugly. This approach also places string recognition and validation upon the parser instead of the lexer β€” this is good because the parser is simply more powerful and can do things the lexer can't.

This option is also nice because replacement fields will be built during the outer string construction. There is also potential for arbitrarily nested strings in replacement fields, which I think is kinda cool.

  1. Don't lex at all. Just use a scannerless parser which treats characters as the tokens, and builds primitives on its own. I shy away from this option because, well, I put a lot of work into my DFA algorithm and I don't want it to go to waste. It might be the most graceful solution, though.

At the end of the day, all I know is that I don't want any secondary parsers, secondary operational modes, or hacky lexer recursion. People keep telling me I'm over thinking it, but I want to do this job right without taking the lazy way out.

feral island
#

I don't know the right answer, but I can say that (1) is how CPython currently works. It works fine for normal strings, but creates limitations for f-strings, so we're likely going to switch to (2) for f-strings.

#

(3) is an interesting idea that I haven't seen before. I feel like it would make your parser quite complex

deep nova
#

Well, first things first

#

Sanity check. Are any of these options completely insane?

feral island
#

I feel like you'll likely run into issues with (3) that are hard to solve. The language fundamentally works quite differently within and outside strings, so getting things like escaping to work properly will be hard

#

I'm assuming the language you're parsing is Python or close to it

deep nova
#

You might call it a python derivative

#

So, pretty close

feral island
#

so Python plus more syntax? You'll want to support \N{} and \U and all that

deep nova
#

Python, now with more syntax!

#

Yeah, lots of different escape options. I'm thinking of throwing out the bytes type (I've never once used it) but I'm ready to think about that yet

#

I suppose, then, that option 4 is the most direct and graceful. When the parser encounters an open quote it will explicitly enter into a "new environment" in that only those methods appropriate for handling characters inside strings are called. Via the call stack, open and close quotes are handled trivially. I'm almost certainly going PEG, which will make lookahead an option and that too will be useful for detecting this like unescaped backslashes

raven ridge
# deep nova Sanity check. Are any of these options completely insane?

option 4 strikes me as the hardest of those to implement by far. And the option most likely to have a negative effect on code quality, by making the parser do two jobs instead of one. As a thought exercise, imagine the language didn't have raw string literals: what changes would be required to add them?

deep nova
#

In theory

#

But I understand where you're coming from. Lexers are good at lexing. Parsers are okay at lexing, but not great

raven ridge
#

something else to think about is how these decisions will affect your ability to recognize and report syntax errors. Option 3 strikes me as making good error reporting much more difficult.

deep nova
#

Though its perfectly doable, the amount of code required to make a parser work on that fine of a grain is probably massive

deep nova
#

Surely there must be a solution. I'd like to walk through the logic step by step

raven ridge
#

consider that "10e3" and 10e3 are very differing things - lexing one as OPEN_QUOTE NUMBER CHARACTER NUMBER CLOSE_QUOTE might make sense, but lexing the other one as NUMBER CHARACTER NUMBER absolutely doesn't, and would make your life much harder.

deep nova
#

Strings are, in modern programming languages, quite complex. They are composite objects which different types of characters, may contain expressions and, potentially, other strings. Thus they are recursive as well.

It follows from this that strings are simply irregular, and cannot be lexed to satisfaction (not without gross hacks anyway)

#

Does this check out?

raven ridge
#

"to satisfaction" is ambiguous - to whose satisfaction?

deep nova
#

To the satisfaction of the definition of a lexer. You can't lex an unlexable object. Trying to do so will require state and procedure that goes beyond what "a lexer" by the strict definition is capable of

raven ridge
#

that depends on what the lexemes are.

#

option 1 - lexing it as just STRING_LITERAL and then figuring out the rest later - is still lexing it

#

that option at least requires figuring out where it starts and stops, and what characters are inside the literal

deep nova
#

If you're willing to accept strings being un-nestable (well, strings using the same quotes) then I suppose you're right. It does severely limit what you're able to put inside the string though

raven ridge
deep nova
#

I'm not willing to ad-hoc a solution

raven ridge
#

it's more annoying, but it's certainly not impossible for the lexer to figure out where the string ends.

deep nova
#

I want an academically substantiated, theory based approach

raven ridge
#

in the same way as it's not impossible for a parser to do all the lexing.

grave jolt
#

what does python do right now? pithink

raven ridge
#

lex the entire string literal as one token

deep nova
grave jolt
#

julia moment

feral cedar
#

i don't get why julia has to be so "purist"

grave jolt
#

eh, it's not a big deal

raven ridge
# deep nova Strings are, in modern programming languages, quite complex. They are composite ...

I'd state this differently. String literals are quite complex, and the contents of string literals follows an entirely different grammar than is used anywhere else in the language. While it's certainly possible to define one grammar and one set of tokens that encompasses both stuff-inside-strings and stuff-outside-strings, it's not clear to me that it makes things more readable, or maintainable, or more performant, etc.

grave jolt
#

I just reference it for the memes, no offence to julia,

deep nova
#

By which I mean, I want the escape characters in strings to be their own tokens, runs of unescaped characters to be their own tokens, and so on

#

Then it stands to reason I'll need two machines (or else one machine operable in two modes)

raven ridge
#

yeah.

deep nova
#

Fair enough. Questions abound though. What about error propagation? Toggling from inside a string to outside (potentially recursively). And, as soon as you start switching between environments you need to start carrying context from one environment to the other and back again (for example, whether the string is raw or not, or whether it uses single or double quotes)

raven ridge
#

yes, you do. That could be one class with a stack of contexts, though.

warm breach
# deep nova

what if we had string division as well πŸ‘€

deep nova
#

So, a recursive lexer

raven ridge
deep nova
#

I'm totally stealing that

warm breach
#

!e

from einspect import view

view(str)["__truediv__"] = str.partition

print("hello+world" / "+")
fallen slateBOT
#

@warm breach :white_check_mark: Your 3.11 eval job has completed with return code 0.

('hello', '+', 'world')
raven ridge
#

one major issue with trying to lex both inside and outside of a string using the same lexer is that whitespace is significant in one and not in the other. You want to tokenize 1 +5 as NUMBER("1") OPERATOR("+") NUMBER("5"), but you can't tokenize "1 +5" as BEGIN_QUOTE('"') NUMBER("1") OPERATOR("+") NUMBER("5") END_QUOTE('"') or you've lost syntactically significant whitespace.

deep nova
#

You've stated your piece on this approach I know, but I think its worth mentioning again

#

In theory, one could have a single grammar which tokenizes everything from both grammars. Whether or not this is possible would depend on the nature of the differences between them. I'm not sure either of us can say for sure whether it would be impossible in a python-like language, though we can agree it would probably be difficult and potentially be quite ugly

grave jolt
#

what if

#

a high-level language without string literals

deep nova
#

Either case, the theory remains the same: kill them all and let the parser sort them out.

raven ridge
#

one could have a single grammar which tokenizes everything from both grammars
You absolutely can. It seems like a lot of extra complexity to me, but it's clearly possible to lex my first example as NUMBER("1") OPERATOR("+") NUMBER("5") and my second as BEGIN_QUOTE('"') LITERAL("1 +5") END_QUOTE('"'). The question is just whether that makes implementing the parser easier or harder. My intuition is that it would be harder, but πŸ€·β€β™‚οΈ

rich cradle
#

if i wanted to support exactly what cpython does, i'd just handle it all in the lexer stage, just like it does.

#

you don't need as much "deep" context

grave jolt
#

Why would you want such a nested f-string? it sounds like a nightmare to read for a human

rich cradle
#

i think my impl is more like pep 701

deep nova
rich cradle
deep nova
#

My dude!

grave jolt
#

I personally find f"{d['key']}" easier to parse as a human

feral island
#

!pep 703

grave jolt
#

f"{d[" + key + "]}" will definitely take me a while

fallen slateBOT
#
**PEP 703 - Making the Global Interpreter Lock Optional in CPython**
Status

Draft

Python-Version

3.12

Created

09-Jan-2023

Type

Standards Track

feral island
#

sorry wrong one

#

!pep 701

fallen slateBOT
#
**PEP 701 - Syntactic formalization of f-strings**
Status

Draft

Python-Version

3.12

Created

15-Nov-2022

Type

Standards Track

feral island
#

^ this is likely to make all @grave jolt 's fears come true

grave jolt
#

o no

rich cradle
# deep nova I'm going to need some context. Do you lex fstrings literals as big atomic chunk...

i keep track of how nested in delimiters i am in the lexer. fstrings and their inner expressions are another part of that. there's only one lexing pass. the example i gave above would end up lexed as something like FStart, FExprStart, (just lex like normal tokens now) Ident, OpenBracket, Str, CloseBracket, (we came to a brace and aren't nested in any other delimiter pairs, we're done)FExprEnd, FEnd.

raven ridge
# grave jolt Why would you want such a nested f-string? it sounds like a nightmare to read fo...

referential transparency for one. The rule that you can refactor py greeting = "Hello, y'all" print(greeting) into py print("Hello, y'all") without changing the meaning makes it easier to reason about both the behavior of function calls and the meaning of variables. Currently, you can't do the same thing with f-strings, though - you can't refactor py greeting = "Hello, y'all" print(f'greeting={greeting}') into ```py
greeting = print(f'greeting="Hello, y'all"')

#

people who maintain code generators also said that the arbitrary restriction on not being able to reuse quotes within different levels of an f-string makes it much harder to generate code, because you need to carry extra context down the stack

grave jolt
#

I don't have a formal rebuttal but that sounds a bit contrived pithink

#

that sounds like a good argument ^

raven ridge
#

https://discuss.python.org/t/pep-701-syntactic-formalization-of-f-strings/22046 had a bunch of arguments, both for and against, if you want to see some of the discussion.

deep nova
#

Func fact β€” I actually look a little bit like him

#

Same forehead

raven ridge
grave jolt
#

Referential transparency is kind of lacking in python tbh, in some places

#

like the closure gotcha

lone sun
#

From a purist theoretical perspective the answer is (4) because there's really no such thing as lexing; there are just sequences of characters and grammar production rules and parse trees. The whole reason for introducing lexing as a separate step is for convenience. So in that sense, the very fact that you've introduced a lexer is some kind of flaw.

deep nova
#

So, I appreciate the sentiment, but I'm not sure I'm 100% on the implementation they are proposing

deep nova
#

They say they're adding new token types as well as new protocols into the grammar. Cool. They don't mention anything about how the non-replacement-field parts of the string are being lexed

grave jolt
#

hmm, maybe lexing and parsing are just parsing, but at different levels of abstraction

#

or rather, with different 'elements'

#

(characters vs 'tokens')

deep nova
#

The FSTRING_MIDDLE parts

lone sun
#

We like lexers for good practical reasons. But if you think about them as a convenience, then (1), (2), and (3) are all equally acceptable. It's just a matter of what you think is convenient for the language you want to support.

#

I think that if you want to support nested f-strings, then (2) is probably the best route. While if you don't, then you can't beat the ease of coding (1).

raven ridge
# deep nova The FSTRING_MIDDLE parts

PEPs specify the behavior and contracts of the language, not the implementation. The existence or absence of a lexer in any particular implementation is an implementation detail that the PEP rightly doesn't address.

deep nova
#

Ahh, I see

raven ridge
#

the ability to make good error reports on syntax errors is often the biggest practical difference between different approaches to parsing. It's very easy to say "the character on line 25, character 4 is unacceptable", but different approaches will make it very difficult or impossible for you to explain why that character isn't allowed there.

deep nova
#

Well, if that's the case, the option 4 is by far the best approach

#

Its the option with the fewest question marks, and the least demand for ad-hoc problem solving

#

If its a question of "switching contexts between two lexical grammars" and propagating those contents as well as errors correctly, well, that's a whole can worms

lone sun
#

I think that rather than thinking of it as "switching contexts" you could think of it as "recursing into a grammar rule with a different lexer". To me that makes it feel cleaner.

deep nova
#

As an aside question, shouldn't that sort of reporting be shunted forward to the semantic analyzer?

lone sun
#

Also if you keep an explicit stack of lexers then that probably helps error reporting.

deep nova
#

In my mind, the only things a lexer and a parser should be reporting on are those things that unambiguously fall within their domain

lone sun
#

What is in their domain is not always obvious, though.

raven ridge
#

imagine you've got a file in a Python-like language whose first line is ff"". That's a syntax error. Discovering that error with a 1-character-at-a-time parser requires look-behind. ff is allowed as the first two characters. When you hit the ", you need to look backwards and say that a " preceded by an f is valid (an f-string), but a "' preceded by 2 f's is not valid. And then you need to figure out what type of thing the two f's are (a name, I suppose) and then explain to the user that their error was putting a string literal after a name without an operator in between them.

deep nova
#

Honestly... I'm so very torn

lone sun
#

A bottom-up parser does all of this without even blinking. The problem is constructing the grammar for it. That grammar would have to say, "okay that first f could be part of a format string or a variable name or a from or ... oh now I saw a second f, it must be a variable name, wait the " is no good here." If you can build that grammar then reporting the location of an error is easy. (Reporting on the type of error, though, is really hard.)

deep nova
#

Well, hold up

#

If you want to be able to solve this problem...

#
f"this is a {"test"}"
#

You simply can't do this with a lexer β€” not without attaching a stack and some custom logic

raven ridge
#

who says a lexer can't have a stack?

deep nova
#

grumbles uncomfortably

#

I really don't want an ad-hoc solution. I want something grounded in theory.

lone sun
#

Theory says you can't do it with a DFA. But you can with a context-free grammar (even an LL grammar). So you could just write a lexer that's not a DFA, declare yourself happy, and move on.

deep nova
#

Well β€” I want to talk about option 3

#

A single grammar for both environments. What really are the differences between the two environments?

#

Outside of a string you have identifiers, numbers, <strings>, keywords, and operators

#

Implicitly, you've also got whitespace, newlines, tabs, and comments

#

Inside a string you've got all of these things as well. You'd need to be able to recognize things words which didn't qualify as identifiers or number literals (maybe just default to a catch-all "blob" token). You'd need to recognize escape characters. One you entered into a replacement field you'd be back in normal territory again

raven ridge
#

they're extremely different. Basically nothing in common.

#

the tokenization for x [ (1+2) * 3 ] and the tokenization for "x [ (1+2) * 3 ]" should almost certainly not have a single token in common.

rich cradle
#

i don't know about theory, but there are years of historical precedent for doing this.

deep nova
#

The literals are all still there. The parser can stitch a string together from these tokens quite easily

rich cradle
#

it can, but why should it?

#

you're just going back to lexerless parsing, really

deep nova
#

XD Yeah, pretty much

raven ridge
#

one of the major things that a lexer is doing is figuring out where one "thing" stops and the next "thing" starts. Making the parser care about whitespace between tokens seems to largely defeat the point of lexing.

deep nova
#

Alright, so I'll concede on that.

rich cradle
#

you can also think of a lexer as something that turns source code into the smallest coherent units. when you're running a string through the rest of the process, you don't really care what content structure the string has (barring escapes). you just care that it's a string, and that you know what it has. there's no need to dissect it. (massive generalization here, but you get the point)

deep nova
#

Here's what I want, and maybe you guys can tell me what I need: I want the individual components of strings to be their own tokens. I do not want to tokenize strings as atomic literals and then run through a second machine. I do want to be able to use the same quotes as bookend the string within the string's replacement fields. I want to move through the input in a single pass (or two passes, if lexing once and then parsing once)

#

Option 1 is out by definition. Option 3 is out for reasons discussed above

rich cradle
#

f-strings in particular, yes? i know my approach works in practice (f-strings as what're ultimately delimiter pairs).

deep nova
#

Option 2 depends on one thing: can recursive/stacked lexers handle strings which use the same quotes as enclose them within the replacement fields? Would an inner-lexer, upon finding sed quote, not need to escape immediately, or else perform costly lookahead?

rich cradle
#

option 4 also works. it seems like option 2 is also ruled out by virtue of "duplicating" the lexer. it appears i misunderstood there.

deep nova
#

I'm not opposed to recursive lexing so long as it doesn't require any complicated glue

raven ridge
#

I want the individual components of strings to be their own tokens.
That implies that raw strings need to be lexed differently than regular strings, right off the bat. Because \n in one should be lexed as NEWLINE_ESCAPE or something, but not in the other.

lone sun
deep nova
#

Question

#

The issue with scannerless parsing is that parsers aren't very good at lexing, correct?

#

The issue that I'm facing with f strings relates to an inability to recurse and an inability to "switch contexts" as needed, correct?

#

Could I build a parser that lazy-lexed?

#

I.E. A parser/lexer combo which automatically generated a new token via DFA whenever one is requested by the parser but does not already exist?

#

Something to think about :3

steel chasm
#

Heya - I'm writing a module and it relies on an outstanding PR for core python (asyncio). I'm not really aware of how to go about including core python in a pip install. I don't see this PR being accepted any time soon (was shipped in 2019). I would appreciate your advice on how to proceed. https://github.com/python/cpython/pull/16429 is the PR. Is it best practice to just import class and just overwrite the method with the patch?

GitHub

Allow Stream.readuntil to take an iterable of separators and match any
of them. The earliest match endpoint wins (which ensures that results
are dependent on the chunking) and on ties shortest sepa...

rare lantern
#

So, I noticed recently, the functools.partial is incompatible with the inspect module. I have built a workaround.

# a proper Partial implementation that works with paramspec etc
def partial(f, *args, **kwargs):
    return wraps(f)(lambda *a, **kw: f(*(args + a), **{**kwargs, **kw}))
#

this works because it returns a standard python function with the attributes of the wrapped function applied using wraps. Similar functionality cold be implemented in the default partial to allow this, especially given that wraps and partial live in the same module, this should be a fairly simple fix

rose schooner
#

it's not a function

rare lantern
#

So it breaks the inspect

#

cause obviously is no longer representing a function

#

the above version applies the same functionality, while preserving the function

rose schooner
#

how does it "break"?

rose schooner
rare lantern
#

you can't use things like paramspec etc etc, anythng that inspect has to inspect functions doesn't work for partials, because partial is a class

rose schooner
rare lantern
#

so if you say, have a pipeline to use these details, it will break if a partial is aver included

rose schooner
#

the other one is that it's implemented in C

rare lantern
#

but my point is

#

with the wraps decorator

#

the c object is redundant

#

because you can do it functonally, and not break things

#

(3 lines over 100 in the functools module)

rose schooner
#

tested with the same A class above

#

i think the only problem is that it's in C

rare lantern
rose schooner
#

wraps() uses partial() btw

rare lantern
rose schooner
#
def wraps(wrapped,
          assigned = WRAPPER_ASSIGNMENTS,
          updated = WRAPPER_UPDATES):
    """Decorator factory to apply update_wrapper() to a wrapper function

       Returns a decorator that invokes update_wrapper() with the decorated
       function as the wrapper argument and the arguments to wraps() as the
       remaining arguments. Default arguments are as for update_wrapper().
       This is a convenience function to simplify applying partial() to
       update_wrapper().
    """
    return partial(update_wrapper, wrapped=wrapped,
                   assigned=assigned, updated=updated)
#

so um

#

why use wraps() again?

rare lantern
#

its just passing the stuff on here

rose schooner
#

update_wrapper()?

rare lantern
#

It will still just crash with plain partial

#

because partial itself does not apply the update wrapper functionality

#

so, when inspected

#

does not appear as the function is is a partial of

#

wraps pulls up that data

#

and then the only solution is to try and filter and update these in whatever system you are working with

#

by checking if is a partial

#

and then doing something janky to try and update that

#

it seems update_wrapper also works here

rose schooner
#

it's probably better to just define and use the custom partial() in user code instead of adding it to the stdlib

rare lantern
#

so, its not like it couldn't be added to the partial

rose schooner
rare lantern
#

because its used as such in all cases

rose schooner
rare lantern
#

and can be used in place in all places

#

Oh python

rose schooner
#

there's also the useful representation output

rare lantern
#

why are you like this lol

rare lantern
#

but make it compatible with inspect

#

he same way funcs are

#

this way

#

best of both

rose schooner
#

!e and a hidden feature too ```py
from functools import partial
def a(y):
pass

def b(x, z):
return x * z + 2

a.func = b
print(partial(a, 5)(2))

fallen slateBOT
#

@rose schooner :x: Your 3.11 eval job has completed with return code 1.

001 | Traceback (most recent call last):
002 |   File "<string>", line 9, in <module>
003 | TypeError: a() takes 0 positional arguments but 1 was given
rare lantern
#

its such an annoying little gotcha

rose schooner
#

ok maybe it doesn't work that way

rare lantern
#
partial = lambda f, *a, **k: update_wrapper(lambda *_a, **_k: f(*(a + _a), **{**k, **_k}), f)
partialmethod = lambda f, *a, **k: update_wrapper(lambda self, *_a, **_k: f(self, *(a + _a), **{**k, **_k}), f)

works

#

as a way to define em

rose schooner
rare lantern
#

I think the update wrapper functionality could technically be applied in class

rose schooner
rare lantern
#

My method works with that too lol

#

cause purely functional