#algos-and-data-structs | Python | Page 70

stiff quartz Jul 26, 2025, 8:57 PM

#

that is immediately discarded

regal spoke Jul 26, 2025, 8:57 PM

#

Doesn't matter. Counts still need to be updated

stiff quartz Jul 26, 2025, 8:57 PM

#

Couldn't this be optimized away when we know it'll be gone?

#

Idk how the interpreter would know I guess

regal spoke Jul 26, 2025, 8:58 PM

#

I think pypy can be smarter, because pypy has a special optimization for pure integer lists

#

So I think it completely drops the reference counting in that case

#

But to my knowledge, CPython doesn't have any features like that

#

Btw if you read this code, you can see the reference counts being updated

#

        while (src < src_end) {
            _Py_RefcntAdd(*src, n);
            *dest++ = *src++;
        }

#

Here it adds n to the count of *src

#

then places source in *dest, and increments both pointers

stiff quartz Jul 26, 2025, 9:00 PM

#

why does it

#

*dest doesn't point to *src?

regal spoke Jul 26, 2025, 9:01 PM

#

(n is number of copies)

regal spoke Jul 26, 2025, 9:01 PM

#

stiff quartz `*dest` doesn't point to `*src`?

dest is the newly allocated list used for the output

stiff quartz Jul 26, 2025, 9:01 PM

#

yes

#

But why would it reference *src

regal spoke Jul 26, 2025, 9:02 PM

#

src is the input list

stiff quartz Jul 26, 2025, 9:02 PM

#

It's itw own independent list

#

It should be able to live with src garbage collected

regal spoke Jul 26, 2025, 9:02 PM

#

What we are looking at corresponds to B[:n] = A in my code

stiff quartz Jul 26, 2025, 9:03 PM

#

Ah so dest starts by just pointing at src?

regal spoke Jul 26, 2025, 9:03 PM

#

Nooo

#

src is the original list

stiff quartz Jul 26, 2025, 9:04 PM

#

yes

regal spoke Jul 26, 2025, 9:04 PM

#

dest is the newly allocated list that should be used for output

stiff quartz Jul 26, 2025, 9:04 PM

#

yes

regal spoke Jul 26, 2025, 9:04 PM

#

They are two seperate pointers

#

Completely seperate

regal spoke Jul 26, 2025, 9:05 PM

#

regal spoke ```c while (src < src_end) { _Py_RefcntAdd(*src, n); ...

This while loops start transfering elements from src into dest

stiff quartz Jul 26, 2025, 9:05 PM

#

Ok I understand

#

Yes I understand

regal spoke Jul 26, 2025, 9:05 PM

#

While doing that, it also increments every elements reference count by n (because in the end, each element will appear n times)

stiff quartz Jul 26, 2025, 9:06 PM

#

That's cool

#

and it seems so simple when we understand it

#

Wait, so Python on Linux uses GCC?

#

If that's the case, and our bug is consistent with the MSVC I had here

#

This means, that if there is actually a bug with GCC (the one I had here)

#

and if Python on Linux uses GCC's memcpy

#

Then we would know how to trigger the bug on Linux, we have to use 2^18-1

regal spoke Jul 26, 2025, 9:09 PM

#

stiff quartz Then we would know how to trigger the bug on Linux, we have to use 2^18-1

Doesn't your benchmark indicate 2^18?

stiff quartz Jul 26, 2025, 9:09 PM

#

my bad

#

2^18 instead of 2^18+1 yes

regal spoke Jul 26, 2025, 9:09 PM

#

You could try if ([0]*2**18)*20 is slow

stiff quartz Jul 26, 2025, 9:09 PM

#

yeah I'm trying right now

#

so it does use GCC

#

now, can I reproduce the bug

#

probably not

#

that'd be too nice

#

#

#

I mean

#

Idk, it's beautiful but suspiciously too beautiful and so it might not just be a Windows bug?

#

@regal path could you try to reproduce on Linux?

regal path Jul 26, 2025, 9:21 PM

#

stiff quartz

this one?

stiff quartz Jul 26, 2025, 9:21 PM

#

Yes

#

Both are the same

#

I just changed the order in case there was a warmup thing

regal path Jul 26, 2025, 9:25 PM

#

2**18+1 47.59964680671692
2**18 49.571892976760864

stiff quartz Jul 26, 2025, 9:25 PM

#

Nice

stiff quartz Jul 26, 2025, 9:29 PM

#

regal path ``` 2**18+1 47.59964680671692 2**18 49.571892976760864 ```

Do you have gcc installed?

regal path Jul 26, 2025, 9:29 PM

#

I do

stiff quartz Jul 26, 2025, 9:30 PM

#

Can you run this?

regal path Jul 26, 2025, 9:31 PM

#

COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.742536 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.556608 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.552967 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.555692 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.566795 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.595249 sec
14

stiff quartz Jul 26, 2025, 9:31 PM

#

ah rip

#

what is wrong with my computer

regal path Jul 26, 2025, 9:33 PM

#

clang, for good measure

COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.705290 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.583754 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.583081 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.617412 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.659714 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.558543 sec
14

stiff quartz Jul 26, 2025, 9:36 PM

#

regal path clang, for good measure ``` COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 1...

and MSVC? 😄

#

ah welp you're on Linux my bad

regal spoke Jul 26, 2025, 11:16 PM

#

I posted a suggestion to use this on the issue

#

Idk if they will like it

#

But it should make python's performance far more reliable.

vernal girder Jul 27, 2025, 12:19 AM

#

Is this graph Hamiltonian? Meaning, can you trace a cycle along the edges which passes through every point exactly once?

main barn Jul 27, 2025, 7:32 AM

#

hello

stiff quartz Jul 27, 2025, 8:10 AM

#

regal spoke Idk if they will like it

Idk either, if they have a specific fix here they need to find all usages of memcpy in CPython to implement similar things I guess

regal spoke Jul 27, 2025, 8:16 AM

#

Thing is, copying is one of few cases where memcpy performance is really noticeable

stiff quartz Jul 27, 2025, 8:16 AM

#

It's weird that you can't make the issue happening when you only do [0] * (2^18 + 1)

#

If the issue really lies in memcpy performance

#

That should happen there too

#

ah no, my bad

#

It would only call memcpy with powers of 2 and then with size = 1

stiff quartz Jul 27, 2025, 8:18 AM

#

regal spoke Thing is, copying is one of few cases where memcpy performance is really noticea...

I think the only way they would "accept" a "hacky change" is if you can find an example where it's [0] * something

#

because that is a very common thing to do (compared to ([0] * a) * b

regal spoke Jul 27, 2025, 8:18 AM

#

stiff quartz It's weird that you can't make the issue happening when you only do `[0] * (2^18...

Read the source code. Memcpy is not used in those cases

#

It is only used if the list being copied has size > 1

stiff quartz Jul 27, 2025, 8:19 AM

#

Ah yes

#

I remember

regal spoke Jul 27, 2025, 8:21 AM

#

stiff quartz Idk either, if they have a specific fix here they need to find all usages of `me...

Yes and no. Repeated copies is one of few cases where the performance of memcpy is noticeable (i.e. isn't being overshadowed by anything else)

stiff quartz Jul 27, 2025, 8:21 AM

#

Fair enough

#

I did make an issue to MSVC repo directly (https://github.com/microsoft/STL/issues/5658)

#

since fixing it upstream is the best thing

regal spoke Jul 27, 2025, 8:23 AM

#

My guess is that *2 is badly affected by this bug, but you don't really see it because of the reference counting taking more time

#

Btw you should change your benchmark to just have one src and one dst

stiff quartz Jul 27, 2025, 8:26 AM

#

You do see it with *2 I think

regal spoke Jul 27, 2025, 8:27 AM

#

Just make one pair of src + dst of size 2**18+10 and use it for all benchmarks

#

That way the code is easier to read

stiff quartz Jul 27, 2025, 8:27 AM

#

regal spoke Just make one pair of src + dst of size 2**18+10 and use it for all benchmarks

Could do

regal spoke Jul 27, 2025, 8:28 AM

#

and it will make it more obvious that memcpy is at fault

#

Otherwise one could think that using different arrays is what is causing the performance bug

stiff quartz Jul 27, 2025, 8:30 AM

#

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define COUNT1 262144   // 2^18
#define COUNT2 262145   // 2^18 + 1
#define COUNT3 262154   // 2^18 + 10
#define REPEATS 10000

void benchmark_copy(const int *src, int *dst, size_t count, const char *label) {
    size_t size_bytes = count * sizeof(int);

    clock_t start = clock();
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, size_bytes);
    }
    clock_t end = clock();

    double total_sec = (double)(end - start) / CLOCKS_PER_SEC;
    printf("%s: Copied %zu bytes (%zu ints) %d times\n", label, size_bytes, count, REPEATS);
    printf("%s: Total time: %.6f sec\n", label, total_sec);

    // Prevent optimizing away some stuff
    printf("%d \n", dst[14]);
}

int main(void) {
    static int src[COUNT3], dst[COUNT3];

    for (size_t i = 0; i < COUNT3; ++i) src[i] = (int)i;
    
    // Do all benchmarks twice just in case
    benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src, dst, COUNT3, "COUNT3 (2^18 + 10)");
    benchmark_copy(src, dst, COUNT3, "COUNT3 (2^18 + 10)");

    return 0;
}

#

You meant that?

regal spoke Jul 27, 2025, 8:31 AM

#

stiff quartz You do see it with `*2` I think

You definitely do, but it is more noticeable for *20 because for *20 less time is spent on ref counting

stiff quartz Jul 27, 2025, 8:32 AM

#

regal spoke Otherwise one could think that using different arrays is what is causing the per...

that is what is causing the performance bug

#

Using the same array doesn't work

#

lemon_angrysad

regal spoke Jul 27, 2025, 8:32 AM

#

stiff quartz Using the same array doesn't work

Really

stiff quartz Jul 27, 2025, 8:33 AM

#

stiff quartz ```c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h...

Well I just ran that

#

And I get the same results for all benchmarks

#

Is there something wrong with it? I don't think so

regal spoke Jul 27, 2025, 8:34 AM

#

Make your computer go to sleep thinkies

stiff quartz Jul 27, 2025, 8:34 AM

#

the problem is deeper than that

#

Because I still have the other executable

#

Lemme check

#

I'm going crazy

#

So these are the results

#

I compile both with MSVC

C:\Users\lhott\Desktop\temp>CL /Fe:mainMSVCSAMEARRAY.exe testSameArray.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30159 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

testSameArray.c
Microsoft (R) Incremental Linker Version 14.29.30159.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:mainMSVCSAMEARRAY.exe
testSameArray.obj

C:\Users\lhott\Desktop\temp>CL /Fe:mainMSVCDIFFARRAYS.exe testDiffArrays.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30159 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

testDiffArrays.c
Microsoft (R) Incremental Linker Version 14.29.30159.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:mainMSVCDIFFARRAYS.exe
testDiffArrays.obj

#

And then, the results are:

C:\Users\lhott\Desktop\temp>mainMSVCSAMEARRAY.exe
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.960000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.951000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.948000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.951000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.950000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.947000 sec
14

C:\Users\lhott\Desktop\temp>mainMSVCDIFFARRAYS.exe
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.216000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.216000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 2.903000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 2.851000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.231000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.221000 sec
14

#

Same array: it's homogeneous BUT SLOW FOR ALL??????

regal spoke Jul 27, 2025, 8:39 AM

#

How lovely

stiff quartz Jul 27, 2025, 8:39 AM

#

🙂

#

Great, I got a warning for trying to share a .c file

#

for you to run with gcc

regal spoke Jul 27, 2025, 8:40 AM

#

This stuff reminds me of https://codeforces.com/blog/entry/126654

Codeforces

Slowdown bug affecting C++ (64 bit) on Codeforces - Codeforces

stiff quartz Jul 27, 2025, 8:41 AM

#

👀

#

isn't codeforces fully running with GCC?

#

https://ideone.com/nsCS2n
https://ideone.com/2jp9Hc
for the source codes if you wanna investigate

#

But the more we do the less it makes sense, so is that a good idea

regal spoke Jul 27, 2025, 8:46 AM

#

stiff quartz And then, the results are: ``` C:\Users\lhott\Desktop\temp>mainMSVCSAMEARRAY.exe...

What if you keep the three but give src2 and dst2 to all of em

stiff quartz Jul 27, 2025, 8:46 AM

#

How are you going to memcpy COUNT3 integers

#

if you give it src2 and dst2

#

—
We should look at the old source code when it did O(n) memcpy

#

And try to hack the old versions of CPython/PyPy

regal spoke Jul 27, 2025, 8:51 AM

#

stiff quartz isn't codeforces fully running with GCC?

https://codeforces.com/blog/entry/126654?#comment-1124500 this guy found the underlying bug. It is caused by a stupidly implemented memory allocator that keeps allocating and deallocating large chunks of memory every time cin is called

Codeforces

Slowdown bug affecting C++ (64 bit) on Codeforces - Codeforces

regal spoke Jul 27, 2025, 8:52 AM

#

stiff quartz How are you going to memcpy COUNT3 integers

Oh I meant give all 3 src3 and dst3

regal spoke Jul 27, 2025, 8:53 AM

#

stiff quartz — We should look at the old source code when it did O(n) memcpy

To my knowledge that isn't "hackable".

#

Why? I've got no clue

stiff quartz Jul 27, 2025, 8:54 AM

#

regal spoke https://codeforces.com/blog/entry/126654?#comment-1124500 this guy found the und...

stiff quartz Jul 27, 2025, 8:55 AM

#

regal spoke Oh I meant give all 3 src3 and dst3

Compiler would just optimise away src1, src2, …

#

And we’d have the same exact code

stiff quartz Jul 27, 2025, 8:55 AM

#

stiff quartz ```c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h...

As that

#

Wouldn’t we?

#

At least that’s what I, from my limited C knowledge, would think happens

regal spoke Jul 27, 2025, 8:57 AM

#

stiff quartz Compiler would just optimise away src1, src2, …

Thats a big assumption...

#

You could also just make some global arrays

#

Adding volatile might work too

stiff quartz Jul 27, 2025, 9:10 AM

#

regal spoke Oh I meant give all 3 src3 and dst3

mmm

#

so I did:

    benchmark_copy(src1, dst1, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src1, dst1, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
    benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");

    benchmark_copy(src3, dst3, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src3, dst3, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src3, dst3, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src3, dst3, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
    benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");

#

and we get:

COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.222000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.255000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 3.900000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 3.884000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.255000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.268000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.344000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.381000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.368000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.312000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.394000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.336000 sec
14

still have the slowdown I had, but everything is homogeneous
but less of a slowdown than earlier

#

worth noting i put my computer to sleep since then ducky_beer

regal spoke Jul 27, 2025, 9:14 AM

#

Try

    benchmark_copy(src1, dst1, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src1, dst1, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
    benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");

    benchmark_copy(src2, dst2, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src2, dst2, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");

stiff quartz Jul 27, 2025, 9:16 AM

#

You're going to laugh

#

C:\Users\lhott\Desktop\temp>CL /Fe:mainMSVCDIFFARRAYSPAJENEGOD.exe testDiffArraysPajenegod.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30159 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

testDiffArraysPajenegod.c
Microsoft (R) Incremental Linker Version 14.29.30159.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:mainMSVCDIFFARRAYSPAJENEGOD.exe
testDiffArraysPajenegod.obj

C:\Users\lhott\Desktop\temp>mainMSVCDIFFARRAYSPAJENEGOD.exe
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.361000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.365000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 7.027000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 4.624000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.258000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.276000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 3.960000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 4.156000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 4.127000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 4.214000 sec
14

regal spoke Jul 27, 2025, 9:17 AM

#

Clearly we have not understood what is causing this bug

stiff quartz Jul 27, 2025, 9:17 AM

#

I can't find anywhere online where you can run C code on an online platform that lets you choose the compiler

#

godbolt lets you choose but you only see the assembly code, you can't run it, it seems

regal spoke Jul 27, 2025, 9:18 AM

#

Windows is very uncommon for servers to run

#

Pretty much codeforces is the only one out there

stiff quartz Jul 27, 2025, 9:18 AM

#

I tried on codeforces

#

No bug

#

Because they use gcc

regal spoke Jul 27, 2025, 9:19 AM

#

You can pick msvc

#

At least you used to be able to

#

Maybe they removed it

stiff quartz Jul 27, 2025, 9:20 AM

#

I couldn’t

#

Or I’m blind

#

One of the two

regal spoke Jul 27, 2025, 9:20 AM

#

Seems they have removed it

#

It used to be possible

stiff quartz Jul 27, 2025, 9:20 AM

#

😮‍💨

regal spoke Jul 27, 2025, 9:22 AM

#

stiff quartz ``` C:\Users\lhott\Desktop\temp>CL /Fe:mainMSVCDIFFARRAYSPAJENEGOD.exe testDiffA...

A theory about this. Maybe it is caused by alignment of pointers?

#

Here is something you could test. Allocate a bigger src and dst

#

Then play around with benchmark_copy(src + i, dst + j, COUNT1, "COUNT1 (2^18)"); for different values of i and j

#

It could be that the size given to memcpy isn't the issue

regal spoke Jul 27, 2025, 9:25 AM

#

stiff quartz ``` C:\Users\lhott\Desktop\temp>CL /Fe:mainMSVCDIFFARRAYSPAJENEGOD.exe testDiffA...

That is what this implies ^

stiff quartz Jul 27, 2025, 9:44 AM

#

regal spoke Then play around with `benchmark_copy(src + i, dst + j, COUNT1, "COUNT1 (2^18)")...

Agreed

#

Will do when I come back

stiff quartz Jul 27, 2025, 10:55 AM

#

@regal spoke do you have Visual Studio?

regal spoke Jul 27, 2025, 10:55 AM

#

stiff quartz <@475657117509091329> do you have Visual Studio?

If I do, it is old

stiff quartz Jul 27, 2025, 11:03 AM

#

Nevermind then

#

But it comes with msvc

regal spoke Jul 27, 2025, 11:03 AM

#

I have Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x86

#

$ ./slowtest.exe
Benchmark COUNT1 (2^18): 0.621000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 0.622000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 2.356000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 2.356000 seconds over 10000 repeats
Benchmark COUNT3 (2^18 + 10): 0.682000 seconds over 10000 repeats
Benchmark COUNT3 (2^18 + 10): 0.684000 seconds over 10000 repeats

#

So I'm able to reproduce the slowdown locally

#

@stiff quartz Try this

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define COUNT1 262144   // 2^18
#define COUNT2 262145   // 2^18 + 1
#define COUNT3 262154   // 2^18 + 10
#define REPEATS 10000

void benchmark_copy(const int* src, int* dst, size_t count, const char* label) {
    size_t size_bytes = count * sizeof(int);
    
    clock_t start = clock();
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, size_bytes);
    }
    clock_t end = clock();
    
    double elapsed_secs = (double)(end - start) / CLOCKS_PER_SEC;
    printf("Benchmark %s: %.6f seconds over %d repeats\n", label, elapsed_secs, REPEATS);
}

int main() {
    static int src[COUNT2];
    static int dst[COUNT2];

    for (size_t i = 0; i < COUNT2; ++i) src[i] = i;

    // Do all benchmarks twice just in case
    benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");

    return 0;
}

stiff quartz Jul 27, 2025, 11:20 AM

#

regal spoke So I'm able to reproduce the slowdown locally

Cool

#

Because a guy couldn’t reproduce

#

On MSVC github

regal spoke Jul 27, 2025, 11:21 AM

#

Maybe we should get latest version

stiff quartz Jul 27, 2025, 11:21 AM

#

Im doing that rn

#

Microsoft has a weird way of doing things

#

With their Visual Studio Installer

#

It’s always so confusing if you’re not familiar with it

#

Which I’m not

regal spoke Jul 27, 2025, 11:24 AM

#

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define COUNT 100000
#define COUNT1 262144   // 2^18
#define COUNT2 262145   // 2^18 + 1
#define COUNT3 262154   // 2^18 + 10
#define REPEATS 10000

void benchmark_copy(const int* src, int* dst, size_t count, const char* label) {
    size_t size_bytes = count * sizeof(int);
    
    clock_t start = clock();
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, size_bytes);
    }
    clock_t end = clock();
    
    double elapsed_secs = (double)(end - start) / CLOCKS_PER_SEC;
    printf("Benchmark %s: %.6f seconds over %d repeats\n", label, elapsed_secs, REPEATS);
}

int main() {
    static int src[COUNT2];
    static int dst[COUNT2];
    
    for (int i = 0; i < COUNT2; ++i) src[i] = 0;

    // Do all benchmarks twice just in case
    benchmark_copy(src, dst, COUNT, "COUNT (100k)");
    benchmark_copy(src, dst, COUNT, "COUNT (100k)");
    benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");

    return 0;
}

Btw the slowdown doesn't seem to have anything to do with the count you give memcpy

#

For this one slowdown hits on all 3

stiff quartz Jul 27, 2025, 11:50 AM

#

Ok

#

I downloaded a few recent versions

#

includin the latest

#

The bug is reproducible on all MSVC versions

#

Which makes sense because Python 3.14 preview is compiled with the latest MSVC version anyway.

stiff quartz Jul 27, 2025, 11:51 AM

#

regal spoke For this one slowdown hits on all 3

if src and dst are of size COUNT2

C:\Users\lhott\Desktop\CP\c\untitled>mainMSPajenegod.exe
Benchmark COUNT (100k): 0.786000 seconds over 10000 repeats
Benchmark COUNT (100k): 0.788000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 2.057000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 2.060000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 2.055000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 2.065000 seconds over 10000 repeats

#

if src and dst are of size COUNT2+100

#

C:\Users\lhott\Desktop\CP\c\untitled>mainMSPajenegod.exe
Benchmark COUNT (100k): 0.064000 seconds over 10000 repeats
Benchmark COUNT (100k): 0.067000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 0.162000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 0.162000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 0.163000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 0.168000 seconds over 10000 repeats

regal spoke Jul 27, 2025, 11:54 AM

#

Here is something you could try

#

#include <string.h>

// #define BADSIZE (262145)
#define BADSIZE (262145 - 1)
#define REPEATS 100000

int main() {
    static int src[BADSIZE];
    static int dst[BADSIZE];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, 100000);
    }

    return 0;
}

#

This one should run fast

#

but switch to

#

#define BADSIZE (262145)

#

and it should run sloww

stiff quartz Jul 27, 2025, 11:57 AM

#

so it's an allocator problem

#

I guess

regal spoke Jul 27, 2025, 11:57 AM

#

no

stiff quartz Jul 27, 2025, 11:57 AM

#

like the memory allocated is bad/misaligned?

regal spoke Jul 27, 2025, 11:58 AM

#

I think the values dst and src is what makes memcpy go bananas

stiff quartz Jul 27, 2025, 12:01 PM

#

Can we find other places in CPython source code that call memcpy?

stiff quartz Jul 27, 2025, 12:09 PM

#

regal spoke and it should run sloww

From 134 ms to 1963 ms yes

#

I guess that's a smaller example but it doesn't give us any particular insight right?

#

Someone said that it's the same bug as https://github.com/microsoft/STL/issues/5506

#

Idk if that's the case since they said the bug hits clang on Linux and we could not reproduce on clang on Linux

regal spoke Jul 27, 2025, 12:19 PM

#

stiff quartz Someone said that it's the same bug as https://github.com/microsoft/STL/issues/5...

I dont think it is the same.

#

I dont have intel

stiff quartz Jul 27, 2025, 12:20 PM

#

I don't either

#

and I can't try the person's example, I don't have the benchmark.h

regal spoke Jul 27, 2025, 12:23 PM

#

#include <string.h>
#define BADSIZE ((1 << 15) | 24)
#define REPEATS 100000

int main() {
    static char src[BADSIZE];
    static char dst[BADSIZE];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, 25000);
    }

    return 0;
}

Try this one

#

About the 15, seems any number works

stiff quartz Jul 27, 2025, 12:24 PM

#

you didn't define REPEATS

regal spoke Jul 27, 2025, 12:24 PM

#

stiff quartz you didn't define REPEATS

fixed

stiff quartz Jul 27, 2025, 12:25 PM

#

so that should be a BADSIZE

#

what's a good size

#

if anything is bad around 15

regal spoke Jul 27, 2025, 12:25 PM

#

It seems that whatever triggers it needs BADSIZE between 1 to 24 bytes above a power of two

#

Thats the only rule

woven ferry Jul 27, 2025, 12:25 PM

#

hi

stiff quartz Jul 27, 2025, 12:26 PM

#

regal spoke It seems that whatever triggers it needs BADSIZE between 1 to 24 bytes above a p...

((1 << 15) | 128) = 36 ms

#

((1 << 15) | 24) = 44 ms

regal spoke Jul 27, 2025, 12:26 PM

#

try rerunning the program

#

many times

stiff quartz Jul 27, 2025, 12:27 PM

#

I multiplied REPEATS by 10

regal spoke Jul 27, 2025, 12:27 PM

#

stiff quartz ((1 << 15) | 24) = 44 ms

Also, did you compile exactly what I sent?

stiff quartz Jul 27, 2025, 12:27 PM

#

regal spoke Also, did you compile exactly what I sent?

yes

regal spoke Jul 27, 2025, 12:27 PM

#

note that I have char and not int

stiff quartz Jul 27, 2025, 12:27 PM

#

#include <string.h>
#define BADSIZE ((1 << 15) | 128)
#define REPEATS 1000000

int main() {
    static char src[BADSIZE];
    static char dst[BADSIZE];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, 25000);
    }

    return 0;
}

#

and

#

#include <string.h>
#define BADSIZE ((1 << 15) | 24)
#define REPEATS 1000000

int main() {
    static char src[BADSIZE];
    static char dst[BADSIZE];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, 25000);
    }

    return 0;
}

#

With REPEATS multiplied I have 391-396 ms for | 24

#

and 304-305 ms for | 128

regal spoke Jul 27, 2025, 12:28 PM

#

stiff quartz ```c #include <string.h> #define BADSIZE ((1 << 15) | 128) #define REPEATS 10000...

This runs instant for me

regal spoke Jul 27, 2025, 12:28 PM

#

stiff quartz ```c #include <string.h> #define BADSIZE ((1 << 15) | 24) #define REPEATS 100000...

This one doesn't

stiff quartz Jul 27, 2025, 12:28 PM

#

I mean

#

I use Measure-Command { ./smallerExample.exe }

#

on Powershell

#

the std seems pretty low

#

when I re-run

#

both are pretty much instantaneous

regal spoke Jul 27, 2025, 12:29 PM

#

Run what I sent but lower the 24

#

until it slows down

#

For me anything between 1 and 24 slows it down

stiff quartz Jul 27, 2025, 12:30 PM

#

for 16 I go up to 448-450 ms

#

for (1 << 15) | 1

#

I get 5 seconds

#

lol

regal spoke Jul 27, 2025, 12:31 PM

#

Go up from 1

stiff quartz Jul 27, 2025, 12:31 PM

#

2 gives me 4.9 s

#

3 gives me 4.9 s

#

5 gives me 4.9s

#

10 gives me 446 ms

#

7 gives me 449 s

#

6 gives me 441 ms

#

so it seems that for me anything between 1 and 5 slows it down

#

4 gives me 4.9 s

#

..

regal spoke Jul 27, 2025, 12:33 PM

#

I'm assuming 8 is also slow?

stiff quartz Jul 27, 2025, 12:33 PM

#

Yes.

#

9 isn't.

#

16 isn't.

#

Oh I had tested 16 already. Well it definitely isn't.

#

https://github.com/python/cpython/blob/2bd4ff07001038fada17b6dbcb5ab5aabd27c26e/Objects/dictobject.c#L911C1-L928C44

#

Has a memcpy as well

#

So we can probably do something with dictionaries in Python

regal spoke Jul 27, 2025, 12:46 PM

#

@stiff quartz try this

#include <string.h>

#define BADSIZE ((1 << 15) | 1)
#define REPEATS 100000


int main() { 
    char src[1000000];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(src + BADSIZE, src, 25000);
    }

    return 0;
}

#

Seems that it might just be that dst - src = slightly larger than a power of 2 is what triggers the bug

regal spoke Jul 27, 2025, 1:26 PM

#

https://github.com/microsoft/STL/issues/5658#issuecomment-3124413440

GitHub

`memcopy` 10x slower for sizes 4*(2^18+1) · Issue #5658 · microso...

Describe the bug Using memcpy is ten times slower for an array of integers of size $2^{18}+1$ when compared to the negative controls $2^{18}$ and $2^{18}+10$. Command-line test case MRE in C #inclu...

stiff quartz Jul 27, 2025, 1:29 PM

#

That’s almost a factor of 20

#

Crazy when you think about it

stiff quartz Jul 27, 2025, 2:29 PM

#

@regal spoke Other parts of Python are affected

regal spoke Jul 27, 2025, 2:30 PM

#

stiff quartz <@475657117509091329> Other parts of Python are affected

how big is the difference between fast and slow?

stiff quartz Jul 27, 2025, 2:30 PM

#

Factor 10-20

#

This is more than what we had with lists

#

If I'm not mistaken:
https://github.com/python/cpython/blob/2bd4ff07001038fada17b6dbcb5ab5aabd27c26e/Objects/unicodeobject.c#L12782
is called

halcyon plankBOT Jul 27, 2025, 2:30 PM

#

Objects/unicodeobject.c line 12782

unicode_repeat(PyObject *str, Py_ssize_t len)```

regal spoke Jul 27, 2025, 2:31 PM

#

Oh it is because it doesnt need to do any ref courning for strings

stiff quartz Jul 27, 2025, 2:31 PM

#

Which calls this
https://github.com/python/cpython/blob/2bd4ff07001038fada17b6dbcb5ab5aabd27c26e/Objects/bytesobject.c#L3713

halcyon plankBOT Jul 27, 2025, 2:31 PM

#

Objects/bytesobject.c line 3713

_PyBytes_Repeat(char* dest, Py_ssize_t len_dest,```

stiff quartz Jul 27, 2025, 2:31 PM

#

And that has a memcpy

#

        while (copied < len_dest) {
            Py_ssize_t bytes_to_copy = Py_MIN(copied, len_dest - copied);
            memcpy(dest + copied, dest, bytes_to_copy);
            copied += bytes_to_copy;
        }

The same log2(n) one

#

!e

import time

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 100)) * 14
print(time.time() - start_time)

halcyon plankBOT Jul 27, 2025, 2:39 PM

#

stiff quartz !e ```py import time start_time = time.time() for _ in range(500000): ("s" * ...

:white_check_mark: Your 3.13 eval job has completed with return code 0.

1.5913984775543213

stiff quartz Jul 27, 2025, 2:39 PM

#

!e

import time

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 1)) * 14
print(time.time() - start_time)

halcyon plankBOT Jul 27, 2025, 2:39 PM

#

stiff quartz !e ```py import time start_time = time.time() for _ in range(500000): ("s" * ...

:white_check_mark: Your 3.13 eval job has completed with return code 0.

2.787262201309204

stiff quartz Jul 27, 2025, 2:39 PM

#

@regal spoke I can reproduce the slowdown with Discord bot?!!!

haughty mountain Jul 27, 2025, 2:42 PM

#

!e

import platform
print(platform.libc_ver())

halcyon plankBOT Jul 27, 2025, 2:42 PM

#

haughty mountain !e ``` import platform print(platform.libc_ver()) ```

:white_check_mark: Your 3.13 eval job has completed with return code 0.

('glibc', '2.36')

stiff quartz Jul 27, 2025, 2:44 PM

#

This is a completely unrelated bug then, the list bug you found was Windows specific, that string bug seems to work on Linux

#

On a random online compiler:

#

haughty mountain Jul 27, 2025, 2:45 PM

#

!e

import platform

print(platform.platform())
print(platform.machine())

halcyon plankBOT Jul 27, 2025, 2:46 PM

#

haughty mountain !e ``` import platform print(platform.platform()) print(platform.machine()) ```

:white_check_mark: Your 3.13 eval job has completed with return code 0.

001 | Linux-6.1.0-30-cloud-amd64-x86_64-with-glibc2.36
002 | x86_64

regal spoke Jul 27, 2025, 2:47 PM

#

stiff quartz <@475657117509091329> I can reproduce the slowdown with Discord bot?!!!

Thats weird. If I compile with GCC, then I do not see any slowdowns

#

So it doesn't seem to be entirely hardware related either

stiff quartz Jul 27, 2025, 2:48 PM

#

Honestly at this point I don't know

#

Isn't that ironic

#

!e

import time

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 1)) * 14
print(time.time() - start_time)

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 100)) * 14
print(time.time() - start_time)

halcyon plankBOT Jul 27, 2025, 2:51 PM

#

stiff quartz !e ```py import time start_time = time.time() for _ in range(500000): ("s" * ...

:white_check_mark: Your 3.14 pre-release eval job has completed with return code 0.

001 | 2.9183502197265625
002 | 1.5649194717407227

stiff quartz Jul 27, 2025, 2:52 PM

#

!e

import time

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 100)) * 14
print(time.time() - start_time)

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 1)) * 14
print(time.time() - start_time)

halcyon plankBOT Jul 27, 2025, 2:52 PM

#

stiff quartz !e ```py import time start_time = time.time() for _ in range(500000): ("s" * ...

:white_check_mark: Your 3.13 eval job has completed with return code 0.

001 | 1.558946132659912
002 | 2.7390034198760986

stiff quartz Jul 27, 2025, 3:57 PM

#

stiff quartz !e ```py import time start_time = time.time() for _ in range(500000): ("s" * ...

I could reproduce on another computer running Python 3.11 on Linux (GCC)

vocal gorge Jul 27, 2025, 4:04 PM

#

wild, I'm getting 0.46 vs 6.38 on linux, Python 3.10.12 GCC 11.3.0. Same for 3.12 GCC, and same for uv-installed 3.11.3 and 3.13.5 (Clang 20.1.4).

#

so neither version nor compiler seem to matter much.

#

...my CPU is an AMD. On a different computer (Windows, Intel) there's no slowdown. As the last message in that thread suggests, it may be CPU-dependent.

stiff quartz Jul 27, 2025, 4:39 PM

#

What does uv-installed mean?

stiff quartz Jul 27, 2025, 4:39 PM

#

vocal gorge ...my CPU is an AMD. On a different computer (Windows, Intel) there's no slowdow...

The string one or the list one? So both are AMD specific bugs?

vocal gorge Jul 27, 2025, 4:41 PM

#

I tested the string one. Presumably they are both caused by malloc

vocal gorge Jul 27, 2025, 4:42 PM

#

stiff quartz What does uv-installed mean?

Obtained by uv python install: https://docs.astral.sh/uv/concepts/python-versions/. This downloads a prebuilt binary.

stiff quartz Jul 27, 2025, 4:43 PM

#

vocal gorge wild, I'm getting 0.46 vs 6.38 on linux, Python 3.10.12 GCC 11.3.0. Same for 3.1...

The difference is wild

stiff quartz Jul 27, 2025, 5:14 PM

#

stiff quartz The string one or the list one? So both are AMD specific bugs?

Someone with Intel managed to reproduce somehow

#

https://tenor.com/view/why-monkey-what-huh-confused-gif-3006908007927048834

Tenor

halcyon plankBOT Jul 27, 2025, 5:38 PM

#

:white_check_mark: Your 3.14 pre-release eval job has completed with return code 0.

001 | Power of 2: 0.1374994833022356
002 | Not power of 2: 3.3911833791062236

regal spoke Jul 27, 2025, 6:42 PM

#

stiff quartz I could reproduce on another computer running Python 3.11 on Linux (GCC)

I think you have to be careful about what is the same issue or not

#

Could be that there are different issues

#

Seems people are mixing memset stuff and memcopy

stiff quartz Jul 27, 2025, 6:55 PM

#

regal spoke I think you have to be careful about what is the same issue or not

https://github.com/python/cpython/issues/137121#issuecomment-3124504591

exactly what I said here but I guess it went ignored

GitHub

Calls to `memset` and `memcpy` are slow for `2^k+1` elements · Iss...

Bug report Bug description: Consider the following two Python programs that make 14 copies of a big list ([0] * 262145) * 14 ([0] * 265665) * 14 On my windows computers, running the latest release ...

stiff quartz Jul 27, 2025, 6:56 PM

#

regal spoke Seems people are mixing memset stuff and memcopy

Yeah I also fell into the trap of a benchmark that was just flawed and seemed to show performance slowdown without the *14

regal spoke Jul 27, 2025, 6:58 PM

#

@stiff quartz Why do you think memcpy issue is not windows specific?

#

Have you been able to make memcpy slow in C without using windows?

stiff quartz Jul 27, 2025, 6:59 PM

#

stiff quartz !e ```py import time start_time = time.time() for _ in range(500000): ("s" * ...

@regal spoke

stiff quartz Jul 27, 2025, 7:00 PM

#

vocal gorge wild, I'm getting 0.46 vs 6.38 on linux, Python 3.10.12 GCC 11.3.0. Same for 3.1...

And here @regal spoke

#

Thèse are the only reasons I have for now why I think memcpy is not windows specific

stiff quartz Jul 27, 2025, 7:00 PM

#

regal spoke Have you been able to make memcpy slow in C without using windows?

I have not

regal spoke Jul 27, 2025, 7:00 PM

#

stiff quartz <@475657117509091329>

Thats not C

stiff quartz Jul 27, 2025, 7:01 PM

#

regal spoke Thats not C

Indeed, I can’t reproduce anything with C on Linux for now related to the string benchmark

#

I probably shouldn’t blame memcpy - I don’t see anything else in CPython source code for strings that could be the cause of this though

regal spoke Jul 27, 2025, 7:03 PM

#

I think confirmation is important

stiff quartz Jul 27, 2025, 7:03 PM

#

Yes, I haven’t been able to / haven’t had the time to dig deeper

regal spoke Jul 27, 2025, 7:03 PM

#

#include <string.h>
#define REPEATS 4000000

int main() { 
    char src[1000000];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(src + (1 << 15) + 1, src, 4096);
    }

    return 0;
}

On my computer, this very precisely triggers the bug

stiff quartz Jul 27, 2025, 7:04 PM

#

regal spoke ```c #include <string.h> #define REPEATS 4000000 int main() { char src[100...

Only on MSVC (Windows) + AMD from my understanding; did you try that on GCC?

#

Has anyone run this on Linux?

regal spoke Jul 27, 2025, 7:08 PM

#

stiff quartz Only on MSVC (Windows) + AMD from my understanding; did you try that on GCC?

I've only gotten it to work on MSVC

#

I've tried with gcc on my windows computer, and a friend on their linux computer (with AMD processor). Nothing interesting happens, they both run equally fast

stiff quartz Jul 27, 2025, 7:11 PM

#

Could they reproduce the Python string slowdown on their Linux computer? Not all Linux computers can it seems, maybe only the Linux computers that show the Python string slowdown would show anything with the C memcpy benchmark.

regal spoke Jul 27, 2025, 7:13 PM

#

Thing is, my computer is clearly affect by these slowdown. But I'm not able to make gcc slowdown

stiff quartz Jul 27, 2025, 7:13 PM

#

But your computer uses MSVC for Python

regal spoke Jul 27, 2025, 7:13 PM

#

yes

stiff quartz Jul 27, 2025, 7:13 PM

#

Why would you expect gcc to show anything

regal spoke Jul 27, 2025, 7:13 PM

#

The question is, is the issue with memcopy windows specific or not

stiff quartz Jul 27, 2025, 7:14 PM

#

Ah, I see

regal spoke Jul 27, 2025, 7:14 PM

#

You are claiming it is not

stiff quartz Jul 27, 2025, 7:14 PM

#

My gcc benchmarks on my computer make absolutely no sense, I’ve given up on trying to understand them.

stiff quartz Jul 27, 2025, 7:15 PM

#

regal spoke You are claiming it is not

Well yes that was not phrased well

#

I edited it my bad

regal spoke Jul 27, 2025, 7:18 PM

#

stiff quartz My gcc benchmarks on my computer make absolutely no sense, I’ve given up on tryi...

Let me see if I can make a minimal working example out of that

stiff quartz Jul 27, 2025, 7:22 PM

#

regal spoke Let me see if I can make a minimal working example out of that

Just to confirm the behaviour I see with gcc can’t be reproduced on your machine?

regal spoke Jul 27, 2025, 7:22 PM

#

This one does run slow, but only if I don't have any O-flags

#

With O2 it runs fast

stiff quartz Jul 27, 2025, 7:23 PM

#

Isnt O2 the default

#

I didnt set any flags

regal spoke Jul 27, 2025, 7:23 PM

#

no

stiff quartz Jul 27, 2025, 7:23 PM

#

Argh

regal spoke Jul 27, 2025, 7:23 PM

#

No optimization is on by default

stiff quartz Jul 27, 2025, 7:24 PM

#

At least I learn a lot about C & C++ & reading CPython code

#

ah you're right

#

I'm back to having a normal-behaving GCC with the optimization flags

#

I wonder if the MSVC problem we have disappears with https://learn.microsoft.com/en-us/cpp/build/reference/o-options-optimize-code?view=msvc-170

regal spoke Jul 27, 2025, 7:27 PM

#

stiff quartz I wonder if the MSVC problem we have disappears with https://learn.microsoft.com...

No my example work with /O2

stiff quartz Jul 27, 2025, 7:27 PM

#

#include <string.h>
#define REPEATS 4000000

int main() { 
    char src[1000000];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(src + (1 << 15) + 1, src, 4096);
    }

    return 0;
}

This is supposed to be the slow one, correct?

regal spoke Jul 27, 2025, 7:27 PM

#

yes

#

It should be quire apperent if it is slow or not

#

like 0.1s vs 4s

stiff quartz Jul 27, 2025, 7:28 PM

#

it's fast

#

with O2

#

dsfkfiqsjdfiqjds

#

it's fast without

#

what is that

#

AH no, I'm the same than the guy on the C issue

#

I have to compile for x86 to see the slowdown (with or without O2)

#

Are you compiling for x64 or x86

regal spoke Jul 27, 2025, 7:32 PM

#

Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x86

stiff quartz Jul 27, 2025, 7:33 PM

#

Yes, you probably wouldn't see anything either on x64

#

can you try?

stiff quartz Jul 27, 2025, 7:34 PM

#

stiff quartz Yes, you probably wouldn't see anything either on x64

which is annoying for our bug hunt because Python 3.14 is compiled with MSC v.1944 64 bit (AMD64)

regal spoke Jul 27, 2025, 7:45 PM

#

stiff quartz Yes, you probably wouldn't see anything either on x64

#include <string.h>
#define REPEATS 4000000

int main() { 
    char src[1000000];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(src + (1 << 18) + 1, src, 8193);
    }

    return 0;
}

Try this

#

This one is slow for me on Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x64

stiff quartz Jul 27, 2025, 7:59 PM

#

Yep

#

Why isn't my initial one working

#

with x64

#

https://github.com/microsoft/STL/issues/5658

#

It's also 2^18+1

#

and i'm clearly moving more than 2*4096 +1

regal spoke Jul 27, 2025, 8:02 PM

#

stiff quartz Why isn't my initial one working

Could be because I have control over the pointers

#

Unlike your test

stiff quartz Jul 27, 2025, 8:02 PM

#

What do you mean sorry

regal spoke Jul 27, 2025, 8:02 PM

#

src + (1 << 18) + 1, src I specially make the pointers be at a bad distance for dst and src

#

You get whatever the compiler decides to give you

stiff quartz Jul 27, 2025, 8:03 PM

#

mmhh

#

I see

regal spoke Jul 27, 2025, 8:03 PM

#

I really think this is the trigger

stiff quartz Jul 27, 2025, 8:03 PM

#

lemme test

#

#include <string.h>
#define REPEATS 4000000

int main() { 
    char src[1000000];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(src + (1 << 18) + 1, src, 2*4097 + 16);
    }

    return 0;
}

This is slow

#

16 => 161 is slow too

regal spoke Jul 27, 2025, 8:04 PM

#

the reason I put 2*4096 + 1

#

is that anything smaller seems to run fast

#

you could probably put 100k if you wanted

stiff quartz Jul 27, 2025, 8:05 PM

#

In my initial case, dst and src are two distinct objects

#

the pointers are far from eacher other

#

at least by a distance equal their size

regal spoke Jul 27, 2025, 8:05 PM

#

Here is the thing, I suspect the reason why copying in python is slow

#

Is that the pointers to dst and src is (power of 2) times n

stiff quartz Jul 27, 2025, 8:06 PM

#

I disagree, if we copy list or strings of size 2^n + 1
They are initially at a distance 2^n + 1 then 2^(n+1) + 2

regal spoke Jul 27, 2025, 8:07 PM

#

yes

#

and

#

(1 << 18) + 1 try changing this +1 to +2

#

still slow as heck

stiff quartz Jul 27, 2025, 8:08 PM

#

yes but then my initial case

#

src and dest are far, and like in your example, distance isn't a power of 2

regal spoke Jul 27, 2025, 8:08 PM

#

stiff quartz I disagree, if we copy list or strings of size 2^n + 1 They are initially at a ...

If initial size is 2^i + 1, then every call to memcpy will be some power of 2 + something small

stiff quartz Jul 27, 2025, 8:09 PM

#

do you suspect the something small is the problem?

regal spoke Jul 27, 2025, 8:09 PM

#

Read my comment earlier on stl github

#

I think that if dst - src = +-(power of 2) + something small (thats > 0)

#

then the bug triggers

regal spoke Jul 27, 2025, 8:10 PM

#

regal spoke ```cpp #include <string.h> #define REPEATS 4000000 int main() { char src[1...

Thats why I'm coding it like this

#

If you add the same value x to both dsc and src, then the bug always happens

stiff quartz Jul 27, 2025, 8:10 PM

#

Reading your comment, I thought you wanted dst and src to be of a distance power of 2 + 1 with that 1 being important

regal spoke Jul 27, 2025, 8:10 PM

#

Well to guarantee a trigger, yes.

#

But I've gotten it to work with something as big as +24

stiff quartz Jul 27, 2025, 8:11 PM

#

Ah yes that was our discussion

#

I could only get up to 7-8

regal spoke Jul 27, 2025, 8:11 PM

#

yes

#

Thats why I only said +1

stiff quartz Jul 27, 2025, 8:11 PM

#

I definitely need to add a lot of context

#

to the issue

#

I initially didn't even mention flags

#

and didn't mention the difference between x86 and x64

regal spoke Jul 27, 2025, 8:12 PM

#

regal spoke ```cpp #include <string.h> #define REPEATS 4000000 int main() { char src[1...

I really think this is the key. Could we make this work on gcc too?

stiff quartz Jul 27, 2025, 8:12 PM

#

Ah yes lemme check on gcc

regal spoke Jul 27, 2025, 8:12 PM

#

The problem is that my gcc seems to just ignore everything

stiff quartz Jul 27, 2025, 8:12 PM

#

do you mean it optimizes away the copy?

regal spoke Jul 27, 2025, 8:13 PM

#

yes

#

Everything is gone

#

I need to add something to force it to not do that

stiff quartz Jul 27, 2025, 8:13 PM

#

#

The assembly is just return 0 lol

regal spoke Jul 27, 2025, 8:13 PM

#

yep

stiff quartz Jul 27, 2025, 8:13 PM

#

when you set the flag

regal spoke Jul 27, 2025, 8:14 PM

#

Thats what I'm trying to avoid

stiff quartz Jul 27, 2025, 8:14 PM

#

what if you add printf(%c, src[40]);

regal spoke Jul 27, 2025, 8:16 PM

#

Hmm I just tried with O0, and then it doesnt optimize it all away

#

But I sadly dont seem to trigger the slowdown bug

stiff quartz Jul 27, 2025, 8:16 PM

#

yes my screenshot is from O2

#

What about if we use ints instead of char for your example

regal spoke Jul 27, 2025, 8:18 PM

#

That wont matter

#

Memcpy doesn't know what it is copying

#

It just works off bytes

stiff quartz Jul 27, 2025, 8:19 PM

#

Ah yes that's the whole point i forgot

regal spoke Jul 27, 2025, 8:20 PM

#

#include <string.h>
#include <stdio.h>
#define REPEATS 4000000

int main() { 
    char src[1000000];

    for (int i = 0; i < REPEATS; ++i) {
        memcpy(src + (1 << 18) + 1, src, 4*4096 + 1);
    }

    return 0;
}

Should I maybe just post this on stl issue?

stiff quartz Jul 27, 2025, 8:21 PM

#

I was rewriting the initial post

#

By editing it

regal spoke Jul 27, 2025, 8:21 PM

#

ok

stiff quartz Jul 27, 2025, 8:21 PM

#

To add information about x86 & x64

#

and the flags

#

I think I'm going to erase my example and just say that you found better shorter more consistent examples

#

Because they're really are all that lol

regal spoke Jul 27, 2025, 8:21 PM

#

I think fundamentally, flags shouldnt matter since that shouldnt affect the internals of memcpy.

stiff quartz Jul 27, 2025, 8:22 PM

#

Fair enough

regal spoke Jul 27, 2025, 8:22 PM

#

but 32bit vs 64bit could definitely make a big difference

stiff quartz Jul 27, 2025, 8:22 PM

#

That's the super important bit that I completely overviewed

#

and to be fair I didn't even know what I was using to compile

#

until the guy said that

#

I'm alson including the cpp version:

#include <iostream>
#include <cstring>

constexpr int REPEATS = 4000000;
constexpr int SRC_SIZE = 1000000;

int main() {
    char src[SRC_SIZE] = {};

    for (int i = 0; i < REPEATS; ++i) {
        std::memcpy(src + (1 << 18) + 1, src, 2 * 4096 + 1);
    }

    return 0;
}

vocal gorge Jul 27, 2025, 8:26 PM

#

stiff quartz ```c #include <string.h> #define REPEATS 4000000 int main() { char src[100...

The C example reproduces the bug for me too (linux, AMD). 4s with the +1, 0.15s without it. GCC vs Clang have the same behaviour.

stiff quartz Jul 27, 2025, 8:26 PM

#

???

#

That's crazy

vocal gorge Jul 27, 2025, 8:27 PM

#

As in in both compilers, memcpy(src + (1 << 15) + 1, src, 4096); slows down and memcpy(src + (1 << 15) + 0, src, 4096); doesn't.

regal spoke Jul 27, 2025, 8:27 PM

#

eeek

stiff quartz Jul 27, 2025, 8:27 PM

#

Although @vocal gorge is one of the people that could see the slowdowns in Python on Linux

#

So it makes sense @vocal gorge can see the slowdowns in C

vocal gorge Jul 27, 2025, 8:27 PM

#

Ehh, I suspect that was a pyenv thing

#

whereas this is just C, nothing interesting

#

(My testing approach is gcc -Wall -o test1 test1.c && time ./test1, and similarly for test2, and similarly with clang instead of gcc.)

regal spoke Jul 27, 2025, 8:28 PM

#

regal spoke ```cpp #include <string.h> #define REPEATS 4000000 int main() { char src[1...

@vocal gorge Try this one ^. This is the one that we got to trigger on both msvc 32 bit and msvc 64 bit.

vocal gorge Jul 27, 2025, 8:29 PM

#

regal spoke <@266216750876459008> Try this one ^. This is the one that we got to trigger on ...

this takes 8s for me, on both gcc and clang

regal spoke Jul 27, 2025, 8:29 PM

#

And if you remove the +1 after 1<<18?

vocal gorge Jul 27, 2025, 8:30 PM

#

~200ms, on both gcc and clang

regal spoke Jul 27, 2025, 8:30 PM

#

Ok nice

#

That is even the same amount of slowdown as I get

stiff quartz Jul 27, 2025, 8:31 PM

#

Well hopefully I included everything relevant here:
https://github.com/microsoft/STL/issues/5658

regal spoke Jul 27, 2025, 8:32 PM

#

stiff quartz Well hopefully I included everything relevant here: https://github.com/microsoft...

Be even more specific.

Say something like:

On Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x64:

stiff quartz Jul 27, 2025, 8:33 PM

#

well I listed all the versions

#

below

#

I could reproduce the issues on 19.29.30157. It is pretty old but I could reproduce with very similar results on 19.29.30159, 19.38.33135, 19.44.35211, and 19.44.35213 (latest available).

vocal gorge Jul 27, 2025, 8:33 PM

#

My versions are gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Target: x86_64-pc-linux-gnu Thread model: posix and Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1~exp1~20240731145000.144) if it ends up mattering.

#

I suspect it's not particularly a compiler thing if it's this consistent, though

stiff quartz Jul 27, 2025, 8:33 PM

#

vocal gorge My versions are `gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Target: x86_64-pc-lin...

thanks

regal spoke Jul 27, 2025, 8:34 PM

#

You could also add the note about memmove also having the same exact issue, expect for if the intervals intersect. (This is to be expected since memcpy is basically the special case of memmove with non intersection).
From my testing, if they intersect, memmove does not get the slowdown bug.

#

Btw instead of // this also can be int src[...];, just say something like // Allocate a sufficiently large array

#

That is all it is

stiff quartz Jul 27, 2025, 8:40 PM

#

I think so

#

Im afraid this will all come down to

#

It’s AMD specific, a part that is closed source

#

Or it’s OS specific and not really fixable

#

Idk

#

I know nothing about syscalls

#

Let alone CPU hardware stuff

regal spoke Jul 27, 2025, 8:41 PM

#

It sounds more like hardware lvl now

stiff quartz Jul 27, 2025, 8:41 PM

#

Yes agreed

regal spoke Jul 27, 2025, 8:41 PM

#

Unless someone finds that it can be triggered on intel too

stiff quartz Jul 27, 2025, 8:41 PM

#

I have yet to see any of those bugs on intel

regal spoke Jul 27, 2025, 8:41 PM

#

That said, I think it makes a lot of sense to keep this open for a while

#

But idk

#

btw I dont like char src[1000000]; // this also can be int src[...]; - any allocation of a big array works

#

Using int src[] instead of char src[] would not only make a larger array

#

it would also mean that src + (1 << 18) + 1 goes 4 times further

stiff quartz Jul 27, 2025, 8:44 PM

#

Does it?

regal spoke Jul 27, 2025, 8:44 PM

#

yes

stiff quartz Jul 27, 2025, 8:44 PM

#

When you do src +

#

You count in Ute’s

#

Bytes

#

Does it matter than ints are 4 bytes

regal spoke Jul 27, 2025, 8:45 PM

#

I've been taught that
A[i] is same as *(A + i)

stiff quartz Jul 27, 2025, 8:45 PM

#

Huh

regal spoke Jul 27, 2025, 8:45 PM

#

Its a thing

#

https://www.educative.io/answers/why-is-ai-ia-in-c

Educative

Why is a[i] == i[a] in C?

Contributor: Saadullah Khan

#

It means you can even switch the order

#

Isnt C lovely

stiff quartz Jul 27, 2025, 8:46 PM

#

You’re right

regal spoke Jul 27, 2025, 8:46 PM

#

Because of this, I think using an int array is bad

stiff quartz Jul 27, 2025, 8:46 PM

#

I thought pointer arithmetic in c was in terms of bytes

regal spoke Jul 27, 2025, 8:46 PM

#

The count you give memcpy is in bytes

#

But any pointer arithmetic depends on the size of the datatype

#

To keep it simple, just dont use int src[]

stiff quartz Jul 27, 2025, 8:47 PM

#

Yeah I edited

stiff quartz Jul 27, 2025, 8:48 PM

#

regal spoke The count you give memcpy is in bytes

Yeah I assumed pointer arithmetic was in bytes too

#

Good thing I’ve never been asked to code in C or C++ in my life

#

🤡

regal spoke Jul 27, 2025, 8:49 PM

#

You dont even need to say char is nice because it's one byte so we can get to 2^k+1 distance

stiff quartz Jul 27, 2025, 8:49 PM

#

Also if it is a AMD issue I don’t even know where this should be discussed

regal spoke Jul 27, 2025, 8:49 PM

#

stiff quartz Also if it is a AMD issue I don’t even know where this should be discussed

Up to the point we are really sure it is an AMD problem

#

It is nice to keep things open

stiff quartz Jul 27, 2025, 8:49 PM

#

I agree

#

But if we are sure it is

#

I wouldn’t even know what the move would be

#

Im not closing or porting any issue today on the Developer community thing they mentioned ill do it tmro (or you can do it if you want)

stiff quartz Jul 27, 2025, 8:52 PM

#

regal spoke Isnt C lovely

I hate i[a] I know it makes sense that addition is commutative but I hate it I hope I never see someone write code with i[a]

regal spoke Jul 27, 2025, 8:53 PM

#

You should see my usage of (-1)[A]

#

back in the day

#

Lovely pointer arithmetics

stiff quartz Jul 27, 2025, 8:53 PM

#

Actually @regal path who couldn’t reproduce anything

#

I assume you don’t have AMD

regal path Jul 27, 2025, 8:54 PM

#

nope

stiff quartz Jul 27, 2025, 8:54 PM

#

Boom

#

One step closer

regal path Jul 27, 2025, 8:54 PM

#

intel 6th and 8th gen

regal path Jul 27, 2025, 8:54 PM

#

regal spoke You should see my usage of (-1)[A]

but, why?

#

I'm all for pointer witchery but why in gods name would you ever go before start of array with that

#

(<datatype>)a-- should work no?

regal spoke Jul 27, 2025, 8:56 PM

#

@stiff quartz Btw fix two things. Make 2*4096 + 1 be just 8193. I originally wrote 2*4096 + 1 because I was lazy when I was coding. And probably remove the "char is nice because it's one byte so we can get to 2^k+1 distance"

regal spoke Jul 27, 2025, 8:58 PM

#

regal path but, why?

I think the story was that I implemented an algorithm from a paper for some kind of shortest distance algorithm in 3D for voxels. This was many years back so I dont really remember now.

#

In the algorithm you naturally do pointer arithmetics with lots of ++ on pointers everywhere

#

But I realized in some edge cases that I kind of needed to undo some of the ++ to my pointers

#

My quick fix was to index with a negative number

#

In some sense it was super nice. But it also broke every sensible rule for working with pointers

#

I don't know if I still have the code somewhere

#

I just remember that @haughty mountain really didn't like my code

#

Maybe he remember more what it was about

stiff quartz Jul 28, 2025, 6:24 AM

#

regal spoke <@338742178618671106> Btw fix two things. Make `2*4096 + 1` be just 8193. I orig...

Done

stiff quartz Jul 28, 2025, 8:40 AM

#

!e

import timeit
setup_code = '''
ba = bytearray(b"A" * 100000)
insert_bytes = b"B" * (2**14 + 1)
'''
stmt_code = 'ba[:1] = insert_bytes'
timer = timeit.Timer(stmt=stmt_code, setup=setup_code)
print(timer.timeit(number=1500))

import timeit
setup_code = '''
ba = bytearray(b"A" * 100000)
insert_bytes = b"B" * (2**14 + 50)
'''
stmt_code = 'ba[:1] = insert_bytes'
timer = timeit.Timer(stmt=stmt_code, setup=setup_code)
print(timer.timeit(number=1500))

#

@regal spoke that example is beautiful because much more common imo

halcyon plankBOT Jul 28, 2025, 8:45 AM

#

stiff quartz !e ```py import timeit setup_code = ''' ba = bytearray(b"A" * 100000) insert_byt...

:white_check_mark: Your 3.14 pre-release eval job has completed with return code 0.

001 | 3.254694360308349
002 | 2.5861790208145976

regal spoke Jul 28, 2025, 9:06 AM

#

the issue is not with any C compiler (GCC/Clang/MSVC Python versions on AMD can (almost always) reproduce)

#

You could argue that the issue lies with all of them

stiff quartz Jul 28, 2025, 9:06 AM

#

Yes sure

#

I don't know where the responsibility of the compiler ends the responsibility of the hardware begins

stiff quartz Jul 28, 2025, 9:07 AM

#

regal spoke You could argue that the issue lies with all of them

I'm surprised no one grasps that the last example is 10x more important to fix than the others

regal spoke Jul 28, 2025, 9:07 AM

#

https://x.com/disconnect3d_pl/status/1909918427084452076 As disconnected mentions in the github issue, there is already tons of case handeling

Disconnect3d (@disconnect3d_pl)

A call to memcpy() in a single binary that uses glibc may behave in 12 different ways depending on the features of the specific x86-64 CPU you run it on.

Here is a list of those impls in glibc:
https://t.co/jz8ttYUNTw

regal spoke Jul 28, 2025, 10:27 AM

#

I made a report now to developer community https://developercommunity.visualstudio.com/t/memcpy-running-10---40-times-slower-when/10943533

Developer Community

#

Seems it needs to get approved by moderators before it gets shown publicly

#

Should probably also make a report to bugzilla https://sourceware.org/bugzilla/

stiff quartz Jul 28, 2025, 10:36 AM

#

regal spoke Seems it needs to get approved by moderators before it gets shown publicly

Yep I can't see it yet

regal spoke Jul 28, 2025, 10:36 AM

#

stiff quartz Yep I can't see it yet

yep

#

Here is what I wrote

[severity:It bothers me. A fix would be nice]
It was recently found that calling memcpy(dst, src, count) where dst-src is close to a power of 2 causes significant performance degragation (up to 20-40 times). It was originally discovered in Python because Python’s internal list copying algorithm happens to repeatedly trigger this slowdown bug.

Related issues:
https://github.com/python/cpython/issues/137121
https://github.com/microsoft/STL/issues/5658

MRE in C used to trigger the bug

#include <string.h>
#define REPEATS 4000000

int main() { 
    char src[1000000];  // any allocation of a big array works
    
    for (int i = 0; i < REPEATS; ++i) {
        // Call memcpy with dst - src = +-(power of 2) + (something small > 0)
        memcpy(src + (1 << 18) + 1, src, 8193);
    }

    return 0;
}
This takes 4 s on my Windows desktop computer (compiled using Microsoft (R) C/C++ Optimizing Compiler Version 19.44.35213 for x64). But if I change (1 << 18) + 1 to (1 << 18) it takes 0.15 s.

About which CPU models are affected. I have desktop and laptop at home, and both experience the slowdown bug. Desktop CPU: AMD Ryzen™ 9 5950X. Laptop CPU: AMD Ryzen™ 9 5900HS. There has been a single report of someone reproducing this with an intel processor + clang (https://github.com/python/cpython/issues/137121#issuecomment-3126294652). Otherwise the reports have been conclusively with people running AMD processors.

Compilers confirmed to be able to reproduce the slowdown bug

Microsoft (R) C/C++ Optimizing Compiler Version XXX for x64
Microsoft (R) C/C++ Optimizing Compiler Version XXX for x86
Where XXX is 19.29.30157, 19.29.30159, 19.38.33135, 19.44.35211, and 19.44.35213 (latest available)

Linux GCC (gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0)
Linux Clang (Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1~exp1~20240731145000.144)

#

I would like mark the severity higher, but there was no good option

stiff quartz Jul 28, 2025, 10:40 AM

#

I wonder if mentioning that python consequence https://github.com/python/cpython/issues/137121#issuecomment-3126183984 (inserting/deleting any 2^k+1 portion of any Python objects that is basically a C array) would help the "severity" case.

#

You can also mention my CPU if you want: AMD Ryzen 9 7900 12-Core Processor

regal spoke Jul 28, 2025, 10:40 AM

#

Ryzen 9 too huh

stiff quartz Jul 28, 2025, 10:40 AM

#

ducky_concerned

regal spoke Jul 28, 2025, 10:41 AM

#

stiff quartz I wonder if mentioning that python consequence https://github.com/python/cpython...

I think the severity is kind of high, but it is not complete broken lvls

#

Which is the next step in their severity scale

#

In the end, this is just a slowdown issue (as far as we know)

#

Which is not the end of the world

stiff quartz Jul 28, 2025, 10:47 AM

#

Would be funny if you could hack people on codeforces because of it but that's very unlikely

regal spoke Jul 28, 2025, 11:00 AM

#

stiff quartz Would be funny if you could hack people on codeforces because of it but that's v...

I think cf runs intel

#

Or at least they used to

stiff quartz Jul 28, 2025, 11:00 AM

#

Ah yes there's that

regal spoke Jul 28, 2025, 11:20 AM

#

Btw the reason I found this bug was that I was benchmarking sparse tables / disjoint sparse table in python/pypy. The disjoint sparse table starts with A *= len(A).bit_length()

#

That is where the original 22 came from. It's the bit length of the size of the list

#

At first I thought pypy was the culprit, but then I realized cpython was equally slow.

stiff quartz Jul 28, 2025, 11:24 AM

#

regal spoke Btw the reason I found this bug was that I was benchmarking sparse tables / disj...

ohh yes makes sense

regal spoke Jul 28, 2025, 11:42 AM

#

~~Maybe I can find more bugs benchmarking them~~

#

This is how long it takes CPython to create the sparse table. There is this weird bump that I don't know where it comes from.

#

This is the same plot, but for PyPy instead

#

That bump is now a cliff

#

Something very funky is going on between n = 6.7e7 and n = 6.8e7

haughty mountain Jul 28, 2025, 11:45 AM

#

cpu cache effects maybe?

#

or these values might be too large for that

regal spoke Jul 28, 2025, 11:47 AM

#

Something else funky is that n=3e7 is always below the expected curve for PyPy

#

Also, the reason I happened to benchmark with powers of 2 plus 1 is that I used this to generate my benchmark

for i in range(40):
    n = int(2**(i*0.5) + 1)
    benchmark(n)

#

Originally I had just 2**i, but I felt like the size increased too fast

#

So I changed it to int(2**(i*0.5) + 1)

#

Then I saw that it somehow alternated between running fast and slow

stiff quartz Jul 28, 2025, 11:53 AM

#

regal spoke ~~Maybe I can find more bugs benchmarking them~~

The easiest way to find bugs is to hunt down everything using memcopy or memmove like I did
Unless you want to find unrelated performance bugs

regal spoke Jul 28, 2025, 11:54 AM

#

~~I just want to know why my sparse table is slow~~

stiff quartz Jul 28, 2025, 11:54 AM

#

Ah welp

#

Will the underlying array ‘s size ever reach higher than 0.5*10^8 though?

#

Seems that it’s already kinda big

regal spoke Jul 28, 2025, 11:56 AM

#

Ofc, but it is still weird

stiff quartz Jul 28, 2025, 11:56 AM

#

Maybe your ram is full

regal spoke Jul 28, 2025, 11:56 AM

#

No

stiff quartz Jul 28, 2025, 11:57 AM

#

regal spoke Something very funky is going on between n = 6.7e7 and n = 6.8e7

Did you try testing more values there?

#

If you pinpoint the exact value at which the cliff happens

#

Could help

#

You could binary search the location of the cliff based on the time it takes to make your sparse table 😳

regal spoke Jul 28, 2025, 12:12 PM

#

Ye I'll more into it when I have time

#

Btw the reason I opened this is also because of my sparse table benchmark https://github.com/pypy/pypy/issues/5314

GitHub

int.bit_length() is slow for non-big integers · Issue #5314 · pyp...

This is a mix of an issue and a feature request. I have noticed that calling .bit_length() is significantly slower in PyPy than any other basic operation on an integer. For example, this benchmark ...

#

I realized that .bit_length() in pypy is O(64)

#

Which kind of defeats the purpose of a sparse table

#

I think this one algorithm is pretty cool for bit length

table3 = [0, 1, 39, 2, 15, 40, 23, 3, 
          12, 16, 59, 41, 19, 24, 54, 4, 
          -1, 13, 10, 17, 62, 60, 28, 42, 
          30, 20, 51, 25, 44, 55, 47, 5, 
          32, -1, 38, 14, 22, 11, 58, 18, 
          53, 63, 9, 61, 27, 29, 50, 43, 
          46, 31, 37, 21, 57, 52, 8, 26,
          49, 45, 36, 56, 7, 48, 35, 6, 
          34, 33]

def bitlength64_4(x):
    # Assumes x >= 0
    x |= x >> 1
    x |= x >> 2
    x |= x >> 4
    x |= x >> 8
    x |= x >> 16
    x |= x >> 32
    return table[x % 67]

#

I haven't seen it anywhere online

#

Even though it is so clean

stiff quartz Jul 28, 2025, 12:32 PM

#

Is the goal to put ones everywhere

#

On the right of the highest bit

regal spoke Jul 28, 2025, 12:32 PM

#

stiff quartz Is the goal to put ones everywhere

The end goal is to compute the bit length

stiff quartz Jul 28, 2025, 12:32 PM

#

And then do a look up

stiff quartz Jul 28, 2025, 12:32 PM

#

regal spoke The end goal is to compute the bit length

Yes yes I got that

#

Im talking about the OR operations

#

You transform 0101010 in 0111111?

regal spoke Jul 28, 2025, 12:32 PM

#

The or operations is to make x be on the form 2^i - 1, yes

stiff quartz Jul 28, 2025, 12:33 PM

#

Ah ok I get the algo

#

That’s nice

regal spoke Jul 28, 2025, 12:33 PM

#

So after the or operations, there are only 64 possible values of x (assuming the starting x is a int64)

stiff quartz Jul 28, 2025, 12:34 PM

#

There are some values that are useless in the table then i assume

#

Or is it length 64

regal spoke Jul 28, 2025, 12:34 PM

#

it is 66 long

stiff quartz Jul 28, 2025, 12:35 PM

#

Shouldn’t it be 67 long?
Or maybe no one in the 64 values end up on index 66?

regal spoke Jul 28, 2025, 12:35 PM

#

I ended up being lucky

regal spoke Jul 28, 2025, 12:36 PM

#

stiff quartz Shouldn’t it be 67 long? Or maybe no one in the 64 values end up on index 66?

yep

stiff quartz Jul 28, 2025, 12:37 PM

#

Did you check if you were luckier with 71? 😅

#

Theoretically the stars could align

#

And give you a 64 long table

#

Even if that’s unlikely

regal spoke Jul 28, 2025, 12:38 PM

#

stiff quartz Did you check if you were luckier with 71? 😅

You are trying to win the lottery, I see

#

Havent tried that

stiff quartz Jul 28, 2025, 12:38 PM

#

Well o mean

#

If we’re trying to micro optimise bit length

#

We may as well try

#

😅

regal spoke Jul 28, 2025, 12:39 PM

#

If we really want to optimize, then this is probably better

import __pypy__
mult = __pypy__.intop.int_mul

# Magic constant for de bruijn sequence with the "Dickinson property" taken from https://cstheory.stackexchange.com/questions/19524/using-the-de-bruijn-sequence-to-find-the-lceil-log-2-v-rceil-of-an-integer
magic = -0x3f08a4c6acb9dbd

table2 = [0, 6, 7, 58, 8, 51, 59, 46,
          9, 55, 52, 33, 60, 47, 40, 28, 
          10, 56, 44, 53, 38, 36, 34, 20,
          61, 48, 25, 41, 22, 29, 16, 11, 
          63, 5, 57, 50, 45, 54, 32, 39, 
          27, 43, 37, 35, 19, 24, 21, 15, 
          62, 4, 49, 31, 26, 42, 18, 23, 
          14, 3, 30, 17, 13, 2, 12, 1]

def bitlength64_3(x):
   # Assumes x >= 0
    x |= x >> 1
    x |= x >> 2
    x |= x >> 4
    x |= x >> 8
    x |= x >> 16
    x |= x >> 32
    return table2[mult(x, magic) >> 58]

#

This one is kind of standard. There is lots of info about this method online

stiff quartz Jul 28, 2025, 12:39 PM

#

I never heard of it

regal spoke Jul 28, 2025, 12:40 PM

#

The idea is similar

stiff quartz Jul 28, 2025, 12:40 PM

#

Is it faster?

regal spoke Jul 28, 2025, 12:40 PM

#

Probably yes

stiff quartz Jul 28, 2025, 12:40 PM

#

On your GitHub issue you mention other methods are a bit faster than your in house one

regal spoke Jul 28, 2025, 12:40 PM

#

I dont know if the compiler can be smart about % 67

#

There might be some funny method to quickly compute it

stiff quartz Jul 28, 2025, 12:41 PM

#

Maybe you can find a series of operations to do on your 64 values

#

That map them to an array of reasonable size

#

That’s faster than a mod

regal spoke Jul 28, 2025, 12:42 PM

#

Thats the goal, yes

stiff quartz Jul 28, 2025, 12:42 PM

#

I don’t really know where to start to find it though

regal spoke Jul 28, 2025, 12:42 PM

#

% 67 was my idea

#

The one unfortunate thing is that you wont be able to use only bit operations

#

Like you cant do something like x & (64 - 1)

#

You need something that kind of mixes up the bits

#

The de bruijn method uses multiplication mod 2^64 to do it

#

And my method uses % 67

stiff quartz Jul 28, 2025, 12:46 PM

#

What’s the CPython algorithm

#

I don’t understand it

#

It looks like depending on the compiler the algorithm is different

regal spoke Jul 28, 2025, 12:49 PM

#

stiff quartz It looks like depending on the compiler the algorithm is different

CPython does the sensible thing of using count leading zeros

#

Which is built into gcc/clang/msvc

stiff quartz Jul 28, 2025, 12:50 PM

#

regal spoke CPython does the sensible thing of using count leading zeros

Isn’t that O(64)

#

If your integer is 0

regal spoke Jul 28, 2025, 12:50 PM

#

No like, it uses something built in. True O(1) on most systems

stiff quartz Jul 28, 2025, 12:50 PM

#

Ah ok

regal spoke Jul 28, 2025, 12:50 PM

#

If you try to compile on some other compiler, then it falls back to

#

    const int BIT_LENGTH_TABLE[32] = {
        0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4,
        5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5
    };
    int msb = 0;
    while (x >= 32) {
        msb += 6;
        x >>= 6;
    }
    msb += BIT_LENGTH_TABLE[x];
    return msb;

stiff quartz Jul 28, 2025, 12:51 PM

#

Yeah that one I couldn’t understand

regal spoke Jul 28, 2025, 12:51 PM

#

This case should practically never ever happen

stiff quartz Jul 28, 2025, 12:52 PM

#

stiff quartz Yeah that one I couldn’t understand

Oh nevermind I think I understand

#

Probably not as good as the ones proposed in the pypy issue because of the jmp statement it will generate?

regal spoke Jul 28, 2025, 1:02 PM

#

Obviously using the built in count leading zeros is better

stiff quartz Jul 28, 2025, 2:12 PM

#

regal spoke ```c const int BIT_LENGTH_TABLE[32] = { 0, 1, 2, 2, 3, 3, 3, 3, 4, 4...

No i meant what about this and de bruijn version

regal spoke Jul 28, 2025, 2:29 PM

#

stiff quartz Isn’t that O(64)

Are you asking if de bruijn is O(64)?

stiff quartz Jul 28, 2025, 2:30 PM

#

What no

#

Im just wondering if the algorithm CPython falls back to on other compilers is any good

regal spoke Jul 28, 2025, 2:30 PM

#

Their algorithm is not great (but better than what pypy currently uses)

#

But like it really doesn't matter

stiff quartz Jul 28, 2025, 2:31 PM

#

Well if it was great pypy could’ve copied it

#

That’s why I was wondering

regal spoke Jul 28, 2025, 2:31 PM

#

They could tbh. But I think the other ones I posted should be better.

#

PyPy's current algorithm for bit length is probably the slowest alg out there (except for algorithm intentionally written to be slow)

#

Just try to benchmark it yourself

#

import __pypy__
mult = __pypy__.intop.int_mul

# Magic constant for de bruijn sequence with the "Dickinson property" taken from https://cstheory.stackexchange.com/questions/19524/using-the-de-bruijn-sequence-to-find-the-lceil-log-2-v-rceil-of-an-integer
magic = -0x3f08a4c6acb9dbd

table2 = [0, 6, 7, 58, 8, 51, 59, 46,
          9, 55, 52, 33, 60, 47, 40, 28, 
          10, 56, 44, 53, 38, 36, 34, 20,
          61, 48, 25, 41, 22, 29, 16, 11, 
          63, 5, 57, 50, 45, 54, 32, 39, 
          27, 43, 37, 35, 19, 24, 21, 15, 
          62, 4, 49, 31, 26, 42, 18, 23, 
          14, 3, 30, 17, 13, 2, 12, 1]

def bitlength64_3(x):
   # Assumes x >= 0
    x |= x >> 1
    x |= x >> 2
    x |= x >> 4
    x |= x >> 8
    x |= x >> 16
    x |= x >> 32
    return table2[mult(x, magic) >> 58]

vs

def bit_length_naive(x):
    cnt = 0
    while x:
        x >>= 1
        cnt += 1
    return cnt

#

My benchmark shows bit_length_naive runs 27 times slower than bitlength64_3 for numbers around 10^18.

stiff quartz Jul 28, 2025, 2:44 PM

#

even for small numbers, it's significantly faster

#

>>>> timeit.timeit("bit_length_naive(10)", setup="from __main__ import bit_length_naive", number=10000000)
0.13675410000000454
>>>> timeit.timeit("bitlength64_3(10)", setup="from __main__ import bitlength64_3", number=10000000)
0.004600799999991523
>>>> 0.1367541/0.0046008
29.723982785602498
>>>>

regal spoke Jul 28, 2025, 2:49 PM

#

It should shine the most if given random numbers with bitlength 63

stiff quartz Jul 28, 2025, 2:49 PM

#

yeah obviously

stiff quartz Jul 28, 2025, 3:48 PM

#

regal spoke You are trying to win the lottery, I see

I have tried every number up to 1 000 000 out of curiosity (i know the probability decreases massively as the mod increases but you never know), no one fits 🙂

#

I guess I could've stopped at 127

#

lol

regal spoke Jul 28, 2025, 3:49 PM

#

no lottery win sadcat

stiff quartz Jul 28, 2025, 3:49 PM

#

stiff quartz I guess I could've stopped at 127

but my computer is faster than my brain

#

argh I wish I had the emojis you have

#

I was thinking about that speedboat thing

stiff quartz Jul 28, 2025, 3:53 PM

#

stiff quartz but my computer is faster than my brain

to complement that sentence

#

and this one for obvious reasons

#

why do you have so many of them

regal spoke Jul 28, 2025, 3:58 PM

#

People made it in 2019

#

Then whenever someone made a new server they included it

#

There are lots of these thinking emojis thonk thonkery thonkeyes thonkcool thinkies pajenethonk mikethonk

#

The mikethonk one was made when MikeMirzayanov told people to think wider on codeforces

#

So we made a wider version of thonk

stiff quartz Jul 28, 2025, 4:02 PM

#

lmfao

stiff quartz Jul 28, 2025, 5:55 PM

#

regal spoke Seems it needs to get approved by moderators before it gets shown publicly

It's still undergoing a very careful investigation

regal spoke Jul 28, 2025, 5:56 PM

#

An issue tracking system like github is definitely nicer

stiff quartz Jul 28, 2025, 5:57 PM

#

I can upvote though

distant quiver Jul 28, 2025, 6:53 PM

#


class Solution(object):
    def minimumDifference(self, nums, k):
        """
        :type nums: List[int]
        :type k: int
        :rtype: int
        """
        n = len(nums)
        left =  0 
        right = 1
        min_diff = float('inf') 

        if n == 1:
            return 0

        while right < n :
            if right == n - 1:
                # if right pointer is at the end of the list 
                # reset left back
                diff = max(nums[right], nums[left])  - min(nums[right], nums[left])
                min_diff = abs(min(min_diff, diff))
                left += 1
                right = left + 1
                continue
            diff = max(nums[right], nums[left])  - min(nums[right], nums[left])

            right +=1
        return min_diff

for leet code

https://leetcode.com/problems/minimum-difference-between-highest-and-lowest-of-k-scores/

what is the time complexity this is my solution how is this O(n^2)

#

its in one pass

#

Oh nvm

#

im not sure

#

but the efficient way is to sort and use two pointer

chrome pewter Jul 29, 2025, 3:25 AM

#

distant quiver ```py class Solution(object): def minimumDifference(self, nums, k): ...

It isn't n^2?

haughty mountain Jul 29, 2025, 7:49 AM

#

distant quiver ```py class Solution(object): def minimumDifference(self, nums, k): ...

this doesn't solve the problem unless k=2

regal spoke Jul 29, 2025, 8:42 AM

#

stiff quartz It's still undergoing a very careful investigation

Its up now.

#

About opening up a similar issue on gcc bugzilla

stiff quartz Jul 29, 2025, 8:43 AM

#

pydis_strong

stiff quartz Jul 29, 2025, 8:43 AM

#

regal spoke About opening up a similar issue on gcc bugzilla

Neither of us could reproduce on gcc though, correct?

#

Was it ConfusedReptile that could?

regal spoke Jul 29, 2025, 8:43 AM

#

Yes, but only tried it on windows

stiff quartz Jul 29, 2025, 8:43 AM

#

Ah yes.

regal spoke Jul 29, 2025, 8:43 AM

#

I could try it through wsl

stiff quartz Jul 29, 2025, 8:43 AM

#

Tbf I so have a VM

#

On Ubuntu

#

With gcc

#

I could try when I come back home

stiff quartz Jul 29, 2025, 8:44 AM

#

regal spoke I could try it through wsl

You could as well I guess

#

We should also make sure the assembly code doesn’t change

#

https://godbolt.org

Compiler Explorer

#

Except for the +1 added obviously

regal spoke Jul 29, 2025, 9:03 AM

#

According to what I can see there, using -O0 should work

#

But both gcc and clang optimize it all away if using >= O1. But I think I found something that works

#

#include <string.h>
#define REPEATS 4000000

char src[1000000];  // any allocation of a big array works
int main() { 
    for (int i = 0; i < REPEATS; ++i) {
        // Call memcpy with dst - src = +-(power of 2) + (something small > 0)
        memcpy(src + (1 << 18) + 1, src, 8193);
    }
    return 0;
}

Here is something that should work on clang or gcc, independent of O flag. According to godbolt memcpy is always called

#

Btw if I understand it correctly, both clang and gcc should be calling the same exact memcpy from glibc. So there shouldn't be any difference between the two

#

https://sourceware.org/glibc/wiki/FilingBugs So this is probably where to report it

#

I wonder if there is an amd bug tracker too. All I could find was emailing their support

regal spoke Jul 29, 2025, 11:02 AM

#

stiff quartz You could as well I guess

I just tested on wsl with ubuntu. I'm getting that sweet factor 40 slowdown with both clang and gcc.

#

But my wsl python3 (version 3.12.3) does not have the copy slowdown bug

#

Is 3.12.3 too old to have the bug?

#

https://github.com/python/cpython/pull/91482 this pr is the one that introduced the algorithm that triggers the slowdown bug

GitHub

gh-91247: improve performance of list and tuple repeat (with specia...

Compared to current main branch there are 3 improvements

For the list inplace repeat there is an optimized copy method
This PR solves the inefficient pointer chasing for small (size 2 to 7) lists ...

regal spoke Jul 29, 2025, 11:19 AM

#

regal spoke Is 3.12.3 too old to have the bug?

It looks like 3.12 was the one who introduced it. But I'm not getting the slowdown. This is the 3.12 branch https://github.com/python/cpython/blob/f66c75f11d3aeeb614600251fd5d3fe1a34b5ff1/Include/internal/pycore_list.h#L59-L70

halcyon plankBOT Jul 29, 2025, 11:19 AM

#

Include/internal/pycore_list.h lines 59 to 70

// Repeat the bytes of a buffer in place
static inline void
_Py_memory_repeat(char* dest, Py_ssize_t len_dest, Py_ssize_t len_src)
{
    assert(len_src > 0);
    Py_ssize_t copied = len_src;
    while (copied < len_dest) {
        Py_ssize_t bytes_to_copy = Py_MIN(copied, len_dest - copied);
        memcpy(dest + copied, dest, bytes_to_copy);
        copied += bytes_to_copy;
    }
}```

stiff quartz Jul 29, 2025, 11:22 AM

#

regal spoke Is 3.12.3 too old to have the bug?

I could reproduce on 3.12.5

regal spoke Jul 29, 2025, 11:22 AM

#

on windows?

stiff quartz Jul 29, 2025, 11:22 AM

#

Yes

#

I couldn’t on Linux

regal spoke Jul 29, 2025, 11:22 AM

#

but what about wsl

stiff quartz Jul 29, 2025, 11:22 AM

#

Try the string example

#

For some reasons I don’t understand

#

This one is easier to reproduce

regal spoke Jul 29, 2025, 11:23 AM

#

regal spoke ```c #include <string.h> #define REPEATS 4000000 char src[1000000]; // any all...

This definitely reproduces it on wsl

#

But I'm unnable to get python/pypy on wsl to bug out for some reason

stiff quartz Jul 29, 2025, 11:23 AM

#

#algos-and-data-structs message

#

Im talking about this ^

stiff quartz Jul 29, 2025, 11:24 AM

#

stiff quartz This one is easier to reproduce

Maybe the breaking change was introduced earlier

regal spoke Jul 29, 2025, 11:24 AM

#

Ok that one is death on wsl

python3 testare2.py
0.43964314460754395
6.476124286651611

stiff quartz Jul 29, 2025, 11:24 AM

#

Yep

regal spoke Jul 29, 2025, 11:25 AM

#

Why is it different?

stiff quartz Jul 29, 2025, 11:25 AM

#

I don’t know

#

No reference counting that’s the only difference

#

If I remember correctly

#

Looking at the source code

#

But maybe the list change came about AFTER this one

#

Can you upgrade your WSL python version?

#

Can you also try that #algos-and-data-structs message ?

#

This one is a memmove bug

regal spoke Jul 29, 2025, 11:26 AM

#

Could it be that the x in version matters, like 3.12.x

stiff quartz Jul 29, 2025, 11:26 AM

#

When you insert at index 1 it moves stuff

stiff quartz Jul 29, 2025, 11:26 AM

#

regal spoke Could it be that the x in version matters, like 3.12.x

Yep

stiff quartz Jul 29, 2025, 11:28 AM

#

regal spoke https://github.com/python/cpython/pull/91482 this pr is the one that introduced ...

That PR was closed in July 2022 and 3.12 was released end of 2023 so idk

stiff quartz Jul 29, 2025, 11:29 AM

#

regal spoke I just tested on wsl with ubuntu. I'm getting that sweet factor 40 slowdown with...

I’ll try in a bit btw

regal spoke Jul 29, 2025, 11:29 AM

#

stiff quartz That PR was closed in July 2022 and 3.12 was released end of 2023 so idk

We know enough know to where I doubt that it is actually calling memcpy in 3.12.3

stiff quartz Jul 29, 2025, 11:34 AM

#

I have 3.12.3 on my Ubuntu

#

What a coincidence

#

I can't even reproduce the string example ... (neither with 3.12.3 neither with 3.14.0 preview)

#

regal spoke Jul 29, 2025, 11:40 AM

#

regal spoke ```c #include <string.h> #define REPEATS 4000000 char src[1000000]; // any all...

@stiff quartz try this

stiff quartz Jul 29, 2025, 11:41 AM

#

No slowdown (on top is +1, bottom is +100)

#

Makes sense

#

For reference:
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

#

(It's a virtual machine though, idk if that's relevant)

regal spoke Jul 29, 2025, 11:43 AM

#

It might be. Seems like you dont have the bug at all there

stiff quartz Jul 29, 2025, 11:43 AM

#

Which is crazy because it's the same CPU

#

Literally the same

regal spoke Jul 29, 2025, 11:43 AM

#

I dont know what the VM does

stiff quartz Jul 29, 2025, 11:43 AM

#

Yeah me neither

regal spoke Jul 29, 2025, 11:44 AM

#

I find it stranger that strings slow down for me, but not lists

stiff quartz Jul 29, 2025, 11:44 AM

#

It might just be that the string change was older

#

https://docs.python.org/3/whatsnew/changelog.html

Python documentation

Changelog

Python next: Release date: XXXX-XX-XX Tools/Demos: gh-135968: Stubs for strip are now provided as part of an iOS install.. Tests: gh-135966: The iOS testbed now handles the app_packages folder as a...

#

you have everything here

#

oh wait no

regal spoke Jul 29, 2025, 11:45 AM

#

stiff quartz It might just be that the string change was older

Im sure of that

stiff quartz Jul 29, 2025, 11:45 AM

#

you have everything for 3.13 but not for 3.12

#

https://docs.python.org/3.12/whatsnew/changelog.html

Python documentation

Changelog

Python 3.12.10 final: Release date: 2025-04-08 macOS: gh-124111: Update macOS installer to use Tcl/Tk 8.6.16., gh-131423: Update macOS installer to use OpenSSL 3.0.16. Patch by Bénédikt Tran., gh-1...

#

ok here you have it all

#

no mention of gh-91247

regal spoke Jul 29, 2025, 11:46 AM

#

It is under Python 3.12.0 alpha 1

stiff quartz Jul 29, 2025, 11:46 AM

#

so 3.12.3 should have memcpy

regal spoke Jul 29, 2025, 11:46 AM

#

stiff quartz Jul 29, 2025, 11:47 AM

#

why is it not called gh-91247

#

There are some core developers here, maybe it'd be easier to ask them lol

regal spoke Jul 29, 2025, 11:47 AM

#

Its given as a bugs.python.org id

#

instead of pr

#

https://bugs.python.org/issue47091

stiff quartz Jul 29, 2025, 11:48 AM

#

Fair

#

So your 3.12.3 should call memcpy

regal spoke Jul 29, 2025, 11:48 AM

#

probably

stiff quartz Jul 29, 2025, 11:49 AM

#

Try the tuple example

#

Supposedly the change touched both and I could reproduce both

#

It’s on the issue that you opened somewhere

#

python3 -m timeit -r 20 "((0,) * 2**18) * 14"

#

Run something like this

#

And with +1 and +100

#

If that works and the list doesn’t work 🤯

regal spoke Jul 29, 2025, 11:54 AM

#

waiit

#

nvm

stiff quartz Jul 29, 2025, 11:57 AM

#

https://tenor.com/view/disappointment-gif-21343121

Tenor

regal spoke Jul 29, 2025, 11:57 AM

#

Just a thing

#

Remember that the calls to memcpy working with strings is different

#

Cause strings are bytes

#

And lists works off 8 bytes elements instead

stiff quartz Jul 29, 2025, 11:58 AM

#

Try

#

On gcc

#

+8

regal spoke Jul 29, 2025, 11:58 AM

#

ye exactly

stiff quartz Jul 29, 2025, 11:59 AM

#

Thats actually the reason why the string example is so much easier to reproduce

regal spoke Jul 29, 2025, 11:59 AM

#

probably, ye

#

also no reference count makes the intial copy faster

stiff quartz Jul 29, 2025, 11:59 AM

#

stiff quartz On a random online compiler:

You’re in the same position as this one random compiler

regal spoke Jul 29, 2025, 12:03 PM

#

+16 is slow, +32 is fast

stiff quartz Jul 29, 2025, 12:03 PM

#

And +8 is fast?

#

Id assume?

regal spoke Jul 29, 2025, 12:03 PM

#

I read it wrong

#

Seems <= 24 is slow

#

otherwise fast

stiff quartz Jul 29, 2025, 12:04 PM

#

But if +8 is slow

#

Then ..

regal spoke Jul 29, 2025, 12:04 PM

#

Which is same as msvc

stiff quartz Jul 29, 2025, 12:04 PM

#

List should be slow

regal spoke Jul 29, 2025, 12:04 PM

#

yes

stiff quartz Jul 29, 2025, 12:04 PM

#

An int is 8 bytes right

haughty mountain Jul 29, 2025, 12:04 PM

#

4

stiff quartz Jul 29, 2025, 12:04 PM

#

In CPython

#

A python int is a int64

#

No?

regal spoke Jul 29, 2025, 12:05 PM

#

stiff quartz An int is 8 bytes right

You are thinking about it wrong

#

A python list only stores pointers

#

It doesnt care what it is storing

stiff quartz Jul 29, 2025, 12:05 PM

#

Ah yes, but the pointer is 8 bytes

haughty mountain Jul 29, 2025, 12:05 PM

#

a list is an array of pointers

haughty mountain Jul 29, 2025, 12:05 PM

#

stiff quartz Ah yes, but the pointer is 8 bytes

on 64-bit, yea

stiff quartz Jul 29, 2025, 12:05 PM

#

So i was right for the wrong reason, this still doesn’t make any sense

regal spoke Jul 29, 2025, 12:05 PM

#

yes

#

I should be seeing the bug

#

but I'm not thonk

stiff quartz Jul 29, 2025, 12:06 PM

#

Are you sure +8 is slow? Maybe +7 and +9 are but not +8 on your computer

regal spoke Jul 29, 2025, 12:07 PM

#

+8, yes

stiff quartz Jul 29, 2025, 12:08 PM

#

haughty mountain a list is an array of pointers

It’s hard being the worst student in class 😳

regal spoke Jul 29, 2025, 12:11 PM

#

Because +16 is bad for me, and +32 isn't

#

import time

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 100)) * 32
print(time.time() - start_time)

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 1)) * 32
print(time.time() - start_time)

This is a killer
0.9612233638763428
15.320869445800781
But change to * 64 and the difference get cut in half
1.9017343521118164
16.154651641845703

stiff quartz Jul 29, 2025, 12:12 PM

#

The first copy is at a distance 2^k+8

#

The second is at a distance of 2^j+16

#

No?

#

The third +32

#

So you get out of the bad zone

#

Maybe fast enough

#

With a list

#

But it’s super slow to get out of the bad zone with a string

#

Since it goes 1 2 4 8 16 32

stiff quartz Jul 29, 2025, 12:14 PM

#

regal spoke ```py import time start_time = time.time() for _ in range(500000): ("s" * (2*...

Yeah exactly

regal spoke Jul 29, 2025, 12:14 PM

#

By this logic, shouldn't * 4 be best for me for lists?

stiff quartz Jul 29, 2025, 12:14 PM

#

So maybe for the array benchmark you should just do *2

#

Ah yes *4

#

Since 16 is bad

#

Yes try *4

regal spoke Jul 29, 2025, 12:15 PM

#

Still nothing

#

on wsl

stiff quartz Jul 29, 2025, 12:16 PM

#

Is it drowning in noise

#

Weird though

#

The memcpy is like a x20 slowdown

regal spoke Jul 29, 2025, 12:16 PM

#

no it is just not triggering I think

regal spoke Jul 29, 2025, 12:16 PM

#

stiff quartz The memcpy is like a x20 slowdown

Actually it should be x40

stiff quartz Jul 29, 2025, 12:17 PM

#

stiff quartz python3 -m timeit -r 20 "((0,) * 2**18) * 14"

Can you try the tuple example?

regal spoke Jul 29, 2025, 12:17 PM

#

I see no change with tuples

stiff quartz Jul 29, 2025, 12:18 PM

#

And the list and tuple changes were shipped together

#

Are we reading the changelog wrong lol

#

Do you have another version of python installed on your wsl ?

#

Also, you should try your GCC benchmark with a bigger count

#

Maybe it doesn’t trigger if the count is too large

#

We have lower bounds and always assumed there was no upper bounds

haughty mountain Jul 29, 2025, 12:22 PM

#

what are you checking? repro'ing the different slowdowns on wsl?

regal spoke Jul 29, 2025, 12:29 PM

#

haughty mountain what are you checking? repro'ing the different slowdowns on wsl?

Here is a summary:

This C code triggers slowdown bug for me in:

msvc (32 bit and 64 bit)
wsl gcc
wsl clang
but no slowdown in
mingw gcc

#include <string.h>
#define REPEATS 4000000

char src[1000000];  // any allocation of a big array works
int main() { 
    for (int i = 0; i < REPEATS; ++i) {
        // Call memcpy with dst - src = +-(power of 2) + (something small > 0)
        memcpy(src + (1 << 18) + 1, src, 8193);
    }
    return 0;
}

This python script triggers for me in all suffciently new CPython and PyPy versions (both on windows and wsl).

import time

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 100)) * 32
print(time.time() - start_time)

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 1)) * 32
print(time.time() - start_time)

The weird thing is that the list equivalence does not trigger on wsl + python (version 3.12.3). Switching to *4 does not fix that.

#

The switch of list copy algorithm was supposedly added in 3.12.0 alpha. But since I'm unnable to trigger the slowdown in 3.12.3, I'm starting to think it might not be in 3.12.3.

haughty mountain Jul 29, 2025, 12:33 PM

#

did you try other versions?

regal spoke Jul 29, 2025, 12:37 PM

#

haughty mountain did you try other versions?

How do I do that in wsl?

#

@haughty mountain Btw have you tried running the string example

#

The one just above with *32

#

Hmm in >= 3.13 they added this to the repeat code

#

I don't know what Py_BEGIN_CRITICAL_SECTION does, but it was not there in 3.12

haughty mountain Jul 29, 2025, 12:53 PM

#

regal spoke How do I do that in wsl?

you can try pyenv

haughty mountain Jul 29, 2025, 12:54 PM

#

regal spoke I don't know what `Py_BEGIN_CRITICAL_SECTION` does, but it was not there in 3.12

I think this has to do with the gil removal

regal spoke Jul 29, 2025, 12:54 PM

#

haughty mountain I think this has to do with the gil removal

https://github.com/python/cpython/issues/129069 it is listed as a bug report

GitHub

Race in concurrent list mutation and item retrieval · Issue #12906...

Bug report Bug description: A data race exists in lists when they are concurrently updated and indexed (or otherwise accessed): for slice assignment and other ways of adding or removing multiple it...

haughty mountain Jul 29, 2025, 12:55 PM

#

regal spoke https://github.com/python/cpython/issues/129069 it is listed as a bug report

yeah, it mentions "free threading"

#

which is the no gil thing

#

it's basically grabbing a mutex for the object

regal spoke Jul 29, 2025, 12:56 PM

#

haughty mountain yeah, it mentions "free threading"

So you dont think the critical section thing should matter?

haughty mountain Jul 29, 2025, 12:57 PM

#

It shouldn't

#

not sure where the mutex stuff is stored though

stiff quartz Jul 29, 2025, 1:00 PM

#

What is dest a char*?

https://github.com/python/cpython/blob/a852c7bdd48979218a0c756ff1a5586d91cff607/Include/internal/pycore_list.h

GitHub

cpython/Include/internal/pycore_list.h at a852c7bdd48979218a0c756ff...

The Python programming language. Contribute to python/cpython development by creating an account on GitHub.

#

Ah it didn’t link well

#

On line 56

#

That’s what’s used to copy the array right

regal spoke Jul 29, 2025, 1:02 PM

#

stiff quartz That’s what’s used to copy the array right

Ye

#

well list

#

not array

stiff quartz Jul 29, 2025, 1:02 PM

#

Yes sorry

regal spoke Jul 29, 2025, 1:03 PM

#

https://github.com/python/cpython/blob/d7e12a362a2a4a4b7d52a343ab5940be2cbcc909/Objects/bytesobject.c#L3713C1-L3733C2 this should be the one for (byte)-strings

stiff quartz Jul 29, 2025, 1:03 PM

#

Yep

#

That’s the one I linked at least

#

@regal spoke I'm the same as you on WSL

#

lhott@JasonTower:~/tmp$ python3 testString.py
0.7817449569702148
6.544427871704102

#

lhott@JasonTower:~/tmp$ vim testArray.py
lhott@JasonTower:~/tmp$ python3 testArray.py
0.8847634792327881
0.8824248313903809

#

Python 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] on linux

regal spoke Jul 29, 2025, 1:14 PM

#

Are those lists?

#

or arrays

stiff quartz Jul 29, 2025, 1:14 PM

#

yes lists

#

python lists

#

~~there are no python arrays anyway~~

regal spoke Jul 29, 2025, 1:14 PM

#

there is something called arrays too

stiff quartz Jul 29, 2025, 1:14 PM

#

I only knew about bytearrays

#

but anyway it was lists

#

lhott@JasonTower:~/tmp$ cat testArray.py
import time

start_time = time.time()
for _ in range(500):
  ([0] * (2**18 + 100)) * 14
print(time.time() - start_time)

start_time = time.time()
for _ in range(500):
  ([0] * (2**18 + 1)) * 14
print(time.time() - start_time)
lhott@JasonTower:~/tmp$

regal spoke Jul 29, 2025, 1:15 PM

#

~~what if they forgot to add it in the update~~

stiff quartz Jul 29, 2025, 1:16 PM

#

what is the password

#

of wsl

#

i'm trying to install 3.13

#

it's asking me a password

#

ok i found it

#

i just cycled through the passwords i usually use

#

So I did already use WSL?

#

I'm getting old, my memory is failing me

#

ok i'm installing a newer version of python

stiff quartz Jul 29, 2025, 1:17 PM

#

regal spoke ~~what if they forgot to add it in the update~~

we'll be able to verify that very quickly

regal spoke Jul 29, 2025, 1:18 PM

#

I'm technically using wsl2

stiff quartz Jul 29, 2025, 1:22 PM

#

regal spoke ~~what if they forgot to add it in the update~~

you can downlaod the non-compiled source code of each minor version

#

https://www.python.org/downloads/source/

Python.org

Python Source Releases for Source release

The official home of the Python Programming Language

#

here

#

if you really wanna verify

regal spoke Jul 29, 2025, 1:24 PM

#

I'm currently uppdating my ubuntu version on wsl.

stiff quartz Jul 29, 2025, 1:26 PM

#

Nop, still no bug in 3.13 + WSL

#

yes I know it should be named testList.py

regal spoke Jul 29, 2025, 1:30 PM

#

-# then rename it to testList.py

stiff quartz Jul 29, 2025, 1:31 PM

#

🚔

#

i can reproduce the slowdown on my WSL with GCC

#

even when count = 2^18

#

actually

#

nvm I'm stupid I can reproduce no matter the value of count if >= 8193

regal spoke Jul 29, 2025, 1:42 PM

#

stiff quartz i can reproduce the slowdown on my WSL with GCC

can or cant?

stiff quartz Jul 29, 2025, 1:42 PM

#

i can

#

with C

#

no with Python + List

#

No breaking news here

regal spoke Jul 29, 2025, 1:43 PM

#

After upgrading, I no longer get a slowdown in wsl, python3 (3.13.3), with strings

#

holyfuck

stiff quartz Jul 29, 2025, 1:43 PM

#

what

#

but do you get it with gcc

#

if you get it with gcc but not with Python there are things that I don't understand

#

I'm on

lhott@JasonTower:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.2 LTS
Release:        24.04
Codename:       noble

what version are you on?

regal spoke Jul 29, 2025, 1:45 PM

#

stiff quartz but do you get it with `gcc`

Seems, I no longer get the slowdowns.

#

On wsl

regal spoke Jul 29, 2025, 1:45 PM

#

stiff quartz I'm on ```py lhott@JasonTower:~$ lsb_release -a No LSB modules are available. Di...

root@MegaBj8rn:~/gitproj/slowdown_memcpy# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 25.04
Release:        25.04
Codename:       plucky

#

Could there have been some kind of patch?