#algos-and-data-structs
1 messages · Page 70 of 1
Doesn't matter. Counts still need to be updated
Couldn't this be optimized away when we know it'll be gone?
Idk how the interpreter would know I guess
I think pypy can be smarter, because pypy has a special optimization for pure integer lists
So I think it completely drops the reference counting in that case
But to my knowledge, CPython doesn't have any features like that
Btw if you read this code, you can see the reference counts being updated
while (src < src_end) {
_Py_RefcntAdd(*src, n);
*dest++ = *src++;
}
Here it adds n to the count of *src
then places source in *dest, and increments both pointers
(n is number of copies)
dest is the newly allocated list used for the output
src is the input list
What we are looking at corresponds to B[:n] = A in my code
Ah so dest starts by just pointing at src?
yes
dest is the newly allocated list that should be used for output
yes
This while loops start transfering elements from src into dest
While doing that, it also increments every elements reference count by n (because in the end, each element will appear n times)
That's cool
and it seems so simple when we understand it
Wait, so Python on Linux uses GCC?
If that's the case, and our bug is consistent with the MSVC I had here
This means, that if there is actually a bug with GCC (the one I had here)
and if Python on Linux uses GCC's memcpy
Then we would know how to trigger the bug on Linux, we have to use 2^18-1
Doesn't your benchmark indicate 2^18?
You could try if ([0]*2**18)*20 is slow
yeah I'm trying right now
so it does use GCC
now, can I reproduce the bug
probably not
that'd be too nice
I mean
Idk, it's beautiful but suspiciously too beautiful and so it might not just be a Windows bug?
@regal path could you try to reproduce on Linux?
this one?
2**18+1 47.59964680671692
2**18 49.571892976760864
Nice
Do you have gcc installed?
I do
Can you run this?
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.742536 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.556608 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.552967 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.555692 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.566795 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.595249 sec
14
clang, for good measure
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.705290 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.583754 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.583081 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.617412 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.659714 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.558543 sec
14
and MSVC? 😄
ah welp you're on Linux my bad
I posted a suggestion to use this on the issue
Idk if they will like it
But it should make python's performance far more reliable.
Is this graph Hamiltonian? Meaning, can you trace a cycle along the edges which passes through every point exactly once?
hello
Idk either, if they have a specific fix here they need to find all usages of memcpy in CPython to implement similar things I guess
Thing is, copying is one of few cases where memcpy performance is really noticeable
It's weird that you can't make the issue happening when you only do [0] * (2^18 + 1)
If the issue really lies in memcpy performance
That should happen there too
ah no, my bad
It would only call memcpy with powers of 2 and then with size = 1
I think the only way they would "accept" a "hacky change" is if you can find an example where it's [0] * something
because that is a very common thing to do (compared to ([0] * a) * b
Read the source code. Memcpy is not used in those cases
It is only used if the list being copied has size > 1
Yes and no. Repeated copies is one of few cases where the performance of memcpy is noticeable (i.e. isn't being overshadowed by anything else)
Fair enough
I did make an issue to MSVC repo directly (https://github.com/microsoft/STL/issues/5658)
since fixing it upstream is the best thing
My guess is that *2 is badly affected by this bug, but you don't really see it because of the reference counting taking more time
Btw you should change your benchmark to just have one src and one dst
You do see it with *2 I think
Just make one pair of src + dst of size 2**18+10 and use it for all benchmarks
That way the code is easier to read
Could do
and it will make it more obvious that memcpy is at fault
Otherwise one could think that using different arrays is what is causing the performance bug
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#define COUNT1 262144 // 2^18
#define COUNT2 262145 // 2^18 + 1
#define COUNT3 262154 // 2^18 + 10
#define REPEATS 10000
void benchmark_copy(const int *src, int *dst, size_t count, const char *label) {
size_t size_bytes = count * sizeof(int);
clock_t start = clock();
for (int i = 0; i < REPEATS; ++i) {
memcpy(dst, src, size_bytes);
}
clock_t end = clock();
double total_sec = (double)(end - start) / CLOCKS_PER_SEC;
printf("%s: Copied %zu bytes (%zu ints) %d times\n", label, size_bytes, count, REPEATS);
printf("%s: Total time: %.6f sec\n", label, total_sec);
// Prevent optimizing away some stuff
printf("%d \n", dst[14]);
}
int main(void) {
static int src[COUNT3], dst[COUNT3];
for (size_t i = 0; i < COUNT3; ++i) src[i] = (int)i;
// Do all benchmarks twice just in case
benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
benchmark_copy(src, dst, COUNT3, "COUNT3 (2^18 + 10)");
benchmark_copy(src, dst, COUNT3, "COUNT3 (2^18 + 10)");
return 0;
}
You meant that?
You definitely do, but it is more noticeable for *20 because for *20 less time is spent on ref counting
that is what is causing the performance bug
Using the same array doesn't work

Really
Well I just ran that
And I get the same results for all benchmarks
Is there something wrong with it? I don't think so
Make your computer go to sleep 
the problem is deeper than that
Because I still have the other executable
Lemme check
I'm going crazy
So these are the results
I compile both with MSVC
C:\Users\lhott\Desktop\temp>CL /Fe:mainMSVCSAMEARRAY.exe testSameArray.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30159 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
testSameArray.c
Microsoft (R) Incremental Linker Version 14.29.30159.0
Copyright (C) Microsoft Corporation. All rights reserved.
/out:mainMSVCSAMEARRAY.exe
testSameArray.obj
C:\Users\lhott\Desktop\temp>CL /Fe:mainMSVCDIFFARRAYS.exe testDiffArrays.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30159 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
testDiffArrays.c
Microsoft (R) Incremental Linker Version 14.29.30159.0
Copyright (C) Microsoft Corporation. All rights reserved.
/out:mainMSVCDIFFARRAYS.exe
testDiffArrays.obj
And then, the results are:
C:\Users\lhott\Desktop\temp>mainMSVCSAMEARRAY.exe
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.960000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.951000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.948000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.951000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.950000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.947000 sec
14
C:\Users\lhott\Desktop\temp>mainMSVCDIFFARRAYS.exe
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.216000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.216000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 2.903000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 2.851000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.231000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.221000 sec
14
Same array: it's homogeneous BUT SLOW FOR ALL??????
How lovely
This stuff reminds me of https://codeforces.com/blog/entry/126654
👀
isn't codeforces fully running with GCC?
https://ideone.com/nsCS2n
https://ideone.com/2jp9Hc
for the source codes if you wanna investigate
But the more we do the less it makes sense, so is that a good idea
What if you keep the three but give src2 and dst2 to all of em
How are you going to memcpy COUNT3 integers
if you give it src2 and dst2
—
We should look at the old source code when it did O(n) memcpy
And try to hack the old versions of CPython/PyPy
https://codeforces.com/blog/entry/126654?#comment-1124500 this guy found the underlying bug. It is caused by a stupidly implemented memory allocator that keeps allocating and deallocating large chunks of memory every time cin is called
Oh I meant give all 3 src3 and dst3
To my knowledge that isn't "hackable".
Why? I've got no clue
Compiler would just optimise away src1, src2, …
And we’d have the same exact code
As that
Wouldn’t we?
At least that’s what I, from my limited C knowledge, would think happens
Thats a big assumption...
You could also just make some global arrays
Adding volatile might work too
mmm
so I did:
benchmark_copy(src1, dst1, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src1, dst1, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
benchmark_copy(src3, dst3, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src3, dst3, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src3, dst3, COUNT2, "COUNT2 (2^18 + 1)");
benchmark_copy(src3, dst3, COUNT2, "COUNT2 (2^18 + 1)");
benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
and we get:
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.222000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.255000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 3.900000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 3.884000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.255000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.268000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.344000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.381000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.368000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.312000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.394000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.336000 sec
14
still have the slowdown I had, but everything is homogeneous
but less of a slowdown than earlier
worth noting i put my computer to sleep since then 
Try
benchmark_copy(src1, dst1, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src1, dst1, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
benchmark_copy(src2, dst2, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src2, dst2, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
You're going to laugh
C:\Users\lhott\Desktop\temp>CL /Fe:mainMSVCDIFFARRAYSPAJENEGOD.exe testDiffArraysPajenegod.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30159 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
testDiffArraysPajenegod.c
Microsoft (R) Incremental Linker Version 14.29.30159.0
Copyright (C) Microsoft Corporation. All rights reserved.
/out:mainMSVCDIFFARRAYSPAJENEGOD.exe
testDiffArraysPajenegod.obj
C:\Users\lhott\Desktop\temp>mainMSVCDIFFARRAYSPAJENEGOD.exe
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.361000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.365000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 7.027000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 4.624000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.258000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.276000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 3.960000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 4.156000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 4.127000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 4.214000 sec
14
Clearly we have not understood what is causing this bug
I can't find anywhere online where you can run C code on an online platform that lets you choose the compiler
godbolt lets you choose but you only see the assembly code, you can't run it, it seems
Windows is very uncommon for servers to run
Pretty much codeforces is the only one out there
😮💨
A theory about this. Maybe it is caused by alignment of pointers?
Here is something you could test. Allocate a bigger src and dst
Then play around with benchmark_copy(src + i, dst + j, COUNT1, "COUNT1 (2^18)"); for different values of i and j
It could be that the size given to memcpy isn't the issue
That is what this implies ^
Agreed
Will do when I come back
@regal spoke do you have Visual Studio?
If I do, it is old
I have Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x86
$ ./slowtest.exe
Benchmark COUNT1 (2^18): 0.621000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 0.622000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 2.356000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 2.356000 seconds over 10000 repeats
Benchmark COUNT3 (2^18 + 10): 0.682000 seconds over 10000 repeats
Benchmark COUNT3 (2^18 + 10): 0.684000 seconds over 10000 repeats
So I'm able to reproduce the slowdown locally
@stiff quartz Try this
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#define COUNT1 262144 // 2^18
#define COUNT2 262145 // 2^18 + 1
#define COUNT3 262154 // 2^18 + 10
#define REPEATS 10000
void benchmark_copy(const int* src, int* dst, size_t count, const char* label) {
size_t size_bytes = count * sizeof(int);
clock_t start = clock();
for (int i = 0; i < REPEATS; ++i) {
memcpy(dst, src, size_bytes);
}
clock_t end = clock();
double elapsed_secs = (double)(end - start) / CLOCKS_PER_SEC;
printf("Benchmark %s: %.6f seconds over %d repeats\n", label, elapsed_secs, REPEATS);
}
int main() {
static int src[COUNT2];
static int dst[COUNT2];
for (size_t i = 0; i < COUNT2; ++i) src[i] = i;
// Do all benchmarks twice just in case
benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
return 0;
}
Cool
Because a guy couldn’t reproduce
On MSVC github
Maybe we should get latest version
Im doing that rn
Microsoft has a weird way of doing things
With their Visual Studio Installer
It’s always so confusing if you’re not familiar with it
Which I’m not
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#define COUNT 100000
#define COUNT1 262144 // 2^18
#define COUNT2 262145 // 2^18 + 1
#define COUNT3 262154 // 2^18 + 10
#define REPEATS 10000
void benchmark_copy(const int* src, int* dst, size_t count, const char* label) {
size_t size_bytes = count * sizeof(int);
clock_t start = clock();
for (int i = 0; i < REPEATS; ++i) {
memcpy(dst, src, size_bytes);
}
clock_t end = clock();
double elapsed_secs = (double)(end - start) / CLOCKS_PER_SEC;
printf("Benchmark %s: %.6f seconds over %d repeats\n", label, elapsed_secs, REPEATS);
}
int main() {
static int src[COUNT2];
static int dst[COUNT2];
for (int i = 0; i < COUNT2; ++i) src[i] = 0;
// Do all benchmarks twice just in case
benchmark_copy(src, dst, COUNT, "COUNT (100k)");
benchmark_copy(src, dst, COUNT, "COUNT (100k)");
benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
return 0;
}
Btw the slowdown doesn't seem to have anything to do with the count you give memcpy
For this one slowdown hits on all 3
Ok
I downloaded a few recent versions
includin the latest
The bug is reproducible on all MSVC versions
Which makes sense because Python 3.14 preview is compiled with the latest MSVC version anyway.
if src and dst are of size COUNT2
C:\Users\lhott\Desktop\CP\c\untitled>mainMSPajenegod.exe
Benchmark COUNT (100k): 0.786000 seconds over 10000 repeats
Benchmark COUNT (100k): 0.788000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 2.057000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 2.060000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 2.055000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 2.065000 seconds over 10000 repeats
if src and dst are of size COUNT2+100
C:\Users\lhott\Desktop\CP\c\untitled>mainMSPajenegod.exe
Benchmark COUNT (100k): 0.064000 seconds over 10000 repeats
Benchmark COUNT (100k): 0.067000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 0.162000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 0.162000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 0.163000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 0.168000 seconds over 10000 repeats
Here is something you could try
#include <string.h>
// #define BADSIZE (262145)
#define BADSIZE (262145 - 1)
#define REPEATS 100000
int main() {
static int src[BADSIZE];
static int dst[BADSIZE];
for (int i = 0; i < REPEATS; ++i) {
memcpy(dst, src, 100000);
}
return 0;
}
This one should run fast
but switch to
#define BADSIZE (262145)
and it should run sloww
no
like the memory allocated is bad/misaligned?
I think the values dst and src is what makes memcpy go bananas
Can we find other places in CPython source code that call memcpy?
From 134 ms to 1963 ms yes
I guess that's a smaller example but it doesn't give us any particular insight right?
Someone said that it's the same bug as https://github.com/microsoft/STL/issues/5506
Idk if that's the case since they said the bug hits clang on Linux and we could not reproduce on clang on Linux
I dont think it is the same.
I dont have intel
#include <string.h>
#define BADSIZE ((1 << 15) | 24)
#define REPEATS 100000
int main() {
static char src[BADSIZE];
static char dst[BADSIZE];
for (int i = 0; i < REPEATS; ++i) {
memcpy(dst, src, 25000);
}
return 0;
}
Try this one
About the 15, seems any number works
you didn't define REPEATS
fixed
It seems that whatever triggers it needs BADSIZE between 1 to 24 bytes above a power of two
Thats the only rule
hi
((1 << 15) | 128) = 36 ms
((1 << 15) | 24) = 44 ms
I multiplied REPEATS by 10
Also, did you compile exactly what I sent?
yes
note that I have char and not int
#include <string.h>
#define BADSIZE ((1 << 15) | 128)
#define REPEATS 1000000
int main() {
static char src[BADSIZE];
static char dst[BADSIZE];
for (int i = 0; i < REPEATS; ++i) {
memcpy(dst, src, 25000);
}
return 0;
}
and
#include <string.h>
#define BADSIZE ((1 << 15) | 24)
#define REPEATS 1000000
int main() {
static char src[BADSIZE];
static char dst[BADSIZE];
for (int i = 0; i < REPEATS; ++i) {
memcpy(dst, src, 25000);
}
return 0;
}
With REPEATS multiplied I have 391-396 ms for | 24
and 304-305 ms for | 128
This runs instant for me
This one doesn't
I mean
I use Measure-Command { ./smallerExample.exe }
on Powershell
the std seems pretty low
when I re-run
both are pretty much instantaneous
Run what I sent but lower the 24
until it slows down
For me anything between 1 and 24 slows it down
Go up from 1
2 gives me 4.9 s
3 gives me 4.9 s
5 gives me 4.9s
10 gives me 446 ms
7 gives me 449 s
6 gives me 441 ms
so it seems that for me anything between 1 and 5 slows it down
4 gives me 4.9 s
..
I'm assuming 8 is also slow?
Yes.
9 isn't.
16 isn't.
Oh I had tested 16 already. Well it definitely isn't.
Has a memcpy as well
So we can probably do something with dictionaries in Python
@stiff quartz try this
#include <string.h>
#define BADSIZE ((1 << 15) | 1)
#define REPEATS 100000
int main() {
char src[1000000];
for (int i = 0; i < REPEATS; ++i) {
memcpy(src + BADSIZE, src, 25000);
}
return 0;
}
Seems that it might just be that dst - src = slightly larger than a power of 2 is what triggers the bug
@regal spoke Other parts of Python are affected
how big is the difference between fast and slow?
Factor 10-20
This is more than what we had with lists
If I'm not mistaken:
https://github.com/python/cpython/blob/2bd4ff07001038fada17b6dbcb5ab5aabd27c26e/Objects/unicodeobject.c#L12782
is called
Objects/unicodeobject.c line 12782
unicode_repeat(PyObject *str, Py_ssize_t len)```
Oh it is because it doesnt need to do any ref courning for strings
Objects/bytesobject.c line 3713
_PyBytes_Repeat(char* dest, Py_ssize_t len_dest,```
And that has a memcpy
while (copied < len_dest) {
Py_ssize_t bytes_to_copy = Py_MIN(copied, len_dest - copied);
memcpy(dest + copied, dest, bytes_to_copy);
copied += bytes_to_copy;
}
The same log2(n) one
!e
import time
start_time = time.time()
for _ in range(500000):
("s" * (2**12 + 100)) * 14
print(time.time() - start_time)
:white_check_mark: Your 3.13 eval job has completed with return code 0.
1.5913984775543213
!e
import time
start_time = time.time()
for _ in range(500000):
("s" * (2**12 + 1)) * 14
print(time.time() - start_time)
:white_check_mark: Your 3.13 eval job has completed with return code 0.
2.787262201309204
@regal spoke I can reproduce the slowdown with Discord bot?!!!
!e
import platform
print(platform.libc_ver())
:white_check_mark: Your 3.13 eval job has completed with return code 0.
('glibc', '2.36')
This is a completely unrelated bug then, the list bug you found was Windows specific, that string bug seems to work on Linux
On a random online compiler:
!e
import platform
print(platform.platform())
print(platform.machine())
:white_check_mark: Your 3.13 eval job has completed with return code 0.
001 | Linux-6.1.0-30-cloud-amd64-x86_64-with-glibc2.36
002 | x86_64
Thats weird. If I compile with GCC, then I do not see any slowdowns
So it doesn't seem to be entirely hardware related either
Honestly at this point I don't know
Isn't that ironic
!e
import time
start_time = time.time()
for _ in range(500000):
("s" * (2**12 + 1)) * 14
print(time.time() - start_time)
start_time = time.time()
for _ in range(500000):
("s" * (2**12 + 100)) * 14
print(time.time() - start_time)
:white_check_mark: Your 3.14 pre-release eval job has completed with return code 0.
001 | 2.9183502197265625
002 | 1.5649194717407227
!e
import time
start_time = time.time()
for _ in range(500000):
("s" * (2**12 + 100)) * 14
print(time.time() - start_time)
start_time = time.time()
for _ in range(500000):
("s" * (2**12 + 1)) * 14
print(time.time() - start_time)
:white_check_mark: Your 3.13 eval job has completed with return code 0.
001 | 1.558946132659912
002 | 2.7390034198760986
I could reproduce on another computer running Python 3.11 on Linux (GCC)
wild, I'm getting 0.46 vs 6.38 on linux, Python 3.10.12 GCC 11.3.0. Same for 3.12 GCC, and same for uv-installed 3.11.3 and 3.13.5 (Clang 20.1.4).
so neither version nor compiler seem to matter much.
...my CPU is an AMD. On a different computer (Windows, Intel) there's no slowdown. As the last message in that thread suggests, it may be CPU-dependent.
What does uv-installed mean?
The string one or the list one? So both are AMD specific bugs?
I tested the string one. Presumably they are both caused by malloc
Obtained by uv python install: https://docs.astral.sh/uv/concepts/python-versions/. This downloads a prebuilt binary.
The difference is wild
Someone with Intel managed to reproduce somehow
:white_check_mark: Your 3.14 pre-release eval job has completed with return code 0.
001 | Power of 2: 0.1374994833022356
002 | Not power of 2: 3.3911833791062236
I think you have to be careful about what is the same issue or not
Could be that there are different issues
Seems people are mixing memset stuff and memcopy
https://github.com/python/cpython/issues/137121#issuecomment-3124504591
exactly what I said here but I guess it went ignored
Yeah I also fell into the trap of a benchmark that was just flawed and seemed to show performance slowdown without the *14
@stiff quartz Why do you think memcpy issue is not windows specific?
Have you been able to make memcpy slow in C without using windows?
@regal spoke
And here @regal spoke
Thèse are the only reasons I have for now why I think memcpy is not windows specific
I have not
Thats not C
Indeed, I can’t reproduce anything with C on Linux for now related to the string benchmark
I probably shouldn’t blame memcpy - I don’t see anything else in CPython source code for strings that could be the cause of this though
I think confirmation is important
Yes, I haven’t been able to / haven’t had the time to dig deeper
#include <string.h>
#define REPEATS 4000000
int main() {
char src[1000000];
for (int i = 0; i < REPEATS; ++i) {
memcpy(src + (1 << 15) + 1, src, 4096);
}
return 0;
}
On my computer, this very precisely triggers the bug
Only on MSVC (Windows) + AMD from my understanding; did you try that on GCC?
Has anyone run this on Linux?
I've only gotten it to work on MSVC
I've tried with gcc on my windows computer, and a friend on their linux computer (with AMD processor). Nothing interesting happens, they both run equally fast
Could they reproduce the Python string slowdown on their Linux computer? Not all Linux computers can it seems, maybe only the Linux computers that show the Python string slowdown would show anything with the C memcpy benchmark.
Thing is, my computer is clearly affect by these slowdown. But I'm not able to make gcc slowdown
But your computer uses MSVC for Python
yes
Why would you expect gcc to show anything
The question is, is the issue with memcopy windows specific or not
Ah, I see
You are claiming it is not
My gcc benchmarks on my computer make absolutely no sense, I’ve given up on trying to understand them.
Well yes that was not phrased well
I edited it my bad
Let me see if I can make a minimal working example out of that
Just to confirm the behaviour I see with gcc can’t be reproduced on your machine?
no
Argh
No optimization is on by default
At least I learn a lot about C & C++ & reading CPython code
ah you're right
I'm back to having a normal-behaving GCC with the optimization flags
I wonder if the MSVC problem we have disappears with https://learn.microsoft.com/en-us/cpp/build/reference/o-options-optimize-code?view=msvc-170
No my example work with /O2
#include <string.h>
#define REPEATS 4000000
int main() {
char src[1000000];
for (int i = 0; i < REPEATS; ++i) {
memcpy(src + (1 << 15) + 1, src, 4096);
}
return 0;
}
This is supposed to be the slow one, correct?
it's fast
with O2
dsfkfiqsjdfiqjds
it's fast without
what is that
AH no, I'm the same than the guy on the C issue
I have to compile for x86 to see the slowdown (with or without O2)
Are you compiling for x64 or x86
Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x86
which is annoying for our bug hunt because Python 3.14 is compiled with MSC v.1944 64 bit (AMD64)
#include <string.h>
#define REPEATS 4000000
int main() {
char src[1000000];
for (int i = 0; i < REPEATS; ++i) {
memcpy(src + (1 << 18) + 1, src, 8193);
}
return 0;
}
Try this
This one is slow for me on Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x64
Yep
Why isn't my initial one working
with x64
It's also 2^18+1
and i'm clearly moving more than 2*4096 +1
Could be because I have control over the pointers
Unlike your test
What do you mean sorry
src + (1 << 18) + 1, src I specially make the pointers be at a bad distance for dst and src
You get whatever the compiler decides to give you
I really think this is the trigger
lemme test
#include <string.h>
#define REPEATS 4000000
int main() {
char src[1000000];
for (int i = 0; i < REPEATS; ++i) {
memcpy(src + (1 << 18) + 1, src, 2*4097 + 16);
}
return 0;
}
This is slow
16 => 161 is slow too
the reason I put 2*4096 + 1
is that anything smaller seems to run fast
you could probably put 100k if you wanted
In my initial case, dst and src are two distinct objects
the pointers are far from eacher other
at least by a distance equal their size
Here is the thing, I suspect the reason why copying in python is slow
Is that the pointers to dst and src is (power of 2) times n
I disagree, if we copy list or strings of size 2^n + 1
They are initially at a distance 2^n + 1 then 2^(n+1) + 2
yes but then my initial case
src and dest are far, and like in your example, distance isn't a power of 2
If initial size is 2^i + 1, then every call to memcpy will be some power of 2 + something small
do you suspect the something small is the problem?
Read my comment earlier on stl github
I think that if dst - src = +-(power of 2) + something small (thats > 0)
then the bug triggers
Thats why I'm coding it like this
If you add the same value x to both dsc and src, then the bug always happens
Reading your comment, I thought you wanted dst and src to be of a distance power of 2 + 1 with that 1 being important
Well to guarantee a trigger, yes.
But I've gotten it to work with something as big as +24
I definitely need to add a lot of context
to the issue
I initially didn't even mention flags
and didn't mention the difference between x86 and x64
I really think this is the key. Could we make this work on gcc too?
Ah yes lemme check on gcc
The problem is that my gcc seems to just ignore everything
do you mean it optimizes away the copy?
yep
when you set the flag
Thats what I'm trying to avoid
what if you add printf(%c, src[40]);
Hmm I just tried with O0, and then it doesnt optimize it all away
But I sadly dont seem to trigger the slowdown bug
yes my screenshot is from O2
What about if we use ints instead of char for your example
Ah yes that's the whole point i forgot
#include <string.h>
#include <stdio.h>
#define REPEATS 4000000
int main() {
char src[1000000];
for (int i = 0; i < REPEATS; ++i) {
memcpy(src + (1 << 18) + 1, src, 4*4096 + 1);
}
return 0;
}
Should I maybe just post this on stl issue?
ok
To add information about x86 & x64
and the flags
I think I'm going to erase my example and just say that you found better shorter more consistent examples
Because they're really are all that lol
I think fundamentally, flags shouldnt matter since that shouldnt affect the internals of memcpy.
Fair enough
but 32bit vs 64bit could definitely make a big difference
That's the super important bit that I completely overviewed
and to be fair I didn't even know what I was using to compile
until the guy said that
I'm alson including the cpp version:
#include <iostream>
#include <cstring>
constexpr int REPEATS = 4000000;
constexpr int SRC_SIZE = 1000000;
int main() {
char src[SRC_SIZE] = {};
for (int i = 0; i < REPEATS; ++i) {
std::memcpy(src + (1 << 18) + 1, src, 2 * 4096 + 1);
}
return 0;
}
The C example reproduces the bug for me too (linux, AMD). 4s with the +1, 0.15s without it. GCC vs Clang have the same behaviour.
As in in both compilers, memcpy(src + (1 << 15) + 1, src, 4096); slows down and memcpy(src + (1 << 15) + 0, src, 4096); doesn't.
eeek
Although @vocal gorge is one of the people that could see the slowdowns in Python on Linux
So it makes sense @vocal gorge can see the slowdowns in C
Ehh, I suspect that was a pyenv thing
whereas this is just C, nothing interesting
(My testing approach is gcc -Wall -o test1 test1.c && time ./test1, and similarly for test2, and similarly with clang instead of gcc.)
@vocal gorge Try this one ^. This is the one that we got to trigger on both msvc 32 bit and msvc 64 bit.
this takes 8s for me, on both gcc and clang
And if you remove the +1 after 1<<18?
~200ms, on both gcc and clang
Well hopefully I included everything relevant here:
https://github.com/microsoft/STL/issues/5658
Be even more specific.
Say something like:
On Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x64:
well I listed all the versions
below
I could reproduce the issues on 19.29.30157. It is pretty old but I could reproduce with very similar results on 19.29.30159, 19.38.33135, 19.44.35211, and 19.44.35213 (latest available).
My versions are gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Target: x86_64-pc-linux-gnu Thread model: posix and Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1~exp1~20240731145000.144) if it ends up mattering.
I suspect it's not particularly a compiler thing if it's this consistent, though
thanks
You could also add the note about memmove also having the same exact issue, expect for if the intervals intersect. (This is to be expected since memcpy is basically the special case of memmove with non intersection).
From my testing, if they intersect, memmove does not get the slowdown bug.
Btw instead of // this also can be int src[...];, just say something like // Allocate a sufficiently large array
That is all it is
I think so
Im afraid this will all come down to
It’s AMD specific, a part that is closed source
Or it’s OS specific and not really fixable
Idk
I know nothing about syscalls
Let alone CPU hardware stuff
It sounds more like hardware lvl now
Yes agreed
Unless someone finds that it can be triggered on intel too
I have yet to see any of those bugs on intel
That said, I think it makes a lot of sense to keep this open for a while
But idk
btw I dont like char src[1000000]; // this also can be int src[...]; - any allocation of a big array works
Using int src[] instead of char src[] would not only make a larger array
it would also mean that src + (1 << 18) + 1 goes 4 times further
Does it?
yes
When you do src +
You count in Ute’s
Bytes
Does it matter than ints are 4 bytes
I've been taught that
A[i] is same as *(A + i)
Huh
You’re right
Because of this, I think using an int array is bad
I thought pointer arithmetic in c was in terms of bytes
The count you give memcpy is in bytes
But any pointer arithmetic depends on the size of the datatype
To keep it simple, just dont use int src[]
Yeah I edited
Yeah I assumed pointer arithmetic was in bytes too
Good thing I’ve never been asked to code in C or C++ in my life
🤡
You dont even need to say char is nice because it's one byte so we can get to 2^k+1 distance
Also if it is a AMD issue I don’t even know where this should be discussed
Up to the point we are really sure it is an AMD problem
It is nice to keep things open
I agree
But if we are sure it is
I wouldn’t even know what the move would be
Im not closing or porting any issue today on the Developer community thing they mentioned ill do it tmro (or you can do it if you want)
I hate i[a] I know it makes sense that addition is commutative but I hate it I hope I never see someone write code with i[a]
nope
intel 6th and 8th gen
but, why?
I'm all for pointer witchery but why in gods name would you ever go before start of array with that
(<datatype>)a-- should work no?
@stiff quartz Btw fix two things. Make 2*4096 + 1 be just 8193. I originally wrote 2*4096 + 1 because I was lazy when I was coding. And probably remove the "char is nice because it's one byte so we can get to 2^k+1 distance"
I think the story was that I implemented an algorithm from a paper for some kind of shortest distance algorithm in 3D for voxels. This was many years back so I dont really remember now.
In the algorithm you naturally do pointer arithmetics with lots of ++ on pointers everywhere
But I realized in some edge cases that I kind of needed to undo some of the ++ to my pointers
My quick fix was to index with a negative number
In some sense it was super nice. But it also broke every sensible rule for working with pointers
I don't know if I still have the code somewhere
I just remember that @haughty mountain really didn't like my code
Maybe he remember more what it was about
Done
!e
import timeit
setup_code = '''
ba = bytearray(b"A" * 100000)
insert_bytes = b"B" * (2**14 + 1)
'''
stmt_code = 'ba[:1] = insert_bytes'
timer = timeit.Timer(stmt=stmt_code, setup=setup_code)
print(timer.timeit(number=1500))
import timeit
setup_code = '''
ba = bytearray(b"A" * 100000)
insert_bytes = b"B" * (2**14 + 50)
'''
stmt_code = 'ba[:1] = insert_bytes'
timer = timeit.Timer(stmt=stmt_code, setup=setup_code)
print(timer.timeit(number=1500))
@regal spoke that example is beautiful because much more common imo
:white_check_mark: Your 3.14 pre-release eval job has completed with return code 0.
001 | 3.254694360308349
002 | 2.5861790208145976
the issue is not with any C compiler (GCC/Clang/MSVC Python versions on AMD can (almost always) reproduce)
You could argue that the issue lies with all of them
Yes sure
I don't know where the responsibility of the compiler ends the responsibility of the hardware begins
I'm surprised no one grasps that the last example is 10x more important to fix than the others
https://x.com/disconnect3d_pl/status/1909918427084452076 As disconnected mentions in the github issue, there is already tons of case handeling
A call to memcpy() in a single binary that uses glibc may behave in 12 different ways depending on the features of the specific x86-64 CPU you run it on.
Here is a list of those impls in glibc:
https://t.co/jz8ttYUNTw
I made a report now to developer community https://developercommunity.visualstudio.com/t/memcpy-running-10---40-times-slower-when/10943533
Seems it needs to get approved by moderators before it gets shown publicly
Should probably also make a report to bugzilla https://sourceware.org/bugzilla/
Yep I can't see it yet
yep
Here is what I wrote
[severity:It bothers me. A fix would be nice]
It was recently found that calling memcpy(dst, src, count) where dst-src is close to a power of 2 causes significant performance degragation (up to 20-40 times). It was originally discovered in Python because Python’s internal list copying algorithm happens to repeatedly trigger this slowdown bug.
Related issues:
https://github.com/python/cpython/issues/137121
https://github.com/microsoft/STL/issues/5658
MRE in C used to trigger the bug
#include <string.h>
#define REPEATS 4000000
int main() {
char src[1000000]; // any allocation of a big array works
for (int i = 0; i < REPEATS; ++i) {
// Call memcpy with dst - src = +-(power of 2) + (something small > 0)
memcpy(src + (1 << 18) + 1, src, 8193);
}
return 0;
}
This takes 4 s on my Windows desktop computer (compiled using Microsoft (R) C/C++ Optimizing Compiler Version 19.44.35213 for x64). But if I change (1 << 18) + 1 to (1 << 18) it takes 0.15 s.
About which CPU models are affected. I have desktop and laptop at home, and both experience the slowdown bug. Desktop CPU: AMD Ryzen™ 9 5950X. Laptop CPU: AMD Ryzen™ 9 5900HS. There has been a single report of someone reproducing this with an intel processor + clang (https://github.com/python/cpython/issues/137121#issuecomment-3126294652). Otherwise the reports have been conclusively with people running AMD processors.
Compilers confirmed to be able to reproduce the slowdown bug
Microsoft (R) C/C++ Optimizing Compiler Version XXX for x64
Microsoft (R) C/C++ Optimizing Compiler Version XXX for x86
Where XXX is 19.29.30157, 19.29.30159, 19.38.33135, 19.44.35211, and 19.44.35213 (latest available)
Linux GCC (gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0)
Linux Clang (Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1~exp1~20240731145000.144)
I would like mark the severity higher, but there was no good option
I wonder if mentioning that python consequence https://github.com/python/cpython/issues/137121#issuecomment-3126183984 (inserting/deleting any 2^k+1 portion of any Python objects that is basically a C array) would help the "severity" case.
You can also mention my CPU if you want: AMD Ryzen 9 7900 12-Core Processor
Ryzen 9 too huh

I think the severity is kind of high, but it is not complete broken lvls
Which is the next step in their severity scale
In the end, this is just a slowdown issue (as far as we know)
Which is not the end of the world
Would be funny if you could hack people on codeforces because of it but that's very unlikely
I think cf runs intel
Or at least they used to
Ah yes there's that
Btw the reason I found this bug was that I was benchmarking sparse tables / disjoint sparse table in python/pypy. The disjoint sparse table starts with A *= len(A).bit_length()
That is where the original 22 came from. It's the bit length of the size of the list
At first I thought pypy was the culprit, but then I realized cpython was equally slow.
ohh yes makes sense
Maybe I can find more bugs benchmarking them
This is how long it takes CPython to create the sparse table. There is this weird bump that I don't know where it comes from.
This is the same plot, but for PyPy instead
That bump is now a cliff
Something very funky is going on between n = 6.7e7 and n = 6.8e7
Something else funky is that n=3e7 is always below the expected curve for PyPy
Also, the reason I happened to benchmark with powers of 2 plus 1 is that I used this to generate my benchmark
for i in range(40):
n = int(2**(i*0.5) + 1)
benchmark(n)
Originally I had just 2**i, but I felt like the size increased too fast
So I changed it to int(2**(i*0.5) + 1)
Then I saw that it somehow alternated between running fast and slow
The easiest way to find bugs is to hunt down everything using memcopy or memmove like I did
Unless you want to find unrelated performance bugs
I just want to know why my sparse table is slow
Ah welp
Will the underlying array ‘s size ever reach higher than 0.5*10^8 though?
Seems that it’s already kinda big
Ofc, but it is still weird
Maybe your ram is full
No
Did you try testing more values there?
If you pinpoint the exact value at which the cliff happens
Could help
You could binary search the location of the cliff based on the time it takes to make your sparse table 😳
Ye I'll more into it when I have time
Btw the reason I opened this is also because of my sparse table benchmark https://github.com/pypy/pypy/issues/5314
I realized that .bit_length() in pypy is O(64)
Which kind of defeats the purpose of a sparse table
I think this one algorithm is pretty cool for bit length
table3 = [0, 1, 39, 2, 15, 40, 23, 3,
12, 16, 59, 41, 19, 24, 54, 4,
-1, 13, 10, 17, 62, 60, 28, 42,
30, 20, 51, 25, 44, 55, 47, 5,
32, -1, 38, 14, 22, 11, 58, 18,
53, 63, 9, 61, 27, 29, 50, 43,
46, 31, 37, 21, 57, 52, 8, 26,
49, 45, 36, 56, 7, 48, 35, 6,
34, 33]
def bitlength64_4(x):
# Assumes x >= 0
x |= x >> 1
x |= x >> 2
x |= x >> 4
x |= x >> 8
x |= x >> 16
x |= x >> 32
return table[x % 67]
I haven't seen it anywhere online
Even though it is so clean
The end goal is to compute the bit length
And then do a look up
Yes yes I got that
Im talking about the OR operations
You transform 0101010 in 0111111?
The or operations is to make x be on the form 2^i - 1, yes
So after the or operations, there are only 64 possible values of x (assuming the starting x is a int64)
There are some values that are useless in the table then i assume
Or is it length 64
it is 66 long
Shouldn’t it be 67 long?
Or maybe no one in the 64 values end up on index 66?
I ended up being lucky
yep
Did you check if you were luckier with 71? 😅
Theoretically the stars could align
And give you a 64 long table
Even if that’s unlikely
You are trying to win the lottery, I see
Havent tried that
If we really want to optimize, then this is probably better
import __pypy__
mult = __pypy__.intop.int_mul
# Magic constant for de bruijn sequence with the "Dickinson property" taken from https://cstheory.stackexchange.com/questions/19524/using-the-de-bruijn-sequence-to-find-the-lceil-log-2-v-rceil-of-an-integer
magic = -0x3f08a4c6acb9dbd
table2 = [0, 6, 7, 58, 8, 51, 59, 46,
9, 55, 52, 33, 60, 47, 40, 28,
10, 56, 44, 53, 38, 36, 34, 20,
61, 48, 25, 41, 22, 29, 16, 11,
63, 5, 57, 50, 45, 54, 32, 39,
27, 43, 37, 35, 19, 24, 21, 15,
62, 4, 49, 31, 26, 42, 18, 23,
14, 3, 30, 17, 13, 2, 12, 1]
def bitlength64_3(x):
# Assumes x >= 0
x |= x >> 1
x |= x >> 2
x |= x >> 4
x |= x >> 8
x |= x >> 16
x |= x >> 32
return table2[mult(x, magic) >> 58]
This one is kind of standard. There is lots of info about this method online
I never heard of it
The idea is similar
Is it faster?
Probably yes
On your GitHub issue you mention other methods are a bit faster than your in house one
I dont know if the compiler can be smart about % 67
There might be some funny method to quickly compute it
Maybe you can find a series of operations to do on your 64 values
That map them to an array of reasonable size
That’s faster than a mod
Thats the goal, yes
I don’t really know where to start to find it though
% 67 was my idea
The one unfortunate thing is that you wont be able to use only bit operations
Like you cant do something like x & (64 - 1)
You need something that kind of mixes up the bits
The de bruijn method uses multiplication mod 2^64 to do it
And my method uses % 67
What’s the CPython algorithm
I don’t understand it
It looks like depending on the compiler the algorithm is different
CPython does the sensible thing of using count leading zeros
Which is built into gcc/clang/msvc
Isn’t that O(64)
If your integer is 0
No like, it uses something built in. True O(1) on most systems
Ah ok
If you try to compile on some other compiler, then it falls back to
const int BIT_LENGTH_TABLE[32] = {
0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5
};
int msb = 0;
while (x >= 32) {
msb += 6;
x >>= 6;
}
msb += BIT_LENGTH_TABLE[x];
return msb;
Yeah that one I couldn’t understand
This case should practically never ever happen
Oh nevermind I think I understand
Probably not as good as the ones proposed in the pypy issue because of the jmp statement it will generate?
Obviously using the built in count leading zeros is better
No i meant what about this and de bruijn version
Are you asking if de bruijn is O(64)?
What no
Im just wondering if the algorithm CPython falls back to on other compilers is any good
Their algorithm is not great (but better than what pypy currently uses)
But like it really doesn't matter
They could tbh. But I think the other ones I posted should be better.
PyPy's current algorithm for bit length is probably the slowest alg out there (except for algorithm intentionally written to be slow)
Just try to benchmark it yourself
import __pypy__
mult = __pypy__.intop.int_mul
# Magic constant for de bruijn sequence with the "Dickinson property" taken from https://cstheory.stackexchange.com/questions/19524/using-the-de-bruijn-sequence-to-find-the-lceil-log-2-v-rceil-of-an-integer
magic = -0x3f08a4c6acb9dbd
table2 = [0, 6, 7, 58, 8, 51, 59, 46,
9, 55, 52, 33, 60, 47, 40, 28,
10, 56, 44, 53, 38, 36, 34, 20,
61, 48, 25, 41, 22, 29, 16, 11,
63, 5, 57, 50, 45, 54, 32, 39,
27, 43, 37, 35, 19, 24, 21, 15,
62, 4, 49, 31, 26, 42, 18, 23,
14, 3, 30, 17, 13, 2, 12, 1]
def bitlength64_3(x):
# Assumes x >= 0
x |= x >> 1
x |= x >> 2
x |= x >> 4
x |= x >> 8
x |= x >> 16
x |= x >> 32
return table2[mult(x, magic) >> 58]
vs
def bit_length_naive(x):
cnt = 0
while x:
x >>= 1
cnt += 1
return cnt
My benchmark shows bit_length_naive runs 27 times slower than bitlength64_3 for numbers around 10^18.
even for small numbers, it's significantly faster
>>>> timeit.timeit("bit_length_naive(10)", setup="from __main__ import bit_length_naive", number=10000000)
0.13675410000000454
>>>> timeit.timeit("bitlength64_3(10)", setup="from __main__ import bitlength64_3", number=10000000)
0.004600799999991523
>>>> 0.1367541/0.0046008
29.723982785602498
>>>>
It should shine the most if given random numbers with bitlength 63
yeah obviously
I have tried every number up to 1 000 000 out of curiosity (i know the probability decreases massively as the mod increases but you never know), no one fits 🙂
I guess I could've stopped at 127
lol
no lottery win 
but my computer is faster than my brain
argh I wish I had the emojis you have
I was thinking about that speedboat thing
to complement that sentence
and this one for obvious reasons
why do you have so many of them
People made it in 2019
Then whenever someone made a new server they included it
There are lots of these thinking emojis

The mikethonk one was made when MikeMirzayanov told people to think wider on codeforces
So we made a wider version of 
lmfao
It's still undergoing a very careful investigation
An issue tracking system like github is definitely nicer
I can upvote though
class Solution(object):
def minimumDifference(self, nums, k):
"""
:type nums: List[int]
:type k: int
:rtype: int
"""
n = len(nums)
left = 0
right = 1
min_diff = float('inf')
if n == 1:
return 0
while right < n :
if right == n - 1:
# if right pointer is at the end of the list
# reset left back
diff = max(nums[right], nums[left]) - min(nums[right], nums[left])
min_diff = abs(min(min_diff, diff))
left += 1
right = left + 1
continue
diff = max(nums[right], nums[left]) - min(nums[right], nums[left])
right +=1
return min_diff
for leet code
https://leetcode.com/problems/minimum-difference-between-highest-and-lowest-of-k-scores/
what is the time complexity this is my solution how is this O(n^2)
its in one pass
Oh nvm
im not sure
but the efficient way is to sort and use two pointer
It isn't n^2?
this doesn't solve the problem unless k=2
Its up now.
About opening up a similar issue on gcc bugzilla

Neither of us could reproduce on gcc though, correct?
Was it ConfusedReptile that could?
Yes, but only tried it on windows
Ah yes.
I could try it through wsl
You could as well I guess
We should also make sure the assembly code doesn’t change
Except for the +1 added obviously
According to what I can see there, using -O0 should work
But both gcc and clang optimize it all away if using >= O1. But I think I found something that works
#include <string.h>
#define REPEATS 4000000
char src[1000000]; // any allocation of a big array works
int main() {
for (int i = 0; i < REPEATS; ++i) {
// Call memcpy with dst - src = +-(power of 2) + (something small > 0)
memcpy(src + (1 << 18) + 1, src, 8193);
}
return 0;
}
Here is something that should work on clang or gcc, independent of O flag. According to godbolt memcpy is always called
Btw if I understand it correctly, both clang and gcc should be calling the same exact memcpy from glibc. So there shouldn't be any difference between the two
https://sourceware.org/glibc/wiki/FilingBugs So this is probably where to report it
I wonder if there is an amd bug tracker too. All I could find was emailing their support
I just tested on wsl with ubuntu. I'm getting that sweet factor 40 slowdown with both clang and gcc.
But my wsl python3 (version 3.12.3) does not have the copy slowdown bug
Is 3.12.3 too old to have the bug?
https://github.com/python/cpython/pull/91482 this pr is the one that introduced the algorithm that triggers the slowdown bug
It looks like 3.12 was the one who introduced it. But I'm not getting the slowdown. This is the 3.12 branch https://github.com/python/cpython/blob/f66c75f11d3aeeb614600251fd5d3fe1a34b5ff1/Include/internal/pycore_list.h#L59-L70
Include/internal/pycore_list.h lines 59 to 70
// Repeat the bytes of a buffer in place
static inline void
_Py_memory_repeat(char* dest, Py_ssize_t len_dest, Py_ssize_t len_src)
{
assert(len_src > 0);
Py_ssize_t copied = len_src;
while (copied < len_dest) {
Py_ssize_t bytes_to_copy = Py_MIN(copied, len_dest - copied);
memcpy(dest + copied, dest, bytes_to_copy);
copied += bytes_to_copy;
}
}```
I could reproduce on 3.12.5
on windows?
but what about wsl
Try the string example
For some reasons I don’t understand
This one is easier to reproduce
This definitely reproduces it on wsl
But I'm unnable to get python/pypy on wsl to bug out for some reason
Maybe the breaking change was introduced earlier
Ok that one is death on wsl
python3 testare2.py
0.43964314460754395
6.476124286651611
Yep
Why is it different?
I don’t know
No reference counting that’s the only difference
If I remember correctly
Looking at the source code
But maybe the list change came about AFTER this one
Can you upgrade your WSL python version?
Can you also try that #algos-and-data-structs message ?
This one is a memmove bug
Could it be that the x in version matters, like 3.12.x
When you insert at index 1 it moves stuff
Yep
That PR was closed in July 2022 and 3.12 was released end of 2023 so idk
I’ll try in a bit btw
We know enough know to where I doubt that it is actually calling memcpy in 3.12.3
I have 3.12.3 on my Ubuntu
What a coincidence
I can't even reproduce the string example ... (neither with 3.12.3 neither with 3.14.0 preview)
@stiff quartz try this
No slowdown (on top is +1, bottom is +100)
Makes sense
For reference:
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
(It's a virtual machine though, idk if that's relevant)
It might be. Seems like you dont have the bug at all there
I dont know what the VM does
Yeah me neither
I find it stranger that strings slow down for me, but not lists
It might just be that the string change was older
you have everything here
oh wait no
Im sure of that
you have everything for 3.13 but not for 3.12
ok here you have it all
no mention of gh-91247
It is under Python 3.12.0 alpha 1
so 3.12.3 should have memcpy
why is it not called gh-91247
There are some core developers here, maybe it'd be easier to ask them lol
probably
Try the tuple example
Supposedly the change touched both and I could reproduce both
It’s on the issue that you opened somewhere
python3 -m timeit -r 20 "((0,) * 2**18) * 14"
Run something like this
And with +1 and +100
If that works and the list doesn’t work 🤯
Just a thing
Remember that the calls to memcpy working with strings is different
Cause strings are bytes
And lists works off 8 bytes elements instead
ye exactly
Thats actually the reason why the string example is so much easier to reproduce
You’re in the same position as this one random compiler
+16 is slow, +32 is fast
Which is same as msvc
List should be slow
yes
An int is 8 bytes right
4
You are thinking about it wrong
A python list only stores pointers
It doesnt care what it is storing
Ah yes, but the pointer is 8 bytes
a list is an array of pointers
on 64-bit, yea
So i was right for the wrong reason, this still doesn’t make any sense
Are you sure +8 is slow? Maybe +7 and +9 are but not +8 on your computer
+8, yes
It’s hard being the worst student in class 😳
Because +16 is bad for me, and +32 isn't
import time
start_time = time.time()
for _ in range(500000):
("s" * (2**12 + 100)) * 32
print(time.time() - start_time)
start_time = time.time()
for _ in range(500000):
("s" * (2**12 + 1)) * 32
print(time.time() - start_time)
This is a killer
0.9612233638763428
15.320869445800781
But change to * 64 and the difference get cut in half
1.9017343521118164
16.154651641845703
The first copy is at a distance 2^k+8
The second is at a distance of 2^j+16
No?
The third +32
So you get out of the bad zone
Maybe fast enough
With a list
But it’s super slow to get out of the bad zone with a string
Since it goes 1 2 4 8 16 32
Yeah exactly
By this logic, shouldn't * 4 be best for me for lists?
So maybe for the array benchmark you should just do *2
Ah yes *4
Since 16 is bad
Yes try *4
no it is just not triggering I think
Actually it should be x40
Can you try the tuple example?
I see no change with tuples
And the list and tuple changes were shipped together
Are we reading the changelog wrong lol
Do you have another version of python installed on your wsl ?
Also, you should try your GCC benchmark with a bigger count
Maybe it doesn’t trigger if the count is too large
We have lower bounds and always assumed there was no upper bounds
what are you checking? repro'ing the different slowdowns on wsl?
Here is a summary:
This C code triggers slowdown bug for me in:
msvc (32 bit and 64 bit)
wsl gcc
wsl clang
but no slowdown in
mingw gcc
#include <string.h>
#define REPEATS 4000000
char src[1000000]; // any allocation of a big array works
int main() {
for (int i = 0; i < REPEATS; ++i) {
// Call memcpy with dst - src = +-(power of 2) + (something small > 0)
memcpy(src + (1 << 18) + 1, src, 8193);
}
return 0;
}
This python script triggers for me in all suffciently new CPython and PyPy versions (both on windows and wsl).
import time
start_time = time.time()
for _ in range(500000):
("s" * (2**12 + 100)) * 32
print(time.time() - start_time)
start_time = time.time()
for _ in range(500000):
("s" * (2**12 + 1)) * 32
print(time.time() - start_time)
The weird thing is that the list equivalence does not trigger on wsl + python (version 3.12.3). Switching to *4 does not fix that.
The switch of list copy algorithm was supposedly added in 3.12.0 alpha. But since I'm unnable to trigger the slowdown in 3.12.3, I'm starting to think it might not be in 3.12.3.
did you try other versions?
How do I do that in wsl?
@haughty mountain Btw have you tried running the string example
The one just above with *32
Hmm in >= 3.13 they added this to the repeat code
I don't know what Py_BEGIN_CRITICAL_SECTION does, but it was not there in 3.12
you can try pyenv
I think this has to do with the gil removal
https://github.com/python/cpython/issues/129069 it is listed as a bug report
yeah, it mentions "free threading"
which is the no gil thing
it's basically grabbing a mutex for the object
So you dont think the critical section thing should matter?
What is dest a char*?
Ah it didn’t link well
On line 56
That’s what’s used to copy the array right
Yes sorry
https://github.com/python/cpython/blob/d7e12a362a2a4a4b7d52a343ab5940be2cbcc909/Objects/bytesobject.c#L3713C1-L3733C2 this should be the one for (byte)-strings
Yep
That’s the one I linked at least
@regal spoke I'm the same as you on WSL
lhott@JasonTower:~/tmp$ python3 testString.py
0.7817449569702148
6.544427871704102
lhott@JasonTower:~/tmp$ vim testArray.py
lhott@JasonTower:~/tmp$ python3 testArray.py
0.8847634792327881
0.8824248313903809
Python 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] on linux
there is something called arrays too
I only knew about bytearrays
but anyway it was lists
lhott@JasonTower:~/tmp$ cat testArray.py
import time
start_time = time.time()
for _ in range(500):
([0] * (2**18 + 100)) * 14
print(time.time() - start_time)
start_time = time.time()
for _ in range(500):
([0] * (2**18 + 1)) * 14
print(time.time() - start_time)
lhott@JasonTower:~/tmp$
what if they forgot to add it in the update
what is the password
of wsl
i'm trying to install 3.13
it's asking me a password
ok i found it
i just cycled through the passwords i usually use
So I did already use WSL?
I'm getting old, my memory is failing me
ok i'm installing a newer version of python
we'll be able to verify that very quickly
I'm technically using wsl2
you can downlaod the non-compiled source code of each minor version
here
if you really wanna verify
I'm currently uppdating my ubuntu version on wsl.
-# then rename it to testList.py
🚔
i can reproduce the slowdown on my WSL with GCC
even when count = 2^18
actually
nvm I'm stupid I can reproduce no matter the value of count if >= 8193
can or cant?
what
but do you get it with gcc
if you get it with gcc but not with Python there are things that I don't understand
I'm on
lhott@JasonTower:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04.2 LTS
Release: 24.04
Codename: noble
what version are you on?
Seems, I no longer get the slowdowns.
On wsl
root@MegaBj8rn:~/gitproj/slowdown_memcpy# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 25.04
Release: 25.04
Codename: plucky
Could there have been some kind of patch?
