#algos-and-data-structs

1 messages · Page 70 of 1

stiff quartz
#

that is immediately discarded

regal spoke
#

Doesn't matter. Counts still need to be updated

stiff quartz
#

Couldn't this be optimized away when we know it'll be gone?

#

Idk how the interpreter would know I guess

regal spoke
#

I think pypy can be smarter, because pypy has a special optimization for pure integer lists

#

So I think it completely drops the reference counting in that case

#

But to my knowledge, CPython doesn't have any features like that

#

Btw if you read this code, you can see the reference counts being updated

#
        while (src < src_end) {
            _Py_RefcntAdd(*src, n);
            *dest++ = *src++;
        }
#

Here it adds n to the count of *src

#

then places source in *dest, and increments both pointers

stiff quartz
#

why does it

#

*dest doesn't point to *src?

regal spoke
#

(n is number of copies)

regal spoke
stiff quartz
#

yes

#

But why would it reference *src

regal spoke
#

src is the input list

stiff quartz
#

It's itw own independent list

#

It should be able to live with src garbage collected

regal spoke
#

What we are looking at corresponds to B[:n] = A in my code

stiff quartz
#

Ah so dest starts by just pointing at src?

regal spoke
#

Nooo

#

src is the original list

stiff quartz
#

yes

regal spoke
#

dest is the newly allocated list that should be used for output

stiff quartz
#

yes

regal spoke
#

They are two seperate pointers

#

Completely seperate

regal spoke
stiff quartz
#

Ok I understand

#

Yes I understand

regal spoke
#

While doing that, it also increments every elements reference count by n (because in the end, each element will appear n times)

stiff quartz
#

That's cool

#

and it seems so simple when we understand it

#

Wait, so Python on Linux uses GCC?

#

If that's the case, and our bug is consistent with the MSVC I had here

#

This means, that if there is actually a bug with GCC (the one I had here)

#

and if Python on Linux uses GCC's memcpy

#

Then we would know how to trigger the bug on Linux, we have to use 2^18-1

regal spoke
stiff quartz
#

my bad

#

2^18 instead of 2^18+1 yes

regal spoke
#

You could try if ([0]*2**18)*20 is slow

stiff quartz
#

yeah I'm trying right now

#

so it does use GCC

#

now, can I reproduce the bug

#

probably not

#

that'd be too nice

#

I mean

#

Idk, it's beautiful but suspiciously too beautiful and so it might not just be a Windows bug?

#

@regal path could you try to reproduce on Linux?

regal path
stiff quartz
#

Yes

#

Both are the same

#

I just changed the order in case there was a warmup thing

regal path
#
2**18+1 47.59964680671692
2**18 49.571892976760864
stiff quartz
#

Nice

stiff quartz
regal path
#

I do

stiff quartz
#

Can you run this?

regal path
#
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.742536 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.556608 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.552967 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.555692 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.566795 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.595249 sec
14
stiff quartz
#

ah rip

#

what is wrong with my computer

regal path
#

clang, for good measure

COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.705290 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.583754 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.583081 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.617412 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.659714 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.558543 sec
14
stiff quartz
#

ah welp you're on Linux my bad

regal spoke
#

I posted a suggestion to use this on the issue

#

Idk if they will like it

#

But it should make python's performance far more reliable.

vernal girder
#

Is this graph Hamiltonian? Meaning, can you trace a cycle along the edges which passes through every point exactly once?

main barn
#

hello

stiff quartz
regal spoke
#

Thing is, copying is one of few cases where memcpy performance is really noticeable

stiff quartz
#

It's weird that you can't make the issue happening when you only do [0] * (2^18 + 1)

#

If the issue really lies in memcpy performance

#

That should happen there too

#

ah no, my bad

#

It would only call memcpy with powers of 2 and then with size = 1

stiff quartz
#

because that is a very common thing to do (compared to ([0] * a) * b

regal spoke
#

It is only used if the list being copied has size > 1

stiff quartz
#

Ah yes

#

I remember

regal spoke
stiff quartz
#

Fair enough

#

since fixing it upstream is the best thing

regal spoke
#

My guess is that *2 is badly affected by this bug, but you don't really see it because of the reference counting taking more time

#

Btw you should change your benchmark to just have one src and one dst

stiff quartz
#

You do see it with *2 I think

regal spoke
#

Just make one pair of src + dst of size 2**18+10 and use it for all benchmarks

#

That way the code is easier to read

regal spoke
#

and it will make it more obvious that memcpy is at fault

#

Otherwise one could think that using different arrays is what is causing the performance bug

stiff quartz
#
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define COUNT1 262144   // 2^18
#define COUNT2 262145   // 2^18 + 1
#define COUNT3 262154   // 2^18 + 10
#define REPEATS 10000

void benchmark_copy(const int *src, int *dst, size_t count, const char *label) {
    size_t size_bytes = count * sizeof(int);

    clock_t start = clock();
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, size_bytes);
    }
    clock_t end = clock();

    double total_sec = (double)(end - start) / CLOCKS_PER_SEC;
    printf("%s: Copied %zu bytes (%zu ints) %d times\n", label, size_bytes, count, REPEATS);
    printf("%s: Total time: %.6f sec\n", label, total_sec);

    // Prevent optimizing away some stuff
    printf("%d \n", dst[14]);
}

int main(void) {
    static int src[COUNT3], dst[COUNT3];

    for (size_t i = 0; i < COUNT3; ++i) src[i] = (int)i;
    
    // Do all benchmarks twice just in case
    benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src, dst, COUNT3, "COUNT3 (2^18 + 10)");
    benchmark_copy(src, dst, COUNT3, "COUNT3 (2^18 + 10)");

    return 0;
}
#

You meant that?

regal spoke
stiff quartz
#

Using the same array doesn't work

regal spoke
stiff quartz
#

And I get the same results for all benchmarks

#

Is there something wrong with it? I don't think so

regal spoke
#

Make your computer go to sleep thinkies

stiff quartz
#

the problem is deeper than that

#

Because I still have the other executable

#

Lemme check

#

I'm going crazy

#

So these are the results

#

I compile both with MSVC

C:\Users\lhott\Desktop\temp>CL /Fe:mainMSVCSAMEARRAY.exe testSameArray.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30159 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

testSameArray.c
Microsoft (R) Incremental Linker Version 14.29.30159.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:mainMSVCSAMEARRAY.exe
testSameArray.obj

C:\Users\lhott\Desktop\temp>CL /Fe:mainMSVCDIFFARRAYS.exe testDiffArrays.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30159 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

testDiffArrays.c
Microsoft (R) Incremental Linker Version 14.29.30159.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:mainMSVCDIFFARRAYS.exe
testDiffArrays.obj
#

And then, the results are:

C:\Users\lhott\Desktop\temp>mainMSVCSAMEARRAY.exe
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.960000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.951000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.948000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.951000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.950000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.947000 sec
14

C:\Users\lhott\Desktop\temp>mainMSVCDIFFARRAYS.exe
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.216000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.216000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 2.903000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 2.851000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.231000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.221000 sec
14
#

Same array: it's homogeneous BUT SLOW FOR ALL??????

regal spoke
#

How lovely

stiff quartz
#

🙂

#

Great, I got a warning for trying to share a .c file

#

for you to run with gcc

regal spoke
stiff quartz
#

👀

#

isn't codeforces fully running with GCC?

#

But the more we do the less it makes sense, so is that a good idea

regal spoke
stiff quartz
#

How are you going to memcpy COUNT3 integers

#

if you give it src2 and dst2

#


We should look at the old source code when it did O(n) memcpy

#

And try to hack the old versions of CPython/PyPy

regal spoke
regal spoke
regal spoke
#

Why? I've got no clue

stiff quartz
#

And we’d have the same exact code

stiff quartz
#

Wouldn’t we?

#

At least that’s what I, from my limited C knowledge, would think happens

regal spoke
#

You could also just make some global arrays

#

Adding volatile might work too

stiff quartz
#

so I did:

    benchmark_copy(src1, dst1, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src1, dst1, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
    benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");

    benchmark_copy(src3, dst3, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src3, dst3, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src3, dst3, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src3, dst3, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
    benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
#

and we get:

COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.222000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.255000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 3.900000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 3.884000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.255000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.268000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.344000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.381000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.368000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 0.312000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.394000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.336000 sec
14

still have the slowdown I had, but everything is homogeneous
but less of a slowdown than earlier

#

worth noting i put my computer to sleep since then ducky_beer

regal spoke
#

Try

    benchmark_copy(src1, dst1, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src1, dst1, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");
    benchmark_copy(src3, dst3, COUNT3, "COUNT3 (2^18 + 10)");

    benchmark_copy(src2, dst2, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src2, dst2, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src2, dst2, COUNT2, "COUNT2 (2^18 + 1)");
stiff quartz
#

You're going to laugh

#
C:\Users\lhott\Desktop\temp>CL /Fe:mainMSVCDIFFARRAYSPAJENEGOD.exe testDiffArraysPajenegod.c
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30159 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

testDiffArraysPajenegod.c
Microsoft (R) Incremental Linker Version 14.29.30159.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:mainMSVCDIFFARRAYSPAJENEGOD.exe
testDiffArraysPajenegod.obj

C:\Users\lhott\Desktop\temp>mainMSVCDIFFARRAYSPAJENEGOD.exe
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.361000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 0.365000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 7.027000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 4.624000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.258000 sec
14
COUNT3 (2^18 + 10): Copied 1048616 bytes (262154 ints) 10000 times
COUNT3 (2^18 + 10): Total time: 0.276000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 3.960000 sec
14
COUNT1 (2^18): Copied 1048576 bytes (262144 ints) 10000 times
COUNT1 (2^18): Total time: 4.156000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 4.127000 sec
14
COUNT2 (2^18 + 1): Copied 1048580 bytes (262145 ints) 10000 times
COUNT2 (2^18 + 1): Total time: 4.214000 sec
14
regal spoke
#

Clearly we have not understood what is causing this bug

stiff quartz
#

I can't find anywhere online where you can run C code on an online platform that lets you choose the compiler

#

godbolt lets you choose but you only see the assembly code, you can't run it, it seems

regal spoke
#

Windows is very uncommon for servers to run

#

Pretty much codeforces is the only one out there

stiff quartz
#

I tried on codeforces

#

No bug

#

Because they use gcc

regal spoke
#

You can pick msvc

#

At least you used to be able to

#

Maybe they removed it

stiff quartz
#

I couldn’t

#

Or I’m blind

#

One of the two

regal spoke
#

Seems they have removed it

#

It used to be possible

stiff quartz
#

😮‍💨

regal spoke
#

Here is something you could test. Allocate a bigger src and dst

#

Then play around with benchmark_copy(src + i, dst + j, COUNT1, "COUNT1 (2^18)"); for different values of i and j

#

It could be that the size given to memcpy isn't the issue

stiff quartz
#

Will do when I come back

stiff quartz
#

@regal spoke do you have Visual Studio?

regal spoke
stiff quartz
#

Nevermind then

#

But it comes with msvc

regal spoke
#

I have Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x86

#
$ ./slowtest.exe
Benchmark COUNT1 (2^18): 0.621000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 0.622000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 2.356000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 2.356000 seconds over 10000 repeats
Benchmark COUNT3 (2^18 + 10): 0.682000 seconds over 10000 repeats
Benchmark COUNT3 (2^18 + 10): 0.684000 seconds over 10000 repeats
#

So I'm able to reproduce the slowdown locally

#

@stiff quartz Try this

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define COUNT1 262144   // 2^18
#define COUNT2 262145   // 2^18 + 1
#define COUNT3 262154   // 2^18 + 10
#define REPEATS 10000

void benchmark_copy(const int* src, int* dst, size_t count, const char* label) {
    size_t size_bytes = count * sizeof(int);
    
    clock_t start = clock();
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, size_bytes);
    }
    clock_t end = clock();
    
    double elapsed_secs = (double)(end - start) / CLOCKS_PER_SEC;
    printf("Benchmark %s: %.6f seconds over %d repeats\n", label, elapsed_secs, REPEATS);
}

int main() {
    static int src[COUNT2];
    static int dst[COUNT2];

    for (size_t i = 0; i < COUNT2; ++i) src[i] = i;

    // Do all benchmarks twice just in case
    benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");

    return 0;
}
stiff quartz
#

Because a guy couldn’t reproduce

#

On MSVC github

regal spoke
#

Maybe we should get latest version

stiff quartz
#

Im doing that rn

#

Microsoft has a weird way of doing things

#

With their Visual Studio Installer

#

It’s always so confusing if you’re not familiar with it

#

Which I’m not

regal spoke
#
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define COUNT 100000
#define COUNT1 262144   // 2^18
#define COUNT2 262145   // 2^18 + 1
#define COUNT3 262154   // 2^18 + 10
#define REPEATS 10000

void benchmark_copy(const int* src, int* dst, size_t count, const char* label) {
    size_t size_bytes = count * sizeof(int);
    
    clock_t start = clock();
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, size_bytes);
    }
    clock_t end = clock();
    
    double elapsed_secs = (double)(end - start) / CLOCKS_PER_SEC;
    printf("Benchmark %s: %.6f seconds over %d repeats\n", label, elapsed_secs, REPEATS);
}

int main() {
    static int src[COUNT2];
    static int dst[COUNT2];
    
    for (int i = 0; i < COUNT2; ++i) src[i] = 0;

    // Do all benchmarks twice just in case
    benchmark_copy(src, dst, COUNT, "COUNT (100k)");
    benchmark_copy(src, dst, COUNT, "COUNT (100k)");
    benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src, dst, COUNT1, "COUNT1 (2^18)");
    benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");
    benchmark_copy(src, dst, COUNT2, "COUNT2 (2^18 + 1)");

    return 0;
}

Btw the slowdown doesn't seem to have anything to do with the count you give memcpy

#

For this one slowdown hits on all 3

stiff quartz
#

Ok

#

I downloaded a few recent versions

#

includin the latest

#

The bug is reproducible on all MSVC versions

#

Which makes sense because Python 3.14 preview is compiled with the latest MSVC version anyway.

stiff quartz
# regal spoke For this one slowdown hits on all 3

if src and dst are of size COUNT2

C:\Users\lhott\Desktop\CP\c\untitled>mainMSPajenegod.exe
Benchmark COUNT (100k): 0.786000 seconds over 10000 repeats
Benchmark COUNT (100k): 0.788000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 2.057000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 2.060000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 2.055000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 2.065000 seconds over 10000 repeats
#

if src and dst are of size COUNT2+100

#
C:\Users\lhott\Desktop\CP\c\untitled>mainMSPajenegod.exe
Benchmark COUNT (100k): 0.064000 seconds over 10000 repeats
Benchmark COUNT (100k): 0.067000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 0.162000 seconds over 10000 repeats
Benchmark COUNT1 (2^18): 0.162000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 0.163000 seconds over 10000 repeats
Benchmark COUNT2 (2^18 + 1): 0.168000 seconds over 10000 repeats
regal spoke
#

Here is something you could try

#
#include <string.h>

// #define BADSIZE (262145)
#define BADSIZE (262145 - 1)
#define REPEATS 100000

int main() {
    static int src[BADSIZE];
    static int dst[BADSIZE];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, 100000);
    }

    return 0;
}
#

This one should run fast

#

but switch to

#

#define BADSIZE (262145)

#

and it should run sloww

stiff quartz
#

so it's an allocator problem

#

I guess

regal spoke
#

no

stiff quartz
#

like the memory allocated is bad/misaligned?

regal spoke
#

I think the values dst and src is what makes memcpy go bananas

stiff quartz
#

Can we find other places in CPython source code that call memcpy?

stiff quartz
#

I guess that's a smaller example but it doesn't give us any particular insight right?

#

Idk if that's the case since they said the bug hits clang on Linux and we could not reproduce on clang on Linux

regal spoke
#

I dont have intel

stiff quartz
#

I don't either

#

and I can't try the person's example, I don't have the benchmark.h

regal spoke
#
#include <string.h>
#define BADSIZE ((1 << 15) | 24)
#define REPEATS 100000

int main() {
    static char src[BADSIZE];
    static char dst[BADSIZE];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, 25000);
    }

    return 0;
}

Try this one

#

About the 15, seems any number works

stiff quartz
#

you didn't define REPEATS

regal spoke
stiff quartz
#

so that should be a BADSIZE

#

what's a good size

#

if anything is bad around 15

regal spoke
#

It seems that whatever triggers it needs BADSIZE between 1 to 24 bytes above a power of two

#

Thats the only rule

woven ferry
#

hi

stiff quartz
#

((1 << 15) | 24) = 44 ms

regal spoke
#

try rerunning the program

#

many times

stiff quartz
#

I multiplied REPEATS by 10

regal spoke
stiff quartz
regal spoke
#

note that I have char and not int

stiff quartz
#
#include <string.h>
#define BADSIZE ((1 << 15) | 128)
#define REPEATS 1000000

int main() {
    static char src[BADSIZE];
    static char dst[BADSIZE];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, 25000);
    }

    return 0;
}
#

and

#
#include <string.h>
#define BADSIZE ((1 << 15) | 24)
#define REPEATS 1000000

int main() {
    static char src[BADSIZE];
    static char dst[BADSIZE];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(dst, src, 25000);
    }

    return 0;
}
#

With REPEATS multiplied I have 391-396 ms for | 24

#

and 304-305 ms for | 128

stiff quartz
#

I mean

#

I use Measure-Command { ./smallerExample.exe }

#

on Powershell

#

the std seems pretty low

#

when I re-run

#

both are pretty much instantaneous

regal spoke
#

Run what I sent but lower the 24

#

until it slows down

#

For me anything between 1 and 24 slows it down

stiff quartz
#

for 16 I go up to 448-450 ms

#

for (1 << 15) | 1

#

I get 5 seconds

#

lol

regal spoke
#

Go up from 1

stiff quartz
#

2 gives me 4.9 s

#

3 gives me 4.9 s

#

5 gives me 4.9s

#

10 gives me 446 ms

#

7 gives me 449 s

#

6 gives me 441 ms

#

so it seems that for me anything between 1 and 5 slows it down

#

4 gives me 4.9 s

#

..

regal spoke
#

I'm assuming 8 is also slow?

stiff quartz
#

Yes.

#

9 isn't.

#

16 isn't.

#

Oh I had tested 16 already. Well it definitely isn't.

#

Has a memcpy as well

#

So we can probably do something with dictionaries in Python

regal spoke
#

@stiff quartz try this

#include <string.h>

#define BADSIZE ((1 << 15) | 1)
#define REPEATS 100000


int main() { 
    char src[1000000];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(src + BADSIZE, src, 25000);
    }

    return 0;
}
#

Seems that it might just be that dst - src = slightly larger than a power of 2 is what triggers the bug

regal spoke
stiff quartz
#

That’s almost a factor of 20

#

Crazy when you think about it

stiff quartz
#

@regal spoke Other parts of Python are affected

regal spoke
stiff quartz
#

Factor 10-20

#

This is more than what we had with lists

halcyon plankBOT
#

Objects/unicodeobject.c line 12782

unicode_repeat(PyObject *str, Py_ssize_t len)```
regal spoke
#

Oh it is because it doesnt need to do any ref courning for strings

halcyon plankBOT
#

Objects/bytesobject.c line 3713

_PyBytes_Repeat(char* dest, Py_ssize_t len_dest,```
stiff quartz
#

And that has a memcpy

#
        while (copied < len_dest) {
            Py_ssize_t bytes_to_copy = Py_MIN(copied, len_dest - copied);
            memcpy(dest + copied, dest, bytes_to_copy);
            copied += bytes_to_copy;
        }

The same log2(n) one

#

!e

import time

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 100)) * 14
print(time.time() - start_time)
halcyon plankBOT
stiff quartz
#

!e

import time

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 1)) * 14
print(time.time() - start_time)
halcyon plankBOT
stiff quartz
#

@regal spoke I can reproduce the slowdown with Discord bot?!!!

haughty mountain
#

!e

import platform
print(platform.libc_ver())
halcyon plankBOT
stiff quartz
#

This is a completely unrelated bug then, the list bug you found was Windows specific, that string bug seems to work on Linux

#

On a random online compiler:

haughty mountain
#

!e

import platform

print(platform.platform())
print(platform.machine())
halcyon plankBOT
regal spoke
#

So it doesn't seem to be entirely hardware related either

stiff quartz
#

Honestly at this point I don't know

#

Isn't that ironic

#

!e

import time

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 1)) * 14
print(time.time() - start_time)

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 100)) * 14
print(time.time() - start_time)
halcyon plankBOT
stiff quartz
#

!e

import time

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 100)) * 14
print(time.time() - start_time)

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 1)) * 14
print(time.time() - start_time)
halcyon plankBOT
stiff quartz
vocal gorge
#

wild, I'm getting 0.46 vs 6.38 on linux, Python 3.10.12 GCC 11.3.0. Same for 3.12 GCC, and same for uv-installed 3.11.3 and 3.13.5 (Clang 20.1.4).

#

so neither version nor compiler seem to matter much.

#

...my CPU is an AMD. On a different computer (Windows, Intel) there's no slowdown. As the last message in that thread suggests, it may be CPU-dependent.

stiff quartz
#

What does uv-installed mean?

stiff quartz
vocal gorge
#

I tested the string one. Presumably they are both caused by malloc

vocal gorge
halcyon plankBOT
#

:white_check_mark: Your 3.14 pre-release eval job has completed with return code 0.

001 | Power of 2: 0.1374994833022356
002 | Not power of 2: 3.3911833791062236
regal spoke
#

Could be that there are different issues

#

Seems people are mixing memset stuff and memcopy

stiff quartz
stiff quartz
regal spoke
#

@stiff quartz Why do you think memcpy issue is not windows specific?

#

Have you been able to make memcpy slow in C without using windows?

stiff quartz
#

Thèse are the only reasons I have for now why I think memcpy is not windows specific

regal spoke
stiff quartz
#

I probably shouldn’t blame memcpy - I don’t see anything else in CPython source code for strings that could be the cause of this though

regal spoke
#

I think confirmation is important

stiff quartz
#

Yes, I haven’t been able to / haven’t had the time to dig deeper

regal spoke
#
#include <string.h>
#define REPEATS 4000000

int main() { 
    char src[1000000];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(src + (1 << 15) + 1, src, 4096);
    }

    return 0;
}

On my computer, this very precisely triggers the bug

stiff quartz
#

Has anyone run this on Linux?

regal spoke
#

I've tried with gcc on my windows computer, and a friend on their linux computer (with AMD processor). Nothing interesting happens, they both run equally fast

stiff quartz
#

Could they reproduce the Python string slowdown on their Linux computer? Not all Linux computers can it seems, maybe only the Linux computers that show the Python string slowdown would show anything with the C memcpy benchmark.

regal spoke
#

Thing is, my computer is clearly affect by these slowdown. But I'm not able to make gcc slowdown

stiff quartz
#

But your computer uses MSVC for Python

regal spoke
#

yes

stiff quartz
#

Why would you expect gcc to show anything

regal spoke
#

The question is, is the issue with memcopy windows specific or not

stiff quartz
#

Ah, I see

regal spoke
#

You are claiming it is not

stiff quartz
#

My gcc benchmarks on my computer make absolutely no sense, I’ve given up on trying to understand them.

stiff quartz
#

I edited it my bad

regal spoke
stiff quartz
regal spoke
#

This one does run slow, but only if I don't have any O-flags

#

With O2 it runs fast

stiff quartz
#

Isnt O2 the default

#

I didnt set any flags

regal spoke
#

no

stiff quartz
#

Argh

regal spoke
#

No optimization is on by default

stiff quartz
#

At least I learn a lot about C & C++ & reading CPython code

#

ah you're right

#

I'm back to having a normal-behaving GCC with the optimization flags

stiff quartz
#
#include <string.h>
#define REPEATS 4000000

int main() { 
    char src[1000000];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(src + (1 << 15) + 1, src, 4096);
    }

    return 0;
}

This is supposed to be the slow one, correct?

regal spoke
#

yes

#

It should be quire apperent if it is slow or not

#

like 0.1s vs 4s

stiff quartz
#

it's fast

#

with O2

#

dsfkfiqsjdfiqjds

#

it's fast without

#

what is that

#

AH no, I'm the same than the guy on the C issue

#

I have to compile for x86 to see the slowdown (with or without O2)

#

Are you compiling for x64 or x86

regal spoke
#

Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x86

stiff quartz
#

Yes, you probably wouldn't see anything either on x64

#

can you try?

stiff quartz
regal spoke
#

This one is slow for me on Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x64

stiff quartz
#

Yep

#

Why isn't my initial one working

#

with x64

#

It's also 2^18+1

#

and i'm clearly moving more than 2*4096 +1

regal spoke
#

Unlike your test

stiff quartz
#

What do you mean sorry

regal spoke
#

src + (1 << 18) + 1, src I specially make the pointers be at a bad distance for dst and src

#

You get whatever the compiler decides to give you

stiff quartz
#

mmhh

#

I see

regal spoke
#

I really think this is the trigger

stiff quartz
#

lemme test

#
#include <string.h>
#define REPEATS 4000000

int main() { 
    char src[1000000];
    
    for (int i = 0; i < REPEATS; ++i) {
        memcpy(src + (1 << 18) + 1, src, 2*4097 + 16);
    }

    return 0;
}

This is slow

#

16 => 161 is slow too

regal spoke
#

the reason I put 2*4096 + 1

#

is that anything smaller seems to run fast

#

you could probably put 100k if you wanted

stiff quartz
#

In my initial case, dst and src are two distinct objects

#

the pointers are far from eacher other

#

at least by a distance equal their size

regal spoke
#

Here is the thing, I suspect the reason why copying in python is slow

#

Is that the pointers to dst and src is (power of 2) times n

stiff quartz
#

I disagree, if we copy list or strings of size 2^n + 1
They are initially at a distance 2^n + 1 then 2^(n+1) + 2

regal spoke
#

yes

#

and

#

(1 << 18) + 1 try changing this +1 to +2

#

still slow as heck

stiff quartz
#

yes but then my initial case

#

src and dest are far, and like in your example, distance isn't a power of 2

regal spoke
stiff quartz
#

do you suspect the something small is the problem?

regal spoke
#

Read my comment earlier on stl github

#

I think that if dst - src = +-(power of 2) + something small (thats > 0)

#

then the bug triggers

regal spoke
#

If you add the same value x to both dsc and src, then the bug always happens

stiff quartz
#

Reading your comment, I thought you wanted dst and src to be of a distance power of 2 + 1 with that 1 being important

regal spoke
#

Well to guarantee a trigger, yes.

#

But I've gotten it to work with something as big as +24

stiff quartz
#

Ah yes that was our discussion

#

I could only get up to 7-8

regal spoke
#

yes

#

Thats why I only said +1

stiff quartz
#

I definitely need to add a lot of context

#

to the issue

#

I initially didn't even mention flags

#

and didn't mention the difference between x86 and x64

regal spoke
stiff quartz
#

Ah yes lemme check on gcc

regal spoke
#

The problem is that my gcc seems to just ignore everything

stiff quartz
#

do you mean it optimizes away the copy?

regal spoke
#

yes

#

Everything is gone

#

I need to add something to force it to not do that

stiff quartz
#

The assembly is just return 0 lol

regal spoke
#

yep

stiff quartz
#

when you set the flag

regal spoke
#

Thats what I'm trying to avoid

stiff quartz
#

what if you add printf(%c, src[40]);

regal spoke
#

Hmm I just tried with O0, and then it doesnt optimize it all away

#

But I sadly dont seem to trigger the slowdown bug

stiff quartz
#

yes my screenshot is from O2

#

What about if we use ints instead of char for your example

regal spoke
#

That wont matter

#

Memcpy doesn't know what it is copying

#

It just works off bytes

stiff quartz
#

Ah yes that's the whole point i forgot

regal spoke
#
#include <string.h>
#include <stdio.h>
#define REPEATS 4000000

int main() { 
    char src[1000000];

    for (int i = 0; i < REPEATS; ++i) {
        memcpy(src + (1 << 18) + 1, src, 4*4096 + 1);
    }

    return 0;
}

Should I maybe just post this on stl issue?

stiff quartz
#

I was rewriting the initial post

#

By editing it

regal spoke
#

ok

stiff quartz
#

To add information about x86 & x64

#

and the flags

#

I think I'm going to erase my example and just say that you found better shorter more consistent examples

#

Because they're really are all that lol

regal spoke
#

I think fundamentally, flags shouldnt matter since that shouldnt affect the internals of memcpy.

stiff quartz
#

Fair enough

regal spoke
#

but 32bit vs 64bit could definitely make a big difference

stiff quartz
#

That's the super important bit that I completely overviewed

#

and to be fair I didn't even know what I was using to compile

#

until the guy said that

#

I'm alson including the cpp version:

#include <iostream>
#include <cstring>

constexpr int REPEATS = 4000000;
constexpr int SRC_SIZE = 1000000;

int main() {
    char src[SRC_SIZE] = {};

    for (int i = 0; i < REPEATS; ++i) {
        std::memcpy(src + (1 << 18) + 1, src, 2 * 4096 + 1);
    }

    return 0;
}
vocal gorge
stiff quartz
#

???

#

That's crazy

vocal gorge
#

As in in both compilers, memcpy(src + (1 << 15) + 1, src, 4096); slows down and memcpy(src + (1 << 15) + 0, src, 4096); doesn't.

regal spoke
#

eeek

stiff quartz
#

Although @vocal gorge is one of the people that could see the slowdowns in Python on Linux

#

So it makes sense @vocal gorge can see the slowdowns in C

vocal gorge
#

Ehh, I suspect that was a pyenv thing

#

whereas this is just C, nothing interesting

#

(My testing approach is gcc -Wall -o test1 test1.c && time ./test1, and similarly for test2, and similarly with clang instead of gcc.)

regal spoke
vocal gorge
regal spoke
#

And if you remove the +1 after 1<<18?

vocal gorge
#

~200ms, on both gcc and clang

regal spoke
#

Ok nice

#

That is even the same amount of slowdown as I get

stiff quartz
regal spoke
stiff quartz
#

well I listed all the versions

#

below

#

I could reproduce the issues on 19.29.30157. It is pretty old but I could reproduce with very similar results on 19.29.30159, 19.38.33135, 19.44.35211, and 19.44.35213 (latest available).

vocal gorge
#

My versions are gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Target: x86_64-pc-linux-gnu Thread model: posix and Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1~exp1~20240731145000.144) if it ends up mattering.

#

I suspect it's not particularly a compiler thing if it's this consistent, though

regal spoke
#

You could also add the note about memmove also having the same exact issue, expect for if the intervals intersect. (This is to be expected since memcpy is basically the special case of memmove with non intersection).
From my testing, if they intersect, memmove does not get the slowdown bug.

#

Btw instead of // this also can be int src[...];, just say something like // Allocate a sufficiently large array

#

That is all it is

stiff quartz
#

I think so

#

Im afraid this will all come down to

#

It’s AMD specific, a part that is closed source

#

Or it’s OS specific and not really fixable

#

Idk

#

I know nothing about syscalls

#

Let alone CPU hardware stuff

regal spoke
#

It sounds more like hardware lvl now

stiff quartz
#

Yes agreed

regal spoke
#

Unless someone finds that it can be triggered on intel too

stiff quartz
#

I have yet to see any of those bugs on intel

regal spoke
#

That said, I think it makes a lot of sense to keep this open for a while

#

But idk

#

btw I dont like char src[1000000]; // this also can be int src[...]; - any allocation of a big array works

#

Using int src[] instead of char src[] would not only make a larger array

#

it would also mean that src + (1 << 18) + 1 goes 4 times further

stiff quartz
#

Does it?

regal spoke
#

yes

stiff quartz
#

When you do src +

#

You count in Ute’s

#

Bytes

#

Does it matter than ints are 4 bytes

regal spoke
#

I've been taught that
A[i] is same as *(A + i)

stiff quartz
#

Huh

regal spoke
#

Its a thing

#

It means you can even switch the order

#

Isnt C lovely

stiff quartz
#

You’re right

regal spoke
#

Because of this, I think using an int array is bad

stiff quartz
#

I thought pointer arithmetic in c was in terms of bytes

regal spoke
#

The count you give memcpy is in bytes

#

But any pointer arithmetic depends on the size of the datatype

#

To keep it simple, just dont use int src[]

stiff quartz
#

Yeah I edited

stiff quartz
#

Good thing I’ve never been asked to code in C or C++ in my life

#

🤡

regal spoke
#

You dont even need to say char is nice because it's one byte so we can get to 2^k+1 distance

stiff quartz
#

Also if it is a AMD issue I don’t even know where this should be discussed

regal spoke
#

It is nice to keep things open

stiff quartz
#

I agree

#

But if we are sure it is

#

I wouldn’t even know what the move would be

#

Im not closing or porting any issue today on the Developer community thing they mentioned ill do it tmro (or you can do it if you want)

stiff quartz
# regal spoke Isnt C lovely

I hate i[a] I know it makes sense that addition is commutative but I hate it I hope I never see someone write code with i[a]

regal spoke
#

You should see my usage of (-1)[A]

#

back in the day

#

Lovely pointer arithmetics

stiff quartz
#

Actually @regal path who couldn’t reproduce anything

#

I assume you don’t have AMD

regal path
#

nope

stiff quartz
#

Boom

#

One step closer

regal path
#

intel 6th and 8th gen

regal path
#

I'm all for pointer witchery but why in gods name would you ever go before start of array with that

#

(<datatype>)a-- should work no?

regal spoke
#

@stiff quartz Btw fix two things. Make 2*4096 + 1 be just 8193. I originally wrote 2*4096 + 1 because I was lazy when I was coding. And probably remove the "char is nice because it's one byte so we can get to 2^k+1 distance"

regal spoke
# regal path but, why?

I think the story was that I implemented an algorithm from a paper for some kind of shortest distance algorithm in 3D for voxels. This was many years back so I dont really remember now.

#

In the algorithm you naturally do pointer arithmetics with lots of ++ on pointers everywhere

#

But I realized in some edge cases that I kind of needed to undo some of the ++ to my pointers

#

My quick fix was to index with a negative number

#

In some sense it was super nice. But it also broke every sensible rule for working with pointers

#

I don't know if I still have the code somewhere

#

I just remember that @haughty mountain really didn't like my code

#

Maybe he remember more what it was about

stiff quartz
#

!e

import timeit
setup_code = '''
ba = bytearray(b"A" * 100000)
insert_bytes = b"B" * (2**14 + 1)
'''
stmt_code = 'ba[:1] = insert_bytes'
timer = timeit.Timer(stmt=stmt_code, setup=setup_code)
print(timer.timeit(number=1500))

import timeit
setup_code = '''
ba = bytearray(b"A" * 100000)
insert_bytes = b"B" * (2**14 + 50)
'''
stmt_code = 'ba[:1] = insert_bytes'
timer = timeit.Timer(stmt=stmt_code, setup=setup_code)
print(timer.timeit(number=1500))
#

@regal spoke that example is beautiful because much more common imo

halcyon plankBOT
regal spoke
#

the issue is not with any C compiler (GCC/Clang/MSVC Python versions on AMD can (almost always) reproduce)

#

You could argue that the issue lies with all of them

stiff quartz
#

Yes sure

#

I don't know where the responsibility of the compiler ends the responsibility of the hardware begins

stiff quartz
regal spoke
regal spoke
#

Seems it needs to get approved by moderators before it gets shown publicly

stiff quartz
regal spoke
#

Here is what I wrote

[severity:It bothers me. A fix would be nice]
It was recently found that calling memcpy(dst, src, count) where dst-src is close to a power of 2 causes significant performance degragation (up to 20-40 times). It was originally discovered in Python because Python’s internal list copying algorithm happens to repeatedly trigger this slowdown bug.

Related issues:
https://github.com/python/cpython/issues/137121
https://github.com/microsoft/STL/issues/5658

MRE in C used to trigger the bug

#include <string.h>
#define REPEATS 4000000

int main() { 
    char src[1000000];  // any allocation of a big array works
    
    for (int i = 0; i < REPEATS; ++i) {
        // Call memcpy with dst - src = +-(power of 2) + (something small > 0)
        memcpy(src + (1 << 18) + 1, src, 8193);
    }

    return 0;
}
This takes 4 s on my Windows desktop computer (compiled using Microsoft (R) C/C++ Optimizing Compiler Version 19.44.35213 for x64). But if I change (1 << 18) + 1 to (1 << 18) it takes 0.15 s.

About which CPU models are affected. I have desktop and laptop at home, and both experience the slowdown bug. Desktop CPU: AMD Ryzen™ 9 5950X. Laptop CPU: AMD Ryzen™ 9 5900HS. There has been a single report of someone reproducing this with an intel processor + clang (https://github.com/python/cpython/issues/137121#issuecomment-3126294652). Otherwise the reports have been conclusively with people running AMD processors.

Compilers confirmed to be able to reproduce the slowdown bug

Microsoft (R) C/C++ Optimizing Compiler Version XXX for x64
Microsoft (R) C/C++ Optimizing Compiler Version XXX for x86
Where XXX is 19.29.30157, 19.29.30159, 19.38.33135, 19.44.35211, and 19.44.35213 (latest available)

Linux GCC (gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0)
Linux Clang (Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1~exp1~20240731145000.144)
#

I would like mark the severity higher, but there was no good option

stiff quartz
#

You can also mention my CPU if you want: AMD Ryzen 9 7900 12-Core Processor

regal spoke
#

Ryzen 9 too huh

stiff quartz
regal spoke
#

Which is the next step in their severity scale

#

In the end, this is just a slowdown issue (as far as we know)

#

Which is not the end of the world

stiff quartz
#

Would be funny if you could hack people on codeforces because of it but that's very unlikely

regal spoke
#

Or at least they used to

stiff quartz
#

Ah yes there's that

regal spoke
#

Btw the reason I found this bug was that I was benchmarking sparse tables / disjoint sparse table in python/pypy. The disjoint sparse table starts with A *= len(A).bit_length()

#

That is where the original 22 came from. It's the bit length of the size of the list

#

At first I thought pypy was the culprit, but then I realized cpython was equally slow.

regal spoke
#

Maybe I can find more bugs benchmarking them

#

This is how long it takes CPython to create the sparse table. There is this weird bump that I don't know where it comes from.

#

This is the same plot, but for PyPy instead

#

That bump is now a cliff

#

Something very funky is going on between n = 6.7e7 and n = 6.8e7

haughty mountain
#

cpu cache effects maybe?

#

or these values might be too large for that

regal spoke
#

Something else funky is that n=3e7 is always below the expected curve for PyPy

#

Also, the reason I happened to benchmark with powers of 2 plus 1 is that I used this to generate my benchmark

for i in range(40):
    n = int(2**(i*0.5) + 1)
    benchmark(n)
#

Originally I had just 2**i, but I felt like the size increased too fast

#

So I changed it to int(2**(i*0.5) + 1)

#

Then I saw that it somehow alternated between running fast and slow

stiff quartz
regal spoke
#

I just want to know why my sparse table is slow

stiff quartz
#

Ah welp

#

Will the underlying array ‘s size ever reach higher than 0.5*10^8 though?

#

Seems that it’s already kinda big

regal spoke
#

Ofc, but it is still weird

stiff quartz
#

Maybe your ram is full

regal spoke
#

No

stiff quartz
#

If you pinpoint the exact value at which the cliff happens

#

Could help

#

You could binary search the location of the cliff based on the time it takes to make your sparse table 😳

regal spoke
#

Ye I'll more into it when I have time

#

I realized that .bit_length() in pypy is O(64)

#

Which kind of defeats the purpose of a sparse table

#

I think this one algorithm is pretty cool for bit length

table3 = [0, 1, 39, 2, 15, 40, 23, 3, 
          12, 16, 59, 41, 19, 24, 54, 4, 
          -1, 13, 10, 17, 62, 60, 28, 42, 
          30, 20, 51, 25, 44, 55, 47, 5, 
          32, -1, 38, 14, 22, 11, 58, 18, 
          53, 63, 9, 61, 27, 29, 50, 43, 
          46, 31, 37, 21, 57, 52, 8, 26,
          49, 45, 36, 56, 7, 48, 35, 6, 
          34, 33]

def bitlength64_4(x):
    # Assumes x >= 0
    x |= x >> 1
    x |= x >> 2
    x |= x >> 4
    x |= x >> 8
    x |= x >> 16
    x |= x >> 32
    return table[x % 67]
#

I haven't seen it anywhere online

#

Even though it is so clean

stiff quartz
#

Is the goal to put ones everywhere

#

On the right of the highest bit

regal spoke
stiff quartz
#

And then do a look up

stiff quartz
#

Im talking about the OR operations

#

You transform 0101010 in 0111111?

regal spoke
#

The or operations is to make x be on the form 2^i - 1, yes

stiff quartz
#

Ah ok I get the algo

#

That’s nice

regal spoke
#

So after the or operations, there are only 64 possible values of x (assuming the starting x is a int64)

stiff quartz
#

There are some values that are useless in the table then i assume

#

Or is it length 64

regal spoke
#

it is 66 long

stiff quartz
#

Shouldn’t it be 67 long?
Or maybe no one in the 64 values end up on index 66?

regal spoke
#

I ended up being lucky

stiff quartz
#

Did you check if you were luckier with 71? 😅

#

Theoretically the stars could align

#

And give you a 64 long table

#

Even if that’s unlikely

regal spoke
#

Havent tried that

stiff quartz
#

Well o mean

#

If we’re trying to micro optimise bit length

#

We may as well try

#

😅

regal spoke
#

If we really want to optimize, then this is probably better

import __pypy__
mult = __pypy__.intop.int_mul

# Magic constant for de bruijn sequence with the "Dickinson property" taken from https://cstheory.stackexchange.com/questions/19524/using-the-de-bruijn-sequence-to-find-the-lceil-log-2-v-rceil-of-an-integer
magic = -0x3f08a4c6acb9dbd

table2 = [0, 6, 7, 58, 8, 51, 59, 46,
          9, 55, 52, 33, 60, 47, 40, 28, 
          10, 56, 44, 53, 38, 36, 34, 20,
          61, 48, 25, 41, 22, 29, 16, 11, 
          63, 5, 57, 50, 45, 54, 32, 39, 
          27, 43, 37, 35, 19, 24, 21, 15, 
          62, 4, 49, 31, 26, 42, 18, 23, 
          14, 3, 30, 17, 13, 2, 12, 1]

def bitlength64_3(x):
   # Assumes x >= 0
    x |= x >> 1
    x |= x >> 2
    x |= x >> 4
    x |= x >> 8
    x |= x >> 16
    x |= x >> 32
    return table2[mult(x, magic) >> 58]
#

This one is kind of standard. There is lots of info about this method online

stiff quartz
#

I never heard of it

regal spoke
#

The idea is similar

stiff quartz
#

Is it faster?

regal spoke
#

Probably yes

stiff quartz
#

On your GitHub issue you mention other methods are a bit faster than your in house one

regal spoke
#

I dont know if the compiler can be smart about % 67

#

There might be some funny method to quickly compute it

stiff quartz
#

Maybe you can find a series of operations to do on your 64 values

#

That map them to an array of reasonable size

#

That’s faster than a mod

regal spoke
#

Thats the goal, yes

stiff quartz
#

I don’t really know where to start to find it though

regal spoke
#

% 67 was my idea

#

The one unfortunate thing is that you wont be able to use only bit operations

#

Like you cant do something like x & (64 - 1)

#

You need something that kind of mixes up the bits

#

The de bruijn method uses multiplication mod 2^64 to do it

#

And my method uses % 67

stiff quartz
#

What’s the CPython algorithm

#

I don’t understand it

#

It looks like depending on the compiler the algorithm is different

regal spoke
#

Which is built into gcc/clang/msvc

stiff quartz
#

If your integer is 0

regal spoke
#

No like, it uses something built in. True O(1) on most systems

stiff quartz
#

Ah ok

regal spoke
#

If you try to compile on some other compiler, then it falls back to

#
    const int BIT_LENGTH_TABLE[32] = {
        0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4,
        5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5
    };
    int msb = 0;
    while (x >= 32) {
        msb += 6;
        x >>= 6;
    }
    msb += BIT_LENGTH_TABLE[x];
    return msb;
stiff quartz
#

Yeah that one I couldn’t understand

regal spoke
#

This case should practically never ever happen

stiff quartz
#

Probably not as good as the ones proposed in the pypy issue because of the jmp statement it will generate?

regal spoke
#

Obviously using the built in count leading zeros is better

stiff quartz
regal spoke
stiff quartz
#

What no

#

Im just wondering if the algorithm CPython falls back to on other compilers is any good

regal spoke
#

Their algorithm is not great (but better than what pypy currently uses)

#

But like it really doesn't matter

stiff quartz
#

Well if it was great pypy could’ve copied it

#

That’s why I was wondering

regal spoke
#

They could tbh. But I think the other ones I posted should be better.

#

PyPy's current algorithm for bit length is probably the slowest alg out there (except for algorithm intentionally written to be slow)

#

Just try to benchmark it yourself

#
import __pypy__
mult = __pypy__.intop.int_mul

# Magic constant for de bruijn sequence with the "Dickinson property" taken from https://cstheory.stackexchange.com/questions/19524/using-the-de-bruijn-sequence-to-find-the-lceil-log-2-v-rceil-of-an-integer
magic = -0x3f08a4c6acb9dbd

table2 = [0, 6, 7, 58, 8, 51, 59, 46,
          9, 55, 52, 33, 60, 47, 40, 28, 
          10, 56, 44, 53, 38, 36, 34, 20,
          61, 48, 25, 41, 22, 29, 16, 11, 
          63, 5, 57, 50, 45, 54, 32, 39, 
          27, 43, 37, 35, 19, 24, 21, 15, 
          62, 4, 49, 31, 26, 42, 18, 23, 
          14, 3, 30, 17, 13, 2, 12, 1]

def bitlength64_3(x):
   # Assumes x >= 0
    x |= x >> 1
    x |= x >> 2
    x |= x >> 4
    x |= x >> 8
    x |= x >> 16
    x |= x >> 32
    return table2[mult(x, magic) >> 58]

vs

def bit_length_naive(x):
    cnt = 0
    while x:
        x >>= 1
        cnt += 1
    return cnt
#

My benchmark shows bit_length_naive runs 27 times slower than bitlength64_3 for numbers around 10^18.

stiff quartz
#

even for small numbers, it's significantly faster

#
>>>> timeit.timeit("bit_length_naive(10)", setup="from __main__ import bit_length_naive", number=10000000)
0.13675410000000454
>>>> timeit.timeit("bitlength64_3(10)", setup="from __main__ import bitlength64_3", number=10000000)
0.004600799999991523
>>>> 0.1367541/0.0046008
29.723982785602498
>>>>
regal spoke
#

It should shine the most if given random numbers with bitlength 63

stiff quartz
#

yeah obviously

stiff quartz
#

I guess I could've stopped at 127

#

lol

regal spoke
#

no lottery win sadcat

stiff quartz
#

argh I wish I had the emojis you have

#

I was thinking about that speedboat thing

stiff quartz
#

and this one for obvious reasons

#

why do you have so many of them

regal spoke
#

People made it in 2019

#

Then whenever someone made a new server they included it

#

There are lots of these thinking emojis thonk thonkery thonkeyes thonkcool thinkies pajenethonk mikethonk

#

The mikethonk one was made when MikeMirzayanov told people to think wider on codeforces

#

So we made a wider version of thonk

stiff quartz
#

lmfao

stiff quartz
regal spoke
#

An issue tracking system like github is definitely nicer

stiff quartz
#

I can upvote though

distant quiver
#

class Solution(object):
    def minimumDifference(self, nums, k):
        """
        :type nums: List[int]
        :type k: int
        :rtype: int
        """
        n = len(nums)
        left =  0 
        right = 1
        min_diff = float('inf') 

        if n == 1:
            return 0

        while right < n :
            if right == n - 1:
                # if right pointer is at the end of the list 
                # reset left back
                diff = max(nums[right], nums[left])  - min(nums[right], nums[left])
                min_diff = abs(min(min_diff, diff))
                left += 1
                right = left + 1
                continue
            diff = max(nums[right], nums[left])  - min(nums[right], nums[left])

            right +=1
        return min_diff

for leet code

https://leetcode.com/problems/minimum-difference-between-highest-and-lowest-of-k-scores/

what is the time complexity this is my solution how is this O(n^2)

#

its in one pass

#

Oh nvm

#

im not sure

#

but the efficient way is to sort and use two pointer

haughty mountain
regal spoke
#

About opening up a similar issue on gcc bugzilla

stiff quartz
stiff quartz
#

Was it ConfusedReptile that could?

regal spoke
#

Yes, but only tried it on windows

stiff quartz
#

Ah yes.

regal spoke
#

I could try it through wsl

stiff quartz
#

Tbf I so have a VM

#

On Ubuntu

#

With gcc

#

I could try when I come back home

stiff quartz
#

We should also make sure the assembly code doesn’t change

#

Except for the +1 added obviously

regal spoke
#

According to what I can see there, using -O0 should work

#

But both gcc and clang optimize it all away if using >= O1. But I think I found something that works

#
#include <string.h>
#define REPEATS 4000000

char src[1000000];  // any allocation of a big array works
int main() { 
    for (int i = 0; i < REPEATS; ++i) {
        // Call memcpy with dst - src = +-(power of 2) + (something small > 0)
        memcpy(src + (1 << 18) + 1, src, 8193);
    }
    return 0;
}

Here is something that should work on clang or gcc, independent of O flag. According to godbolt memcpy is always called

#

Btw if I understand it correctly, both clang and gcc should be calling the same exact memcpy from glibc. So there shouldn't be any difference between the two

#

I wonder if there is an amd bug tracker too. All I could find was emailing their support

regal spoke
#

But my wsl python3 (version 3.12.3) does not have the copy slowdown bug

#

Is 3.12.3 too old to have the bug?

regal spoke
halcyon plankBOT
#

Include/internal/pycore_list.h lines 59 to 70

// Repeat the bytes of a buffer in place
static inline void
_Py_memory_repeat(char* dest, Py_ssize_t len_dest, Py_ssize_t len_src)
{
    assert(len_src > 0);
    Py_ssize_t copied = len_src;
    while (copied < len_dest) {
        Py_ssize_t bytes_to_copy = Py_MIN(copied, len_dest - copied);
        memcpy(dest + copied, dest, bytes_to_copy);
        copied += bytes_to_copy;
    }
}```
stiff quartz
regal spoke
#

on windows?

stiff quartz
#

Yes

#

I couldn’t on Linux

regal spoke
#

but what about wsl

stiff quartz
#

Try the string example

#

For some reasons I don’t understand

#

This one is easier to reproduce

regal spoke
#

But I'm unnable to get python/pypy on wsl to bug out for some reason

stiff quartz
#

Im talking about this ^

stiff quartz
regal spoke
#

Ok that one is death on wsl

python3 testare2.py
0.43964314460754395
6.476124286651611
stiff quartz
#

Yep

regal spoke
#

Why is it different?

stiff quartz
#

I don’t know

#

No reference counting that’s the only difference

#

If I remember correctly

#

Looking at the source code

#

But maybe the list change came about AFTER this one

#

Can you upgrade your WSL python version?

#

This one is a memmove bug

regal spoke
#

Could it be that the x in version matters, like 3.12.x

stiff quartz
#

When you insert at index 1 it moves stuff

stiff quartz
regal spoke
stiff quartz
#

I have 3.12.3 on my Ubuntu

#

What a coincidence

#

I can't even reproduce the string example ... (neither with 3.12.3 neither with 3.14.0 preview)

stiff quartz
#

No slowdown (on top is +1, bottom is +100)

#

Makes sense

#

For reference:
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

#

(It's a virtual machine though, idk if that's relevant)

regal spoke
#

It might be. Seems like you dont have the bug at all there

stiff quartz
#

Which is crazy because it's the same CPU

#

Literally the same

regal spoke
#

I dont know what the VM does

stiff quartz
#

Yeah me neither

regal spoke
#

I find it stranger that strings slow down for me, but not lists

stiff quartz
#

It might just be that the string change was older

#

you have everything here

#

oh wait no

regal spoke
stiff quartz
#

you have everything for 3.13 but not for 3.12

#

ok here you have it all

#

no mention of gh-91247

regal spoke
#

It is under Python 3.12.0 alpha 1

stiff quartz
#

so 3.12.3 should have memcpy

regal spoke
stiff quartz
#

why is it not called gh-91247

#

There are some core developers here, maybe it'd be easier to ask them lol

regal spoke
#

instead of pr

stiff quartz
#

Fair

#

So your 3.12.3 should call memcpy

regal spoke
#

probably

stiff quartz
#

Try the tuple example

#

Supposedly the change touched both and I could reproduce both

#

It’s on the issue that you opened somewhere

#

python3 -m timeit -r 20 "((0,) * 2**18) * 14"

#

Run something like this

#

And with +1 and +100

#

If that works and the list doesn’t work 🤯

regal spoke
#

waiit

#

nvm

stiff quartz
regal spoke
#

Just a thing

#

Remember that the calls to memcpy working with strings is different

#

Cause strings are bytes

#

And lists works off 8 bytes elements instead

stiff quartz
#

Try

#

On gcc

#

+8

regal spoke
#

ye exactly

stiff quartz
#

Thats actually the reason why the string example is so much easier to reproduce

regal spoke
#

probably, ye

#

also no reference count makes the intial copy faster

stiff quartz
regal spoke
#

+16 is slow, +32 is fast

stiff quartz
#

And +8 is fast?

#

Id assume?

regal spoke
#

I read it wrong

#

Seems <= 24 is slow

#

otherwise fast

stiff quartz
#

But if +8 is slow

#

Then ..

regal spoke
#

Which is same as msvc

stiff quartz
#

List should be slow

regal spoke
#

yes

stiff quartz
#

An int is 8 bytes right

haughty mountain
#

4

stiff quartz
#

In CPython

#

A python int is a int64

#

No?

regal spoke
#

A python list only stores pointers

#

It doesnt care what it is storing

stiff quartz
#

Ah yes, but the pointer is 8 bytes

haughty mountain
#

a list is an array of pointers

haughty mountain
stiff quartz
#

So i was right for the wrong reason, this still doesn’t make any sense

regal spoke
#

yes

#

I should be seeing the bug

#

but I'm not thonk

stiff quartz
#

Are you sure +8 is slow? Maybe +7 and +9 are but not +8 on your computer

regal spoke
#

+8, yes

stiff quartz
regal spoke
#

Because +16 is bad for me, and +32 isn't

#
import time

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 100)) * 32
print(time.time() - start_time)

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 1)) * 32
print(time.time() - start_time)

This is a killer
0.9612233638763428
15.320869445800781
But change to * 64 and the difference get cut in half
1.9017343521118164
16.154651641845703

stiff quartz
#

The first copy is at a distance 2^k+8

#

The second is at a distance of 2^j+16

#

No?

#

The third +32

#

So you get out of the bad zone

#

Maybe fast enough

#

With a list

#

But it’s super slow to get out of the bad zone with a string

#

Since it goes 1 2 4 8 16 32

regal spoke
#

By this logic, shouldn't * 4 be best for me for lists?

stiff quartz
#

So maybe for the array benchmark you should just do *2

#

Ah yes *4

#

Since 16 is bad

#

Yes try *4

regal spoke
#

Still nothing

#

on wsl

stiff quartz
#

Is it drowning in noise

#

Weird though

#

The memcpy is like a x20 slowdown

regal spoke
#

no it is just not triggering I think

regal spoke
stiff quartz
regal spoke
#

I see no change with tuples

stiff quartz
#

And the list and tuple changes were shipped together

#

Are we reading the changelog wrong lol

#

Do you have another version of python installed on your wsl ?

#

Also, you should try your GCC benchmark with a bigger count

#

Maybe it doesn’t trigger if the count is too large

#

We have lower bounds and always assumed there was no upper bounds

haughty mountain
#

what are you checking? repro'ing the different slowdowns on wsl?

regal spoke
# haughty mountain what are you checking? repro'ing the different slowdowns on wsl?

Here is a summary:

This C code triggers slowdown bug for me in:

msvc (32 bit and 64 bit)
wsl gcc
wsl clang
but no slowdown in
mingw gcc

#include <string.h>
#define REPEATS 4000000

char src[1000000];  // any allocation of a big array works
int main() { 
    for (int i = 0; i < REPEATS; ++i) {
        // Call memcpy with dst - src = +-(power of 2) + (something small > 0)
        memcpy(src + (1 << 18) + 1, src, 8193);
    }
    return 0;
}

This python script triggers for me in all suffciently new CPython and PyPy versions (both on windows and wsl).

import time

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 100)) * 32
print(time.time() - start_time)

start_time = time.time()
for _ in range(500000):
  ("s" * (2**12 + 1)) * 32
print(time.time() - start_time)

The weird thing is that the list equivalence does not trigger on wsl + python (version 3.12.3). Switching to *4 does not fix that.

#

The switch of list copy algorithm was supposedly added in 3.12.0 alpha. But since I'm unnable to trigger the slowdown in 3.12.3, I'm starting to think it might not be in 3.12.3.

haughty mountain
#

did you try other versions?

regal spoke
#

@haughty mountain Btw have you tried running the string example

#

The one just above with *32

#

Hmm in >= 3.13 they added this to the repeat code

#

I don't know what Py_BEGIN_CRITICAL_SECTION does, but it was not there in 3.12

haughty mountain
haughty mountain
regal spoke
haughty mountain
#

which is the no gil thing

#

it's basically grabbing a mutex for the object

regal spoke
haughty mountain
#

It shouldn't

#

not sure where the mutex stuff is stored though

stiff quartz
#

Ah it didn’t link well

#

On line 56

#

That’s what’s used to copy the array right

regal spoke
#

well list

#

not array

stiff quartz
#

Yes sorry

regal spoke
stiff quartz
#

Yep

#

That’s the one I linked at least

#

@regal spoke I'm the same as you on WSL

#
lhott@JasonTower:~/tmp$ python3 testString.py
0.7817449569702148
6.544427871704102
#
lhott@JasonTower:~/tmp$ vim testArray.py
lhott@JasonTower:~/tmp$ python3 testArray.py
0.8847634792327881
0.8824248313903809
#

Python 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] on linux

regal spoke
#

Are those lists?

#

or arrays

stiff quartz
#

yes lists

#

python lists

#

there are no python arrays anyway

regal spoke
#

there is something called arrays too

stiff quartz
#

I only knew about bytearrays

#

but anyway it was lists

#
lhott@JasonTower:~/tmp$ cat testArray.py
import time

start_time = time.time()
for _ in range(500):
  ([0] * (2**18 + 100)) * 14
print(time.time() - start_time)

start_time = time.time()
for _ in range(500):
  ([0] * (2**18 + 1)) * 14
print(time.time() - start_time)
lhott@JasonTower:~/tmp$
regal spoke
#

what if they forgot to add it in the update

stiff quartz
#

what is the password

#

of wsl

#

i'm trying to install 3.13

#

it's asking me a password

#

ok i found it

#

i just cycled through the passwords i usually use

#

So I did already use WSL?

#

I'm getting old, my memory is failing me

#

ok i'm installing a newer version of python

stiff quartz
regal spoke
#

I'm technically using wsl2

stiff quartz
#

here

#

if you really wanna verify

regal spoke
#

I'm currently uppdating my ubuntu version on wsl.

stiff quartz
#

Nop, still no bug in 3.13 + WSL

regal spoke
stiff quartz
#

🚔

#

i can reproduce the slowdown on my WSL with GCC

#

even when count = 2^18

#

actually

#

nvm I'm stupid I can reproduce no matter the value of count if >= 8193

regal spoke
stiff quartz
#

i can

#

with C

#

no with Python + List

#

No breaking news here

regal spoke
#

After upgrading, I no longer get a slowdown in wsl, python3 (3.13.3), with strings

stiff quartz
#

what

#

but do you get it with gcc

#

if you get it with gcc but not with Python there are things that I don't understand

#

I'm on

lhott@JasonTower:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.2 LTS
Release:        24.04
Codename:       noble

what version are you on?

regal spoke
#

On wsl

regal spoke
#

Could there have been some kind of patch?