#Optimizing my collatz code
1875 messages · Page 2 of 2 (latest)
but it was always the same numpy array, no?
I'll sent the code and I'll explain
int basic_collatz(uint64_t n) {
uint64_t count = 0;
uint64_t arr_counter = 0;
uint64_t size = n;
uint64_t* numbers;
uint64_t* counters;
numbers = malloc(size * sizeof(uint64_t));
counters = malloc(size * sizeof(uint64_t));
while(n != 1) {
if (n % 2 == 0) {
if (arr_counter >= size) {
size *= 2;
numbers = realloc(numbers, size * sizeof(uint64_t));
counters = realloc(counters, size * sizeof(uint64_t));
}
assert(arr_counter < size);
numbers[arr_counter] = n;
counters[arr_counter] = count;
n /= 2;
count++;
arr_counter++;
} else {
if (arr_counter >= size) {
size *= 2;
numbers = realloc(numbers, size * sizeof(uint64_t));
counters = realloc(counters, size * sizeof(uint64_t));
}
assert(arr_counter < size);
numbers[arr_counter] = n;
counters[arr_counter] = count;
n = ((3 * n) + 1) / 2;
count+=2;
arr_counter++;
}
}
for (uint64_t i = 0; i < arr_counter; i++) {
giant_buffer[numbers[i]] = count - counters[i];
}
free(numbers);
free(counters);
return count;
}```
do you want me to send the python code that is easier to read?
yes
def basic_algo(n):
global mem
c = 0
numbers = []
counters = []
while n > 1:
if n % 2 == 0:
numbers.append(n)
counters.append(c)
n = n // 2
c+=1
else:
numbers.append(n)
counters.append(c)
n = (3 * n + 1) // 2
c+=2
for i in range(len(numbers)):
mem[numbers[i]] = c - counters[i]
return c```
thx!
numbers and counters
I cache all the numbers that were computed in the end
I do it at the end because I don't know how far they are from 1 (collatz..)
okay but that's where your problem is
the program is on numbers[i]
mem has been defined with capacity end-start
so it is too short to hold all the results basic_algo might need
no
you can cache what basic_algo does, but you have no proof it fits into mem
yeah that's what I mean
@stone plover on the call `basic_collatz(103561)`, loop iteration `i=23` from the final loop is causing `numbers[i]` to result in `129140162```
mem is not meant to track an entire orbit
(it just cann't do it, by definition)
numbers and counters are tracking a full orbit
(the sequence of numbers that are hit by the basic algo, from a giving starting point)
but those numbers can because extremely big compared to your problem range
oh
so I can just add a check
that if the number is bigger than the range
I don't cache it
yes
Wonderful!
Thanks!
I thought the number that numbers[i] was returning wasn't a computed number
However I verified it is
Thanks for helping me understand it
you're welcome!
don't thx me, thx python and its verbose high level decent sane-for-the-mind style 🤣 hahahaha
and because you now know what you're going to cache, fits into mem,
please remove those realloc ; it's totally unnecessary now 😉
but cann't you mutate mem while iterating?
mutate?
ah, maybe it will become silly
:_:
is mem an array of signed integers?
confusion immerses
no?
mem is the giant buffer
that we cache all numbers
we would compute in the future
okay nvm 😄 I had an idea but, I think it will get too mystic
curiosity hits
I don't know I want to hear it I have wasted a lot of my time on this project but curiosity hits hard
I made some slight modifications should be all good now
//create simple collatz algorithm
int basic_collatz(uint64_t n) {
uint64_t count = 0;
uint64_t arr_counter = 0;
uint64_t size = n;
uint64_t* numbers;
uint64_t* counters;
numbers = malloc(size * sizeof(uint64_t));
counters = malloc(size * sizeof(uint64_t));
while(n != 1) {
if (n % 2 == 0) {
if (arr_counter >= size) {
size *= 2;
numbers = realloc(numbers, size * sizeof(uint64_t));
counters = realloc(counters, size * sizeof(uint64_t));
}
if (n > max_number) {
n/=2;
count++;
}
else {
numbers[arr_counter] = n;
counters[arr_counter] = count;
n /= 2;
count++;
arr_counter++;
}
} else {
if (arr_counter >= size) {
size *= 2;
numbers = realloc(numbers, size * sizeof(uint64_t));
counters = realloc(counters, size * sizeof(uint64_t));
}
if (n > max_number) {
n = ((3 * n) + 1) / 2;
count+=2;
}
else {
numbers[arr_counter] = n;
counters[arr_counter] = count;
n = ((3 * n) + 1) / 2;
count+=2;
arr_counter++;
}
}
}
for (uint64_t i = 0; i < arr_counter; i++) {
giant_buffer[numbers[i]] = count - counters[i];
}
free(numbers);
free(counters);
return count;
}```
I added that if ( n > max_number )
Alright it works now 🙂
a wonderful number of 2.7 seconds on O0
def basic_algo(n):
global mem
# `mem` will get mutated by this method. We keep track of the mutations
# by using the sign as a flag.
c = 0
while n > 1:
if n < mem_size and mem[n] == 0:
mem[n] = -c
if n % 2 == 0:
n = n // 2
c+=1
else:
n = (3 * n + 1) // 2
c+=2
for i in range(len(mem)):
if mem[i] < 0: mem[i] = c + mem[i]
return c
lets see the O3
would that work ?
I use the sign as a way to keep track of the indices I'm mutating
so after you remove it from mem[i] = c - mem[i]
ah yeah sorry
but yeah
fixed
That should probably work perfectly
and I remove the need of buffers
let me backup this code that I have right now
and I'll try to implement that
sounds easy
(I haven't added the check on n! so beware
this saves all the buffer and the reallocations
thats perfect!
anyways lets see some beautiful O3 performance 🙂
maybe measure it before checking if it's really nicer 😄 hahaha
it's a bit hacky.. maybe a comment would be useful to understand why we use negative numbers
I still don't get why you use negative
I think it should be like this:
def basic_algo(n):
global mem
c = 0
while n > 1:
if n < mem_size and mem[n] == 0:
mem[n] = c # CHANGED THIS
if n % 2 == 0:
n = n // 2
c+=1
else:
n = (3 * n + 1) // 2
c+=2
for i in range(len(mem)):
if mem[i] < 0: mem[i] = c - mem[i] #CHANGED THIS AS WELL
return c```
Then mem can be unsigned
I changed the + to -
next time you'll need to increment this counter (the old counters buffer), you won't see it
because it has positive sign
wait
it really must stay negative, with a + at the end. otherwise you keep track of what entry is in "mutation" state
no? I'm saying something wrong?
I think?
aaaaaaah, maybe I'm too prudent yes : because you have no cycles
so you don't come back on the same entry
on the while loop
okay fair
it will count how many iterations it has done until now
yeah okay okay fair to me! I trust
(I mean, let's just run the python code and see if it dramatically fails 🤣
oh right
haha
yeah let me fix it and ill run it
before refactoring the C one, maybe playing with Python is helpful 😄
prototyping and shell-with-classes : where python is not that bad, I guess 😄
oh wait
Oh I got what you meant about the plus
yeah you're right
I hadn't notice the check you have down there
is it really worth to do that
I mean
I guess ?
it should 🙂
yeah
however
because otherwise you don't know where you need to do + c
yea that
however is it faster
to run all over the mem
with a loop
or two loop
two custom buffers
with x specific iterations
idk
buffers seem more optimal
to be honest, I think both are good in their own ways
we could only use one buffer to keep track of what numbers to change
and then use your approach
because this last loop, you see,
to change those numbers
maybe writing it with a ternary would make it more obvious for the compiler ; idk
have an array
that holds the numbers
and store the number of iterations on mem
and then just
it also depends on how you realloc, and how much this method is effectively called
change the numbers
because for some cases, you might only hit it rarely
(when you start low)
on other cases maybe you hit it every time
hm
well the implementation of the two sounds better
don't you think?
use realloc only once for one buffer
store the counters in the mem
and then with a for loop
change the numbers to their supposed iterations
tbh I don't know
it's difficult to reason here because the complexity of the basic algorithm is difficult to comprehend
so maybe O(n) is "ridiculously small" compared to the main loop anyway
I would just check both implementations (realloc as you have, and no realloc at all),
in different scenarios
check if there is really a winner in all cases
if not, keep the one that is easier to understand
(assuming both are bugfree, ofc...)
One quick question just wondering
why if we put as a max number: 1.000.000.000 it gets a segmentation fault?
it can't handle such big numbers?
(nvm I think its because of the 32 bit compiler settings)
!close
Message is already solved
yeah I guess it's related ; not an expert but ig on 32 bits, pointers have 32 bits lengths, but you send 64 bits integers
that's actually a limitation of the cache approach
(you can never hope to cache integers that go beyond the size_t number
maybe you should disallow caching for those cases
(the algo will be slower... yes. but it will remain correct.
(note that you cannot just check the program inputs... because the orbits are not bounded by the inputs (as your previous issue showed))
actually, more generally, I think you should get more picky on the cache
- first bound the size of the cache by a number you're sure pointers can reprsent
- try to calloc until it succeeds ; if not, retry with a smaller size
- add a guard everytime you try to access the buffer (IMO that might be a good case to write a structure or a macro, so you'll have a unified way to access this buffer)
you could also check if malloc couldn't be used, if calloc fails (say: your stack is much smaller than your heap) ; idk what others would think about it
It's okay
It's an assignment after all
Not an actual program I'll use in my daily life or distribute xd
All the test cases work just fine
So there's no problem
okay 😄
I hadn't read the assignment right
It's true after all, that whoever writes the fastest code, gets bonus
💀
Maybe after all I'll do more optimizations and more micro optimizations
I asked my professor, he said some students are fighting in milliseconds
I'm done
you can probably reuse the numbers and counters buffers, instead of having to allocate for each number
Yeah I guess that's possible
Okay that's one optimization
NOTED
If you have any other ideas lmk
Alright I fixed that
run some test cases
everything is working fine
@prime ravine
My teacher said something about bss section and that i could utilize that maybe in some way
Any ideas what that is?
bss section is basically where statically allocated variables (that haven't been assigned values) get stored
so if you know the size of some buffer, for example, you can store it in the bss section and avoid an allocation at runtime
I see
I could
that's not true, you don't know the size at compile- time
I can guess it
then you lose program correctness
hear me out
the assignment says
the numbers will be given from the range (-100m, 100m)
(all negative numbers return 0)
so I can assume the maximum number size to be 100m + 1
yep, i guess that's true
you just statically declare giant_buffer
how do I do that?
the actual usage of .bss is an implementation detail
just declare at the top of your file uint64_t giant_buffer[100000001]
just declare it as an array
Okay
(this will create a p large binary though, like 1GB)
the .bss section is part of the binary itself
that's how you avoid the allocation
(you don't really avoid the allocation, it just gets mapped into memory as part of the image load)
If it doesn't really give any performance boost
I'll remove it then
time to benchmark
warning: assignment to ‘uint64_t’ {aka ‘long unsigned int’} from ‘uint64_t *’ {aka ‘long unsigned int
’} makes integer from pointer without a cast```
what is this error?
oh mb
collatz.c: In function ‘basic_collatz’:
collatz.c:84:34: warning: assignment to ‘uint64_t *’ {aka ‘long unsigned int *’} from ‘uint64_t’ {aka ‘long unsigned in
’} makes pointer from integer without a cast [-Wint-conversion]
84 | giant_buffer[numbers[i]] = count - counters[i];```
There are tons of warnings
in the same format
you've declared giant_buffer wrong
uint64_t* giant_buffer[100000001];
have you declared it as an array of uint64_t *s instead of uint64_ts?
that's an array of pointers
hmm
ohhhh i was wrong about bss
it does get allocated at runtime
the size is simply recorded in the object file, that makes sense
i'm dubious this will give you a performance speedup
i am doubtful that you'll get one
oh alright
well
its still a micro optimization
collatz.c:8:10: error: variably modified ‘giant_buffer’ at file scope
8 | uint64_t giant_buffer[BUFFER_SIZE];
| ^~~~~~~~~~~~
collatz.c: In function ‘main’:```
I tried making a const value BUFFER_SIZE
and set that to the giant_buffer
what's the error?
what's a macro?
#define
no idea
oh okay I went to read the docs
so I can just basically:
#define BUFFER_SIZE 100000000;```
I don't need any extras or something
nope
sounds good
so the compiler actually sees uint64_t buffer[100000000], not uint64_t buffer[BUF_SIZE]
aaa
noice
alright
trilon@M1911:~/hw0-TR1LON/collatz/src$ gcc -O0 -m32 -Wall -Wextra -Werror -pedantic -o collatz collatz.c
collatz.c:8:10: error: size ‘3200000032’ of array ‘giant_buffer’ exceeds maximum object size ‘2147483647’
8 | uint64_t giant_buffer[BUFFER_SIZE];```
Bruh moment
Macro is 400.000.004
so I can cache more values that will be calculated
...but you said the max was 100 million
your original code doesn't account for that
it does
basically
my original code says:
if the number that is calculated is bigger than the cache:
dont cache it
However
since I have a static cache now
I said let's make it bigger
if there's no easy fix
I'll just make it smaller and not cache big numbers
i really doubt that using bss will give you a performance increase
but try it and see what happens
how do I fix that error
you reduce the size of the buffer?
yup, thought so
I'll leave it like that
if you think of anything else
don't hesitate
Current time is at:
trilon@M1911:~/hw0-TR1LON/collatz/src$ time ./collatz 100 100000000
950
real 0m4.581s
user 0m4.360s
sys 0m0.220s```
i wonder... how slow does it run without memoising?
hmm
im wondering if you could get some performance boosts by multithreading
(but then you have issues with sharing the same cache)
I guess
I thought of it too
however I think we aren't even allowed to use more than one thread
the system we run the tests on I think limits it
okay, that's fair
Also
why would we have issues with the cache?
Threads can't access eachother's memory?
yeah so where's the problem?
... so then the threads can access each other's memory
which means you're at risk of race conditions
can't we split jobs equally from the start?
no we can't
because things arent cached
;/
well
there's only one solution
GPU computing
xD
nvm I need CUDA which I can't install to the main system
Time on the uni's machine 😄
linux12:/home/users/sdi23xxxxx/program>time ./collatz 100 100000000
950
3.929u 0.275s 0:04.20 99.7% 0+0k 0+0io 0pf+0w```
have you tried my naive idea of never allocate extra buffers?
No I haven't
But since I allocate them only once
on the main() function now
it shouldn't really be a performance difference
right?
I use realloc
but its being used very rare
wait I'll implement a counter
to see how many times it's being used
do you think the if statement is causing bottleneck?
wait if you send me the code, I could compare both on my machine
so I'm not bothering you
and how do you compile it eventually?
yes
okay, 957 for the alternative approach ; so I guess not better if the competition already happens at the millis!
ah no wait
what command do you use to time it?
are you on linux
if yes
time ./collatz 100 100000000
If you are on windows
you can use a program called ptime
nah it's fine, I can use time
or use the untrustworthy powershell command: ps Measure-Command { ./collatz.exe 100 100000000 }
Alright
okay so the alternative approach takes much more time! so not good
looping through the giant buffer?
(don"t especially surprise me that it's not the same percentage anymore ; likely due to vectorization)
oh yeah that's maybe a machine difference
Yeah I guess
I was looking at the proportion only
ah! also
my compiler doesn't not understand m32
so I had to remove that flag
maybe that's also why something that can impact the measured millis
oh yeah
you need an addon for m32
it impacts more things
so careful
gcc by default complies a program to 64 bit, if you compile it after in 32 bit there may be errors
yeah
(it's funny, I'm trying your code without realloc ; just don't realloc the buffer and if it's filled, stop adding a cache)
intuitivelt one could expect the number of holes to get lower, so the needs to realloc would be small
but okay anyway!
no idk about gpu things
wait
I have forgotten a check
I just noticed
else {
numbers[arr_counter] = n;
counters[arr_counter] = count;```
I don't check if the n is already computed
ah oops
but
hm
I don't really know if I can skip anything here
well
I think I can just add a check up on the while loop
that should work just fine actually
btw, if I may , I think you have some duplicated code
Where?
basic_collatz
if (giant_buffer[n] != 0) {
break;
}``` I think this should be good on `basic_collatz`
//create simple collatz algorithm
int basic_collatz(uint64_t n) {
uint64_t count = 0;
uint64_t arr_counter = 0;
while(n != 1) {
if (n % 2 == 0) {
if (arr_counter >= SIZE) {
SIZE *= 2;
numbers = realloc(numbers, SIZE * sizeof(uint64_t));
counters = realloc(counters, SIZE * sizeof(uint64_t));
}
if (n > max_number) {
n/=2;
count++;
}
else {
numbers[arr_counter] = n;
counters[arr_counter] = count;
n /= 2;
count++;
arr_counter++;
}
} else {
if (arr_counter >= SIZE) {
SIZE *= 2;
numbers = realloc(numbers, SIZE * sizeof(uint64_t));
counters = realloc(counters, SIZE * sizeof(uint64_t));
}
if (n > max_number) {
n = ((3 * n) + 1) / 2;
count+=2;
}
else {
numbers[arr_counter] = n;
counters[arr_counter] = count;
n = ((3 * n) + 1) / 2;
count+=2;
arr_counter++;
}
}
}
for (uint64_t i = 0; i < arr_counter; i++) {
giant_buffer[numbers[i]] = count - counters[i];
}
return count;
}
can likely become
//create simple collatz algorithm
int basic_collatz(uint64_t n) {
uint64_t count = 0;
uint64_t arr_counter = 0;
while(n != 1) {
if(n <= max_number) {
if (arr_counter >= SIZE) {
SIZE *= 2;
numbers = realloc(numbers, SIZE * sizeof(uint64_t));
counters = realloc(counters, SIZE * sizeof(uint64_t));
}
numbers[arr_counter++] = n;
counters[arr_counter] = count;
}
if (n % 2 == 0) {
n /= 2;
count++;
} else {
n = ((3 * n) + 1) / 2;
count+=2;
}
}
for (uint64_t i = 0; i < arr_counter; i++) {
giant_buffer[numbers[i]] = count - counters[i];
}
return count;
}
(if I don't mistake)
well I guess it seems good
take the arr_counter++ outside from the buffer
so its more readable
imo the code as it is now, is highly unsafe : if your realloc fails, you'll get a segfault
I would add some checks there
I'll see what I can do
Professor said if I have the fastest code in class I need to do a presentation
Presentation to 200 people xd
(also : since you keep track of the indices that you mutate in numbers, I'm not sure qhy you actually need counters:
can you just print the count in giant_buffer and rectify later?
Oh
save the numbers in giant_buffer
and have only numbers
If I understood correctly
yeah ; since you know (in numbers) what you are mutating
you can save them and rectify later
yeah yeah
alright got it
I'll go eat now
I'll be back and I'll apply the changes
it's kind of frustrating to get a competitive thing with bonus, based on "one time millis" 😄 if ever you're not first, don't hesitate to take a step back 😄
(an approach that works well on (start,end) might not work quite as well on different inputs 😉 since I see you're quite investing time into it, don't take it too personnally if it doesn't pay off 🙂 what you'd have learned might be the best take-away)
Well I'm pretty happy that I have learned so many about C in such a small period
I mean obviously I won't be as excited if I'm not the first, but it's not that bad
there's a problem with this one
trilon@AK-47:~/collatz$ time ./collatz 100 100000000
952
real 0m2.465s
user 0m2.365s
sys 0m0.100s
I got a value of 952 instead of 950
Alright nvm fixed it
The
numbers[arr_counter++] = n;
``` caused the error
I changed it to
arr_counter++;
numbers[arr_counter] = n;```
ah yeah sorry, I haven't really tested before submitting
Also I'll remove the numbers[]
and use the giant_buffer
i'll also add the check to see if its computed already
break; ``` exists in C right?
yup
Alright thanks!
hmm
for some reason now it has incorrect values
Oh I know why wait
Well I don't see any performance difference
But I guess its good practice to have a more simple and compact code
o oh
its not good
it's producing false results
I think the problem is that check
it's producing wrong results 💀
Alright I got the backup
I'll add some extra variables to check how many times other things get accessed
I'll make a python script that runs test cases
and checks if the output of the program was correct
hm
@torn quarry I think we still call the basic function a lot of times
trilon@AK-47:~/collatz$ ./collatz 23504 100000000
950
Called 7301 times.
trilon@AK-47:~/collatz$ ./collatz 1 100000000
950
Called 2 times.
trilon@AK-47:~/collatz$ ./collatz 3000 100000000
950
Called 935 times.```
let me test the code at the python code to see if we access it the same amount
(it should be)
for now in the basic function, you add in cache if the entry was not computed yet
what happens if you just, exit the function if the entry has been discovered already ? (because it means you're now in an orbit you've discovered)
(that is suppose to "return fast" in case you eventually meet a trajectory you've discovered before)
for some reason if I do that it gives incorrect results
(no idea why actually)
ah because you cannot rectify or something?
hm
idk
let me think about it a bit
if (giant_buffer[n] != 0) { //If it's calculated:
c += giant_buffer[n];
}```
for example ; 5 goes to 16 -> 8 -> 4 -> 2 -> 1
so when you're at 5, you actually don't have to go to 1, because you might already have computed 4 from a previous computation
yeah I got it
^^
I had implemented it
but it gave incorrect values
ah
maybe those cases are also too rare to really matter, idk
maybe more like a score I think :
counter before early return / counter returned
the closest to 1, the closest to 1 you went
I found the error
I had incorrect brackets I THINK
lemme try to implement it again
wait
I think you're right,there is something that is not supposed to be working
because,
It's working
trilon@AK-47:~/collatz$ ./collatz 14040 100000000
950
4366```
ah, interesting
4366 times was giant_buffer[n] used
because, iirc, but maybe I've stopped followng at some point,
the entire goal of this basic function was to never actually read from the cache
(compared to the main loop)
but ! I don't remember why
maybe I'm mistaking
Oh
I think you're right
yeah
Maybe now we are calling the function even more times
but actually, it was two days ago already
so maybe the code had changed since then, and it's now completely useless to keep having this basic function 😄 hahaha
wait I'll compile an old backup and time them
tbh I haven't followed entirely 🤣
¯_(ツ)_/¯
yeah no I cann't remember
I do remember we restarted from [some state], then wanted to introduce cache ; then there was somthing in the code that was not friendly to handle reading caches in some cases, because the cache was not set yet ;
but it was sooo long ago 😄 like sunday morning? and we're tuesday afternoon here 😄 no chance my memory is large enough to handle that
haha
uhh
trilon@AK-47:~/collatz$ time ./collatz 14043 100000000
950
CALLED: 4366
GIANT: 4365
real 0m1.699s
user 0m1.578s
sys 0m0.120s
trilon@AK-47:~/collatz$ time ./test 13043 100000000
950
BASIC: 4064
real 0m1.698s
user 0m1.655s
sys 0m0.040s```
the first one is with the giant_buffer[n]
the second one is without
they achieve literally the same time
do we care about user or sys?
oh wait
wrong number on the 2nd one
OOPS
retry test
trilon@AK-47:~/collatz$ time ./collatz 14043 100000000
950
CALLED: 4366
GIANT: 4365
real 0m1.641s
user 0m1.550s
sys 0m0.090s
trilon@AK-47:~/collatz$ time ./test 14043 100000000
950
BASIC: 4366
real 0m1.647s
user 0m1.487s
sys 0m0.160s
trilon@AK-47:~/collatz$```
No difference
The function is called the same exact amount
however there's no performance difference
how is that possible
Can you check if I have done any mistakes xD
I think the code is good
//create simple collatz algorithm
int basic_collatz(uint64_t n) {
uint64_t count = 0;
uint64_t arr_counter = 0;
times_basic_called++;
while(n != 1) {
if (giant_buffer[n] != 0) {
times_giant_used++;
count+=giant_buffer[n];
n = 1;
}
else {
if (n % 2 == 0) {
if (arr_counter >= SIZE) {
SIZE *= 2;
numbers = realloc(numbers, SIZE * sizeof(uint64_t));
}
if (n > max_number) {
n/=2;
count++;
}
else {
numbers[arr_counter] = n;
giant_buffer[n] = count;
n /= 2;
count++;
arr_counter++;
}
} else {
if (arr_counter >= SIZE) {
SIZE *= 2;
numbers = realloc(numbers, SIZE * sizeof(uint64_t));
}
if (n > max_number) {
n = ((3 * n) + 1) / 2;
count+=2;
}
else {
numbers[arr_counter] = n;
giant_buffer[n] = count;
n = ((3 * n) + 1) / 2;
count+=2;
arr_counter++;
}
}
}
}
for (uint64_t i = 0; i < arr_counter; i++) {
giant_buffer[numbers[i]] = count - giant_buffer[numbers[i]];
}
return count;
}```
I think I should do something about the if's and else
we have kinda alot
@prime ravine sorry to bother you
do you know assembly?
https://godbolt.org/z/sd9fhvEas <---- I want to apply any optimizations that I can apply manually that -O3 applies
#define BUFFER_SIZE 200000000
uint64_t giant_buffer[BUFFER_SIZE];
uint64_t max_number;
uint64_t* numbers;
uint64_t SIZE = 16432;
uint64_t times_basic_called = 0;
uint64_t times_giant_used = 0;
//create fast_atoi, exits if number is <= 0
int fast_atoi( const char * str )
{
uint64_t val = 0;
if (str[0] == '-' || str[0] == '0') {
...
are you allowed to inline assembly for your solution?
I don't think so...
Well it doesn't say I am not allowed to
but I don't think
I'm allowed either
oh so you're trying to make the C code have the optimisations in it?
you won't be able to do it entirely, since the optimiser simply has more power than you have
you could try just unrolling some loops by hand
seeing if that gets you anywhere
I mean yeah
That's what I mean
The thing is I don't really understand assembly
I just know some uhh functions idk how u call them
the mov,push etc
cmp
Can you point out some obvious ones?
I noticed there's no .main on the O3
oh there is mb
if I may, I think there are other measures that would be relevant
Like?
for example: can you print the numbers of elements in the buffer that are never actually filled
(the ones smaller than start)
Well to be honest
I don't know if that's faster or not
we Could:
collatz(2, start)
collatz(start, end)
xD
but that ruins the whole point
of the assignment so ig not
And why would I do that?
but maybe the point is not the right one, yes
because take for example this path:
5 -> 16 -> 8 -> 4 -> 2 -> 1
and stay you start at 3
3 will give you the path
3 -> 10 -> 5 ...
(wait I try to formulate the observation in my mind 😄 hahaha)
We need an algorithm to choose the best numbers first
And then guess for the others
I'm wondering if it wouldn't be possible to actually buffer a bit more than numbers, but more like, a kind of dependency graph
BTW
The professor told me another guy is talking about a guessing algorithm (that guesses which number has the biggest sequence without actually calculating it), but neither my professor trusts it as much until he sees it
you see in the current approach, when you hit 3, you're going to fill up the full path
3 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1
but maybe you could just go to 1, and while iterating , you store information like:
entries[10] = { .pivot = 3, distance = 1 }
so that you don't have to go back and fill elements in your path
next time you hit 10, you know you only have to resolve it ;
I'm not sure that can be better, that said
the point is that if you allow pivot to be always smaller than the entry index, you can maybe will that buffer "very" fast,
and then re-traverse it while finding the maximal value
what's the size of giant_buffer?
epty entries are { pivot = -1 , distance = 0 } ; resolved ones are { -1 , > 0 }
you could get away with storing uint32_t, now?
I could?
When I tried doing so I can weird values
and errors
you have burned my mind
I'm giving it a try and I'll see
@torn quarry
I found some useful information in a research paper
Page 10
Has all the tricks proven by math to filter out numbers
I think
can you tell me @stone plover
what are the maximal values for start 1 and
end = 10
end = 15
end = 25
oh
sorry
I had gone to eat
yeah I will
20, 20, 24
if your numbers are
19, 19, 23
it is correct
I add +1 in the end
@torn quarry
ok !
we can't change in different systems
judge*
compare**
we can't compare in different systems
I picked that one ^
It's the latest one
yes
okay so unfortunately, my idea didn't turn to be fast enough
I think I make cache misses way too much
oof
Should we stop bothering with this
I mean probably
there are optimizations that we could find
hmmmm, conceptually, I'm not quite sure to understand why my code is so slower
but I guess it has to do with the fact I traverse the numbers once without actually resolving anything
(and then I have to perform very long jumps, I guess)
hmmmmmm, no, nop nop nop --- there is something else I do wrong
because the last O(n) loop that resolves everything, takes 1% of the time -- I'll check a bit later, I must have forgotten a check
Oh I see
aaaaaah okay I think I get it! with my approach, there is a significiant draw back, that I have to wait my cursor comes back at an index below the starting point
for example, s = 7, I will run the algo but I need to wait to come below 7 to stop
while in your case, you stop as soon as you have something in the cache (I guess?)
I think that's where I'm loosing all my time
so for example, when I hit 7,
I hit then 22 and I claim
value(22) = value(7) + 1
I continue and hit 11, so I claim
value(11) = value(7) + 2
and so on;
and I can never stop while I'm not below 7 ; even if I already discovered 11, I need to wait to come back below 7 because I need the dependencies to be always below the index ; hu hu , unfortunate
yeah yeah yeah, that's it I think
I just removed the check I made and performed one computation "too much", and I just made +50% perf decrease. so I guess that's why I'm so slow 😄 hahaha
I see
I guess then this project is finished
I'll try to make the code a bit clearer
Write some README.md
Explain the code
And it's finished
hello
What exactly is the goal of the assignment?
Calculate the fastest
Count the number of steps each number takes to reach 1?
Are start and end bounded?
Bounded meaning?
Specific?
The range of numbers is [-100m, +100m]
The fastest time test will run (probably) on 100 -> 100.000.000
Yes
The start will never be bigger than end
And the numbers will be valid from the range I mentioned
I would give you the assignment paper but it's in Greek so...
Then why do you have error checking?
I have only for negative numbers
The range supports negative numbers
However
The assignment says if the number is negative print 0 and exit
So
Yeah ig
Jude has helped me as well but we can't find anything more than this to optimize except using the O3 optimizations
Which makes it run about x2.1 times after
I think
Lmk if you have any ideas or questions @formal wyvern
Have you tried SIMD?
No idea what that is
If it requires threads we are not allowed to use
Oh alright got it
However how would I apply that in my code? I need to calculate all the numbers in order since I use memoization
@stone plover You've been looking at this for weeks has your teacher not shown you what he did yet
I have achieved a better time than my teacher
However
The first place (There's a leaderboard apparently) gets a bonus on his grade
The assignment is due in 3 days
What did you do to get a better score than your teacher
The code is here
Why is your code so hideous?
Hideous?
That's a pretty rude thing to say in a help channel
Not really something we want to see
That's true, my bad.
Well my experience in writing C is about 3-3,5 weeks sorry if my code is not very clear and well written
It's not bad
if (arr_counter >= SIZE) {
SIZE *= 2;
numbers = realloc(numbers, SIZE * sizeof(uint32_t));
}
if (n <= max_number) {
numbers[arr_counter++] = n;
giant_buffer[n] = count;
}
```this is repeated twice, this could be abstracted to a function to deal with reallocating your buffer and setting stuff
I suppose I could do that
I don't have a computer rn
I'll fix it in a bit thanks!
what's the difference between numbers and giant_buffer
numbers save temporarily numbers
The reason I save them is because I don't know the sequence for such number
So in the end I add them to the giant buffer after I have reached 0
Hmm interesting
I didn't explain it very well let me rephrase it
As I'm calculating for n, I cache every number I calculate in the process. (If I haven't already calculated obviously). After I finish calculating for n, I add to the cache all the numbers the giant buffer with the difference of the biggest count to the smaller
I hope I made some sense
Yep
Hurubon mentioned SMID but I can't see any way that I can apply that in my code
Since I don't calculate many values in the same time
I calculate one each loop
How does this compare with your solution?
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>
#include <limits.h>
int buffer[1 << 28] = { 1, 1 };
int collatz_length(int n)
{
if (buffer[n] == 0 && n % 2 == 0)
{
buffer[n] = 1 + collatz_length(n / 2);
}
else if (buffer[n] == 0 && n % 2 == 1)
{
buffer[n] = 2 + collatz_length((3 * n + 1) / 2);
}
return buffer[n];
}
int main(int argc, char** argv)
{
if (argc != 3)
{
printf("Usage: collatz <start> <end>");
return 1;
}
int start = atoi(argv[1]);
int end = atoi(argv[2]);
if (start <= 0 || end <= 0)
{
printf("0");
return 1;
}
int max = INT_MIN;
for (; start <= end; start += 1)
{
int const result = collatz_length(start);
if (result > max)
{
max = result;
}
}
printf("%d", max);
}
It's verry difficult to try to apply simd to collatz
Let me give it a try
It might not work, change the 1 << 16 to something like 1 << 28.
What's the verdict?
Sorry family needed me for a bit
Im running the test now
test2.c:52:18: error: ‘result’ undeclared (first use in this function)
52 | printf("%d", result);
| ^~~~~~
test2.c:52:18: note: each undeclared identifier is reported only once for each function it appears in```
I think you meant max let me change it
trilon@AK-47:~/collatz$ time ./test2 100 100000000
Segmentation fault
real 0m0.001s
user 0m0.001s
sys 0m0.000s```
I changed the code a bit.
Try to run it again.
segfault
trilon@AK-47:~/collatz$ time ./test2 100 100000000
Segmentation fault
real 0m0.007s
user 0m0.007s
sys 0m0.000s
compiler settings
gcc -O0 -m32 -Wall -Wextra -Werror -pedantic -o test2 test2.c
Hmm, let me see why it segfaults.
1.1G file?
The way the array is initialized is problematic
If you write int buffer[1 << 28]; and then set buffer[0] = 1; buffer[1] = 1; in main then the array can live in .bss and the file won't be 1.1GB
Also
Yes, but it won't be zero-initialized.
Which is what I want.
Oh no
It will be zero initialized
I'm not sure why it's segfaulting though. Could try running in a debugger to see why though
I tried using godbolt
Working on it.
it didn't like the 1.1gb
Yeah CE won't like that
visual studio code debugger might work let me check
The C standard sets some minimums the compiler must support, e.g. 64 arguments in a function all and whatnot
In practice compilers support more
Alright
I can't get it to run bruh
where does it say the error is
It should be either a stack overflow from recursion or out of bounds buffer access
yeah I can't even run the code because my vs code is broken
hmm
trying to fix it give me a sec
@stone plover I tried giving it 1 << 29 space and it seems not to crash on godbolt.
Try (1 << 29) - 1.
arguments*
Yes.
yeah I noticed that
it still gives me a seg fault
wait
I tried gcc test2.c
/usr/lib/gcc/x86_64-linux-gnu/11/crtbeginS.o: in function `deregister_tm_clones':
crtstuff.c:(.text+0x3): relocation truncated to fit: R_X86_64_PC32 against `.tm_clone_table'
crtstuff.c:(.text+0xa): relocation truncated to fit: R_X86_64_PC32 against symbol `__TMC_END__' defined in .data section in a.out
/usr/lib/gcc/x86_64-linux-gnu/11/crtbeginS.o: in function `register_tm_clones':
crtstuff.c:(.text+0x33): relocation truncated to fit: R_X86_64_PC32 against `.tm_clone_table'
crtstuff.c:(.text+0x3a): relocation truncated to fit: R_X86_64_PC32 against symbol `__TMC_END__' defined in .data section in a.out
/usr/lib/gcc/x86_64-linux-gnu/11/crtbeginS.o: in function `__do_global_dtors_aux':
crtstuff.c:(.text+0x76): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x9e): relocation truncated to fit: R_X86_64_PC32 against `.bss'
collect2: error: ld returned 1 exit status```
1<<29 is too big
Do we even need such a big cache
The reason is the memory model of the binary itself is limited to a couple gigabytes, unless you use a different model which may be slower
lets just calloc it and test it
it shouldn't have such a huge difference anyways I think
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>
#include <limits.h>
static inline int collatz_length(int* const cache, int n)
{
printf("Input: %d\n", n);
if (cache[n] == 0 && n % 2 == 0)
{
cache[n] = 1 + collatz_length(cache, n / 2);
}
else if (cache[n] == 0 && n % 2 == 1)
{
cache[n] = 2 + collatz_length(cache, (3 * n + 1) / 2);
}
return cache[n];
}
int main(int argc, char** argv)
{
if (argc != 3)
{
printf("Usage: collatz <start> <end>");
return 1;
}
int start = atoi(argv[1]);
int end = atoi(argv[2]);
int* const cache = calloc(1U << 31, sizeof(int));
cache[1] = 1;
if (start <= 0 || end <= 0)
{
printf("0");
return 1;
}
int max = INT_MIN;
for (; start <= end; start += 1)
{
int const result = collatz_length(cache, start);
if (result > max)
{
max = result;
}
}
printf("%d", max);
}
test2.c: In function ‘collatz_length’:
test2.c:13:41: error: passing argument 1 of ‘collatz_length’ makes pointer from integer without a cast [-Werror=int-conversion]
13 | cache[n] = 1 + collatz_length(n / 2);
| ~~^~~
| |
| int
test2.c:8:45: note: expected ‘int * const’ but argument is of type ‘int’
8 | static inline int collatz_length(int* const cache, int n)
| ~~~~~~~~~~~^~~~~
test2.c:13:24: error: too few arguments to function ‘collatz_length’
13 | cache[n] = 1 + collatz_length(n / 2);
| ^~~~~~~~~~~~~~
test2.c:8:19: note: declared here
8 | static inline int collatz_length(int* const cache, int n)
| ^~~~~~~~~~~~~~
test2.c:17:51: error: passing argument 1 of ‘collatz_length’ makes pointer from integer without a cast [-Werror=int-conversion]
17 | cache[n] = 2 + collatz_length((3 * n + 1) / 2);
| ~~~~~~~~~~~~^~~
| |
| int
test2.c:8:45: note: expected ‘int * const’ but argument is of type ‘int’
8 | static inline int collatz_length(int* const cache, int n)
| ~~~~~~~~~~~^~~~~
test2.c:17:24: error: too few arguments to function ‘collatz_length’
17 | cache[n] = 2 + collatz_length((3 * n + 1) / 2);
| ^~~~~~~~~~~~~~
test2.c:8:19: note: declared here
8 | static inline int collatz_length(int* const cache, int n)
| ^~~~~~~~~~~~~~
test2.c: In function ‘main’:
test2.c:34:33: error: left shift count >= width of type [-Werror=shift-count-overflow]
34 | int* const cache = calloc(1 << 32, sizeof(int));
| ```
Fixed.
just do this https://godbolt.org/z/4949MYx3f
test2.c:34:24: error: argument 1 value ‘2147483648’ exceeds maximum object size 2147483647 [-Werror=alloc-size-larger-than=]
34 | int* const cache = calloc(1 << 31, sizeof(int));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from test2.c:2:
/usr/include/stdlib.h:543:14: note: in a call to allocation function ‘calloc’ declared here
543 | extern void *calloc (size_t __nmemb, size_t __size)
| ^~~~~~
cc1: all warnings being treated as errors```
Alright
test2.c: In function ‘main’:
test2.c:35:13: error: product ‘1073741824 * 4’ of arguments 1 and 2 exceeds ‘SIZE_MAX’ [-Werror=alloc-size-larger-than=]
35 | cache = calloc(1 << 30, sizeof(int));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from test2.c:2:
/usr/include/stdlib.h:543:14: note: in a call to allocation function ‘calloc’ declared here
543 | extern void *calloc (size_t __nmemb, size_t __size)
| ^~~~~~
cc1: all warnings being treated as errors```
That'd be the OOM protection
isn't 1<<30 huge
try reducing the cache size then
Yes.
I put it back down to 16 and I just get [truncated].
Which is kind of sus, though.
seg fault
with 100 10000
Input: 148958
also why you assume the value is cached?
I think I expect the max recursion depth to be exceeded
You see sequence lengths in the thousands
oh that
yeah
@formal wyvern Why would this approach work better?
It wouldn't necessarily, I was just wondering how the simple case compares to your solution.
Also that solution would necesarily work
Or maybe it would hmm
SMID isn't applicable here right?
Not particularly
It’s theoretically possible to use SIMD but the benefit is very questionable
It’d be complex at the very least
Thanks a ton for your help! Really appreciate it!
Haha