#How can I optimize my code further?
404 messages · Page 1 of 1 (latest)
what is the code meant to do
like, what's your intent
fn main() {
let mut s = String::new();
for mut i in 0_u128.. {
s.clear();
while i > 0 {
s.push(char::from((i % 94) as u8 + 32));
i /= 94;
}
println!("{}", s);
}
}
how's this
yes, which made it slower
... are you not benchmarking this?
Why are you using tokio here @misty vector ?
Look into rayon
Tokio is giving you nothing here
rayon probably wont' help much either because stdout has a mutex on it
and i highly doubt anything but the printing is the expensive part
Yeah very true, for the code you've sent it'd be very fast just as a normal iterator
Tokio runs async code across multiple threads, but your code is not async at all
Is this just a thought exercise, or is your code noticeably slow and you'd like to improve the speed?
If your number of iterations is high enough. You should benchmark this kind of thing for sure
Building the string is going to be super fast compared to the println
For lower iterations just a single threaded iteration would be faster
How are you measuring it to be "insanely slow"?
It takes a second for each println?
Your biggest bottleneck is just the println
But also make sure you're running it in release mode
Just printing to stdout is much slower than any of the computations your code is doing
I'm not sure what help i can give, since i don't really know your ultimate goal
You're pretty much benchmarking printing to std out using your eyes as a measurement
Let start = Instant::now()
...
let ns = start.elapsed().as_nanos();
Print ns
The workaround is simple: Don't print
That's all there is to it
The alternatives are:
- Print less, e.g. only every 1000th iteration
- Get something that can dump to files really fast
Also to come back to this:
If your loop could run at 10GHz (completely unrealistic), a u64 would overflow after 58 years of runtime. You don't plan to run this for 100s of years, do you? No? Then maybe use an u64, it's enough
edit: woops, I misread something, sorry for the ping
i ran a flamegraph on that code (literally just cargo flamegraph > /dev/null)
i use flamegraphs for like 80% of my optimization work, generally, they're great
i factored stdout locking outside the loop, which surprisingly hardly improved performance; imma try replacing the format machinery next
circumventing format machinery also hardly improves performance (it's visible though)
flamegraph looks like this now
(i changed the loop signature to for mut i in 0u128..1000000 { btw to have a finite runtime to compare with)
replacing String with Vec<u8> doesnt change performance whatsoever, surprinslgy
imma try replacing the Vec with an array
that brought down runtime by another ~3%
hmm what now
oh of course
buffering
how did i only think of that now
LMAO
0.7s to 0.05s
buffering is overpowered
i 100x'ed the loop limit and it takes just 0.99s
ill play around with the buffer size a bit (i started with my go-to size of 50KB)
hm doesnt change very much, ill go with 1MB, it hovers around 0.97s
kinda sad i only tried the buffering thing now; the buffering issue presumably overshadowed the real improvements by all the ideas i had before
anyways heres the flamegraph now
around half of runtime is copying bytes into libc; that's kind of a dead end (if you don't venture into obscure kernel features)
the other half is everything that's not function calls, so, presumably the straight math done in the main function
we can try to calculate two digits per loop cycle
by dividing through 94² instead of 94
before i do that ill check the assembly of the core loop
?godbolt ```rust
pub fn foo(buf: &mut [u8], mut n: u64) {
let mut i = buf.len() - 1;
while n > 0 {
i -= 1;
if i >= buf.len() {
unsafe { std::hint::unreachable_unchecked() }
}
buf[i] = (n % 94) as u8 + 32;
n /= 94;
}
}
example::foo:
test rdx, rdx
je .LBB0_3
mov rcx, rdx
add rsi, -2
movabs r8, 6279742663390485657
.LBB0_2:
mov rax, rcx
shr rax
mul r8
shr rdx, 4
imul eax, edx, 94
mov r9d, ecx
sub r9d, eax
add r9b, 32
mov byte ptr [rdi + rsi], r9b
dec rsi
cmp rcx, 94
mov rcx, rdx
jae .LBB0_2
.LBB0_3:
ret
```Note: only public functions (`pub fn`) are shown
(i changed it to u64 because with u128 it just called into a stdlib function which doesn't optimize nicely; and i added the unreachable_unchecked to add the context which the full program probably provides, and to remove clutter from the asm)
ok i cant tell much from the asm
nice, unrolling the loop to two digits per cycle actually helped rust let mut i = buf.len() - 1; while n > 0 { i -= 2; buf[i + 1] = (n % 94) as u8 + 32; n /= 94; buf[i] = (n % 94) as u8 + 32; n /= 94; } if buf[i] == 32 { i += 1; }
~0.88s -> ~0.84s
scaling up to 3 digits per cycle made it much worse (>0.9s)
changing u64 down to u32 chops off 15% of runtime but i cant get away with that because it would overflow in like a few dozen seconds
hmm one idea i still have is a lookup table
mapping n % 94² to the two ascii digits it produces
that would be slightly messy to implement
#![allow(unused_must_use)]
use std::io::Write as _;
fn main() {
let stdout = std::io::stdout();
let stdout = stdout.lock();
let mut stdout = std::io::BufWriter::with_capacity(1000000, stdout);
let mut buf = [0; 21]; // 20 characters + newline to fit all base-94 u128's
buf[buf.len() - 1] = 10; // newline
for mut n in 0u64..100000000 {
let mut i = buf.len() - 1;
while n > 0 {
i -= 2;
buf[i + 1] = (n % 94) as u8 + 32;
n /= 94;
buf[i] = (n % 94) as u8 + 32;
n /= 94;
}
if buf[i] == 32 {
i += 1;
}
stdout.write_all(&buf[i..]);
}
}
i changed it because u128 were strictly useless for this specific program
sure
i realize you probably wanna convert number above u64 irl
but for benchmarking, you kinda need a specific piece of code to compare
and this specific piece of code just never reached the number ranges that u128 supports
u64 holds up to 2^64-1, u128 holds up to 2^128-1
?play ```rust
dbg!(u64::MAX);
dbg!(u128::MAX);
[src/main.rs:2] u64::MAX = 18446744073709551615
[src/main.rs:3] u128::MAX = 340282366920938463463374607431768211455```
length of what characters?
ah ok
when i replace the loop header with for mut n in [u64::MAX as u128, u128::MAX] {, it prints ```rust
@22>%,ipPg
+"G}PV:g:h%6edu))p+-
so that's 10 and 20
sure we can do u128
what im saying with u64 is u128 is, currently the code just loops upwards from 0
and when doing literally that, you'll never reach generated strings above 10 chars
so doing hardcore optimization and benchmarking on u128 is kinda pointless
we'd need a more realistic benchmark
for example one that samples n uniformly-ish from the whole u128 range, not just counts upwards
the latter
also, i unrolled the inner loop, not the outer loop
i made it convert two digits at a time, not to numbers at a time
where
ok i can explain both
this is part of a more efficient printing system
wait actually, it was just an oversight by me
i assumed we're converting to base94 big endian (i.e. the most significant base94 digit coming first)
hm okay
i would guess it's pretty similarly fast
well anyways the reason i did - instead of + because i start at the very right, at the least significant, and then write the digits towards the left, going ever more significant
because i thought we should be most significant first
but you're right, it should have been +, i.e. going more significant digits to the right, because that's what the original code did
thats part of the unrolling
for example if we wanna convert 123456 to base94, the code would convert the first two digits to 34 and 91, and then the next two to 13 and 0
but now you see we overshot by one digit
we should've stopped before the 0 because we were already done at that point, but we couldn't stop because we did 2 digits at a time
lol you must be on linux
I'm on Windows and, not sure how close it is but after several minutes it's still at 4 letters
I took a stab at writing it faster too but I can't get it to finish because of printing
so with ```rust
if buf[i] == 32 {
i += 1;
}
if i let it print to terminal it takes ages for me too
i benchmarked everything while routing to /dev/null
Aha, makes sense
You can use b' ' instead of 32 by the way to avoid such magic constants
If 0123456789abcdef are the characters then your program is not behaving the same way?
I'm getting stuff like
!:E?!
!:E?"
!:E?#
!:E?$
!:E?%
!:E?&
!:E?'
!:E?(
!:E?)
!:E?*
!:E?+
!:E?,
!:E?-
!:E?.
!:E?/```
could explain the speed difference, this is doing a lot more work
you mean know when the program should stop?
that's dictated by the loop condition at the top
in my current code, the loop just stops at 100000000
~0.85s
there is no mutex right now
i dont know
it may
it may not
i got another optimization idea, from the newly added context
of those i only expect codegen-units to be able to make a difference in this case
ill try
mayybe
although, not really actually
we are only using stdlib
and i dont think lto works across stdlib if you use the default precompiled stldib
i see no clear difference
not really
the only significant stdlib call here is write_all
and its actually marked #[inline]
so it can be inlined even if lto is off, if i remember correctly
as far as I remember, setting lto = true enables inlining functions that aren't marked inline
but all relevant functions in this code are already marked inline (write_all, and slice.len() i suppose)
this is a bit adventurous but here goes:
instead of incrementing an integer and converting it to base94 every time, we could increment the base94 directly
either naively (i.e. loop through, least significant to most significant, keeping track of carry), which could also already be faster tbh
or smartly, by taking advantage of native integer addition and carry (something that the 31GB/s FizzBuzz impl also did iirc)
not really, it's probably a decent chunk of work to get all working
we probably could.. :p
before attempting, it would be a wise idea to formalize the requirements though
for example if it's ok to start with leading zeroes, and/or if it's ok to hardcode to a certain character length, the programming would get much easier and probably faster
ok so the output must be identical to your original posted code EXCEPT the base94 numbers can also be big endian instead of little endian? @misty vector
no updates
i havent started
no promises i will
im just here for fun ^^
why there?
interesting
i mean if you want i can do some coding and explain
idk how far ill get
although that would be easier in a voice channel imo
thats ok
i was too, for the longest time
still am a bit but much less
https://prod.liveshare.vsengsaas.visualstudio.com/join?CDE34FC13B724209DD24189712FF4BB491FE @misty vector
Build with Visual Studio Code, anywhere, anytime, entirely in your browser.
@balmy bison
Build with Visual Studio Code, anywhere, anytime, entirely in your browser.
ok
(we continued this conversation in voice channel, and then DMs)
current state, if anyone is interested
without the stripping, the output is 1.6GB/s
at this rate, it will take roughly 370 days until the u64 accumulator is overflowed
thats enough for my standards
how are you measuring, and how are you invoking crunch?
prooobably not, because in the end the data still needs to be outputted sequentially
at least if we're using stdout?
see the command visible in the screenshot
when i do that, it has quite some delay
what are you doing about that
oh
and youre usin literally 1 4 as arguments?
bc that doesn't do the same thing as our rust code
crunch seems to default to a-z as the charset
and 1 4 prints only 475254 lines
crunch 1 4 prints 2357264 bytes, taking 3s + 0.085s for me
that's 28 MB/s
lol
yeah
yeah
it would probably be a tiny bit slower with big endian
the space stripping can probably be faster
and we can probably calculate more per loop cycle
for example we can have an array of 8 accumulators representing the first 8 strings. In each loop cycle we'd batch increment them all by 8 and and roll around them all, to get the next 8 strings
the compiler will probably help us with autovectorization
sorry just arrived home
yeah
it keeps track of how long the generated strings get
checkpoint is the index where the string length is gonna get one bigger
that is the question
find some way to not have to compare i to checkpoint every itreation
or make the check less expensive
the length doesnt increase every 93 iterations
not even every 94
it grows exponentiallyy
think about base10
there's 9 single digit numbers (1-9)
then 90 two digit numbers (10-99)
then 900 three digit numbers (100-999)
and so on
i dont know
im not the computer
ask the computer :p
why?
relative timings should be consistent
you can measure the old one and then the new one
itll be interesting to see if your baseline is faster or slower than mine ^^
cpu compare time
your cpu has a higher passmark score than mine
although my single thread rating is slightly higher
which is what counts here
so lets see
share your troubles
you always need to leave so suddenly
ok i made the batch thing
sweet spot seems to be 8 x u64
but even with that, i get only ~1.5% improvement (using target-cpu=native)
so thats kinda sad
?godbolt flags="-C opt-level=3 -C target-cpu=native" ```rust
pub fn increment(acc: &mut [u64; 8]) {
for acc in acc {
*acc += 8;
*acc += ((!*acc & 0x8080808080808080) >> 7) * 0xA2;
}
}
.LCPI0_0:
.quad -9
.LCPI0_1:
.quad 72340172838076673
.LCPI0_2:
.quad 162
.LCPI0_3:
.quad 8
example::increment:
vpbroadcastq ymm0, qword ptr [rip + .LCPI0_0]
vmovdqu ymm1, ymmword ptr [rdi]
vpbroadcastq ymm4, qword ptr [rip + .LCPI0_1]
vpbroadcastq ymm5, qword ptr [rip + .LCPI0_2]
vmovdqu ymm2, ymmword ptr [rdi + 32]
vpbroadcastq ymm7, qword ptr [rip + .LCPI0_3]
vpsubq ymm3, ymm0, ymm1
vpsubq ymm0, ymm0, ymm2
vpaddq ymm1, ymm1, ymm7
vpaddq ymm2, ymm2, ymm7
vpsrlq ymm3, ymm3, 7
vpsrlq ymm0, ymm0, 7
vpand ymm3, ymm3, ymm4
vpand ymm0, ymm0, ymm4
vpmuludq ymm6, ymm3, ymm5
vpsrlq ymm3, ymm3, 32
vpmuludq ymm3, ymm3, ymm5
vpsllq ymm3, ymm3, 32
vpor ymm3, ymm6, ymm3
vpaddq ymm1, ymm1, ymm3
vpmuludq ymm3, ymm0, ymm5
vpsrlq ymm0, ymm0, 32
vpmuludq ymm0, ymm0, ymm5
vmovdqu ymmword ptr [rdi], ymm1
vpsllq ymm0, ymm0, 32
vpor ymm0, ymm3, ymm0
vpaddq ymm0, ymm2, ymm0
vmovdqu ymmword ptr [rdi + 32], ymm0
vzeroupper
ret
```Note: only public functions (`pub fn`) are shown
considering how vectorized the assembly looks
This is where i got the idea of that carry trick from btw
500+ chars?
idk i’m probably getting something wrong
but you do not need to use 128u for 500+ chars
i’m probably just being stupid
yeah youre missing context but you cant be blamed
500+ chars in the base94 encoding of the number
although u128 would not be enough for that either lol
that's more like u4096
would you not want to store that as a string lol
the encoded version sure
but the source number
also needs to be stored
anyhow
since the task is to count upwards, even 20+ chars in the base94 encoding will never be reached
@misty vector btw i have another idea for quicker IO
Linux vmsplice syscall
its a rabbit hole i already tried to venture into, once, but didnt really manage
Windows doesn't have an equivalent afaik
Bad luck for windows
The compiler generates simd instructions automatically
That's called auto vectorization
The code needs to be suitable for it
Like in the godbolt code i posted above
warning: function `main` is never used
--> <source>:3:4
|
3 | fn main() {
| ^^^^
|
= note: `#[warn(dead_code)]` on by default
warning: 1 warning emitted
fn main() {
let mut stdout = BufWriter::with_capacity(1000000, stdout().lock());
const BATCH_SIZE: usize = 8;
let mut acc: [u64; BATCH_SIZE] = std::array::from_fn(|i| 0xA2A2A2A2A2A2A2A2_u64 + i as u64);
let mut i = 0;
while i < 100000000_u64 {
for &acc in &acc {
let _ = stdout.write_all(&(acc - 0x8282828282828282).to_le_bytes());
let _ = stdout.write_all(&[10]);
}
for acc in &mut acc {
*acc += BATCH_SIZE as u64;
*acc += ((!*acc & 0x8080808080808080) >> 7) * 0xA2;
}
i += BATCH_SIZE as u64;
}
}
possibly
but that much lowers the chance of the compiler recognizing the autovectorizable code
think like a compiler
what SIMD is, apply the same math instruction on multiple values at once, basically
when the compiler sees this, it's a very easy deduction that those loop iterations can be done in parallel using SIMD instructions
however, if those math statements were interleaved with stdout.write_all(), the compiler would first have to undo the interleaving and prove that it is allowed to batch the math instructions together without changing behavior
so we precisely do not want to combine the for's
to incentivize the autovectorization
because i removed the space stripping again
because my sense of aesthethic hates its presence in the code
you can add the previous code back
that keeps track of the length
why btw
is this not all for fun to beat crunch?
i suppose
altho i would argue crunch has a slightly different model
it doesnt have dynamic lengths
you can tell it fixed lengths that it will all go through
you just said we have to do the same behavior of crunch
now you're implying we have to do better behavior than crunch? :p
what?
is ammount from your code?
Is this for cracking passwords? 😅
Just wondering why it was relevant that most people have passwords longer than 10 characters
better = worse in every single way, but faster
cant configure it at all
everything hardcoded
incomprehensible code
no help page
lol
with what even
i often need to puzzle together a lot of missing context from your messages
you need help with.. itself?
what do you mean
uh
is that a question?
is it a statement?
oh no
stop it
ripping out meaning of code by evaluating it manually is the compilers job, not yours :(
anyways
didnt mean to insult you, it's just kinda frustrating xD
ok i see an issue with the approach..
we're batch cycling through 8 strings at a time now
so at one point, we are at the 88th string, and the next cycle we're at the 96th string
and we skipped the 94th
where len would have had to be incremented
that ofc a bit rip
id restructure the whole code a bit
to work more like the crunch model
i.e. not have dynamic lengths
instead, have a single fixed length
but then just run that entire fixed length procedure multiple times, for every length we want
yea
change batch size
to.. 2
everything else, 94 is not divisible by
mh
i suppose
you could stop the batching right before reaching the 94
switch to a slow path
after passing the boundary, switch to fast path again
but thats really complciated
btw in my benchmarks, 2 was literally slower than 1
and batch size 1 is the same as not batching at all
🙃
the slow path is slow, indeed
but the slow path would only be executed for the numbers from 88 to 96, in this example
because we need to go one by one in order to catch the 94 threshold
all in all, most numbers are still executed with the fast path
yeah
exactly
thats why we should change structure
says who
its a mii mario
to me, static feels better
also feels better fitting to the actual task, generating wordlists
well in the end the two approaches are equivalent
both can do what the other can
they are just structured differently
and the static one works better in this case i think
what
please make your question easier to unerstand
my understanding aint understandingin
what do you wanna do
what is there to make work
std::io::buffered::bufwriter::BufWriter<W>::write_all_cold:
push r15
push r14
push r12
push rbx
push rax
mov r14, rdx
mov r15, rsi
mov rbx, rdi
mov rax, qword ptr [rdi + 8]
mov rcx, rax
sub rcx, qword ptr [rdi + 16]
cmp rcx, rdx
jae .LBB0_3
mov rdi, rbx
call std::io::buffered::bufwriter::BufWriter<W>::flush_buf
test rax, rax
jne .LBB0_6
mov rax, qword ptr [rbx + 8]
.LBB0_3:
cmp rax, r14
jbe .LBB0_5
mov r12, qword ptr [rbx + 16]
mov rdi, qword ptr [rbx]
add rdi, r12
mov rsi, r15
mov rdx, r14
call qword ptr [rip + memcpy@GOTPCREL]
add r12, r14
mov qword ptr [rbx + 16], r12
xor eax, eax
jmp .LBB0_6
.LBB0_5:
mov byte ptr [rbx + 24], 1
lea rdi, [rbx + 32]
mov rsi, r15
mov rdx, r14
call qword ptr [rip + <std::io::stdio::StdoutLock as std::io::Write>::write_all@GOTPCREL]
mov byte ptr [rbx + 24], 0
.LBB0_6:
add rsp, 8
pop rbx
pop r12
pop r14
```Note: only public functions (`pub fn`) are shown
Output too large. Godbolt link: <https://godbolt.org/z/6vjhh134G>
Crates:
/crate Lookup crates on crates.io
/doc Lookup documentation
Godbolt:
?godbolt View assembly using Godbolt
?mca Run performance analysis using llvm-mca
?llvmir View LLVM IR using Godbolt
/targets Lists all available godbolt rustc targets
Utilities:
/go Evaluates Go code
/source Links to the bot GitHub repo
/help Show this menu
/uptime Tells you how long the bot has been up for
/cleanup Deletes the bot's messages for cleanup
/ban Bans another person
/selftimeout Self-timeout yourself.
Modmail:
/modmail Send a private message to the moderators of the server.
Playground:
?play Compile and run Rust code in a playground
?eval Evaluate a single Rust expression
?miri Run code and detect undefined behavior using Miri
?expand Expand macros to their raw desugared form
?clippy Catch common mistakes using the Clippy linter
?fmt Format code using rustfmt
?microbench Benchmark small snippets of code
?procmacro Compile and use a procedural macro
You can still use all commands with `?`, even if it says `/` above.
Type ?help command for more info on a command.
You can edit your message to the bot and the bot will edit its response.
thread 'main' panicked at src/main.rs:10:80:
range end index 17 out of range for slice of length 16
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
A
B
C
D
E
F
G
H```Output too large. Playground link: <https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=8309171b7f9625f7f44536aa49f6e5a8>
is this an attempt at the static structure?
if so, the loop header should be more like for _ in 94.pow(len), since if a single digit has 94 possible values, n digits have 94^n possible values
ah ok
i forgot what amount is
btw the word amount has just a single m
maybe you remember a certain way
that i mentioned
that includes a word that starts with s and ends with tatic
no, that would be super tatic
for any given length, it would generate strings of that length, and those would just not contain spaces
in the code, there'd be [..len]
cutting off the acc
imma be honest, tough luck then
i feel like the dynamic thing is kinda doomed
the problem is, like
there needs to be a check on every loop iteration
yk
if the length can dynamically change
i would guess so
yes
look, the pseudocode is:
dynamic ```
for every value {
generate string
find how many spaces to remove, and remove em
}
static ```
for every string len {
for every value {
generate string
remove spaces (we know how many, because we know len
}
}
i dont care tbh, i just say what i think will have better performance / lend itself to better performance in the future
i will venture to bed, maybe ill even stay there for the night, so you might not hear anything from me in the next 8 hours or so
what does this do?
not a particularly helpful answer tbh
this thread is gigantic
prints base94 numbers (alphabet from ascii 32 (space) to 126), starting from 0 upwards
i think a baby got into your computer
NONBLOCK is a recipe for disaster
in this case
probably because it's overwriting buffers that still have to be read, one way or another
NONBLOCK makes it inevitable to overwrite pending buffers
but even without it, it's still easy to accidentally do it
im not sure, but i believe the way to go is:
- decide on a certain buffer size N
- set the linux pipe buffer size to N*2
- create two [u8; N] in the program
- fill them and expose them to the pipe using vmsplice, taking turns (without NONBLOCK!)
no
there is no way to use nonblock
nonblock implies not waiting for the receiving end of the pipe
which is cheating
not waiting for the receiving end of the pipe is basically discarding data
it's like skipping chunks of output
why is it not
no
it will not
no
do you understand how vmsplice works, how it avoids copying by exposing just a buffer reference to the pipe receiver?
yes
also, the way i understand it, if the receiving end is fast enough, vmsplice doesn't block even without NONBLOCK
so for any performance testing, where the receiving end is always fast enough, it should make no difference
improvement of what exactly?
im not sure
maybe, maybe not