Ridiculously fragile performance | Rust Programming Language Community | Page 1

gloomy jungle Feb 17, 2026, 2:23 PM

#

I have a (decoding) function which I've been trying to optimize the best I can for a while.
I finally managed to get it to what I believe to be the 'best' possible, at ~4.1Gib/s.
I then went to actually clean up my crate, splitting things into modules and such.
However, I noticed that performance had effectively halved after doing the organization, despite no actual code changing.
I went to fixing this issue, which turned out be inlining (which was my suspicion since I moved away from the 'everything in one module' approach), I added some #[inline]s to certain methods, and got back to my original performance.

I thought that that was where this issue would end, but I added encoding to my crate, and despite nothing even remotely interacting with the decoder function, it halved in performance again.
If I comment out the encoding part of my crate, performance is back to full.
At this point I'm a bit lost, since nothing actually interacts with my decoder function, so I don't even have a starting point to work with.
If I comment out the encoding module in my lib.rs (leaving just the decoding module), performance is back to full.
Uncomment it, and performance is halved.

Why on earth could performance be so fragile?
I genuinely don't understand how methods that don't even interact with the decoder function could impact performance so much.
I even tried using a single codegen unit, thinking the change in layout may have resulted in an unfortunate assignment.
I also tried LTO, thinking extra inlining might help.
However, trying those didn't help (in fact, it even worsened things, #1470786138179371124, but that is not what this issue is about)

dim vault Feb 17, 2026, 2:28 PM

#

it may still be inlining. for instance if you have a function that it's called in one place right now, the compiler might be more susceptible to inline it. if the decoder also calls it, now it's 2 calls and it might not get inlined

#

there's also the possibility of the way the compiler will order things in memory, but that's very niche and unlikely to affect to this degree

gloomy jungle Feb 17, 2026, 2:31 PM

#

hm, given that the encoder and decoder modules are entirely independent (each can be commented out without affecting the other), the issue is reduced to the two remaining shared modules, one of which just contains consts, the second only has an enum definition

#

the only actual methods there are the Debug, Clone, PartialEq and Display implementations of said enum

#

none of which are actually called in the decoder

gloomy jungle Feb 17, 2026, 3:14 PM

#

the generated assembly is quite drastically different

#

the outgoing references are actually slightly different

#

bad one

#

good one

#

though notably this is a level deeper than the function itself

#

also, it seems the bad version returns undefined1 [16], while the good one returns undefined8

#

could this be #1470536083421921362 recurring?

#

I could send the cpp that ghidra decompiles them to if someone could read anything from that

gloomy jungle Feb 17, 2026, 4:25 PM

#

ok, I think I may have found one potential reason
my error enum is as follows:

pub enum Error {
    UnknownTag(u8),
    UnexpectedEnd,
    InvalidSmallBigLen(u8),
}

there are some niches here though: the value of both UnknownTag and InvalidSmallBigLen can never some (different) subset of values
it seems that for some reason, without the other modules the compiler makes use of these and thus returns a single u8, but when they are present it falls back to using two registers for the discriminator and value (which for some reason is presented as a [u8; 16] by ghidra)

#

it makes no sense why it can use those niches without the other modules, but not with them tho

#

none of the encoding code touches this error enum, as it's exclusive to just decoding

#

maybe the compiler gets 'scared' and takes a safer approach?

dusk grove Feb 17, 2026, 4:51 PM

#

how do you benchmark?

gloomy jungle Feb 17, 2026, 4:53 PM

#

criterion bench_with_input with as actual method:

#[expect(clippy::panic)]
fn decode(b: &mut Bencher<'_>, data: &[u8]) {
    let mut tape = Tape::with_capacity(4096);
    b.iter(|| {
        if let Err(err) = fill_tape(&mut tape, data) {
            panic!("Error: {err:?}");
        }
        tape.clear();
    });
}

-# note that fill_tape has #[inline(never)]

gloomy jungle Feb 17, 2026, 4:54 PM

#

gloomy jungle ok, I think I may have found one potential reason my error enum is as follows: `...

maybe this is not it, I tried making Error a ZST and while it did improve performance (duh), it didn't bring it back to original numbers

dusk grove Feb 17, 2026, 4:55 PM

#

could just be your computer fluctuating sadfist

gloomy jungle Feb 17, 2026, 4:55 PM

#

in fact, with it as ZST, performance is still heavily reduces when the encoder modules are commented out

gloomy jungle Feb 17, 2026, 4:55 PM

#

dusk grove could just be your computer fluctuating <:sadfist:883555850860507177>

I'd accept that if the change was minor, but not in this case

dusk grove Feb 17, 2026, 4:55 PM

#

fair

#

have you tried with different opt-levels?

gloomy jungle Feb 17, 2026, 4:56 PM

#

my code is fairly susceptible to fluctiations, to the extent I have a 5% threshold instead of the normal 1% lol

gloomy jungle Feb 17, 2026, 4:56 PM

#

dusk grove have you tried with different opt-levels?

can try, which would be best to try?

#

I can try them all but it'll take a bit

dusk grove Feb 17, 2026, 4:57 PM

#

-O2 for sure, maybe -Os too

gloomy jungle Feb 17, 2026, 5:06 PM

#

O2 is 106% worse for small inputs and 82% worse for large inputs
Os is 88% worse for small inputs and 74% worse for large inputs

dusk grove Feb 17, 2026, 5:08 PM

#

interesting, must be some pretty vital optimization that you get from O3, like maybe loop unrolling, which I know has pretty sensitive heuristics

gloomy jungle Feb 17, 2026, 5:09 PM

#

ah, this is compared to the good version

#

the one with encoding commented out

dusk grove Feb 17, 2026, 5:09 PM

#

ah

gloomy jungle Feb 17, 2026, 5:09 PM

#

I believe Os is toughly on-par with the bad version

dusk grove Feb 17, 2026, 5:09 PM

#

I'd probably try with O1 too and see if you can narrow down which LLVM pass it is

gloomy jungle Feb 17, 2026, 5:11 PM

#

gloomy jungle could this be <#1470536083421921362> recurring?

-# note that the code is still roughly the same as the 'good' variant mentioned in here, should you want to see it

#

the cpp shown by ghidra does show two other notable differences in addition to the return type being [u8; 16]:

the 'good' version is decompiled to a while (true) {}, while the 'bad' version is a do {} while (true)
the 'bad' version seems to be doing.. something before the actual switch statement (which I imagine to be the main reason for the reduction)

dusk grove Feb 17, 2026, 5:14 PM

#

maybe it hampers autovectorization

gloomy jungle Feb 17, 2026, 5:14 PM

#

O1 is 109% worse for small inputs, and 99% worse for large inputs

#

again, compared to the ideal result

#

wait

#

I might be misunderstanding what you want me to do

#

do you want me to compare O1, O2 and Os with encoding commented out to O1, O2 and Os with encoding uncommented?

#

my comparisons have been to O3 with encoding commented out

dusk grove Feb 17, 2026, 5:16 PM

#

I wanted the comparison to O3, yeah, but it could be interesting to see compared to uncommented too

gloomy jungle Feb 17, 2026, 5:40 PM

#

without encoding modules:

O3 is 281ns for small inputs, 76us for large inputs
O2 is 285ns for small inputs, 76us for large inputs
O1 is 654ns for small inputs, 158us for large inputs
Os is 705ns for small inputs, 164us for large inputs
with encoding modules:
O3 is 519ns for small inputs, 131us for large inputs
O2 is 559ns for small inputs, 134us for large inputs
O1 is 558ns for small inputs, 147us for large inputs
Os is 517ns for small inputs, 128us for large inputs
-# all results are fresh

dusk grove Feb 17, 2026, 5:41 PM

#

so there's something enabled in O2 but not in O1 that causes that massive difference

gloomy jungle Feb 17, 2026, 5:45 PM

#

here's a link to the real latest code: https://rust.godbolt.org/z/jec1hMcjh

gloomy jungle Feb 17, 2026, 5:46 PM

#

dusk grove so there's something enabled in O2 but not in O1 that causes that massive differ...

could be just inlining since I'm using a few helper functions

#Ridiculously fragile performance