#Ridiculously fragile performance

1 messages · Page 1 of 1 (latest)

gloomy jungle
#

I have a (decoding) function which I've been trying to optimize the best I can for a while.
I finally managed to get it to what I believe to be the 'best' possible, at ~4.1Gib/s.
I then went to actually clean up my crate, splitting things into modules and such.
However, I noticed that performance had effectively halved after doing the organization, despite no actual code changing.
I went to fixing this issue, which turned out be inlining (which was my suspicion since I moved away from the 'everything in one module' approach), I added some #[inline]s to certain methods, and got back to my original performance.

I thought that that was where this issue would end, but I added encoding to my crate, and despite nothing even remotely interacting with the decoder function, it halved in performance again.
If I comment out the encoding part of my crate, performance is back to full.
At this point I'm a bit lost, since nothing actually interacts with my decoder function, so I don't even have a starting point to work with.
If I comment out the encoding module in my lib.rs (leaving just the decoding module), performance is back to full.
Uncomment it, and performance is halved.

Why on earth could performance be so fragile?
I genuinely don't understand how methods that don't even interact with the decoder function could impact performance so much.
I even tried using a single codegen unit, thinking the change in layout may have resulted in an unfortunate assignment.
I also tried LTO, thinking extra inlining might help.
However, trying those didn't help (in fact, it even worsened things, #1470786138179371124, but that is not what this issue is about)

dim vault
#

it may still be inlining. for instance if you have a function that it's called in one place right now, the compiler might be more susceptible to inline it. if the decoder also calls it, now it's 2 calls and it might not get inlined

#

there's also the possibility of the way the compiler will order things in memory, but that's very niche and unlikely to affect to this degree

gloomy jungle
#

hm, given that the encoder and decoder modules are entirely independent (each can be commented out without affecting the other), the issue is reduced to the two remaining shared modules, one of which just contains consts, the second only has an enum definition

#

the only actual methods there are the Debug, Clone, PartialEq and Display implementations of said enum

#

none of which are actually called in the decoder

gloomy jungle
#

the generated assembly is quite drastically different

#

the outgoing references are actually slightly different

#

bad one

#

good one

#

though notably this is a level deeper than the function itself

#

also, it seems the bad version returns undefined1 [16], while the good one returns undefined8

#

could this be #1470536083421921362 recurring?

#

I could send the cpp that ghidra decompiles them to if someone could read anything from that

gloomy jungle
#

ok, I think I may have found one potential reason
my error enum is as follows:

pub enum Error {
    UnknownTag(u8),
    UnexpectedEnd,
    InvalidSmallBigLen(u8),
}

there are some niches here though: the value of both UnknownTag and InvalidSmallBigLen can never some (different) subset of values
it seems that for some reason, without the other modules the compiler makes use of these and thus returns a single u8, but when they are present it falls back to using two registers for the discriminator and value (which for some reason is presented as a [u8; 16] by ghidra)

#

it makes no sense why it can use those niches without the other modules, but not with them tho

#

none of the encoding code touches this error enum, as it's exclusive to just decoding

#

maybe the compiler gets 'scared' and takes a safer approach?

dusk grove
#

how do you benchmark?

gloomy jungle
#

criterion bench_with_input with as actual method:

#[expect(clippy::panic)]
fn decode(b: &mut Bencher<'_>, data: &[u8]) {
    let mut tape = Tape::with_capacity(4096);
    b.iter(|| {
        if let Err(err) = fill_tape(&mut tape, data) {
            panic!("Error: {err:?}");
        }
        tape.clear();
    });
}

-# note that fill_tape has #[inline(never)]

gloomy jungle
dusk grove
#

could just be your computer fluctuating sadfist

gloomy jungle
#

in fact, with it as ZST, performance is still heavily reduces when the encoder modules are commented out

gloomy jungle
dusk grove
#

fair

#

have you tried with different opt-levels?

gloomy jungle
#

my code is fairly susceptible to fluctiations, to the extent I have a 5% threshold instead of the normal 1% lol

gloomy jungle
#

I can try them all but it'll take a bit

dusk grove
#

-O2 for sure, maybe -Os too

gloomy jungle
#

O2 is 106% worse for small inputs and 82% worse for large inputs
Os is 88% worse for small inputs and 74% worse for large inputs

dusk grove
#

interesting, must be some pretty vital optimization that you get from O3, like maybe loop unrolling, which I know has pretty sensitive heuristics

gloomy jungle
#

ah, this is compared to the good version

#

the one with encoding commented out

dusk grove
#

ah

gloomy jungle
#

I believe Os is toughly on-par with the bad version

dusk grove
#

I'd probably try with O1 too and see if you can narrow down which LLVM pass it is

gloomy jungle
#

the cpp shown by ghidra does show two other notable differences in addition to the return type being [u8; 16]:

  • the 'good' version is decompiled to a while (true) {}, while the 'bad' version is a do {} while (true)
  • the 'bad' version seems to be doing.. something before the actual switch statement (which I imagine to be the main reason for the reduction)
dusk grove
#

maybe it hampers autovectorization

gloomy jungle
#

O1 is 109% worse for small inputs, and 99% worse for large inputs

#

again, compared to the ideal result

#

wait

#

I might be misunderstanding what you want me to do

#

do you want me to compare O1, O2 and Os with encoding commented out to O1, O2 and Os with encoding uncommented?

#

my comparisons have been to O3 with encoding commented out

dusk grove
#

I wanted the comparison to O3, yeah, but it could be interesting to see compared to uncommented too

gloomy jungle
#

without encoding modules:

  • O3 is 281ns for small inputs, 76us for large inputs
  • O2 is 285ns for small inputs, 76us for large inputs
  • O1 is 654ns for small inputs, 158us for large inputs
  • Os is 705ns for small inputs, 164us for large inputs
    with encoding modules:
  • O3 is 519ns for small inputs, 131us for large inputs
  • O2 is 559ns for small inputs, 134us for large inputs
  • O1 is 558ns for small inputs, 147us for large inputs
  • Os is 517ns for small inputs, 128us for large inputs
    -# all results are fresh
dusk grove
#

so there's something enabled in O2 but not in O1 that causes that massive difference

gloomy jungle
gloomy jungle