#pretty iterator usage

49 messages · Page 1 of 1 (latest)

shell mirage
#

anyone have an idea how I can make the looping in this code prettier (without making the code slower (it's currently very well vectorized by the compiler)

fn fetch_packages_apk() -> Result<String, Error> {
    //commented out because slow
    //let buf = read("/lib/apk/db/installed")?;
    //let packages = find_iter(&buf, b"\nP").count();

    let buf = read_fast("/lib/apk/db/installed")?;
    let mut packages = 0;

    let mut chunks0 = buf[..buf.len()-1].chunks_exact(8192);
    let mut chunks1 = buf[1..].chunks_exact(8192);

    while let (Some(chunk0), Some(chunk1)) = (chunks0.next(), chunks1.next()) {
        let mut counters: [u8; 32] = [0; 32];
        for j in 0..256 {
            for i in 0..32 {
                counters[i] += (chunk0[j * 32 + i] == b'\n') as u8 & (chunk1[j * 32 + i] == b'P') as u8;
            }
            // inner loop should be vectorized as 4 instructions
            //compare
            //compare
            //and
            //sub
        }
        packages += counters.iter().map(|&x| x as usize).sum::<usize>();
    }

    for (a,b) in chunks0.remainder().iter().zip(chunks1.remainder()) {
        packages += (*a == b'\n') as usize & (*b == b'P') as usize;
    }

    Ok(packages.to_string())
}
#

tbh I might just make a find_count function lol

heavy pulsar
shell mirage
#

you using -C opt-level=3 -C target-cpu=native?

heavy pulsar
#

Ah, native CPU

#

I was only using the opt level

#

Yeah, now it vectorizes

#

Ok, let's see if I can keep that happening as I add iterators

shell mirage
#

oh even in your example it compiles to ```x86asm
pcmpeqb xmm5, xmm0
pcmpeqb xmm4, xmm1
pand xmm4, xmm5
psubb xmm3, xmm4
pcmpeqb xmm7, xmm0
pcmpeqb xmm6, xmm1
pand xmm6, xmm7
psubb xmm2, xmm6

heavy pulsar
#

Also, I was assuming the inner loop was the one that goes to 32 😄

#

Did you mean the 256 one?

shell mirage
#

ah no the unrolled loop is just adding the bytes together

#

speaking of, that's not very optimized

#

doesn't matter cause it's not a particularly hot loop

heavy pulsar
#

That was without a native target, tho

#

With it, it turns into scary simd

shell mirage
#

oh ok nice

#

maybe there's no way to horizontal sum in sse

stone wolf
stone wolf
stone wolf
#

So in addition to the small counters inside the loop, you might consider having bigger counters outside the loop, so after the inner loop you simd add the small counters into the bigger ones, and thus you only need to horizontally reduce once at the end

#

Also, you probably don't need to use 8192 as your chunk size -- that's going to leave a huge remainder pretty often. u8x64 is the biggest vector you can realistically get on current chips (and most modern ones only go up to u8x32), so .chunks_exact(64) ought to be plenty.

stone wolf
#

For fun, here's a std::simd version: https://godbolt.org/z/q1Eo3zbne

shell mirage
stone wolf
#

I guess I'm just not convinced that it's worth having the scalar part be so much longer in order to do that. Because you can still do only a single horizontal sum even with the smaller chunk size, by doing it after the chunks loop.

shell mirage
#

the hot loop affects the performance a lot though

stone wolf
#

I'd be curious to hear how to the simd version I put above performs for you, though it needs nightly. It does the horizontal sum using the one horizontal sum instructions that's been around since ≈2009: POPCNT :)

shell mirage
#

I'll test when I get home

shell mirage
#
shell mirage
#

about 5GB/s

#

which is super fast

stone wolf
shell mirage
#

benchmarked my ram

#

8GB/s

#

so I'm pretty close to peak performance

stone wolf
shell mirage
#

the needle is 2 bytes

#

so there can never be more than 256 in 512 bytes

#

that's where the 512 comes from

stone wolf
#

Out of curiosity, did you ever bench the "obvious" versions like https://godbolt.org/z/PWsz95dvr ? I'd be curious how much slower they were.

shell mirage
#

first one is 2.5x slower

#

second one is 3x slower