pretty iterator usage | Rust Programming Language Community | Page 1

shell mirage May 14, 2023, 3:37 PM

#

anyone have an idea how I can make the looping in this code prettier (without making the code slower (it's currently very well vectorized by the compiler)

fn fetch_packages_apk() -> Result<String, Error> {
    //commented out because slow
    //let buf = read("/lib/apk/db/installed")?;
    //let packages = find_iter(&buf, b"\nP").count();

    let buf = read_fast("/lib/apk/db/installed")?;
    let mut packages = 0;

    let mut chunks0 = buf[..buf.len()-1].chunks_exact(8192);
    let mut chunks1 = buf[1..].chunks_exact(8192);

    while let (Some(chunk0), Some(chunk1)) = (chunks0.next(), chunks1.next()) {
        let mut counters: [u8; 32] = [0; 32];
        for j in 0..256 {
            for i in 0..32 {
                counters[i] += (chunk0[j * 32 + i] == b'\n') as u8 & (chunk1[j * 32 + i] == b'P') as u8;
            }
            // inner loop should be vectorized as 4 instructions
            //compare
            //compare
            //and
            //sub
        }
        packages += counters.iter().map(|&x| x as usize).sum::<usize>();
    }

    for (a,b) in chunks0.remainder().iter().zip(chunks1.remainder()) {
        packages += (*a == b'\n') as usize & (*b == b'P') as usize;
    }

    Ok(packages.to_string())
}

#

tbh I might just make a find_count function lol

heavy pulsar May 14, 2023, 3:44 PM

#

shell mirage anyone have an idea how I can make the looping in this code prettier (without ma...

Based on a quick godbolt, your inner loop is definitely unrolled, but I don't think it's vectorizing

#

https://godbolt.org/z/EG8GxP491

shell mirage May 14, 2023, 3:44 PM

#

you using -C opt-level=3 -C target-cpu=native?

heavy pulsar May 14, 2023, 3:45 PM

#

Ah, native CPU

#

I was only using the opt level

#

Yeah, now it vectorizes

#

Ok, let's see if I can keep that happening as I add iterators

shell mirage May 14, 2023, 3:46 PM

#

oh even in your example it compiles to ```x86asm
pcmpeqb xmm5, xmm0
pcmpeqb xmm4, xmm1
pand xmm4, xmm5
psubb xmm3, xmm4
pcmpeqb xmm7, xmm0
pcmpeqb xmm6, xmm1
pand xmm6, xmm7
psubb xmm2, xmm6

heavy pulsar May 14, 2023, 3:46 PM

#

shell mirage oh even in your example it compiles to ```x86asm pcmpeqb xmm5, xmm0 ...

Wait hang on, then what's the unrolled sequence doing

#

Also, I was assuming the inner loop was the one that goes to 32 😄

#

Did you mean the 256 one?

shell mirage May 14, 2023, 3:48 PM

#

ah no the unrolled loop is just adding the bytes together

#

speaking of, that's not very optimized

#

doesn't matter cause it's not a particularly hot loop

heavy pulsar May 14, 2023, 3:52 PM

#

That was without a native target, tho

#

With it, it turns into scary simd

shell mirage May 14, 2023, 3:53 PM

#

oh ok nice

#

maybe there's no way to horizontal sum in sse

stone wolf May 14, 2023, 7:03 PM

#

Well, I think the thing that would make it most nicer would be https://doc.rust-lang.org/nightly/std/primitive.slice.html#method.as_chunks, but sadly that's unstable :(

stone wolf May 14, 2023, 7:03 PM

#

shell mirage maybe there's no way to horizontal sum in sse

horizontal stuff is often just bad in general for SIMD.

stone wolf May 14, 2023, 7:05 PM

#

shell mirage maybe there's no way to horizontal sum in sse

Note that https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=5365&text=_mm256_reduce_add describes it as "Sequence" because there's not actually an instruction for it. So even the intel intrinsic just generates a bunch of shuffles or whatever to add things together horizontally.

#

So in addition to the small counters inside the loop, you might consider having bigger counters outside the loop, so after the inner loop you simd add the small counters into the bigger ones, and thus you only need to horizontally reduce once at the end

#

Also, you probably don't need to use 8192 as your chunk size -- that's going to leave a huge remainder pretty often. u8x64 is the biggest vector you can realistically get on current chips (and most modern ones only go up to u8x32), so .chunks_exact(64) ought to be plenty.

#

That gets me to something like this: https://godbolt.org/z/1xPTohao4

stone wolf May 14, 2023, 7:55 PM

#

For fun, here's a std::simd version: https://godbolt.org/z/q1Eo3zbne

Compiler Explorer - Rust (rustc nightly)

#![feature(slice_as_chunks)]
#![feature(portable_simd)]

use std::simd::*;

#[inline]
fn do_chunk(a: u8x32, b: u8x32) -> u32 {
let found = a.simd_eq(u8x32::splat(b'\n')) & b.simd_eq(u8x32::splat(b'P'));
found.to_bitmask().count_ones()
}

pub fn fetch_packages_apk(buf: &[u8]) -> usize {
assert!(buf.len() > 0);

let (ac, at) = buf[...

shell mirage May 14, 2023, 8:55 PM

#

stone wolf Also, you probably don't need to use 8192 as your chunk size -- that's going to ...

8k is just the biggest I can go without having to do a horizontal sum

stone wolf May 14, 2023, 9:43 PM

#

I guess I'm just not convinced that it's worth having the scalar part be so much longer in order to do that. Because you can still do only a single horizontal sum even with the smaller chunk size, by doing it after the chunks loop.

shell mirage May 14, 2023, 9:53 PM

#

stone wolf I guess I'm just not convinced that it's worth having the scalar part be *so* mu...

my particular use case is processing a very large file, so the slow part at the end is a drop in the ocean

#

the hot loop affects the performance a lot though

stone wolf May 14, 2023, 10:10 PM

#

I'd be curious to hear how to the simd version I put above performs for you, though it needs nightly. It does the horizontal sum using the one horizontal sum instructions that's been around since ≈2009: POPCNT :)

shell mirage May 14, 2023, 10:12 PM

#

I'll test when I get home

shell mirage May 15, 2023, 4:34 PM

#

https://godbolt.org/z/Kq3xhT5MT

Compiler Explorer - Rust (rustc 1.69.0)

use std::io::Error;

pub fn fetch_packages_apk(buf:&[u8]) -> Result<String, Error> {
let mut packages = 0;

let buf0 = &buf[..buf.len()-1];
let buf1 = &buf[1..];

let mut chunks0 = buf0.chunks_exact(512);
let mut chunks1 = buf1.chunks_exact(512);

while let (Some(chunk0), Some(chunk1)) = (chunks0.next(), chunks1.next()) {
	let mut counter...

shell mirage May 15, 2023, 4:39 PM

#

stone wolf I'd be curious to hear how to the simd version I put above performs for you, tho...

performs exactly the same as mine lol

#

about 5GB/s

#

which is super fast

stone wolf May 15, 2023, 4:51 PM

#

shell mirage https://godbolt.org/z/Kq3xhT5MT

Ah, the classic "LLVM sees you're doing a sum and will autovectorize it for you" approach :) Always a good one.

shell mirage May 15, 2023, 4:52 PM

#

benchmarked my ram

#

8GB/s

#

so I'm pretty close to peak performance

stone wolf May 15, 2023, 5:02 PM

#

shell mirage https://godbolt.org/z/Kq3xhT5MT

Actually, isn't this one possibly overflowing the local counter? Since it's a chunk of 512? Or I guess it can't actually find a match at every position so that's not a concern?

shell mirage May 15, 2023, 5:02 PM

#

the needle is 2 bytes

#

so there can never be more than 256 in 512 bytes

#

that's where the 512 comes from

stone wolf May 15, 2023, 5:07 PM

#

Out of curiosity, did you ever bench the "obvious" versions like https://godbolt.org/z/PWsz95dvr ? I'd be curious how much slower they were.

Compiler Explorer - Rust

pub fn fetch_packages_apk(buf:&[u8]) -> usize {
let buf0 = &buf[..buf.len()-1];
let buf1 = &buf[1..];

std::iter::zip(buf0, buf1)
    .filter(|(&a, &b)| (a == b'\n') & (b == b'P'))
    .count()

}

#

Or the most direct one of all, which does still appear to get lightly vectorized: https://godbolt.org/z/dPdKnPefW

Compiler Explorer - Rust

pub fn fetch_packages_apk(buf:&[u8]) -> usize {
buf.windows(2)
.filter(|w| w == b"\nP")
.count()
}

shell mirage May 15, 2023, 5:10 PM

#

first one is 2.5x slower

#

second one is 3x slower

#pretty iterator usage