Safety | Bevy | Page 1

acoustic spruce Dec 21, 2022, 7:28 PM

#

macro_rules! memcmp_impl {
    ($ty:ty) => (
        impl Memcmp for [$ty] {
            #[inline(always)]
            fn memcmp(&self, lhs: &[$ty]) -> bool {
                let len = self.len();
                debug_assert_eq!(len, lhs.len());
                // SAFETY any bytes would be zero initialized and the slices will be of equal type
                // and length
                unsafe {
                    std::slice::from_raw_parts(self.as_ptr() as *const $ty, len)                          == std::slice::from_raw_parts(lhs.as_ptr() as *const $ty, len)
                }
            }
        }
    );
}

memcmp_impl!(u8);
memcmp_impl!(u16);
memcmp_impl!(u32);
memcmp_impl!(u64);
``` Any issues with this given correct args are supplied?

#

(I'll avoid actually using it except in a few isolated cases), If it's a workable solution can it be improved at all WRT potential performance?

queen hare Dec 21, 2022, 7:32 PM

#

what is this even for? It seems like a convoluted and unsafe wrapper for <[T]>::eq

acoustic spruce Dec 21, 2022, 7:32 PM

#

I want exact length memcmp

#

as in a constrained to length version of C's memcmp, it's not even required, I've since completely removed the method in the codebase I'm working on.

#

they had ```rust
#[inline]
pub fn is_clear(&self) -> bool {
self.data.iter().all(|block| *block == 0)
}

#

I wanted one call instead of iter

queen hare Dec 21, 2022, 7:39 PM

#

wouldn't that require an allocation then? To allocate the zero-alloc'd comparion slice?
What does exact length memcmp mean? Why can't you just use <[T]>::eq?

acoustic spruce Dec 21, 2022, 7:40 PM

#

I probably could and didn't know it existed

queen hare Dec 21, 2022, 7:40 PM

#

i mean, you are using it already, but why do you need to have this unsafe stuff to deconstruct and reconstruct the slices?

acoustic spruce Dec 21, 2022, 7:41 PM

#

I was replacing the above with

#

self.data.as_slice().memcmp([0].as_slice())

#

the entire method calling that is gone now, so it's moot

#

They probably should have done what you suggested in the first place, I'll file a PR on that

queen hare Dec 21, 2022, 7:43 PM

#

oh no that's deeply unsafe with what you're doing above

acoustic spruce Dec 21, 2022, 7:43 PM

#

I just wanted memcmp

#

cause I seen what the code was doing

#

and its memcmp

queen hare Dec 21, 2022, 7:43 PM

#

well, it's unsound unless self.data.len() == 1

acoustic spruce Dec 21, 2022, 7:44 PM

#

fair, that's why I asked about it 🙂

queen hare Dec 21, 2022, 7:45 PM

#

but yeah I think if you really want to optimize the above function you should just try unrolling into a for loop and benchmark because frankly I'd expect the existing function to be just about as fast as it can possibly get

acoustic spruce Dec 21, 2022, 7:46 PM

#

it was slow

#

I changed the implementation because of that

queen hare Dec 21, 2022, 7:46 PM

#

release mode?

acoustic spruce Dec 21, 2022, 7:46 PM

#

the iterator is the slowness

#

it should be/can be one memcmp

queen hare Dec 21, 2022, 7:47 PM

#

the thing is you need an existing buffer of at least self.data.len() length in order for the memcmp to ever succeed, which means unless you have an upper bound for how long self.data will get, you will need to allocate to do a memcmp which I'd expect to be much slower than iterating

acoustic spruce Dec 21, 2022, 7:48 PM

#

This would check 10 million elements = 0 vs one call to memcmp

#

(it's highly perf sensitive code)

queen hare Dec 21, 2022, 7:48 PM

#

ok so self.data.len() == 10mil?

acoustic spruce Dec 21, 2022, 7:48 PM

#

the goal is low-level optimization

#

self.data.iter().all(|block| *block == 0)

#

that was the slow part

#

I wanted to replace that with one call originally.

queen hare Dec 21, 2022, 7:50 PM

#

i don't think a memcmp is going to beat that though
memcmp means iterate over 2 slices comparing each element. the iterator iterates over 1 slice comparing each element to 0. I'd expect the iterator to beat memcmp, and that's ignoring that you'd need to allocate a second 10 million element slice to do memcmp

acoustic spruce Dec 21, 2022, 7:51 PM

#

ya if that allocates it's totally bunk, agreed

queen hare Dec 21, 2022, 7:52 PM

#

now you've got me curious, gonna try comparing the iterator and memcmp (excluding the allocation)

acoustic spruce Dec 21, 2022, 7:52 PM

#

I want to compare existing memory to other existing memory

queen hare Dec 21, 2022, 7:52 PM

#

i thought you just wanted to check if the entire slice is 0?

acoustic spruce Dec 21, 2022, 7:53 PM

#

it's checking if all elements in a vec are 0

#

original code is not mine

#

My idea which is now obsolete for the intended use was one call to see if the entire memory block is 0

queen hare Dec 21, 2022, 7:56 PM

#

yeah that doesn't really exist afaik

#

since memcpy under the hood has to iterate the exact same

#

i think there's like maybe a chance that memory provides some sort of interface to access an actual zero-check but that'd be nearly inaccessible from explicit code

acoustic spruce Dec 21, 2022, 7:59 PM

#

the macro use of it was out of sync also, what I posted was not the intent, the length check I added in after already removing the is_clear call with my invocation above, macro is never actually used.

#

I know memcpy/memcmp etc can be SSE accelerated, so I was going towards that as a last-level optimization

#

there are lots of places I could used an optimized memcmp/memcpy

queen hare Dec 21, 2022, 8:01 PM

#

iterators are probably more likely to get those optimizations than anything except explicit SSE ops

acoustic spruce Dec 21, 2022, 8:01 PM

#

yup, that's acknowledged

queen hare Dec 21, 2022, 8:02 PM

#

got the benchmarks up, time to see what happens

acoustic spruce Dec 21, 2022, 8:02 PM

#

that's the game plan itself probably

#

SSE/SIMD optimizing it

queen hare Dec 21, 2022, 8:02 PM

#

I also don't trust benchmarks ever so take this with many grains of salt but I think it's interesting

#

huh...

acoustic spruce Dec 21, 2022, 8:02 PM

#

ya, I was more curious, and thanks for taking the time

queen hare Dec 21, 2022, 8:02 PM

#

#

time to take a look at the asm cause I'm confused

acoustic spruce Dec 21, 2022, 8:05 PM

#

queen hare I also don't trust benchmarks ever so take this with many grains of salt but I t...

ya, I'm being careful here, I'm planning on looking for real world improvement in bevy_stress tests, not so much micro-benchmarks of fixedbitset itself

#

I'd be a little more confident if there's real gain in the ECS

#

thankfully bevy never calls the offending function anyways

queen hare Dec 21, 2022, 8:15 PM

#

eugh, can't figure out how to properly view the asm in godbolt. Produces SO MUCH asm

#

maybe playground will be more useful

acoustic spruce Dec 21, 2022, 8:17 PM

#

      pub fn count_ones(&self, range: std::ops::Range<usize>) -> usize {
          Masks::new(range, self.length)
              .map(|(block, mask)| {
                  let value = unsafe {
                      *self.data.get_unchecked(block)
                  };
                  (value & mask).count_ones() as usize
              })
              .sum()
      }
``` the actualy optimization target will be this

#

That is already partially diverging from upstream by the bounded Range change I've made so that get_unchecked can be used.

#

(which in turn allows implementing ExactSizeIterator) which will lead to further optimizations being possible

#

I also plan on making an ActuallyFixedSizeFixedBitSet

#

which will lead to more optimization possibilities

queen hare Dec 21, 2022, 8:29 PM

#

ok interesting! There is a difference between the asm output for them. The iterator very clearly loops, doing a cmpb each iteration. Whereas the direct slice comparison (bytes == zeros where both are a slice of 10 million zeros) seems to access *bcmp@GOTPCREL(%rip), which seems like a dedicated byte comparison function

acoustic spruce Dec 21, 2022, 8:30 PM

#

which is probably very optimized like how we described above

#

almost certainly SSE+

#

at least on linux

queen hare Dec 21, 2022, 8:31 PM

#

yeah, I didn't expect both of these. I expected either they both get bcmp or they both loop

#

(I'm on windows)

acoustic spruce Dec 21, 2022, 8:33 PM

#

Thanks again for going through this with me, it's appreciated, I'm way too new to Rust.

#

I know what I'm trying to do usually, but translating how i'd do things in C to Rust WRT perf is not my forte yet

#Safety