#Safety

76 messages · Page 1 of 1 (latest)

acoustic spruce
#
macro_rules! memcmp_impl {
    ($ty:ty) => (
        impl Memcmp for [$ty] {
            #[inline(always)]
            fn memcmp(&self, lhs: &[$ty]) -> bool {
                let len = self.len();
                debug_assert_eq!(len, lhs.len());
                // SAFETY any bytes would be zero initialized and the slices will be of equal type
                // and length
                unsafe {
                    std::slice::from_raw_parts(self.as_ptr() as *const $ty, len)                          == std::slice::from_raw_parts(lhs.as_ptr() as *const $ty, len)
                }
            }
        }
    );
}

memcmp_impl!(u8);
memcmp_impl!(u16);
memcmp_impl!(u32);
memcmp_impl!(u64);
``` Any issues with this given correct args are supplied?
#

(I'll avoid actually using it except in a few isolated cases), If it's a workable solution can it be improved at all WRT potential performance?

queen hare
#

what is this even for? It seems like a convoluted and unsafe wrapper for <[T]>::eq

acoustic spruce
#

I want exact length memcmp

#

as in a constrained to length version of C's memcmp, it's not even required, I've since completely removed the method in the codebase I'm working on.

#

they had ```rust
#[inline]
pub fn is_clear(&self) -> bool {
self.data.iter().all(|block| *block == 0)
}

#

I wanted one call instead of iter

queen hare
#

wouldn't that require an allocation then? To allocate the zero-alloc'd comparion slice?
What does exact length memcmp mean? Why can't you just use <[T]>::eq?

acoustic spruce
#

I probably could and didn't know it existed

queen hare
#

i mean, you are using it already, but why do you need to have this unsafe stuff to deconstruct and reconstruct the slices?

acoustic spruce
#

I was replacing the above with

#

self.data.as_slice().memcmp([0].as_slice())

#

the entire method calling that is gone now, so it's moot

#

They probably should have done what you suggested in the first place, I'll file a PR on that

queen hare
#

oh no that's deeply unsafe with what you're doing above

acoustic spruce
#

I just wanted memcmp

#

cause I seen what the code was doing

#

and its memcmp

queen hare
#

well, it's unsound unless self.data.len() == 1

acoustic spruce
#

fair, that's why I asked about it 🙂

queen hare
#

but yeah I think if you really want to optimize the above function you should just try unrolling into a for loop and benchmark because frankly I'd expect the existing function to be just about as fast as it can possibly get

acoustic spruce
#

it was slow

#

I changed the implementation because of that

queen hare
#

release mode?

acoustic spruce
#

the iterator is the slowness

#

it should be/can be one memcmp

queen hare
#

the thing is you need an existing buffer of at least self.data.len() length in order for the memcmp to ever succeed, which means unless you have an upper bound for how long self.data will get, you will need to allocate to do a memcmp which I'd expect to be much slower than iterating

acoustic spruce
#

This would check 10 million elements = 0 vs one call to memcmp

#

(it's highly perf sensitive code)

queen hare
#

ok so self.data.len() == 10mil?

acoustic spruce
#

the goal is low-level optimization

#

self.data.iter().all(|block| *block == 0)

#

that was the slow part

#

I wanted to replace that with one call originally.

queen hare
#

i don't think a memcmp is going to beat that though
memcmp means iterate over 2 slices comparing each element. the iterator iterates over 1 slice comparing each element to 0. I'd expect the iterator to beat memcmp, and that's ignoring that you'd need to allocate a second 10 million element slice to do memcmp

acoustic spruce
#

ya if that allocates it's totally bunk, agreed

queen hare
#

now you've got me curious, gonna try comparing the iterator and memcmp (excluding the allocation)

acoustic spruce
#

I want to compare existing memory to other existing memory

queen hare
#

i thought you just wanted to check if the entire slice is 0?

acoustic spruce
#

it's checking if all elements in a vec are 0

#

original code is not mine

#

My idea which is now obsolete for the intended use was one call to see if the entire memory block is 0

queen hare
#

yeah that doesn't really exist afaik

#

since memcpy under the hood has to iterate the exact same

#

i think there's like maybe a chance that memory provides some sort of interface to access an actual zero-check but that'd be nearly inaccessible from explicit code

acoustic spruce
#

the macro use of it was out of sync also, what I posted was not the intent, the length check I added in after already removing the is_clear call with my invocation above, macro is never actually used.

#

I know memcpy/memcmp etc can be SSE accelerated, so I was going towards that as a last-level optimization

#

there are lots of places I could used an optimized memcmp/memcpy

queen hare
#

iterators are probably more likely to get those optimizations than anything except explicit SSE ops

acoustic spruce
#

yup, that's acknowledged

queen hare
#

got the benchmarks up, time to see what happens

acoustic spruce
#

that's the game plan itself probably

#

SSE/SIMD optimizing it

queen hare
#

I also don't trust benchmarks ever so take this with many grains of salt but I think it's interesting

#

huh...

acoustic spruce
#

ya, I was more curious, and thanks for taking the time

queen hare
#

time to take a look at the asm cause I'm confused

acoustic spruce
#

I'd be a little more confident if there's real gain in the ECS

#

thankfully bevy never calls the offending function anyways

queen hare
#

eugh, can't figure out how to properly view the asm in godbolt. Produces SO MUCH asm

#

maybe playground will be more useful

acoustic spruce
#
      pub fn count_ones(&self, range: std::ops::Range<usize>) -> usize {
          Masks::new(range, self.length)
              .map(|(block, mask)| {
                  let value = unsafe {
                      *self.data.get_unchecked(block)
                  };
                  (value & mask).count_ones() as usize
              })
              .sum()
      }
``` the actualy optimization target will be this
#

That is already partially diverging from upstream by the bounded Range change I've made so that get_unchecked can be used.

#

(which in turn allows implementing ExactSizeIterator) which will lead to further optimizations being possible

#

I also plan on making an ActuallyFixedSizeFixedBitSet

#

which will lead to more optimization possibilities

queen hare
#

ok interesting! There is a difference between the asm output for them. The iterator very clearly loops, doing a cmpb each iteration. Whereas the direct slice comparison (bytes == zeros where both are a slice of 10 million zeros) seems to access *bcmp@GOTPCREL(%rip), which seems like a dedicated byte comparison function

acoustic spruce
#

which is probably very optimized like how we described above

#

almost certainly SSE+

#

at least on linux

queen hare
#

yeah, I didn't expect both of these. I expected either they both get bcmp or they both loop

#

(I'm on windows)

acoustic spruce
#

Thanks again for going through this with me, it's appreciated, I'm way too new to Rust.

#

I know what I'm trying to do usually, but translating how i'd do things in C to Rust WRT perf is not my forte yet