macro_rules! memcmp_impl {
($ty:ty) => (
impl Memcmp for [$ty] {
#[inline(always)]
fn memcmp(&self, lhs: &[$ty]) -> bool {
let len = self.len();
debug_assert_eq!(len, lhs.len());
// SAFETY any bytes would be zero initialized and the slices will be of equal type
// and length
unsafe {
std::slice::from_raw_parts(self.as_ptr() as *const $ty, len) == std::slice::from_raw_parts(lhs.as_ptr() as *const $ty, len)
}
}
}
);
}
memcmp_impl!(u8);
memcmp_impl!(u16);
memcmp_impl!(u32);
memcmp_impl!(u64);
``` Any issues with this given correct args are supplied?
#Safety
76 messages · Page 1 of 1 (latest)
(I'll avoid actually using it except in a few isolated cases), If it's a workable solution can it be improved at all WRT potential performance?
what is this even for? It seems like a convoluted and unsafe wrapper for <[T]>::eq
I want exact length memcmp
as in a constrained to length version of C's memcmp, it's not even required, I've since completely removed the method in the codebase I'm working on.
they had ```rust
#[inline]
pub fn is_clear(&self) -> bool {
self.data.iter().all(|block| *block == 0)
}
I wanted one call instead of iter
wouldn't that require an allocation then? To allocate the zero-alloc'd comparion slice?
What does exact length memcmp mean? Why can't you just use <[T]>::eq?
I probably could and didn't know it existed
i mean, you are using it already, but why do you need to have this unsafe stuff to deconstruct and reconstruct the slices?
I was replacing the above with
self.data.as_slice().memcmp([0].as_slice())
the entire method calling that is gone now, so it's moot
They probably should have done what you suggested in the first place, I'll file a PR on that
oh no that's deeply unsafe with what you're doing above
well, it's unsound unless self.data.len() == 1
fair, that's why I asked about it 🙂
but yeah I think if you really want to optimize the above function you should just try unrolling into a for loop and benchmark because frankly I'd expect the existing function to be just about as fast as it can possibly get
release mode?
the thing is you need an existing buffer of at least self.data.len() length in order for the memcmp to ever succeed, which means unless you have an upper bound for how long self.data will get, you will need to allocate to do a memcmp which I'd expect to be much slower than iterating
This would check 10 million elements = 0 vs one call to memcmp
(it's highly perf sensitive code)
ok so self.data.len() == 10mil?
the goal is low-level optimization
self.data.iter().all(|block| *block == 0)
that was the slow part
I wanted to replace that with one call originally.
i don't think a memcmp is going to beat that though
memcmp means iterate over 2 slices comparing each element. the iterator iterates over 1 slice comparing each element to 0. I'd expect the iterator to beat memcmp, and that's ignoring that you'd need to allocate a second 10 million element slice to do memcmp
ya if that allocates it's totally bunk, agreed
now you've got me curious, gonna try comparing the iterator and memcmp (excluding the allocation)
I want to compare existing memory to other existing memory
i thought you just wanted to check if the entire slice is 0?
it's checking if all elements in a vec are 0
original code is not mine
My idea which is now obsolete for the intended use was one call to see if the entire memory block is 0
yeah that doesn't really exist afaik
since memcpy under the hood has to iterate the exact same
i think there's like maybe a chance that memory provides some sort of interface to access an actual zero-check but that'd be nearly inaccessible from explicit code
the macro use of it was out of sync also, what I posted was not the intent, the length check I added in after already removing the is_clear call with my invocation above, macro is never actually used.
I know memcpy/memcmp etc can be SSE accelerated, so I was going towards that as a last-level optimization
there are lots of places I could used an optimized memcmp/memcpy
iterators are probably more likely to get those optimizations than anything except explicit SSE ops
yup, that's acknowledged
got the benchmarks up, time to see what happens
I also don't trust benchmarks ever so take this with many grains of salt but I think it's interesting
huh...
ya, I was more curious, and thanks for taking the time
ya, I'm being careful here, I'm planning on looking for real world improvement in bevy_stress tests, not so much micro-benchmarks of fixedbitset itself
I'd be a little more confident if there's real gain in the ECS
thankfully bevy never calls the offending function anyways
eugh, can't figure out how to properly view the asm in godbolt. Produces SO MUCH asm
maybe playground will be more useful
pub fn count_ones(&self, range: std::ops::Range<usize>) -> usize {
Masks::new(range, self.length)
.map(|(block, mask)| {
let value = unsafe {
*self.data.get_unchecked(block)
};
(value & mask).count_ones() as usize
})
.sum()
}
``` the actualy optimization target will be this
That is already partially diverging from upstream by the bounded Range change I've made so that get_unchecked can be used.
(which in turn allows implementing ExactSizeIterator) which will lead to further optimizations being possible
I also plan on making an ActuallyFixedSizeFixedBitSet
which will lead to more optimization possibilities
ok interesting! There is a difference between the asm output for them. The iterator very clearly loops, doing a cmpb each iteration. Whereas the direct slice comparison (bytes == zeros where both are a slice of 10 million zeros) seems to access *bcmp@GOTPCREL(%rip), which seems like a dedicated byte comparison function
which is probably very optimized like how we described above
almost certainly SSE+
at least on linux