#A Better Chunk Entity Enumerator (16ms -> 4ms for 100,000 chunks)

1 messages · Page 1 of 1 (latest)

rain aspen
#

Digging around my code to figure out locations for optimization, I timed the execution of my jobs with enabled components to be far worse than ones without. Not just fractions of a millisecond worse, but multiple full milliseconds longer execution than other jobs. Enabled components obviously have a performance cost since the sequential iteration is broken and burst is unable to vectorize across entities. But it shouldn't be that bad since I rarely have cross-entity vectorization even for my normal components.

So, after timing my job structs, it seems like a major contributor to the bad performance is the ChunkEntityEnumerator that unity provides with the package. It's horrible. A massive while loop that iteratively bit-shifts the bit mask to find the next enabled bit. You know what can do that in a single operation? Tzcnt().

Presenting the superior BetterEntityEnumerator based off tzcnt(). Half the struct memory size, 4x the performance. With only one "breaking" change for the final returned index once the mask is depleted.

ChunkEntityEnumerator returns the chunk entity count, my version returns -1.
It should be undefined behavior so don't rely on this. Just use chunk-entity-count.

BetterEntityEnumerator performs worse (2ms vs 5ms) when use enabled mask is false since I eliminated that branch.
The cost of reducing the struct memory size in half. Just use a for loop in the main IJobChunk if it's guaranteed that there are no enabled components in the query.

Otherwise, drop in and replace. @glossy coral what do you think? Please improve ChunkEntityEnumerator in base package, enabled components shouldn't be penalized more simply because of bad coding. Maybe not with my version but at least something better.

#
private ref struct BetterEntityEnumerator
{
    private ulong _uLong0, _uLong1;

    public BetterEntityEnumerator(bool useEnabledMask, in v128 chunkEnabledMask, int chunkEntityCount)
    {
        _uLong0 = useEnabledMask ? chunkEnabledMask.ULong0 : ~0ul;
        _uLong1 = useEnabledMask ? chunkEnabledMask.ULong1 : ~0ul;

        if (chunkEntityCount < 64)
        {
            _uLong0 &= ~(~0ul << chunkEntityCount);
            _uLong1 = 0;
        }
        else if (chunkEntityCount is not 128)
            _uLong1 &= ~(~0ul << (chunkEntityCount - 64));
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public bool NextEntityIndex(out int nextIndex)
    {
        if (_uLong0 != 0)
        {
            int j = math.tzcnt(_uLong0);
            nextIndex = j;
            _uLong0 ^= 1ul << j;
            return true;
        }

        if (_uLong1 != 0)
        {
            int j = math.tzcnt(_uLong1);
            nextIndex = j + 64;
            _uLong1 ^= 1ul << j;
            return true;
        }

        // Breaking change, unknown what last returned iteration is.
        // ChunkEntityEnumerator returns chunk entity count, this returns -1.
        nextIndex = -1;
        return false;
    }
}```
#

On the topic of bad performance for false useEnabledMask, fixing this without increasing struct size is possible:

  1. Just add a bool parameter to NextEntityIndex that accepts useEnabledMask.
  2. In the constructor, if useEnabledMask is false, use _uLong0 instead as chunkEntityCount.
  3. Add a branch inside NextEntityIndex that activates with true useEnabledMask and decrements _uLong0 and returns _uLong0 > 0.
public bool NextEntityIndex(bool useEnabledMask, out int nextIndex)
{
  if (useEnabledMask)
  {
    // UseEnabledMask true in constructor results in _uLong0 becoming just chunkEntityCount.
    nextIndex = _uLong0;
    return _uLong0-- > 0;
  }
  // ... Rest of the method uses mask and tzcnt shifting.
}```
It's a change to the interface of this struct so I didn't include it. Also relies on useEnabledMask not changing between construction and calling of the method which should normally be true, but those assumptions are where bugs appear. So I wouldn't recommend this in a production code. Or the struct can just add the bool as a field, but I like perfect 16 byte sized struct....
scenic sand
#

As already discussed with you, i personally don't think the false case should be considered a breaking change as i would have have considered that value undefined

rain aspen
#

Performance test benchmark on 100,000 randomized enabled masks and randomized entity counts from 1 to 128. Objectively better performance.

glossy coral
#

I think it's such a great idea, that one of my colleagues implemented a similar optimization weeks ago and has already merged it! I thought it would be in the recent 1.4.0 preview release, but...maybe not? Let me track it down and see where it landed.

#

Woops, it landed further away than I thought, as part of a more ambitious feature branch. I'll see what it would take to get it into a 1.x release. I'm personally not too concerned about any breaking changes; in the worst case, we can just add the tzcnt-based version as a parallel implementation in 1.4, leave the old one in place for now (deprecated), and switch IJE/IFE sourcegen and existing package code to use the faster version by default.
So, consider this a full-throated endorsement of this optimization!

south tangle
rain aspen
novel scaffold
#

hi folks, @glossy coral 's colleague here. I confirm that I've landed something almost identical to your suggestion a couple weeks ago 😆 but not in a 1.x release branch indeed. I'll have a look at porting this. I confirm that EFE/IJE don't use ChunkEntityEnumerator at the moment, but could be based on it. This will need some testing/profiling on a 1.x release branch before I can comment on this further. Thanks for provinding your implementation. As I said it looks really close, but it's nice being able to measure your approach against what I've got.

glossy coral
cerulean trail
#

Please apply to ChunkEntityEnumerator as well. We use it a lot since we avoid code-gen stuff.

novel scaffold
#

@rain aspen would you mind sharing your performance tests? I'm seeing improvements in some cases, but nothing comparable to what you described.

novel scaffold
#

Alright, now everything makes sense and it’s quite interesting. First, thanks again for your code.

Our current loop-based ChunkEntityEnumerator is bad, your optimized version works great. It’s nearly identical to what I’ve made, but in some respects yours is better and in some mine is. The good news is that I combined them and I got the best of both, I’ll include that in a future release. It’s an uncontroversial improvement, but that being said I remain unconvinced that anyone will really notice outside of micro-benchmarking and some edge cases. Did you see a performance improvement in your project by replacing the enumerator? If you did, do you confirm that it’s in cases where only a very few bits are on in the masks on average?

An important consideration is that as soon as ChunkEntityEnumerator is involved, it robs the compiler of some optimization opportunities. In particular, there’s no auto-vectorization. Which is actually quite impactful when the job is all about branchless math stuff.

Also, as soon as I’m using a non-trivial job that actually accesses component data, this completely dwarfs the cost of iteration. So my take is that the new ChunkEntityEnumerator is only noticeable when the job does almost no per-entity processing and when the bitfields become particularly sparse, because it’s just so much better at skipping zeroes.

When using your performance tests, which are only measuring enumerator performance and aren’t doing anything about per-entity data, I’m seeing a 3x performance improvement. If I increase the amount of enabled bits it goes as low as 2x and it can get arbitrarily high as I reduce it. All of this holds true with Burst on or off. That’s likely machine dependent though, I’m running an M1 here. This will go through cross-platform perf testing before it lands anyway.

#

But initially, your version performed better than mine. I noticed that your constructor was a little more efficient, so I switched to it and… then my version became faster… Why so? Well, well.

Your version does: _uLong0 ^= 1ul << tzcnt(_uLong0);
Whereas my version does: mask64_0 &= mask64_0 - 1;

And since the tzcnt is computed anyway, it certainly looks like it should be perfectly equivalent. But here’s the catch: with burst off, tzcnt is not an intrinsic, so the compiler doesn’t know that the range is [0-63] (63 and not 64 because there’s a condition around that piece of code that guarantees it cannot be 64). But a left shift in C# only uses the lowest bits of the shift value corresponding to the range of the value being shifted, in this case the lower 6 bits. Since the compiler doesn’t know the range is already valid, it has to bit mask the shift value. So it loads the constant 63 and does an and, and those extra steps are not required when burst compiled and are also not required when subtracting 1. So what’s going on if burst is on? will you say. Well, I’m not completely sure in this case. Here’s what your code compiles into (arm assembly):

    rbit        x10, x9
    clz        x10, x10
    lsl        x11, x24, x10
    eor        x9, x11, x9

Whereas mine is this:

    rbit        x10, x9
    sub        x11, x9, #1
    and        x9, x11, x9
    clz        x10, x10

Notice that tzcnt turns into rbit/clz (reverse + count leading zeros) because arm doesn’t have a proper tzcnt. And it looks like having the mask update not depend on the tzcnt result allows a more efficient execution ordering? I’m speculating at this point, but all I can tell for sure is that it’s faster.

#

Finally, you said that your version was slower when useEnabledMask is false. I found this to only be the case with burst off. With burst on, they both run at the same speed. And with the &= approach I described above it’s only about 2x slower, which sounds reasonable enough.

Anyway, thanks for this! You prompted me to investigate that topic again and I discovered new things, that was fun.

neat scroll
#

o, that's an interesting optimization

rain aspen
#

But any optimization is good optimization.