Digging around my code to figure out locations for optimization, I timed the execution of my jobs with enabled components to be far worse than ones without. Not just fractions of a millisecond worse, but multiple full milliseconds longer execution than other jobs. Enabled components obviously have a performance cost since the sequential iteration is broken and burst is unable to vectorize across entities. But it shouldn't be that bad since I rarely have cross-entity vectorization even for my normal components.
So, after timing my job structs, it seems like a major contributor to the bad performance is the ChunkEntityEnumerator that unity provides with the package. It's horrible. A massive while loop that iteratively bit-shifts the bit mask to find the next enabled bit. You know what can do that in a single operation? Tzcnt().
Presenting the superior BetterEntityEnumerator based off tzcnt(). Half the struct memory size, 4x the performance. With only one "breaking" change for the final returned index once the mask is depleted.
ChunkEntityEnumerator returns the chunk entity count, my version returns -1.
It should be undefined behavior so don't rely on this. Just use chunk-entity-count.
BetterEntityEnumerator performs worse (2ms vs 5ms) when use enabled mask is false since I eliminated that branch.
The cost of reducing the struct memory size in half. Just use a for loop in the main IJobChunk if it's guaranteed that there are no enabled components in the query.
Otherwise, drop in and replace. @glossy coral what do you think? Please improve ChunkEntityEnumerator in base package, enabled components shouldn't be penalized more simply because of bad coding. Maybe not with my version but at least something better.