#Attempt at hiding AoS vs SoA comes to an early halt

1 messages · Page 1 of 1 (latest)

lone solstice
#

Not directly DOTS, but definitely data-oriented!

A couple of days ago I started attempting to try and hide AoS vs SoA implementations. This, all to be able to experiment with it more easily during development (because it's a real pain in the ass to change such code). The idea was very simple: wrap the data accesses in a structure that fetches it from the correct place, and rely on the compiler to optimize the code.

I ran into trouble earlier than expected!

In a first attempt, I have some (non-bursted) code where I changed the return value of a collection: from a pointer, to a struct containing a pointer.

That's the only meaningful difference in the code.. yet, my performance tests show I lose around 5% (and up to 20% if I access neighbouring data in addition to that).

I have added aggressive inlining to all the relevant functions, so I do not understand where this performance difference comes from. Any ideas about what I could've missed? Maybe some ideas to figure out more?

The struct is just this.

    public unsafe readonly struct TileDataRef
    {
        public readonly TestTileData* _tileData;
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public TileDataRef(TestDataBlock8x8 block, int localFlatIndex)
        {
            // of course this was done in the containing data structure before, but it's constructed in the same place
            _tileData = block.GetDataPtr() + localFlatIndex;
        }

        public bool IsValid => _tileData != null;
        public static implicit operator bool(TileDataRef t) => t._tileData != null;

        public ETileType Type
        {
            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            get => _tileData->Type;
            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            set => _tileData->Type = value;
        }
        
        // more accessors here .. 
lone solstice
#

I've done a lot of digging, and I have no explanation for this other than to say the compiler used by unity isn't perfect.

I tried porting my code to godbolt, but I don't see any difference in the compiled end-result, so I'm not sure what is going on.

#

~~I've found that removing some force-inlining only in some places helps performance, though I don't understand why those places in particular. ~~
For example, force-inlining the biggest function in the getter, but not its wrapper function, is better for performance than doing the inverse. I expected it to not make a difference (the code path taken is the same for all test cases). 🤷

edit: force-inlining various things doesn't make a difference in proper release builds

#

Another surprise: simply adding an extra pointer to the wrapper structure, even if it's completely unused, increases my test's runtime from 460ms to 490ms.

edit: 220ms to 240ms in a proper release configuration

#

Lastly, I found that passing the return value as out TileDataRef outRef (instead of as the function return value) consistently saves me around 10ms. 🤔

edit: return value and out are similar (in release builds)

heavy tide
#

Is this profile in burst, mono or both

#

I would not expect a difference in burst but I would not be surprised at all with mono

#

(I know you say non burst originally just curious if you also treated burst)

#

Have you tried how it does in an il2cpp build?

lone solstice
#

I didn't try il2cpp yet, nor burst. However, I did make a stupid mistake, because the implementation is in a separate assembly that wasn't compiled in release🤦

The tests were executed in a unity assembly of course, but in spite of all the inlining (that clearly helps a lot even in debug), it does obviously mess up the measurements (things like asserts etc)

#

So, for example, using out is in fact similar to using a return value directly (which makes more sense).

However, using a wrapping structure instead of a raw pointer maintains a consistent 8-10% perf difference in this case, and the addition of a pointer to the wrapping structure does the same

lone solstice
#

@heavy tide Do you know if it's possible to burst compile the unit tests in the unity testing framework? Or do you not use that package?

It's a bit harder to test in a real environment because that requires adapting a lot of code to use the test containers that I wrote ^^

#

I tried adding burst compile attribute of course, but it doesn't seem to affect the results in any way (or im really unlucky and it remains equal, but somehow I doubt that)

#

Final test results, just looping over the data in a linear fashion and summing a single value. For now I'm concluding it's just not working in Mono and we should avoid wrapping things as much as we can in performance sensitive contexts.

I think testing in il2cpp would be interesting but I don't have the time to dig into that area as well (especially because it requires making a build and I can't use the same test framework)

#

Iterators are helping a lot here, because they know more about the memory layout.
I don't have an iterator equivalent for the raw version in this implementation, I implemented it to see if separating the data helps in this case (it doesn't, unless I add a lot of data in the structure to enforce more cache misses)