#Why can't this loop be vectorized? Also, at what point will vectorization improve performance?

1 messages · Page 1 of 1 (latest)

pine vine
#

Here is my loop:

        // cap a
        var capACenter = vertOffset;
        verts[vertOffset++] = new Vertex { position = a, uv = new float2(0, 0.5f) };
        var capAFirst = vertOffset;
        var capAOffset = vertOffset;

        for (var i = 0; i <= capSegments; i++)
        {
            var t = i / (float)capSegments;
            var angle = math.PI * t;
            var cosA = math.cos(angle);
            var sinA = math.sin(angle);
            var posA = a - rightA * cosA - segDirA * halfWidth * sinA;
            var uv = new float2(-0.5f * sinA, 0.5f - 0.5f * cosA); // calculate UVs 'correctly'
            
            verts[capAOffset + i] = new Vertex { position = posA, uv = uv };
        }

        // write back offset that should have been incremented inside loop
        vertOffset += capSegments + 1;

Here is the reason:

---------------------------
Remark Type: Analysis
Message:     BuildBillboardQuadJob.cs:186:0: loop not vectorized: value that could not be identified as reduction is used outside the loop
Pass:        loop-vectorize
Remark:      NonReductionValueUsedOutsideLoop
Function:    BuildBillboardQuadJob, Tracing, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null.GenerateCaps(BuildBillboardQuadJob*, Tracing, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null this, Unity.Collections.NativeArray`1[[BuildBillboardQuadJob+Vertex, Tracing, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null]]&, UnityEngine.CoreModule, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null verts, Unity.Collections.NativeArray`1[[System.UInt16, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]&, UnityEngine.CoreModule, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null indices, System.Int32&, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089 vertOffset, System.Int32&, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089 indexOffset) -> System.Void, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089_9723e40b7a4e00b747f85c6c9abcaf1c from Tracing, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null
---------------------------

Is it the Vertex struct? (just a float3 and a float2). Or the verts native array? Everything else is floats, float3s, and ints.

Also, should I be trying to vectorize this? Just want to see how fast I can push things tbh. It's a quick loop, between 2-255 times, with expected values at ~12 (although it will happen a lot). Will vectorization provide any benifit?

humble yacht
#

soon as you go to anything but float/float4 you basically can't completely autovectorize a loop

#

certain steps might be, but not the full thing

pine vine
#

ohh, so I'd need to flatten the array to 3 arrays of floats or make it float4s?

short cipher
#

I would give the AI an assembler and maybe I would get an accurate answer

short cipher
pine vine
#

oh yea. But what do you mean give the AI an assembler?

humble yacht
#

actually i just read it again, your float2 is only used for writing at the end so it's probably not a problem [though it might be]

#

but just go look at your assembly in the burst inspector

pine vine
#

position is a float3

humble yacht
#

it'll probably be be pretty nice

pine vine
#

ok let me look, i'll have to look up the instruction sets to see if they're vectorized or not

humble yacht
#

just post a screenshot

#

tldr: if instruction ends i ps you are happy

pine vine
#

i need to find it first haha...

short cipher
pine vine
#

ohhh i see what you mean now

#

I tried to, the output was too big lol

#

maybe if I run it from the CLI but im not paying for any right now outside of the bundled jet brains AI i get for having dotUltimate

short cipher
pine vine
#

I think this is the right stuff? Second screenshot is a bit above the first.

#

I see some things ending in ps lol

#

vs this which I think is fully vectorized? At least there's nothing in the diagnostics about it not being vectorized.

short cipher
pine vine
#

yea maybe. There is some simple math. but I might have also sent the wrong screenshot lol.

short cipher
pine vine
#

yea makes sense. Well at least my main loop is vectorized and this one uses some vectorized instructions.

worthy crane
#

using sincos would maybe help

When Burst compiled, his method is faster than calling sin() and cos() separately.

maybe something like

var capACenter = vertOffset;
verts[vertOffset++] = new Vertex { position = a, uv = new float2(0f, 0.5f) };
var capAFirst = vertOffset;
var capAOffset = vertOffset;

var angleStep = math.PI / capSegments;
var dirScale = segDirA * halfWidth;

for (var i = 0; i <= capSegments; i++)
{
    var angle = i * angleStep;
    math.sincos(angle, out var sinA, out var cosA);

    var posA = a - rightA * cosA - dirScale * sinA;
    var uv = new float2(-0.5f * sinA, 0.5f - 0.5f * cosA);

    verts[capAOffset + i] = new Vertex { position = posA, uv = uv };
}

vertOffset += capSegments + 1;

edit: my bad didn't see it was an old thread

short cipher
#

In theory, this should remove 1 call

#

But it's funny that the compiler doesn't even know about this optimization, and you need to apply it manually

worthy crane
#

next step would be like I guess running 4 cap segments at a time maybe

#

but that's so annoying to write lol

lunar forge
#

This package transcribed burst mathematics sin-cos implementation. Its for v256 but the same logic applies to standard v128 vectors.