#Burst slower than regular c#?

1 messages · Page 1 of 1 (latest)

viscid jackal
#

Anyone notice burst being kind of... slow? Specifically when straight up iterating over arrays of TransformHandles? I know there's a performance hit when first calling burst code, but I'm using the performance package. I made a forum post about it: https://discussions.unity.com/t/burst-slower-than-non-burst/1710309 but I'm sure that will be lost, might get an answer here.

Was wondering if someone could sanity check me? Don't use burst that often. This is the code I'm using to test burst, which is slower than the code that just iterates over a normal array:

    [Test, Performance]
    [TestCase(1)]
    [TestCase(100)]
    [TestCase(250)]
    [TestCase(500)]
    [TestCase(1000)]
    [TestCase(10000)]
    public void SetPosition_TransformHandle_Burst(int objectCount)
    {
        var objects = new GameObject[objectCount];
        var handles = default(NativeArray<TransformHandle>);

        Measure.Method(() => { TransformHandle_Burst.Compute(ref handles, 0.1f); })
            .SetUp(() =>
            {
                handles = new NativeArray<TransformHandle>(objectCount, Allocator.Persistent);
                for (var i = 0; i < objectCount; i++)
                {
                    objects[i] = new GameObject();
                    handles[i] = objects[i].transformHandle;
                }
            })
            .CleanUp(() =>
            {
                handles.Dispose();
                foreach (var obj in objects)
                    Object.DestroyImmediate(obj);
            })
            .WarmupCount(k_WARMUP_COUNT)
            .MeasurementCount(k_MEASUREMENT_COUNT)
            .Run();
    }

    [BurstCompile(OptimizeFor = OptimizeFor.Performance, CompileSynchronously = true)]
    private static class TransformHandle_Burst
    {
        [BurstCompile(OptimizeFor = OptimizeFor.Performance, CompileSynchronously = true)]
        public static void Compute(ref NativeArray<TransformHandle> handles, in float delta)
        {
            for (var i = 0; i < handles.Length; i++)
            {
                var transformHandle = handles[i];
                transformHandle.position += new Vector3(i * delta, 0, 0);
            }
        }
    }

For 10,000 transforms, I get a median time of 1.02 ms where without burst I get 0.84. Is 10,000 not enough to see a performance increase from using burst? I do see a performance increase when using IParallelForTransform and the TransformAccess struct, but I thought that burst iteration over 10k objects would be faster than non-burst iteration over 10k objects.

sharp bobcat
#

Tbh I would expect no real difference here. All your logic is occurring within the native engine so it's not executing under mono or burst.

sharp bobcat
#
public struct TransformHandle : IEquatable<TransformHandle>, IComparable<TransformHandle>
{
  internal IntPtr pTransformData;
  [SerializeField]
  internal EntityId id;```
#

transform handle is just a pointer into the engine

#
  public Vector3 position
  {
    get
    {
      TransformHandle.AssertHandleIsValid(this);
      Vector3 p;
      this.Internal_GetPosition(out p);
      return p;
    }
    set
    {
      TransformHandle.AssertHandleIsValid(this);
      this.Internal_SetPosition(value);
    }
  }```
viscid jackal
#

shouldn't iteration be like 'lightning quick' or something in burst?

sharp bobcat
#

all the properties are just extern calls to the engine

  [NativeMethod(Name = "GetPosition")]
  [MethodImpl(MethodImplOptions.InternalCall)]
  private extern void Internal_GetPosition(out Vector3 p);

  [NativeMethod(Name = "SetPosition")]
  private void Internal_SetPosition(Vector3 p)
  {
    TransformHandle.Internal_SetPosition_Injected(ref this, ref p);
  }

  [MethodImpl(MethodImplOptions.InternalCall)]
  private static extern void Internal_SetPosition_Injected(
    ref TransformHandle _unity_self,
    [In] ref Vector3 p);
sharp bobcat
#

because this code isn't in burst

#

it's c++

#

the overhead is the extern call

#

without looking too closely at your test, it is suggesting this extern call with burst is somehow more costly than from c#

#

which i agree is odd and i'd have to test to confirm

#

but all your benchmarking really is this native call to a c++ library

#

from mono vs burst

viscid jackal
sharp bobcat
#

UnityEngine is written in c++ with a light c# layer on top

viscid jackal
#

vs repeated access from burst, which should be faster.

#

¯_(ツ)_/¯

sharp bobcat
copper oxide
#

it's worth noting that non-burst code is not unoptimized (unless you make a debug build): you still have either il2cpp and the c++ static optimizer, or the mono jit compiler's optimizations being applied

#

and the very first thing even the most basic optimizer does is make sure loops run fast at a basic level, and if you just call external functions from a simple loop like that, then even simple optimizers will quickly reach the best possible loop code for your cpu architecture

#

where you start to see big differences is when you do more interesting inlinable things within the loop that would allow various transformations like vectorization, pulling repeated computations out of the loop, etc, and that's the situation where Burst shines

viscid jackal
#

looks like it was slower because of the editor anyway. Also not all my tests were += which affected things slightly. But runing the test as a player test not a play mode test changed it a ton, burst is way faster in a build.

#

The reason I was testing this in the first place is that I want to be able to move at most 500 trail renderers per frame.. although I might also need to make a custom trail renderer system that also occurs in the loop. Guess we'll see how it handles it.

empty otter
viscid jackal
empty otter
viscid jackal
#

yea?

empty otter
viscid jackal
empty otter
empty otter
viscid jackal
paper mortar
# viscid jackal i dont have any debug settings on

Burst will be notably faster in the build no matter what your editor settings say, as there is a O3 level optimization applied only on builds. I heve a case where a parallel job in editor with burst on takes 1 ms sum accross all threads, while in build it is 0.1 ms sum - pretty wild difference

warm jetty
empty otter
#

I did my own mini benchmarks and compared the difference in both the build and the editor, and the compiled Burst code didn't change the execution time by a single percentage.

#

Well, I just analyzed whether OptimizeFor.Performance works, and it does.

#

Optimized assembly

paper mortar
#

Then it has to be the directives. The code is semantically complex and uses collections a lot so could be

empty otter
#

Balanced (Default) assembly

#
public static int TestOpt(in NativeArray<int> array)
{
    int sum = 0;
    for (int i = 0; i < array.Length; i++)
    {
        sum += array[i] * 2;
    }
    return sum;
}

[BurstCompile]
public static int TestNoOpt(in NativeArray<int> array)
{
    int sum = 0;
    for (int i = 0; i < array.Length; i++)
    {
        sum += array[i] * 4;
    }
    return sum;
}```
empty otter
# empty otter Optimized assembly

I'll explain the difference: the optimized code doesn't insert the number 2 directly into the code, but stores it in a register, which is supposed to be faster.

#

Anyway, wait

#

OptimizeFor.Performance not working

#

It might be better to try with more complex code

empty otter
#

Well, it looks like Burst Inspector only shows the most optimized version