#EntityQuery ToEntityArray crashing, out of leads... Any tips on debugging?

1 messages · Page 1 of 1 (latest)

mystic olive
#

I'm able to replicate a crash in my game and the logs point to

0x00007FFF6DBF6FF8 (lib_burst_generated) [C:\Users\Bilal\Dev\frankenstorm\Library\PackageCache\com.unity.burst@1.8.8.Runtime\unknown\unknown:0] Unity.Entities.ChunkIterationUtility.Unity.Entities.GatherEntitiesWithoutFilter_00000A34$BurstDirectCall.Invoke
(dump and log attached)

Using Unity 2022.3.6f1. Error was occurring on Burst 1.8.7 and Entities 1.0.11 so I tried upgrading to the latest Burst (1.8.8) and Entities (1.0.14) packages, but it still occurs there.

This crash is occurring when the query is executed in a MonoBehaviour, while the ECS world is not doing anything (in between when the last ECS tick finished and the next one will begin) - so there should be no jobs or anything running to cause collisions. I've got safety checks enabled and nothing relevant popped up there. I've spent quite a while trying to confirm this assumption, and yeah AFAIK there's really no sign of any concurrency going on at this point.

Not really sure what to look for at this point. Any advice on how to proceed?

The crash is coming up when a particular enemy in my game dies. When it dies, it creates a status effect on nearby towers. These operations are executed via command buffers which are played back before that MonoBehaviour ticks. I can't see why any of this would be relevant though... things dying and things creating status effects are very common in my game so this is nothing new 🤔

Also, this crash does not occur in editor, only in a build, so replication is quite slow...

short mantle
#

Is this repoable?

mystic olive
#

I am able to repo it consistently-ish

#

Takes a few minutes each time though

short mantle
#

im curious what happens if you call
query.CompleteDependency() beforehand

mystic olive
#

Interesting

#

Will try

#

I'm curious, if this does happen to fix it, then that means my assumption about nothing running in the meantime is wrong. Is there a way I can poke at the query.Dependency to see what it's depending on? Would help me fix the root cause

#

Well, I failed to replicate

#

Might've fixed it!

#

But yeah I'd love to be able to debug what's actually running behind it so I can fix the root cause

#

Any way to do something like this?

if (!query.Dependency.IsCompleted)
    foreach (var handle in query.Dependency.DependentHandles)
        UnityEngine.Debug.Log($"{handle.DebugName}");
short mantle
#

This is only happening in a build right?

mystic olive
#

Yep

short mantle
#

My guess is it's just running much faster in a build

#

And the system/jobs are still running

#

Or the timings are different in a build (this happens)

mystic olive
#

I have an explicit .Complete() call before this monobehaviour ticks, which is supposed to complete everything

#

so it's still unexpected for me

#

Otherwise... I'd pretty much have to put a query.CompleteDependency() behind like every single query? 😆 Pretty dodgy

short mantle
#

if your outside of systems

#

yeah pretty much

#

actually hmm

#

i feel like this behaviour changed at some point

#

but pretty much if you access data on main thread, you need to complete dependency

#

i swear it used to do it for you in 0.51 and changed in 1.0 or something

#

but i use main thread access so infrequent these days i honestly can't remember

mystic olive
#

Actually after running it for a lot more minutes, I even got one inside of a system.

I'm curious about how queries work now. If I call query.ToEntityArray, does this create a new job handle that is not completed immediately? Should I be manually completing queries?

short mantle
#

i would not expect ToEntityArray to need a complete because the data can't change in a job but ToComponentDataArray should
HOWEVER
I can clearly see it does a complete in there already

        public NativeArray<T> ToComponentDataArray<T>(AllocatorManager.AllocatorHandle allocator, EntityQuery outer)
            where T : unmanaged, IComponentData
        {
            // CalculateEntityCount() syncs any jobs that could affect the filtering results for this query
            int entityCount = CalculateEntityCount();
            // We also need to complete any jobs writing to the component we're gathering.
            var typeIndex = TypeManager.GetTypeIndex<T>();
            _Access->DependencyManager->CompleteWriteDependency(typeIndex);```
#

so i don't know what's going on

#

my best guess is you have safety off somewhere

mystic olive
#

e.g. my new crash was in this code somewhere:

            if (_pendingDestructionQuery.CalculateEntityCountWithoutFiltering() == 0)
                return;

            var ctx = _tcbSystem.PrepareTCB();

            var job = new JibJob
            {
                Ctx = ctx,
                EntitiesPendingDestruction = _pendingDestructionQuery.ToEntityArray(Allocator.TempJob),
                EntitiesWithDependencies = _dependsOnQuery.ToEntityArray(Allocator.TempJob),
                DependsOnDatas = _dependsOnQuery.ToComponentDataArray<ExistenceDependsOnEntityData>(Allocator.TempJob),
                ValuesThatDependOnKey = ValuesThatDependOnKey,
                EntitiesToDestroyIterCurrent = EntitiesToDestroyIterCurrent,
                EntitiesToDestroyIterNext = EntitiesToDestroyIterNext,
                EntitiesToDestroyAtDepth = EntitiesToDestroyAtDepth,
                DestroyDepthForEntity = DestroyDepthForEntity,
            };
            job.Run();

(running in a System's Update)

Either _pendingDestructionQuery or _dependsOnQuery are having this crash. All jobs that have run before this point in the world tick were Schedule().Complete()d or Run()d, so I would again not expect anything to be running concurrently in the background. But I didn't Complete any queries 🤔

short mantle
#

you should need to call dependency.Complete() before doing this type of thing

#

though it depends because you have ToComponentDataArray

#

i dont' know what the other 5 fields are

mystic olive
#

They're native containers

short mantle
#

just local to system?

#

no other dependency in system?

#

if so no idea

mystic olive
#

Yeah only this system, and the job it's launching, use those containers

short mantle
#

not sure what ctx is either, looks fishy because it's another system reference

mystic olive
#

Ctx has safety disabled lol, that one's a doozy, but my assumptions have been that if I enforce that 0 or 1 jobs are using Ctx at a time then it'll be ok 🤔

#

That's why I'm worried about what's been left uncompleted when these things happen, because of that unsafety there. I've been trying to control the execution order pretty tightly but something's slipped through.

Any ideas on how to do something like this to help track it down? #1143006413510168597 message

#

Ctx has a mishmash of native containers, but AFAIK has nothing that can create or hold up dependencies from being completed, so it's still weird to me that these queries on the System Update have uncompleted dependencies

#

For now I'll query.CompleteDependency on all main thread queries... Would be great to be able to dig into what those dependencies actually are though

#

Thanks for the help!

#

Although it seems there's no CompleteDependency option for this:

            foreach (var waypoint in SystemAPI.Query<RefRW<WaypointData>>())
                waypoint.ValueRW.ReachedThisWave = false;
short mantle
#

will be fine

mystic olive
#

Ah cool

#

Hmm, although I'm still getting the same crash from this code #1143006413510168597 message
even after .CompleteDependency()ing the two queries 🤔

short mantle
#

well the problem with just passing in ctx between systems

#

is it doesn't add a dependency between the systems

#

this is why it's generally a really bad idea for 1 system to reference another

#

if you put your data on an entity and just call GetSingleton it solves like 90% of issues with cross system stuff

mystic olive
#

Mm I've tried to force systems to basically run one by one in a line to avoid issues with that, e.g. just avoid cross-system stuff altogether.

There are some systems which batch schedule jobs that run in the background while rendering and such go ahead. Even in that case, the scheduled jobs will run one at a time anyway, just not on main thread. Anyway, the crashes aren't occurring in this context.

If my systems aren't running 1 by 1 like I've tried then yeah I'd expect all hell to break loose lol...

short mantle
#

So you're saying nothing that uses ctx uses threads?

mystic olive
#

A system may launch a job that uses Ctx, but then nothing else will happen until that job is completed (even if it's threaded, the system will .Complete() the job as part of its Update). So there should never be competition for Ctx

The exception are those systems I mentioned that batch schedule jobs that run in the background later. Those jobs use Ctx, but are scheduled to depend on all other jobs that were scheduled before them, (I literally append each job handle into a list, and combine them as a dependency when scheduling the next job), so they run 1 by 1 in the background lol. This seems to be working fine as I've had no crashes in this area

#

IJobParallelFor doesn't use Ctx

#

IJobEntity with Ctx are not scheduled as parallel

#

I'm wondering if I can use Entities Journaling to look into the dependencies at the time the crash happens

#

although it doesn't look like it shows anything to do with that, looking at the docs

#

Hmm, but manually completing all of the query dependencies didn't fix it anyway.. So that's probably not the right way forward.

mystic olive
#

After testing more, turns out it's still happening in the original spot on that MonoBehaviour that's executed in-between ECS Updates, so must've just been flakiness that made it look like query.CompleteDependency() improved the situation 😦

mystic olive
#

Hmm still stuck on this. I looked at the situation safety-wise to see if I could get any more information by making sure safety checks are running, and I'm not actually using [NativeDisableContainerSafetyRestriction] anywhere 🤔 so I would've expected appropriate warnings if I'm doing something wrong.

I use [NativeDisableParallelForRestriction] sometimes as needed in individual IJobParallelFors, but I don't think that will be related in this instance, considering that I'm using job handles to force jobs to be executed one at a time.

mystic olive
#

Lol this is some crazy shit. I've been very slowly bisecting to find the tipping point between this crash happening and not happening.

With this code, the crash occurs straight away:

            if (target != Entity.Null)
            {
                bool didAvoid;
                if (statusEffectSpec.Type is not StatusEffectType.Debuff)
                    didAvoid = false;
                else
                {
                    var avoidanceChance = Stats.I.Data.Get(ref ctx, Stat.PctDebuffAvoidance, target);
                }
            }

With this code, the crash does not occur:

            if (target != Entity.Null)
            {
                if (statusEffectSpec.Type is not StatusEffectType.Debuff)
                    Stats.I.Data.Get(ref ctx, Stat.PctDebuffAvoidance, target);
            }

I wonder what the compiler's doing!

#

This code change above is clearly not the root cause of the issue (since there's functionally no difference), it's just the straw that breaks the camel's back.

But... I'm at a loss as to how to actually find the root cause at this stage.

Safety checks are enabled except in some standard parallel writing cases. All dependencies are being completed. Everything's running pretty much sequentially.

mystic olive
#

Hmm when I think about it, this kind of behaviour seems in line with what you'd expect from "undefined behaviour" that could arise from e.g. inappropriate parallel writing to containers like hashsets or something - buffer overflows or what not 🤔

I'll carefully review the 6 cases that I've used NativeDisableParallelFor stuff - I know there's at least one that I can remove that attribute for if I split up the logic better - and see if that helps.

#

Does the safety system handle usages of shared statics like Stats.I.Data in the snippet above to warn me of misuse? I didn't put any NativeDisable attributes on any of that 🤔 but I wonder if the safety system just doesn't follow into those

short mantle
#

Shared statics have no safety

mystic olive
#

I see

woven skiff
#

There is only native debugging route I think.

mystic olive
#

I've been using a shared static with a whole bunch of entity queries defined in it. If shared statics have no safety, then maybe that's hiding some problems away from me. I'm going to break that apart and hopefully something will come up.

#

I've audited all my NativeDisableParallelForRestriction, there's only one sketchy one left but AFAIK it's (a) actually safe; and (b) I dunno any other way to handle it that allows me to remove the attribute:

        partial struct JibJob : IJobEntity
        {
            // We only write to the execution entity (the projectile) (and note that this data already exists),
            // and read from another entity (the target) that is not a projectile,
            // so there can be no read/write overlaps.
            [NativeDisableParallelForRestriction] public ComponentLookup<GridPresenceData> GridPrezs;

            [UsedImplicitly]
            public void Execute(Entity entity, ref ProjectileData projectile)
            {
                ref var projectileGridPrez = ref GridPrezs.GetRefRW(entity).ValueRW;

                if (projectile.Target != Entity.Null)
                {
                    if (GridPrezs.HasComponent(projectile.Target))
                        projectile.TargetGridPos = GridPrezs[projectile.Target].Pos;
                    else projectile.Target = Entity.Null;
                }
  • RW ref to the IJobEntity entity
  • Reads from other entities, but which the IJobEntity will never itself iterate over

I can't put the projectileGridPrez as an Execute parameter because the IJobEntity will internally define a component lookup or handle thing that'll raise an error with the one I'm passing in to read from other entities 🤔

Any other way I can handle that?

plucky quest
#

using shared statics outside the main thread is definitely taking the railings off of everything and i would expect crashes and race conditions just like any c++ multithreaded code. so i highly recommend against doing that, because you can get speed and safety at the same time using normal containers and such 🙂

also, just because a monobehaviour is running doesn't at all guarantee that no jobs are running. jobs scheduled at any time before that that nobody called complete() on could well be running.

mystic olive
#

I thought shared statics were exclusively for accessing things off of main thread, otherwise you'd use regular statics 🤔 If not for sharing across threads, then what are shared statics actually for?

#

also, just because a monobehaviour is running doesn't at all guarantee that no jobs are running. jobs scheduled at any time before that that nobody called complete() on could well be running.
Yep I understand this. My updates follow a general structure of:

  1. Thing calls Complete on previous tick's scheduled jobs
  2. MonoBehaviours update
  3. New tick begins that schedules new jobs that'll run during rendering and what not

So when I mention that MonoBehaviour, it's there in 2 where everything's supposedly completed

#

The reason I'm using shared statics myself is basically to avoid passing dozens of arguments all around my codebase 😅 I've got that magic Ctx struct there which holds a whole bunch of native containers and I pass as a ref everywhere, but I started getting errors because it became too large (past the 10k byte limit) and so started splitting it off into shared statics

#

I've got a generic trigger / action system running where anything can kinda cause anything to happen. There are a few different systems which go off and do these action / trigger processing chains, and that's why this Ctx / shared statics are used - otherwise they'd need to have like... everything in the game as arguments lol. These systems do not run concurrently - just sequentially in Burst, and off the main thread.

mystic olive
#

Example of what I mean:

  • projectile hits enemy
  • does a hit trigger
  • tower has a special effect that causes hits to reduce the enemy's damage resistance. This is applied before the damage calculation
  • player had a global effect that gives hits a chance to kill enemies under a 50% life threshold. Thanks to that damage resistance effect, the enemy drops under the threshold from the hit and this kills them
  • that enemy had a debuff that causes its corpse to explode on death, and so it does
  • the explosion kills another 2 enemies, which each had the same debuff, which chain and cause another 4 enemies to die
  • one of those enemies was a bomb-bot who has an on death effect that stuns nearby towers

This all happens "instantly" when that projectile hits the enemy. So this trigger / action system needs to have arbitrary access to game state, pretty much, which is why it's forced to run sequentially. But I've got it running in burst and off main for performance reasons

plucky quest
#

shared statics are for passing between mono and burst on the main thread for the most part

#

for multithreading, we have the job system

#

which, you know, has safety

mystic olive
#

I'm not actually using it with multithreading, so to speak. No parallel jobs use any shared statics

#

But yeah if it doesn't have safety at all 🙁 bah...

#

What about with purely readonly stuff? I've got some shared static native hashmaps which are initialised at startup and then never modified - I'd hope that'd be safe even in a multi-threaded environment?