#What causes a bursted job to be pushed to the main thread (freezing the game) and the fix?

1 messages · Page 1 of 1 (latest)

unborn ginkgo
#

This is a jobs+burst question, no entity. I schedule long running jobs that normally do their thing, which is path finding, on one of the 19 available worker thread. Most of the time the job carry across a frame or two but some time i see jobs that freeze the editor and the build for 5000ms, yes these jobs move to the main thread and freeze everything.
Now my understanding is that a job can jump to main thread when no worker thread is available but, looking back in time to prior the freeze, I see that the worker threads are very light on work. There maybe be 6 busy with path finding at all time, the remaining 10 or so do basic physics and what not that's unity's domain. Total of at most 1ms/frame.
So we're trying to find out other cause. And ways to prevent that.

kindred cloak
#

Make sure there aren't any dependencies with what you are trying to do on the main thread or it will have to wait for your jobs to complete, which can result with it pulling it onto the main thread, ideally you would only be checking the job status

ashen arch
#

It’s very hard to guarantee that this will never happen, which is why we don’t recommend having such long-running jobs today, unfortunately. It’s on the todo to enable that use case, but in the meantime, manual time-slicing is the way to go

nova halo
#

Which Unity version are you using?

When the main thread waits it will steal jobs, however in 2022.2 the logic was changed to make it a bit kinder to when there might be long running jobs.

If the main thread is told to wait on a parallel job, and there is a portion of that parallel work available, then it will run that work. So long running parallel jobs could cause a stall on the main thread -- it's best to break up long running jobs in general to be pieces of work that can resume across a series of jobs.

If the job being waited on isn't a parallel job, then the main thread will look for a job do while waiting unless there are fewer jobs than worker threads in the system. In which case it will do nothing letting worker threads do the work instead.

If long running jobs cannot be adjusted for your game, the hard switch is to force the main thread to never steal via the bootconfig/commandline switch of no-main-thread-job-stealing=true/--no-main-thread-job-stealing=true

The main thread will never help in such cases which could slightly hurt perf in cases like waiting on a parallel job that the main thread could have helped with. But if you find you're more often than not waiting on long running work scheduled on the main thread then this might help.

We intend to offer better granularity for job affinity but I have no timelines to provide at this moment. Hope this helps

echo cedar
#

@true parcel you asked a while ago about never using the mainthread for jobs apparently theres a command line argument to turn it off ^

true parcel
true parcel
#

I can't find anything in docs about bootconfig/commandline 🤷‍♀️

#

(this deserves more than cmd arg, maybe an option in project settings?)

unborn ginkgo
#

wonderful, thank you for all the explanations and help 🙏

#

@nova halo can you give an example of how to use the command line? in a build and editor.

#

@ashen arch such long time for pathfinding is possibly a bug, even with 130K triangles navmesh. But until this is clarified we'd like to move on with gameplay and put timeouts in the AI (aka hacks that end up shipping)

unborn ginkgo
#

i use 2021.3.22. what the heuristic change in 2022? shouldn't the decision to steal jobs depend on how busy worker threads are? 80% utilization = steal a job which, on mean avg, lasts less than mode avg main thread idle time

nova halo
#

On 2021.3 there is no heuristic or option to make the main thread not steal.

boot.config is an ini file that can be deployed to configure the engine. It's not documented since it's is meant to be allowed to change as the engine changes. It should be in your project settings and be deployed with your app (might be omitted if empty). Options in the bootconfig can be passed to a unity player or unity editor exe via -option=value

shouldn't the decision to steal jobs depend on how busy worker threads are
In 2022.2 it is. Generally it's better to not have a heuristic at all but it seems many folks end up having long running jobs that stall their main thread due to bad stealing when there are idle threads. So to help avoid that we loosely see if workers are busy and then steal if the workers could do the work instead of the main thread. In which case, the main thread will prefer to yield execution completely and wait to be signaled when the job it's waiting on is done.

unborn ginkgo
nova halo
#

Are you using 2021.3?

unborn ginkgo
#

yes

#

was having trouble with 2022 and the new physics

#

ok so confirming that main thread is still stealing jobs

unborn ginkgo
#

Just tried 2022 and unfortunately still getting main thread stealing jobs.

#

I see that boot.config no longer exist in 2022, gets deleted

#

deep profiler if that helps:

frank zenith
# ashen arch It’s very hard to guarantee that this will never happen, which is why we don’t r...

Maybe not the best idea, but for the sake of discussion, an idea on the topic of time-slicing..

As a compromise between moving code around and introducing thread preemption issues, did you consider a job type that can yield its execution (as in fibers/coroutines)? Something like myJob.ScheduleYieldable(), and a yield instruction that can be called from the job.

There's a slew of difficulties with the idea, but I imagine you already solved most of them (seeing as how there are C# coroutines as well).

Of course this wouldn't do anything for calls to external APIs that take too long, so perhaps it'd be a mediocre compromise in the end.

ashen arch
#

c# coroutines are horrifyingly slow and main thread only

#

but i would not say that yielding is super high on the list of candidate solutions to this problem

hushed basin
#

I recently got rid of my only long running job. I never had issues described in this thread (it would have been noticeable as it took 3+ seconds to run) but I had a lot of other little annoyances I had to work around that were bothering me.

#

I will say it now runs like 30% slower sadly, because it's hard to manually slice work efficiently

frank zenith
frank zenith
nova halo
#

i.e a naive approach would be JobYield() is just a means to steal a job that is ready for execution and would run in-place. However if you steal other jobs that yield, you could recurse until the stack blows with such an approach. But this approach is simple since you don't need to persist job state between jobs explicitly

frank zenith
#

the naive approach doesn't solve the issue at hand though, where the main thread is occupied by a job that runs too long; for the idea to work I don't think you can avoid having a stack / state with the yieldable job, because it must be able to migrate away from the main thread

thinking about it like that, I suppose an easier "solution" is to be able to mark a job as slow and prevent it from being grabbed by the main thread in the first place

ashen arch
#

this ^ is the more likely idea we have at the moment. haven't started on it yet though

kindred cloak
#

I would love to see "background" threads paired with each job, so you could create long running jobs that absorb all the idle time from job threads (which would take a lot of effort to get even 50% utilisation) and just use the operation system to context switch between them (eg background threads lower priority)

frank zenith
#

I think they're trying to avoid that solution at all costs, because that will always introduce extra thread context switches at some point. And while that may be acceptable on PC in many cases, I think it will not be very likely to behave nicely on many consoles where the performance constraints are tighter.

kindred cloak
#

I don't think a context switch by itself for <100 jobs on a single thread per frame is going to be that much (they could even spinlock a short time to make sure the background thread doesn't get switched to until there's idle time, so more like 50 "idle zones"), a problem could be cache misses but that would only affect the background thread