This is a jobs+burst question, no entity. I schedule long running jobs that normally do their thing, which is path finding, on one of the 19 available worker thread. Most of the time the job carry across a frame or two but some time i see jobs that freeze the editor and the build for 5000ms, yes these jobs move to the main thread and freeze everything.
Now my understanding is that a job can jump to main thread when no worker thread is available but, looking back in time to prior the freeze, I see that the worker threads are very light on work. There maybe be 6 busy with path finding at all time, the remaining 10 or so do basic physics and what not that's unity's domain. Total of at most 1ms/frame.
So we're trying to find out other cause. And ways to prevent that.
#What causes a bursted job to be pushed to the main thread (freezing the game) and the fix?
1 messages · Page 1 of 1 (latest)
Make sure there aren't any dependencies with what you are trying to do on the main thread or it will have to wait for your jobs to complete, which can result with it pulling it onto the main thread, ideally you would only be checking the job status
It’s very hard to guarantee that this will never happen, which is why we don’t recommend having such long-running jobs today, unfortunately. It’s on the todo to enable that use case, but in the meantime, manual time-slicing is the way to go
Which Unity version are you using?
When the main thread waits it will steal jobs, however in 2022.2 the logic was changed to make it a bit kinder to when there might be long running jobs.
If the main thread is told to wait on a parallel job, and there is a portion of that parallel work available, then it will run that work. So long running parallel jobs could cause a stall on the main thread -- it's best to break up long running jobs in general to be pieces of work that can resume across a series of jobs.
If the job being waited on isn't a parallel job, then the main thread will look for a job do while waiting unless there are fewer jobs than worker threads in the system. In which case it will do nothing letting worker threads do the work instead.
If long running jobs cannot be adjusted for your game, the hard switch is to force the main thread to never steal via the bootconfig/commandline switch of no-main-thread-job-stealing=true/--no-main-thread-job-stealing=true
The main thread will never help in such cases which could slightly hurt perf in cases like waiting on a parallel job that the main thread could have helped with. But if you find you're more often than not waiting on long running work scheduled on the main thread then this might help.
We intend to offer better granularity for job affinity but I have no timelines to provide at this moment. Hope this helps
@true parcel you asked a while ago about never using the mainthread for jobs apparently theres a command line argument to turn it off ^
but wait a minute, where do I insert these arguments?🤔 Editor? will it go to build too?
I can't find anything in docs about bootconfig/commandline 🤷♀️
(this deserves more than cmd arg, maybe an option in project settings?)
wonderful, thank you for all the explanations and help 🙏
@nova halo can you give an example of how to use the command line? in a build and editor.
@ashen arch such long time for pathfinding is possibly a bug, even with 130K triangles navmesh. But until this is clarified we'd like to move on with gameplay and put timeouts in the AI (aka hacks that end up shipping)
i use 2021.3.22. what the heuristic change in 2022? shouldn't the decision to steal jobs depend on how busy worker threads are? 80% utilization = steal a job which, on mean avg, lasts less than mode avg main thread idle time
On 2021.3 there is no heuristic or option to make the main thread not steal.
boot.config is an ini file that can be deployed to configure the engine. It's not documented since it's is meant to be allowed to change as the engine changes. It should be in your project settings and be deployed with your app (might be omitted if empty). Options in the bootconfig can be passed to a unity player or unity editor exe via -option=value
shouldn't the decision to steal jobs depend on how busy worker threads are
In 2022.2 it is. Generally it's better to not have a heuristic at all but it seems many folks end up having long running jobs that stall their main thread due to bad stealing when there are idle threads. So to help avoid that we loosely see if workers are busy and then steal if the workers could do the work instead of the main thread. In which case, the main thread will prefer to yield execution completely and wait to be signaled when the job it's waiting on is done.
in the existing and empty boot.config I added a line -option=--no-main-thread-job-stealing=true and as soon as I started the editor the boot.config file turned out empty. then I added -option=no-main-thread-job-stealing=true and it stayed there but then this happens on a rather empty worker thread
Are you using 2021.3?
yes
was having trouble with 2022 and the new physics
ok so confirming that main thread is still stealing jobs
Just tried 2022 and unfortunately still getting main thread stealing jobs.
I see that boot.config no longer exist in 2022, gets deleted
deep profiler if that helps:
Maybe not the best idea, but for the sake of discussion, an idea on the topic of time-slicing..
As a compromise between moving code around and introducing thread preemption issues, did you consider a job type that can yield its execution (as in fibers/coroutines)? Something like myJob.ScheduleYieldable(), and a yield instruction that can be called from the job.
There's a slew of difficulties with the idea, but I imagine you already solved most of them (seeing as how there are C# coroutines as well).
Of course this wouldn't do anything for calls to external APIs that take too long, so perhaps it'd be a mediocre compromise in the end.
c# coroutines are horrifyingly slow and main thread only
but i would not say that yielding is super high on the list of candidate solutions to this problem
I recently got rid of my only long running job. I never had issues described in this thread (it would have been noticeable as it took 3+ seconds to run) but I had a lot of other little annoyances I had to work around that were bothering me.
I will say it now runs like 30% slower sadly, because it's hard to manually slice work efficiently
Of course I was not suggesting to use c# coroutines. 😅 I've worked on native context switching, it is not slow.
Indeed, hence my suggestion, which could make that easier. Not saying it is "the" solution, just wanted to throw it out there. I know from experience that there are many implications and consequences.
Yes, it's something we've discussed before and would be easy to implement with some restrictions but there is a risk of stack overflows without some additional work. There are a handful of other improvements that we think will be more impactful that have taken priority but I don't think we've removed the option from ones to consider
i.e a naive approach would be JobYield() is just a means to steal a job that is ready for execution and would run in-place. However if you steal other jobs that yield, you could recurse until the stack blows with such an approach. But this approach is simple since you don't need to persist job state between jobs explicitly
the naive approach doesn't solve the issue at hand though, where the main thread is occupied by a job that runs too long; for the idea to work I don't think you can avoid having a stack / state with the yieldable job, because it must be able to migrate away from the main thread
thinking about it like that, I suppose an easier "solution" is to be able to mark a job as slow and prevent it from being grabbed by the main thread in the first place
this ^ is the more likely idea we have at the moment. haven't started on it yet though
I would love to see "background" threads paired with each job, so you could create long running jobs that absorb all the idle time from job threads (which would take a lot of effort to get even 50% utilisation) and just use the operation system to context switch between them (eg background threads lower priority)
I think they're trying to avoid that solution at all costs, because that will always introduce extra thread context switches at some point. And while that may be acceptable on PC in many cases, I think it will not be very likely to behave nicely on many consoles where the performance constraints are tighter.
I don't think a context switch by itself for <100 jobs on a single thread per frame is going to be that much (they could even spinlock a short time to make sure the background thread doesn't get switched to until there's idle time, so more like 50 "idle zones"), a problem could be cache misses but that would only affect the background thread