#Performance issue from idle job worker threads - 75% CPU usage from worker job overhead

1 messages · Page 1 of 1 (latest)

grim haven
#

I'm working on Final Factory, and we're running into a weird performance overhead with the DOTS job worker threads. We're seeing that each thread is consuming significant CPU just for existing--the more job worker threads we have, the more CPU the system consumes. The CPU is a Ryzen 5800x3D (8C/16T) and we are using IL2CPP (but the same behavior occurs with Mono). The game is rendering windowed at 1080p, but it happens in all rendering modes.

Test Setup
We started the game with a nearly empty scene (so not many jobs or systems are doing much activity) with vsync @ 60hz and getting 60 fps.

Initial State:
CPU usage is around 15%. As expected, there are 15 job worker threads (16 cores-1). Looking into the CPU usage with Visual Studio's profiler, I see heavy usage in the job worker threads (see screenshot #1). 81% of the total CPU time was from the job worker threads, with only 6% of that actually running our jobs. The overhead is from RtlQueryPerformanceCounter and lane_try_steal.

Optimization:
I made a one-line change and set the number of job worker threads to 1 with this command:
JobsUtility.JobWorkerCount = 1;

Now, the game still runs at 60 fps, but CPU usage drops to 2.8%. We see a dramatic drop in the job overhead, although it's still present (see screenshot #2)

We are not calling JobHandle.ScheduleBatchedJobs(). It looks like there's just pure overhead in the worker job threads, even though they are mostly idle. On systems with weaker CPUs, the overhead is even greater (on my laptop with a Ryzen 3750H, it's consuming ~50% CPU) and it's making the fans really run hard due to the excess CPU usage.

Any idea why this is happening or how we can eliminate this overhead?

#

Performance issue from idle job worker threads - 75% CPU usage from worker job overhead

summer marsh
#

I'm no expert in this, but I believe this is what happens when you use a lot of ScheduleParallel with really tiny jobs

#

They discuss this here a bit
https://learn.unity.com/tutorial/part-3-2-managing-the-data-transformation-pipeline#639b0909edbc2a778f40dbcf

ScheduleParallel() overhead
You can reduce the cost of parallel scheduling by tuning the number of worker threads used by your application. Do this by setting JobUtility.JobWorkerCount so that your application uses enough worker threads to perform the work it requires without introducing CPU bottlenecks, but not so many worker threads that they spend a lot of time idle. In other words, in the Timeline view of the Unity Profiler’s CPU Module, make sure your worker threads are spending as much time as possible running jobs.

#

However your case seems extreme so there may be more going on here

grim haven
#

Thanks tertle. That actually makes sense. There are probably 50 jobs scheduled per frame, so 3000 jobs/second. With an empty base, they're each not doing much of anything.

We'll take a look at it we can reschedule some of them non-parallel, which we could join together, and ones we could not run every frame.

It is exceedingly hard to follow the recommendation to have all the threads spending time running jobs, as the workload is extremely variable. When the game is new, the factory is empty and there's little to do. 1 thread is overkill for such a base.

But later on, when a player has hundreds of thousands of entities in play, 16 threads is very reasonable. We could manually control the number of threads in our game, but it's going to be a balancing act to make sure we're not under or overallocating threads.

summer marsh
#

50 isn't really that high

#

we schedule 175-200/frame at work

#

and this is on 0.51 in 2020 so much less efficient scheduling

vale wyvern
#

There's actually about 100 jobs that are scheduled parallel, and another 120 that are single threaded.

ruby nest
#

Which version of Unity are you using? In earlier versions of 2022.2 desktop platforms would spin for a bit looking for work, but that has been reduced in an upcoming release of 2022.2 (2022.2.11f1)

#

That call to the QueryPerformanceCounter in your screenshot for example will not exist in that later release

grim haven
#

All profiles were taken from 2022.2.9f1, but we observed the same behavior in 2022.2.0b16 as well.

#

Happy to test with 11f1, but I am using the latest release available in unity hub.

summer marsh
#

2022.2.11 is out, did these improvements make it out of interest? i didn't really see anything in the changelog related

ruby nest
#

The changes are in, however there wasn't an explicit bug for reducing spinning so I hadn't called it out specifically.

If you notice issues with the job system let me know

grim haven
#

@summer marsh @ruby nest Sorry, I was out on vacation and didn't manage to compile until now. I was waiting eagerly for 11f1.

I can report success. 2022.2.11f1 drops CPU usage from 15% to 10% overall (including main thread) at 60 FPS and the method in question is gone. There is still some significant overhead in the worker threads, but it's MUCH better.

lane_try_steal is now 26% CPU, actually running our jobs is 18%, and WaitOnAddress is 10%.

So overall, yes, there's still a lot of overhead (36% overhead vs 18% actually running jobs), but it's much better.

I'm going to try to change some of our job scheduling to single instead of parallel, and look at doing some of the simplest ones on the mainthread itself.

Here's the breakdown of lane_try_steal:

summer marsh
#

nice to see some numbers, i didn't really have a viable project to properly test this on atm

#

that is a significant improvement

grim haven
#

Very happy to see the improvement 🙂 Thank you to both of you for the help!

There is still a lot of overhead, and we'll see where we can optimize by making parallel jobs non-parallel. If anyone's interested in working on trying to optimize the lane_try_steal, I would be happy to help out, provide example binaries, etc.

We're also seeing lots of overhead from EntityCommandBuffers, specifically array resizing when they're being destroyed and allocation of them. That's not new in 11f1 or anything, but these are a couple of high overhead points that I see in DOTS and the job system.

summer marsh
#

For the ecb is this in editor or builds

#

They're about 5x faster to dispose in builds

ruby nest
#

There is still a lot of overhead, and we'll see where we can optimize by making parallel jobs non-parallel. If anyone's interested in working on trying to optimize the lane_try_steal, I would be happy to help out, provide example binaries, etc

If you're concerned about additional power consumption due to job workers scanning for work but you don't have that much work in the job system, I'd recommend you use fewer worker threads. This will cut down on the WaitOnAddress calls (these won't result in wake ups since workers should be awake more often), and job workers will have fewer neighbours to steal from which will reduce the try_steal overhead. That would be a better choice than making your parallel jobs, single.

I am more than happy to look at use cases to help with job tuning.

#

You can tune the worker count via the boot.cfg job-worker-count=x/ commandline swithc of --job-worker-count=x (these are engine wide switches that will put a hard cap on the job workers made) or you can use the JobsUtilitly.JobWorkerCount=x to dynamically reduce how many workers are used (all workers will be created but we won't bother waking up more than x)

grim haven
grim haven
# ruby nest You can tune the worker count via the boot.cfg `job-worker-count=x`/ commandline...

Ah! So lane_try_steal is trying to get jobs from other worker threads that are still busy? I didn't realize that. So yes, this profile is a bunch of idle threads, for the most part.

In this case, there is very little work to do because I've been profiling the simple case (new game). Once the game is played for a while, there will be a lots of work for the jobs to do.

The key would be, given the variable workload and the wide variety of CPUs, how would we figure out how many threads we need?

I can confirm that setting the JobWorkerCount to 1 reduces the CPU overhead (and power consumption).

ruby nest
#

lane_try_steal is trying to get jobs from other worker threads
Indeed. We use a work stealing design, so all work on the main thread is submitted to a single queue and then we wake workers (the WaitOnAddress calls you see) to run work in their own queue and if that is empty they steal work. They have a hint of where to find work, but if they can't find anything they will look at other workers. If they come up empty handed they go to sleep. If they find work they run it, and repeat the process. If occasional parallel jobs come in, we can wake up multiple workers but as they fail to find work they'll spend more time scanning for work than small job workloads. But as soon as you have more jobs in the system this time is reduced (since there is more work to find).

If you anticipate the common case to be that there are many jobs, you may not need to worry about tuning how many workers there are, as it sounds like you anticipate using them. Especially if you intend play sessions to be very long where players build up big complex worlds, having few jobs may be a rare scenario for users. That isn't to say we should sweep unnecessary job overhead under the carpet; we'll take a look to ensure stealing costs are reasonable for low and high job volume. But if you want to tune worker usage, you could determine a heuristic for complexity in your game adjust worker counts as you see fit. e.g. gameplay setting for power saving mode, measure sim time and adjust workers if you fall behind some target etc.... Heuristics are tricky though so if you can avoid doing so that is preferred but it could help with power consumption if that is a priority.

grim haven
#

Yeah, we'll consider this. We definitely could define some heuristics to set the thread count. We're focused on launch, so we might do something very basic for now and then scale it up.

Thanks for the clarity on how this works. That helps us better tune. Half the CPU right now is going to overhead, so especially for slower laptops, etc... it'd be nice to not spin up the fans and drain the battery if it isn't needed.

We definitely are anticipating complex worlds where all the job threads are needed, so we'll have to run a careful balance of how many threads we run.