#Job Timing without Profiler
1 messages · Page 1 of 1 (latest)
I'm getting Burst.Compiler Unable to load plugin
Dll built wirh visual studio and __declspec(dllexport)
I manually placed the dll it in the project root dir
must be in Assets/Plugins
Wierd, took like 5 editor restarts and 10 script changes to stop giving that error after I moved it into Assets/Plugins
I'm measuring 44, sometimes 62 cycles per QPC call (12ns according to QPC)
Calling 1000x and dividing of course
@lofty solstice one method is to prewarm the caller inside the job, ie. for 1 to 100 GetTicks(dummy), then GetTicks(start), dostuff, GetTicks(end). also ignore the first 100 or so job calls in measurement
Interesting, this helped me get allow me to call native code whereever I want
This would allow me to accurately measure any particular bursted function call without the profiler, I suppose
however my goal would be to measure all calls of a job with minimal overhead, do you have any approach for this?
Measuring every chunk execution is not optimal and summing the values via atomics isn't either I suspect
would be nice if we knew exactly when a job first starts executing and finally ends on any particular thread
Why do you want to measure it without the profiler?
well you could use the index to measure each thread
use a resultsTable[indices]
anyway im just testing a few things on the overhead
Because it's simply convenient to have a rough breakdown of overall performance without needing the profiler (just like people use fps counters)
Both for me as the dev and for power users who want to understand the performance they are getting better
Eg. factorio has a breakdown per type of simulation system and people use it to optimize their megabases for simulation speed
I remember looking into this once, but I can't for the life of me remember what I found. I just remember that each job ended up being shorter than the minimum measurable time, so it didn't manage to get an accurate measurement when measuring within the job
Sadly possible with QueryPerformanceCounter, which I believe is the usual high resolution timer used on windows
afaik __rdtsc() is a raw cpu cycle counter, which I think means if your cpu boosts its clock the value now represents a shorter time
(though your code presumably could also run faster then, so cycle counts may actually be more useful when microoptimizing)
doesn't look like rdtsc is in burst intrinsics at least
https://gitlab.com/tertle/com.bovinelabs.core/-/blob/master/BovineLabs.Core/Jobs/IJobChunkWorkerBeginEnd.cs?ref_type=heads#L22
You'd need to switch to a custom job like this
Or fork entities and add the timing to the base ijobchunk processor
Which probably makes more sense, it'd work then with the codegen ijobentity as well
It's literally a cpu instruction even, I called it wrapped using a native plugin earlier
tbh, does sound like a niche use that would make sense to modify entities directly for
modern cpus have invariant rdtsc so it doesn't change when the cpu clock speed changes https://randomascii.wordpress.com/2011/07/29/rdtsc-in-the-age-of-sandybridge/
if you're using rdtsc you'll also want either something like lfence; rdtsc or rdtscp to ensure the cpu doesn't reorder the counter instruction
but if you don't want to dive that deep into the rabbit hole, QueryPerformanceCounter ultimately simplifies all of this nonsense for you at the cost of some overhead
Can you actually override the standard job by just putting this code into your project?
No
This is a custom job, you'd have to modify the entities package if you want all jobs to have custom code
Which to me seems like the best, maybe only, solution to this
I mean I'm fine with only getting timing for any jobs I've written
does this depend on any internal APIs I can't usually access?
Oh, it would not work with IJobEntity is one problem I gather
Otherwise this is perfect, thanks a lot for the insight
Yeah if you only want it for a few custom jobs, that you don't mind writing without code gen, then that custom job is perfect for you
Gives start and end method of the worker per thread
From what I can understand from the code, this also tells me that it's guaranteed that scheduled jobs always execute in one go per thread, and exactly the scheduled order
Now I'm curious how dependencies are handled
Yes, once a job starts on a worker it'll end before another job is scheduled in that worker
This is important for things like temp memory as well
The dependency logic runs independently of the main thread, right?
Lock a semaphore, update any dependencies based on the job a thread may have just completed, figure out if any jobs in the queue are able to execute and sleep on the semaphore if not?
If you're aiming for absolute maximum performance then converting your IJobEntity jobs to IJobChunk probably is a good bet anyways
Currently working on a custom renderer with per-chunk culling, so I'm already using it plenty
But I think I can also use IJobEntity with change filters to skip work for a lot of my simulation stuff later
going back to the precision timing, i tried QPC and it gives oddly low values. I assume it outputs the current ticks timestamp or does it give something else?
It's a timestamp since boot or something, check QueryPerformanceFrequency(), I think it returned 10 million on my machine
long freq = QueryPerformanceFrequency(); // usually cached somewhere
long start = QueryPerformanceCounter();
// code to be measured
long end = QueryPerformanceCounter();
float duration = (float)(end - start) / freq; // in seconds
I always used it like this for all my timing needs in c++