#Job Timing without Profiler

1 messages · Page 1 of 1 (latest)

lofty solstice
#

How to measure time for bursted jobs

#

I'm getting Burst.Compiler Unable to load plugin

#

Dll built wirh visual studio and __declspec(dllexport)
I manually placed the dll it in the project root dir

hybrid nest
#

must be in Assets/Plugins

lofty solstice
#

Wierd, took like 5 editor restarts and 10 script changes to stop giving that error after I moved it into Assets/Plugins

lofty solstice
#

I'm measuring 44, sometimes 62 cycles per QPC call (12ns according to QPC)
Calling 1000x and dividing of course

hybrid nest
#

@lofty solstice one method is to prewarm the caller inside the job, ie. for 1 to 100 GetTicks(dummy), then GetTicks(start), dostuff, GetTicks(end). also ignore the first 100 or so job calls in measurement

lofty solstice
#

Interesting, this helped me get allow me to call native code whereever I want

#

This would allow me to accurately measure any particular bursted function call without the profiler, I suppose

#

however my goal would be to measure all calls of a job with minimal overhead, do you have any approach for this?

#

Measuring every chunk execution is not optimal and summing the values via atomics isn't either I suspect

#

would be nice if we knew exactly when a job first starts executing and finally ends on any particular thread

lofty heath
#

Why do you want to measure it without the profiler?

hybrid nest
#

well you could use the index to measure each thread

#

use a resultsTable[indices]

#

anyway im just testing a few things on the overhead

lofty solstice
#

Eg. factorio has a breakdown per type of simulation system and people use it to optimize their megabases for simulation speed

lofty heath
#

I remember looking into this once, but I can't for the life of me remember what I found. I just remember that each job ended up being shorter than the minimum measurable time, so it didn't manage to get an accurate measurement when measuring within the job

lofty solstice
#

Sadly possible with QueryPerformanceCounter, which I believe is the usual high resolution timer used on windows
afaik __rdtsc() is a raw cpu cycle counter, which I think means if your cpu boosts its clock the value now represents a shorter time

#

(though your code presumably could also run faster then, so cycle counts may actually be more useful when microoptimizing)

lofty heath
#

doesn't look like rdtsc is in burst intrinsics at least

daring storm
#

Or fork entities and add the timing to the base ijobchunk processor

#

Which probably makes more sense, it'd work then with the codegen ijobentity as well

lofty solstice
lofty heath
#

tbh, does sound like a niche use that would make sense to modify entities directly for

somber gale
# lofty solstice Sadly possible with QueryPerformanceCounter, which I believe is the usual high r...

modern cpus have invariant rdtsc so it doesn't change when the cpu clock speed changes https://randomascii.wordpress.com/2011/07/29/rdtsc-in-the-age-of-sandybridge/

#

if you're using rdtsc you'll also want either something like lfence; rdtsc or rdtscp to ensure the cpu doesn't reorder the counter instruction

#

but if you don't want to dive that deep into the rabbit hole, QueryPerformanceCounter ultimately simplifies all of this nonsense for you at the cost of some overhead

lofty solstice
daring storm
#

No

#

This is a custom job, you'd have to modify the entities package if you want all jobs to have custom code

#

Which to me seems like the best, maybe only, solution to this

lofty solstice
#

I mean I'm fine with only getting timing for any jobs I've written
does this depend on any internal APIs I can't usually access?

#

Oh, it would not work with IJobEntity is one problem I gather

#

Otherwise this is perfect, thanks a lot for the insight

daring storm
#

Yeah if you only want it for a few custom jobs, that you don't mind writing without code gen, then that custom job is perfect for you

#

Gives start and end method of the worker per thread

lofty solstice
#

From what I can understand from the code, this also tells me that it's guaranteed that scheduled jobs always execute in one go per thread, and exactly the scheduled order
Now I'm curious how dependencies are handled

daring storm
#

Yes, once a job starts on a worker it'll end before another job is scheduled in that worker

#

This is important for things like temp memory as well

lofty solstice
#

The dependency logic runs independently of the main thread, right?
Lock a semaphore, update any dependencies based on the job a thread may have just completed, figure out if any jobs in the queue are able to execute and sleep on the semaphore if not?

lofty heath
lofty solstice
#

Currently working on a custom renderer with per-chunk culling, so I'm already using it plenty

#

But I think I can also use IJobEntity with change filters to skip work for a lot of my simulation stuff later

hybrid nest
#

going back to the precision timing, i tried QPC and it gives oddly low values. I assume it outputs the current ticks timestamp or does it give something else?

lofty solstice
#

It's a timestamp since boot or something, check QueryPerformanceFrequency(), I think it returned 10 million on my machine

#
long freq  = QueryPerformanceFrequency(); // usually cached somewhere

long start = QueryPerformanceCounter();
// code to be measured
long end   = QueryPerformanceCounter();

float duration = (float)(end - start) / freq; // in seconds
#

I always used it like this for all my timing needs in c++