#How do I truly profile compute shaders?

1 messages · Page 1 of 1 (latest)

oak stump
#

Hello, I am trying to profile the read speed times of different memory layouts and access types (global vs thread, using more cache friendly layouts etc...)
For that I have made two compute shaders which use a for loop to read whatever memory I am testing X times. (Always adding to a sum so compiler hopefully doesn't remove it) and then I dispatch a single thread to perform it using the following code:

        var stopWatch = Stopwatch.StartNew();
        testCase.shader.Dispatch(kernel, 1, 1, 1);   
        resultBuf.GetData(result);       
        stopWatch.Stop();

Result in this case is the sum (so a single float) which I use to force unity to wait until it completes.
The issue is that a lot of external factors influence time beyond actual code, for example it seems that every consecutive test takes a lot less time. Like a test might take 50% longer just because it was moved from being second to first to occur in the frame.

So I was wondering if there was any more standardized method or tool for profiling compute shaders.

Thanks in advance

balmy sable
#

You need to use vendor specific profilers: Nvidia Nsight, AMD GPUPerfStudio

ruby socket
#

what are you trying to write?

oak stump
oak stump
ruby socket
oak stump
balmy sable
balmy sable
oak stump
ruby socket
#

kind of pointless to test async memory streaming for example if your application is using the shader for the purposes of game logic, such as a fluid simulation / bouyancy game

oak stump
ruby socket
oak stump
#

Its mostly curiosity

ruby socket
#

gotchya

#

okay, and in your opinion, is async buffer copying slow?

#

do you see how this is subjective?

#

it's fast in the sense of, occupied time on the GPU is low, but it's slow in the sense that you had to wait

#

you don't need to benchmark to know that it's the least expensive in terms of gpu occupancy (on modern gpus)

oak stump
ruby socket
#

100% of the latency is from what you're going to do with the buffer. GPUs have hardware just for copying buffers (the so called hardware engine)

oak stump
#

I am doing a small utilities libraries with various data structures implemented in the GPU and Id like to have the functions themselves and the internal data be as efficent as possible

#

So for example if I implement a dictionary for the GPU id like the reads to the internal hash table to be as fast as possible, regardless of the use of the dictionary

ruby socket
#

okay which of these do you want to measure? all of them?

  • Placing Your Order (API Call): You tell the cashier what you want. This is very quick.

  • Waiting in Line (Command Queue & Latency): The cashier puts your order slip in a queue. Your order waits for the barista to be free. This wait time is Submission Latency.

  • Making the Coffee (GPU Execution): The barista finally gets to your order and makes the coffee. The time this takes is the Execution Duration.

  • Getting Your Coffee (End-to-End Latency): The total time from when you first spoke to the cashier until the coffee is in your hand.

oak stump
#

GPU execution

#

For example if I have a function within my compute shader ReadDictionary(key) id like to measure how long it takes in the worst case to finish that function

ruby socket
#

it doesn't matter

oak stump
#

what do you mean?

ruby socket
#

the way your shader is written has the biggest impact

#

on gpu execution

#

then the size of the buffers

oak stump
#

Well but if I have a data structure that has 10 accesses to vram memory that is significant right?

ruby socket
#

you would need benchmarks to compare subtle differences in implementations, but for that you can use any of the built in unity profiling tools, they can measure gpu execution times just fine

#

the other stuff is a little harder to measure

oak stump
ruby socket
oak stump
ruby socket
#

you don't have a specific application in mind

#

there's a reason this stuff isn't generally done at a micro level

#

the whole of the shader matters

oak stump
# ruby socket you don't have a specific application in mind

But there are still ways of doing things that are in general faster than other ways of doing it, like for example if I find that reading two adjacent values in the same buffer is faster I could structure my data in another way and use a single buffer instead of two separate ones. I also want to build a bit of intuition

#

But I guess that profiling GPU like that is not as easy as it would be on the CPU

#

I'll look into the tools that have been mentioned, thanks to both of you 👍