Is my rewrite "better"? | Together C & C++ | Page 1

steady scroll Aug 4, 2024, 8:14 PM

#

I'm looking at some performance critical open source code for educational purposes and wanted to make sure my intuitions are correct

//Orignal function
void histogramScatterAdd2D(float* histogram, int *index1, int *index2, float *src, int maxidx1, int n)
{
  int threads = 512;
  int num_blocks = n/threads;
  num_blocks = n % threads == 0 ? num_blocks : num_blocks + 1;
  kHistogramScatterAdd2D<<<num_blocks, 512>>>(histogram, index1, index2, src, maxidx1, n);
  CUDA_CHECK_RETURN(cudaPeekAtLastError());
}


//My rewrite
void histogramScatterAdd2D(float* histogram, int *index1, int *index2, float *src, int maxidx1, int n)
{
  kHistogramScatterAdd2D<<<n + 511 >> 9, 512>>>(histogram, index1, index2, src, maxidx1, n);
  CUDA_CHECK_RETURN(cudaPeekAtLastError());
}

humble warrenBOT Aug 4, 2024, 8:14 PM

#

When your question is answered use !solved to mark the question as resolved.

Remember to ask specific questions, provide necessary details, and reduce your question to its simplest form. For tips on how to ask a good question use !howto ask.

steady scroll Aug 4, 2024, 8:17 PM

#

Also I'm a monkey for some reason, nice. Oh turns out I surrendered to the call of the void and clicked the monkey button when I joined.

pseudo goblet Aug 4, 2024, 10:31 PM

#

"Performance critical... Intuition"
Rookie mistake.

#

Measure it

distant flume Aug 4, 2024, 11:15 PM

#

^

distant flume Aug 4, 2024, 11:21 PM

#

steady scroll I'm looking at some performance critical open source code for educational purpos...

It's easy in C++ to get into a really egregious micro-optimization mindset where we obsess over ever little thing, even if it doesn't matter at all. Here, this won't matter in the slightest. Basic arithmetic is completely insignificant compared to the cost of launching a cuda kernel and waiting for that computation to be done 😛

#

Also, additionally, the compiler is really good at optimizing this sort of thing

steady scroll Aug 4, 2024, 11:23 PM

#

distant flume It's easy in C++ to get into a really egregious micro-optimization mindset where...

I'm only really paying attention to it becuase the use of ternaries and such makes it look like they cared about the performance here. I appreciate the point though.

distant flume Aug 4, 2024, 11:24 PM

#

Ternaries don't necessarily imply some performance concern, it's just a reasonable way of writing ceiling division

steady scroll Aug 4, 2024, 11:25 PM

#

Yeah I suppose it is readable

distant flume Aug 4, 2024, 11:25 PM

#

When in doubt about optimizations like this, always look at godbolt.org

#

;asm -O3

unsigned foo(unsigned n) {
    int threads = 512;
    int num_blocks = n/threads;
    num_blocks = n % threads == 0 ? num_blocks : num_blocks + 1;
    return num_blocks;
}
unsigned bar(unsigned n) {
    return n + 511 >> 9;
}

soft ledgeBOT Aug 4, 2024, 11:25 PM

#

Assembly Output

foo(unsigned int):
  mov eax, edi
  and edi, 511
  shr eax, 9
  cmp edi, 1
  sbb eax, -1
  ret
bar(unsigned int):
  lea eax, [rdi+511]
  shr eax, 9
  ret

distant flume Aug 4, 2024, 11:26 PM

#

It's a common pitfall to eyeball x86 assembly, however bar should have slightly better reciprocal throughput (if it matters, which it doesn't here)

#

https://uica.uops.info/ is a good tool to use to look at that, but it's down right now

steady scroll Aug 4, 2024, 11:27 PM

#

distant flume ;asm -O3 ```cpp unsigned foo(unsigned n) { int threads = 512; int num_bl...

Maybe this is also nitpicking, but other parts of the code make it clear they only support powers of 2 for threads. So I'd prefer bit shifts here to communicate that constraint? The downstream cuda kernels seem to use bit shifts for thread count stuff?

#

In isolation it llooks like you could pick a different magic number with this function and its fine, but it isn't fine, things break.

distant flume Aug 4, 2024, 11:28 PM

#

https://godbolt.org/z/7cs9e7rEr

#

Dispatch Width:    6
uOps Per Cycle:    3.31
IPC:               3.31
Block RThroughput: 1.5

Dispatch Width:    4
uOps Per Cycle:    3.85
IPC:               2.88
Block RThroughput: 1.0

distant flume Aug 4, 2024, 11:30 PM

#

steady scroll Maybe this is also nitpicking, but other parts of the code make it clear they on...

Bit shifts don't really communicate this constraint, you can bitshift non-powers of two

distant flume Aug 4, 2024, 11:30 PM

#

steady scroll In isolation it llooks like you could pick a different magic number with this fu...

It depends how the kernel is written, but usually the way the grids and blocks are setup here should be flexible as long as the total thread count covers the work to be done

steady scroll Aug 4, 2024, 11:48 PM

#

distant flume ``` Dispatch Width: 6 uOps Per Cycle: 3.31 IPC: 3.31 Block R...

thanks for that! I'll look at that more often especially if it has CUDA stuff

humble warrenBOT Aug 4, 2024, 11:53 PM

#

@steady scroll Has your question been resolved? If so, type !solved :)

steady scroll Aug 4, 2024, 11:55 PM

#

!solved

humble warrenBOT Aug 4, 2024, 11:55 PM

#

Thank you and let us know if you have any more questions!

This thread is now set to auto-hide after an hour of inactivity

distant flume Aug 5, 2024, 12:12 AM

#

Oh

#

@steady scroll there's a bug

#

ope ignore me

#Is my rewrite "better"?