#Is my rewrite "better"?

29 messages · Page 1 of 1 (latest)

steady scroll
#

I'm looking at some performance critical open source code for educational purposes and wanted to make sure my intuitions are correct

//Orignal function
void histogramScatterAdd2D(float* histogram, int *index1, int *index2, float *src, int maxidx1, int n)
{
  int threads = 512;
  int num_blocks = n/threads;
  num_blocks = n % threads == 0 ? num_blocks : num_blocks + 1;
  kHistogramScatterAdd2D<<<num_blocks, 512>>>(histogram, index1, index2, src, maxidx1, n);
  CUDA_CHECK_RETURN(cudaPeekAtLastError());
}


//My rewrite
void histogramScatterAdd2D(float* histogram, int *index1, int *index2, float *src, int maxidx1, int n)
{
  kHistogramScatterAdd2D<<<n + 511 >> 9, 512>>>(histogram, index1, index2, src, maxidx1, n);
  CUDA_CHECK_RETURN(cudaPeekAtLastError());
}
humble warrenBOT
#

When your question is answered use !solved to mark the question as resolved.

Remember to ask specific questions, provide necessary details, and reduce your question to its simplest form. For tips on how to ask a good question use !howto ask.

steady scroll
#

Also I'm a monkey for some reason, nice. Oh turns out I surrendered to the call of the void and clicked the monkey button when I joined.

pseudo goblet
#

"Performance critical... Intuition"
Rookie mistake.

#

Measure it

distant flume
#

^

distant flume
#

Also, additionally, the compiler is really good at optimizing this sort of thing

steady scroll
distant flume
#

Ternaries don't necessarily imply some performance concern, it's just a reasonable way of writing ceiling division

steady scroll
#

Yeah I suppose it is readable

distant flume
#

When in doubt about optimizations like this, always look at godbolt.org

#

;asm -O3

unsigned foo(unsigned n) {
    int threads = 512;
    int num_blocks = n/threads;
    num_blocks = n % threads == 0 ? num_blocks : num_blocks + 1;
    return num_blocks;
}
unsigned bar(unsigned n) {
    return n + 511 >> 9;
}
soft ledgeBOT
#
Assembly Output
foo(unsigned int):
  mov eax, edi
  and edi, 511
  shr eax, 9
  cmp edi, 1
  sbb eax, -1
  ret
bar(unsigned int):
  lea eax, [rdi+511]
  shr eax, 9
  ret

distant flume
#

It's a common pitfall to eyeball x86 assembly, however bar should have slightly better reciprocal throughput (if it matters, which it doesn't here)

steady scroll
#

In isolation it llooks like you could pick a different magic number with this function and its fine, but it isn't fine, things break.

distant flume
#
Dispatch Width:    6
uOps Per Cycle:    3.31
IPC:               3.31
Block RThroughput: 1.5

Dispatch Width:    4
uOps Per Cycle:    3.85
IPC:               2.88
Block RThroughput: 1.0
distant flume
distant flume
steady scroll
humble warrenBOT
#

@steady scroll Has your question been resolved? If so, type !solved :)

steady scroll
#

!solved

humble warrenBOT
#

Thank you and let us know if you have any more questions!

This thread is now set to auto-hide after an hour of inactivity

distant flume
#

Oh

#

@steady scroll there's a bug

#

ope ignore me