#7 mads and 1 mul vs 2 dot4s and 1 add?
95 messages · Page 1 of 1 (latest)
There isn't going to be a universal answer to this, it will probably vary across platforms
what do you think the dots expand to?
So you believe that this is going to be very vendor and driver dependent and there is no clear answer here? I have only one GPU I can test on rn sadly.
yes
I find it hard to believe that this exact operation is the main bottleneck of your application, hardware aside
possibly even GPU architecture dependent
Not saying it's impossible but you should try and find out
there should be a :that: emoji which points down
arithmetic is very rarely the biggest bottleneck in anything, since it's absurdly fast
make it :frog_whip:
There's a fun analogy about memory vs. arithmetic latency that goes something like, doing arithmetic on two values from main memory and writing them back is comparable to someone mailing you a letter with two numbers, you adding them in your head in a few seconds, and mailing the response back to them
Okay, I have no idea why people are suddenly assuming I have a bottleneck and where I supposedly have it. I wanted to know if there is an answer to performance difference and there is none. Please do not give give me unrelated unsolicited advice. It is needlessly hurtful and pressuring for me personally.
why are you asking this question if you don’t have a bottleneck?
Ok well the answer to the question is that there isn't really a clear one, so if there isn't any empirical evidence for us to reason about I don't think there's much more that can be said
So that people do not provide unsolicited advice.
It's not unsolicited, it's relevant to the question
Thank you very much. That is what I wanted to know. I will profile it as well on my gpu.
performance doesn't exist in a vacuum, so it's natural for a question about performance to pull in broader context
It is unsolicited to push me to fix a nonexistent bottleneck.
I'm not pushing you to fix a nonexistent bottleneck, I'm explaining part of why the question is more complex than it seems
Sure, just please do not push me without at least questioning me first... it is hurtful to me personally.
the correct advice for this kind of question is literally to profile both and see which is faster on your target(s)
I'm sorry if I hurt your feelings but we asked the sequence of questions you're going to get in response to literally any question about code optimization so just be prepared to see it again
I find it hard to believe that this exact operation is the main bottleneck of your application, hardware aside
There is a lot of assumption here that I do not want to encounter anymore, please.
"Is this really a bottleneck" is the first question anyone is going to ask if you solicit advice about microoptimizations, for better or worse
Okay, then next time I will mention, that I am asking a purely theoretical question. And that I would only want an answer to that specific question.
anyways, pretty sure both will perform identically since the two options you present are the same thing
here's the diff for rdna2 (right = two dots, left = fma spam)
Interesting, so if I replace the mad version with a raw (*)+(*)+(*)+(*)+(*)+(*)+(*) version it will lose the mov and be one operation less than the dot version.. looks like the dot version would run ~1.125x slower in that specific disassembly.
I doubt that
In that specific disassembly, I think most of the latency would be centered around the s_waitcnt instructions
Which both sides have an equal number of?
yes
therefore I'd guess they perform the same
(though it does not depend on the number of waitcnt instructions alone, it also depends on how the memory is accessed by buffer_load_dwordx4)
So you are implying that in, as an example, a loop of 1000, the difference will be negligible because one more mul is cheap in comparison to synchronization opcodes, anyways?
cheap in comparison to the memory load it synchronizes yeah
Hmm well, I finally woke up so now I can profile my code.
I will test 1024x1024 with random numbers in loops of 1000.
as in you load, apply arithmetic on, then store 1000 different numbers in each invocation?
No. I will in a loop of 1000:
generate a b c d e f g h through a naive sine hash and then perform the divergent computations; repeat.
And that for 2^20 threads.
a hash of what?
I will not be loading a resource 1000 times in a loop, I know that is going to obfuscate the performance difference too much because that is slowwww.
also, how will you store the results or otherwise make sure the compiler doesn't just optimize the calculation out?
Thread id uint plus loop index uint to float hashed to another float.
Disassembly.
if you don't store any results the compiler will optimize it out
If the code I care about differs only in the divergent part.
I can tell you that without the disasm
I will fma with an aggregate aggregate += (0.01 * result) to prevent any memory latency obfuscation inside the loop itself and then output the aggregate for each uint index into a separate output buffer index to prevent serialization.
Alternatively, I can use the precise keyword.
precise shouldn't matter here at all
I will try the precise keyword first anyways.
Don't want any random outside force to obfuscate what I really am curious about anyways.
are you planning to apply the knowledge gained through this benchmark anywhere?
No.
huh
I also can clearly sense what you were trying to achieve with this sequence of questioning me about memory latency again.
By deviating me from the question that truly interests, sure. I don't have an interest in being helped by talking about memory latency, again. I am fully aware of how bad that thing can get and I literally work as a GPU programmer. I don't need to be constantly reminded about memory latency. I did not ask anything about it.
I literally work as a GPU driver programmer
anyways, the reason I'm asking/mentioning this is that I doubt that this benchmark is of any use (and if it isn't useful at all, why invest the effort in the first place?)
So you are pushing me because in your personal vision anything that does not have clear utility should not be done by others?
No, I'm trying to save you from wasting your time
Why are you trying to affect my interests or curiosities?
Are people not allowed to have "useless" interests?
There are "useless" interests that are still fun (why do we kick a ball towards two goals for 90 minutes? it does not affect anything in our lives), but I doubt you'd get your enjoyment out of writing benchmarks without any merit
Hey, do you live inside of my body and brain?
printf("%d", rand()) would give you about the same amount of useful information
Why keep projecting and pushing lol?
no, but I do claim to know how people usually work
nobody is stopping you from doing this benchmark
if you insist, go ahead
Yeah which is really weird to generqlize me to be the same.
I am just telling you that you are wasting your time
If someone came up to you and started insisting that you don't reallywant to test something that you feel really excited about testing and then keep pushing you, would you enjoy that?
I am literally not forcing anyone to keep pushing me.
so we are asking "is this a good question"
Which I have never forced you to do. You are doing it to yourself.
so what do you think you are testing here btw
This entire continuous breach of a boundary I asserted is entirely your choice to try to control how another wants to spend their frer time at this point. I am done here.
I would maybe like to crawl inside
but there's already a worm in there
