7 mads and 1 mul vs 2 dot4s and 1 add? | Graphics Programming | Page 1

unborn sluice Jan 7, 2023, 9:13 AM

#

profile it and see

#

There isn't going to be a universal answer to this, it will probably vary across platforms

storm obsidian Jan 7, 2023, 9:15 AM

#

what do you think the dots expand to?

cosmic python Jan 7, 2023, 9:15 AM

#

unborn sluice There isn't going to be a universal answer to this, it will probably vary across...

So you believe that this is going to be very vendor and driver dependent and there is no clear answer here? I have only one GPU I can test on rn sadly.

restive shoal Jan 7, 2023, 9:15 AM

#

yes

unborn sluice Jan 7, 2023, 9:15 AM

#

I find it hard to believe that this exact operation is the main bottleneck of your application, hardware aside

restive shoal Jan 7, 2023, 9:15 AM

#

possibly even GPU architecture dependent

unborn sluice Jan 7, 2023, 9:16 AM

#

Not saying it's impossible but you should try and find out

restive shoal Jan 7, 2023, 9:16 AM

#

also this

#

storm obsidian Jan 7, 2023, 9:16 AM

#

there should be a :that: emoji which points down

unborn sluice Jan 7, 2023, 9:16 AM

#

arithmetic is very rarely the biggest bottleneck in anything, since it's absurdly fast

restive shoal Jan 7, 2023, 9:16 AM

#

storm obsidian there should be a :that: emoji which points down

make it :frog_whip:

unborn sluice Jan 7, 2023, 9:18 AM

#

There's a fun analogy about memory vs. arithmetic latency that goes something like, doing arithmetic on two values from main memory and writing them back is comparable to someone mailing you a letter with two numbers, you adding them in your head in a few seconds, and mailing the response back to them

cosmic python Jan 7, 2023, 9:18 AM

#

Okay, I have no idea why people are suddenly assuming I have a bottleneck and where I supposedly have it. I wanted to know if there is an answer to performance difference and there is none. Please do not give give me unrelated unsolicited advice. It is needlessly hurtful and pressuring for me personally.

restive shoal Jan 7, 2023, 9:19 AM

#

why are you asking this question if you don’t have a bottleneck?

cosmic python Jan 7, 2023, 9:19 AM

#

Theoretical questions are allowed, I believe.

#

Please add a tag for it.

unborn sluice Jan 7, 2023, 9:20 AM

#

Ok well the answer to the question is that there isn't really a clear one, so if there isn't any empirical evidence for us to reason about I don't think there's much more that can be said

cosmic python Jan 7, 2023, 9:20 AM

#

So that people do not provide unsolicited advice.

unborn sluice Jan 7, 2023, 9:20 AM

#

It's not unsolicited, it's relevant to the question

cosmic python Jan 7, 2023, 9:20 AM

#

unborn sluice Ok well the answer to the question is that there isn't really a clear one, so if...

Thank you very much. That is what I wanted to know. I will profile it as well on my gpu.

unborn sluice Jan 7, 2023, 9:20 AM

#

performance doesn't exist in a vacuum, so it's natural for a question about performance to pull in broader context

cosmic python Jan 7, 2023, 9:20 AM

#

unborn sluice It's not unsolicited, it's relevant to the question

It is unsolicited to push me to fix a nonexistent bottleneck.

unborn sluice Jan 7, 2023, 9:21 AM

#

I'm not pushing you to fix a nonexistent bottleneck, I'm explaining part of why the question is more complex than it seems

cosmic python Jan 7, 2023, 9:21 AM

#

unborn sluice performance doesn't exist in a vacuum, so it's natural for a question about perf...

Sure, just please do not push me without at least questioning me first... it is hurtful to me personally.

storm obsidian Jan 7, 2023, 9:21 AM

#

the correct advice for this kind of question is literally to profile both and see which is faster on your target(s)

unborn sluice Jan 7, 2023, 9:22 AM

#

I'm sorry if I hurt your feelings but we asked the sequence of questions you're going to get in response to literally any question about code optimization so just be prepared to see it again

cosmic python Jan 7, 2023, 9:22 AM

#

I find it hard to believe that this exact operation is the main bottleneck of your application, hardware aside

There is a lot of assumption here that I do not want to encounter anymore, please.

unborn sluice Jan 7, 2023, 9:22 AM

#

"Is this really a bottleneck" is the first question anyone is going to ask if you solicit advice about microoptimizations, for better or worse

cosmic python Jan 7, 2023, 9:23 AM

#

Okay, then next time I will mention, that I am asking a purely theoretical question. And that I would only want an answer to that specific question.

storm obsidian Jan 7, 2023, 9:26 AM

#

anyways, pretty sure both will perform identically since the two options you present are the same thing

#

here's the diff for rdna2 (right = two dots, left = fma spam)

cosmic python Jan 7, 2023, 9:58 AM

#

storm obsidian here's the diff for rdna2 (right = two dots, left = fma spam)

Interesting, so if I replace the mad version with a raw (*)+(*)+(*)+(*)+(*)+(*)+(*) version it will lose the mov and be one operation less than the dot version.. looks like the dot version would run ~1.125x slower in that specific disassembly.

restive shoal Jan 7, 2023, 10:39 AM

#

I doubt that

#

In that specific disassembly, I think most of the latency would be centered around the s_waitcnt instructions

cosmic python Jan 7, 2023, 3:51 PM

#

restive shoal In that specific disassembly, I think most of the latency would be centered arou...

Which both sides have an equal number of?

restive shoal Jan 7, 2023, 3:52 PM

#

yes

#

therefore I'd guess they perform the same

#

(though it does not depend on the number of waitcnt instructions alone, it also depends on how the memory is accessed by buffer_load_dwordx4)

cosmic python Jan 7, 2023, 3:54 PM

#

So you are implying that in, as an example, a loop of 1000, the difference will be negligible because one more mul is cheap in comparison to synchronization opcodes, anyways?

restive shoal Jan 7, 2023, 3:54 PM

#

cheap in comparison to the memory load it synchronizes yeah

cosmic python Jan 7, 2023, 3:55 PM

#

Hmm well, I finally woke up so now I can profile my code.

#

I will test 1024x1024 with random numbers in loops of 1000.

restive shoal Jan 7, 2023, 3:56 PM

#

as in you load, apply arithmetic on, then store 1000 different numbers in each invocation?

cosmic python Jan 7, 2023, 4:01 PM

#

No. I will in a loop of 1000:
generate a b c d e f g h through a naive sine hash and then perform the divergent computations; repeat.

#

And that for 2^20 threads.

restive shoal Jan 7, 2023, 4:03 PM

#

a hash of what?

cosmic python Jan 7, 2023, 4:04 PM

#

I will not be loading a resource 1000 times in a loop, I know that is going to obfuscate the performance difference too much because that is slowwww.

restive shoal Jan 7, 2023, 4:05 PM

#

also, how will you store the results or otherwise make sure the compiler doesn't just optimize the calculation out?

cosmic python Jan 7, 2023, 4:05 PM

#

restive shoal a hash of what?

Thread id uint plus loop index uint to float hashed to another float.

cosmic python Jan 7, 2023, 4:05 PM

#

restive shoal also, how will you store the results or otherwise make sure the compiler doesn't...

Disassembly.

restive shoal Jan 7, 2023, 4:06 PM

#

if you don't store any results the compiler will optimize it out

cosmic python Jan 7, 2023, 4:06 PM

#

If the code I care about differs only in the divergent part.

restive shoal Jan 7, 2023, 4:06 PM

#

I can tell you that without the disasm

cosmic python Jan 7, 2023, 4:08 PM

#

restive shoal if you don't store any results the compiler will optimize it out

I will fma with an aggregate aggregate += (0.01 * result) to prevent any memory latency obfuscation inside the loop itself and then output the aggregate for each uint index into a separate output buffer index to prevent serialization.

#

Alternatively, I can use the precise keyword.

restive shoal Jan 7, 2023, 4:09 PM

#

precise shouldn't matter here at all

cosmic python Jan 7, 2023, 4:09 PM

#

I will try the precise keyword first anyways.

#

Don't want any random outside force to obfuscate what I really am curious about anyways.

restive shoal Jan 7, 2023, 4:11 PM

#

are you planning to apply the knowledge gained through this benchmark anywhere?

cosmic python Jan 7, 2023, 4:11 PM

#

No.

restive shoal Jan 7, 2023, 4:11 PM

#

huh

cosmic python Jan 7, 2023, 4:12 PM

#

I also can clearly sense what you were trying to achieve with this sequence of questioning me about memory latency again.

restive shoal Jan 7, 2023, 4:12 PM

#

what do you think I was trying to achieve?

#

I am literally trying to help you

cosmic python Jan 7, 2023, 4:14 PM

#

By deviating me from the question that truly interests, sure. I don't have an interest in being helped by talking about memory latency, again. I am fully aware of how bad that thing can get and I literally work as a GPU programmer. I don't need to be constantly reminded about memory latency. I did not ask anything about it.

restive shoal Jan 7, 2023, 4:15 PM

#

I literally work as a GPU driver programmer

#

anyways, the reason I'm asking/mentioning this is that I doubt that this benchmark is of any use (and if it isn't useful at all, why invest the effort in the first place?)

cosmic python Jan 7, 2023, 4:17 PM

#

restive shoal anyways, the reason I'm asking/mentioning this is that I doubt that this benchma...

So you are pushing me because in your personal vision anything that does not have clear utility should not be done by others?

restive shoal Jan 7, 2023, 4:17 PM

#

No, I'm trying to save you from wasting your time

cosmic python Jan 7, 2023, 4:17 PM

#

Why are you trying to affect my interests or curiosities?

#

Are people not allowed to have "useless" interests?

restive shoal Jan 7, 2023, 4:19 PM

#

There are "useless" interests that are still fun (why do we kick a ball towards two goals for 90 minutes? it does not affect anything in our lives), but I doubt you'd get your enjoyment out of writing benchmarks without any merit

cosmic python Jan 7, 2023, 4:19 PM

#

restive shoal There are "useless" interests that are still fun (why do we kick a ball towards ...

Hey, do you live inside of my body and brain?

restive shoal Jan 7, 2023, 4:19 PM

#

printf("%d", rand()) would give you about the same amount of useful information

cosmic python Jan 7, 2023, 4:20 PM

#

Why keep projecting and pushing lol?

restive shoal Jan 7, 2023, 4:20 PM

#

no, but I do claim to know how people usually work

#

nobody is stopping you from doing this benchmark

#

if you insist, go ahead

cosmic python Jan 7, 2023, 4:20 PM

#

Yeah which is really weird to generqlize me to be the same.

restive shoal Jan 7, 2023, 4:20 PM

#

I am just telling you that you are wasting your time

tall skiff Jan 7, 2023, 4:21 PM

#

also

#

there is no problem with you doing this on your own free time

#

but understand

cosmic python Jan 7, 2023, 4:21 PM

#

If someone came up to you and started insisting that you don't reallywant to test something that you feel really excited about testing and then keep pushing you, would you enjoy that?

tall skiff Jan 7, 2023, 4:21 PM

#

you are taking our time

#

you are asking us

cosmic python Jan 7, 2023, 4:21 PM

#

tall skiff you are taking our time

I am literally not forcing anyone to keep pushing me.

tall skiff Jan 7, 2023, 4:21 PM

#

so we are asking "is this a good question"

cosmic python Jan 7, 2023, 4:22 PM

#

tall skiff so we are asking "is this a good question"

Which I have never forced you to do. You are doing it to yourself.

restive shoal Jan 7, 2023, 4:22 PM

#

so what do you think you are testing here btw

cosmic python Jan 7, 2023, 4:22 PM

#

This entire continuous breach of a boundary I asserted is entirely your choice to try to control how another wants to spend their frer time at this point. I am done here.

tall skiff Jan 7, 2023, 4:22 PM

#

cosmic python Hey, do you live inside of my body and brain?

I would maybe like to crawl inside

#

but there's already a worm in there

#7 mads and 1 mul vs 2 dot4s and 1 add?