#7 mads and 1 mul vs 2 dot4s and 1 add?

95 messages · Page 1 of 1 (latest)

unborn sluice
#

profile it and see

#

There isn't going to be a universal answer to this, it will probably vary across platforms

storm obsidian
#

what do you think the dots expand to?

cosmic python
restive shoal
#

yes

unborn sluice
#

I find it hard to believe that this exact operation is the main bottleneck of your application, hardware aside

restive shoal
#

possibly even GPU architecture dependent

unborn sluice
#

Not saying it's impossible but you should try and find out

restive shoal
#

also this

storm obsidian
#

there should be a :that: emoji which points down

unborn sluice
#

arithmetic is very rarely the biggest bottleneck in anything, since it's absurdly fast

restive shoal
unborn sluice
#

There's a fun analogy about memory vs. arithmetic latency that goes something like, doing arithmetic on two values from main memory and writing them back is comparable to someone mailing you a letter with two numbers, you adding them in your head in a few seconds, and mailing the response back to them

cosmic python
#

Okay, I have no idea why people are suddenly assuming I have a bottleneck and where I supposedly have it. I wanted to know if there is an answer to performance difference and there is none. Please do not give give me unrelated unsolicited advice. It is needlessly hurtful and pressuring for me personally.

restive shoal
#

why are you asking this question if you don’t have a bottleneck?

cosmic python
#

Theoretical questions are allowed, I believe.

#

Please add a tag for it.

unborn sluice
#

Ok well the answer to the question is that there isn't really a clear one, so if there isn't any empirical evidence for us to reason about I don't think there's much more that can be said

cosmic python
#

So that people do not provide unsolicited advice.

unborn sluice
#

It's not unsolicited, it's relevant to the question

cosmic python
unborn sluice
#

performance doesn't exist in a vacuum, so it's natural for a question about performance to pull in broader context

cosmic python
unborn sluice
#

I'm not pushing you to fix a nonexistent bottleneck, I'm explaining part of why the question is more complex than it seems

cosmic python
storm obsidian
#

the correct advice for this kind of question is literally to profile both and see which is faster on your target(s)

unborn sluice
#

I'm sorry if I hurt your feelings but we asked the sequence of questions you're going to get in response to literally any question about code optimization so just be prepared to see it again

cosmic python
#
I find it hard to believe that this exact operation is the main bottleneck of your application, hardware aside

There is a lot of assumption here that I do not want to encounter anymore, please.

unborn sluice
#

"Is this really a bottleneck" is the first question anyone is going to ask if you solicit advice about microoptimizations, for better or worse

cosmic python
#

Okay, then next time I will mention, that I am asking a purely theoretical question. And that I would only want an answer to that specific question.

storm obsidian
#

anyways, pretty sure both will perform identically since the two options you present are the same thing

#

here's the diff for rdna2 (right = two dots, left = fma spam)

cosmic python
restive shoal
#

I doubt that

#

In that specific disassembly, I think most of the latency would be centered around the s_waitcnt instructions

cosmic python
restive shoal
#

yes

#

therefore I'd guess they perform the same

#

(though it does not depend on the number of waitcnt instructions alone, it also depends on how the memory is accessed by buffer_load_dwordx4)

cosmic python
#

So you are implying that in, as an example, a loop of 1000, the difference will be negligible because one more mul is cheap in comparison to synchronization opcodes, anyways?

restive shoal
#

cheap in comparison to the memory load it synchronizes yeah

cosmic python
#

Hmm well, I finally woke up so now I can profile my code.

#

I will test 1024x1024 with random numbers in loops of 1000.

restive shoal
#

as in you load, apply arithmetic on, then store 1000 different numbers in each invocation?

cosmic python
#

No. I will in a loop of 1000:
generate a b c d e f g h through a naive sine hash and then perform the divergent computations; repeat.

#

And that for 2^20 threads.

restive shoal
#

a hash of what?

cosmic python
#

I will not be loading a resource 1000 times in a loop, I know that is going to obfuscate the performance difference too much because that is slowwww.

restive shoal
#

also, how will you store the results or otherwise make sure the compiler doesn't just optimize the calculation out?

cosmic python
restive shoal
#

if you don't store any results the compiler will optimize it out

cosmic python
#

If the code I care about differs only in the divergent part.

restive shoal
#

I can tell you that without the disasm

cosmic python
#

Alternatively, I can use the precise keyword.

restive shoal
#

precise shouldn't matter here at all

cosmic python
#

I will try the precise keyword first anyways.

#

Don't want any random outside force to obfuscate what I really am curious about anyways.

restive shoal
#

are you planning to apply the knowledge gained through this benchmark anywhere?

cosmic python
#

No.

restive shoal
#

huh

cosmic python
#

I also can clearly sense what you were trying to achieve with this sequence of questioning me about memory latency again.

restive shoal
#

what do you think I was trying to achieve?

#

I am literally trying to help you

cosmic python
#

By deviating me from the question that truly interests, sure. I don't have an interest in being helped by talking about memory latency, again. I am fully aware of how bad that thing can get and I literally work as a GPU programmer. I don't need to be constantly reminded about memory latency. I did not ask anything about it.

restive shoal
#

I literally work as a GPU driver programmer

#

anyways, the reason I'm asking/mentioning this is that I doubt that this benchmark is of any use (and if it isn't useful at all, why invest the effort in the first place?)

cosmic python
restive shoal
#

No, I'm trying to save you from wasting your time

cosmic python
#

Why are you trying to affect my interests or curiosities?

#

Are people not allowed to have "useless" interests?

restive shoal
#

There are "useless" interests that are still fun (why do we kick a ball towards two goals for 90 minutes? it does not affect anything in our lives), but I doubt you'd get your enjoyment out of writing benchmarks without any merit

cosmic python
restive shoal
#

printf("%d", rand()) would give you about the same amount of useful information

cosmic python
#

Why keep projecting and pushing lol?

restive shoal
#

no, but I do claim to know how people usually work

#

nobody is stopping you from doing this benchmark

#

if you insist, go ahead

cosmic python
#

Yeah which is really weird to generqlize me to be the same.

restive shoal
#

I am just telling you that you are wasting your time

tall skiff
#

also

#

there is no problem with you doing this on your own free time

#

but understand

cosmic python
#

If someone came up to you and started insisting that you don't reallywant to test something that you feel really excited about testing and then keep pushing you, would you enjoy that?

tall skiff
#

you are taking our time

#

you are asking us

cosmic python
tall skiff
#

so we are asking "is this a good question"

cosmic python
restive shoal
#

so what do you think you are testing here btw

cosmic python
#

This entire continuous breach of a boundary I asserted is entirely your choice to try to control how another wants to spend their frer time at this point. I am done here.

tall skiff
#

but there's already a worm in there