#HIP support for threadfence/atomics with memory ordering

1 messages · Page 1 of 1 (latest)

eager lance
#

Implementing some datastructures and algorithms becomes easier if atomics with memory ordering are supported. Currently we have to add a full threadfence when required, but in some cases, a less strong fence would suffice, say only acquire or release.
I saw some references to __builtin_amdgcn_fence in the llvm repos, but im not sure what the status of this is.
So my question is, is it possible to use weaker threadfences, and if so what is the recommended way to do this

white veldt
#

Hey, it might be helpful to share a minimal CUDA equivalent of this, to understand exactly what's missing on the HIP side.

misty tulip
#

We have support for __atomic builtins which take a memory ordering parameter; for an example in the wild, see https://github.com/vllm-project/vllm/blob/6d0cf239c66936ff52582042698fd1aeb2a73bb6/csrc/custom_all_reduce.cuh#L250. I think these are the gcc built-ins, or at least the syntax is the same, trying to find some documentation for it

GitHub

A high-throughput and memory-efficient inference and serving engine for LLMs - vllm-project/vllm

#

so you should be able to call these directly with your preferred memory ordering

eager lance
#

Oh this is perfect