Implementing some datastructures and algorithms becomes easier if atomics with memory ordering are supported. Currently we have to add a full threadfence when required, but in some cases, a less strong fence would suffice, say only acquire or release.
I saw some references to __builtin_amdgcn_fence in the llvm repos, but im not sure what the status of this is.
So my question is, is it possible to use weaker threadfences, and if so what is the recommended way to do this
#HIP support for threadfence/atomics with memory ordering
1 messages · Page 1 of 1 (latest)
Hey, it might be helpful to share a minimal CUDA equivalent of this, to understand exactly what's missing on the HIP side.
We have support for __atomic builtins which take a memory ordering parameter; for an example in the wild, see https://github.com/vllm-project/vllm/blob/6d0cf239c66936ff52582042698fd1aeb2a73bb6/csrc/custom_all_reduce.cuh#L250. I think these are the gcc built-ins, or at least the syntax is the same, trying to find some documentation for it
Actually if you look at the source for the HIP atomic* functions, they call our own builtins as well: https://github.com/ROCm/clr/blob/amd-staging/hipamd/include/hip/amd_detail/amd_hip_atomic.h. For example, atomicAdd(address, val) resolves to __hip_atomic_fetch_add(address, val, __ATOMIC_RELAXED, __HIP_MEMORY_SCOPE_SYSTEM)
so you should be able to call these directly with your preferred memory ordering
Oh this is perfect