Rocshmem put hang / sync | AMD Developer Community | Page 1

I'm trying rocshmem for inter-gpu communication and have written a basic code to check p2p put between two ranks sending data to each other simultaneously. I can't seem to find the correct way to sync the host with the device after the other rank's puts have finished. I tried two ideas:

Basic puts from one device to the other using syncs (rocshmem_quiet, rocshmem_barrier_all, and hipDeviceSynchronize). CPU check fails due to elements still holding the initial value - probably due to the sync being device level, not accounting for incomplete puts.

__global__ void p2p_shmem_single_kernel(int* send_tensor, int* recv_tensor, int size, int peer) {
    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < size) {
        rocshmem_int_put(recv_tensor + tid, send_tensor + tid, 1, peer);
    }
    if (tid == 0) {
        rocshmem_quiet();
    }
}

Wave puts with signaling between the two ranks and appropriate wait calls - which hang.

__global__ void p2p_shmem_single_kernel(int* send_tensor, int* recv_tensor, size_t size, uint64_t* sig_addr, int peer) {
    const int tid = threadIdx.x + blockIdx.x * blockDim.x;
    const int lane_id = tid & (WAVE_SIZE - 1);
    const int w_start = tid - lane_id;
    uint64_t* wave_sig = sig_addr + (tid % WAVE_SIZE);

    if (lane_id == 0 && tid < size) {
        const size_t wave_count = min(WAVE_SIZE, size - w_start);
        rocshmem_int_put_signal_wave(recv_tensor + w_start, send_tensor + w_start, wave_count, wave_sig, 1, ROCSHMEM_SIGNAL_SET, peer);
    }
    if (tid == 0) {
        rocshmem_quiet();
    }

    if (lane_id == 0 and tid < size) {
        rocshmem_ulong_wait_until(wave_sig, ROCSHMEM_CMP_EQ, 1);
    }
}

Both are launched in a similar fashion with rocshmem barrier and device synchronization immediately after kernel launch. The program is run with mpirun -np 2.

Does anyone know to correct way to perform one side communication? Or a relevant rocshmem guide?
Thank you!

#Rocshmem put hang / sync