#Rocshmem put hang / sync

1 messages · Page 1 of 1 (latest)

ocean beacon
#

I'm trying rocshmem for inter-gpu communication and have written a basic code to check p2p put between two ranks sending data to each other simultaneously. I can't seem to find the correct way to sync the host with the device after the other rank's puts have finished. I tried two ideas:

  • Basic puts from one device to the other using syncs (rocshmem_quiet, rocshmem_barrier_all, and hipDeviceSynchronize). CPU check fails due to elements still holding the initial value - probably due to the sync being device level, not accounting for incomplete puts.
__global__ void p2p_shmem_single_kernel(int* send_tensor, int* recv_tensor, int size, int peer) {
    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < size) {
        rocshmem_int_put(recv_tensor + tid, send_tensor + tid, 1, peer);
    }
    if (tid == 0) {
        rocshmem_quiet();
    }
}
  • Wave puts with signaling between the two ranks and appropriate wait calls - which hang.
__global__ void p2p_shmem_single_kernel(int* send_tensor, int* recv_tensor, size_t size, uint64_t* sig_addr, int peer) {
    const int tid = threadIdx.x + blockIdx.x * blockDim.x;
    const int lane_id = tid & (WAVE_SIZE - 1);
    const int w_start = tid - lane_id;
    uint64_t* wave_sig = sig_addr + (tid % WAVE_SIZE);

    if (lane_id == 0 && tid < size) {
        const size_t wave_count = min(WAVE_SIZE, size - w_start);
        rocshmem_int_put_signal_wave(recv_tensor + w_start, send_tensor + w_start, wave_count, wave_sig, 1, ROCSHMEM_SIGNAL_SET, peer);
    }
    if (tid == 0) {
        rocshmem_quiet();
    }

    if (lane_id == 0 and tid < size) {
        rocshmem_ulong_wait_until(wave_sig, ROCSHMEM_CMP_EQ, 1);
    }
}

Both are launched in a similar fashion with rocshmem barrier and device synchronization immediately after kernel launch. The program is run with mpirun -np 2.

Does anyone know to correct way to perform one side communication? Or a relevant rocshmem guide?
Thank you!

valid sinew
#

Hi @ocean beacon . Have you looked through example code in https://github.com/ROCm/rocSHMEM/tree/develop/examples ? If you're stuck after that, I'd suggest opening an issue in the GitHub repo with your question - https://github.com/ROCm/rocSHMEM/issues

GitHub

rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform. - ROCm/rocSHMEM

GitHub

rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform. - ROCm/rocSHMEM