I'm trying rocshmem for inter-gpu communication and have written a basic code to check p2p put between two ranks sending data to each other simultaneously. I can't seem to find the correct way to sync the host with the device after the other rank's puts have finished. I tried two ideas:
- Basic puts from one device to the other using syncs (rocshmem_quiet, rocshmem_barrier_all, and hipDeviceSynchronize). CPU check fails due to elements still holding the initial value - probably due to the sync being device level, not accounting for incomplete puts.
__global__ void p2p_shmem_single_kernel(int* send_tensor, int* recv_tensor, int size, int peer) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < size) {
rocshmem_int_put(recv_tensor + tid, send_tensor + tid, 1, peer);
}
if (tid == 0) {
rocshmem_quiet();
}
}
- Wave puts with signaling between the two ranks and appropriate wait calls - which hang.
__global__ void p2p_shmem_single_kernel(int* send_tensor, int* recv_tensor, size_t size, uint64_t* sig_addr, int peer) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
const int lane_id = tid & (WAVE_SIZE - 1);
const int w_start = tid - lane_id;
uint64_t* wave_sig = sig_addr + (tid % WAVE_SIZE);
if (lane_id == 0 && tid < size) {
const size_t wave_count = min(WAVE_SIZE, size - w_start);
rocshmem_int_put_signal_wave(recv_tensor + w_start, send_tensor + w_start, wave_count, wave_sig, 1, ROCSHMEM_SIGNAL_SET, peer);
}
if (tid == 0) {
rocshmem_quiet();
}
if (lane_id == 0 and tid < size) {
rocshmem_ulong_wait_until(wave_sig, ROCSHMEM_CMP_EQ, 1);
}
}
Both are launched in a similar fashion with rocshmem barrier and device synchronization immediately after kernel launch. The program is run with mpirun -np 2.
Does anyone know to correct way to perform one side communication? Or a relevant rocshmem guide?
Thank you!