I'm trying to fine-tune Qwen 2.5 14B using Unsloth + GRPO on two GPUs (since a single GPU doesn't have enough VRAM). I'm using the default Colab/Jupyter training script and only modified a few parts to fit my needs. However, the training fails with this CUDA device-side assert triggered error:
RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with 'TORCH_USE_CUDA_DSA' to enable device-side assertions.
I've attached the notebook script for you to see what I changed and how I fixed it.