When using vLLM (fast_inference=True) with Unsloth for GRPO, VRAM usage can spike due to the need to keep both inference (vLLM) and training engines in memory, especially with long context and multiple generations. Even with gpu_memory_utilization set low, vLLM may still allocate more VRAM than expected, and Unsloth's standby/sleep mode is designed to help, but issues with OOM (out of memory) can still occur, particularly on large models or when resuming from checkpoints. This is a known challenge, and sometimes lowering context length, batch size, or num_generations is necessary. Also, ensure you are using the latest Unsloth and vLLM versions, as recent updates have improved memory handling, and try enabling Unsloth Standby by setting os.environ["UNSLOTH_VLLM_STANDBY"] = "1" before importing Unsloth to further reduce VRAM usage during RL training.
If you still encounter OOM, consider running without vLLM (fast_inference=False) for more stable memory usage, or experiment with further reducing context length and batch size. Note that vLLM and Unsloth share memory space, but high VRAM usage is expected for long context and multiple generations, and some users have reported that even with low gpu_memory_utilization, vLLM may not respect the limit strictly in standby mode. For more details and troubleshooting, see the official Unsloth RL memory guide and related GitHub issues: Unsloth RL Memory Guide, Issue 3328, Issue 2632.
Would you like a step-by-step breakdown or code example for your setup?
Sources: