#CUDA OOM When trying to load Qwen2.5-7B instruct in GRPO notebook

16 messages · Page 1 of 1 (latest)

muted island
#

Can't load qwen2.5 7b instruct on GRPO notebook,

from unsloth import is_bfloat16_supported
import torch
max_seq_length = 512 # Can increase for longer reasoning traces
lora_rank = 16 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-7B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    fast_inference = True,
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.42,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 2.12 MiB is free. Process 8487 has 14.74 GiB memory in use. Of the allocated memory 14.61 GiB is allocated by PyTorch, and 15.65 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I tried decreasing the rank, using default sequence length, decreasing the gpu_memory_utilization and removing QKVO

And it's still in a retrying loop. Any idea how to fix this?

foggy sable
#

yes you need to lower gpu_memory_utilization which i see you already did which is strangly low, have you tried raising it to .7?

muted island
#

@foggy sable Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:15<00:00, 20.57s/it] INFO 02-08 15:29:34 model_runner.py:1115] Loading model weights took 14.5950 GB INFO 02-08 15:29:34 punica_selector.py:18] Using PunicaWrapperGPU. Unsloth: Retrying vLLM to process 96 sequences and 256 tokens in tandem. Error: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 2.12 MiB is free. Process 2599 has 14.74 GiB memory in use. Of the allocated memory 14.61 GiB is allocated by PyTorch, and 15.65 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

foggy sable
#

thats interesting, can you destroy enviroment, delete the notebook (if you dont have too much custom code) and re create the enviroment with a new notebook. and just click run all or just run each cell 1 time

#

i had this issue where a cell failed but did not release memory

muted island
#

Sure, I'll try

muted island
foggy sable
#

i got no clue man sorry cant think of anything else that might help

muted island
#

It's ok, thanks 😅

tight hazel
#

I use llama notebook and just change it to qwen 7b and it works .-.

muted island
lucid merlin
#

You may have had bad luck with the instance. Google does partition T4's virtually at times..

muted island
#

CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 4.12 MiB is free. Process 14873 has 14.73 GiB memory in use. Of the allocated memory 14.62 GiB is allocated by PyTorch, and 1.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)