CUDA OOM When trying to load Qwen2.5-7B instruct in GRPO notebook | Unsloth AI | Page 1

muted island Feb 8, 2025, 1:58 PM

#

Can't load qwen2.5 7b instruct on GRPO notebook,

from unsloth import is_bfloat16_supported
import torch
max_seq_length = 512 # Can increase for longer reasoning traces
lora_rank = 16 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-7B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    fast_inference = True,
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.42,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 2.12 MiB is free. Process 8487 has 14.74 GiB memory in use. Of the allocated memory 14.61 GiB is allocated by PyTorch, and 15.65 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I tried decreasing the rank, using default sequence length, decreasing the gpu_memory_utilization and removing QKVO

And it's still in a retrying loop. Any idea how to fix this?

foggy sable Feb 8, 2025, 3:16 PM

#

yes you need to lower gpu_memory_utilization which i see you already did which is strangly low, have you tried raising it to .7?

muted island Feb 8, 2025, 3:26 PM

#

foggy sable yes you need to lower gpu_memory_utilization which i see you already did which i...

I'm currently trying, hopefully it works 😅

#

@foggy sable Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:15<00:00, 20.57s/it] INFO 02-08 15:29:34 model_runner.py:1115] Loading model weights took 14.5950 GB INFO 02-08 15:29:34 punica_selector.py:18] Using PunicaWrapperGPU. Unsloth: Retrying vLLM to process 96 sequences and 256 tokens in tandem. Error: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 2.12 MiB is free. Process 2599 has 14.74 GiB memory in use. Of the allocated memory 14.61 GiB is allocated by PyTorch, and 15.65 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

foggy sable Feb 8, 2025, 3:32 PM

#

thats interesting, can you destroy enviroment, delete the notebook (if you dont have too much custom code) and re create the enviroment with a new notebook. and just click run all or just run each cell 1 time

#

i had this issue where a cell failed but did not release memory

muted island Feb 8, 2025, 3:39 PM

#

Sure, I'll try

muted island Feb 8, 2025, 4:17 PM

#

foggy sable thats interesting, can you destroy enviroment, delete the notebook (if you dont ...

Just tried it, still gives me the same error

foggy sable Feb 8, 2025, 4:28 PM

#

i got no clue man sorry cant think of anything else that might help

muted island Feb 8, 2025, 4:45 PM

#

It's ok, thanks 😅

tight hazel Feb 8, 2025, 4:59 PM

#

I use llama notebook and just change it to qwen 7b and it works .-.

#

muted island Feb 8, 2025, 5:01 PM

#

tight hazel

Hmmm I'll try using the llama notebook then, thanks

lucid merlin Feb 8, 2025, 5:02 PM

#

You may have had bad luck with the instance. Google does partition T4's virtually at times..

muted island Feb 8, 2025, 7:15 PM

#

lucid merlin You may have had bad luck with the instance. Google does partition T4's virtuall...

I guess I'm really unlucky, since I tried running it on the llama GRPO notebook and it still doesn't work

#

CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 4.12 MiB is free. Process 14873 has 14.73 GiB memory in use. Of the allocated memory 14.62 GiB is allocated by PyTorch, and 1.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

#CUDA OOM When trying to load Qwen2.5-7B instruct in GRPO notebook