#Load LORA Adapters and use for inference

9 messages · Page 1 of 1 (latest)

fiery rock Feb 23, 2025, 1:33 PM

I trained a Lora and have the saved checkpoint.
When i try to load it like this i get OOM

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "outputs/Falcon3-7B/20250223_000456/checkpoint-1210",
    max_seq_length = 1024*4,
    dtype = torch.bfloat16,
    load_in_4bit = True,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA GeForce RTX 4070 SUPER. Max memory: 11.994 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading checkpoint shards: 100%|| 4/4 [00:11<00:00,  2.86s/it]
Traceback (most recent call last):
  File "/workspace/recipes/test-checkpoint.py", line 32, in <module>
    model, tokenizer = FastLanguageModel.from_pretrained(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/unsloth/models/loader.py", line 354, in from_pretrained
    model = PeftModel.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/peft/peft_model.py", line 581, in from_pretrained
    load_result = model.load_adapter(
                  ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/peft/peft_model.py", line 1235, in load_adapter
    adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/peft/utils/save_and_load.py", line 571, in load_peft_weights
    adapters_weights = safe_load_file(filename, device=device)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/safetensors/torch.py", line 315, in load_file
    result[k] = f.get_tensor(k)
                ^^^^^^^^^^^^^^^
RuntimeError: CUDA driver error: out of memory

it seems the lora was saved in fp32 although i didnt want it too, a similar llama-3.1-8B Lora with also rank 64 was just half the size.
there was a previous post with the same problem but with no solution.

outputs/Falcon3-7B/
└── 20250223_000456
    └── checkpoint-1210
        ├── README.md
        ├── adapter_config.json
        ├── adapter_model.safetensors
        ├── optimizer.pt
        ├── rng_state.pth
        ├── scheduler.pt
        ├── special_tokens_map.json
        ├── tokenizer.json
        ├── tokenizer_config.json
        ├── trainer_state.json
        └── training_args.bin

total 8343908
drwxr-xr-x 1 root root        512 Feb 23 05:17 .
drwxr-xr-x 1 root root        512 Feb 23 12:52 ..
-rw-r--r-- 1 root root       5096 Feb 23 05:16 README.md
-rw-r--r-- 1 root root        828 Feb 23 05:17 adapter_config.json
-rw-r--r-- 1 root root 5540241880 Feb 23 05:17 adapter_model.safetensors
-rw-r--r-- 1 root root 2993698684 Feb 23 05:17 optimizer.pt
-rw-r--r-- 1 root root      14244 Feb 23 05:17 rng_state.pth
-rw-r--r-- 1 root root       1064 Feb 23 05:17 scheduler.pt
-rw-r--r-- 1 root root        718 Feb 23 05:17 special_tokens_map.json
-rw-r--r-- 1 root root    9780960 Feb 23 05:17 tokenizer.json
-rw-r--r-- 1 root root     362767 Feb 23 05:17 tokenizer_config.json
-rw-r--r-- 1 root root      24071 Feb 23 05:17 trainer_state.json
-rw-r--r-- 1 root root       5752 Feb 23 05:17 training_args.bin

i did a fresh install with the latest versions just yesterday

while we are at it, how would i load the lora for further finetuning on new data?

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "tiiuae/Falcon3-7B-Base",
  "bias": "none",
  "eva_config": null,
  "exclude_modules": null,
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 64,
  "lora_bias": false,
  "lora_dropout": 0,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": [
    "lm_head",
    "embed_tokens"
  ],
  "peft_type": "LORA",
  "r": 64,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "q_proj",
    "k_proj",
    "o_proj",
    "gate_proj",
    "down_proj",
    "up_proj",
    "v_proj"
  ],
  "task_type": "CAUSAL_LM",
  "use_dora": false,
  "use_rslora": false
}

i also tried to load it with just hf but it says its incompatible because of the added pad_token in the embeddings

hasty panther Feb 24, 2025, 12:18 PM

apart that you run oom .. is there a reason why you want to train a very old model ?

even llama 3.2 at a lower parameter count should be already way stronger