#Facing Error during inference: Cache only has 0 layers, attempted to access layer with index 0

2 messages · Page 1 of 1 (latest)

proud silo Mar 22, 2025, 8:40 AM

Hi, Can anyone help me please?

I finetuned llama 3.2 3b ins 4bit
with unsloth in jupyter notebook.
i saved lora, and tested inference on same jupyter file and in same environment. The inference worked with FastLanguageModel.
but when i deploy it on standalone linux server then the inference is raising this error. No code change everything is same as jupyter.

my inference code:
import unsloth
from unsloth import FastLanguageModel
import torch
import time
import json

max_seq_length = 2048
dtype = None
load_in_4bit = True

BASE_MODEL_ID = "unsloth/llama-3.2-3b-instruct-bnb-4bit"
ADAPTER = "model/adapter"

model, tokenizer = FastLanguageModel.from_pretrained(
model_name = ADAPTER,
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)

FastLanguageModel.for_inference(model) # Enable native 2x faster in
inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

with torch.inference_mode():
outputs = model.generate(
**inputs,
#use_cache = True,
max_length=2048,
temperature=0.1,
top_p=0.85,
top_k=50,
repetition_penalty=1.2,
do_sample=True,
)

THE ERROR IS:
Cache only has 0 layers, attempted to access layer with index 0

full traceback:

Traceback (most recent call last):
File "/home/fastuser/api/api/test.py", line 69, in <module>
outputs = model.generate(
File "/home/fastuser/anaconda3/envs/infer_env/lib/python3.10/site-packages/peft/peft_model.py", line 1874, in generate
outputs = self.base_model.generate(*args, **kwargs)
File "/home/fastuser/anaconda3/envs/infer_env/lib/python3.10/site-packages/unsloth/models/llama.py", line 1579, in unsloth_fast_generate
output = self._old_generate(*args, **kwargs)
File "/home/fastuser/anaconda3/envs/infer_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/fastuser/anaconda3/envs/infer_env/lib/python3.10/site-packages/transformers/generation/utils.py", line 2326, in generate
result = self._sample(
File "/home/fastuser/anaconda3/envs/infer_env/lib/python3.10/site-packages/transformers/generation/utils.py", line 3286, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/home/fastuser/anaconda3/envs/infer_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/fastuser/anaconda3/envs/infer_env/lib/python3.10/site-p