Hi, Can anyone help me please?
I finetuned llama 3.2 3b ins 4bit
with unsloth in jupyter notebook.
i saved lora, and tested inference on same jupyter file and in same environment. The inference worked with FastLanguageModel.
but when i deploy it on standalone linux server then the inference is raising this error. No code change everything is same as jupyter.
my inference code:
import unsloth
from unsloth import FastLanguageModel
import torch
import time
import json
max_seq_length = 2048
dtype = None
load_in_4bit = True
BASE_MODEL_ID = "unsloth/llama-3.2-3b-instruct-bnb-4bit"
ADAPTER = "model/adapter"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = ADAPTER,
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster in
inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")
with torch.inference_mode():
outputs = model.generate(
**inputs,
#use_cache = True,
max_length=2048,
temperature=0.1,
top_p=0.85,
top_k=50,
repetition_penalty=1.2,
do_sample=True,
)
THE ERROR IS:
Cache only has 0 layers, attempted to access layer with index 0