#Gemma support

15 messages · Page 1 of 1 (latest)

silver locust
#

I finetuned a model with Gemma but I am trying to make a gradio interference on huggingface, I am calling the model but I am getting an error AttributeError: 'Gemma3Config' object has no attribute 'max_position_embeddings'

it seems unsloth.FastLanguageModel currently only supports LLaMA-style architectures but I could be wrong any soltions please

trail cragBOT
#

The error AttributeError: 'Gemma3Config' object has no attribute 'max_position_embeddings' is due to a missing attribute in the Gemma3 config, which is required by some inference tools. This is a known issue and not specific to Unsloth; it can occur if the config or Transformers version is out of sync with the model requirements. Updating to the latest Unsloth, unsloth_zoo, and Transformers (preferably 4.50.0+ or the Gemma-3 branch) usually resolves this. If the error persists, check that your config.json includes the max_position_embeddings key, or manually add it to match your model's context length (e.g., 128000 for Gemma 3).

Unsloth's FastLanguageModel does support Gemma 3 models, but earlier versions had patching issues. Make sure you are using the latest Unsloth and unsloth_zoo, and reinstall if needed. For Gradio inference on Hugging Face, ensure your environment uses the correct versions and that the model config is up to date. If you still encounter issues, see the official Gemma 3 fine-tuning guide and related GitHub issues for troubleshooting steps.

Would you like a step-by-step guide to update your environment and config?

Sources:

silver locust
#

@trail crag when saving the model it doesnt come with a config file here is the error:

File "/home/user/app/app.py", line 28, in <module>
model, tokenizer = FastLanguageModel.from_pretrained(
File "/home/user/.pyenv/versions/3.10.18/lib/python3.10/site-packages/unsloth/models/loader.py", line 412, in from_pretrained
model, tokenizer = dispatch_model.from_pretrained(
File "/home/user/.pyenv/versions/3.10.18/lib/python3.10/site-packages/unsloth/models/llama.py", line 1930, in from_pretrained
model_max_seq_length = model_config.max_position_embeddings
File "/home/user/.pyenv/versions/3.10.18/lib/python3.10/site-packages/transformers/configuration_utils.py", line 211, in getattribute
return super().getattribute(key)
AttributeError: 'Gemma3Config' object has no attribute 'max_position_embeddings'

earnest bobcat
silver locust
earnest bobcat
#

ah because you uploaded only the lora here

silver locust
#

ooh, Thank you for pointing this out... please how do I solve this😢 Do I need to retrain?

earnest bobcat
silver locust
#

Oh okay, I have the model but not the notebooks. I think I will have to figure out our to either merge or use direct peft loading. Please I will appreciate a blog or a video on this. Thank you

earnest bobcat
#

you might find something on youtube

#
from unsloth import FastLanguageModel
from transformers import TextIteratorStreamer
import threading

# Load base model
base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-3-4b-it-unsloth-bnb-4bit", # <-- Replace with your base model (it will be in your adapter_config.json)
    max_seq_length=2048,
    dtype=torch.float16,
    load_in_4bit=True,
)

# Peft for loading LoRA
from peft import PeftModel
lora_model = PeftModel.from_pretrained(base_model, "tieubaoca/gemma-3-4b-it-unsloth-bnb-4bit-finetune-vi-alpaca-lora") # <-- Replace with your Lora

FastLanguageModel.for_inference(lora_model)

# Chat template
def generate_streaming(model, tokenizer, message):
    messages = [{
        "role": "user", 
        "content": [{"type": "text", "text": message}]
    }]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt", 
        tokenize=True,
        return_dict=True,
    ).to("cuda")
    
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    
    # Sampler settings
    generation_kwargs = dict(
        **inputs,
        streamer=streamer,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        top_k=40,
        use_cache=False
    )
    
    def generate():
        model.generate(**generation_kwargs)
    
    thread = threading.Thread(target=generate)
    thread.start()
    
    print(f"User: {message}")
    print("Assistant: ", end="", flush=True)
    
    full_response = ""
    for new_token in streamer:
        if new_token:
            print(new_token, end="", flush=True)
            full_response += new_token
    
    thread.join()
    print() 
    return full_response

# Inference
user_message = "What is the meaning of life?"
response = generate_streaming(lora_model, tokenizer, user_message)```
#

here is a quick vibe coded script that I found to work

#

its not merging the base model but using peft to load the Lora

silver locust
#

Thank you very much Lee! I appreciate your efforts to help. Thank you