#Issue with GGUF Conversion for Merged 4-bit Gemma 3 Model

1 messages · Page 1 of 1 (latest)

frigid ridge
#

I'm trying to convert a finetuned Gemma 3 27B model (originally loaded as 4-bit with Unsloth) to Q8_0 GGUF format using the manual llama.cpp conversion script, but I'm running into an error.

Here's my setup and what I'm doing:

  1. Base Model: "unsloth/gemma-3-27b-it-unsloth-bnb-4bit"

  2. Initial Loading: FastModel.from_pretrained with load_in_4bit=True and max_seq_length=13500.

  3. LoRA Setup: FastModel.get_peft_model with r=8.

  4. Training: SFTTrainer.train() completed successfully (with correct max_seq_length=13500).

  5. Adapter Save: model.save_pretrained("gemma-3-27b-quant-4bit-adapters") and tokenizer.save_pretrained(...) completed successfully.

  6. Load Saved Adapters: In a new session, I loaded the finetuned model using FastModel.from_pretrained("gemma-3-27b-quant-4bit-adapters", ...) with max_seq_length=13500 and load_in_4bit=True. This also worked.

  7. Manual Merge & Save to 16-bit: I successfully used merged_model = model.merge_and_unload() followed by merged_model.save_pretrained("gemma-3-27b-quant-4bit-merged-16bit"). This directory contains config.json and the merged safetensor shards.

  8. Copied Tokenizer: Copied the necessary tokenizer files from gemma-3-27b-quant-4bit-adapters/ to gemma-3-27b-quant-4bit-merged-16bit/. (Confirmed they are there).

  9. Manual GGUF Conversion Attempt: Ran the llama.cpp/convert_hf_to_gguf.py script pointing to the merged 16-bit directory:
    python llama.cpp/convert_hf_to_gguf.py "gemma-3-27b-quant-4bit-merged-16bit" --outfile "gemma-3-27b-gguf-q8_0.gguf" --outtype Q8_0

Error Encountered:

ValueError: Can not map tensor 'model.layers.0.mlp.down_proj.weight.absmax'

Environment:

Unsloth/Unsloth-Zoo Source: Installed from git+https://github.com/unslothai/unsloth.git (most recent version as of yesterday).
PyTorch: 2.4.1+cu124
CUDA: 12.4
GPU: NVIDIA A100-SXM4-80GB
OS: Linux (RunPod)
Transformers: 4.51.3

Is this a known bug, and are there any workarounds or fixes available in the latest code?

cunning sundial
#

Hello yes

#
  1. pay attention that if you had unsloth and unsloth_zoo previously installed and you're installing directly from the github repo, you need to use the command line argument --force-reinstall or else it won't overwrite your current installation
#

therefore

#
pip install --force-reinstall git+https://github.com/unslothai/unsloth.git
pip install --force-reinstall git+https://github.com/unslothai/unsloth-zoo.git
#

2- The issue was with llama.cpp While they've updated the conversion script, support for Gemma3 is still experimental . That's why the manual conversion still fails

#

3- as a temporary workaround , follow the step by step i mention here

#

after saving pretrained merged
place your Gemma3 ollama Modelfile in the same folder where the model is
here's an example how the Modelfile might look like:

FROM ./

PARAMETER temperature 1
PARAMETER top_k 64
PARAMETER top_p 0.95
PARAMETER min_p 0.0
PARAMETER num_ctx 8192
PARAMETER stop "<end_of_turn>"

TEMPLATE """
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- if or (eq .Role "user") (eq .Role "system") }}<start_of_turn>user
{{ .Content }}<end_of_turn>
{{ if $last }}<start_of_turn>model
{{ end }}
{{- else if eq .Role "assistant" }}<start_of_turn>model
{{ .Content }}{{ if not $last }}<end_of_turn>
{{ end }}
{{- end }}
{{- end }}"""

adjust the file as necessary.
then inside the same folder run:
assuming you wanna call the model in ollama "gemma3-finetune"

ollama create --quantize q5_K_M gemma3-finetune -f ./Modelfile

it should now be available in ollama

cunning sundial
#

i just tried converting a non quantized model using llama.cpp/convert_hf...... and it worked out

#

wondering if it has to do with the quantization though

#

it's not working manually because you need to dequantize the model first

#

that's what unsloth did under the hood when save_gguf (forgot the name of the method) was called

#

give us a bit of time. we might actually solve this earlier than initially thought

#

I'll notify you when we do

frigid ridge
#

Awesome thanks very much! This was the method I used to get the model to merge and save in 16bit:


merged_model.save_pretrained("gemma-3-27b-quant-4bit-merged-16bit")```

It looks like some of the 16bit model files actually contain 8bit integers after the merged_model.saved_pretrained process, causing an error when trying to use the workaround (ollama create...) you supplied - please see this error towards the bottom in the screenshot: 

'time=2025-04-24T15:23:30.370Z level=ERROR source=create.go:162 msg="error converting from safetensors" error="unknown data type: U8"'

Am I correct in diagnosing this here? 

Or have I incorrectly merged the model to 16bit before running the 'ollama create' stuff?
cunning sundial
#

mm

#

no

#
save_pretrained

save the adapter doesn't merge with the base_model
you'll need to follow it up with

save_pretrained_merged()

to merge the adapter and the save model

frigid ridge
#

So the bit I used here:

merged_model = model.merge_and_unload()

Does that do the merging correctly before saving with 'save_pretrained'?

Maybe 'merge_and_unload' is the method that is incorrectly dequantizing from 4bit to 16bit?

Apologies for the confusion!

cunning sundial
#

not incorrectly

#

it does dequantize so it can merge

#

cause your adapter is in 16bits

#

your base_model in nf4 or something

#

they have to be the same dtype to merge

#

i think maybe when you merge you need to specify F32 or f16

#

ollama quantize create can only accept f32 or float16 models

frigid ridge
#

Ok I will try that, thank you 🙂

frigid ridge
# cunning sundial i think maybe when you merge you need to specify F32 or f16

I tried specifying as fp16 during the save of the merged model, but unfortunately I am getting the same error:

"error converting from safetensors" error="unknown data type: U8"

This is the code I used to specify fp16:

# Perform the merge operation in memory
merged_model = model.merge_and_unload()

# Added torch_dtype=torch.float16 to specify the output format
merged_model.save_pretrained("gemma-3-27b-quant-4bit-merged-16bit", tokenizer, torch_dtype=torch.float16)

Are there any other methods to get the model merged correctly?

cunning sundial
#

use save_pretrained_merged instead of what you're doing now.

don't use merge_and_unload, don't use save_pretrained

#

and i assume you're loading from disk, so i am not sure what the model is loading as, cause it highly depends on what happened when you last saved

#

load the model as merged_model or whatever you wanna call it , then

print(type(merged_model))

what does it give you?

frigid ridge
cunning sundial
#

oh it's loading the adapter, that's not a merged model

#

do you mind ina. cell by itself just type

model
#

and run it

#

i mean merged_model or whatever you called it

#

see if there are layers with the words

lora_A
lora_B

or anything lora related

frigid ridge
cunning sundial
#

yes this is not merged

#

that's why you see in the model structure the lora_A and lora_B as separate layers

#

once you merge, they'd be gone

frigid ridge
#

so the next step here is to use model.save_pretrained_merged? apologies for my confusion

cunning sundial
#

give me a sec , brb

#

what files do you see in the directory where the model is saved?

frigid ridge
cunning sundial
#

yes that's the adapter

#

everytime you see adapter_config, adapter_model, that's just teh LORA adapter

frigid ridge
#

you mean the merged model i saved using merge_and_unload?

cunning sundial
#

no , it wasn't merged

#

If it was merged, you'd see

  • a config.json
  • model.safetensors
  • no adapter_config.json
  • no adapter_model.safetensors
frigid ridge
#

this is the merged model i saved using merged model = model.merge_and_unload -> merged_model.save_pretrained:

#

the Modelfile.txt is created to your suggestion

cunning sundial
#

just Modelfile

frigid ridge
#

oh ok, I'll try that for the workaround solution, thanks 🙂

cunning sundial
#

i don't think that's the issue. it's just a technicality. but yes try

#

i think you need to reload and remerge the model differently <--- this might work . that's what i am suggesting

frigid ridge
frigid ridge
cunning sundial
#

yes save_pretrained_merged

frigid ridge
#

model.save_pretrained_merged gives me this error:

merge_and_overwrite_lora() got an unexpected keyword argument 'save_method'

This was also happening yesterday, so I tried the workaround with 'merged model = model.merge_and_unload -> merged_model.save_pretrained'

cunning sundial
#

maybe if you actually follow what i'm telling you , we'd solve this faster?

frigid ridge
#

here is the screenshot for more context, if it is helpful:

cunning sundial
#

did i say

merge_and_overwrite_lora
``` ?
#

ah ok

#

damn it

#

don't just show me how you're saving

#

how are you loading

#

...

#

😐

frigid ridge
#

like this (i commented out the inferencing part, just kept the loading part provided by Unsloth):

cunning sundial
#

ok i'll be back

#

also, when you initially loaded the model before finetuning, did you load in 4 bits?

#

and when you initially saved it using save_pretrained? did you specify any specific dtype

#

?

frigid ridge
cunning sundial
#

what was teh base model you started from?

#

exactly

#

at the beginning

frigid ridge
#

unsloth/gemma-3-27b-it-unsloth-bnb-4bit

cunning sundial
#

ok

cunning sundial
#

i am trying to reproduce your actual steps

#

and potentially the same error

#

and a fix

#

unrelated to your original question, but did you use the latest notebook where the formatting func applies

removeprefix()

?

frigid ridge
#

thanks so much for your ongoing help by the way, i'll look into that and get back to you

cunning sundial
#

did you train on text or vision?

frigid ridge
#

Yes I am using the latest notebook version where the formatting function includes .removeprefix('<bos>').

frigid ridge
#

text

cunning sundial
#

ok

#

so weirdly enough , i tried on gemma-4b-bnb-4bits

#

trained

#

repeated your steps

#

merge_unload then save_pretrained should work, because that's what the unsloth save_pretrained_merged seems to be calling anyway

#

i also tried reloading from the adapter and doing save_pretrained_merged, it also worked

#

and gemma3-4b-it-bnb-4b <-- -sorry forgot the exact name,

#

There is one more thing i'm trying.... trying to see if FastModel behaves differently than FastLanguageModel

I used FastLanguageModel in my trial run

frigid ridge
#

merge_unload then save_pretrained works with me too, but i think it is still saving some 8bit information where it should just be creating a purely 16bit merged model, that's my guess - because when I use that 16bit merged, saved model as the input with your 'ollama create' workaround it gives the error saying 8bit datatypes are present:

"error converting from safetensors" error="unknown data type: U8"

cunning sundial
#

yes just waiting on my wireless mouse to recharge

#

just ran out of battery

frigid ridge
#

no problem 🙂

cunning sundial
#

we can actually figure it out. i'll tell. you how when i get back online.

#

we'll see which layers are being saved in 8 bits

frigid ridge
#

awesome sounds good

cunning sundial
#

it might be that the base_mdel you're using the bnb 4 bit quantized bit, saves certain layers in 8 bits

#

originally

#

but we'll figure that out

#

😄

frigid ridge
#

The merged, saved 16bit model that I think might be 'corrupted' with 8bit information - I have it saved locally (big ~22GB folder).

Would this help at all if I sent it to you to take a look at? Or can I simply run commands in my remote RunPod notebook to search through the model and check for U8 dtypes?

viscid lotus
#

(simple message to pin this for later reference)

cunning sundial
#

ok so if we start from the top

#

thi model

google/gemma-3-4b-it
``` loaded using transformers without any quantization, all the layers are in 32 bits
#

makes sense so far

#

if i quantize it using BitsandBytes with this configuration

model_kwargs = {
    "quantization_config": BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    "torch_dtype": torch.bfloat16,
    "attn_implementation": "flash_attention_2",
    "device_map": "auto",
}
#

basically loading it in bnb-4bits. (specifically NF4)

#

this is what happens to the parameters

#
Parameter counts by dtype:
  torch.bfloat16: 680,366,448 parameters
  torch.uint8: 1,809,856,512 parameters
#

some layers are in int8

#

If now instead i load it using FastModel and load_in_4bit=True

#

and wait for my slow GPU to finish ohreally

frigid ridge
#

ok interesting! just to confirm, you are mindful i was using this 27B model yes? 'unsloth/gemma-3-27b-it-unsloth-bnb-4bit'

cunning sundial
#

yes , but i only have 12GB VRAM right now, so i'm using the gemma-3-4b-it

#

it shouldn't make a difference. it's the same architecture. they both have vision capabilities

cunning sundial
#

Parameter counts by dtype:
torch.bfloat16: 1,579,024,752 parameters
torch.uint8: 1,360,527,360 parameters

#

but still the point is that the presence of int8 parameters is due to 4bit quantization, not any other action you were taking at a later stage

#

i am doing a quick training run

#

i am now followign the same steps you took to merge_unload then save_pretrained

frigid ridge
#

ok thank you! it looks like these models do indeed include 8bit information somehow at the start

cunning sundial
#

yes now i'm seeing what happens when i save in different dtypes

#

just running out of disk space, so saving that mess rip

frigid ridge
#

i feel you! had to edit my instance config many times 🤣

cunning sundial
#

damn it ran out of disk space again

#

how's the A40 btw, what's the memory bandwidth?

#

if you have plenty of disk space can you try this:

run 2 separate runs.

1- load the model as you would with FastModel in 4 bits, load the peft model
2- do a very small training run... max 5 mins
3- merge_unload
4 - save_pretrained( )

In the first run for save_pretrained choose torch_dtype=torch.float16

Then shutdown the notebook kernel. Turn the kernel back up

load the model you just saved with FastModel.from_pretrained. but turn off load_in_4_bit=False

then run this

# Check parameter dtypes
dtype_counts = {}
for name, param in model.named_parameters():
    dtype = str(param.dtype)
    if dtype not in dtype_counts:
        dtype_counts[dtype] = 0
    dtype_counts[dtype] += param.numel()

print("Parameter counts by dtype:")
for dtype, count in dtype_counts.items():
    print(f"  {dtype}: {count:,} parameters")

see if there are uint8 parameters

#

repeat the run the second time but instead use torch_dtype=torchf.loat32 when calling save_pretrained

#

don't forget the step of shutting down the kernel to make sure the GPU memory is cleared.

#

then reload the new model, run the code for checking parameter types

#

my suspicion is :

If you save pretrained in float16 , you might still be left with uint8 parameters
while if you save pretrained in float32, the parameters should all be float32

#

and since ollama only accepts either float16 or float32, then you're left with no option but to use torch_dtype=torch.float32 when saving

#

This whole issue is about ollama going in a different direction then llama.cpp
using their own quantizer, etc....
it isn't helping

frigid ridge
#

ok this is very clear thank you!

#

appreciate you laying all of that out, i will try those two runs and get back to you

frigid ridge
#

A40 bandwidth is 696 GB/s pretty sure

#

i will attempt your fixes on Monday and get back to you, thanks for today its super helpful!

frigid ridge
#

Hi Roland,

I wanted to give a quick update on the merging issue. I did significant debugging today and found the core problems and a working path.

Merge & Save Issue Diagnosis:
Using merged_model = model.merge_and_unload() -> merged_model.save_pretrained(..., torch_dtype=torch.float16), the saved file resulted in RuntimeError: size mismatch when loaded (shape torch.Size([57802752, 1]) for MLP weights) This shows shape corruption during saving with this method.
Attempting to save explicitly as FP16 using model.save_pretrained_merged(..., torch_dtype=torch.float16) failed with TypeError: unsloth_generic_save_pretrained_merged() got an unexpected keyword argument 'torch_dtype'This points to a bug in the explicit dtype path of save_pretrained_merged.

Working Solution Found:
The default save path of model.save_pretrained_merged (without specifying torch_dtype, which results in torch.bfloat16 on my A100 RunPod instance) successfully saved a merged model that loads perfectly using FastModel without shape corruption and contains no uint8 parameters.

Summary:

  • The shape corruption bug affects the merge_and_unload() + save_pretrained() method. The explicit FP16 save in save_pretrained_merged doesn't work for me.
  • However, the default model.save_pretrained_merged works correctly and produces a clean, loadable file without uint8s.
  • My plan is to use this bfloat16 merged model (saved via model.save_pretrained_merged without torch_dtype) as the source for GGUF conversion - it works nicely with the Unsloth-provided code:
    !python llama.cpp/convert_hf_to_gguf.py gemma-3-unsloth-merged-test-bf16 --outfile gemma-3-unsloth-merged-test-gguf-q8_0.gguf --outtype q8_0
  • I will proceed with the full training run, save the adapters separately, save the merged model using the save_pretrained_merged default in bfloat16, and then attempt the GGUF conversion from that output directory.

Thanks again for your guidance on diagnosing this!

#

Also, I noticed an error when trying to get trainer.train() to run:

I got this:
TypeError: scaled_dot_product_attention() got an unexpected keyword argument 'enable_gqa'

During trainer.train(). Turns out this was caused by a temporary patch (patch_Gemma3Attention). Correcting the patch file (temporary_patches.py in unsloth_zoo) by commenting out only the list append line and restarting the kernel fixed this, allowing training to proceed.

cunning sundial
cunning sundial
#

just finishing up some coursework first and some stuff for llamacon tomorrow

frigid ridge
#

thats good to hear 🙂 really appreciate your time on this!

cunning sundial
#

Appreciate your patience!

cunning sundial
#

Hello gents!

#

We pushed and merged a PR that should resolve your save_pretrained_merged() and push_to_hub_merged() issues. PLease make sure to

pip install --force-reinstall git+https://github.com/unslothai/unsloth.git
frigid ridge
#

thanks very much @cunning sundial !

cunning sundial
#

note that this works for vision models. we'll push for text models in the next two days