#GGUF Quantizing

10 messages · Page 1 of 1 (latest)

hallow olive
#

Been trying to quant to GGUF using Unsloth. My llama.cpp is already compiled, but I still get error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[2], line 17
      6 load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
      9 model, tokenizer = FastLanguageModel.from_pretrained(
     10     model_name = "lora_model_llama-2-13B",
     11     #model_name = "unsloth/llama-2-13b-bnb-4bit",
   (...)
     15     # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
     16 )
---> 17 model.save_pretrained_gguf("qr_k_m_gguf", tokenizer, quantization_method = "q4_k_m")

File ~/.local/lib/python3.10/site-packages/unsloth/save.py:1340, in unsloth_save_pretrained_gguf(self, save_directory, tokenizer, quantization_method, first_conversion, push_to_hub, token, private, is_main_process, state_dict, save_function, max_shard_size, safe_serialization, variant, save_peft_format, tags, temporary_location, maximum_memory_usage)
   1337     gc.collect()
   1339 model_type = self.config.model_type
-> 1340 file_location = save_to_gguf(model_type, new_save_directory, quantization_method, first_conversion, makefile)
   1342 if push_to_hub:
   1343     print("Unsloth: Uploading GGUF to Huggingface Hub...")

File ~/.local/lib/python3.10/site-packages/unsloth/save.py:964, in save_to_gguf(model_type, model_directory, quantization_method, first_conversion, _run_installer)
    955         raise RuntimeError(
    956             f"Unsloth: Quantization failed for {final_location}\n"\
    957             "You are in a Kaggle environment, which might be the reason this is failing.\n"\
   (...)
    961             "I suggest you to save the 16bit model first, then use manual llama.cpp conversion."
    962         )
    963     else:
--> 964         raise RuntimeError(
    965             f"Unsloth: Quantization failed for {final_location}\n"\
    966             "You might have to compile llama.cpp yourself, then run this again.\n"\
    967             "You do not need to close this Python program. Run the following commands in a new terminal:\n"\
    968             "You must run this in the same folder as you're saving your model.\n"\
    969             "git clone https://github.com/ggerganov/llama.cpp\n"\
    970             "cd llama.cpp && make clean && LLAMA_CUDA=1 make all -j\n"\
    971             "Once that's done, redo the quantization."
    972         )
    973     pass
    974 pass

RuntimeError: Unsloth: Quantization failed for ./qr_k_m_gguf-unsloth.F16.gguf
You might have to compile llama.cpp yourself, then run this again.
You do not need to close this Python program. Run the following commands in a new terminal:
You must run this in the same folder as you're saving your model.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make clean && LLAMA_CUDA=1 make all -j
Once that's done, redo the quantization.
dark elm
#

seems to be a hugging face issue we're trying to solve it now

#

oh wait nevermind this is a different issue

quasi zephyr
#

Hi dude, have you figure out how to solve this issue?
I compiled llama.cpp on colab as what it is informed me, but after that I still got the same error

dark elm
#

@quick siren

quick siren
#

maybe it might be better?

#

do u have a screenshot?

quasi zephyr
# quick siren oh did u use our new colab notebooks?

Hi, thank you so much for reply!
I have solved this by download the pytorch model, and then quantize the model on my computer.
But I'll try to use save_to_gguf function the next time I finetune model, and if there is anything wrong I will take the screenshots for the problem! Very appreciated for your help!