Been trying to quant to GGUF using Unsloth. My llama.cpp is already compiled, but I still get error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[2], line 17
6 load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
9 model, tokenizer = FastLanguageModel.from_pretrained(
10 model_name = "lora_model_llama-2-13B",
11 #model_name = "unsloth/llama-2-13b-bnb-4bit",
(...)
15 # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
16 )
---> 17 model.save_pretrained_gguf("qr_k_m_gguf", tokenizer, quantization_method = "q4_k_m")
File ~/.local/lib/python3.10/site-packages/unsloth/save.py:1340, in unsloth_save_pretrained_gguf(self, save_directory, tokenizer, quantization_method, first_conversion, push_to_hub, token, private, is_main_process, state_dict, save_function, max_shard_size, safe_serialization, variant, save_peft_format, tags, temporary_location, maximum_memory_usage)
1337 gc.collect()
1339 model_type = self.config.model_type
-> 1340 file_location = save_to_gguf(model_type, new_save_directory, quantization_method, first_conversion, makefile)
1342 if push_to_hub:
1343 print("Unsloth: Uploading GGUF to Huggingface Hub...")
File ~/.local/lib/python3.10/site-packages/unsloth/save.py:964, in save_to_gguf(model_type, model_directory, quantization_method, first_conversion, _run_installer)
955 raise RuntimeError(
956 f"Unsloth: Quantization failed for {final_location}\n"\
957 "You are in a Kaggle environment, which might be the reason this is failing.\n"\
(...)
961 "I suggest you to save the 16bit model first, then use manual llama.cpp conversion."
962 )
963 else:
--> 964 raise RuntimeError(
965 f"Unsloth: Quantization failed for {final_location}\n"\
966 "You might have to compile llama.cpp yourself, then run this again.\n"\
967 "You do not need to close this Python program. Run the following commands in a new terminal:\n"\
968 "You must run this in the same folder as you're saving your model.\n"\
969 "git clone https://github.com/ggerganov/llama.cpp\n"\
970 "cd llama.cpp && make clean && LLAMA_CUDA=1 make all -j\n"\
971 "Once that's done, redo the quantization."
972 )
973 pass
974 pass
RuntimeError: Unsloth: Quantization failed for ./qr_k_m_gguf-unsloth.F16.gguf
You might have to compile llama.cpp yourself, then run this again.
You do not need to close this Python program. Run the following commands in a new terminal:
You must run this in the same folder as you're saving your model.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make clean && LLAMA_CUDA=1 make all -j
Once that's done, redo the quantization.