#Cant launch any models I get an error that llama.cpp is not installed. How do I install it?

136 messages · Page 1 of 1 (latest)

runic spindle
#

Hi everyone,

I have oogabooga installed. Some models work using the transformers option. When I try llama.cpp I get this error :

Traceback (most recent call last):

File “C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py”, line 182, in load_model_wrapper

shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File “C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows\text-generation-webui\modules\models.py”, line 79, in load_model

output = load_func_maploader
File “C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows\text-generation-webui\modules\models.py”, line 238, in llamacpp_loader

from modules.llamacpp_model import LlamaCppModel
File “C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows\text-generation-webui\modules\llamacpp_model.py”, line 11, in

import llama_cpp
ModuleNotFoundError: No module named ‘llama_cpp’

Thanks for any help!

quiet raptor
runic spindle
#

I read your post thanks for the info. However I need to get my llama_cpp working if I want to use any ggml models? How do I install it? I did update it and install it using CMD, and it says llama_cpp is installed, but in oogabooga when I try load models its giving me that error - so not sure if im doing anything wrong

quiet raptor
#

If you used the one-click-installer, then it should have already been installed. Something must have gone wrong during the installation.
You can manually install it by running cmd_windows.bat and entering these commands:

python -m pip install https://github.com/abetlen/llama-cpp-python/releases/download/v0.1.78/llama_cpp_python-0.1.78-cp310-cp310-win_amd64.whl

python -m pip install llama-cpp-python-cuda --no-deps --index-url==https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX2/cu117
runic spindle
#

I get this error when I try the above :

Downloading diskcache-5.6.1-py3-none-any.whl (45 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.6/45.6 kB 452.7 kB/s eta 0:00:00
Installing collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.1 llama-cpp-python-0.1.78

(C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows\installer_files\env) C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows>
(C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows\installer_files\env) C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows>python -m pip install llama-cpp-python-cuda --no-deps --index-url==https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX2/cu117
WARNING: The index url "=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX2/cu117" seems invalid, please provide a scheme.
Looking in indexes: =https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX2/cu117
WARNING: Location '=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX2/cu117/llama-cpp-python-cuda/' is ignored: it is either a non-existing path or lacks a specific scheme.
ERROR: Could not find a version that satisfies the requirement llama-cpp-python-cuda (from versions: none)
ERROR: No matching distribution found for llama-cpp-python-cuda

#

I tried loading the model using llama_cpp and I now get this error :

Traceback (most recent call last):

File “C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py”, line 182, in load_model_wrapper

shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File “C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows\text-generation-webui\modules\models.py”, line 79, in load_model

output = load_func_maploader
File “C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows\text-generation-webui\modules\models.py”, line 247, in llamacpp_loader

model, tokenizer = LlamaCppModel.from_pretrained(model_file)
File “C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows\text-generation-webui\modules\llamacpp_model.py”, line 74, in from_pretrained

result.model = Llama(**params)
File “C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py”, line 328, in init

assert self.model is not None
AssertionError

runic spindle
#

Apparently cuda is installed :

Collecting environment information...
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Pro
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.12 | packaged by Anaconda, Inc. | (main, Jul 5 2023, 19:01:18) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22621-SP0
Is CUDA available: True
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Ti
Nvidia driver version: 536.67
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=3401
DeviceID=CPU0
Family=107
L2CacheSize=8192
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=3401
Name=AMD Ryzen 9 5950X 16-Core Processor
ProcessorType=3
Revision=8448

Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] torch==2.0.1+cu117
[pip3] torchaudio==2.0.2+cu117
[pip3] torchvision==0.15.2+cu117
[conda] numpy 1.24.1 pypi_0 pypi
[conda] torch 2.0.1+cu117 pypi_0 pypi
[conda] torchaudio 2.0.2+cu117 pypi_0 pypi
[conda] torchvision 0.15.2+cu117 pypi_0 pypi

quiet raptor
#

I messed up the second command:

python -m pip install llama-cpp-python-cuda --no-deps --index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX2/cu117

This error indicates that you may not be loading a GGML model:

assert self.model is not None
AssertionError

What model are you trying to load?

runic spindle
#

Im getting about 1 token a second using Open-Orca_OpenOrca-Platypus2-13B. Is this right for my hardware (3080 TI). Output generated in 1226.93 seconds (0.13 tokens/s, 160 tokens, context 752, seed 202979456) Using the transformers model. That is the only model that seems to work. The ggml models I try llama_cpp which is apprently the right model to use for them? But I cant get the llama_cpp working it seems.

runic spindle
quiet raptor
#

That model is outdated. llama.cpp devs have changed their model format several times in the past.

runic spindle
quiet raptor
runic spindle
#

Im trying to run this model :

TheBloke_OpenOrca-Platypus2-13B-GGML

using the model loader llama_cpp.

runic spindle
#

the other models give all errors

runic spindle
#

I now get this error after installing the other pip installers ;

#

(C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows\installer_files\env) C:\Users\Greg\Downloads>pip install llama_cpp_python_cuda-0.1.79+cu117-cp310-cp310-win_amd64.wh
ERROR: Invalid requirement: 'llama_cpp_python_cuda-0.1.79+cu117-cp310-cp310-win_amd64.wh'

I think I need the nvidia cuda?

runic spindle
#

I cant seem to get it working now tried install cude tool kit and all other things but hitting the same issue

quiet raptor
#

My best guess as to what is going wrong is that C:\Users\Greg\OneDrive\Desktop\BusinessIdeas\LawChatbot\LLama2\oobabooga_windows is too long of a path for the installer to work.
Windows has an arbitrary limit to how long file paths can be. Most Python packages are developed on Linux and ported to Windows afterwards.
Since Linux doesn't have a path length limit, devs often create lengthy file paths that cause issues on Windows.

runic spindle
#

Ok ill try copy the file into a shorter name path.

quiet raptor
#

It will need to be reinstalled since Conda doesn't support moving it's installation.

runic spindle
#

Ok ill move it and then reinstall conda. What is the install code for conda? Pip install conda?

#

And ill install it all by launching the cmd windows bat file?

quiet raptor
#

It is part of the installer, installed by start_windows.bat.
There just isn't an easy way to move it without a lot of manual installation. I always just do a full reinstall when I need to move it.

runic spindle
#

Ok I launched the start_windows.bat folder.

In the new shortened path.

Same error :

#

How would I do a full reinstall?

quiet raptor
#

That error you got seems to indicate that you have the Linux version of bitsandbytes installed. A reinstall will get the right one, otherwise run cmd_windows.bat and enter this command to install the Windows version:

python -m pip install bitsandbytes --force-reinstall --no-deps --index-url=https://jllllll.github.io/bitsandbytes-windows-webui
runic spindle
#

If I try run the model you posted earlier : Traceback (most recent call last):

File “H:\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py”, line 182, in load_model_wrapper

shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 79, in load_model

output = load_func_maploader
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 247, in llamacpp_loader

model, tokenizer = LlamaCppModel.from_pretrained(model_file)
File “H:\oobabooga_windows\text-generation-webui\modules\llamacpp_model.py”, line 74, in from_pretrained

result.model = Llama(**params)
File “H:\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp_cuda\llama.py”, line 323, in init

assert self.model is not None
AssertionError

#

If I use the transformer model :

Traceback (most recent call last):

File “H:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\configuration_utils.py”, line 702, in _get_config_dict

config_dict = cls._dict_from_json_file(resolved_config_file)
File “H:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\configuration_utils.py”, line 793, in _dict_from_json_file

text = reader.read()
File “H:\oobabooga_windows\installer_files\env\lib\codecs.py”, line 322, in decode

(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x80 in position 28: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File “H:\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py”, line 182, in load_model_wrapper

#

shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 79, in load_model

output = load_func_maploader
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 140, in huggingface_loader

config = AutoConfig.from_pretrained(path_to_model, trust_remote_code=shared.args.trust_remote_code)
File “H:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\auto\configuration_auto.py”, line 983, in from_pretrained

config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File “H:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\configuration_utils.py”, line 617, in get_config_dict

config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File “H:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\configuration_utils.py”, line 705, in _get_config_dict

raise EnvironmentError(
OSError: It looks like the config file at ‘models\GPT4All-13B-snoozy.ggmlv3.q4_K_S.bin’ is not a valid JSON file.

#

But thats for the model you gave me. The other models I get the assert error

quiet raptor
#

Ah, crap. I forgot that llama-cpp-python updated. Use this instead:

python -m pip install llama-cpp-python-cuda==0.1.78 --no-deps --index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX2/cu117

Also, GPT4All-13B-snoozy.ggmlv3.q4_K_S.bin is a GGML model, not a transformers model. So, it needs to be loaded with llama.cpp.

runic spindle
#

Wow that worked!!! You are amazing. Thank you! let me test this out a bit!!!!!!

#

Hmm it seems to be running on my cpu though? How do I make it use my gpu?

#

I dont know if this is useful :

2023-08-26 22:00:00 INFO:Loading GPT4All-13B-snoozy.ggmlv3.q4_K_S.bin...
2023-08-26 22:00:00 INFO:llama.cpp weights detected: models\GPT4All-13B-snoozy.ggmlv3.q4_K_S.bin
2023-08-26 22:00:00 INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models\GPT4All-13B-snoozy.ggmlv3.q4_K_S.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 14 (mostly Q4_K - Small)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 7477.72 MB (+ 1600.00 MB per state)
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/43 layers to GPU
llama_model_load_internal: total VRAM used: 480 MB
llama_new_context_with_model: kv self size = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
2023-08-26 22:00:00 INFO:Loaded the model in 0.77 seconds.

#

llama_print_timings: load time = 3176.75 ms
llama_print_timings: sample time = 1.90 ms / 11 runs ( 0.17 ms per token, 5786.43 tokens per second)
llama_print_timings: prompt eval time = 3176.70 ms / 66 tokens ( 48.13 ms per token, 20.78 tokens per second)
llama_print_timings: eval time = 1865.49 ms / 10 runs ( 186.55 ms per token, 5.36 tokens per second)
llama_print_timings: total time = 5061.70 ms
Output generated in 5.42 seconds (1.84 tokens/s, 10 tokens, context 66, seed 1220684351)
Llama.generate: prefix-match hit

#

I can still only load this using transformer model and not llama_cpp : 4bit_Llama-2-13b-chat-hf

quiet raptor
#

When loading a GGML model, you can change the n-gpu-layers setting to control how much of the model to put on your GPU.

runic spindle
#

Oh I see. ok let me play with the n-gpu-layers.

So your saying the 4bit llama model im running can only be run in transformers mode? ok ill get the GGML version - So I must run GGML versions inorder to run it under the llama_cpp model?

#

Ok thanks for the links to your models. Im getting around 10 tokens a second with this model : GPT4All-13B-snoozy.ggmlv3.q4_K_S.bin ill try out the other model - where else can I find good models? all the models im getting down work that well.

#

Output generated in 5.89 seconds (23.08 tokens/s, 136 tokens, context 43, seed 981438056)

#

yay 23 tokens a second. looks like my pc is upto the task!

quiet raptor
#

For the GGML models, it is best to manually download the model file you want to use as he usually uploads multiple versions to the same repo, which the webui's model downloader doesn't handle well.

runic spindle
#

ok thanks , ill try them!

runic spindle
#

Im thinking of getting another graphics card - is this something that would work? I have a 3080 ti 12GB. Im thinking of getting a 3090 24 gb - this should then give me 36 GB Vram - is this needed or you think im waisting money and 1 x 3080ti is fine? I want to run other models. not just text to text or LLMs. I want to run the other models on hugginngface aswell...

quiet raptor
#

It's largely a matter of convenience from what I've seen. I have 11gb VRAM, 32gb RAM and can run most models, with the largest ones requiring some work to get running.
Stable Diffusion XL can be a pain to run when upscaling to large resolutions, but the tools available generally are written to make it easier by upscaling the image in chunks rather than all at once.

runic spindle
quiet raptor
#

My experience with Stable Diffusion is pretty much just casual experimentation. I don't have any experience with making videos with it.
Additionally, I quit using the 1111 webui a while back and switched to ComfyUI, so I'm not familiar with the current state of the webui.

runic spindle
#

Hmm, im trying to learn more about running these models - I get the feeling doing it only through the web ui's is limiting me to the selection of models I can run. I am guessing that if I knew how to run in my cmd or in a virtual linux environment I would have a larger selection of models to play with. Can you point me to any tutorials you think I should learn so I become more familiar with running llms and other models in a console? Cheers!

quiet raptor
#

I don't think I've ever seen a guide on running LLMs in console. It isn't really a command-line task, more of a programming one.
You would need to learn both Python and all of the relevant APIs for running LLMs like Transformers and llama-cpp-python.

People use webuis for a reason. All of the code is already written and packaged with a convenient UI.
Just running it in CMD is technically possible, but it would be so unwieldy that you definitely wouldn't like it.
Even the researchers developing AI models will write their own UI programs for using their models because anything else just isn't practical.

runic spindle
#

Hmm , maybe im not explaining correctly - if I wanted to run the LLM and then host it online - how would I do that? Oogabooga makes it really easy to launch a model as you say , but im trying to figure out how to load and run a model on my computer / server / VM and then host it online. I dont know if using oogabooga would allow that or how you do that. So im thinking I need to run it more on a console basis , im not sure though just trying to figure this all out

quiet raptor
#

The webui has an API, if that is what you are wanting. Use this flag in CMD_FLAGS.txt to enable it: --api
For online hosting, use --listen and port-forwarding to access it from the web.
With --listen you can also access the UI from other systems as well.
If you want to host it through a public URL, rather than connecting directly to your IP, then use --share and/or --public-api. This will create a public URL hosted by Cloudflare that will last 72-hours.

You can read more about the various flags and what they do here:
https://github.com/oobabooga/text-generation-webui#starting-the-web-ui

runic spindle
#

great let me go through this all thank you!

#

My CMD_FLAGS.txt is empty. Must I just add the lines you posted and save it?

quiet raptor
#

They are entered into it in a single line. --api --listen --share --public-api
Eventually I'll work out a good way to have the installer read flags from the file regardless of how they are entered.

runic spindle
#

I tried running a 34B model. It used all my ram (96GB). How much ram do I need to run larger models?

quiet raptor
#

Quite a lot if it is an HF model. If the model loads through the Transformers loader, then you should select one of the load-in-*bit options to reduce memory usage.
Beyond that, you should prefer quantized GGUF/GGML models. 8bit models use roughly half of the memory of full HF models and has pretty much the same quality output.

runic spindle
#

It only runs in transformer mode. Ill try the 8 bit. It is this model : WizardLM_WizardCoder-Python-34B-V1.0

runic spindle
#

I tried loading in 4 bit and 8 bit.

It loads but doesent work.

I get this error :

RuntimeError: value cannot be converted to type at::Half without overflow
Output generated in 0.59 seconds (0.00 tokens/s, 0 tokens, context 49, seed 87160459)

Any ideas?

quiet raptor
# runic spindle It only runs in transformer mode. Ill try the 8 bit. It is this model : WizardLM...

Here is a GGUF version of that model:
https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-GGUF#provided-files

Select the model version that you want to use based on how much VRAM and RAM you have.
In the list of files, you can see the amount of RAM each version needs.
You can use the n-gpu-layers setting when loading the model to offload some of the RAM to VRAM.
This reduces the amount of RAM needed and speeds up the model as well by using the GPU.

With your amount of RAM and VRAM, you can comfortably use any of the provided versions.
With the Q5_K_M version, you can probably set n-gpu-layers to around 18 to load 18 of the model's 48 layers onto the GPU.
You can tweak that number as desired if you think you can fit more into VRAM. The more layers loaded into VRAM, the faster the model will be.
Q5_K_M: https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-GGUF/resolve/main/wizardcoder-python-34b-v1.0.Q5_K_M.gguf

quiet raptor
#

If you haven't updated the webui since GGUF support was implemented, then you will need to do so to use this model.

runic spindle
#

Ok thanks let me try this!

#

Im downloading the GGUF version. How would I update the webui? I launched the update.bat file so I think I am upto date?

A few of my models are GGUF - What model loader must I use for GGUF? Llama.cp?

quiet raptor
#

llama.cpp will load it.

runic spindle
#

Ahh, I was playing around with my cuda installations today , was trying to get arround an assetion error.

I may have meesed up my installation for cuda. When I try launch my llama.cpp I get this error now (It was working earlier!)>

Traceback (most recent call last):

File “H:\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py”, line 180, in load_model_wrapper

unload_model()
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 362, in unload_model

clear_torch_cache()
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 355, in clear_torch_cache

torch.cuda.empty_cache()
File “H:\oobabooga_windows\installer_files\env\lib\site-packages\torch\cuda\memory.py”, line 133, in empty_cache

torch._C._cuda_emptyCache()
RuntimeError: CUDA error: an illegal instruction was encountered

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

quiet raptor
#

Not sure what the cause of that is. Doesn't seem to have anything to do with llama.cpp.

#

Maybe caused by trying to unload a model when there isn't one loaded?

#

Not sure.

runic spindle
#

It was working fine earlier but I installed a cuda package, pytorch, anaconda and a bunch of other software. let me reboot my pc maybe that will fix it.

#

Hmm I just opened and closed the cmd panel and relaunched it and now llama.cpp is working.

runic spindle
# quiet raptor Here is a GGUF version of that model: https://huggingface.co/TheBloke/WizardCode...

I downloaded this model. Tried running it with model loader llama.cpp.

I get this error :

Traceback (most recent call last):

File “H:\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py”, line 182, in load_model_wrapper

shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 79, in load_model

output = load_func_maploader
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 247, in llamacpp_loader

model, tokenizer = LlamaCppModel.from_pretrained(model_file)
File “H:\oobabooga_windows\text-generation-webui\modules\llamacpp_model.py”, line 74, in from_pretrained

result.model = Llama(**params)
File “H:\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp_cuda\llama.py”, line 328, in init

assert self.model is not None
AssertionError

quiet raptor
# runic spindle I downloaded this model. Tried running it with model loader llama.cpp. I get t...

I'm guessing that llama-cpp-python is not updated then.
Run cmd_windows.bat and enter these commands to update the webui if the update_windows.bat script isn't working:

cd text-generation-webui

git pull

python -m pip install -r requirements.txt --upgrade

python -m pip install llama-cpp-python llama-cpp-python-cuda llama-cpp-python-ggml llama-cpp-python-ggml-cuda --force-reinstall --no-deps --index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cpu --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX2/cu117

The last long command is to ensure that llama-cpp-python gets updated.

runic spindle
#

Thanks, I updated it, now I get this error :

Traceback (most recent call last):

File “H:\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py”, line 182, in load_model_wrapper

shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 79, in load_model

output = load_func_maploader
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 244, in llamacpp_loader

model_file = list(Path(f'{shared.args.model_dir}/{model_name}').glob('ggml.bin'))[0]
IndexError: list index out of range

quiet raptor
#

model_file = list(Path(f'{shared.args.model_dir}/{model_name}').glob('ggml.bin'))[0]

This indicates that the webui was not updated as that code is old and no longer used.

runic spindle
#

Ok I see, I just launched my CMD and ran the code to update. I never ran the cmd_windows.bat file.

I now have launched the cmd_windows.bat file. I ran the code you posted, however I get this error in red :

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'H:\oobabooga_windows\installer_files\env\Lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll'
Consider using the --user option or check the permissions.

(H:\oobabooga_windows\installer_files\env) H:\oobabooga_windows\text-generation-webui>
(H:\oobabooga_windows\installer_files\env) H:\oobabooga_windows\text-generation-webui>python -m pip install llama-cpp-python llama-cpp-python-cuda llama-cpp-python-ggml llama-cpp-python-ggml-cuda --force-reinstall --no-deps --index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cpu --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX2/cu117

#

I tried opening the bat file as administrator but I get the same error.

#

It gets stuck here :

from H:\oobabooga_windows\installer_files\pip-uninstall-b7mdmsy9\f2py.exe
ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'H:\oobabooga_windows\installer_files\env\Lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll'
Consider using the --user option or check the permissions.

quiet raptor
runic spindle
#

Ok I ran your latest command. It did a lot , I got this on the last line :

processed file: H:\oobabooga_windows\installer_files\env\Lib\site-packages\numpy\doc
H:\oobabooga_windows\installer_files\env\Lib\site-packages\numpy\dtypes.py: Access is denied.
Successfully processed 57056 files; Failed processing 1 files

Anyway , let me try run this command now :

cd text-generation-webui

git pull

python -m pip install -r requirements.txt --upgrade

python -m pip install llama-cpp-python llama-cpp-python-cuda llama-cpp-python-ggml llama-cpp-python-ggml-cuda --force-reinstall --no-deps --index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cpu --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX2/cu117

I ran the command above and I get this at the end ;

Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.24.3
Uninstalling numpy-1.24.3:
Successfully uninstalled numpy-1.24.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.57.0 requires numpy<1.25,>=1.21, but you have numpy 1.25.2 which is incompatible.
Successfully installed numpy-1.25.2

(H:\oobabooga_windows\installer_files\env) H:\oobabooga_windows\text-generation-webui>
(H:\oobabooga_windows\installer_files\env) H:\oobabooga_windows\text-generation-webui>python -m pip install llama-cpp-python llama-cpp-python-cuda llama-cpp-python-ggml llama-cpp-python-ggml-cuda --force-reinstall --no-deps --index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cpu --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX2/cu117

Im going to try launch the webui anyway to see if its working.

#

Im hitting the same error in the web ui when I try load the model :

Traceback (most recent call last):

File “H:\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py”, line 182, in load_model_wrapper

shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 79, in load_model

output = load_func_maploader
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 244, in llamacpp_loader

model_file = list(Path(f'{shared.args.model_dir}/{model_name}').glob('ggml.bin'))[0]
IndexError: list index out of range

quiet raptor
#

Try running the update_windows.bat script now that the permissions are fixed.

runic spindle
#

ok

#

I launched the update_windows.bat file.

It displays this in the CMD :

error: Your local changes to the following files would be overwritten by merge:
requirements.txt
Please commit your changes or stash them before you merge.
Aborting
Updating eaf5f0f..4affa08
Command '"H:\oobabooga_windows\installer_files\conda\condabin\conda.bat" activate "H:\oobabooga_windows\installer_files\env" >nul && git pull' failed with exit status code '1'. Exiting...

Done!
Press any key to continue . . .

I then launched the Start_windows.bat file.

I then tried to load the model. Same error. Here is the error :

File "H:\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py", line 182, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File "H:\oobabooga_windows\text-generation-webui\modules\models.py", line 79, in load_model
output = load_func_maploader
File "H:\oobabooga_windows\text-generation-webui\modules\models.py", line 244, in llamacpp_loader
model_file = list(Path(f'{shared.args.model_dir}/{model_name}').glob('ggml.bin'))[0]
IndexError: list index out of range

quiet raptor
#

Run these commands in cmd_windows.bat, then run the update script:

cd text-generation-webui
git reset --hard
runic spindle
#

Thank you, this worked. I can see my gui is updated in the webui! And I can now run the GGUF models. Wow! 34B models running on my pc no sweat! What would I need to run a 70B model? or is that out of my depths?

Anyway ,

I loaded this model fine : Successfully loaded codellama-34b.Q5_K_S.gguf.

But when I try the model you suggested earlier (eBloke_WizardCoder-Python-34B-V1.0-GGUF) I get this :

Traceback (most recent call last):

File “H:\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py”, line 196, in load_model_wrapper

shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 79, in load_model

output = load_func_maploader
File “H:\oobabooga_windows\text-generation-webui\modules\models.py”, line 244, in llamacpp_loader

model_file = (list(Path(f'{shared.args.model_dir}/{model_name}').glob('.gguf')) + list(Path(f'{shared.args.model_dir}/{model_name}').glob('ggml.bin')))[0]
IndexError: list index out of range

But ill try different models - now that the GGUF is working im going to have a field day!!

Thanks for your help again!!!!

#

Hmm strangely now this is running super slow : codellama-34b.Q5_K_S.gguf.

Its using my pc and gpu to max but its not outputting anything, just says typing. The first time I tried it it worked so fast. odd. will try reboot.

#

I waited it out : llama_print_timings: load time = 322392.79 ms
llama_print_timings: sample time = 1.59 ms / 10 runs ( 0.16 ms per token, 6305.17 tokens per second)
llama_print_timings: prompt eval time = 322392.75 ms / 39 tokens ( 8266.48 ms per token, 0.12 tokens per second)
llama_print_timings: eval time = 3090.42 ms / 9 runs ( 343.38 ms per token, 2.91 tokens per second)
llama_print_timings: total time = 325501.62 ms
Output generated in 325.85 seconds (0.03 tokens/s, 9 tokens, context 39, seed 1733623714)

Seems too slow let me reboot.

#

rebooted no difference. In my task manager I can see my GPU memory is showing 13/60GB. I only have 12GB Vram - so maybe im running out of vram with this model

quiet raptor
#

You can try lowering n-gpu-layers until your VRAM is no longer maxed out. That alone should speed it up.

#

NVIDIA drivers on Windows automatically offload VRAM to RAM as necessary, which just creates feedback loops with GGML/GGUF and slows everything down.

runic spindle
#

Going to try this one its a bit smaller : wizardcoder-python-34b-v1.0.Q4_K_S.gguf

#

So I tried lowering how many n-gpu-layers I use - Set it to 15 and then my gpu memory stayed under 12GB - But now the model is outputting characters quickly but its talking gibberish. very odd. Ill try with some other models! It worked the first time. so wierd.

runic spindle
#

How would I make my model available to someone else online? If I wanted to access it remotely? Can I do that? Cheers

quiet raptor
#

The easiest way is to put this in CMD_FLAGS.txt: --share --gradio-auth username:password
This will create a Cloudflare-hosted URL that can be used to access the webui over the internet. However, the URL is temporary and only available for around 72-hours.

A more long-term solution would be to use these flags instead: --listen --listen-port 28756 --gradio-auth username:password
Then you would port-forward that port to access the webui using your public IP and that port like this: 11.111.11.11:28756

The --gradio-auth and --listen-port flags are to marginally strengthen security as you will be exposing your system to the public internet.
The port number I chose is random and not commonly used. You can use a site like this to get another one: https://it-tools.tech/random-port-generator
You can set whatever username and password you want and can set multiple like this: --gradio-auth username1:password1,username2:password2,username3:password3

runic spindle
#

Ok thanks, Ill try get that working later this week will keep you posted

runic spindle
#

Tried this to start :

The easiest way is to put this in CMD_FLAGS.txt: --share --gradio-auth username:password

I get a login screen. What is the password? I have tried password as both but im getting incorrect credentials :/

#

Ok I got it. Username = username
Password = password

runic spindle
#

If I had my own app with my own gui , how would I beable to feed the input into my machine and then feed the output into the app? Would that be possible? right now it just hosts the webui online - I want to host a large language model that then can feed another app with content.

quiet raptor
runic spindle
#

Ok thanks , ill do some more research! Ill let you know what I find.

runic spindle
#

Is it possible for me to store my models in a folder that is NOT in the oogabooga folder?

#

If I want to store models on a seperate drive for example

quiet raptor
#

Use the --model-dir flag like so:

--model-dir D:\models
runic spindle
#

Wonder if you can help me im stuck, trying to run falcon 40B. Here is the error message : File “H:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\auto\configuration_auto.py”, line 1010, in from_pretrained

trust_remote_code = resolve_trust_remote_code(
File “H:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\dynamic_module_utils.py”, line 618, in resolve_trust_remote_code

raise ValueError(
ValueError: Loading models\Falcon40B requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.

runic spindle
#

I found a button in the gui I can click let me try...

#

I clicked the "Trust remote code" button and now I get this error :/

File “H:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\modeling_utils.py”, line 491, in load_state_dict

with open(checkpoint_file) as f:
FileNotFoundError: [Errno 2] No such file or directory: ‘models\Falcon40B\pytorch_model-00004-of-00009.bin’

#

oh im missing a file. nevermind ill figure it out:D

runic spindle
#

yay got it working:D

runic spindle
#

I tried running my one model and I am not getting this error :

File “H:\oobabooga_windows\text-generation-webui\modules\metadata_gguf.py”, line 75, in load_metadata

raise Exception('You are using an outdated GGUF, please download a new one.')
Exception: You are using an outdated GGUF, please download a new one.

How do I update my GGUF?

runic spindle
#

Tried uninstalling and reinstalling llama.cpp "pip install llama-cpp-python" this never solved the issue.

runic spindle
runic spindle
#

Seems like after I updated my oogabooga none of my models are working:/

#

I downloaded a new GGUF model and new models work but my old ones dont? Any tips to get my older models working?

quiet raptor
#

#windows-setup message

runic spindle
#

Hmm , im stuck here :

PS C:\Users\Greg> CD J:\text-generation-webui\models
PS J:\text-generation-webui\models> cd text-generation-webui\models
cd : Cannot find path 'J:\text-generation-webui\models\text-generation-webui\models' because it does not exist.
At line:1 char:1

  • cd text-generation-webui\models
  •   + CategoryInfo          : ObjectNotFound: (J:\text-generat...on-webui\models:String) [Set-Location], ItemNotFoundE
     xception
      + FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.SetLocationCommand
    
    

PS J:\text-generation-webui\models> quantize.exe wizardcoder-python-34b-v1.0.Q4_K_M.gguf COPY

#

PS J:\text-generation-webui\models> quantize.exe wizardcoder-python-34b-v1.0.Q4_K_M.gguf COPY
quantize.exe : The term 'quantize.exe' is not recognized as the name of a cmdlet, function, script file, or operable
program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:1

  • quantize.exe wizardcoder-python-34b-v1.0.Q4_K_M.gguf COPY
  •   + CategoryInfo          : ObjectNotFound: (quantize.exe:String) [], CommandNotFoundException
      + FullyQualifiedErrorId : CommandNotFoundException
    
    
    

Suggestion [3,General]: The command quantize.exe was not found, but does exist in the current location. Windows PowerShell does not load commands from the current location by default. If you trust this command, instead type: ".\quantize.exe". See "get-help about_Command_Precedence" for more details.
PS J:\text-generation-webui\models>

#

PS J:\text-generation-webui\models> ".\quantize.exe" wizardcoder-python-34b-v1.0.Q4_K_M.gguf COPY
At line:1 char:18

  • ".\quantize.exe" wizardcoder-python-34b-v1.0.Q4_K_M.gguf COPY
  •              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    

Unexpected token 'wizardcoder-python-34b-v1.0.Q4_K_M.gguf' in expression or statement.
+ CategoryInfo : ParserError: (:) [], ParentContainsErrorRecordException
+ FullyQualifiedErrorId : UnexpectedToken

PS J:\text-generation-webui\models>

#

Hmmm , when you say :

"Run this command. Change file names and paths as needed."

Where must I run the command? Using CMD? I used windows power shell

#

If I use my normal cmd and I input this address cd J:\text-generation-webui\models it just shows me im in my C drive / folder - it doesent go to my J drive

quiet raptor
#

The cd command was written before the one-click-installer was merged into the main webui repo, so the text-generation-webui\ part isn't relevant anymore.

This is the command to run the quantize program and convert the model with CMD:

quantize.exe wizardcoder-python-34b-v1.0.Q4_K_M.gguf wizardcoder-python-34b-v1.0.ggufv2.Q4_K_M.gguf COPY

This will create a new, updated copy of the model with the name wizardcoder-python-34b-v1.0.ggufv2.Q4_K_M.gguf

#

For simplicity, run cmd_windows.bat and enter these commands after placing quantize.exe in the models folder:

cd models

quantize.exe wizardcoder-python-34b-v1.0.Q4_K_M.gguf wizardcoder-python-34b-v1.0.ggufv2.Q4_K_M.gguf COPY
runic spindle
#

Thank you this worked! Ill do this with all my outdated models. Cheers

runic spindle
#

Hmmm,

I ran the windows update.bat file.

Then I tried to launch the start windows.bat file.

I get this error :

Traceback (most recent call last):
File "J:\text-generation-webui\server.py", line 31, in <module>
from modules import (
File "J:\text-generation-webui\modules\chat.py", line 18, in <module>
from modules.text_generation import (
File "J:\text-generation-webui\modules\text_generation.py", line 21, in <module>
from modules.grammar import GrammarLogitsProcessor
File "J:\text-generation-webui\modules\grammar.py", line 1, in <module>
from torch_grammar import GrammarSampler
ModuleNotFoundError: No module named 'torch_grammar'
Press any key to continue . . .

Seems like I have missing files?

Any idea how to fix this?

quiet raptor
#
python -m pip install git+https://github.com/oobabooga/torch-grammar.git