#cuda out of memory

36 messages · Page 1 of 1 (latest)

urban rune
#

out of topic:
trying also ggml i get this error


File “/home/nako/text-generation-webui/modules/ui_model_menu.py”, line 190, in load_model_wrapper

shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File “/home/nako/text-generation-webui/modules/models.py”, line 79, in load_model

output = load_func_map[loader](model_name)
File “/home/nako/text-generation-webui/modules/models.py”, line 244, in llamacpp_loader

model_file = (list(Path(f'{shared.args.model_dir}/{model_name}').glob('*.gguf*')) + list(Path(f'{shared.args.model_dir}/{model_name}').glob('*ggml*.bin')))[0]
IndexError: list index out of range```
crude kiln
#

Correct, CUDA is what's used to run calculations on the GPU, so when it's out of memory it refers to your VRAM. You can indeed split the work between GPU and CPU by switching loader from ExLlama to llama.cpp, however GGML has been superceded by GGUF so download those kind of models if possible.

#

For reference I can just barely run ExLlama with a 13B GPTQ model and I have 12GB VRAM.

wispy pike
#

6GB of vram is too little to run the gptq model. What quant method is the ggml model, and which model loader are you using for it?

urban rune
urban rune
#

i thought gptq might be better but ig i was wrong

#

also asked about ggml with gpu and cpu, but people say it's actually slower

#

so gave up entirely and sticking to what i have, if is there any "setting" tweaking to make it faster (without it losing quality too much) i would like to know

urban rune
#

python3 server.py --api --auto-devices --disk --load-in-4bit --cache-capacity 10GiB --model models/mythomax-l2-13b.ggml.q4_K_M.bin --mul_mat_q

#

this is best i can get, 2 token per sec
Output generated in 88.67 seconds (2.26 tokens/s, 200 tokens, context 1062, seed 2074356039)

crude kiln
#

With GPTQ it always tries to put the entire model on the GPU afaik. But with GGUF/GGML it defaults to running all layers on CPU, and you need to specify how much of it that should be ran on GPU

urban rune
#

but feels like the generate startup is a bit faster

crude kiln
#

If you launch text-generation-webui from a terminal, does it write something along these lines?

text-generation-webui-text-generation-webui-1  | llm_load_tensors: ggml ctx size =    0.12 MB
text-generation-webui-text-generation-webui-1  | llm_load_tensors: using CUDA for GPU acceleration
text-generation-webui-text-generation-webui-1  | llm_load_tensors: mem required  = 1973.50 MB (+ 3200.00 MB per state)
text-generation-webui-text-generation-webui-1  | llm_load_tensors: offloading 32 repeating layers to GPU
text-generation-webui-text-generation-webui-1  | llm_load_tensors: offloaded 32/43 layers to GPU
text-generation-webui-text-generation-webui-1  | llm_load_tensors: VRAM used: 6829 MB
#

Or does it not mention CUDA and GPU at all?

urban rune
# crude kiln Or does it not mention CUDA and GPU at all?

yes, i fixed my issue

python3 server.py --api --auto-devices --mul_mat_q --disk --thread 12 --gpu-memory 6 --cpu-memory 16 --n_ctx 4096 --llama_cpp_seed 0 --cache-capacity 2GiB --model models/mythomax-l2-13b.ggmlv3.q5_K_M.bin --n-gpu-layers 16

but got new issues regarding gguf in new thread

urban rune
#

my older message:

sooo, i installed cuda via a wrong repostory (using stable repo in rolling release -_-), hoping nothing breaks, lol
https://www.reddit.com/r/openSUSE/comments/gaihe9/cuda_on_tumbleweed/
power of entewnet, i'll break my system
#

now it loads fast now, like 2 sec, going to try though, i didn't had cuda and i didn't knew i didn't because i had there close driver, how annoying!

crude kiln
#

I didn't even bother with the cuda toolkit.. If I need it, e.g. for llama.cpp, I run it in docker instead

urban rune
#
python3 server.py --api --auto-devices --mul_mat_q --disk --n_ctx 4096 --no-mmap --mlock --no-cache --model models/remm-slerp-l2-13b.Q5_K_M.gguf --n-gpu-layers 18
#

i found out that this setting is perfection

crude kiln
#

Well, there are different parts of CUDA. I think I have the CUDA runtime 11 installed natively just because I use the proprietary Nvidia drivers. To compile CUDA apps I need the dev kit, which I didn't install because it sounded like it might mess up my system (they don't officially support Fedora 38 anyway). I can use the pre-built wheel for llama-cpp-python CUDA 12 by installing nvidia/cuda_runtime and nvidia/cublas Python packages which provide a bunch of .so-files, though I needed to manually set LD_LIBRARY_PATH for my Python program since pip doesn't do that.

#

But there are nvidia/cuda docker images for all different version, both devel images for compiling and more lightweight runtime images for running stuff. So as long as you can expose your GPU to docker you can run everything in there.

#

And to do that you need the NVIDIA Container Toolkit, but since I use docker a lot I had that installed already.

urban rune
#

makes sense....

#

ig it's skill issue on my part, but meh

urban rune