cuda out of memory | Text Generation WebUI | Page 1

urban rune Aug 29, 2023, 8:41 AM

#

out of topic:
trying also ggml i get this error


File “/home/nako/text-generation-webui/modules/ui_model_menu.py”, line 190, in load_model_wrapper

shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File “/home/nako/text-generation-webui/modules/models.py”, line 79, in load_model

output = load_func_map[loader](model_name)
File “/home/nako/text-generation-webui/modules/models.py”, line 244, in llamacpp_loader

model_file = (list(Path(f'{shared.args.model_dir}/{model_name}').glob('*.gguf*')) + list(Path(f'{shared.args.model_dir}/{model_name}').glob('*ggml*.bin')))[0]
IndexError: list index out of range```

crude kiln Aug 29, 2023, 10:38 AM

#

Correct, CUDA is what's used to run calculations on the GPU, so when it's out of memory it refers to your VRAM. You can indeed split the work between GPU and CPU by switching loader from ExLlama to llama.cpp, however GGML has been superceded by GGUF so download those kind of models if possible.

#

For reference I can just barely run ExLlama with a 13B GPTQ model and I have 12GB VRAM.

wispy pike Aug 29, 2023, 11:19 AM

#

6GB of vram is too little to run the gptq model. What quant method is the ggml model, and which model loader are you using for it?

urban rune Aug 29, 2023, 3:40 PM

#

crude kiln Correct, CUDA is what's used to run calculations on the GPU, so when it's out of...

makes sense, already fixed ggml removing v3 on the name (it was like ggmlv3)

urban rune Aug 29, 2023, 3:42 PM

#

wispy pike 6GB of vram is too little to run the gptq model. What quant method is the ggml m...

used bloke's 13b mythomax-l2-13b.ggml.q4_K_M.bin now, speed is slow (but barable)

#

i thought gptq might be better but ig i was wrong

#

https://tenor.com/view/dark-souls-bonfire-fantasy-fromsoft-video-games-gif-18642052

Tenor

#

also asked about ggml with gpu and cpu, but people say it's actually slower

#

so gave up entirely and sticking to what i have, if is there any "setting" tweaking to make it faster (without it losing quality too much) i would like to know

urban rune Aug 29, 2023, 6:10 PM

#

python3 server.py --api --auto-devices --disk --load-in-4bit --cache-capacity 10GiB --model models/mythomax-l2-13b.ggml.q4_K_M.bin --mul_mat_q

#

this is best i can get, 2 token per sec
Output generated in 88.67 seconds (2.26 tokens/s, 200 tokens, context 1062, seed 2074356039)

#

https://cdn.discordapp.com/emojis/1103166197471649872.webp?size=48&name=dumfuk&quality=lossless

crude kiln Aug 31, 2023, 6:25 AM

#

You need to tell it how many layers you want to offload to the GPU, see this section if you want to do it using command line args https://github.com/oobabooga/text-generation-webui#ggmlgguf-for-llamacpp-and-ctransformers

#

With GPTQ it always tries to put the entire model on the GPU afaik. But with GGUF/GGML it defaults to running all layers on CPU, and you need to specify how much of it that should be ran on GPU

urban rune Aug 31, 2023, 2:57 PM

#

crude kiln You need to tell it how many layers you want to offload to the GPU, see this sec...

didn't have any noticable change

#

but feels like the generate startup is a bit faster

#

thank you https://cdn.discordapp.com/emojis/1097599467466788956.webp?size=48&name=inapray&quality=lossless

crude kiln Sep 1, 2023, 7:14 PM

#

If you launch text-generation-webui from a terminal, does it write something along these lines?

text-generation-webui-text-generation-webui-1  | llm_load_tensors: ggml ctx size =    0.12 MB
text-generation-webui-text-generation-webui-1  | llm_load_tensors: using CUDA for GPU acceleration
text-generation-webui-text-generation-webui-1  | llm_load_tensors: mem required  = 1973.50 MB (+ 3200.00 MB per state)
text-generation-webui-text-generation-webui-1  | llm_load_tensors: offloading 32 repeating layers to GPU
text-generation-webui-text-generation-webui-1  | llm_load_tensors: offloaded 32/43 layers to GPU
text-generation-webui-text-generation-webui-1  | llm_load_tensors: VRAM used: 6829 MB

#

Or does it not mention CUDA and GPU at all?

urban rune Sep 3, 2023, 7:37 AM

#

crude kiln Or does it not mention CUDA and GPU at all?

yes, i fixed my issue

python3 server.py --api --auto-devices --mul_mat_q --disk --thread 12 --gpu-memory 6 --cpu-memory 16 --n_ctx 4096 --llama_cpp_seed 0 --cache-capacity 2GiB --model models/mythomax-l2-13b.ggmlv3.q5_K_M.bin --n-gpu-layers 16

but got new issues regarding gguf in new thread

urban rune Sep 5, 2023, 11:03 AM

#

crude kiln Or does it not mention CUDA and GPU at all?

ty it was helpful

#

my older message:

sooo, i installed cuda via a wrong repostory (using stable repo in rolling release -_-), hoping nothing breaks, lol
https://www.reddit.com/r/openSUSE/comments/gaihe9/cuda_on_tumbleweed/
power of entewnet, i'll break my system

#

now it loads fast now, like 2 sec, going to try though, i didn't had cuda and i didn't knew i didn't because i had there close driver, how annoying!

#

https://cdn.discordapp.com/emojis/1102790802180153365.webp?size=48&name=sayorisob&quality=lossless

crude kiln Sep 5, 2023, 6:40 PM

#

I didn't even bother with the cuda toolkit.. If I need it, e.g. for llama.cpp, I run it in docker instead

urban rune Sep 6, 2023, 6:12 AM

#

crude kiln I didn't even bother with the cuda toolkit.. If I need it, e.g. for llama.cpp, I...

wait, you can use cuda without having it on your system?
painful noices

#

python3 server.py --api --auto-devices --mul_mat_q --disk --n_ctx 4096 --no-mmap --mlock --no-cache --model models/remm-slerp-l2-13b.Q5_K_M.gguf --n-gpu-layers 18

#

i found out that this setting is perfection

crude kiln Sep 6, 2023, 7:56 AM

#

Well, there are different parts of CUDA. I think I have the CUDA runtime 11 installed natively just because I use the proprietary Nvidia drivers. To compile CUDA apps I need the dev kit, which I didn't install because it sounded like it might mess up my system (they don't officially support Fedora 38 anyway). I can use the pre-built wheel for llama-cpp-python CUDA 12 by installing nvidia/cuda_runtime and nvidia/cublas Python packages which provide a bunch of .so-files, though I needed to manually set LD_LIBRARY_PATH for my Python program since pip doesn't do that.

#

But there are nvidia/cuda docker images for all different version, both devel images for compiling and more lightweight runtime images for running stuff. So as long as you can expose your GPU to docker you can run everything in there.

#

And to do that you need the NVIDIA Container Toolkit, but since I use docker a lot I had that installed already.

urban rune Sep 6, 2023, 9:24 AM

#

makes sense....

#

ig it's skill issue on my part, but meh

#

https://cdn.discordapp.com/emojis/1107658513259446302.webp?size=48&name=durrr&quality=lossless