Can't load any model | Text Generation WebUI | Page 1

thick moat Nov 5, 2024, 3:30 PM

#

I am on windows, i have downloaded "test generation webui", run the start_windows.bat and the update_windows (only the first line).
I have tried to load 5 different models but i have always an error during the "Load".
What i have done wrong ?
(windows 10, I7, 16Gb, 3080 16GB but only 8 sharable)

trail falcon Nov 5, 2024, 5:14 PM

#

thick moat I am on windows, i have downloaded "test generation webui", run the start_window...

It really depends on a model. But I think you're likely have to use tensor_split for most models and for some bigger model even to reduce n-gpu-layers.

I don't know what settings could work well for this model and your setup but I'd try this:

reduce n-gpu-layers to 40.
put 50,50 in tensor_split field (50% in first GPU, 50% in 2nd).

Then you could try to load and check how much VRAM both card use.
If there's free space left, you can increase these values a bit an check again.
If it still doesn't load then decrease them.

I don't know if it's required for llama.cpp but for exllamav2 loader first value in tensor_split has to be smaller than 2nd. For example you could also try something like n-gpu-layers=35, tensor_split=40,60 (40% in 1st, 60% in 2nd).

thick moat Nov 5, 2024, 6:11 PM

#

Thank you for your help. I have only one GPU 3080 with 166Gb but only 8Gb are shareable (maybe because i am on a desktop)
I have tried to lower the "n_ctx" value to 4000 and 1000 , but it doesn't work.
The "model loader" and all the values under have been selected automatically when i have chosen this model (which is working with LM Studio)

#

I have tried to lower the "n-gpu-layer" to 40 then 20 , 10 and 0. but nothing works !
Do i need to update the extensions (line 2) in the update_wizard_windows.bat ?

#

I read that i am not the one who have problems to load models, maybe i should wait this software to be mature to use it ?

trail falcon Nov 5, 2024, 6:37 PM

#

Hm, you edited your PC specs.
Does it mean you have 3080 with dedicated 8GB? Is it a laptop graphics card?
I think with 16GB RAM + 8GB GPU this model is too big for your PC.
Try something like Qwen2.5-7B, Llama3-8B.

#

Don't bother with update wizard, it isn't going to help here and extensions have nothing to do with loading a model.

If you find this app complicated, yeah, it's not very intuitive.
You can try KoboldCpp or LM Studio, these are more limited in supported model types but a lot easier to use. And come prepackaged, so they are more lightweight and easier to use.

#

Also, I just noticed you use Q4_0_4_8.gguf model, it isn't meant for PC, it's for high-end ARM devices.
I think it still likely should load but next time use Q4_K_M or some IQ4 quant, these should be noticeable faster than Q4_0_4_8.

thick moat Nov 5, 2024, 6:45 PM

#

Thank you again. I will try another model.
Do you have a link with an explanation of what is Q4, O, ... and the other letters in the description of the model ?

trail falcon Nov 5, 2024, 6:57 PM

#

Hmmm... it's mildly weird to explain.
These are naming style for compressed models in GGUF format.

Number in Q2/3/4/6/8 roughly represents quality of the model.
Q8 is 8bit precision, nearly perfect quality, almost as good as original uncompressed model.
Q6 is worse but still almost flawless.
Q5 is ok.
Q4 is usually the smallest usable size, in rare cases it might even work poorly, especially for non-English languages.
Q3 usually is trash.
Q2 is even worse.

Q4_0/Q4_1 use oldest compression method, don't bother with them.
Q4_K (Q4_K_M, Q4_K_S, etc) are newer method, these are okay.
IQ4_XS/NL is even newer method, these supposed to be okay too but honestly I don't know if they are worth to be used.
Q4_0_4_4 are optimized for low end mobile phones, Q4_0_4_8 for high end phones, Q4_0_8_8 is for the newest high end phones but there are very few that support it. These also can work with Mac computers. These are noticeably slower on PC but somewhat faster on ARM devices (phones, tablets, Mac computers.)

#

Usually, Q4_K_M is considered as optimal in size/quality, so if you're unsure you could try it.

#

I've seen in some recent tests that Q3_K_L or Q5_K_S might have unusually high quality but I haven't tested it, and usually it's hard to tell the difference unless something breaks.
For example Q2/Q3 models usually tend to repeat words, phrases or even write something like "Hello, userrrrrrrrrrrr" or fail to write numbers/dates properly.

thick moat Nov 5, 2024, 7:15 PM

#

Thank you for your explanations. I will look for Q4_K_M now.
I tried with "Codestral-22B-v0.1-GGUF" which is working for me on LM Studio.

#Can't load any model