Hello! im really new to the LLAMA and stuff ive tried to download wizard-vicuna and i cant get it to work. So instead ive download normal Vicuna and still the same. I in the end i tried AlekseyKorshuk_vicuna-7b and it finaly worked. Yet it is REALYL slow im talking around 0.9 tokens a second on good day. any idea how to fix it or what might be going on?
#low token (SOLVED)
1 messages · Page 1 of 1 (latest)
What's your hardware?
I am getting anywhere between 0.8 to 1.2 tokens/s on RX 7900 XT, with model GPT-4-x-Vicuna-13b-4bit.
Would also appreciate any advice
my hardware RTX 3060 ti, i5 10600k, 16 ram
Have you tried running 4bit models? It could be the case of not having enough VRAM
im gonna try this one TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ
gonna see if it will help
also is there a way to use both CPU & GPU at the same time?
I believe it does offload the model to RAM if there's not enough VRAM
To actually use CPU for processing - I think LLAMA.cpp does that\
But I was not able to get both CPU and GPU working with it, and CPU only is even slower for me
Could be due to having only 16 GB RAM
I've ordered a 32 GB kit, will try again once I get it
GGML models support using CPU and GPU to accelerate generation speed. But you have to compile llama-cpp-python for CUBLAS support for it to work as the version installed by the webui is CPU only.
GPTQ models run on the GPU only. You can offload some of the model to RAM, but that data has to be swapped into the GPU for processing which greatly slows the model down.
Currently, the webui is set to use AutoGPTQ to load gptq models by default. If you enable auto-devices then the webui will automatically offload some of the model to RAM if you run out of VRAM.
If VRAM is not an issue, what could cause low speed?
(This is with model loaded)
Possibly AutoGPTQ itself. I've noticed slower speeds with that on my 1080ti.
the alternative is GPTQ-for-LLaMa, right?
Yes
Just keep in mind that GPTQ-for-LLaMa can't automatically detect gptq model parameters, so you will need to either set the wbits and groupsize params in the webui or add something like -4bit-128g to the model folder's name according to what the model uses.
Do you know of any forks of GPTQ-for-LLaMa-ROCm that support pytorch 2.0 and later?
I know how to compile ROCm GPTQ-for-LLaMa using GitHub Actions, so I can try that and see if it works for you.
Do you want me to compile Ooba's fork for ROCm or the latest version from qwopqwop200's repo?
It seems like there is a guide specifically for RDNA3, so I will try it
https://are-we-gfx1100-yet.github.io/post/text-gen-webui/
Are we gfx1100 yet?
Prerequisites Install AMDGPU driver with ROCm Download the following prebuilt wheels into ~/Downloads torch https://github.com/evshiron/rocm_lab/releases/download/v1.14.514/torch-2.0.1+gite19229c-cp310-cp310-linux_x86_64.whl bitsandbytes https://github.com/evshiron/rocm_lab/releases/download/v1.14.514/bitsandbytes-0.37.2-py3-none-any.whl If some...
Don't know why you would need a fork for it though it is nice to see that there is one. Last time I compiled gptq with ROCm, it automatically converted the code to HIP.
i tried downloading TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ as u told me and i tried to run but i get the same thing as the last model where idont get any response
i even followed these steps
No errors? The error message might be getting hidden. Try running cmd_windows.bat and enter these commands to launch the webui:
cd text-generation-webui
python server.py
nothing is happening still
Is there nothing in the CMD window?
Try loading it with gptq-for-llama checked, wbits set to 4 and groupsize set to 128
i did that and now it is just repating what im saying also it is taking a lot of time
All right, after following this guide I have significantly increased the speed
However this guide is only related to RDNA3, so probably not useful for OP
Most likely the difference is due to custom bitsandbytes
its giving me output but at 0.4 tokens/s
I have no idea why it would be that slow. I get significantly faster speeds on my 1080ti.
the fastest ive gotten it just now is 1.08 tokens a second
I'm getting ~2.36 tokens/s. Only thing I can think of to try is GGML.
u mean this one?
Yeah
Should i put it into the model folder ?
I would give it it's own folder, though you can just put it into \text-generation-webui\models
GGML models are pretty good and are what I use exclusively as it is the only way I can load 30B models on my system.
It wouldn't surprise me at all if GGML models eventually become the standard consumer AI model format due to their better memory efficiency. Once the GPU acceleration support is more fleshed out, I imagine most people will switch to them regardless.
With GGML I can run 30B models at slightly faster speeds than 13B gptq models.
im getting this
when i put it in its own folder or just in the \text-generation-webui\models
I guess that is still an issue. Move the .bin file to a folder with -ggml in the name. Something like Wizard-Vicuna-13B-Uncensored-ggml.
now im getting this
Do you still have gptq-for-llama option checked or does the folder name have gptq or 4bit in it?
Something is causing the webui to think the model is a gptq model.
i did have it gptq-for-llama option checked
once i unchecked this i got this instead
wait a min
now its worken
i reset the setting including these
im also getting 2.7 tokens a second
is that good?
It's ok. You would likely need a better CPU for faster speeds.
You can try GPU acceleration, but you will need to compile llama-cpp-python to support it.
is it hard to do?
Not really, just takes a bit of time to set up Visual Studio C++ Build Tools.
Everything else needed to compile should be already installed by the one-click-installer.
Do i download it inside the model folder or whole who ogaboga one
or run the compile in there
Once VS C++ is installed, run cmd_windows.bat and enter these commands:
set FORCE_CMAKE=1
set "CMAKE_ARGS=-DLLAMA_CUBLAS=on"
python -m pip install git+https://github.com/abetlen/llama-cpp-python --force-reinstall --no-deps
If it fails to find Visual Studio automatically, use these instead:
call "C:\Program Files\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat"
set DISTUTILS_USE_SDK=1
set FORCE_CMAKE=1
set "CMAKE_ARGS=-DLLAMA_CUBLAS=on"
python -m pip install git+https://github.com/abetlen/llama-cpp-python --force-reinstall --no-deps
its done
Now just set the n-gpu-layers option in the webui to something like 30 and it should put 30 layers of the model onto GPU. The number you can use depends on the VRAM required to run each layer. That 13B model has 40 layers. 30 should provide plenty of acceleration if you can't fit all 40.
Latest llama-cpp-python must have broken webui's support. If you want to continue, repeat last commands but use this instead of the last one:
python -m pip install git+https://github.com/abetlen/llama-cpp-python@71f4582d4469ba74529386abb66a835e3ad1c374 --force-reinstall --no-deps
All of this would be a lot more convenient if the llama-cpp-python devs would distribute versions with GPU support instead of CPU only.
Do i copy this code and just paste in cmd_ windows?set FORCE_CMAKE=1 set "CMAKE_ARGS=-DLLAMA_CUBLAS=on" python -m pip install git+https://github.com/abetlen/llama-cpp-python@71f4582d4469ba74529386abb66a835e3ad1c374 --force-reinstall --no-deps
ah now its working
cmd was not running
oh wow, im now getting 4.5 tokens a second
Nice!
with this prompt i can get it to 5.66
You have a much newer cpu+gpu than I do, so the benefits of gpu acceleration will be a lot better for you.
Larger prompt means slower generation. There are some other nuances, but I don't understand them.