#low token (SOLVED)

1 messages · Page 1 of 1 (latest)

forest raft
#

Hello! im really new to the LLAMA and stuff ive tried to download wizard-vicuna and i cant get it to work. So instead ive download normal Vicuna and still the same. I in the end i tried AlekseyKorshuk_vicuna-7b and it finaly worked. Yet it is REALYL slow im talking around 0.9 tokens a second on good day. any idea how to fix it or what might be going on?

stoic moat
#

What's your hardware?
I am getting anywhere between 0.8 to 1.2 tokens/s on RX 7900 XT, with model GPT-4-x-Vicuna-13b-4bit.
Would also appreciate any advice

forest raft
stoic moat
forest raft
#

gonna see if it will help

#

also is there a way to use both CPU & GPU at the same time?

stoic moat
#

To actually use CPU for processing - I think LLAMA.cpp does that\

#

But I was not able to get both CPU and GPU working with it, and CPU only is even slower for me

#

Could be due to having only 16 GB RAM

#

I've ordered a 32 GB kit, will try again once I get it

fervent acorn
#

GPTQ models run on the GPU only. You can offload some of the model to RAM, but that data has to be swapped into the GPU for processing which greatly slows the model down.

#

Currently, the webui is set to use AutoGPTQ to load gptq models by default. If you enable auto-devices then the webui will automatically offload some of the model to RAM if you run out of VRAM.

stoic moat
#

(This is with model loaded)

fervent acorn
stoic moat
#

the alternative is GPTQ-for-LLaMa, right?

fervent acorn
#

Just keep in mind that GPTQ-for-LLaMa can't automatically detect gptq model parameters, so you will need to either set the wbits and groupsize params in the webui or add something like -4bit-128g to the model folder's name according to what the model uses.

stoic moat
#

Do you know of any forks of GPTQ-for-LLaMa-ROCm that support pytorch 2.0 and later?

fervent acorn
#

I know how to compile ROCm GPTQ-for-LLaMa using GitHub Actions, so I can try that and see if it works for you.

fervent acorn
stoic moat
#

It seems like there is a guide specifically for RDNA3, so I will try it
https://are-we-gfx1100-yet.github.io/post/text-gen-webui/

fervent acorn
#

Don't know why you would need a fork for it though it is nice to see that there is one. Last time I compiled gptq with ROCm, it automatically converted the code to HIP.

forest raft
#

i tried downloading TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ as u told me and i tried to run but i get the same thing as the last model where idont get any response

#

i even followed these steps

fervent acorn
forest raft
#

nothing is happening still

fervent acorn
forest raft
fervent acorn
# forest raft

Try loading it with gptq-for-llama checked, wbits set to 4 and groupsize set to 128

forest raft
stoic moat
#

However this guide is only related to RDNA3, so probably not useful for OP

forest raft
stoic moat
#

Most likely the difference is due to custom bitsandbytes

forest raft
#

its giving me output but at 0.4 tokens/s

fervent acorn
forest raft
#

the fastest ive gotten it just now is 1.08 tokens a second

fervent acorn
fervent acorn
forest raft
#

Should i put it into the model folder ?

fervent acorn
#

GGML models are pretty good and are what I use exclusively as it is the only way I can load 30B models on my system.
It wouldn't surprise me at all if GGML models eventually become the standard consumer AI model format due to their better memory efficiency. Once the GPU acceleration support is more fleshed out, I imagine most people will switch to them regardless.

#

With GGML I can run 30B models at slightly faster speeds than 13B gptq models.

forest raft
#

im getting this

#

when i put it in its own folder or just in the \text-generation-webui\models

fervent acorn
# forest raft im getting this

I guess that is still an issue. Move the .bin file to a folder with -ggml in the name. Something like Wizard-Vicuna-13B-Uncensored-ggml.

fervent acorn
# forest raft now im getting this

Do you still have gptq-for-llama option checked or does the folder name have gptq or 4bit in it?
Something is causing the webui to think the model is a gptq model.

forest raft
#

i did have it gptq-for-llama option checked

#

once i unchecked this i got this instead

#

wait a min

#

now its worken

#

i reset the setting including these

#

im also getting 2.7 tokens a second

#

is that good?

fervent acorn
# forest raft is that good?

It's ok. You would likely need a better CPU for faster speeds.
You can try GPU acceleration, but you will need to compile llama-cpp-python to support it.

fervent acorn
#

Everything else needed to compile should be already installed by the one-click-installer.

forest raft
#

Do i download it inside the model folder or whole who ogaboga one

#

or run the compile in there

fervent acorn
# forest raft or run the compile in there

Once VS C++ is installed, run cmd_windows.bat and enter these commands:

set FORCE_CMAKE=1
set "CMAKE_ARGS=-DLLAMA_CUBLAS=on"
python -m pip install git+https://github.com/abetlen/llama-cpp-python --force-reinstall --no-deps
#

If it fails to find Visual Studio automatically, use these instead:

call "C:\Program Files\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat"
set DISTUTILS_USE_SDK=1
set FORCE_CMAKE=1
set "CMAKE_ARGS=-DLLAMA_CUBLAS=on"
python -m pip install git+https://github.com/abetlen/llama-cpp-python --force-reinstall --no-deps
forest raft
#

its done

fervent acorn
# forest raft its done

Now just set the n-gpu-layers option in the webui to something like 30 and it should put 30 layers of the model onto GPU. The number you can use depends on the VRAM required to run each layer. That 13B model has 40 layers. 30 should provide plenty of acceleration if you can't fit all 40.

forest raft
#

Now when ever i launch the LLAMA i get this instead

#

and CMD says this

fervent acorn
# forest raft and CMD says this

Latest llama-cpp-python must have broken webui's support. If you want to continue, repeat last commands but use this instead of the last one:

python -m pip install git+https://github.com/abetlen/llama-cpp-python@71f4582d4469ba74529386abb66a835e3ad1c374 --force-reinstall --no-deps
#

All of this would be a lot more convenient if the llama-cpp-python devs would distribute versions with GPU support instead of CPU only.

forest raft
#

Do i copy this code and just paste in cmd_ windows?set FORCE_CMAKE=1 set "CMAKE_ARGS=-DLLAMA_CUBLAS=on" python -m pip install git+https://github.com/abetlen/llama-cpp-python@71f4582d4469ba74529386abb66a835e3ad1c374 --force-reinstall --no-deps

#

ah now its working

#

cmd was not running

#

oh wow, im now getting 4.5 tokens a second

fervent acorn
#

Nice!

forest raft
#

with this prompt i can get it to 5.66

fervent acorn
#

You have a much newer cpu+gpu than I do, so the benefits of gpu acceleration will be a lot better for you.

forest raft
#

does the prompt play a big role?

#

some of them give better tokens per second

fervent acorn
#

Larger prompt means slower generation. There are some other nuances, but I don't understand them.

forest raft
#

i see, because now i see that some prompt can reach 6.34 ish

#

but thank you so so so much for your help i really appericate it!

#

i wonder if its possible to save this chat because it helped me a lot

#

low token (SOLVED)