low token (SOLVED) | Text Generation WebUI | Page 1

forest raft Jun 8, 2023, 3:23 PM

#

Hello! im really new to the LLAMA and stuff ive tried to download wizard-vicuna and i cant get it to work. So instead ive download normal Vicuna and still the same. I in the end i tried AlekseyKorshuk_vicuna-7b and it finaly worked. Yet it is REALYL slow im talking around 0.9 tokens a second on good day. any idea how to fix it or what might be going on?

stoic moat Jun 8, 2023, 7:02 PM

#

What's your hardware?
I am getting anywhere between 0.8 to 1.2 tokens/s on RX 7900 XT, with model GPT-4-x-Vicuna-13b-4bit.
Would also appreciate any advice

forest raft Jun 8, 2023, 7:07 PM

#

stoic moat What's your hardware? I am getting anywhere between 0.8 to 1.2 tokens/s on RX 7...

my hardware RTX 3060 ti, i5 10600k, 16 ram

stoic moat Jun 8, 2023, 7:08 PM

#

forest raft my hardware RTX 3060 ti, i5 10600k, 16 ram

Have you tried running 4bit models? It could be the case of not having enough VRAM

forest raft Jun 8, 2023, 7:09 PM

#

stoic moat Have you tried running 4bit models? It could be the case of not having enough VR...

im gonna try this one TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ

#

gonna see if it will help

#

also is there a way to use both CPU & GPU at the same time?

stoic moat Jun 8, 2023, 7:11 PM

#

forest raft also is there a way to use both CPU & GPU at the same time?

I believe it does offload the model to RAM if there's not enough VRAM

#

To actually use CPU for processing - I think LLAMA.cpp does that\

#

But I was not able to get both CPU and GPU working with it, and CPU only is even slower for me

#

Could be due to having only 16 GB RAM

#

I've ordered a 32 GB kit, will try again once I get it

fervent acorn Jun 8, 2023, 7:15 PM

#

forest raft also is there a way to use both CPU & GPU at the same time?

GGML models support using CPU and GPU to accelerate generation speed. But you have to compile llama-cpp-python for CUBLAS support for it to work as the version installed by the webui is CPU only.

#

GPTQ models run on the GPU only. You can offload some of the model to RAM, but that data has to be swapped into the GPU for processing which greatly slows the model down.

#

Currently, the webui is set to use AutoGPTQ to load gptq models by default. If you enable auto-devices then the webui will automatically offload some of the model to RAM if you run out of VRAM.

stoic moat Jun 8, 2023, 7:19 PM

#

fervent acorn Currently, the webui is set to use AutoGPTQ to load gptq models by default. If y...

If VRAM is not an issue, what could cause low speed?

#

(This is with model loaded)

fervent acorn Jun 8, 2023, 7:19 PM

#

stoic moat If VRAM is not an issue, what could cause low speed?

Possibly AutoGPTQ itself. I've noticed slower speeds with that on my 1080ti.

stoic moat Jun 8, 2023, 7:20 PM

#

the alternative is GPTQ-for-LLaMa, right?

fervent acorn Jun 8, 2023, 7:20 PM

#

stoic moat the alternative is GPTQ-for-LLaMa, right?

Yes

#

Just keep in mind that GPTQ-for-LLaMa can't automatically detect gptq model parameters, so you will need to either set the wbits and groupsize params in the webui or add something like -4bit-128g to the model folder's name according to what the model uses.

stoic moat Jun 8, 2023, 7:23 PM

#

Do you know of any forks of GPTQ-for-LLaMa-ROCm that support pytorch 2.0 and later?

fervent acorn Jun 8, 2023, 7:25 PM

#

I know how to compile ROCm GPTQ-for-LLaMa using GitHub Actions, so I can try that and see if it works for you.

fervent acorn Jun 8, 2023, 7:28 PM

#

stoic moat Do you know of any forks of GPTQ-for-LLaMa-ROCm that support pytorch 2.0 and lat...

Do you want me to compile Ooba's fork for ROCm or the latest version from qwopqwop200's repo?

stoic moat Jun 8, 2023, 7:28 PM

#

It seems like there is a guide specifically for RDNA3, so I will try it
https://are-we-gfx1100-yet.github.io/post/text-gen-webui/

Are we gfx1100 yet?

Run Text Generation WebUI on RX 7900 XTX

Prerequisites Install AMDGPU driver with ROCm Download the following prebuilt wheels into ~/Downloads torch https://github.com/evshiron/rocm_lab/releases/download/v1.14.514/torch-2.0.1+gite19229c-cp310-cp310-linux_x86_64.whl bitsandbytes https://github.com/evshiron/rocm_lab/releases/download/v1.14.514/bitsandbytes-0.37.2-py3-none-any.whl If some...

fervent acorn Jun 8, 2023, 7:32 PM

#

Don't know why you would need a fork for it though it is nice to see that there is one. Last time I compiled gptq with ROCm, it automatically converted the code to HIP.

forest raft Jun 8, 2023, 7:34 PM

#

i tried downloading TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ as u told me and i tried to run but i get the same thing as the last model where idont get any response

#

i even followed these steps

#

fervent acorn Jun 8, 2023, 7:38 PM

#

forest raft i tried downloading TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ as u told me and ...

No errors? The error message might be getting hidden. Try running cmd_windows.bat and enter these commands to launch the webui:

cd text-generation-webui
python server.py

forest raft Jun 8, 2023, 7:46 PM

#

nothing is happening still

fervent acorn Jun 8, 2023, 7:47 PM

#

forest raft nothing is happening still

Is there nothing in the CMD window?

forest raft Jun 8, 2023, 7:48 PM

#

fervent acorn Jun 8, 2023, 7:52 PM

#

forest raft

Try loading it with gptq-for-llama checked, wbits set to 4 and groupsize set to 128

forest raft Jun 8, 2023, 7:55 PM

#

fervent acorn Try loading it with `gptq-for-llama` checked, wbits set to `4` and groupsize set...

i did that and now it is just repating what im saying also it is taking a lot of time

stoic moat Jun 8, 2023, 7:55 PM

#

stoic moat It seems like there is a guide specifically for RDNA3, so I will try it https://...

All right, after following this guide I have significantly increased the speed

#

However this guide is only related to RDNA3, so probably not useful for OP

forest raft Jun 8, 2023, 7:56 PM

#

stoic moat Jun 8, 2023, 7:56 PM

#

Most likely the difference is due to custom bitsandbytes

forest raft Jun 8, 2023, 7:58 PM

#

its giving me output but at 0.4 tokens/s

fervent acorn Jun 8, 2023, 7:59 PM

#

forest raft its giving me output but at 0.4 tokens/s

I have no idea why it would be that slow. I get significantly faster speeds on my 1080ti.

forest raft Jun 8, 2023, 8:00 PM

#

the fastest ive gotten it just now is 1.08 tokens a second

fervent acorn Jun 8, 2023, 8:05 PM

#

forest raft the fastest ive gotten it just now is 1.08 tokens a second

I'm getting ~2.36 tokens/s. Only thing I can think of to try is GGML.

forest raft Jun 8, 2023, 8:05 PM

#

fervent acorn I'm getting ~2.36 tokens/s. Only thing I can think of to try is GGML.

u mean this one?

fervent acorn Jun 8, 2023, 8:05 PM

#

forest raft u mean this one?

Yeah

forest raft Jun 8, 2023, 8:06 PM

#

Should i put it into the model folder ?

fervent acorn Jun 8, 2023, 8:06 PM

#

forest raft Should i put it into the model folder ?

I would give it it's own folder, though you can just put it into \text-generation-webui\models

#

GGML models are pretty good and are what I use exclusively as it is the only way I can load 30B models on my system.
It wouldn't surprise me at all if GGML models eventually become the standard consumer AI model format due to their better memory efficiency. Once the GPU acceleration support is more fleshed out, I imagine most people will switch to them regardless.

#

With GGML I can run 30B models at slightly faster speeds than 13B gptq models.

forest raft Jun 8, 2023, 8:13 PM

#

im getting this

#

when i put it in its own folder or just in the \text-generation-webui\models

fervent acorn Jun 8, 2023, 8:14 PM

#

forest raft im getting this

I guess that is still an issue. Move the .bin file to a folder with -ggml in the name. Something like Wizard-Vicuna-13B-Uncensored-ggml.

forest raft Jun 8, 2023, 8:18 PM

#

fervent acorn I guess that is still an issue. Move the `.bin` file to a folder with `-ggml` in...

now im getting this

fervent acorn Jun 8, 2023, 8:21 PM

#

forest raft now im getting this

Do you still have gptq-for-llama option checked or does the folder name have gptq or 4bit in it?
Something is causing the webui to think the model is a gptq model.

forest raft Jun 8, 2023, 8:21 PM

#

i did have it gptq-for-llama option checked

#

once i unchecked this i got this instead

#

wait a min

#

now its worken

#

i reset the setting including these

#

im also getting 2.7 tokens a second

#

is that good?

fervent acorn Jun 8, 2023, 8:26 PM

#

forest raft is that good?

It's ok. You would likely need a better CPU for faster speeds.
You can try GPU acceleration, but you will need to compile llama-cpp-python to support it.

forest raft Jun 8, 2023, 8:28 PM

#

fervent acorn It's ok. You would likely need a better CPU for faster speeds. You can try GPU a...

is it hard to do?

fervent acorn Jun 8, 2023, 8:31 PM

#

forest raft is it hard to do?

Not really, just takes a bit of time to set up Visual Studio C++ Build Tools.

#

Everything else needed to compile should be already installed by the one-click-installer.

forest raft Jun 8, 2023, 8:42 PM

#

Do i download it inside the model folder or whole who ogaboga one

#

or run the compile in there

fervent acorn Jun 8, 2023, 8:48 PM

#

forest raft or run the compile in there

Once VS C++ is installed, run cmd_windows.bat and enter these commands:

set FORCE_CMAKE=1
set "CMAKE_ARGS=-DLLAMA_CUBLAS=on"
python -m pip install git+https://github.com/abetlen/llama-cpp-python --force-reinstall --no-deps

#

If it fails to find Visual Studio automatically, use these instead:

call "C:\Program Files\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat"
set DISTUTILS_USE_SDK=1
set FORCE_CMAKE=1
set "CMAKE_ARGS=-DLLAMA_CUBLAS=on"
python -m pip install git+https://github.com/abetlen/llama-cpp-python --force-reinstall --no-deps

forest raft Jun 8, 2023, 9:02 PM

#

its done

fervent acorn Jun 8, 2023, 9:03 PM

#

forest raft its done

Now just set the n-gpu-layers option in the webui to something like 30 and it should put 30 layers of the model onto GPU. The number you can use depends on the VRAM required to run each layer. That 13B model has 40 layers. 30 should provide plenty of acceleration if you can't fit all 40.

forest raft Jun 8, 2023, 9:08 PM

#

Now when ever i launch the LLAMA i get this instead

#

and CMD says this

fervent acorn Jun 8, 2023, 9:11 PM

#

forest raft and CMD says this

Latest llama-cpp-python must have broken webui's support. If you want to continue, repeat last commands but use this instead of the last one:

python -m pip install git+https://github.com/abetlen/llama-cpp-python@71f4582d4469ba74529386abb66a835e3ad1c374 --force-reinstall --no-deps

#

All of this would be a lot more convenient if the llama-cpp-python devs would distribute versions with GPU support instead of CPU only.

forest raft Jun 8, 2023, 9:14 PM

#

Do i copy this code and just paste in cmd_ windows?set FORCE_CMAKE=1 set "CMAKE_ARGS=-DLLAMA_CUBLAS=on" python -m pip install git+https://github.com/abetlen/llama-cpp-python@71f4582d4469ba74529386abb66a835e3ad1c374 --force-reinstall --no-deps

#

ah now its working

#

cmd was not running

#

oh wow, im now getting 4.5 tokens a second

fervent acorn Jun 8, 2023, 9:18 PM

#

Nice!

forest raft Jun 8, 2023, 9:18 PM

#

with this prompt i can get it to 5.66

fervent acorn Jun 8, 2023, 9:18 PM

#

You have a much newer cpu+gpu than I do, so the benefits of gpu acceleration will be a lot better for you.

forest raft Jun 8, 2023, 9:19 PM

#

does the prompt play a big role?

#

some of them give better tokens per second

fervent acorn Jun 8, 2023, 9:20 PM

#

Larger prompt means slower generation. There are some other nuances, but I don't understand them.

forest raft Jun 8, 2023, 9:24 PM

#

i see, because now i see that some prompt can reach 6.34 ish

#

but thank you so so so much for your help i really appericate it!

#

i wonder if its possible to save this chat because it helped me a lot

#

low token (SOLVED)

#low token (SOLVED)