llama.cpp not using GPU despite having BLAS = 1 (Linux, GGML) | Text Generation WebUI | Page 1

sharp sapphire Jul 15, 2023, 2:35 AM

#

Part 2 of my struggles on trying to get a model running on an AMD GPU. This time, I tried using a GGML model. The problem is, the GPU isn't being utilized. n-gpu-layers is set to 40. I'm on an AMD RX 6800 XT with ROCm 5.4.2 enabled. llama-cpp-python shouldn't be the problem since BLAS is showing up as 1 already. However, when I try to set up bitsandbytes, it doesn't say successful. Check images 4 and 5 to see the command line output when doing python -m bitsandbytes.

As for how I built bitsandbytes-rocm (maybe this is where it went wrong?), I first did ROCM_HOME=/opt/rocm-5.4.2, then I built bitsandbytes-rocm from https://github.com/agrocylo/bitsandbytes-rocm. After that, I ran python setup.py install, then did python -m bitsandbytes (which again, shows an error).

halcyon crown Jul 15, 2023, 3:06 AM

#

Oh, if you are on Linux, can you test my PR for the one-click-installer here:
https://github.com/oobabooga/one-click-installers/pull/98

I don't have an AMD GPU, which is the main thing slowing me down.

GitHub

Add AMD GPU support for Linux by jllllll · Pull Request #98 · oobab...

Requires the ROCm SDK version 5.4.2 or 5.4.3. I do not have an AMD GPU, so community testing will be needed.
Things added:

Install ROCm version of Pytorch
Install ROCm ExLlama module for AMD GPUs
...

#

Specifically, if you are on Linux and not WSL. ROCm doesn't work on WSL.

sharp sapphire Jul 15, 2023, 3:08 AM

#

halcyon crown Oh, if you are on Linux, can you test my PR for the one-click-installer here: ht...

Sure! I'll test it after I'm done installing VS C++

#

and yep, I have ubuntu dual boot

sharp sapphire Jul 15, 2023, 4:03 AM

#

halcyon crown Oh, if you are on Linux, can you test my PR for the one-click-installer here: ht...

So I tried it, llama.cpp still doesn't use the GPU (BLAS is now 0, and Bitsandbytes is showing "No GPU support"

ExLlama works, as in, it uses the GPU. But it still generates gibberish

halcyon crown Jul 15, 2023, 4:05 AM

#

What model is that?

sharp sapphire Jul 15, 2023, 4:05 AM

#

Wizard-Vicuna-13B-Uncensored-GPTQ

#

buuuuuuuuuuuuuut its the act order version

#

im gonna try the no act order one

halcyon crown Jul 15, 2023, 4:07 AM

#

Maybe try downloading the llama tokenizer if that doesn't work: https://huggingface.co/oobabooga/llama-tokenizer
You can download it like other models with: oobabooga/llama-tokenizer

sharp sapphire Jul 15, 2023, 4:08 AM

#

Alright I'll test that one out as well

halcyon crown Jul 15, 2023, 4:08 AM

#

As for llama.cpp, run cmd_linux.sh and enter this command:

CMAKE_ARGS="-DLLAMA_CLBLAST=on" FORCE_CMAKE=1 pip install llama-cpp-python

The needed dependencies for it should come with ROCm.

There is also a ROCm-specific port of llama.cpp here: https://github.com/ggerganov/llama.cpp/pull/1087
Using it with the webui will require manually building llama-cpp-python with that fork of llama.cpp.

sharp sapphire Jul 15, 2023, 4:09 AM

#

sharp sapphire Alright I'll test that one out as well

--in a moment, i'll be off for an hour or two, ill be updating this thread as soon as i get back to work

sharp sapphire Jul 15, 2023, 4:41 AM

#

I'm so close to just nuking my current Ubuntu install and starting over from scratch lol

sharp sapphire Jul 15, 2023, 4:41 AM

#

halcyon crown As for llama.cpp, run `cmd_linux.sh` and enter this command: ``` CMAKE_ARGS="-DL...

I did this as well

#

I should've written down how I got BLAS to work darn

#

And yeah, as always, that bitsandbytes thing

#

This is what it looks like on my manual install of the webui

#

no bitsandbytes "no GPU support" thingy and BLAS = 1, but it still doesn't work

#

which I highly suspect is because of

#

#

#

this

#

In the video i saw, it's supposed to say something like setup successful

#

Mine just ends in an error

#

I'm gonna try using that bitsandbytes-rocm repository from the video i saw

halcyon crown Jul 15, 2023, 4:55 AM

#

All of the bitsandbytes rocm forks I've seen are pretty outdated. Not sure if any of them will work. This one is most likely to, I think: https://github.com/agrocylo/bitsandbytes-rocm

sharp sapphire Jul 15, 2023, 4:56 AM

#

Yup, that's the one I used

#

#

woah, it worked

#

well, i havent tested if the webui works

halcyon crown Jul 15, 2023, 4:58 AM

#

That is what needs to be tested, simply because the webui has updated several times in response to newer versions of bitsandbytes.

sharp sapphire Jul 15, 2023, 4:59 AM

#

Yep. i was gonna try running it now except where did all the .py files in text-generation-webui go

#

my govd

#

lmao

#

ohmygod i rm *-ed the directory thats why

#

ooooooooooooooooookay it still does not want to use the gpu

#

halcyon crown Jul 15, 2023, 5:11 AM

#

BLAS=1 is proof that it is set up correctly. Make sure you are setting the n-gpu-layers option before loading the model.

I see that you are doing that now. That is odd that it isn't loading it at all.

sharp sapphire Jul 15, 2023, 5:14 AM

#

Yeah I don't know what's up with it ahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

halcyon crown Jul 15, 2023, 5:19 AM

#

Worst case scenario is that you compile llama-cpp-python with the ROCm fork of llama.cpp:
https://github.com/ggerganov/llama.cpp/pull/1087

Essentially, this will involve cloning the llama-cpp-python repo and cloning the llama.cpp fork into llama-cpp-python/vender/llama.cpp. Then you can build it using the environment variables mentioned on that PR, but with this command instead:

CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ CMAKE_ARGS="-DLLAMA_HIPBLAS=ON -DLLAMA_CUDA_DMMV_X=128 -DLLAMA_CUDA_MMV_Y=8 -DLLAMA_CUDA_KQUANTS_ITER=1" FORCE_CMAKE=1 python -m pip install . -v

sharp sapphire Jul 15, 2023, 5:25 AM

#

halcyon crown Worst case scenario is that you compile llama-cpp-python with the ROCm fork of l...

How do you clone from that github link?

halcyon crown Jul 15, 2023, 5:26 AM

#

git clone https://github.com/SlyEcho/llama.cpp -b hipblas

sharp sapphire Jul 15, 2023, 5:27 AM

#

Thanks, I'll be trying that

#

Okay y'know what, I will nuke my current Ubuntu install and I'll start over from the beginning, cause I feel like I have far too many conflicting stuff currently

sharp sapphire Jul 16, 2023, 6:19 AM

#

Update: I got back to trying text-generation-webui after trying KoboldAI and KoboldCPP. This time, I tried doing this, and after a bit of searching, I eventually got llama-cpp-python to compile with the ROCm port of llama.cpp. I had to put -DCMAKE_POSITION_INDEPENDENT_CODE=ON in CMAKE_ARGS as well. I haven't tested it out yet, so I'm not sure if it'll even run. More updates later

sharp sapphire Jul 16, 2023, 6:35 AM

#

That's new

halcyon crown Jul 16, 2023, 6:36 AM

#

Don't use a SuperHOT model. I don't think the ROCm fork has been updated to support it yet.

sharp sapphire Jul 16, 2023, 6:37 AM

#

Okay, thanks. I'll download another model

#

Used Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_1.bin and still got the same error

#

On another note, I tried ExLLaMa and it loads the model, but when I try to generate something, this happens

#

On ExLLaMa-HF, a different error shows up

#

The same error above also shows up when I try test_benchmark_inference.py

halcyon crown Jul 16, 2023, 6:48 AM

#

It seems like an issue with the LoRa code.
I've seen the probabilities error before, but I can't for the life of me remember what caused it.

#

What flags are you using to load the webui?

sharp sapphire Jul 16, 2023, 6:50 AM

#

No flags

halcyon crown Jul 16, 2023, 6:54 AM

#

Run python -m torch.utils.collect_env to ensure that you have the correct Pytorch version. Other than that, I'm not sure what is causing it.
At the very least, these are errors seen on many different systems, which means it may not be related to the ROCm build.

sharp sapphire Jul 16, 2023, 6:55 AM

#

halcyon crown Jul 16, 2023, 6:55 AM

#

Yep, that's the right version at least.

sharp sapphire Jul 16, 2023, 6:56 AM

#

Also, I tried GPTQ-for-LLaMa and it yields the same error

#

at least it's using the VRAM I guess

#

Just gotta figure out that probability tensor thing

#

and this llama_cpp/libllama.so: undefined symbol: llama_n_vocab_from_model

halcyon crown Jul 16, 2023, 6:57 AM

#

What about AutoGPTQ? I have a ROCm wheel for it:

python -m pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/blob/Linux-x64/ROCm-5.4.2/auto_gptq-0.3.0.dev0%2Brocm5.4.2-cp310-cp310-linux_x86_64.whl --force-reinstall --no-deps

sharp sapphire Jul 16, 2023, 6:59 AM

#

halcyon crown Jul 16, 2023, 7:01 AM

#

???

sharp sapphire Jul 16, 2023, 7:01 AM

#

i have no idea

halcyon crown Jul 16, 2023, 7:01 AM

#

oops

#

Wrong URL.

#

python -m pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/Linux-x64/ROCm-5.4.2/auto_gptq-0.3.0.dev0%2Brocm5.4.2-cp310-cp310-linux_x86_64.whl --force-reinstall --no-deps

sharp sapphire Jul 16, 2023, 7:03 AM

#

maaaaaaaaaaaybe the problem is in the model

halcyon crown Jul 16, 2023, 7:05 AM

#

Could be a corrupt download.

sharp sapphire Jul 16, 2023, 7:08 AM

#

So I tried redownloading the files (except for the model itself), loaded exllama, and this happened

#

#

but

#

it generated gibberish

#

halcyon crown Jul 16, 2023, 7:09 AM

#

Maybe it isn't detecting the correct GPU architecture?

#

Are you using the HCC_AMDGPU_TARGET and HSA_OVERRIDE_GFX_VERSION environment variables?

sharp sapphire Jul 16, 2023, 7:11 AM

#

Yep, I tried AutoGPTQ and GPTQ-for-LLaMa, they still result in the same error

#

https://github.com/oobabooga/text-generation-webui/issues/2840 and i found this

GitHub

exllama gibberish output · Issue #2840 · oobabooga/text-generation-...

Describe the bug I'm running into issues using exllama with the textgen UI. I have exllama running well using its webui, but when I load the same model into textgen UI, I only get gibberish as ...

#

there's a fix but I dunno how to apply it

#

https://github.com/oobabooga/text-generation-webui/pull/2912

GitHub

Disable half2 for ExLlama when using HIP by ardfork · Pull Request ...

Using kernels that rely on half2 produce gibberish output when using ROCm. ExLlama UI also disable half2 in the same way when using HIP.
Fix #2840

#

and it's supposed to have been merged but hhhhhhhhhhhmmmmmmmmmmm

#

Okay I'll stop working on this for now, have to go somewhere. I'll return as soon as I can, probably in 1-2 days

#

Thanks for all the help!

sharp sapphire Jul 16, 2023, 8:32 AM

#

Well, I managed to sneak in a bit of time to use the PC again. Will be going again in a bit though. So, I uninstalled my custom compile of llama-cpp-python and opted for the CLBlast install instead.

#

Aaaaand it worked

#

#

on the 30B model:

halcyon crown Jul 16, 2023, 8:34 AM

#

Oh yeah, llama-cpp-python was just updated with SuperHOT support.

sharp sapphire Jul 16, 2023, 8:35 AM

#

Yup. Maybe that's what's causing the error? Though the 30B model should've worked

#

I'll try to do some more fixing since I do want to use ROCm

#

Tried ExLLaMa, AutoGPTQ, and GPTQ-for-LLaMa again, still has that probability tensor error thing. When I'm able to, I'll download another model cause the current one I'm using might be scuffed

halcyon crown Jul 16, 2023, 8:50 AM

#

I'm rebuilding the ROCm GPTQ wheels using a potentially better method. May have been an issue with how I was specifying target architectures.

halcyon crown Jul 16, 2023, 9:05 AM

#

Wheels have been rebuilt:

python -m pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/Linux-x64/ROCm-5.4.2/auto_gptq-0.3.0.dev0%2Brocm5.4.2-cp310-cp310-linux_x86_64.whl https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/Linux-x64/ROCm-5.4.2/quant_cuda-0.0.0-cp310-cp310-linux_x86_64.whl --force-reinstall --no-deps

^ All one command.

sharp sapphire Jul 16, 2023, 9:10 AM

#

Will get back to you immediately when I'm able to use the PC again ^^

#

Oh yeah, just so I can try again immediately. Could you tell me, in detail, the steps on how to build llama-cpp-python with ROCm?

What I basically did was,

Clone the llama-cpp-python repository
Clone the llama-cpp ROCm fork into the vendor folder inside llama-cpp-python
In the llama-cpp folder, I run

CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake -DLLAMA_HIPBLAS=ON
4. Then I run

make -j12 LLAMA_HIPBLAS=1
5. I went back to the llama-cpp-python directory, then I ran the command you gave me, but I added another argument.
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ CMAKE_ARGS="-DLLAMA_HIPBLAS=ON -DLLAMA_CUDA_DMMV_X=128 -DLLAMA_CUDA_MMV_Y=8 -DLLAMA_CUDA_KQUANTS_ITER=1 -DCMAKE_POSITION_INDEPENDENT_CODE=ON" FORCE_CMAKE=1 python -m pip install . -v

halcyon crown Jul 16, 2023, 5:24 PM

#

Running the commands in the llama.cpp folder are unnecessary. That last command for building llama-cpp-python handles everything.

sharp sapphire Jul 16, 2023, 6:04 PM

#

Gotcha. I honestly just did some guesswork lol. Will try again tomorrow maybe, with hopefully better results.

sharp sapphire Jul 17, 2023, 9:32 AM

#

Right. I tried it again and I couldn't get it to work. Still had the same error. Something about libllama.so.

Next, I tried using AutoGPTQ, LLaMa-for-GPTQ, and ExLLaMa. All of 'em didn't work.

#

This is with LLaMa-for-GPTQ (yes, ignore the model, I don't know what else to try lol)

#

#

another model, still with LLaMa-for-GPTQ

#

using the updated AutoGPTQ version you sent, this is the error

#

ExLLaMa loads Vicuna 7B successfully, excepttttttttttt

#

#

yea

#

ExLLaMa-HF, still that cursed probability tensor error

sand valley Jul 18, 2023, 1:17 PM

#

On my laptop I can only use oobabooga in CPU mode. I installed llama.cpp with OpenBlas, but on startup it says blas zero. What could be the problem?

halcyon crown Jul 18, 2023, 1:23 PM

#

You have to check the build output to determine if it actually used OpenBLAS or failed to find it. It will just continue without it if not found.
The build output can be hidden depending on how you are installing it.

sand valley Jul 18, 2023, 1:41 PM

#

I installed it in the usual way:
pip uninstall -y llama-cpp-python set CMAKE_ARGS="-DLLAMA_OPENBLAS=on" set FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir

#

And I see this:

📎 message.txt

halcyon crown Jul 18, 2023, 1:58 PM

#

Use this to see the full build output:

pip uninstall -y llama-cpp-python
set "CMAKE_ARGS=-DLLAMA_OPENBLAS=on"
set FORCE_CMAKE=1
set VERBOSE=1
pip install llama-cpp-python --no-cache-dir -v

sand valley Jul 18, 2023, 2:10 PM

#

Everything seems to be fine with the installation. Or am I missing something?

📎 message.txt

halcyon crown Jul 18, 2023, 2:32 PM

#

@sand valley I found the issue. Quotes on CMAKE_ARGS were in Linux/Bash format. I have modified the commands in my previous comment to correct it.

sand valley Jul 19, 2023, 6:12 AM

#

I also tried this version, but the Blas value is still zero when loading the model. :/ So I don't know if this is a bug in oobabooga or llama.cpp, or if I need some extra functionality from the processor.

halcyon crown Jul 19, 2023, 10:42 AM

#

Just found out that the CMAKE_ARGS needed to install with OpenBLAS were changed at some point:

pip uninstall -y llama-cpp-python
set "CMAKE_ARGS=-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
set FORCE_CMAKE=1
set VERBOSE=1
pip install llama-cpp-python --no-cache-dir -v

sand valley Jul 20, 2023, 6:14 AM

#

The Llama translation looks good, it's more interesting that the BLAS=0 is still set when loading the model. Maybe OpenBlas is not implemented for loading?

📎 message.txt

sharp sapphire Jul 20, 2023, 2:23 PM

#

Are you on Linux?

#llama.cpp not using GPU despite having BLAS = 1 (Linux, GGML)