Telsa K80 help | Text Generation WebUI | Page 1

gusty ocean Apr 8, 2023, 5:16 AM

#

I have cuda installed, and it works with the facebook models

CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 3.7
CUDA SETUP: Detected CUDA version 112
/usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so...
Loading facebook_opt-2.7b...
Loaded the model in 11.84 seconds.

But not gptq

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 3.7
CUDA SETUP: Detected CUDA version 112
/usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so...
Loading llama-13b...
CUDA extension not installed.
Loading model ...
Killed

I reran the install, this time with no prefix install dir and granting the right perms for my user sudo chown -R sam:sam /usr/local/lib/python3.10/dist-packages/ and same error

python3 server.py --wbits 4  --model llama-13b --no-stream

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 3.7
CUDA SETUP: Detected CUDA version 112
/usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so...
Loading llama-13b...
Loading model ...
Killed

Nothing gets loaded into GPU, only system memory.

#

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

#

s -lah /usr/local/lib/python3.10/dist-packages/quant_cuda-0.0.0-py3.10-linux-x86_64.egg/
total 10M
drwxrwxr-x 4 sam sam 4.0K Apr  8 05:09 .
drwxr-xr-x 4 sam sam 4.0K Apr  8 05:09 ..
drwxrwxr-x 2 sam sam 4.0K Apr  8 05:09 EGG-INFO
drwxrwxr-x 2 sam sam 4.0K Apr  8 05:09 __pycache__
-rwxr-xr-x 1 sam sam  10M Apr  8 05:09 quant_cuda.cpython-310-x86_64-linux-gnu.so
-rw-rw-r-- 1 sam sam  436 Apr  8 05:09 quant_cuda.py

#

Had to build against https://github.com/qwopqwop200/GPTQ-for-LLaMa/commit/841feedde876785bc8022ca48fd9c3ff626587e2 due to https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/88

gusty goblet Apr 8, 2023, 5:19 AM

#

Maybe this https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/

cp libbitsandbytes_cuda117.so libbitsandbytes_cpu.so
conda install cudatoolkit``` 

But cuda112_nocublaslt.so ?

gusty ocean Apr 8, 2023, 5:21 AM

#

libbitsandbytes had to be recompiled a while back to even get it to work at all

#

I followed https://github.com/TimDettmers/bitsandbytes/blob/main/compile_from_source.md after installing cuda from the helper script in the repo

#

Doing the cp trick did not work

#

Also, it looks like it's loading the right one Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so and I don't have cuda117

#

well, I guess it's more accurate to say it did work for some projects, as I made a comment in the issue they reference in that reddit thread https://github.com/TimDettmers/bitsandbytes/issues/156#issuecomment-1489953163 but when I wanted to compile alpaca-lora I had to build it from source

#

hmm wait

#

I'm only giving this vm 8G of memory

#

and I did just notice that it loads a lot in system memory before diving into the pool

#

is it possible....

#

I just noticed this same issue with the opt 6.7

glass marlin Apr 8, 2023, 5:42 AM

#

possible. can always allocate more or presumably allocate swap within the vm

#

weird manifestation tho

#

you might try --pre-layer caching first

gusty ocean Apr 8, 2023, 5:43 AM

#

What does that do? I saw it in a doc somewhere but didn't understand exactly what it did

#

ooo

#

almost

#

I saw it in nvidia-smi

glass marlin Apr 8, 2023, 5:44 AM

#

i'm a bit fuzzy on it too, but my understanding is that it caches some of the model's "layers", of which there are bajillions

#

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#cpu-offloading

#

not actually bajillions but

#

if you load with --verbose you should see how many layers there are

#

anyway, i'd slash the gpu memory to validate the hypothesis in some form first

gusty ocean Apr 8, 2023, 5:46 AM

#

I don't see it getting killed or crashing

#

but it does crash

#

Just tried with python3 server.py --wbits 4 --model llama-13b --gpu-memory 11 11 --verbose --pre_layer 20

#

ah wait

#

stacktrace

#

Traceback (most recent call last):
  File "/home/sam/git/text-generation-webui/server.py", line 308, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/sam/git/text-generation-webui/modules/models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "/home/sam/git/text-generation-webui/modules/GPTQ_loader.py", line 132, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, shared.args.pre_layer)
  File "/home/sam/git/text-generation-webui/repositories/GPTQ-for-LLaMa/llama_inference_offload.py", line 211, in load_quant
    model.load_state_dict(torch.load(checkpoint))
  File "/home/sam/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.qzeros ..............

#

Did I not download this model correctly?...

#

ls
config.json                       pytorch_model-00013-of-00041.bin  pytorch_model-00030-of-00041.bin
generation_config.json            pytorch_model-00014-of-00041.bin  pytorch_model-00031-of-00041.bin
huggingface-metadata.txt          pytorch_model-00015-of-00041.bin  pytorch_model-00032-of-00041.bin
llama-13b-4bit.pt                 pytorch_model-00016-of-00041.bin  pytorch_model-00033-of-00041.bin
pytorch_model-00000-of-00041.bin  pytorch_model-00017-of-00041.bin  pytorch_model-00034-of-00041.bin
pytorch_model-00001-of-00041.bin  pytorch_model-00018-of-00041.bin  pytorch_model-00035-of-00041.bin
pytorch_model-00002-of-00041.bin  pytorch_model-00019-of-00041.bin  pytorch_model-00036-of-00041.bin
pytorch_model-00003-of-00041.bin  pytorch_model-00020-of-00041.bin  pytorch_model-00037-of-00041.bin
pytorch_model-00004-of-00041.bin  pytorch_model-00021-of-00041.bin  pytorch_model-00038-of-00041.bin
pytorch_model-00005-of-00041.bin  pytorch_model-00022-of-00041.bin  pytorch_model-00039-of-00041.bin
pytorch_model-00006-of-00041.bin  pytorch_model-00023-of-00041.bin  pytorch_model-00040-of-00041.bin
pytorch_model-00007-of-00041.bin  pytorch_model-00024-of-00041.bin  pytorch_model-00041-of-00041.bin
pytorch_model-00008-of-00041.bin  pytorch_model-00025-of-00041.bin  pytorch_model.bin.index.json
pytorch_model-00009-of-00041.bin  pytorch_model-00026-of-00041.bin  README.md
pytorch_model-00010-of-00041.bin  pytorch_model-00027-of-00041.bin  special_tokens_map.json
pytorch_model-00011-of-00041.bin  pytorch_model-00028-of-00041.bin  tokenizer_config.json
pytorch_model-00012-of-00041.bin  pytorch_model-00029-of-00041.bin  tokenizer.model

glass marlin Apr 8, 2023, 5:49 AM

#

woah, that's a funky one

gusty ocean Apr 8, 2023, 5:49 AM

#

https://huggingface.co/elinas/alpaca-30b-lora-int4/discussions/2#641a326ee15dc827d9abad2a found something

#

17 days ago

glass marlin Apr 8, 2023, 5:50 AM

#

yea, everything changed about gptq then

#

the dawn of groupsize

#

that indicates to me that it is indeed GPU VRAM

gusty ocean Apr 8, 2023, 5:50 AM

#

aaah but I can't go higher than that other commit due to m10 things...

glass marlin Apr 8, 2023, 5:50 AM

#

you have a secondary issue but

#

i think you can find a set of old commits that will work

#

if you go back to like 18 days ago

#

(unless other M10 things)

gusty ocean Apr 8, 2023, 5:51 AM

#

oh https://github.com/qwopqwop200/GPTQ-for-LLaMa/commit/841feedde876785bc8022ca48fd9c3ff626587e2 is younger than 468c47c01b4fe370616747b6d69a2d3f48bab5e4

#

ill try this older commit now

#

man I was tempted to buy like 5 more k80s but

#

I think I might just save up sweat_heh

#

uh oh 4 errors detected in the compilation of "quant_cuda_kernel.cu".

#

Leaving this here for discord search


quant_cuda_kernel.cu(149): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double> *, c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>)
          detected during instantiation of "void VecQuant2MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>]" 
(87): here

quant_cuda_kernel.cu(261): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double> *, c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>)
          detected during instantiation of "void VecQuant3MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>]" 
(171): here

quant_cuda_kernel.cu(337): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double> *, c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>)
          detected during instantiation of "void VecQuant4MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>]" 
(283): here

quant_cuda_kernel.cu(409): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double> *, c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>)
          detected during instantiation of "void VecQuant8MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>]" 
(359): here

#

fuck man https://stackoverflow.com/a/37569519

Stack Overflow

CUDA atomicAdd for doubles definition error

In previous versions of CUDA, atomicAdd was not implemented for doubles, so it is common to implement this like here. With the new CUDA 8 RC, I run into troubles when I try to compile my code which

#

oh fuck who is this hero https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/58

GitHub

Add support for devices with compute capability < 6.0 by tobbez · P...

Without this change, building for devices with compute capability < 6.0
fails with:
quant_cuda_kernel.cu(149): error: no instance of overloaded function "atomicAdd" matches the argumen...

#

ec1459998a39deb3e4abad5283b9616cfa18c9ee ?

#

Just warnings 🙌

#

hmmm

#

Traceback (most recent call last):
  File "/home/sam/git/text-generation-webui/server.py", line 308, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/sam/git/text-generation-webui/modules/models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "/home/sam/git/text-generation-webui/modules/GPTQ_loader.py", line 132, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, shared.args.pre_layer)
  File "/home/sam/git/text-generation-webui/repositories/GPTQ-for-LLaMa/llama_inference_offload.py", line 211, in load_quant
    model.load_state_dict(torch.load(checkpoint))
  File "/home/sam/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        Missing key(s) in state_dict

#

uhhh

#

hte other one is older

#

oh

#

holdon

#

i gotta...

#

cherrypick

glass marlin Apr 8, 2023, 6:03 AM

#

the stream of consciousness i just can't

gusty ocean Apr 8, 2023, 6:03 AM

#

and by cherrypick I mean nano cause im lazy

#

So currently,

ec1459998a39deb3e4abad5283b9616cfa18c9ee or in other words PR 58 tobbez/support-pre-6.0-compute-capability, allows this to compile, but
468c47c01b4fe370616747b6d69a2d3f48bab5e4 or in other words, the fix from https://huggingface.co/elinas/alpaca-30b-lora-int4/discussions/2 doesn't include this change

#

so I have to take the fix and I apply it to the other fix to fix the fix

#

¯_(ツ)_/¯

#

Loading llama-13b...
Traceback (most recent call last):
  File "/home/sam/git/text-generation-webui/server.py", line 308, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/sam/git/text-generation-webui/modules/models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "/home/sam/git/text-generation-webui/modules/GPTQ_loader.py", line 132, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, shared.args.pre_layer)
TypeError: load_quant() takes 4 positional arguments but 5 were given

hmmm

#

so the older version of quant...

#

needs to be matched with an older version of web-ui?

#

or this patched

#

oh

#

I removed --pre_layer 20

#

  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

#

!!!!!!!!!!!!!!!!1

#

holfuckingshit we did it

#

ok i have some gh issues to make

#

or....

#

a blog?

glass marlin Apr 8, 2023, 6:16 AM

#

wherever really old GPU owners congregate. you now have this help-forum to memorialize your triumph, too

#

i'm honored to be a part of your memoirs

gusty ocean Apr 8, 2023, 6:17 AM

#

ty for your support. google definitely isn't going anywhere with its fine grain error searching either lol

#Telsa K80 help