#Telsa K80 help

1 messages · Page 1 of 1 (latest)

gusty ocean
#

I have cuda installed, and it works with the facebook models

CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 3.7
CUDA SETUP: Detected CUDA version 112
/usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so...
Loading facebook_opt-2.7b...
Loaded the model in 11.84 seconds.

But not gptq

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 3.7
CUDA SETUP: Detected CUDA version 112
/usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so...
Loading llama-13b...
CUDA extension not installed.
Loading model ...
Killed

I reran the install, this time with no prefix install dir and granting the right perms for my user sudo chown -R sam:sam /usr/local/lib/python3.10/dist-packages/ and same error

python3 server.py --wbits 4  --model llama-13b --no-stream

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 3.7
CUDA SETUP: Detected CUDA version 112
/usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so...
Loading llama-13b...
Loading model ...
Killed

Nothing gets loaded into GPU, only system memory.

#
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
#
s -lah /usr/local/lib/python3.10/dist-packages/quant_cuda-0.0.0-py3.10-linux-x86_64.egg/
total 10M
drwxrwxr-x 4 sam sam 4.0K Apr  8 05:09 .
drwxr-xr-x 4 sam sam 4.0K Apr  8 05:09 ..
drwxrwxr-x 2 sam sam 4.0K Apr  8 05:09 EGG-INFO
drwxrwxr-x 2 sam sam 4.0K Apr  8 05:09 __pycache__
-rwxr-xr-x 1 sam sam  10M Apr  8 05:09 quant_cuda.cpython-310-x86_64-linux-gnu.so
-rw-rw-r-- 1 sam sam  436 Apr  8 05:09 quant_cuda.py
gusty goblet
gusty ocean
#

libbitsandbytes had to be recompiled a while back to even get it to work at all

#

Doing the cp trick did not work

#

Also, it looks like it's loading the right one Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so and I don't have cuda117

#

hmm wait

#

I'm only giving this vm 8G of memory

#

and I did just notice that it loads a lot in system memory before diving into the pool

#

is it possible....

#

I just noticed this same issue with the opt 6.7

glass marlin
#

possible. can always allocate more or presumably allocate swap within the vm

#

weird manifestation tho

#

you might try --pre-layer caching first

gusty ocean
#

What does that do? I saw it in a doc somewhere but didn't understand exactly what it did

#

almost

#

I saw it in nvidia-smi

glass marlin
#

i'm a bit fuzzy on it too, but my understanding is that it caches some of the model's "layers", of which there are bajillions

#

not actually bajillions but

#

if you load with --verbose you should see how many layers there are

#

anyway, i'd slash the gpu memory to validate the hypothesis in some form first

gusty ocean
#

I don't see it getting killed or crashing

#

but it does crash

#

Just tried with python3 server.py --wbits 4 --model llama-13b --gpu-memory 11 11 --verbose --pre_layer 20

#

ah wait

#

stacktrace

#
Traceback (most recent call last):
  File "/home/sam/git/text-generation-webui/server.py", line 308, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/sam/git/text-generation-webui/modules/models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "/home/sam/git/text-generation-webui/modules/GPTQ_loader.py", line 132, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, shared.args.pre_layer)
  File "/home/sam/git/text-generation-webui/repositories/GPTQ-for-LLaMa/llama_inference_offload.py", line 211, in load_quant
    model.load_state_dict(torch.load(checkpoint))
  File "/home/sam/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.qzeros ..............
#

Did I not download this model correctly?...

#
ls
config.json                       pytorch_model-00013-of-00041.bin  pytorch_model-00030-of-00041.bin
generation_config.json            pytorch_model-00014-of-00041.bin  pytorch_model-00031-of-00041.bin
huggingface-metadata.txt          pytorch_model-00015-of-00041.bin  pytorch_model-00032-of-00041.bin
llama-13b-4bit.pt                 pytorch_model-00016-of-00041.bin  pytorch_model-00033-of-00041.bin
pytorch_model-00000-of-00041.bin  pytorch_model-00017-of-00041.bin  pytorch_model-00034-of-00041.bin
pytorch_model-00001-of-00041.bin  pytorch_model-00018-of-00041.bin  pytorch_model-00035-of-00041.bin
pytorch_model-00002-of-00041.bin  pytorch_model-00019-of-00041.bin  pytorch_model-00036-of-00041.bin
pytorch_model-00003-of-00041.bin  pytorch_model-00020-of-00041.bin  pytorch_model-00037-of-00041.bin
pytorch_model-00004-of-00041.bin  pytorch_model-00021-of-00041.bin  pytorch_model-00038-of-00041.bin
pytorch_model-00005-of-00041.bin  pytorch_model-00022-of-00041.bin  pytorch_model-00039-of-00041.bin
pytorch_model-00006-of-00041.bin  pytorch_model-00023-of-00041.bin  pytorch_model-00040-of-00041.bin
pytorch_model-00007-of-00041.bin  pytorch_model-00024-of-00041.bin  pytorch_model-00041-of-00041.bin
pytorch_model-00008-of-00041.bin  pytorch_model-00025-of-00041.bin  pytorch_model.bin.index.json
pytorch_model-00009-of-00041.bin  pytorch_model-00026-of-00041.bin  README.md
pytorch_model-00010-of-00041.bin  pytorch_model-00027-of-00041.bin  special_tokens_map.json
pytorch_model-00011-of-00041.bin  pytorch_model-00028-of-00041.bin  tokenizer_config.json
pytorch_model-00012-of-00041.bin  pytorch_model-00029-of-00041.bin  tokenizer.model
glass marlin
#

woah, that's a funky one

gusty ocean
#

17 days ago

glass marlin
#

yea, everything changed about gptq then

#

the dawn of groupsize

#

that indicates to me that it is indeed GPU VRAM

gusty ocean
#

aaah but I can't go higher than that other commit due to m10 things...

glass marlin
#

you have a secondary issue but

#

i think you can find a set of old commits that will work

#

if you go back to like 18 days ago

#

(unless other M10 things)

gusty ocean
#

ill try this older commit now

#

man I was tempted to buy like 5 more k80s but

#

I think I might just save up sweat_heh

#

uh oh 4 errors detected in the compilation of "quant_cuda_kernel.cu".

#

Leaving this here for discord search


quant_cuda_kernel.cu(149): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double> *, c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>)
          detected during instantiation of "void VecQuant2MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>]" 
(87): here

quant_cuda_kernel.cu(261): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double> *, c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>)
          detected during instantiation of "void VecQuant3MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>]" 
(171): here

quant_cuda_kernel.cu(337): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double> *, c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>)
          detected during instantiation of "void VecQuant4MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>]" 
(283): here

quant_cuda_kernel.cu(409): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double> *, c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>)
          detected during instantiation of "void VecQuant8MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const scalar_t *, int, int, int, int) [with scalar_t=c10::impl::ScalarTypeToCPPTypeT<c10::ScalarType::Double>]" 
(359): here
#

ec1459998a39deb3e4abad5283b9616cfa18c9ee ?

#

Just warnings 🙌

#

hmmm

#
Traceback (most recent call last):
  File "/home/sam/git/text-generation-webui/server.py", line 308, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/sam/git/text-generation-webui/modules/models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "/home/sam/git/text-generation-webui/modules/GPTQ_loader.py", line 132, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, shared.args.pre_layer)
  File "/home/sam/git/text-generation-webui/repositories/GPTQ-for-LLaMa/llama_inference_offload.py", line 211, in load_quant
    model.load_state_dict(torch.load(checkpoint))
  File "/home/sam/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        Missing key(s) in state_dict
#

uhhh

#

hte other one is older

#

oh

#

holdon

#

i gotta...

#

cherrypick

glass marlin
#

the stream of consciousness i just can't

gusty ocean
#

and by cherrypick I mean nano cause im lazy

#

so I have to take the fix and I apply it to the other fix to fix the fix

#

¯_(ツ)_/¯

#
Loading llama-13b...
Traceback (most recent call last):
  File "/home/sam/git/text-generation-webui/server.py", line 308, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/sam/git/text-generation-webui/modules/models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "/home/sam/git/text-generation-webui/modules/GPTQ_loader.py", line 132, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, shared.args.pre_layer)
TypeError: load_quant() takes 4 positional arguments but 5 were given

hmmm

#

so the older version of quant...

#

needs to be matched with an older version of web-ui?

#

or this patched

#

oh

#

I removed --pre_layer 20

#
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
#

!!!!!!!!!!!!!!!!1

#

holfuckingshit we did it

#

ok i have some gh issues to make

#

or....

#

a blog?

glass marlin
#

wherever really old GPU owners congregate. you now have this help-forum to memorialize your triumph, too

#

i'm honored to be a part of your memoirs

gusty ocean
#

ty for your support. google definitely isn't going anywhere with its fine grain error searching either lol