#GPTQ ExLlama Won't Split Between GPUs

5 messages · Page 1 of 1 (latest)

bitter zephyr
#

I cannot for the life of me get exllama or exllamahf to use both GPUs I have a 3060 and 2070 super 20gb of ram total. I would like to use both GPUs so I can try to either run a 30b for better conversational skills or a 13b with higher context for story writing.

elder imp
#

you probably won't get 30b working, 13 with higher context should. my cards are all 12gb, so they don't match your setup but i found 8,10 gpu split worked for 2 gpus and 5,5,10 works for 3 gpus when using ExLlama. most number combinations just don't work, so it was a lot of trial and error to get the 3 working together. so yours is just not loading anything into the second gpu?

bitter zephyr
elder imp
#

yes, i get 30b working fine with 3 12gb cards. i use 5,5,10. i can't do 8k context, but 6k works great. i could run 13b on one 12gb card, but at 2k context. it would run out of vram if i tried 8k.

mossy canyon
#

No matter how I try, I can't get ExLlama to split across my gpu's rtx 3090 and a rtx 4060ti 16gb. i've chased the numbers from gradio all the way into the loader and it's there but it always uses my device 0 and only that. I'm current as of commit 6cca8b8. It splits with llama, it splits with AutoGPTQ. Using Python 3.10.12 and Cuda 12.2