#GPTQ ExLlama Won't Split Between GPUs
5 messages · Page 1 of 1 (latest)
you probably won't get 30b working, 13 with higher context should. my cards are all 12gb, so they don't match your setup but i found 8,10 gpu split worked for 2 gpus and 5,5,10 works for 3 gpus when using ExLlama. most number combinations just don't work, so it was a lot of trial and error to get the 3 working together. so yours is just not loading anything into the second gpu?
Can you get 30b working with 3 12gb cards? I had to go down to 4-5 on card 0 to get it to load 13b with 8k
yes, i get 30b working fine with 3 12gb cards. i use 5,5,10. i can't do 8k context, but 6k works great. i could run 13b on one 12gb card, but at 2k context. it would run out of vram if i tried 8k.
No matter how I try, I can't get ExLlama to split across my gpu's rtx 3090 and a rtx 4060ti 16gb. i've chased the numbers from gradio all the way into the loader and it's there but it always uses my device 0 and only that. I'm current as of commit 6cca8b8. It splits with llama, it splits with AutoGPTQ. Using Python 3.10.12 and Cuda 12.2