#Can't run a 70B Llama 3.1 model on 2 A100 80 gb GPUs.
66 messages · Page 1 of 1 (latest)
Maybe 70b needs 192gbs or smth like that
Up until now, RunPod has only supported using a single GPU in Serverless, with the exception of using two 48GB cards (which honestly didn't help, given the overhead involved in multi-GPU setups for LLMs.) You were effectively limited to what you could fit in 80GB, so you would essentially be
This blogpost said that 2 80GB are enough
yeah im not sure about the minimum requirements, maybe let me check
alright also how much network volume do you I think need for this?
maybe around 150~
can you try other, gpu 4x
ok
np
It got stuck here again
It's always at this place
What do you think could be the problem @trail acorn
It went a bit further now
and now it just shifted to a different worker
still loading
Maybe.. loading took to long
just stop it first if you feel like its too long
what gpu setup are you using?
Its 4 48GB not pro, just normal
I suggest opting for a GPU with vram 200G+, You can try a lower option, but performance may suffer.🥲
yeah it should fit, but it creates new worker?
why is that @vast solstice
But it says u need 140gb
I gave it 196
@vast solstice how do you think we can fix this?
Do you mind to try a bigger vRAM, see if that helps?
What do you think i should put?
😂 I usually start with highest memory I can put, and keep reduce it until it won't work anymore.
ValueError: Total number of attention heads (64) must be divisible by tensor parallel size (6).
What does this mean
@vast solstice Now I have 384 gb but it is still getting stuck there
Same logs output?
yea
same
Using model weights format ['*.safetensors']
its always at this sport it gets stuck
@vast solstice it worked
i hava 4x48gb
I had to wait 6 minutes for the first time
and then now its working quickly
Nvm it became slow again
@vast solstice it becomes slow when it has to load after a while
Yea it a cold start issue
Cool👍🏻 Try to set 1 active worker, that can make sure no cold start when testing.
don't use 6 gpus
Totally unrelated, i've tried 4x48gb it works, and this too proved it works
i believe strongly that this "creating new workers on model loading" thing has to do with runpod's autoscaling
yeah
i was talking about this srry gratefully it works rn nice
Yea
what token speed did you achieve?
what cost per token?