#can't run 70b

74 messages · Page 1 of 1 (latest)

devout bobcat
#

any tips to run a 70b model, for example: mlabonne/Llama-3.1-70B-Instruct-lorablated

i tried that:
config
80GB GPU
2GPUs / Worker
container disk: 500 gb

env var:
MAX_MODEL_LEN 15000*
MODEL_NAME mlabonne/Llama-3.1-70B-Instruct-lorablated

but it doesn't work

without MAX_MODEL_LEN 15000, i got The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (18368). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. "

2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B
2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B
2024-08-08T12:44:26Z 4f4fb700ef54 Pull complete
2024-08-08T12:44:26Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70
2024-08-08T12:44:26Z Status: Downloaded newer image for runpod/worker-v1-vllm:stable-cuda12.1.0
2024-08-08T12:44:26Z worker is ready
2024-08-08T12:44:38Z create pod network
2024-08-08T12:44:38Z create container runpod/worker-v1-vllm:stable-cuda12.1.0
2024-08-08T12:44:38Z stable-cuda12.1.0 Pulling from runpod/worker-v1-vllm
2024-08-08T12:44:38Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70
2024-08-08T12:44:38Z Status: Image is up to date for runpod/worker-v1-vllm:stable-cuda12.1.0
2024-08-08T12:44:38Z worker is ready
2024-08-08T12:44:39Z start container
2024-08-08T12:48:14Z start container

and nothing after

cosmic tinsel
# devout bobcat any tips to run a 70b model, for example: mlabonne/Llama-3.1-70B-Instruct-lorabl...

The logs you have provided are the System logs, this shows how the container is launching. Have you checked your Container Logs? This will show logs for what your application is doing. You can see these logs in the RunPod website. If you click on a running worker you can then click logs there you will get a choice of System Logs and Container Logs. You have to be quick to catch them running because the Container Logs will not be visible after the worker stops running.

devout bobcat
cosmic tinsel
#

I don't really do much with vLLM so I cannot comment on your speed. To me 160GB seems like a lot and 7-8 seconds sounds fast but I have never needed more than 48GB for any models I have ever ran. What kind of speed are you looking for the 2000 input token and 200 output token? Are you trying to do something live?

devout bobcat
#

but i don't want to run vanilla model, that's why i need runpod.

are you using 70b model with 48gb? or you use tinier model?

cosmic tinsel
#

I am not running any vLLM. I do image generation, lipsync, voice cloning, along with some custom apps.

devout bobcat
#

ok!

cosmic tinsel
#

Good luck!

devout bobcat
#

thanks for your help

worn island
#

@devout bobcat you are talking about "pods" not serverless or? Because I wonder how you would get two GPUs to run this model.

worn island
#

hmm, but you can't have more than one GPU / worker when using serverless. That's why I'm wondering what you are using to run this

#

ah forget it 😄 I just saw it in the UI

plush summit
#

You can

worn island
#

I have never used it yet

#

Yes totally, was my fault 😄

plush summit
#

But for me its not working while idk why

worn island
#

@plush summit what is your problem? Also related to llama 3.1 70b?

plush summit
#

Yea

#

It keeps getting stuck at the same place

worn island
#

I will take a look at both things

plush summit
#

thanks

plush summit
#

@worn island it worked

#

just that it takes a bit of time to load for the first time in a while

worn island
#

perfect. You are also running on 2x80GB GPUs?

plush summit
#

80GB Gpus are hard to get

#

this is easier

#

just a problem with cold start right now

cosmic tinsel
#

Are you using a network volume? Doing so can add 30-60 seconds to delayTime.

plush summit
#

I wasn't using before

#

But i just tried setting it up

cosmic tinsel
#

You would think... but it does just the opposite.. I have done direct comparisons with a baked in model vs network volume and network volume always have 30-60 seconds more delayTime. IMHO nobody should be using network volume, unless you don't care about response time.

#

better off with a bigger image than using network volume.

plush summit
#

It takes 6 minuts without a network volume

#

for the first time

cosmic tinsel
#

Also, how many in/out tokens for those 6 min? Thibaud was saying it was taking his 7-8 sec for 2000 input token - 200 out.
on 2 GPU (h100) per worker

worn island
#

@cosmic tinsel would you mind sharing your test results with me? I was looking into this the other day with our engineers and they said that network volumes were only affected in EU-RO and EU-SE at the start of this week.

plush summit
cosmic tinsel
#

All my data comes from EU-RO as I have compute (non GPU) workloads as well.

plush summit
plush summit
#

I am confused

cosmic tinsel
worn island
# plush summit I am confused

yeah sorry for this. We had some users reporting that network volumes were slow. They were all from these two regions that I mentioned. And there was some maintenance happening on our end at the start of this week that would have explained the slow repsonse times of the network volumes.

#

Right now I'm trying to figure out of this is still ongoing by talking with users that had problems with slow network volumes before

cosmic tinsel
#

I last tested it about 1 month ago, since then I no longer use network volume. @still fog did some more recent testing, with large difusion models. She also came to same conclusion. Although, I am not sure what region she was testing in.

worn island
#

@devout bobcat sorry for poluting your thread with the network volume stuff. I will open a new thread for this

#

@plush summit so when using the network volume, you don't have to download the model again = decreasing cold start as the model already exists

#

@devout bobcat I can't seem to get the model you are using to work at all

plush summit
plush summit
worn island
#

@devout bobcat did you change anything else related to the env variables?

#

Because with the config you provided at the start, I can't get it to do anything

#

it will get my request and produce errors

devout bobcat
worn island
#

I thought that this is just controlling the maximum length of the context or what do you mean with "model have less memory"?

glacial fiber
devout bobcat
glacial fiber
#

More gpu, more optimization, more uses too maybe, idk, what do you think it can be?

devout bobcat
#

more gpu reduce cost per gpu ofc.
more optimization too.
i think we can get better setting at our level.

glacial fiber
#

What level?

covert mirage
covert mirage
# devout bobcat try 2 gpu / worker

The GPU instance is an H100 14 vCPU 80GB VRAM.

Runpod Serverless settings:


`MODEL = neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8

MAX_MODEL_LEN = 131072

GPU_MEMORY_UTILIZATION = 0.99

024-08-14T21:49:06.860520596Z tokenizer_name_or_path: neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8, tokenizer_revision: None, trust_remote_code: False

engine.py :113 2024-08-14 20:33:15,350 Error initializing vLLM engine: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (20736). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
`
I also tried:

KV_CACHE_DTYPE = fp8

devout bobcat
#

or use MAX_SEQ_LEN 20000

covert mirage