#Serverless VLLM concurrency issue

80 messages · Page 1 of 1 (latest)

real sun
#

Hello everyone, i deployed a serverless vllm (gemma 12b model) through runpod ui. withj 2 workers of A100 80GB vram.

if i send two requests at the same time, they both become IN PROGRESS but i recieve the ouput stream of one first, the second always waits for the first to finish then i start recieveing the tokens stream. why is it behaving live this?

grizzled rivet
#

Because it's processing using the batch one, is it the same worker?

#

Both requests are then in the same worker? If yes, then possible explanation is one gets processed first, then the second one. And your gpu isn't fast enough to process the inputs until the delay between them being processed is close to 0ms

#

Maybe your application is configured to also optimize for certain things so it processed the way it is ( 1 request then another one) in multiple stage of processing. Can't provide technical details currently but that's what I can guess

real sun
#

@grizzled rivet it is the same worker, is there any way i can make it respond to both of them at the same time?

real jay
#

you cant without changing code

#

bc of the natre of llms

#

if you give them the same input, the output may vary in length, which affects generation time

grizzled rivet
#

Or set batch processing to 1,so it just spawns a new worker which eventually will process in parallel but I think it will be a waste of resource

By setting ENV VARIABLE:
BATCH_SIZE = 1

real sun
#

i am using A100 80GB vram and it is supposed to be very fast!
before i used to deploy the same model on A100 40gb vram on gcp with vllm it it had no problem handling concurrent requests

#

DEFAULT_BATCH_SIZE or BATCH_SIZE ?

grizzled rivet
#

I was telling about the BATCH_SIZE why?

real sun
grizzled rivet
#

Can you quantify how slow is it compared to hosted in gcp?

#

Like the stream delay first request, second in runpod and vs gcp

real sun
#

my issue is not really the speed, the speed is decent when there is no cold start, my issue is handling more than one request at the same time

grizzled rivet
#

How long is it that the delay for output stream in the second one vs first one

#

Yes I'm telling about the problem you described, delay between first and second request stream isn't it?

real sun
#

yes
first request starts streaming, second request from another client always starts after the first one finishes

real jay
#

with two workers?

real sun
#

ill do some benchmarks and provide you with the number s

real sun
#

tried both

real jay
#

can you check vllm logs

#

it should say metrics like

#

current running req, waiting req

#

etc

#

and tok/s

grizzled rivet
real jay
grizzled rivet
# real sun 2 and 3

I don't know why your endpoint is set to run multiple worker for only 2 request

real jay
#

vllm intellegently does batching until its kv cache is full

grizzled rivet
#

No it's the default that I was talking about

#

In the endpoint image vllm worker

#

It's 300 reqs batched default

grizzled rivet
#

Can you screenshot your whole edit endpoint details @real sun

real sun
#

right now it is configured to only have one worker

real jay
#

try default batch size to 10

real sun
#

i ma setting default batch size to 1 because i noticed streaming used to send very big chunks of tokens

grizzled rivet
#

Oh.. It's because

real jay
#

lol

grizzled rivet
#

Wait the request is just one?

real sun
#

i tried it with 50 and 256

real jay
#

that setting means only 1 request should be processed cocurrently

grizzled rivet
#

In your logs?

#

If you just set remove the batch size thingy, does they get processed in the same worker?

real sun
#

same behavior of not handling multiple requests with default batch size set to 50 and 256

grizzled rivet
#

And maybe there is an overhead of runpod's queue system

#

When you send a request it's going through runpod first then to the worker

#

Might introduce abit delay even everything is right

real jay
#

no no

#

sorry fo rthe misinformation

real sun
#

but both requests status appear as IN PROGRESS

real jay
#

its the batch size for streaming tokens

#

This is the real one but you didnt set it so should be fine

real sun
#

but let me try it one more time to confirm

real jay
grizzled rivet
grizzled rivet
real jay
#

because it is related to token streaming, not the actual requests

#

@real sun maybe can you try spamming requests? like 50+?

grizzled rivet
#

😅😅

real jay
#

set the max workers to 1 and then

#

spam requests

real sun
#
============ Serving Benchmark Result ============
Successful requests:                     857       
Benchmark duration (s):                  95.82     
Total input tokens:                      877568    
Total generated tokens:                  68965     
Request throughput (req/s):              8.94      
Output token throughput (tok/s):         719.70    
Total Token throughput (tok/s):          9877.74   
---------------Time to First Token----------------
Mean TTFT (ms):                          42451.61  
Median TTFT (ms):                        42317.61  
P99 TTFT (ms):                           77811.55  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          472.19    
Median TPOT (ms):                        190.87    
P99 TPOT (ms):                           3881.05   
---------------Inter-token Latency----------------
Mean ITL (ms):                           182.12    
Median ITL (ms):                         0.01      
P99 ITL (ms):                            4703.27   
==================================================

configuration was max workers = 3
and i was NOT setting default batch size , it was left on deafult which i believe is 50

#

also the script sent 1000 requests
only 857 was succesful

#

same model, same benchmark but on gcp a100 40 vram machine

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  346.74    
Total input tokens:                      1024000   
Total generated tokens:                  70328     
Request throughput (req/s):              2.88      
Output token throughput (tok/s):         202.83    
Total Token throughput (tok/s):          3156.09   
---------------Time to First Token----------------
Mean TTFT (ms):                          172033.53 
Median TTFT (ms):                        178518.65 
P99 TTFT (ms):                           326714.81 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          357.45    
Median TPOT (ms):                        271.08    
P99 TPOT (ms):                           1728.97   
---------------Inter-token Latency----------------
Mean ITL (ms):                           263.52    
Median ITL (ms):                         151.98    
P99 ITL (ms):                            1228.35   
==================================================
real sun
austere pivot
#

When you initialize the vLLM engine (on cold start) you should see a log similar to this: Maximum concurrency for 32768 tokens per request: 5.42x as a part of vLLM's memory profiling. Make sure that the engine can perform concurrency > 2.
That being said, the official RunPod vLLM image, unfortunately, does not handle concurrency dynamically (it's hardcoded to 300 or static value), which will result in bottlenecking the jobs anyway. But it's definitely possible to stream multiple responses concurrently from a single serverless worker. Or at least it's working on my implementation.

real jay
#

anyways he has enough cache (the requests doesnt even use 5percent of the cache)