#Streaming responses Serveless Endpoint

1 messages · Page 1 of 1 (latest)

pine sequoia
#

Currently using serveless endpoint for inference, and it seems the streaming response is not working the same as with a dedicated endpoint. I have same setup both when is dedicated and when is serveless. i can see that the reponses are not coming the same streaming way and speed as the dedicated endpoint.

forest moonBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

covert leaf
#

Do you mean that it has something like a different format or just the speed behaviour?

pine sequoia
#

speed behaviour? can i control it somehow? in dedicated environment streaming response is normal. here is like is working with batches

covert leaf
#

What template/framework are you using? When you say "like working with batches", does it mean it has the same overall compute speed but it's sending the stream data less often? Can you describe the differences more specifically?

pine sequoia
#

i am using vllm template. so the responses are returning back as batches not streaming i mean each character when is ready

covert leaf
#

In the official vLLM template, there are environment variables DEFAULT_BATCH_SIZE, MIN_BATCH_SIZE and BATCH_SIZE_GROWTH_FACTOR. These control how often we send the streaming data. The whole logic of it is here

GitHub

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

#

These comments could also help you understand which value does what better

pine sequoia
#

so i assume setting the BATCH_SIZE lower will help?

covert leaf
#

Yes. The default value is 50. MIN_BATCH_SIZE should be default 1 already.

pine sequoia
#

ok thanks