Streaming responses Serveless Endpoint | Runpod | Page 1

pine sequoia Jun 29, 2025, 9:31 PM

#

Currently using serveless endpoint for inference, and it seems the streaming response is not working the same as with a dedicated endpoint. I have same setup both when is dedicated and when is serveless. i can see that the reponses are not coming the same streaming way and speed as the dedicated endpoint.

forest moonBOT Jun 29, 2025, 9:31 PM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

covert leaf Jun 29, 2025, 9:42 PM

#

Do you mean that it has something like a different format or just the speed behaviour?

pine sequoia Jun 29, 2025, 9:57 PM

#

speed behaviour? can i control it somehow? in dedicated environment streaming response is normal. here is like is working with batches

covert leaf Jun 29, 2025, 10:02 PM

#

What template/framework are you using? When you say "like working with batches", does it mean it has the same overall compute speed but it's sending the stream data less often? Can you describe the differences more specifically?

pine sequoia Jun 29, 2025, 10:04 PM

#

i am using vllm template. so the responses are returning back as batches not streaming i mean each character when is ready

covert leaf Jun 29, 2025, 10:13 PM

#

In the official vLLM template, there are environment variables DEFAULT_BATCH_SIZE, MIN_BATCH_SIZE and BATCH_SIZE_GROWTH_FACTOR. These control how often we send the streaming data. The whole logic of it is here

GitHub

worker-vllm/src/utils.py at 4f22d6f10748382973908f20c83b28573848285...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

#

These comments could also help you understand which value does what better

pine sequoia Jun 29, 2025, 10:18 PM

#

so i assume setting the BATCH_SIZE lower will help?

covert leaf Jun 29, 2025, 10:20 PM

#

Yes. The default value is 50. MIN_BATCH_SIZE should be default 1 already.

pine sequoia Jun 29, 2025, 10:22 PM

#

ok thanks

#Streaming responses Serveless Endpoint