#Streaming responses Serveless Endpoint
1 messages · Page 1 of 1 (latest)
Do you mean that it has something like a different format or just the speed behaviour?
speed behaviour? can i control it somehow? in dedicated environment streaming response is normal. here is like is working with batches
What template/framework are you using? When you say "like working with batches", does it mean it has the same overall compute speed but it's sending the stream data less often? Can you describe the differences more specifically?
i am using vllm template. so the responses are returning back as batches not streaming i mean each character when is ready
In the official vLLM template, there are environment variables DEFAULT_BATCH_SIZE, MIN_BATCH_SIZE and BATCH_SIZE_GROWTH_FACTOR. These control how often we send the streaming data. The whole logic of it is here
These comments could also help you understand which value does what better
so i assume setting the BATCH_SIZE lower will help?
Yes. The default value is 50. MIN_BATCH_SIZE should be default 1 already.
ok thanks