Suitability for live TTS generation - general questions | Runpod | Page 1

candid basin Nov 26, 2025, 10:52 AM

#

We're completely new to runpod but we are used to either having our own virtual machines and using pre-configured services on azure.
We have an Orpheus TTS model which will have highly variable usage. Does anyone have experience of this kind of use on runpod and is Serverless suitable to meet low latency demands?

These are the questions I sent to runpod but they didn't bother to reply:
We provide AI telephony agents to a number of UK based SMBs. We have trained a new natural TTS voice based on Orpheus LLM 3B and it's this model we're interested in deploying on runpod.
Obviously for our customers the number of call varies hugely throughout the day so we need a solution that is real-time but will also scale quickly from zero to possibly hundreds.
Development tests in vLLM on a dedicated A100 show it supports 4.5 concurrent TTS requests while still streaming faster than real-time.

I think your Serverless solutions are the way forward but I have questions about the architecture and pricing. I'm hoping you can answer these and recommend a cost effective solution.

Can Workers handle more than one request at a time and are we charged per Worker or just GPU usage?

When we need to scale what is the start-up time of a Worker? And it's corresponding GPU?

We currently have a time-to-first-token of 150ms (even with concurrent requests). Do Workers have instant real-time access to GPUs?

Are there plans to provide UK based servers to help reduce latency further?

shrewd driftBOT Nov 26, 2025, 10:52 AM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

icy glenBOT Nov 26, 2025, 10:52 AM

#

twin abyss Nov 30, 2025, 1:27 AM

#

candid basin We're completely new to runpod but we are used to either having our own virtual ...

yes, serverless is great for this.

yes workers have real time access to gpus

startup time varies depending on gpu type, model size (its basically time of loading the model, starting container( takes only abit), and executing your starting command before the andler.
charged per worker running seconds, rounded to nearest minute i think

#

for uk based servers plan maybe contact support for an official answer

#Suitability for live TTS generation - general questions