We're completely new to runpod but we are used to either having our own virtual machines and using pre-configured services on azure.
We have an Orpheus TTS model which will have highly variable usage. Does anyone have experience of this kind of use on runpod and is Serverless suitable to meet low latency demands?
These are the questions I sent to runpod but they didn't bother to reply:
We provide AI telephony agents to a number of UK based SMBs. We have trained a new natural TTS voice based on Orpheus LLM 3B and it's this model we're interested in deploying on runpod.
Obviously for our customers the number of call varies hugely throughout the day so we need a solution that is real-time but will also scale quickly from zero to possibly hundreds.
Development tests in vLLM on a dedicated A100 show it supports 4.5 concurrent TTS requests while still streaming faster than real-time.
I think your Serverless solutions are the way forward but I have questions about the architecture and pricing. I'm hoping you can answer these and recommend a cost effective solution.
Can Workers handle more than one request at a time and are we charged per Worker or just GPU usage?
When we need to scale what is the start-up time of a Worker? And it's corresponding GPU?
We currently have a time-to-first-token of 150ms (even with concurrent requests). Do Workers have instant real-time access to GPUs?
Are there plans to provide UK based servers to help reduce latency further?