#Real-Time Inference Processing

2 messages · Page 1 of 1 (latest)

hexed halo
#

Hello, I want to create real-time talking chat app using self hosted LLM & TTS models. How can i make them processing in real-time? My LLM combined with TTS has around ~60s processing time with GPU L40s. What do you think is the best practice to do so, I'm currently implementing 4 GPUs of L40s with each GPU processing each request. But i don't think it can handle hundred users at once. What do you think is the best practice?

brisk marten
#

yes