I am working with deploying a multi-modality model. I suddenly have a question about project when I deploy.
My QUESTION is followed/
Assume that three requests are posted concurrently. Every request is passed by different users and need to request LLM to generate something. I have deployed LLM locally. Now three requests are passed. Do I need to create three LLM model instances to handle with requests respectively? The general solution is what. I only image to use a pool to create a fix number LLM instances . System selects an instance when requests are passed. However if my gpu is A40 and gpu memory just 48G, it's difficult for me to create a pool and push ten instances like Janus-7B. Have several solutions to deal with it?
#Deployment problem
6 messages · Page 1 of 1 (latest)
Could try batching (if supported) or queueing, (or both, push onto a queue and batch procesess, then split and respond).
Has bro not thought of queues?
I mean an LLM doesn't take much time to generate if you have an A40 GPU.
No... I dont have any engineer experience before. Thanks for your reply. It's very useful
It's an amazing thought. I know queue but ever dont consider it as a solution. Thanks your reply. It's useful!
Programming is most of like engineering.
But a queue is simple to implement
Np mate :)