#Deployment problem

6 messages · Page 1 of 1 (latest)

hybrid breach
#

I am working with deploying a multi-modality model. I suddenly have a question about project when I deploy.
My QUESTION is followed/blob_help
Assume that three requests are posted concurrently. Every request is passed by different users and need to request LLM to generate something. I have deployed LLM locally. Now three requests are passed. Do I need to create three LLM model instances to handle with requests respectively? The general solution is what. I only image to use a pool to create a fix number LLM instances . System selects an instance when requests are passed. However if my gpu is A40 and gpu memory just 48G, it's difficult for me to create a pool and push ten instances like Janus-7B. Have several solutions to deal with it?

latent garden
#

Could try batching (if supported) or queueing, (or both, push onto a queue and batch procesess, then split and respond).

tawny basalt
hybrid breach
hybrid breach
tawny basalt