#Ray Serve integration
1 messages · Page 1 of 1 (latest)
The core changes in this PR (besides the refactor itself) are:
- Loading models in a shared separate process so models can be completely freed from memory.
- Microbatching. This means multiple requests received in a short period of time are combined together, run through the model, and split back out into individual responses.
Notably for the latter, there has to be a 1:1 mapping between the list of requests and responses for it to work. This is part of why the facial recognition response is changed from sending a list of faces to an ImageFaces object with a faces field
This looks really interesting. Have you done any benchmarking to measure what changes in performance and throughout this has vs what we have now?
I compared it with the numbers in the explicit batch PR.
It's not an exact comparison because the way they do requests is really different. The concurrency level basically determines the batch size in this PR, so I experimented with 8 and 16 (numbers are for 16)
But a quirk I noticed is that if you set a max batch size for the microbatching and set the concurrency equal, you're not actually going to get the max batch size each time. Not sure if that comes down to basic latency or if the requests are sent in a certain way.
The numbers from this test were set up like that, but from testing after that it seems setting the concurrency higher than the max batch size gives better throughput
oh also, a really cool thing with ray is that you can add a reconfigure method to any deployment so you can change settings live with a request. e.g. a json with min_score could just change a model's score threshold without doing anything else. https://docs.ray.io/en/latest/serve/production-guide/config.html#dynamically-adjusting-parameters-in-deployment
it could potentially be used for an ml dashboard in the future.
Never thought of using ray for immich, really cool 🔥
I know there is a PyTorch equivalent (torchserve or something like that), are you familiar with that one?
I looked into it a bit but it's more opinionated and doesn't give as much flexibility for handling processes or loading and unloading models.
There's also Triton Inference Server. I've used it before and it's really fast and feature-rich, but it also expects you to know exactly which models you're using, their inputs and outputs and have them at the ready before it even starts up. The image for it is also massive, even the CPU-only version.
So those are good options if we can decide on a small number of models to support, but for being able to add, change or remove models easily the flexibility of Ray is more valuable.
This makes sense, thanks for your answer. Some time ago I tried hacking together a solution for loading/unloading models on the fly but it wasn't conclusive in the end, so I'm really glad to see that this new approach solves the problem + gives enough flexibility to implement other stuff. I'll try to find time to look into your PR and Ray Serve (only used Ray Tune till now) and see if I can be of any help 🙂