openvino automatically restart container on Non-zero status code? | Immich | Page 1

last crater Oct 13, 2024, 2:51 PM

#

Is there a way to automatically restart ML containers when they error out? I tried adding a try-catch to the main.py but was unsuccessful. The container responds with a 502 to new ML tasks, however the ping endpoint continues to pong.

It seems it slowly eats more and more ram when loading the gpu

Currently the containers hang until manually stopped and started. I have an external script watching it but it would be nice if there was a way I could incorporate it into the container.

I don't need a working solution, but providing a direction to know what would need to be modified (if at all possible) would be appreciated

First screenshot was taken at the beginning of typing, the 2nd was taken at the end

trim edgeBOT Oct 13, 2024, 2:51 PM

#

:wave: Hey @last crater,

Thanks for reaching out to us. Please follow the recommended actions below; this will help us be more effective in our support effort and leave more time for building Immich immich .

References

Container Logs: docker compose logs docs
Container Status: docker compose ps docs
Reverse Proxy: https://immich.app/docs/administration/reverse-proxy

Checklist

:ballot_box_with_check: I have verified I'm on the latest release(note that mobile app releases may take some time).
:ballot_box_with_check: I have read applicable release notes.
:ballot_box_with_check: I have reviewed the FAQs for known issues.
:ballot_box_with_check: I have reviewed Github for known issues.
:ballot_box_with_check: I have tried accessing Immich via local ip (without a custom reverse proxy).
:ballot_box_with_check: I have uploaded the relevant logs, docker compose, and .env files, making sure to use code formatting.
:blue_square: I have tried an incognito window, disabled extensions, cleared mobile app cache, logged out and back in, different browsers, etc. as applicable

(an item can be marked as "complete" by reacting with the appropriate number)

If this ticket can be closed you can use the /close command, and re-open it later if needed.

trim edgeBOT Oct 13, 2024, 2:51 PM

#

trim edge :wave: Hey <@1142920273851592814>, Thanks for reaching out to us. Please follow...

Successfully submitted, a tag has been added to inform contributors. :white_check_mark:

last crater Oct 13, 2024, 2:56 PM

#

storm radish Oct 13, 2024, 3:04 PM

#

In the try/catch, check if it’s a relevant error by looking at its .message property, and if so do os.kill(os.getpid(), signal.SIGINT)

#

How many requests does it generally take to get to this point?

last crater Oct 13, 2024, 3:09 PM

#

I'll turn off the load balancer and point all requests to this machine for testing. Think it's generally between 500-3000

#

(I am running facial detection only with the [improved model](#1272383382487040020 message ))

storm radish Oct 13, 2024, 3:15 PM

#

What happens if you change this line https://github.com/immich-app/immich/blob/e183ff6feb2df3b7653f9ea25f056fb86ea1d256/machine-learning/app/models/facial_recognition/recognition.py#L25 to:

self.batch = False

last crater Oct 13, 2024, 3:17 PM

#

I'll give it a shot. Still waiting for the container to error out again, 370 requests and only at 2.2gb gpu memory used

#

memory still hasn't climbed significantly (1000 assets processed). I'll stop the container and change that line

#

I've never seen my GPU memory stay that low. It seems to never spike above 1gb now

storm radish Oct 13, 2024, 3:32 PM

#

Do you notice a difference in speed?

last crater Oct 13, 2024, 3:33 PM

#

let me stop it, change it back, and restart. I'll take another screenshot after a couple hundred requests are processed to compare

#

but I don't think so

#

this machine is running the vm from a nvme so it might not be the best to compare speeds with

#

storm radish Oct 13, 2024, 3:41 PM

#

It’d be perfect if you could run Locust on it for precise metrics:

Install poetry https://python-poetry.org/docs/#installation
git clone https://github.com/immich-app/immich.git
cd machine-learning
poetry install
In locustfile.py, comment out the CLIP endpoint calls so only the face detection and recognition is tested
Run locust --web-host 127.0.0.1
Open the link
Customize the threshold (default only returns 1 face, decreasing it will show more)
Set the number of users to 1 (or more if you want to see how it performs at higher concurrency)

Introduction | Documentation | Poetry - Python dependency manageme...

Introduction Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on and it will manage (install/update) them for you. Poetry offers a lockfile to ensure repeatable installs, and can build your project for distribution.
System requirements Poetry requires Python 3.8+. It ...

last crater Oct 13, 2024, 3:42 PM

#

I'll give it a shot. All from the immich_machine_learning compose file directory?

storm radish Oct 13, 2024, 3:43 PM

#

You can clone the repo wherever you like

#

As long as it’s on the same server

#

Oh, and I forgot to mention that you need to expose port 3003 for machine learning

last crater Oct 13, 2024, 3:45 PM

#

yep, already exposed. No firewall active

#

between steps 2 and 3, am I supposed to make a directory or cd to the container's root

storm radish Oct 13, 2024, 3:49 PM

#

oops, cd to immich/machine-learning

#

I think 4 tests should be enough:

face min score 0.034 (default), self.batch patch
face min score 0.02, self.batch patch
face min score 0.034, original code
face min score 0.02, original code

Each test can run for 120s or so, then you can export the results as html for each before moving to the next

last crater Oct 13, 2024, 3:59 PM

#

is this the only comment necessary?

def _(parser: ArgumentParser) -> None:
    #parser.add_argument("--clip-model", type=str, default="ViT-B-32::openai")
    parser.add_argument("--face-model", type=str, default="buffalo_l")
    parser.add_argument(
        "--tag-min-score",
        type=int,
        default=0.0,
        help="Returns all tags at or above this score. The default returns all tags.",
    )
    parser.add_argument(
        "--face-min-score",
        type=int,
        default=0.034,
        help=(
            "Returns all faces at or above this score. The default returns 1 face per request; "
            "setting this to 0 blows up the number of faces to the thousands."
        ),
    )
    parser.add_argument("--image-size", type=int, default=1000)```

storm radish Oct 13, 2024, 4:00 PM

#

No, just remove CLIPTextFormDataLoadTest and CLIPVisionFormDataLoadTest

last crater Oct 13, 2024, 4:06 PM

#

I must've done something wrong. I'm erroring out. Give me a few minutes to go through it all again (will test without editing files first)

last crater Oct 13, 2024, 4:57 PM

#

📎 report_original_0.034.html 📎 report_patched_0.02.html 📎 report_patched_0.034.html 📎 report_original_0.02.html

last crater Oct 13, 2024, 5:36 PM

#

I have no idea what those numbers mean, when you have a chance to look at them in the future I'd be interested in your thoughts of wether or not these numbers are acceptable.

Since I posted this I increased concurrency on my J4125 to 3, and now it's GPU performance matches oracle free tier's cpu for facial detection.

No more restarts!!! I had the concurrency set to 1 before and it would generally fault every couple hundred of requests (I've pushed 750 through since). The container's ram is now also hovering just below 2gb, instead of steadily climbing.

storm radish Oct 13, 2024, 5:40 PM

#

Sweet! I’m on my phone but will take a look at the files later. I think if we defaulted openvino to not batch and added a max batch size env var (overriding the default of 1 for openvino), it should cover all scenarios

last crater Oct 13, 2024, 5:41 PM

#

I've got another 36k in the backlog of facial detection, I'll note how many crashes I have when it's complete (hopefully 0)

#

I really can't thank you enough for your help!

storm radish Oct 13, 2024, 5:46 PM

#

Thanks for taking the time to test all of this!

last crater Oct 13, 2024, 5:47 PM

#

anytime!

storm radish Oct 13, 2024, 8:42 PM

#

Looking at this, there seems to be a bug in the locustfile.py. It isn't setting the face threshold in the right place. Can you retest with this change? Also, since you're using the 34g detection model the threshold numbers are different. Run 0.1 (1 face) and 0.07 (5 faces) instead

"facial-recognition": {
                "recognition": {
                    "modelName": self.environment.parsed_options.face_model,
--                  "options": {"minScore": self.environment.parsed_options.face_min_score},
                },
                "detection": {
                    "modelName": self.environment.parsed_options.face_model,
++                  "options": {"minScore": self.environment.parsed_options.face_min_score},
                },
            }

last crater Oct 13, 2024, 8:43 PM

#

yep. I'll be back at the computer within half an hour

last crater Oct 14, 2024, 2:08 AM

#

sorry it took a little longer than expected. Ended up taking the side-by-side in the bush with my nieces and then one thing led to another.

#

📎 report_.original_0.1.html 📎 report_patched_0.07.html 📎 report_patched_0.1.html 📎 report_original_0.07.html

storm radish Oct 14, 2024, 2:15 AM

#

Thanks! This looks a lot better

#

It seems like batching doesn't help at all in your case, at least with the current onnxruntime-openvino setup. What I find odd is that the average response size is different between the original and patched. 27319.74 is wrong and 29862 is right. I'm not sure what's causing that...

#

I should compare with the 155H

last crater Oct 14, 2024, 2:27 AM

#

Oh interesting. I am of no help there unless you want me to run some tests

155H?

storm radish Oct 14, 2024, 2:28 AM

#

The intel processor I use for testing

last crater Oct 14, 2024, 2:29 AM

#

would running this on my J4125 be beneficial for you or not?

storm radish Oct 14, 2024, 2:38 AM

#

Is that different from what you just ran it on?

last crater Oct 14, 2024, 2:39 AM

#

yep. I just ran it on an i7-8565U

storm radish Oct 14, 2024, 2:40 AM

#

Oh I see. I'd be interested in the J4125 results too

last crater Oct 14, 2024, 2:41 AM

#

Ok. Time to figure out how to run locust in a container. I'll get results back to you in awhile

#

https://docs.locust.io/en/stable/running-in-docker.html

how many workers should I launch it with?

storm radish Oct 14, 2024, 2:46 AM

#

Just 1

storm radish Oct 14, 2024, 3:11 AM

#

So for comparison, this is what I get on the 155H

📎 patched_0.07.html 📎 patched_0.1.html 📎 original_0.1.html 📎 original_0.07.html

#

I don't see the different response size in this case, and there is a positive impact of batching (7.22 vs 6.05)

last crater Oct 14, 2024, 3:16 AM

#

interesting. maybe it's the fact I'm running super old hardware?

#

I'm still working through getting my docker container setup. It might be a tomorrow thing

storm radish Oct 14, 2024, 3:20 AM

#

No rush!

#

I get 1 rps on cpu vs 8.57 on openvino for a threshold of 0.1, so openvino is definitely putting in work though

#

Oh, it could maybe be because I'm running on main with #13290, so the difference in your case might be another aspect of the recent openvino regression

trim edgeBOT Oct 14, 2024, 3:22 AM

#

storm radish Oh, it could maybe be because I'm running on main with #13290, so the difference...

[Pull Request] fix(ml): pin onnxruntime-openvino (immich-app/immich#13290)

last crater Oct 14, 2024, 3:39 AM

#

yep, I'm running the latest and greatest for a change 😂

I'm calling it a night, I'll jump back into this tomorrow

last crater Oct 14, 2024, 1:56 PM

#

last crater I've got another 36k in the backlog of facial detection, I'll note how many cras...

not one crash processing this backlog

last crater Oct 14, 2024, 3:57 PM

#

I seem to have been able to get locust to run in a docker container. for the host port would I set it as the http://machine_IP:3003 ? pointing it at :8089 results in failures

storm radish Oct 14, 2024, 4:07 PM

#

If Locust uses 3003, what's the ML service using?

last crater Oct 14, 2024, 4:08 PM

#

locust uses 8089. ML is 3003

storm radish Oct 14, 2024, 4:08 PM

#

Ah gotcha, that sounds good

last crater Oct 14, 2024, 4:09 PM

#

RPS is low though.... then again probably correct for the hardware

storm radish Oct 14, 2024, 4:10 PM

#

Is it using openvino?

last crater Oct 14, 2024, 4:10 PM

#

yep

#

I'll toggle it with cpu afterwards just because

#

📎 J4125_original_0.1.html 📎 J4125_patched_0.07.html 📎 J4125_patched_0.1.html 📎 J4125_original_0.07.html

storm radish Oct 14, 2024, 4:17 PM

#

The UHD 600 in the J4125 should be roughly 3x slower than the UHD 620 in your 8565U. Then it's probably also slower to decode and preprocess images before getting them to the GPU

last crater Oct 14, 2024, 4:18 PM

#

rps is the same with cpu... 0.5-0.6 on 0.07 maybe I didn't rebuild the container correctly?

storm radish Oct 14, 2024, 4:19 PM

#

Relative improvement over CPU depends on how strong the GPU is

last crater Oct 14, 2024, 4:19 PM

#

📎 J4125_original_0.07_possibly-cpu.html

storm radish Oct 14, 2024, 4:20 PM

#

If the logs don't mention OpenVINOExecutionProvider, it's definitely CPU-only

last crater Oct 14, 2024, 4:20 PM

#

it seems commenting out the driver didn't work.

Will re-run with a cpu image

#

I'll let you decide whether or not openvino is worth the hassle on this Synology 😁
It was really only using 2 cores....

📎 J4125_original_0.07_cpu.html

#

I had to run it again to get an extra request in there

storm radish Oct 14, 2024, 5:17 PM

#

Geez lol

storm radish Oct 14, 2024, 5:18 PM

#

last crater I had to run it again to get an extra request in there

I didn't even understand what you meant by this at first lmao

last crater Oct 14, 2024, 5:30 PM

#

So now that this is all said and done, would you like me to run tests on a different version of the container?

storm radish Oct 14, 2024, 5:31 PM

#

No, I think this is fine

#

Thanks again for testing!

storm radish Oct 14, 2024, 5:53 PM

#

last crater I'll let you decide whether or not openvino is worth the hassle on this Synology...

Wait hold up. Are you sure you set this to 0.07 and were using the larger detection model?

#

The response size is completely wrong

last crater Oct 14, 2024, 5:56 PM

#

Actually that might've been the smaller model. Not sure I patched the cpu container come to think of it. I'll patch it and re-run the test just as soon as I get back to my pc

last crater Oct 14, 2024, 6:16 PM

#

I'm not sure what changed. I had my flag there stating that the model was replaced, but I ended up forcefully replacing it again and now got different results. The first time I ran it I also thought it was something in the 16s mark, but I stopped and restarted the test so I didn't need to wait for the container to load the model.

Maybe I need to look into my script

📎 J4125_original_0.07_cpu2.html

#

I suppose that aligns with your 155H as well, being roughly 8x faster with openvino

storm radish Oct 14, 2024, 6:23 PM

#

Nice, that looks better

last crater Oct 14, 2024, 6:27 PM

#

Thanks again for your time. I really was looking to put a band aid on a gushing wound at the beginning of this, but this solution is so much better.

I'm assuming there's no need for me to make a PR as you have your own thoughts about how you want to proceed with this?

storm radish Oct 14, 2024, 6:29 PM

#

Yeah, I can make the PR for it

last crater Oct 14, 2024, 6:30 PM

#

Awesome, thanks again!

#openvino automatically restart container on Non-zero status code?

References

Checklist