#openvino automatically restart container on Non-zero status code?

1 messages · Page 1 of 1 (latest)

last crater
#

Is there a way to automatically restart ML containers when they error out? I tried adding a try-catch to the main.py but was unsuccessful. The container responds with a 502 to new ML tasks, however the ping endpoint continues to pong.

It seems it slowly eats more and more ram when loading the gpu

Currently the containers hang until manually stopped and started. I have an external script watching it but it would be nice if there was a way I could incorporate it into the container.

I don't need a working solution, but providing a direction to know what would need to be modified (if at all possible) would be appreciated

First screenshot was taken at the beginning of typing, the 2nd was taken at the end

trim edgeBOT
#

:wave: Hey @last crater,

Thanks for reaching out to us. Please follow the recommended actions below; this will help us be more effective in our support effort and leave more time for building Immich immich.

References

Checklist

  1. :ballot_box_with_check: I have verified I'm on the latest release(note that mobile app releases may take some time).
  2. :ballot_box_with_check: I have read applicable release notes.
  3. :ballot_box_with_check: I have reviewed the FAQs for known issues.
  4. :ballot_box_with_check: I have reviewed Github for known issues.
  5. :ballot_box_with_check: I have tried accessing Immich via local ip (without a custom reverse proxy).
  6. :ballot_box_with_check: I have uploaded the relevant logs, docker compose, and .env files, making sure to use code formatting.
  7. :blue_square: I have tried an incognito window, disabled extensions, cleared mobile app cache, logged out and back in, different browsers, etc. as applicable

(an item can be marked as "complete" by reacting with the appropriate number)

If this ticket can be closed you can use the /close command, and re-open it later if needed.

trim edgeBOT
last crater
storm radish
#

In the try/catch, check if it’s a relevant error by looking at its .message property, and if so do os.kill(os.getpid(), signal.SIGINT)

#

How many requests does it generally take to get to this point?

last crater
#

I'll turn off the load balancer and point all requests to this machine for testing. Think it's generally between 500-3000

#

(I am running facial detection only with the [improved model](#1272383382487040020 message ))

last crater
#

I'll give it a shot. Still waiting for the container to error out again, 370 requests and only at 2.2gb gpu memory used

#

memory still hasn't climbed significantly (1000 assets processed). I'll stop the container and change that line

#

I've never seen my GPU memory stay that low. It seems to never spike above 1gb now

storm radish
#

Do you notice a difference in speed?

last crater
#

let me stop it, change it back, and restart. I'll take another screenshot after a couple hundred requests are processed to compare

#

but I don't think so

#

this machine is running the vm from a nvme so it might not be the best to compare speeds with

storm radish
#

It’d be perfect if you could run Locust on it for precise metrics:

  1. Install poetry https://python-poetry.org/docs/#installation
  2. git clone https://github.com/immich-app/immich.git
  3. cd machine-learning
  4. poetry install
  5. In locustfile.py, comment out the CLIP endpoint calls so only the face detection and recognition is tested
  6. Run locust --web-host 127.0.0.1
  7. Open the link
  8. Customize the threshold (default only returns 1 face, decreasing it will show more)
  9. Set the number of users to 1 (or more if you want to see how it performs at higher concurrency)
last crater
#

I'll give it a shot. All from the immich_machine_learning compose file directory?

storm radish
#

You can clone the repo wherever you like

#

As long as it’s on the same server

#

Oh, and I forgot to mention that you need to expose port 3003 for machine learning

last crater
#

yep, already exposed. No firewall active

#

between steps 2 and 3, am I supposed to make a directory or cd to the container's root

storm radish
#

oops, cd to immich/machine-learning

#

I think 4 tests should be enough:

  • face min score 0.034 (default), self.batch patch
  • face min score 0.02, self.batch patch
  • face min score 0.034, original code
  • face min score 0.02, original code

Each test can run for 120s or so, then you can export the results as html for each before moving to the next

last crater
#

is this the only comment necessary?

def _(parser: ArgumentParser) -> None:
    #parser.add_argument("--clip-model", type=str, default="ViT-B-32::openai")
    parser.add_argument("--face-model", type=str, default="buffalo_l")
    parser.add_argument(
        "--tag-min-score",
        type=int,
        default=0.0,
        help="Returns all tags at or above this score. The default returns all tags.",
    )
    parser.add_argument(
        "--face-min-score",
        type=int,
        default=0.034,
        help=(
            "Returns all faces at or above this score. The default returns 1 face per request; "
            "setting this to 0 blows up the number of faces to the thousands."
        ),
    )
    parser.add_argument("--image-size", type=int, default=1000)```
storm radish
#

No, just remove CLIPTextFormDataLoadTest and CLIPVisionFormDataLoadTest

last crater
#

I must've done something wrong. I'm erroring out. Give me a few minutes to go through it all again (will test without editing files first)

last crater
#

I have no idea what those numbers mean, when you have a chance to look at them in the future I'd be interested in your thoughts of wether or not these numbers are acceptable.

Since I posted this I increased concurrency on my J4125 to 3, and now it's GPU performance matches oracle free tier's cpu for facial detection.

No more restarts!!! I had the concurrency set to 1 before and it would generally fault every couple hundred of requests (I've pushed 750 through since). The container's ram is now also hovering just below 2gb, instead of steadily climbing.

storm radish
#

Sweet! I’m on my phone but will take a look at the files later. I think if we defaulted openvino to not batch and added a max batch size env var (overriding the default of 1 for openvino), it should cover all scenarios

last crater
#

I've got another 36k in the backlog of facial detection, I'll note how many crashes I have when it's complete (hopefully 0)

#

I really can't thank you enough for your help!

storm radish
#

Thanks for taking the time to test all of this!

last crater
#

anytime!

storm radish
#

Looking at this, there seems to be a bug in the locustfile.py. It isn't setting the face threshold in the right place. Can you retest with this change? Also, since you're using the 34g detection model the threshold numbers are different. Run 0.1 (1 face) and 0.07 (5 faces) instead

"facial-recognition": {
                "recognition": {
                    "modelName": self.environment.parsed_options.face_model,
--                  "options": {"minScore": self.environment.parsed_options.face_min_score},
                },
                "detection": {
                    "modelName": self.environment.parsed_options.face_model,
++                  "options": {"minScore": self.environment.parsed_options.face_min_score},
                },
            }
last crater
#

yep. I'll be back at the computer within half an hour

last crater
#

sorry it took a little longer than expected. Ended up taking the side-by-side in the bush with my nieces and then one thing led to another.

storm radish
#

Thanks! This looks a lot better

#

It seems like batching doesn't help at all in your case, at least with the current onnxruntime-openvino setup. What I find odd is that the average response size is different between the original and patched. 27319.74 is wrong and 29862 is right. I'm not sure what's causing that...

#

I should compare with the 155H

last crater
#

Oh interesting. I am of no help there unless you want me to run some tests

155H?

storm radish
#

The intel processor I use for testing

last crater
#

would running this on my J4125 be beneficial for you or not?

storm radish
#

Is that different from what you just ran it on?

last crater
#

yep. I just ran it on an i7-8565U

storm radish
#

Oh I see. I'd be interested in the J4125 results too

last crater
#

Ok. Time to figure out how to run locust in a container. I'll get results back to you in awhile

storm radish
#

Just 1

storm radish
#

I don't see the different response size in this case, and there is a positive impact of batching (7.22 vs 6.05)

last crater
#

interesting. maybe it's the fact I'm running super old hardware?

#

I'm still working through getting my docker container setup. It might be a tomorrow thing

storm radish
#

No rush!

#

I get 1 rps on cpu vs 8.57 on openvino for a threshold of 0.1, so openvino is definitely putting in work though

#

Oh, it could maybe be because I'm running on main with #13290, so the difference in your case might be another aspect of the recent openvino regression

trim edgeBOT
last crater
#

yep, I'm running the latest and greatest for a change 😂

I'm calling it a night, I'll jump back into this tomorrow

last crater
last crater
#

I seem to have been able to get locust to run in a docker container. for the host port would I set it as the http://machine_IP:3003 ? pointing it at :8089 results in failures

storm radish
#

If Locust uses 3003, what's the ML service using?

last crater
#

locust uses 8089. ML is 3003

storm radish
#

Ah gotcha, that sounds good

last crater
#

RPS is low though.... then again probably correct for the hardware

storm radish
#

Is it using openvino?

last crater
#

yep

#

I'll toggle it with cpu afterwards just because

storm radish
#

The UHD 600 in the J4125 should be roughly 3x slower than the UHD 620 in your 8565U. Then it's probably also slower to decode and preprocess images before getting them to the GPU

last crater
#

rps is the same with cpu... 0.5-0.6 on 0.07 maybe I didn't rebuild the container correctly?

storm radish
#

Relative improvement over CPU depends on how strong the GPU is

last crater
storm radish
#

If the logs don't mention OpenVINOExecutionProvider, it's definitely CPU-only

last crater
#

it seems commenting out the driver didn't work.

Will re-run with a cpu image

#

I had to run it again to get an extra request in there

storm radish
#

Geez lol

storm radish
last crater
#

So now that this is all said and done, would you like me to run tests on a different version of the container?

storm radish
#

No, I think this is fine

#

Thanks again for testing!

storm radish
#

The response size is completely wrong

last crater
#

Actually that might've been the smaller model. Not sure I patched the cpu container come to think of it. I'll patch it and re-run the test just as soon as I get back to my pc

last crater
#

I'm not sure what changed. I had my flag there stating that the model was replaced, but I ended up forcefully replacing it again and now got different results. The first time I ran it I also thought it was something in the 16s mark, but I stopped and restarted the test so I didn't need to wait for the container to load the model.

Maybe I need to look into my script

#

I suppose that aligns with your 155H as well, being roughly 8x faster with openvino

storm radish
#

Nice, that looks better

last crater
#

Thanks again for your time. I really was looking to put a band aid on a gushing wound at the beginning of this, but this solution is so much better.

I'm assuming there's no need for me to make a PR as you have your own thoughts about how you want to proceed with this?

storm radish
#

Yeah, I can make the PR for it

last crater
#

Awesome, thanks again!