connect ECONNREFUSED 172.23.0.6:3003 - ML not starting | Immich | Page 1

harsh quiver Aug 25, 2023, 9:54 PM

#

It seems like the machine_learning container is not starting. Here is my docker-compose block for that service (it previously worked before updating):

  immich-machine-learning:
    image: ghcr.io/immich-app/immich-machine-learning:release
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      #- ./model-cache:/cache
      - ./model-cache/yolos-tiny:/usr/src/app/hustvl/yolos-tiny
      - ./model-cache/resnet-50:/usr/src/app/microsoft/resnet-50
      - ./model-cache/clip-ViT-B-32:/usr/src/app/clip-ViT-B-32
    env_file:
      - .env
    environment:
      - NODE_ENV=production
    depends_on:
      - database
    restart: always

Here are the logs from the container:

docker logs immich-immich-machine-learning
INFO:     Started server process [7]
INFO:     Waiting for application startup.

Those logs aren't super helpful. Is there a way to get more verbose logs, and has anyone run into this? Previously I had an issue with the container not being allowed to access the internet and having to bind mount the models within it.

#

connect ECONNREFUSED 172.23.0.6:3003 - ML not starting

#

Setting the container to development mode did not give any helpful logs here. It seems like it may be a fastapi app, but I am unsure what the correct way to debug/triage is here

#

It seems like some models may have changes today, but in general models haven't changed for a bit: https://github.com/immich-app/immich/commit/165b91b068193db53b07cc4f265d11326530be3c

GitHub

feat(ml)!: switch image classification and CLIP models to ONNX (#38...

slow nimbus Aug 25, 2023, 10:20 PM

#

That PR hasn't made it into a release yet

#

The server won't be open to requests until you see Application startup complete.. If there are no errors, it might just be loading the models. Do you see any IO or CPU activity in the container?

harsh quiver Aug 26, 2023, 12:59 AM

#

No CPU or IO usage after a few seconds. It's on a pretty large zfs array

harsh quiver Aug 26, 2023, 1:25 AM

#

Can you share the model cache directory on either a bind mount of the container or from a running container? I think the default now in the container might be /cache? It might be that there is a bug in the loader for at least bison_l where it does not raise an exception with missing files

slow nimbus Aug 26, 2023, 1:47 AM

#

The cache directory is set with MACHINE_LEARNING_CACHE_FOLDER and defaults to /cache. Are you setting this env?

#

there is a bug in the loader for at least bison_l where it does not raise an exception with missing files
If there's an error while loading a model, the app will delete the folder associated with that model and try again. It will only error if this second attempt also fails.

#

In general I don't recommend binding the model folders directly like this since there can be changes in the cache structure, and as mentioned, it can delete the contents of these folders on failure.

harsh quiver Aug 26, 2023, 4:17 AM

#

Ok, I need to run the containers offline and can not download them on startup. Bind mounting them directly as noted in the docker compose above worked. How can they be downloaded in an offline manner and mounted into the container?

#

And do exceptions not propagate to logs in the container? That's not totally clear to me

slow nimbus Aug 26, 2023, 4:23 AM

#

harsh quiver And do exceptions not propagate to logs in the container? That's not totally cle...

Exceptions definitely do show up in the logs

#

The model cache isn't terribly well documented at the moment, but if you can match the folder structure here then it should work. https://github.com/immich-app/immich/blob/60729a091ab0da18ec67d4a8d0ca8448715a91b6/machine-learning/app/config.py#L28C1-L29C69

(This cache structure will change in the upcoming release as a heads up)

GitHub

immich/machine-learning/app/config.py at 60729a091ab0da18ec67d4a8d0...

Self-hosted photo and video backup solution directly from your mobile phone. - immich-app/immich

#

Where model_type.value is image-classification, clip or facial-recognition

harsh quiver Aug 26, 2023, 4:29 AM

#

slow nimbus The model cache isn't terribly well documented at the moment, but if you can mat...

Previously the huggingface library was used I think? It did expect some subdirs, and would first try to serialize relative to the main script before loading from a configured folder

#

Kind of interesting behavior. I'll give that approach a try. Kind of interesting that there aren't any exceptions after 6ish hours

slow nimbus Aug 26, 2023, 4:33 AM

#

The first model that gets loaded is the HF image classification model. This model always outputs a log about configuring the feature extractor. Since you don't have that log, it's probably stuck trying to download it. But I'm not sure why it isn't erroring out

#

The feature extractor preprocesses the image for the model and is downloaded separately, so that might be causing you issues

harsh quiver Aug 26, 2023, 4:38 AM

#

clip, right?

slow nimbus Aug 26, 2023, 4:40 AM

#

Ah, CLIP also has two preprocessors: a tokenizer for the text model and a feature extractor for the vision model. But I was talking about the image classification model

harsh quiver Aug 26, 2023, 4:41 AM

#

Oh OK, cool this gives me a lot to work with. I'll debug my setup tomorrow and keep a better eye on the model structures

#

thanks!

#connect ECONNREFUSED 172.23.0.6:3003 - ML not starting