#Search problem after upgrading to v1.91.4

1 messages · Page 1 of 1 (latest)

upper dagger
#

After upgrading to v1.91.4 from 1.90 I have an issue with search (using clip):

  • Search fails (times out after 3 minutes)

I tried rebuilding smart search but it's going very slow and when I check the logs for immich-microservices I see this in the logs:

[Nest] 1  - 12/22/2023, 11:12:31 PM   ERROR [JobService] Unable to run job handler (smartSearch/clip-encode): TypeError: fetch failed

[Nest] 1  - 12/22/2023, 11:12:31 PM   ERROR [JobService] TypeError: fetch failed

    at Object.fetch (node:internal/deps/undici/undici:11730:11)

    at async MachineLearningRepository.post (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:16:21)

    at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/domain/smart-info/smart-info.service.js:102:31)

    at async /usr/src/app/dist/domain/job/job.service.js:113:37

    at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:387:28)

    at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:574:24)

[Nest] 1  - 12/22/2023, 11:12:31 PM   ERROR [JobService] Object:

{

  "id": "fa877739-6e97-46af-a5f8-1a37df71e335"

}
stark mortarBOT
#

:wave: Hey @upper dagger,

Thanks for reaching out to us. Please follow the recommended actions below; this will help us be more effective in our support effort and leave more time for building Immich immich.

References

Checklist

  1. :blue_square: I have verified I'm on the latest release(note that mobile app releases may take some time).
  2. :blue_square: I have read applicable release notes.
  3. :blue_square: I have reviewed the FAQs for known issues.
  4. :blue_square: I have reviewed Github for known issues.
  5. :blue_square: I have tried accessing Immich via local ip (without a custom reverse proxy).
  6. :blue_square: I have uploaded the relevant logs, docker compose, and .env files using the buttons below or the /upload command.
  7. :blue_square: I have tried an incognito window, cleared mobile app cache, logged out and back in, different browsers, etc. as applicable

(an item can be marked as "complete" by reacting with the appropriate number)

If this ticket can be closed you can use the /close command, and re-open it later if needed.

upper dagger
#

machine-learning container has this log:

[12/22/23 23:16:34] INFO     Booting worker with pid: 155199                    

[12/22/23 23:16:40] ERROR    Worker (pid:155199) was sent code 132!             

[12/22/23 23:16:40] INFO     Booting worker with pid: 155204                    

[12/22/23 23:16:45] ERROR    Worker (pid:155204) was sent code 132!             

[12/22/23 23:16:45] INFO     Booting worker with pid: 155209                    

and immich-server container have this log:

[Nest] 1  - 12/22/2023, 11:19:02 PM   ERROR [TypeError: fetch failed

    at Object.fetch (node:internal/deps/undici/undici:11730:11)

    at async MachineLearningRepository.post (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:16:21)

    at async SearchService.search (/usr/src/app/dist/domain/search/search.service.js:63:35)] TypeError: fetch failed
covert lynx
#

The ML worker is being sent SIGKILL. How much RAM do you have? Edit: Actually it’s SIGILL, probably because of no AVX

upper dagger
#

16gb - 2gb used on that node

#

buut.. dmesg does show something interesting: traps: gunicorn[3895862] trap invalid opcode ip:7fc6eb8045f2 sp:7fff88a6fb20 error:0 in libtorch_cpu.so[7fc6db12f000+16075000]

covert lynx
#

Aha

upper dagger
#

the cpu on that node is a bit old. Lemme see if I can move it to a different node

covert lynx
#

Do you have Tag Images enabled by chance?

upper dagger
#

no, that's disabled

covert lynx
#

Hm that’s interesting, so torch itself can cause issues even when it isn’t being used

upper dagger
#

cpu on that is AMD Phenom(tm) II X4 965 Processor

covert lynx
#

The good news is that pytorch has been removed from ML entirely, so the next release won’t have this issue

upper dagger
#

I wonder why it didn't happen on earlier versions. Maybe they just randomly got assigned to nodes with more recent cpu's

#

(running it in kubernetes, which has a bunch of used and onsale nodes :D)

covert lynx
#

Maybe yeah, could also be from a dependency version bump

upper dagger
#

I tagged the newer nodes and set a required tag affinity on it. Let's see now

#

yeah, seems to work now! Thanks