#OCR Issues with NVIDIA Remote ML

1 messages · Page 1 of 1 (latest)

median palm
#

There are no driver issues. nvidia-smi works.

I rented H100 and it just offloaded to CPU.
Tried A2000 and it worked.

In OCR Settings it uses _server version and resolution increased to 1280 but with bad performance I lessened it to 1080.

Agenda

  1. How to resolve this issue?
  2. What Hardware is required for me to run high res OCR with _server version. Just need estimated if you have.
  3. Thanks for OCR!

name: immich_remote_ml

services:
immich-machine-learning:
container_name: immich-machine-learning
# Use 'release-cuda' for compatibility with new GPUs like the H100
image: ghcr.io/immich-app/immich-machine-learning:release-cuda
hostname: immich-machine-learning
restart: always
environment:
- MACHINE_LEARNING_DEVICE_IDS=0
- MACHINE_LEARNING_WORKERS=1
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- type: bind
source: /tmp/immich-cache
target: /cache
bind:
create_host_path: true
ports:
- "3003:3003" # Port for the main server to connect to
networks:
- default
privileged: false
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]

networks:
default:
# If this network is managed by another docker-compose file,
# you might need to mark it as external:
# external: true
# name: <name_of_your_immich_network>

lusty hingeBOT
#

:wave: Hey @median palm,

Thanks for reaching out to us. Please carefully read this message and follow the recommended actions. This will help us be more effective in our support effort and leave more time for building Immich immich.

References

#

Checklist

I have...

  1. :ballot_box_with_check: verified I'm on the latest release(note that mobile app releases may take some time).
  2. :ballot_box_with_check: read applicable release notes.
  3. :ballot_box_with_check: reviewed the FAQs for known issues.
  4. :ballot_box_with_check: reviewed Github for known issues.
  5. :ballot_box_with_check: tried accessing Immich via local ip (without a custom reverse proxy).
  6. :ballot_box_with_check: uploaded the relevant information (see below).
  7. :ballot_box_with_check: tried an incognito window, disabled extensions, cleared mobile app cache, logged out and back in, different browsers, etc. as applicable

(an item can be marked as "complete" by reacting with the appropriate number)

Information

In order to be able to effectively help you, we need you to provide clear information to show what the problem is. The exact details needed vary per case, but here is a list of things to consider:

  • Your docker-compose.yml and .env files.
  • Logs from all the containers and their status (see above).
  • All the troubleshooting steps you've tried so far.
  • Any recent changes you've made to Immich or your system.
  • Details about your system (both software/OS and hardware).
  • Details about your storage (filesystems, type of disks, output of commands like fdisk -l and df -h).
  • The version of the Immich server, mobile app, and other relevant pieces.
  • Any other information that you think might be relevant.

Please paste files and logs with proper code formatting, and especially avoid blurry screenshots.
Without the right information we can't work out what the problem is. Help us help you ;)

If this ticket can be closed you can use the /close command, and re-open it later if needed.

median palm
#

In H100 saw spike of GPU usage but it settled to 0% and CPU went to 100%. However VRAM consumption remained significant.

#

nvidia-smi of H100 confirming requirements.

#

e00kjvzd0g6ya39w9f:/home/cloud# nvidia-smi

Fri Oct 31 11:43:10 2025

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 |

|-----------------------------------------+------------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+========================+======================|

| 0 NVIDIA H100 80GB HBM3 On | 00000000:8D:00.0 Off | 0 |

| N/A 33C P0 177W / 700W | 8572MiB / 81559MiB | 98% Default |

| | | Disabled |

+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=========================================================================================|

| 0 N/A N/A 2457 C python 8562MiB |

+-----------------------------------------------------------------------------------------+

exotic sky
#

What CPU do you have in the server?

#

Also @daring jackal

sharp tapir
#

did ocr need to run as privileged?

median palm
#

But it was thermally bottlenecked

#

For H100

lusty hingeBOT
sharp tapir
median palm
#

Upgraded and jobs seem to deplete fast but have this error in logs (remote ML A2000)

#

And this on immich-server container
[Nest] 7 - 11/02/2025, 10:42:48 AM WARN [Microservices:MachineLearningRepository] Machine learning request to "http://10.0.0.202:3003" failed with status 500: Internal Server Error
[Nest] 7 - 11/02/2025, 10:42:48 AM LOG [Microservices:MachineLearningRepository] Machine learning server became unhealthy (http://10.0.0.202:3003).
[Nest] 7 - 11/02/2025, 10:42:48 AM ERROR [Microservices:{"id":"29f6c72b-ee8b-4d25-90e1-05d68fa396e9"}] Unable to run job handler (Ocr): Error: Machine learning request '{"ocr":{"detection":{"modelName":"PP-OCRv5_server","options":{"minScore":0.5,"maxResolution":1000}},"recognition":{"modelName":"PP-OCRv5_server","options":{"minScore":0.8}}}}' failed for all URLs
Error: Machine learning request '{"ocr":{"detection":{"modelName":"PP-OCRv5_server","options":{"minScore":0.5,"maxResolution":1000}},"recognition":{"modelName":"PP-OCRv5_server","options":{"minScore":0.8}}}}' failed for all URLs
at MachineLearningRepository.predict (/usr/src/app/server/dist/repositories/machine-learning.repository.js:117:15)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
at async MachineLearningRepository.ocr (/usr/src/app/server/dist/repositories/machine-learning.repository.js:150:26)
at async OcrService.handleOcr (/usr/src/app/server/dist/services/ocr.service.js:52:28)
at async JobService.onJobRun (/usr/src/app/server/dist/services/job.service.js:199:30)
at async EventRepository.onEvent (/usr/src/app/server/dist/repositories/event.repository.js:91:13)
at async /usr/src/app/server/node_modules/.pnpm/[email protected]/node_modules/bullmq/dist/cjs/classes/worker.js:528:32
[Nest] 7 - 11/02/2025, 10:42:49 AM LOG [Microservices:MachineLearningRepository] Machine learning server became healthy (http://10.0.0.202:3003).
[Nest] 35 - 11/02/2025, 10:43:18 AM LOG [Api:WebsocketRepository] Websocket Connect: z5GIxoj0Mt32EoZKAAAB

#

It seems to work now..

#

Still what was the error about?

exotic sky
#

You should check there machine learning logs to see if there were issues there

#

Ah you sent them in message.txt

#

Looks like the GPU ran out of memory, what model are you running for OCR?

#

And what concurrency do you have set?

#

And how much memory does your version of the a2000 have? 6 or 12 GB?

median palm
#

It is working now though. Did 20000 images

#

55000 pending

timber jungle
#

I got almost the same error, but on a AMD GPU

docker compose:

name: immich_remote_ml

services:
  immich-machine-learning:
    container_name: immich_machine_learning
    # For hardware acceleration, add one of -[armnn, cuda, rocm, openvino, rknn] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}-rocm
    extends:
        file: hwaccel.ml.yml
        service: rocm # set to one of [armnn, cuda, rocm, openvino, openvino-wsl, rknn] for accelerated inference - use the `-wsl` version for WSL2 where applicable
    volumes:
      - model-cache:/cache
    restart: always
    ports:
      - 3003:3003

volumes:
  model-cache:
median palm
#

Getting same error on latest version with all settings downgraded

median palm
#

So I solved this. I restart Immich and only do one task, Either OCR or Smart Search.

Then it works flawlessly.

sharp tapir
#

🤔 that doesn't sound like an amazing solution

median palm