#GPU is detected and working but ML doesn't use it at all for any of it's related tasks.

1 messages · Page 1 of 1 (latest)

vapid coyote
proven sparrowBOT
#

:wave: Hey @vapid coyote,

Thanks for reaching out to us. Please carefully read this message and follow the recommended actions. This will help us be more effective in our support effort and leave more time for building Immich immich.

References

#

Checklist

I have...

  1. :ballot_box_with_check: verified I'm on the latest release(note that mobile app releases may take some time).
  2. :ballot_box_with_check: read applicable release notes.
  3. :ballot_box_with_check: reviewed the FAQs for known issues.
  4. :ballot_box_with_check: reviewed Github for known issues.
  5. :ballot_box_with_check: tried accessing Immich via local ip (without a custom reverse proxy).
  6. :ballot_box_with_check: uploaded the relevant information (see below).
  7. :ballot_box_with_check: tried an incognito window, disabled extensions, cleared mobile app cache, logged out and back in, different browsers, etc. as applicable

(an item can be marked as "complete" by reacting with the appropriate number)

Information

In order to be able to effectively help you, we need you to provide clear information to show what the problem is. The exact details needed vary per case, but here is a list of things to consider:

  • Your docker-compose.yml and .env files.
  • Logs from all the containers and their status (see above).
  • All the troubleshooting steps you've tried so far.
  • Any recent changes you've made to Immich or your system.
  • Details about your system (both software/OS and hardware).
  • Details about your storage (filesystems, type of disks, output of commands like fdisk -l and df -h).
  • The version of the Immich server, mobile app, and other relevant pieces.
  • Any other information that you think might be relevant.

Please paste files and logs with proper code formatting, and especially avoid blurry screenshots.
Without the right information we can't work out what the problem is. Help us help you ;)

If this ticket can be closed you can use the /close command, and re-open it later if needed.

vapid coyote
#

the .env file

#

Setup is - Proxmox as host, Immich is setup and installed in a LXC Container. (i5-6500, 16GB RAM, Quadro P1000)

I can confirm through nvidia-smi that the gpu A Quadro P1000 gets detected in all 3 places mentioned above.

The mounting, is a backblaze b2 bucket.

The container has 4vCPUs, 4GB RAM and 90GB of storage allotted to it.

#

The image is nvidia-smi output

#

I have played around as much as I can with the docker files, and hwaccel.

Finally found the discord and saw someone pointed to the immich_machine_learning container using the internal IP of docker on the web interface.

I have spent more time on this than I'd like to admit. Any help is appreciated.

proven sparrowBOT
hollow grove
#

Is a P1000 even compatible?

#

Ah yep, compute 6.1 is >= 5.2

#

Not sure what the NVIDIA env vars are supposed to achieve?

#

We don't have those

#

By installed in an LXC, do you mean directly, or is docker running in your LXC @vapid coyote ?

vapid coyote
hollow grove
#

Could try setting:

  cuda:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities:
                - gpu

In hwaccel instead @vapid coyote ?

vapid coyote
#

Ok will try and update here

vapid coyote
#

Didn't work

hollow grove
#

dang

#

maybe

group_add:
  - "xy"
  - "zw"

Where xy zw are the gid numbers for video and render 👀

#

But I'm just guessing at this point

vapid coyote
#

I assume that's where you wanted me to add it

hollow grove
#

indeed

vapid coyote
#

What I can't get out of my mind is the error log still mentions "device_id" I got rid of that but the error log still mentions it

hollow grove
#

It mentioned that before you switched to it

#

my guess is there is a rights issue preventing the GPU from being properly accessesd

#

resulting in GPU=-1

vapid coyote
#

hmm, the error log at the top that I shared doesn't have the white error from now

#

I do agree it might just be that too many virtualisations happening, proxmox>lxc>docker

#

causing permission issues

#

Installed immich on a vm first, worked great, then realised maybe adding a GPU for the ML loads would be a good idea considering the CPU is probably not the best for this stuff.

That was 3 days ago when I dropped $100 on the Quadro, first my wallet hurt, now my brain hurts.

hollow grove
#

The joke is your iGPU probably would've worked fine 👀

hollow grove
#

So to give you the real picture here

#

CPUs have their own accelerators

#

OpenVino for instance

#

And outside of the initial processing, you don't really need anything but RAM

#

usually loading the model is what takes the longest

#

using an external ML container for initial processing is very popular

vapid coyote
# hollow grove using an external ML container for initial processing is very popular

Yeah I have been thinking of 2 options -

  1. Use a spare disk and test run a desktop environment - probably mint. Try running immich on there and if it works well then maybe I move my entire stack there - HAos, SMB, Minecraft Servers.

  2. My main machine is a RTX 3060 Laptop, Windows. Run the ML container on there? The web UI allows me to control when the jobs are running and I can control when the container is running on the laptop.

#

Also to note - I tried sending through 7000 files and that started crashing the container(lxc) seems I loaded the CPU too much and had all the ML stuff enabled because I thought the GPU was working

#

But on the VM, the CPU seemed to handle it beautifully, didn't crash even though the CPU was going ham the same way - this was before the GPU install.

hollow grove
#

I ran my ML on windows as well, took a good few hours but not days

vapid coyote
#

Yeah I did look into, Nvidia and Microsoft have done some cool stuff to bring proper GPU access to WSL.

There's just so much conflicting and unclear information on the docs

hollow grove
#

Our docs?

vapid coyote
#

don't think I found anything about that on the immich docs

lunar idol
vapid coyote
#

So with a VM, one VM will have access to the GPU.

With LXC Containers, host keeps the GPU but multiple containers can use it

#

I first tried VM but in that I couldn’t even get the GPU to be detected properly, with LXC it detects just doesn’t get used.

lunar idol
#

With LXC you still need to write rules in the LXC config file and you'll likely have to mount the GPU

#

If you do ls /dev/nvidia* in the container does anything show up?

vapid coyote
#

@lunar idol mb don't know how I missed this message, but my config for the lxc has all these things in it

lunar idol
#

Please provide me with an ls /dev/nvidia* from both the Proxmox host and from inside the container

vapid coyote
#

This is the Immich Container

#

That's Proxmox

lunar idol
#

Just to test the passthrough, if you download ffmpeg within the container and try to transcode a video, does it do that properly?

vapid coyote
#

I’ll try that now

lunar idol
#

(Make sure to use the nvenc encoder to trigger GPU usage)

vapid coyote
#

Yeah did that and seems it failed on the container

#

ffmpeg -hwaccel cuda -i P1066118.MOV test.mp4 that was my command

lunar idol
#

I just noticed you're passing through driver libraries from your host; is there any reason for this?

vapid coyote
#

That's just what I found when searching online

#

The way I set it up was -

  1. Got the driver file from Nvidia.com. Ran it on the proxmox host.

  2. Ran the same file on the container but with the parameter to exclude building kernels since it was erroring out when I tried to install the same file because the lxc was not detecting the driver at all.

#

Also my reason to use the file from Nvidia.com was because of version discrepancy between what the pve host was installing from apt and the lxc.

#

Plus this way I assume less chance of breaking when updates are run

lunar idol
#

The driver in the Debian repository is a bit older, it's from Februari 2024, but for something like Immich I wouldn't be chasing the newest version (especially with a Debian-based OS) but I'd want stability instead

#

If I were you I'd install the Nvidia driver from the apt repository and then try this all again and avoid mounting driver directories; just install nvidia-smi inside the LXC container and it will pull along any necessary libraries it needs

#

Let me know if you need help

#

If you install the Nvidia drivers through apt if I remember correctly it will undo the .run installation itself

vapid coyote
#

I get that - PVE was installing 570.20 and immich was installing 570.06 something.

In my mind it made sense to have the same driver versions so I ran the .run

#

But you mean to say - let the host have the drivers or not? Because isn't the container dependent on the host for the drivers?

lunar idol
#

Immich comes with its own driver blobs?

vapid coyote
#

wait a min, you mean to say the docker container should have the drivers in there?

hollow grove
#

FYI new drivers often break ML, don't try to be clever here and stick with the "old" ones

lunar idol
#

Break ML or break Debian; there's no winning here

#

The issue is that installing Nvidia drivers from anything that is not the repository is a hole deeper than hell

lunar idol
#

Or am I reading it wrong and staying with old is good?

#

I'd think see if repo driver works and if so, don't change, right?

hollow grove
#

I specifically said new 👀

vapid coyote
#

Well, I have cleared out the driver install. Will reboot and try what's suggested here once the badblock test on one of my disks ends.

lunar idol
#

Make sure to get rid of the library directory mounts; that should not be necessary if you have nvidia-smi installed within the container.

vapid coyote
#

I am so effin happy

#

What I did was nothing short of a full system reset -

Proxmox 9 was recently released, and I still use Proxmox 8. Decided imma install Proxmox 9 on a spare disk that will have none of my stuff on it besides the default and try passing through the GPU to a Windows VM. WORKED RIGHT AWAY

So I tried to recreate it on my actual boot disk but failed, spent the day backing up and then restoring my containers and vms to the new install and finally voila!!!

Don't know what was wrong in my old install that was messing the passthrough up so bad, but having attempted this so many times now over the past 10 days, I have nailed down the process and know exactly what is needed.

#

Should it have taken this long? - NO.

Does this remind of the covid lockdown days, where teenage me spent a lot of time understanding and figuring out how networking, reverse tunnels and java works, so I could play minecraft with friends? - Absolutely.

vapid coyote
#

Seems my video transcoding is failing now

lunar idol
#

You're missing libcuda

#

Is this still an LXC container?

vapid coyote
#

Nope it’s a Ubuntu server VM now.

lunar idol
#

Make sure the GPU is blacklisted on the Proxmox host

vapid coyote
#

Oh hmm could that also be why I had to restart docker earlier.

It had stopped detecting a cuda device according to the errors

lunar idol
#

lspci -nn on the host and paste the output

vapid coyote
lunar idol
#

Sorry, -nnk instead of -nn

vapid coyote
lunar idol
#

Looks good, it's attached to vfio

#

And the GPU is properly visible in the Windows VM? In device management?

vapid coyote
#

Well I have since gotten rid of the Windows VM, but it shows up nicely in my Ubuntu Server VM now.

It's chugging through the files

lunar idol
#

Nice!

vapid coyote
#

Just not sure why ffmpeg is misbehaving, especially considering what I have found.

Immich uses the jellyfin fork of ffmpeg and I was gonna setup jellyfin next.

lunar idol
#

Well, it says right in the output

vapid coyote
#

The extends portion was commented out, let's see if it works now