#GPU support troubleshooting

1 messages · Page 1 of 1 (latest)

edgy birch
#

Hi everyone.

Forgive my confusion, but I'm a bit at a loss. Trying to get gpu to work inside dagger...

TL;DR: runc run failed: unable to start container process: error during container init: error running prestart hook #0: fork/exec /usr/bin/nvidia-container-runtime-hook: no such file or directory

Where is /usr/bin/nvidia-container-runtime-hook expected? Is it:

  1. In the dagger-engine container? Because if yes, it's not in the registry.dagger.io/engine:v0.20.8-gpu image
  2. Supposed to be passed to the engine by my system podman?
  3. In my target SDK image?

I'm not pretending to know everything but 3 seems unlikely as the hook is supposed to be running before my target container start.

I've confirmed the gpu is correctly passed to my containers thanks to nvidia-ctk cdi generate -output /etc/cdi/nvidia.yaml

Here is how I've created the engine: podman run --gpus all -v /var/lib/dagger -d --privileged -e _EXPERIMENTAL_DAGGER_GPU_SUPPORT=true --name dagger-engine-v0.20.8 registry.dagger.io/engine:v0.20.8-gpu -- --debug

Thanks for the potential help

edgy birch
#

Alright some follow up: after installing the nvidia-ctk + deps manually inside the dagger engine container and running "nvidia-ctk config --set nvidia-container-cli.no-cgroups --in-place" (something to do with podman "rootless") it works.

I don't know why registry.dagger.io/engine:v0.20.8-gpu doesn't have that though as the documentation says it should "just work"

#

Trying some bigger work that just "hasgpu" now

zealous hollow
edgy birch
#

Currently I think just installing the nvidia-ctk and its deps libcap and libsecomp + disabling nvidia-ctk cgroup support (nvidia-ctk config --set nvidia-container-cli.no-cgroups --in-place) is enough to have "has-gpu = true" and the cuda test report the GPU (see attached screenshot).

The ollama test doesn't work anymore as the ollama download link (https://ollama.com/download/ollama-linux-amd64.tgz) is dead (also see screenshot).

I don't know how hard is it for you to add these to your engine docker image but this would be great.

Maybe I can help if the engine dockerfile is in github? If yes can you point its location to me?

I will be doing more testing in the following weeks and I'll keep in touch.