#ML pipelines (or: using a GPU)

1 messages · Page 1 of 1 (latest)

pulsar notch
#

Hi everyone!
I'm hoping to use Dagger for an internal ML pipeline, which unsurprisingly requires a GPU.
We have a working POV on a VM running the following setup:

Now from everything I've seen, BUILDKIT_HOST isn't something we should depend on.
So the question(s) are:

  1. is there a recommended way?
  2. All of our workloads are running in GKE, so I wonder if there's away to achieve this without resorting to an out-of-cluster VM

Many many thanks in advance!

glass light
# pulsar notch Hi everyone! I'm hoping to use Dagger for an internal ML pipeline, which unsurpr...

👋

  1. There's no recommended way so far. There's an issue @sacred ether opened some time ago for this https://github.com/dagger/dagger/issues/4675 but we haven't had the time to think about the right way to plug this into dagger
  2. AFAIK GKE can run node pools with nvidia devices, right? (https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) In that case you could just spawn a custom dagger engine daemonset into those nodes with the configurations you mentioend before and that should give you GPU capabilities, right? What else did you have in mind?
#

also cc @soft lion @tawdry terrace and @timid lava since you're probably interested as well

pulsar notch
#

I'm guessing i'm unsure whether builkit(??) used by Dagger has access to the GPU?
Otherwise I wouldn't have jumped through hoops with containerd. I got here (https://github.com/NVIDIA/nvidia-container-runtime/issues/153) link jumping from @sacred ether 's issue

GitHub

I'm trying to have nvidia driver available during build which works with the default build command but not when using buildkit. I have this minimal Dockerfile FROM nvidia/cuda:11.1-base RUN ls ...

#

Also, I'm not sure I can modify the containerd configuration in the nodes themselves, or even about having nvidia-container-runtime available

#

And can I assume that BUILDKIT_HOST can be used until...further notice?

glass light
#

@pulsar notch happy to provide pointers if you're willing to explore the path of modifying our runc shim to add the nvidia's one and see if that works

pulsar notch
#

sure, I'll be happy to. though you'll need to take into account that I've been exploring container runtimes for about 1 day 🙂 so...no Fu whatsoever

sacred ether
#

We are discussing this live right now fyi 🙂

#

with @tawdry terrace

pulsar notch
#

😄

#

I must have jinxed it. I see severe performance degredation on that VM, as if there's no access to the GPU

sacred ether
#

It does require non-trivial engineering.. We will do this. But I don’t think there’s an easy workaround (yet)

#

Bottom line it seems we need to add GPUs to the DAG

#

Possibly with a native GPU type in the Dagger API (like we have for secrets etc)

#

This is necessary since GPU access needs to be carefully synchronized, and affects the sequencing and concurrency of steps in the dag

tawdry terrace
pulsar notch
#

thanks @sacred ether!
what's your take on BUILDKIT_HOST? Since it's no longer in the docs, can I rely on it?
Is it meant to be used the SDK or by the engine itself when I spin up it's container by myself?

sacred ether
#

That variable is no longer supported, because the dagger engine now bundles a customized version of buildkit - swapping in a vanilla buildkit daemon at runtime would break certain features.

However we do support an equivalent feature, custom runner (still experimental). See https://github.com/dagger/dagger/blob/main/core/docs/d7yxc-operator_manual.md#what-are-the-steps-for-using-a-custom-runner

GitHub

A programmable CI/CD engine that runs your pipelines in containers - dagger/core/docs/d7yxc-operator_manual.md at main · dagger/dagger

pulsar notch
#

thanks, but unfortunately I still can't GPU 🙂

sacred ether
#

Yeah, that doesn't solve GPU access. But we're working on it!

glass light
#

@sacred ether do you think it's worth investing time on the Nvidia container runtime path? Or we know that's a dead end given GPU access sync?

#

I'm unfamiliar about GPU access specifics from multiple processes

pulsar notch
#

@glass light @sacred ether I'll be happy to try and contribute if you guys think it's worth exploring

tawdry terrace
#

@glass light @pulsar notch it's being discussed here: https://github.com/dagger/dagger/issues/4675 - @snow cypress is going to give it a shot for the basic support implementation. The main issue is to handle some workarounds for multi-tenancy as described in the comments.