#mount custom devices
1 messages · Page 1 of 1 (latest)
Just to confirm, you're aware of existing GPU support, and want to generalize it beyond specific hardcoded devices?
Yes.
We use things like intel GPUs and AMD gpus that dont use cuda.
We also have things like funky network cards that have CPUs in them and support remote direct memory access.
I'm looking for a path to add these.
in a way that isn't hard coded for each repo.
Our approach has been to specifically not expose general device access in the API (where the caller can just pass the pass as-is) because I worry that it will introduce fragmentation: "this module works on machines of this type, make sure to run this out-of-platform command on the host system", stuff like that.
So my preferred approach would be to generalize gradually, by expanding the abstraction, without removing it
For example I wouldn't mind having a WithKVM call
Maybe my fragmentation fear is excessive, you could argue that /dev/xx is already a well-defined API that won't cause fragmentation - no different from eg. Linux syscalls. But I don't have enough confidence that that is true.
Fair enough, but as I understand from reading the GPU implementation is that 1) it requires a custom engine build, 2) it's really hard coded for NVIDIA. 3) if we had the ability to mount devices (and some host driver libraries) from the host we don't have to wait on upstream to add these.
While I agree that Nvidia is the leader commercially, 4 of the top 5 most powerful (including the top 2) super computers in the world do not use Nvidia accelerators. They use AMD, or Intel accelerators, and in the case of Fugaku in Japan a custom home grown accelerator. And all of the top 10 use network interconnects that require certain devices to be mounted instead of traditional TCP/IP, and that doesn't include some the other accerlator types like TPUs, Cerebras wafers, etc... And as much as people try to make codes interoperable (it's amazing how interoperable pytorch is), but it's not perfect.
While there is value in modules being supported everywhere, for users like me, there is value in being able to use the same modules on different systems which share a given hardware type because of how allocations work on these machines.
@candid spade is there any chance we could get our cake and eat it too? Meaning 1) Dagger supports the diversity of devices that you need, out of the box, and 2) we don't open the pandora box of each host system's device API leaking into the Dagger API?
Could dagger just support what ever is needed to use https://github.com/cncf-tags/container-device-interface?
In otherwords, we configure device support in the container runtime, and pass a list of device vendors to load in Dagger as strings?
That looks like the kind of abstraction that could help, yes. I think we actually discussed this exact spec with @glass coyote @whole cradle @earnest hemlock at some point
and @sleek coyote who contributed GPU support in the first place
To your other point about GPU access requiring a custom engine build: that is just because we're being cautious in rolling out changes to the engine. The goal is to unify so that we have only one engine build for everyone, and it supports GPU out of the box.
I don't know if we are following a pre-established calendar for getting there, or if maybe we just collectively dropped the ball and forgot. @vital drift would know
It's great to see the GPU variant of the Engine gaining popularity! Now that we see growing interest from the Dagger community in this use-case, I am keen to spend some time on continuing down this path, and trying the first next step in the https://github.com/cncf-tags/container-device-interface integration. Would you be up for a 30 mins sync @candid spade sometime next week? cc @gritty smelt
I'm in ET, would 4pm or 4:30pm on tuesday work? @vital drift
Following-up via DM
hey folks - new to dagger and was wanting to port some integration tests of some virtualization code and found this thead - has there been any movement on being able to support KVM inside dagger containers (ie. passing host devices through to the container such as /dev/kvm)
/dev/kvm is one of the "special" mountpoints that gets passed to the containers if you set the InsecureRootCapabilities flag in WithExec