So, I have a serverless endpoint that uses a customized image of runpod/worker-comfyui:5.7.1-base. I say “customized,” but basically I just modified the handler to return a different value and added my custom nodes/models.
I would like to use SageAttention 2.1 on this endpoint, but I am facing the following problem. I need to compile SageAttention with a specific version of CUDA in order to use it in my ComfyUI... Except that endpoints can have different versions.
I'm very new to this, but from what I understand, the runpod/worker-comfyui image itself is based on nvidia/cuda:12.6.3-cudnn-runtime-ubuntu24.04. If I want to use sageattention, do I have to compile a version with CUDA 12.6.3 myself, and then use the .whl file with PIP install filename.whl?
And if that works, it will mean that sageattention works in my image, but would it work on a serverless endpoint that has, for example, CUDA 12.7 or 12.8?
Sorry to the experts, I may be talking nonsense, I'm just trying to understand how I can do this ^^.
Thank you!