When encoding video with ffmpeg, nvenc does not work. | Runpod | Page 1

mint fossil Sep 9, 2025, 4:32 AM

#

DC:US-NC-1
GPU:RTX 5090

I have switched data centers to US-IL-1 in addition to US-NC-1, but the results remain the same.

cmd
-f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y

ffmpeg version 7.1.1 Copyright (c) 2000-2025 the FFmpeg developers
  built with gcc 13 (Ubuntu 13.3.0-6ubuntu2~24.04)
  configuration: --disable-debug --disable-doc --disable-ffplay --enable-alsa --enable-cuda-llvm --enable-cuvid --enable-ffprobe --enable-gpl --enable-libaom --enable-libass --enable-libdav1d --enable-libfdk_aac --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-libkvazaar --enable-liblc3 --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-libplacebo --enable-librav1e --enable-librist --enable-libshaderc --enable-libsrt --enable-libsvtav1 --enable-libtheora --enable-libv4l2 --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpl --enable-libvpx --enable-libvvenc --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-nonfree --enable-nvdec --enable-nvenc --enable-opencl --enable-openssl --enable-stripping --enable-vaapi --enable-vdpau --enable-version3 --enable-vulkan
  libavutil      59. 39.100 / 59. 39.100
  libavcodec     61. 19.101 / 61. 19.101
  libavformat    61.  7.100 / 61.  7.100
  libavdevice    61.  3.100 / 61.  3.100
  libavfilter    10.  4.100 / 10.  4.100
  libswscale      8.  3.100 /  8.  3.100
  libswresample   5.  3.100 /  5.  3.100
  libpostproc    58.  3.100 / 58.  3.100
Input #0, lavfi, from 'testsrc=duration=5:size=1280x720:rate=30':
  Duration: N/A, start: 0.000000, bitrate: N/A
  Stream #0:0: Video: wrapped_avframe, rgb24, 1280x720 [SAR 1:1 DAR 16:9], 30 fps, 30 tbr, 30 tbn
Stream mapping:
  Stream #0:0 -> #0:0 (wrapped_avframe (native) -> h264 (h264_nvenc))
Press [q] to stop, [?] for help
[h264_nvenc @ 0x5b844ea0d440] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x5b844ea0d440] No capable devices found
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Error while opening encoder - maybe incorrect parameters such as bit_rate, rate, width or height.
[vf#0:0 @ 0x5b844ea2afc0] Error sending frames to consumers: Generic error in an external library
[vf#0:0 @ 0x5b844ea2afc0] Task finished with error code: -542398533 (Generic error in an external library)
[vf#0:0 @ 0x5b844ea2afc0] Terminating thread with return code -542398533 (Generic error in an external library)
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Could not open encoder before EOF
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Task finished with error code: -22 (Invalid argument)
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Terminating thread with return code -22 (Invalid argument)
[out#0/mp4 @ 0x5b844ea0f540] Nothing was written into output file, because at least one of its streams received no packets.
frame=    0 fps=0.0 q=0.0 Lsize=       0KiB time=N/A bitrate=N/A speed=N/A
Conversion failed!

This is the result I got running on my RTX 4090. There are no issues with the container image and command.

docker run --rm -it --gpus=all \
                     -v $(pwd):/config \
                     linuxserver/ffmpeg:7.1.1 \
                     -hwaccel cuda -hwaccel_device 0 -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y
  ...
        encoder         : Lavc61.19.101 h264_nvenc
      Side data:
        cpb: bitrate max/min/avg: 0/0/2000000 buffer size: 4000000 vbv_delay: N/A
[out#0/mp4 @ 0x619f0fd1afc0] video:196KiB audio:0KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 1.333094%
frame=  150 fps=0.0 q=8.0 Lsize=     199KiB time=00:00:04.90 bitrate= 332.2kbits/s speed=27.8x

ionic boneBOT Sep 9, 2025, 4:32 AM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

radiant prawn Sep 9, 2025, 5:43 AM

#

NVENC and ffmpeg are very very sensitive to your nodes specific driver version. When you're deploying your pod, used the advanced filter to select CUDA 12.8 and 12.9.

mint fossil Sep 9, 2025, 10:20 AM

#

radiant prawn NVENC and ffmpeg are very very sensitive to your nodes specific driver version. ...

I have already made that specification.
It was clear that it would not work with either 12.8 or 12.9.
I built ffmpeg from scratch inside the container, but that made no difference.
None of the ffmpeg versions I tried worked.

Please tell me which data centers have an RTX 5090 where NVENC works.
If it is my issue, I will gladly take care of it.

radiant prawn Sep 9, 2025, 4:03 PM

#

NVENC is a chip on the graphics card included on every NVIDIA device produced after 2006. Have you tried any other ffmpeg version (8.0+) or any of the mainline alternative builds?

https://github.com/BtbN/FFmpeg-Builds/releases/tag/latest

Youll want ffmpeg-master-latest-linux64-lgpl.tar.xz. You can extract this with

tar xf ffmpeg-master-latest-linux64-lgpl.tar.xz

GitHub

Release Latest Auto-Build (2025-09-09 13:41) · BtbN/FFmpeg-Builds

tardy verge Sep 9, 2025, 4:32 PM

#

radiant prawn NVENC is a chip on the graphics card included on every NVIDIA device produced af...

from what I know the AI training chips (A100 H100) doesn't have NVENC

#

only NVDEC

#

but all consumer & worktation grade cards do

#

https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new

NVIDIA Developer

Video Encode and Decode GPU Support Matrix

Find the related video encoding and decoding support for all NVIDIA GPU products.

#

works with RTX 2000 Ada

#

doesn't work with 5090

#

#

same error with community cloud

#

https://obsproject.com/forum/threads/obs-not-working-with-rtx-5090-nv_enc_err_invalid_device.184606/

OBS Forums

OBS not working with RTX 5090 (NV_ENC_ERR_INVALID_DEVICE)

Hey!
I'm not entirely sure if this is going to be an isolated case or a bigger issue across other RTX 5090 Cards.
I'll gladly help find a solution for this & hopefully will help find a fix for other 50 Cards users as well.

I have spent over a week working around with OBS in an attempt to fix...

#

looks like a driver issue

mint fossil Sep 9, 2025, 9:28 PM

#

radiant prawn NVENC is a chip on the graphics card included on every NVIDIA device produced af...

Yes. I have tried all the items you pointed out and have confirmed that it is impossible to execute them.
Of course, I am also using ffmpeg-master-latest-linux64-lgpl.tar.xz.

mint fossil Sep 9, 2025, 9:32 PM

#

tardy verge looks like a driver issue

No, I don’t think so. This is a data center issue.
On the RTX 4090, the same error occurs in US-IL-1, EUR-NO-1, and EUR-IS-2, but encoding works normally on EU-RO-1.

Also, some RTX 5090s have been confirmed to work. However, it’s no longer possible to get assigned to that pod.

tardy verge Sep 10, 2025, 1:44 AM

#

That's strange

#

But i heard that runpod is using early driver versions

#

So I thought it was a driver issue

mint fossil Sep 10, 2025, 4:12 AM

#

tardy verge That's strange

I will present evidence that supports my claim.
This is a serverless endpoint that runs the command
-f lavfi -i testsrc=duration=600:size=1920x1080:rate=60 -c:v h264_nvenc -preset p1 -b:v 10M -pix_fmt yuv420p -f null - -benchmark -stats
using the linuxserver/ffmpeg:version-8.0-cli image on a serverless worker.

The bad workers show the same error I reported, while the properly functioning workers correctly display encoding speed logs.
Therefore, this cannot be concluded as a driver issue, since there are GPU servers that operate normally.

The four attached screenshots are logs from the workers that function correctly.
The fifth image shows a list of both healthy and unhealthy workers. (All unhealthy ones output the error I posted and then stopped processing.)

And as an important detail, I have confirmed that there is no difference in the Driver Version between the bad workers and the healthy ones.
Therefore, this is a data center issue.

mint fossil Sep 10, 2025, 4:16 AM

#

radiant prawn NVENC and ffmpeg are very very sensitive to your nodes specific driver version. ...

I have presented evidence that this is a data center issue. Please investigate. I would appreciate it if you could escalate the ticket.

tardy verge Sep 10, 2025, 4:31 AM

#

this is strange

#

NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9: work (EU-RO-1)
NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 : no work (EUR-IS-2)
NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8: no work (EUR-IS-1)
NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 : no work (EU-RO-1)
NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8: work (EUR-IS-1)
NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8: work (EUR-IS-1)
NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8 : no work (EUR-IS-1)

#

it doesn't depend on driver version?

#

seems like EUR-RO-1 uses newer drivers

#

but it sometimes doesn't work there too

mint fossil Sep 10, 2025, 4:51 AM

#

tardy verge it doesn't depend on driver version?

Yes. NVENC reacts sensitively to things like driver version, but even with exactly the same version, in exactly the same DC, and with exactly the same configuration, differences in behavior can be observed.
As you know, EUR-IS-2 is 570.172.08, but there are workers operating normally on 570.172.08, and there are also bad workers on 570.172.08.
The attached image shows information from a worker that operated normally.

tardy verge Sep 10, 2025, 4:52 AM

#

Yeah I can confirm

#

And i didnt know that runpod dashboard shows driver version

radiant prawn Sep 10, 2025, 9:37 AM

#

mint fossil I have presented evidence that this is a data center issue. Please investigate. ...

I really appreciate you looking into this. I'll have this escalated.

radiant prawn Sep 10, 2025, 9:37 AM

#

tardy verge NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 1...

Are you able to share the ids of these deployments or no? It's okay if not I can get it all manually.

tardy verge Sep 10, 2025, 9:47 AM

#

8sv5k7ublivjhq

#

should be the serverless endpoint ID

mint fossil Sep 12, 2025, 5:01 AM

#

radiant prawn I really appreciate you looking into this. I'll have this escalated.

Thank you very much. Since I would like to track the situation, please escalate from this thread and issue a ticket.

#

skwe5i0dbvkzeu
This is the serverless endpoint ID that I presented as evidence.

sick helmBOT Sep 12, 2025, 11:25 AM

#

mint fossil skwe5i0dbvkzeu This is the serverless endpoint ID that I presented as evidence.

@mint fossil

Escalated To Zendesk

The thread has been escalated to Zendesk!

rose shoal Sep 14, 2025, 10:02 PM

#

Any news on this issue?

mint fossil Sep 15, 2025, 3:31 AM

#

rose shoal Any news on this issue?

In the support ticket, they reported:
I’ve already escalated this case to our reliability team for deeper review.

It seems they are currently investigating.

I will share any new information here as soon as it becomes available.

mint fossil Sep 17, 2025, 3:14 AM

#

Runpod sincerely conducted additional investigation and provided support.
I will share the details of the ticket.

Following up on your request, we’ve reproduced the NVENC failure you reported across multiple regions and GPU types. After investigation, we’ve classified this as an upstream issue with FFmpeg/NVIDIA. Specifically, the problem appears to stem from how device indices are mapped inside containers (when /dev/nvidia* devices don’t align with nvidia-smi indices).
This behavior matches several active upstream reports:
https://trac.ffmpeg.org/ticket/11694
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1249
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1209
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1197
https://github.com/NVIDIA/k8s-device-plugin/issues/1282
Since the root cause lies upstream, a permanent fix will need to come from the FFmpeg/NVIDIA teams. That said, we’ll continue to monitor developments closely and will keep you updated on any relevant progress or workarounds.

quaint moss Oct 11, 2025, 6:49 AM

#

I am seeing the same issue @mint fossil. Thank god I found this. I thought I was going crazy.

Seems to be a random roll of the dice on whether a pod will work or not. That being said, I have not seen any of these errors on the serverless endpoints.

I wonder if it is safe to depend on the serverless endpoints for large batches of requests or if this only affects the pods?

#

I will read the issues mentioned in the runpod support response. Perhaps we can just ship our own binaries that have a fix.

quaint moss Oct 11, 2025, 7:10 AM

#

Seeing a lot of talk about the 570 driver being the culprit. When I launch RTX PRO 6000 pods I always get nvidia driver version >= 580 and have yet to see the issue on there. But that could be anecdotal.

It seems there is nothing we can do to fix the problem itself. But I am going to try getting the /dev/nvidia# and passing that into ffmpeg with ffmpeg -hwaccel_device 0

Maybe that will work. Going to bed but will continue tinkering with it tomorrow.

tardy verge Oct 11, 2025, 7:30 AM

#

tardy verge NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 1...

this might support that clain

#

only servers with driver version 570 didn't work

quaint moss Oct 11, 2025, 5:45 PM

#

Yeah as I understand the issue, it's with the way multiple gpu servers start the docker container. Whatever fancy way runpod spins them up, they are probably doing something like device=1 if the first gpu is already being used in another container.

As I found out last night, you can spin up the exact same image over and over on the same gpu type. Seems like unless you get gpu 0, you get this problem. The thing is, I have not seen this issue even once with RTX PRO 6000 machines.

#

I'm gonna see right now if I can reproduce the issue, and then work around it by specifying which device to enumerate in ffmpeg

quaint moss Oct 11, 2025, 6:16 PM

#

export CUDA_VISIBLE_DEVICES=0; ffmpeg ... seems to have some effect. At least I get a different error when I set different device ids.

Just for fun, I tried symlinking /dev/nvidia1 to /dev/nvidia0 and it had no effect.

I am kind of out of ideas on how to fix this. It feels like if you get a set of circumstances, you just can't work around it.

You select a gpu pod that isn't having all the gpus passed to it
Someone else is using gpu 0
You are on driver ~570 (although I havent confirmed this)

#

Curious why I never saw this on the serverless endpoints. That's what I indend to use anyway so if it works there, then no big deal. But I can't really finish my project if there's a random chance each serverless endpoint worker might fail too.

#

Btw. I just launched a pod and got the same exact scenario where I got /dev/nvidia1 but since it is driver 580, it seems to work just fine

tardy verge Oct 11, 2025, 6:37 PM

#

So the only workaround is using a dofferent gpu?

#

Or maybe you can set higher cuda version

#

To avoid 570 drivers

quaint moss Oct 11, 2025, 9:45 PM

#

I guess we can try. But I didn't think they were related. Like if I use a container image that is 12.2, then I get into a pod using an RTX PRO 6000, it will tell me its running cuda 13

#

I think that filter when you are making a pod is exactly that. Just a filter, showing which gpus are compatible with that version of cuda you've selected. 13 isn't even selectable on there and no matter what I choose, it seems to just have whatever the host system gives you when it is launching the container. Which makes sense.

mint fossil Oct 11, 2025, 11:02 PM

#

quaint moss Curious why I never saw this on the serverless endpoints. That's what I indend t...

Curious why I never saw this on the serverless endpoints.

I just re-tested this issue on the serverless endpoints, and I can confirm that the same problem is still occurring as before.

quaint moss Oct 12, 2025, 12:52 AM

#

Oh really 🤔

#

Maybe because I only really tested on RTX PRO 6000’s

#

I’ve yet to see one of those gpu’s have this issue. And all have been on driver >= 580

tardy verge Oct 12, 2025, 6:18 AM

#

quaint moss I think that filter when you are making a pod is exactly that. Just a filter, sh...

its the host cuda version

#

when you do nvidia-smi it will print out the cuda version and the driver version

#

and maybe RTX PRO 6000s don't support lower version of drivers so they don't get 570 driver version

quaint moss Oct 12, 2025, 7:01 AM

#

Yeah. It seems to print out whatever version the host uses. So your base image can be like 12.2 and the host can be 13 as reported by nvidia-smi. But the filter selector when creating a pod seems to have no bearing on what version of cuda the host system uses. Like if I select 12.4 I will get 12.2 installed by my image and 13 according to nvidia-smi

tardy verge Oct 12, 2025, 7:23 AM

#

that's wierd

#

I always got what I asked for

#

but if you set it to 12.8 it should give 12.8+

#

at least

#

and that should avoid 570 drivers

quaint moss Oct 13, 2025, 11:10 PM

#

There's a chance I don't know what I am talking about. But I am pretty confident that the cuda filter does absolutely nothing besides filter which gpu's are compatible with the version of cuda you need.

#

Unrelated to that though, I just saw my first failure on the serverless endpoints because I got an RTX 5090 running driver 570.

quaint moss Oct 13, 2025, 11:39 PM

#

Weird. I just had an RTX PRO 6000 worker run one of my tasks and it was running driver 570.195.03. It worked

#

So I must have had gpu #0 I guess.

tardy verge Oct 14, 2025, 12:33 AM

#

Maybe related to this

#

https://discord.com/channels/912829806415085598/1427414138576961640

quaint moss Oct 14, 2025, 12:52 AM

#

Great find!

tardy verge Oct 14, 2025, 12:59 AM

#

can you try setting cuda version

#

just in case it works

quaint moss Oct 14, 2025, 2:33 AM

#

Where at? In my image or in the selector when editing the pod/worker?

tardy verge Oct 14, 2025, 4:21 AM

#

here

#

#

Probably to cuda 12.8

quaint moss Oct 14, 2025, 7:56 PM

#

That selector seems to somewhat help at choosing the host's cuda version. But there are times when I will choose a specific version and get back a different one.

I just tested three pods. Two gave me 12.8 and the last one gave me 13

tardy verge Oct 15, 2025, 2:36 AM

#

I think its 12.8+

#

So you are getting 13 too

#

Since cuda should be backwards compatible

#

And newer cuda version == newer drivers

#

So this should help in getting newer drivers

quaint moss Oct 15, 2025, 3:28 AM

#

I think you were right that the selector works like that. I just noticed if I select 12.8 it tells me that the RTX PRO 6000 pods are unavailable, but if I select 12.9 they are available.

So that case is closed

#

But related to this issue. If I get driver 570 there’s a chance it doesn’t work.

But if I select 12.9 and get driver version 580 with the same gpu, it works.

#

I guess it’s possible that it’s anecdotal and I’ve just been lucky to get gpu number 0. But I am pretty confident driver 580 doesn’t have this issue

tardy verge Oct 15, 2025, 3:32 AM

#

Then selecting 12.9 will make you avoid 570 driver

#

Because 12.9 is 580+

tardy verge Oct 15, 2025, 3:33 AM

#

quaint moss I think you were right that the selector works like that. I just noticed if I se...

@radiant prawn is this a bug?

#

Either this is a bug or 12.8 selector giving cuda13 is a bug

quaint moss Oct 15, 2025, 3:59 AM

#

# 5090
🟥 Driver Version: 570.153.02 CUDA Version: 12.8
🟥 Driver Version: 575.57.08 CUDA Version: 12.9

# RTX PRO 6000
🟥 Driver Version: 575.57.08 CUDA Version: 12.9 /dev/nvidia5
✅ Driver Version: 575.57.08 CUDA Version: 12.9 /dev/nvidia0
✅ Driver Version: 570.195.03 CUDA Version: 12.8 /dev/nvidia1 <-- ??

# RTX PRO 6000 WK
✅ Driver Version: 575.51.03 CUDA Version: 12.9 /dev/nvidia2

# RTX PRO 6000 WK (no cuda selection)
✅ Driver Version: 575.51.03 CUDA Version: 12.9 /dev/nvidia2

# RTX PRO 6000 WK NORTH AMERICA (no cuda selection)
✅ Driver Version: 580.65.06 CUDA Version: 13.0 /dev/nvidia5

#

No clue how to get it to give me 580 with cuda 13

#

It seems random

#

I tried selecting nothing, which is what I've done in the past to get it to give me 580. It gave me 12.9. So maybe it's just random or a regional thing.

#

Boom

#

Selected north america

#

🇺🇸

#

Driver Version: 580.65.06 CUDA Version: 13.0

#

My serverless endpoint workers are all NA except one. But I don't think you can choose where your workers come from.

#

Oh yes you can. In the advanced section.

#

Perfect. So if we can get a list of what each regions pods are running for cuda, I could technically go to production with this. maybe

tardy verge Oct 15, 2025, 4:19 AM

#

https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-86-15/index.html

Version 570.86.15(Linux)/572.13(Windows) :: NVIDIA Data Center GPU ...

Release notes for the Release 570 family of NVIDIA® Data Center GPU Drivers for Linux and Windows.

#

So its cuda 12.x?

#

Im confused now

quaint moss Oct 15, 2025, 4:21 AM

#

I’m confused too 😂

#

My working theory is that driver 580 seems to not have this problem.

However, it seems to not happen at all on the RTX PRO 6000 WK

#

But the only way I can get driver 580 has been to use that card. So 🤷

tardy verge Oct 15, 2025, 4:23 AM

#

quaint moss But the only way I can get driver 580 has been to use that card. So 🤷

Hiw about Cuda 13?

#

How*

#

Does that give you driver 58* consistently?

#

58X

quaint moss Oct 15, 2025, 4:24 AM

#

Every time so far, yes

#

If I see 580 I see cuda 13

tardy verge Oct 15, 2025, 4:26 AM

#

https://docs.nvidia.com/deeplearning/cudnn/backend/latest/reference/support-matrix.html

#

i think this is the answer

quaint moss Oct 15, 2025, 4:26 AM

#

If you look at my table above, some of it is a friggin mystery still.

Like that one time I got the 3rd nvidia card in my container, and it still worked

tardy verge Oct 15, 2025, 4:27 AM

#

#

so if you pick cuda 13

#

you get nvidia driver >= 580

quaint moss Oct 15, 2025, 4:27 AM

#

I wish I could pick cuda 13

#

Maybe runpod can tell us which regions are running 580/13.0

tardy verge Oct 15, 2025, 4:28 AM

#

yeah

quaint moss Oct 15, 2025, 4:29 AM

#

But right now if you select north America and don’t pick any cuda version, I have gotten 580/13 100% of the time. But that’s like 15 locations so I definitely haven’t confirmed if all of them have it.

So I will need runpod to confirm so I can filter my serverless endpoint to only select those.

#

But again, I still don’t know how some of those tests I ran worked. They were not getting the “default” gpu 0 and they were not on 580 and they worked.

What’s real? Is the sky blue? Are birds real?

tardy verge Oct 15, 2025, 4:30 AM

#

oh

#

wait

#

it works with API

#

@quaint moss

#

quaint moss Oct 15, 2025, 4:31 AM

#

I was gonna check that. See if I can use the api to specify cuda 13

tardy verge Oct 15, 2025, 4:31 AM

#

they just didn't give that option in the UI

quaint moss Oct 15, 2025, 4:31 AM

#

You are a beauty

tardy verge Oct 15, 2025, 4:32 AM

#

lol

quaint moss Oct 15, 2025, 4:32 AM

#

So you just set it to cuda: 13 and it made that?

tardy verge Oct 15, 2025, 4:32 AM

#

this

#

#

works

quaint moss Oct 15, 2025, 4:32 AM

#

Hell yeah

tardy verge Oct 15, 2025, 4:33 AM

#

and setting that to 14.0 fails

#

so they are doing some kind of checking

#

even if the option is invalid

#

13.1 also fails

#

like this

#

quaint moss Oct 15, 2025, 4:39 AM

#

That was you trying to make a cuda 13 instance or an invalid version number?

tardy verge Oct 15, 2025, 4:41 AM

#

invalid version number

#

cuda 13.1 or cuda 14 requested returns that error

#

cuda 13.0 (you need the .0) works fine

#

and pods spun up like that have cuda 13.0

quaint moss Oct 15, 2025, 4:47 AM

#

This is great. This only leaves some type of confirmation of what is causing this.

Or not what is causing it, but what is causing it to work. I keep wondering… just because I haven’t seen the error on driver 580 doesn’t mean I won’t. I feel like I could just as easily say RTX PRO 6000 WK’s don’t have the issue either.

tardy verge Oct 15, 2025, 4:48 AM

#

yeah you have to test if it works in that driver version

quaint moss Oct 15, 2025, 4:49 AM

#

I guess I could automate a test for this and just smash the api with it

#

I guess I’ve been assuming that if I get a pod and see /dev/nvidia5 that that means I’ve got gpu index 6

#

But in a few examples above when I was testing earlier I got some successes with /dev/nvidia1 and /dev/nvidia2

so maybe that device number doesn’t mean the index? If so, how could we tell?

tardy verge Oct 15, 2025, 4:56 AM

#

https://github.com/NVIDIA/open-gpu-kernel-modules/discussions/336

GitHub

who creates /dev/nvidia0 · NVIDIA open-gpu-kernel-modules · Discu...

Thank you very much for your answer. The problem I encountered is: I can get the PCI device number of NVIDIA graphics card, such as (81:00.0). I want to use this device number to correspond to my l...

quaint moss Oct 15, 2025, 5:14 AM

#

Hmm. I wonder if that device minor number is visible from inside the container

tardy verge Oct 15, 2025, 5:47 AM

#

I made a script

tardy verge Oct 15, 2025, 6:07 AM

#

ffmpeg -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y

#

is this command right?

#

and it looks like not many machines with 5090s that have cuda 13 is available

tardy verge Oct 15, 2025, 6:29 AM

#

📎 ffmpeg_test_summary.csv

#

maybe I am wrong with the testin

#

g

tardy verge Oct 15, 2025, 6:30 AM

#

tardy verge ffmpeg -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix...

if this command is right this should be correct

#

#

#

#

idk at this point

#

note that 5090s with cuda 13 is rare

#

i think there is 1 machine available

quaint moss Oct 15, 2025, 6:35 AM

#

That command looks right.

tardy verge Oct 15, 2025, 6:35 AM

#

i can't get more than 4 cocurrently

quaint moss Oct 15, 2025, 6:35 AM

#

How are you checking for pass/fail?

#

If ffmpeg returns anything but 0?

tardy verge Oct 15, 2025, 6:36 AM

#

this

#

#

Conversion failed!

#

if that's in stderr its a failed run

#

📎 results.tar.gz

#

this si the raw output

quaint moss Oct 15, 2025, 6:37 AM

#

Gotcha. Yeah I was greping for grep -q "No capable devices found"

tardy verge Oct 15, 2025, 6:37 AM

#

#

doesn't this look off

#

no "No capable devices found"

#

oh there is

#

#

anyways I would like to test with RTX6000 blackwell

#

but im broke

#

😢

quaint moss Oct 15, 2025, 6:39 AM

#

I've been sitting here trying to patch ffmpeg.

#

tardy verge Oct 15, 2025, 6:40 AM

#

can i ask why you need ffmpeg?

#

at this point i think using other platforms to transcode is better

#

lol

quaint moss Oct 15, 2025, 6:40 AM

#

I'm trying to use nvenc to convert archives of videos to streamable formats.

#

I guess I could try gstreamer instead

tardy verge Oct 15, 2025, 6:41 AM

#

non-blackwell cards fail too?

quaint moss Oct 15, 2025, 6:42 AM

#

I dunno. I haven't tried them much because they're slower and don't have 9th gen nvenc

#

The newer blackwell have 4 nvenc and seem to blow the L4 gpu's from google cloud run out of the water. At least with this encoding stuff.

tardy verge Oct 15, 2025, 6:49 AM

#

Yeah ik

#

Community vloud seems to work better

#

Rtx pro 6000 maxqs were all working

quaint moss Oct 15, 2025, 6:53 AM

#

gstreamer seemed to work

#

I haven't thoroughly tested it but I think its slower than ffmpeg

tardy verge Oct 15, 2025, 6:56 AM

#

Is it using cpu?

quaint moss Oct 15, 2025, 7:01 AM

#

That’s what I’m trying to confirm. I ran an nvenc test on the gst-bad nvenc plugin and it seemed to spit out a video.

#

Can’t tell if it falls back to cpu

tardy verge Oct 15, 2025, 7:02 AM

#

maybe try seeing if nvidia card pulls more power when encoding

#

or if gstreamer uses more than 100% cpu while encoding

quaint moss Oct 15, 2025, 7:03 AM

#

I was deleting a couple pods and accidentally deleted the one I compiled gstreamer on. So now I gotta go through all that mess again.

tardy verge Oct 15, 2025, 7:04 AM

#

why not binary releases?

quaint moss Oct 15, 2025, 7:04 AM

#

But I will. If gstreamer can do it, then ffmpeg clearly can be patched

quaint moss Oct 15, 2025, 7:04 AM

#

tardy verge why not binary releases?

I don’t know if there are any precompiled binaries for the gstreamer gst plugins for nvenc. I didn’t really look though

tardy verge Oct 15, 2025, 7:05 AM

#

https://stackoverflow.com/questions/75981119/how-to-install-gstreamer-nvcodec-vs-nvdec-nvenc-plugins-on-ubuntu-20-04

Stack Overflow

How to install gstreamer nvcodec vs nvdec/nvenc plugins on Ubuntu 2...

Installed gstreamer and gstreamer-plugins-bad on ubuntu 20.04 via the apt repo. I also installed the Video_Codec SDK 11.0 from Nvidia.
The gst-ispect command shows me nvenc and nvdec is installed ...

#

it says you'll get them automatically

#

try it on runpod pytorch official template

#

that has ubuntu 22 from what I know

quaint moss Oct 15, 2025, 7:06 AM

#

Oh cool

tardy verge Oct 15, 2025, 7:11 AM

#

hmm

quaint moss Oct 15, 2025, 7:13 AM

#

The gst-plugins-bad with Ubuntu doesn’t have nvenc. I’ll have to compile it again in the morning.

tardy verge Oct 15, 2025, 7:13 AM

#

it wont work

#

gst-inspect-1.0 nvcodec

#

try this

#

it does have nvcodec

#

#

but no nvh264enc?

quaint moss Oct 15, 2025, 7:14 AM

#

Yep

tardy verge Oct 15, 2025, 7:16 AM

#

https://github.com/jackersson/env-setup/tree/master/gst-nvidia-docker

GitHub

env-setup/gst-nvidia-docker at master · jackersson/env-setup

Useful scripts, docker containers. Contribute to jackersson/env-setup development by creating an account on GitHub.

#

looks like someone made a script

radiant prawn Oct 15, 2025, 2:11 PM

#

woah huge thread

#

Great to see you all working on it, but the 12.8 selector should not give you CUDA 13 machines. However you're correct in that a non zero amount of our servers are on 13.0 - but it's not an amount that I can guarantee will always be available.

#

And naturally we'll do what we can from the backend^ Just a little awkward while we're running maintenance.

tardy verge Oct 16, 2025, 12:50 AM

#

All community cloud instances I have tested worked

#

@radiant prawn This is strange

#

All instances with RTX PRO 6000

#

didn't try 5090s yet

radiant prawn Oct 16, 2025, 12:51 AM

#

tardy verge All instances with RTX PRO 6000

Work or don't work?

#

I think one of the facets of this is the hosts operating system/kernel version. Let me take a look

tardy verge Oct 16, 2025, 12:51 AM

#

worked

#

all community cloud instances with RTX PRO

#

I wasn't able to spin up much because there was not many available

quaint moss Oct 16, 2025, 12:52 AM

#

Every RTX PRO 6000 WK I have tried worked no matter what version or driver or which enumerated gpu I got

tardy verge Oct 16, 2025, 12:52 AM

#

@radiant prawn would you like the pod ids?

tardy verge Oct 16, 2025, 12:53 AM

#

tardy verge

this is secure cloud

#

the folder names are pod ids

radiant prawn Oct 16, 2025, 12:53 AM

#

quaint moss Every RTX PRO 6000 WK I have tried worked no matter what version or driver or wh...

Interesting, on secure cloud all of these machines use Ubuntu 24.04.2 or 24.04.3.

#

The unsecure cloud host uses Ubuntu 22.04.5.

#

So probably not the operating system.

quaint moss Oct 16, 2025, 12:54 AM

#

I’ll try the community cloud instances tonight.

radiant prawn Oct 16, 2025, 12:55 AM

#

There's not a lot and I can't guarantee the availibility.

tardy verge Oct 16, 2025, 12:55 AM

#

yeah

#

i got total under 10 pods

radiant prawn Oct 16, 2025, 12:55 AM

#

I can tell you we have 1 machine on the community cloud with this GPU, and one machine physically cannot support more than 8 GPUs.

tardy verge Oct 16, 2025, 12:55 AM

#

and half of them didn't even work (probably pulling image)

radiant prawn Oct 16, 2025, 12:55 AM

#

maybe its not one machine but its 1 OS and that usually indicates one machine

#

it's 2 :)

#

@tardy verge What cuda version was the one you were on?

#

If you know, it could've only been 13.0 or 12.8.

tardy verge Oct 16, 2025, 12:57 AM

#

has 12.9

#

radiant prawn Oct 16, 2025, 12:58 AM

#

uh? nvcc --version

tardy verge Oct 16, 2025, 12:58 AM

#

is nvidia-smi inaccurate?

#

I can't ssh back because the test is automated

radiant prawn Oct 16, 2025, 12:59 AM

#

I learned recently nvidia-smi will show the highest cuda version the driver supports

tardy verge Oct 16, 2025, 12:59 AM

#

pod is already terminated

radiant prawn Oct 16, 2025, 12:59 AM

#

o7 thats fine

#

Would you happen to have the prompt from the pod?

#

root@12345678...

#

Or does the output only give you the result of nvidia-smi?

tardy verge Oct 16, 2025, 1:00 AM

#

8mmj1nmc6r2ksh

#

this is the podid

radiant prawn Oct 16, 2025, 1:00 AM

#

perfect

tardy verge Oct 16, 2025, 1:00 AM

#

i don't have the prompt

radiant prawn Oct 16, 2025, 1:00 AM

#

We have this machine listed as cuda 12.9

#

Weird when I queried for it it showed as 12.8

tardy verge Oct 16, 2025, 1:00 AM

#

drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 8mmj1nmc6r2ksh
drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 mgrd9lo1q1bptd
drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 mx91o0m3l84i0c
drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 uirv051063dg6j
drwxr-xr-x@ 6 riverfog7 staff 192 Oct 16 09:58 yh7yf9vtb4o2k1
drwxr-xr-x@ 6 riverfog7 staff 192 Oct 16 09:58 yl08tpmbn6esh9

#

these are the ones i tested

radiant prawn Oct 16, 2025, 1:01 AM

#

I just opened this, this is excellent actually

tardy verge Oct 16, 2025, 1:01 AM

#

I do have some 12.8 ones too

radiant prawn Oct 16, 2025, 1:03 AM

#

If you do manage to find a correlation let me know, not that you're obligated to and I can very easily create (or run?) a script that simulates a bunch of different variables to pull details. I think we have this chalked down to these issues from the last time we got a report like this:

https://trac.ffmpeg.org/ticket/11694
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1249
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1209
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1197
https://github.com/NVIDIA/k8s-device-plugin/issues/1282

tardy verge Oct 16, 2025, 1:05 AM

#

OP said gstreamer worked so it might be a bug in ffmpeg

radiant prawn Oct 16, 2025, 1:06 AM

#

But this issue was reopened, by another? customer just yesterday with the following reproduction and we rolled that up into this too. We discussed a few workarounds, but aren't happy with any of them as they all have their own issues.

#

We know it's ffmpeg, we just don't know really about the details or why.

https://trac.ffmpeg.org/ticket/11694

tardy verge Oct 16, 2025, 1:06 AM

#

only correlation is this?

#

stupidly blue is failure

#

red is sucess

radiant prawn Oct 16, 2025, 1:08 AM

#

Does it help to know that after our maintenance the lowest driver version in the fleet will be 570.195.03?

tardy verge Oct 16, 2025, 1:08 AM

#

quaint moss Oct 16, 2025, 1:08 AM

#

That’s the driver I have the most issues with 😂

#

I wish I had escalated privs on one of these nodes so I could test a few things. For example I wonder if this could be fixed by running mknod with the major and minor from /dev/nvidiaX

MAJOR=$(stat -c '%t' /dev/nvidiaX)
MINOR=$(stat -c '%T' /dev/nvidiaX)

mknod /dev/nvidia0 c 0x$MAJOR 0x$MINOR
chmod 666 /dev/nvidia0

tardy verge Oct 16, 2025, 1:09 AM

#

tardy verge

these are community 5090s

#

forgot to mention it

radiant prawn Oct 16, 2025, 1:09 AM

#

I don't have ssh access to the hosts but I do have a lot of other permission.

#

@tardy verge I can message you a credit to continue your testing if you'd like.

tardy verge Oct 16, 2025, 1:10 AM

#

sure but its midterm season soon so i don't know if i can continue for long 😅

radiant prawn Oct 16, 2025, 1:10 AM

#

ah i understand

quaint moss Oct 16, 2025, 1:12 AM

#

I still wonder why the 6000 WK has worked no matter what version I get. I guess we need to check the index in nvidia-smi to see if it’s just luck

#

This is the kind of bug I used to love working on when I was at nvidia. Don’t have access to those kind of testing rigs anymore though.

tardy verge Oct 16, 2025, 1:18 AM

#

is there any stats that is nice to have when debugging

tardy verge Oct 16, 2025, 6:37 AM

#

what is this?

#

display attatched = True?

#

quaint moss Oct 16, 2025, 6:46 AM

#

Is that the thingy you need to set if you want to do stuff like vnc or X11 forwarding?

tardy verge Oct 16, 2025, 6:47 AM

#

i don't know about that part well

#

📎 ffmpeg_test_summary.csv

#

this time its 280 instances

#

of 5090s mixed between community and secure cloud

quaint moss Oct 16, 2025, 6:49 AM

#

Is your test easily runable? I don’t mind to burn through credits testing other scenarios

#

Other gpu’s I should say

tardy verge Oct 16, 2025, 6:50 AM

#

its just doing this

#

in a template

tardy verge Oct 16, 2025, 7:12 AM

#

#

#

autogluon feature importance

#

it just identified every gpu by id

#

idk at this point

#

this is probably a software bug inside nvidia or ffmpeg

quaint moss Oct 16, 2025, 6:53 PM

#

Yeah I read all the tickets and associated links this morning. Nothing seems to be reliable in the reproduction. Some people say you need to be GPU 0, others say the last gpu is working or an odd number in between.

Some say the bug is a regression starting at driver 570. Others have reproduced it on 550 and lower.

Nobody seems to be focused on fixing it. Some ffmpeg references say its an issue with nvcodec itself.

quaint moss Oct 16, 2025, 7:38 PM

#

Just run several simultaneous iterations of a quick encoding task on the PRO 6000 WK. Not a single one failed. All gave me 580/13

#

First test on a PRO 6000, 575/12.9 fail

quaint moss Oct 17, 2025, 12:35 AM

#

Had a failure on a serverless worker. I didn't catch it in time to see the logs. But the only difference is that it was not in NA.

#

My_Endpoint_2025-10-16_at_7.35.48_PM.png

quaint moss Oct 17, 2025, 3:52 AM

#

I just had an idea to try and do a health check on a serverless endpoint. My question is, once a worker gets my image and goes idle, does it already have this issue or not?

Was hoping I could do a health check and if it fails, the worker terminates and a new one is created until I am left with nothing but workers without this issue.

#

But when my health check fails, whatever is orchestrating the containers just restarts it instead of terminating and launching on a new worker. Wonder if I can fail with a different error code to get it to terminate?

tardy verge Oct 17, 2025, 4:12 AM

#

you can self destruct with this in a pod

#

runpodctl remove pod ${RUNPOD_POD_ID}

#

I'm not sure about serverless

tardy verge Oct 17, 2025, 4:13 AM

#

tardy verge runpodctl remove pod ${RUNPOD_POD_ID}

they come with pod scoped api keys afaik

#

so no need to configure credentials

quaint moss Oct 17, 2025, 4:14 AM

#

I’ll give it a shot.

I don’t know if my concept is flawed though. Like, if a worker starts the container and doesn’t get the error in ffmpeg, does that mean when a request comes in hours later that it still won’t run into this bug?

I guess I don’t know how the serverless workers are orchestrated.

#

The question is, is the gpu already assigned when the worker goes idle? And if so, does it stay that way?

tardy verge Oct 17, 2025, 4:14 AM

#

@radiant prawn is there an api to terminate serverless workers individually?

#

its possible in serverless console thingey

quaint moss Oct 17, 2025, 4:15 AM

#

If so, this fixes everything for me. Just takes a little longer to deploy

tardy verge Oct 17, 2025, 4:15 AM

#

quaint moss If so, this fixes everything for me. Just takes a little longer to deploy

but you have to consider the possibility of getting the same host

#

you can see in my test that there are multiple overlapping gpu ids

quaint moss Oct 17, 2025, 4:16 AM

#

tardy verge but you have to consider the possibility of getting the same host

What do you mean?

tardy verge Oct 17, 2025, 4:16 AM

#

quaint moss Oct 17, 2025, 4:16 AM

#

I think I’m saying the same thing as you

tardy verge Oct 17, 2025, 4:17 AM

#

test ffmpeg fails -> terminate worker -> runpod spins up same worker

#

can happen

quaint moss Oct 17, 2025, 4:17 AM

#

Ohh

tardy verge Oct 17, 2025, 4:17 AM

#

probably not a problem if there are many GPUs

#

but since you are using the RTX PRO 6000 and they have limited supply

quaint moss Oct 17, 2025, 4:17 AM

#

Well when I manually terminate one I always get back a worker with a different id. But I don’t know if that’s unique or not

#

I’m using L40, L40S, RTX PRO 6000 and I think one other gpu

tardy verge Oct 17, 2025, 4:18 AM

#

quaint moss Well when I manually terminate one I always get back a worker with a different i...

the id should be different

#

#

pod id is different here

#

but GPU uuid is same for some pods

quaint moss Oct 17, 2025, 4:23 AM

#

I guess I don’t know what a worker truly is. Is it a shared server? Is it a shared cluster? Etc.

Because it could totally just be server rack that picks up requests from the queue and runs docker run -it … -gpus=5 (not really but you get the idea). That means it could work for one request then fail the next.

Otherwise if it’s consistent, then I’m ok with this janky solution.

tardy verge Oct 17, 2025, 4:23 AM

#

My opinion is a serverless worker is just a pod

#

and worker id = pod id

#

and works with the same infrastructure, hence can share network volumes

quaint moss Oct 17, 2025, 4:24 AM

#

That’s what I think too.

#

Pods + orchestration = serverless worker

tardy verge Oct 17, 2025, 4:25 AM

#

ECS in AWS terms

#

but with an API Gateway

#

and a queue

#

and cloudfront

quaint moss Oct 17, 2025, 4:25 AM

#

Yep.

#

So if I can kill a worker during the initial health check, it may be a workable solution. Provided I don’t get the same one over and over 🤔

tardy verge Oct 17, 2025, 4:26 AM

#

tardy verge test ffmpeg fails -> terminate worker -> runpod spins up same worker

so this point still stands

#

this is only a problem when there is like 10 GPUs available and 7 of them are not working

#

but you are trying to get 5 workers

#

and you get broke because you get billed for the ffmpeg health check time

quaint moss Oct 17, 2025, 4:28 AM

#

Oh I never thought about it billing me for the deploy time

tardy verge Oct 17, 2025, 4:29 AM

#

you should get billed for the health check

#

because the container is already starte

#

d

quaint moss Oct 17, 2025, 4:30 AM

#

That’s a good point. I’ll have to see where I can run this in the lifecycle

#

I put the health check right before the serverless handler and deployed and noticed some of the instances kept initializing. So I assumed it was running and I just couldn’t see the logs

tardy verge Oct 17, 2025, 4:31 AM

#

i think its failing the health check and the host just restarts it

#

pods do the same thing when containers exit abnormally

#

host starts it until it works

#

do you have anything in serverless console-> logs instead of serverless console->workers->worker->logs

quaint moss Oct 17, 2025, 4:38 AM

#

Nothing from the deploy. So it’s either not running the health check or none of the logs it generates during a deploy are in those logs I can see

#

The only thing I changed was adding the health check and the only two outcomes I saw were workers initializing over and over or becoming ready and successfully handling requests.

tardy verge Oct 17, 2025, 4:44 AM

#

I just spun up a random serverless endpoint and it looks like "Initializing" is pulling images and extracting them

#

and "running" is the actual container running

#

so if worker is created
initialize -> running (load model in memory and health check, etc) -> idle (waits till request)

quaint moss Oct 17, 2025, 4:47 AM

#

I only ever get the running state when I send a request. I just get initializing -> idle

tardy verge Oct 17, 2025, 4:47 AM

#

oh I think i set this to get the worker count to go up

#

that makes sense i think im wrong then

#

my question is "Is anything happening after the container starts and before serverless start billed?"

#

its not really clear from this explaination

quaint moss Oct 17, 2025, 4:50 AM

#

What I thought was happening is deploy > workers are assigned and they all pull your image

Then request > container starts

#

But after deploy, does it run the container at all.

#

If so, health check + terminate on fail would work. Otherwise it won’t

#

I’ve never noticed it charging me for the deploy phases

quaint moss Oct 17, 2025, 5:13 AM

#

Hmm. I don’t think this will work now. There is no health check I can find in the docs for queue based serverless

quaint moss Oct 17, 2025, 5:29 AM

#

Ugh. I am spending too much time thinking about this each night. I gotta just implement my own queue and let the occasional failures retry. At most, terminate failed workers

tardy verge Oct 17, 2025, 5:37 AM

#

Does dockerfile health checks work?

quaint moss Oct 17, 2025, 5:38 AM

#

I don’t think the worker is even running the container until you send a request to it.

#

At least that’s my theory

tardy verge Oct 17, 2025, 5:38 AM

#

Hmm

quaint moss Oct 17, 2025, 5:48 AM

#

I wonder how Google gets around this

tardy verge Oct 17, 2025, 5:48 AM

#

google?

quaint moss Oct 17, 2025, 5:49 AM

#

With cloud run gpu instances. We’ve processed lots of video using those and never ran into this problem. They’re on L4 gpus

tardy verge Oct 17, 2025, 5:49 AM

#

idk about cloud run but in AWS ECS containers run inside vms

#

not like runpod (shared host)

quaint moss Oct 17, 2025, 5:51 AM

#

Google cloud run is using docker. At least for their second gen runtimes

#

Maybe that’s the answer. Just run a docker container inside the docker container 😅

tardy verge Oct 17, 2025, 5:52 AM

#

good news

#

serverless workers count as pods ig

#

so runpodctl remove pod <workerid> works

#

if eval ffmpeg -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y > /dev/null 2> /dev/null; then
    echo "FFMPEG with NVENC encoding succeeded."
else
    echo "FFMPEG with NVENC encoding failed."
    runpodctl remove pod ${RUNPOD_POD_ID}
fi

#

should work

#

if the env variable is correct and runpodctl is installed in the pod

quaint moss Oct 17, 2025, 6:27 AM

#

Sorry. Wife aggro. I'll check this out.

#

So with that scenario, the job will fail, but it will take the worker out with it.

radiant prawn Oct 17, 2025, 6:36 AM

#

tardy verge and worker id = pod id

Correct. To the API a serverless worker is a pod named after the endpoint id :')

quaint moss Oct 17, 2025, 6:40 AM

#

And there's no way during a deploy to tap into any of the health checks?

tardy verge Oct 17, 2025, 8:46 AM

#

That should depend on the docker cmd running at worker initialization

quaint moss Oct 18, 2025, 4:33 AM

#

Yeah I don’t think the container itself runs. I suspect the pod just fetches the image and makes sure everything is ready for when it receives a request

quaint moss Oct 19, 2025, 3:52 AM

#

Ok new tactic...

My API forms the request, sends it to the serverless endpoint, then receives the webhook back on success or fail. On fail, it retries.

Meanwhile, on the worker, if we get the bad cuda ffmpeg response about no supported devices, we terminate the worker.

Provided there is enough delay, the retry will go to the next worker while the last one was being terminated. So the system should only waste a few seconds on retries and terminating before eventually hitting a successful worker.

This also has the added side effect of constantly pruning bad workers.

#

So far its working as expected

tardy verge Oct 19, 2025, 9:46 AM

#

Hmm

quaint moss Oct 20, 2025, 11:00 PM

#

Just noticing some things. If I setup my serverless workers to run on 5090's, it's like 9.5 out of 10 fail and have to be terminated.

L40 and L40S, maybe 1 out of 5 fail.

RTX PRO 6000, zero have been terminate by my script.

#

Totally usable now that I am having the worker terminate itself when it detects those failure modes. After a small amount of failures, we get 100% working workers. I have not seen a worker that worked the first time but then subsequently failed. So I think it does work the way I was thinking.

tardy verge Oct 21, 2025, 3:20 AM

#

Failure rate was about 75% in my testing

#

But it includes community clpud

jolly tapir Oct 23, 2025, 10:06 AM

#

Hey everyone, is this issue fixed yet?
I have an RTX 3060 Ti, and when I run my Docker image locally, it works perfectly and uses the GPU.
However, when I pull the same image on Serverless or RunPod, it doesn’t use the GPU at all.

It only works if I add a CPU fallback — then it runs on CPU.
But if I try running with GPU only, it throws this error:
"details": "Traceback (most recent call last):
File "/app/unified_r2_only.py", line 134, in compress_video_nvenc
raise RuntimeError(f"ffmpeg exit {proc.returncode}")
RuntimeError: ffmpeg exit 1 "

normal gate Oct 23, 2025, 11:37 AM

#

jolly tapir Hey everyone, is this issue fixed yet? I have an RTX 3060 Ti, and when I run my ...

thats probably the error

quaint moss Oct 27, 2025, 9:09 PM

#

Just for reference, the last couple deploys I did had to prune less workers before it was left with 5 unaffected ones. Possibly just luck of the draw.

#

I have all datacenters selected still though.

#

I suspect if I limited to NA it would be close to zero

tardy verge Oct 31, 2025, 12:47 PM

#

@radiant prawn Sorry to mention but can you check these Community cloud pods if they are working properly?

#

#

#

radiant prawn Oct 31, 2025, 3:47 PM

#

Working backwards:

radiant prawn Oct 31, 2025, 3:47 PM

#

tardy verge

DNS error, this usually self resolves

radiant prawn Oct 31, 2025, 3:50 PM

#

tardy verge

Machine error (both variations of the error shown here)

radiant prawn Oct 31, 2025, 3:50 PM

#

tardy verge

I'll aggregate these pod ids to unique machine ids and ideally point out the errors to this host.

jolly tapir Nov 2, 2025, 8:07 PM

#

Has any one dealth with this issue? Because I get ffmpeg return -1 on most cards on runpod while some who work give this performance while same code on my local utilizes 100 percentage.

quaint moss Nov 2, 2025, 10:16 PM

#

Amphere cards are meh with nvenc

#

Probably gonna have to find the actual error so we can see what’s really going on.

#

Even with 4 nvenc chips and running tons of nvenc sessions I only get like 15% usage

#

Gpu usage isn’t the same as nvenc usage

dense scroll Nov 19, 2025, 11:01 PM

#

Is there any paved road with this? I am having the same issue.

normal gate Nov 20, 2025, 3:37 AM

#

i dont think there's fix for this yet

dense scroll Nov 20, 2025, 3:37 PM

#

after big debugging NVIDIA_DRIVER_CAPABILITIES=compute,decode,encode,utility,video
not working. the ,decode,encode is being blocked no mater what i do.

quaint moss May 7, 2026, 1:51 AM

#

I finally solved this. Going to write a paper describing it.

astral silo May 7, 2026, 5:26 AM

#

it probably really deserves a paper since it took over 6 months to solve

quaint moss May 7, 2026, 6:02 AM

#

For sure. Got a few more tests to run to make sure it doesn’t break on multi-gpu pods. But I wrote a shared library that patches the calls made between nvidia’s user-space driver libraries and the kernel.

Specifically the call NV0000_CTRL_CMD_GPU_GET_ATTACHED_IDS that goes to the nvidia resource manager.

#

libnvidia-encode.so and libnvcuvid.so

#

Totally worked for me earlier. I booted up a pod and ran ffmpeg as a test and got the error. Then added an env var to tell the system to use my library and boom. Success.

Tried it on 5 different pods and it fixed it on every gpu and cuda combo so far.

#

The sad part is this is an nvidia bug in the driver but we could have fixed it from inside the container the whole time.

raven heath May 7, 2026, 7:51 AM

#

@quaint moss fell free to share finds as I'm also intrested

raven heath May 7, 2026, 3:12 PM

#

@quaint moss so I think I was able to fully nail the ffmpeg setup and to make so it works 🙂

raven heath May 7, 2026, 4:33 PM

#

@mint fossil can you give a try to https://github.com/MadiatorLabs/nvscope/releases/tag/v0.1.0

GitHub

Release Initial version · MadiatorLabs/nvscope

Full Changelog: https://github.com/MadiatorLabs/nvscope/commits/v0.1.0

quaint moss May 7, 2026, 5:22 PM

#

raven heath <@356832952815976449> can you give a try to https://github.com/MadiatorLabs/nvsc...

Cool yes. Similar approach.

raven heath May 7, 2026, 5:24 PM

#

Feel free to use it it's all open source for all

quaint moss May 7, 2026, 6:24 PM

#

https://github.com/flexgrip/nvidia-gpu-enumeration/

GitHub

GitHub - flexgrip/nvidia-gpu-enumeration: Fixing NVIDIA's Broken GP...

Fixing NVIDIA's Broken GPU Encoding in Containers using LD_PRELOAD. Nvenc/cuvid solution for "OpenEncodeSessionEx", "unsupported device", and "N...

#

There's all of my notes and code. Although @raven heath's shim is probably more friendly. Mine was just a c file you compile during image build so you can include the library.

#

I have debugging flags you can pass in to have it send logs to stderr or out to a file. I had to use the crap out of those because sometimes the failures wouldn't get logged and I'd lose them when I destroyed the pod.

mint fossil May 8, 2026, 2:52 AM

#

raven heath <@356832952815976449> can you give a try to https://github.com/MadiatorLabs/nvsc...

It is very good.

On my end, I performed dozens of encodes using this software across multiple data centers, specifically focusing on the RTX 5090, and did not encounter this issue. In other words, I was able to successfully perform high-speed encoding using NVENC.

I believe this problem has finally been resolved.

It seems like it would work well if I add a few layers to the Dockerfile and perhaps set up an alias for the ffmpeg command.

#When encoding video with ffmpeg, nvenc does not work.