#When encoding video with ffmpeg, nvenc does not work.

462 messages · Page 1 of 1 (latest)

mint fossil
#

DC:US-NC-1
GPU:RTX 5090

I have switched data centers to US-IL-1 in addition to US-NC-1, but the results remain the same.

cmd
-f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y

ffmpeg version 7.1.1 Copyright (c) 2000-2025 the FFmpeg developers
  built with gcc 13 (Ubuntu 13.3.0-6ubuntu2~24.04)
  configuration: --disable-debug --disable-doc --disable-ffplay --enable-alsa --enable-cuda-llvm --enable-cuvid --enable-ffprobe --enable-gpl --enable-libaom --enable-libass --enable-libdav1d --enable-libfdk_aac --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-libkvazaar --enable-liblc3 --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-libplacebo --enable-librav1e --enable-librist --enable-libshaderc --enable-libsrt --enable-libsvtav1 --enable-libtheora --enable-libv4l2 --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpl --enable-libvpx --enable-libvvenc --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-nonfree --enable-nvdec --enable-nvenc --enable-opencl --enable-openssl --enable-stripping --enable-vaapi --enable-vdpau --enable-version3 --enable-vulkan
  libavutil      59. 39.100 / 59. 39.100
  libavcodec     61. 19.101 / 61. 19.101
  libavformat    61.  7.100 / 61.  7.100
  libavdevice    61.  3.100 / 61.  3.100
  libavfilter    10.  4.100 / 10.  4.100
  libswscale      8.  3.100 /  8.  3.100
  libswresample   5.  3.100 /  5.  3.100
  libpostproc    58.  3.100 / 58.  3.100
Input #0, lavfi, from 'testsrc=duration=5:size=1280x720:rate=30':
  Duration: N/A, start: 0.000000, bitrate: N/A
  Stream #0:0: Video: wrapped_avframe, rgb24, 1280x720 [SAR 1:1 DAR 16:9], 30 fps, 30 tbr, 30 tbn
Stream mapping:
  Stream #0:0 -> #0:0 (wrapped_avframe (native) -> h264 (h264_nvenc))
Press [q] to stop, [?] for help
[h264_nvenc @ 0x5b844ea0d440] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x5b844ea0d440] No capable devices found
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Error while opening encoder - maybe incorrect parameters such as bit_rate, rate, width or height.
[vf#0:0 @ 0x5b844ea2afc0] Error sending frames to consumers: Generic error in an external library
[vf#0:0 @ 0x5b844ea2afc0] Task finished with error code: -542398533 (Generic error in an external library)
[vf#0:0 @ 0x5b844ea2afc0] Terminating thread with return code -542398533 (Generic error in an external library)
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Could not open encoder before EOF
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Task finished with error code: -22 (Invalid argument)
[vost#0:0/h264_nvenc @ 0x5b844ea0fdc0] Terminating thread with return code -22 (Invalid argument)
[out#0/mp4 @ 0x5b844ea0f540] Nothing was written into output file, because at least one of its streams received no packets.
frame=    0 fps=0.0 q=0.0 Lsize=       0KiB time=N/A bitrate=N/A speed=N/A
Conversion failed!

This is the result I got running on my RTX 4090. There are no issues with the container image and command.

docker run --rm -it --gpus=all \
                     -v $(pwd):/config \
                     linuxserver/ffmpeg:7.1.1 \
                     -hwaccel cuda -hwaccel_device 0 -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y
  ...
        encoder         : Lavc61.19.101 h264_nvenc
      Side data:
        cpb: bitrate max/min/avg: 0/0/2000000 buffer size: 4000000 vbv_delay: N/A
[out#0/mp4 @ 0x619f0fd1afc0] video:196KiB audio:0KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 1.333094%
frame=  150 fps=0.0 q=8.0 Lsize=     199KiB time=00:00:04.90 bitrate= 332.2kbits/s speed=27.8x 
ionic boneBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

radiant prawn
#

NVENC and ffmpeg are very very sensitive to your nodes specific driver version. When you're deploying your pod, used the advanced filter to select CUDA 12.8 and 12.9.

mint fossil
radiant prawn
tardy verge
#

only NVDEC

#

but all consumer & worktation grade cards do

#

works with RTX 2000 Ada

#

doesn't work with 5090

#

same error with community cloud

#

looks like a driver issue

mint fossil
mint fossil
# tardy verge looks like a driver issue

No, I don’t think so. This is a data center issue.
On the RTX 4090, the same error occurs in US-IL-1, EUR-NO-1, and EUR-IS-2, but encoding works normally on EU-RO-1.

Also, some RTX 5090s have been confirmed to work. However, it’s no longer possible to get assigned to that pod.

tardy verge
#

That's strange

#

But i heard that runpod is using early driver versions

#

So I thought it was a driver issue

mint fossil
# tardy verge That's strange

I will present evidence that supports my claim.
This is a serverless endpoint that runs the command
-f lavfi -i testsrc=duration=600:size=1920x1080:rate=60 -c:v h264_nvenc -preset p1 -b:v 10M -pix_fmt yuv420p -f null - -benchmark -stats
using the linuxserver/ffmpeg:version-8.0-cli image on a serverless worker.

The bad workers show the same error I reported, while the properly functioning workers correctly display encoding speed logs.
Therefore, this cannot be concluded as a driver issue, since there are GPU servers that operate normally.

The four attached screenshots are logs from the workers that function correctly.
The fifth image shows a list of both healthy and unhealthy workers. (All unhealthy ones output the error I posted and then stopped processing.)

And as an important detail, I have confirmed that there is no difference in the Driver Version between the bad workers and the healthy ones.
Therefore, this is a data center issue.

mint fossil
tardy verge
#

this is strange

#

NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9: work (EU-RO-1)
NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 : no work (EUR-IS-2)
NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8: no work (EUR-IS-1)
NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 : no work (EU-RO-1)
NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8: work (EUR-IS-1)
NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8: work (EUR-IS-1)
NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8 : no work (EUR-IS-1)

#

it doesn't depend on driver version?

#

seems like EUR-RO-1 uses newer drivers

#

but it sometimes doesn't work there too

mint fossil
# tardy verge it doesn't depend on driver version?

Yes. NVENC reacts sensitively to things like driver version, but even with exactly the same version, in exactly the same DC, and with exactly the same configuration, differences in behavior can be observed.
As you know, EUR-IS-2 is 570.172.08, but there are workers operating normally on 570.172.08, and there are also bad workers on 570.172.08.
The attached image shows information from a worker that operated normally.

tardy verge
#

Yeah I can confirm

#

And i didnt know that runpod dashboard shows driver version

radiant prawn
radiant prawn
tardy verge
#

8sv5k7ublivjhq

#

should be the serverless endpoint ID

mint fossil
#

skwe5i0dbvkzeu
This is the serverless endpoint ID that I presented as evidence.

sick helmBOT
rose shoal
#

Any news on this issue?

mint fossil
# rose shoal Any news on this issue?

In the support ticket, they reported:
I’ve already escalated this case to our reliability team for deeper review.

It seems they are currently investigating.

I will share any new information here as soon as it becomes available.

mint fossil
#

Runpod sincerely conducted additional investigation and provided support.
I will share the details of the ticket.


Following up on your request, we’ve reproduced the NVENC failure you reported across multiple regions and GPU types. After investigation, we’ve classified this as an upstream issue with FFmpeg/NVIDIA. Specifically, the problem appears to stem from how device indices are mapped inside containers (when /dev/nvidia* devices don’t align with nvidia-smi indices).
This behavior matches several active upstream reports:
https://trac.ffmpeg.org/ticket/11694
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1249
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1209
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1197
https://github.com/NVIDIA/k8s-device-plugin/issues/1282
Since the root cause lies upstream, a permanent fix will need to come from the FFmpeg/NVIDIA teams. That said, we’ll continue to monitor developments closely and will keep you updated on any relevant progress or workarounds.


quaint moss
#

I am seeing the same issue @mint fossil. Thank god I found this. I thought I was going crazy.

Seems to be a random roll of the dice on whether a pod will work or not. That being said, I have not seen any of these errors on the serverless endpoints.

I wonder if it is safe to depend on the serverless endpoints for large batches of requests or if this only affects the pods?

#

I will read the issues mentioned in the runpod support response. Perhaps we can just ship our own binaries that have a fix.

quaint moss
#

Seeing a lot of talk about the 570 driver being the culprit. When I launch RTX PRO 6000 pods I always get nvidia driver version >= 580 and have yet to see the issue on there. But that could be anecdotal.

It seems there is nothing we can do to fix the problem itself. But I am going to try getting the /dev/nvidia# and passing that into ffmpeg with ffmpeg -hwaccel_device 0

Maybe that will work. Going to bed but will continue tinkering with it tomorrow.

tardy verge
#

only servers with driver version 570 didn't work

quaint moss
#

Yeah as I understand the issue, it's with the way multiple gpu servers start the docker container. Whatever fancy way runpod spins them up, they are probably doing something like device=1 if the first gpu is already being used in another container.

As I found out last night, you can spin up the exact same image over and over on the same gpu type. Seems like unless you get gpu 0, you get this problem. The thing is, I have not seen this issue even once with RTX PRO 6000 machines.

#

I'm gonna see right now if I can reproduce the issue, and then work around it by specifying which device to enumerate in ffmpeg

quaint moss
#

export CUDA_VISIBLE_DEVICES=0; ffmpeg ... seems to have some effect. At least I get a different error when I set different device ids.

Just for fun, I tried symlinking /dev/nvidia1 to /dev/nvidia0 and it had no effect.

I am kind of out of ideas on how to fix this. It feels like if you get a set of circumstances, you just can't work around it.

  1. You select a gpu pod that isn't having all the gpus passed to it
  2. Someone else is using gpu 0
  3. You are on driver ~570 (although I havent confirmed this)
#

Curious why I never saw this on the serverless endpoints. That's what I indend to use anyway so if it works there, then no big deal. But I can't really finish my project if there's a random chance each serverless endpoint worker might fail too.

#

Btw. I just launched a pod and got the same exact scenario where I got /dev/nvidia1 but since it is driver 580, it seems to work just fine

tardy verge
#

So the only workaround is using a dofferent gpu?

#

Or maybe you can set higher cuda version

#

To avoid 570 drivers

quaint moss
#

I guess we can try. But I didn't think they were related. Like if I use a container image that is 12.2, then I get into a pod using an RTX PRO 6000, it will tell me its running cuda 13

#

I think that filter when you are making a pod is exactly that. Just a filter, showing which gpus are compatible with that version of cuda you've selected. 13 isn't even selectable on there and no matter what I choose, it seems to just have whatever the host system gives you when it is launching the container. Which makes sense.

mint fossil
quaint moss
#

Oh really 🤔

#

Maybe because I only really tested on RTX PRO 6000’s

#

I’ve yet to see one of those gpu’s have this issue. And all have been on driver >= 580

tardy verge
#

when you do nvidia-smi it will print out the cuda version and the driver version

#

and maybe RTX PRO 6000s don't support lower version of drivers so they don't get 570 driver version

quaint moss
#

Yeah. It seems to print out whatever version the host uses. So your base image can be like 12.2 and the host can be 13 as reported by nvidia-smi. But the filter selector when creating a pod seems to have no bearing on what version of cuda the host system uses. Like if I select 12.4 I will get 12.2 installed by my image and 13 according to nvidia-smi

tardy verge
#

that's wierd

#

I always got what I asked for

#

but if you set it to 12.8 it should give 12.8+

#

at least

#

and that should avoid 570 drivers

quaint moss
#

There's a chance I don't know what I am talking about. But I am pretty confident that the cuda filter does absolutely nothing besides filter which gpu's are compatible with the version of cuda you need.

#

Unrelated to that though, I just saw my first failure on the serverless endpoints because I got an RTX 5090 running driver 570.

quaint moss
#

Weird. I just had an RTX PRO 6000 worker run one of my tasks and it was running driver 570.195.03. It worked

#

So I must have had gpu #0 I guess.

tardy verge
#

Maybe related to this

quaint moss
#

Great find!

tardy verge
#

can you try setting cuda version

#

just in case it works

quaint moss
#

Where at? In my image or in the selector when editing the pod/worker?

tardy verge
#

here

#

Probably to cuda 12.8

quaint moss
#

That selector seems to somewhat help at choosing the host's cuda version. But there are times when I will choose a specific version and get back a different one.

I just tested three pods. Two gave me 12.8 and the last one gave me 13

tardy verge
#

I think its 12.8+

#

So you are getting 13 too

#

Since cuda should be backwards compatible

#

And newer cuda version == newer drivers

#

So this should help in getting newer drivers

quaint moss
#

I think you were right that the selector works like that. I just noticed if I select 12.8 it tells me that the RTX PRO 6000 pods are unavailable, but if I select 12.9 they are available.

So that case is closed

#

But related to this issue. If I get driver 570 there’s a chance it doesn’t work.

But if I select 12.9 and get driver version 580 with the same gpu, it works.

#

I guess it’s possible that it’s anecdotal and I’ve just been lucky to get gpu number 0. But I am pretty confident driver 580 doesn’t have this issue

tardy verge
#

Then selecting 12.9 will make you avoid 570 driver

#

Because 12.9 is 580+

tardy verge
#

Either this is a bug or 12.8 selector giving cuda13 is a bug

quaint moss
#

# 5090
🟥 Driver Version: 570.153.02 CUDA Version: 12.8
🟥 Driver Version: 575.57.08 CUDA Version: 12.9

# RTX PRO 6000
🟥 Driver Version: 575.57.08 CUDA Version: 12.9 /dev/nvidia5
✅ Driver Version: 575.57.08 CUDA Version: 12.9 /dev/nvidia0
✅ Driver Version: 570.195.03 CUDA Version: 12.8 /dev/nvidia1 <-- ??

# RTX PRO 6000 WK
✅ Driver Version: 575.51.03 CUDA Version: 12.9 /dev/nvidia2

# RTX PRO 6000 WK (no cuda selection)
✅ Driver Version: 575.51.03 CUDA Version: 12.9 /dev/nvidia2

# RTX PRO 6000 WK NORTH AMERICA (no cuda selection)
✅ Driver Version: 580.65.06 CUDA Version: 13.0 /dev/nvidia5

#

No clue how to get it to give me 580 with cuda 13

#

It seems random

#

I tried selecting nothing, which is what I've done in the past to get it to give me 580. It gave me 12.9. So maybe it's just random or a regional thing.

#

Boom

#

Selected north america

#

🇺🇸

#

Driver Version: 580.65.06 CUDA Version: 13.0

#

My serverless endpoint workers are all NA except one. But I don't think you can choose where your workers come from.

#

Oh yes you can. In the advanced section.

#

Perfect. So if we can get a list of what each regions pods are running for cuda, I could technically go to production with this. maybe

tardy verge
#

So its cuda 12.x?

#

Im confused now

quaint moss
#

I’m confused too 😂

#

My working theory is that driver 580 seems to not have this problem.

However, it seems to not happen at all on the RTX PRO 6000 WK

#

But the only way I can get driver 580 has been to use that card. So 🤷

tardy verge
#

How*

#

Does that give you driver 58* consistently?

#

58X

quaint moss
#

Every time so far, yes

#

If I see 580 I see cuda 13

tardy verge
#

i think this is the answer

quaint moss
#

If you look at my table above, some of it is a friggin mystery still.

Like that one time I got the 3rd nvidia card in my container, and it still worked

tardy verge
#

so if you pick cuda 13

#

you get nvidia driver >= 580

quaint moss
#

I wish I could pick cuda 13

#

Maybe runpod can tell us which regions are running 580/13.0

tardy verge
#

yeah

quaint moss
#

But right now if you select north America and don’t pick any cuda version, I have gotten 580/13 100% of the time. But that’s like 15 locations so I definitely haven’t confirmed if all of them have it.

So I will need runpod to confirm so I can filter my serverless endpoint to only select those.

#

But again, I still don’t know how some of those tests I ran worked. They were not getting the “default” gpu 0 and they were not on 580 and they worked.

What’s real? Is the sky blue? Are birds real?

tardy verge
#

oh

#

wait

#

it works with API

#

@quaint moss

quaint moss
#

I was gonna check that. See if I can use the api to specify cuda 13

tardy verge
#

they just didn't give that option in the UI

quaint moss
#

You are a beauty

tardy verge
#

lol

quaint moss
#

So you just set it to cuda: 13 and it made that?

tardy verge
#

this

#

works

quaint moss
#

Hell yeah

tardy verge
#

and setting that to 14.0 fails

#

so they are doing some kind of checking

#

even if the option is invalid

#

13.1 also fails

#

like this

quaint moss
#

That was you trying to make a cuda 13 instance or an invalid version number?

tardy verge
#

invalid version number

#

cuda 13.1 or cuda 14 requested returns that error

#

cuda 13.0 (you need the .0) works fine

#

and pods spun up like that have cuda 13.0

quaint moss
#

This is great. This only leaves some type of confirmation of what is causing this.

Or not what is causing it, but what is causing it to work. I keep wondering… just because I haven’t seen the error on driver 580 doesn’t mean I won’t. I feel like I could just as easily say RTX PRO 6000 WK’s don’t have the issue either.

tardy verge
#

yeah you have to test if it works in that driver version

quaint moss
#

I guess I could automate a test for this and just smash the api with it

#

I guess I’ve been assuming that if I get a pod and see /dev/nvidia5 that that means I’ve got gpu index 6

#

But in a few examples above when I was testing earlier I got some successes with /dev/nvidia1 and /dev/nvidia2

so maybe that device number doesn’t mean the index? If so, how could we tell?

tardy verge
quaint moss
#

Hmm. I wonder if that device minor number is visible from inside the container

tardy verge
#

I made a script

tardy verge
#

ffmpeg -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y

#

is this command right?

#

and it looks like not many machines with 5090s that have cuda 13 is available

tardy verge
#

maybe I am wrong with the testin

#

g

tardy verge
#

idk at this point

#

note that 5090s with cuda 13 is rare

#

i think there is 1 machine available

quaint moss
#

That command looks right.

tardy verge
#

i can't get more than 4 cocurrently

quaint moss
#

How are you checking for pass/fail?

#

If ffmpeg returns anything but 0?

tardy verge
#

this

#

Conversion failed!

#

if that's in stderr its a failed run

#

this si the raw output

quaint moss
#

Gotcha. Yeah I was greping for grep -q "No capable devices found"

tardy verge
#

doesn't this look off

#

no "No capable devices found"

#

oh there is

#

anyways I would like to test with RTX6000 blackwell

#

but im broke

#

😢

quaint moss
#

I've been sitting here trying to patch ffmpeg.

tardy verge
#

can i ask why you need ffmpeg?

#

at this point i think using other platforms to transcode is better

#

lol

quaint moss
#

I'm trying to use nvenc to convert archives of videos to streamable formats.

#

I guess I could try gstreamer instead

tardy verge
#

non-blackwell cards fail too?

quaint moss
#

I dunno. I haven't tried them much because they're slower and don't have 9th gen nvenc

#

The newer blackwell have 4 nvenc and seem to blow the L4 gpu's from google cloud run out of the water. At least with this encoding stuff.

tardy verge
#

Yeah ik

#

Community vloud seems to work better

#

Rtx pro 6000 maxqs were all working

quaint moss
#

gstreamer seemed to work

#

I haven't thoroughly tested it but I think its slower than ffmpeg

tardy verge
#

Is it using cpu?

quaint moss
#

That’s what I’m trying to confirm. I ran an nvenc test on the gst-bad nvenc plugin and it seemed to spit out a video.

#

Can’t tell if it falls back to cpu

tardy verge
#

maybe try seeing if nvidia card pulls more power when encoding

#

or if gstreamer uses more than 100% cpu while encoding

quaint moss
#

I was deleting a couple pods and accidentally deleted the one I compiled gstreamer on. So now I gotta go through all that mess again.

tardy verge
#

why not binary releases?

quaint moss
#

But I will. If gstreamer can do it, then ffmpeg clearly can be patched

quaint moss
tardy verge
#

it says you'll get them automatically

#

try it on runpod pytorch official template

#

that has ubuntu 22 from what I know

quaint moss
#

Oh cool

tardy verge
#

hmm

quaint moss
#

The gst-plugins-bad with Ubuntu doesn’t have nvenc. I’ll have to compile it again in the morning.

tardy verge
#

it wont work

#

gst-inspect-1.0 nvcodec

#

try this

#

it does have nvcodec

#

but no nvh264enc?

quaint moss
#

Yep

tardy verge
#

looks like someone made a script

radiant prawn
#

woah huge thread

#

Great to see you all working on it, but the 12.8 selector should not give you CUDA 13 machines. However you're correct in that a non zero amount of our servers are on 13.0 - but it's not an amount that I can guarantee will always be available.

#

And naturally we'll do what we can from the backend^ Just a little awkward while we're running maintenance.

tardy verge
#

All community cloud instances I have tested worked

#

@radiant prawn This is strange

#

All instances with RTX PRO 6000

#

didn't try 5090s yet

radiant prawn
#

I think one of the facets of this is the hosts operating system/kernel version. Let me take a look

tardy verge
#

worked

#

all community cloud instances with RTX PRO

#

I wasn't able to spin up much because there was not many available

quaint moss
#

Every RTX PRO 6000 WK I have tried worked no matter what version or driver or which enumerated gpu I got

tardy verge
#

@radiant prawn would you like the pod ids?

tardy verge
#

the folder names are pod ids

radiant prawn
#

The unsecure cloud host uses Ubuntu 22.04.5.

#

So probably not the operating system.

quaint moss
#

I’ll try the community cloud instances tonight.

radiant prawn
#

There's not a lot and I can't guarantee the availibility.

tardy verge
#

yeah

#

i got total under 10 pods

radiant prawn
#

I can tell you we have 1 machine on the community cloud with this GPU, and one machine physically cannot support more than 8 GPUs.

tardy verge
#

and half of them didn't even work (probably pulling image)

radiant prawn
#

maybe its not one machine but its 1 OS and that usually indicates one machine

#

it's 2 :)

#

@tardy verge What cuda version was the one you were on?

#

If you know, it could've only been 13.0 or 12.8.

tardy verge
#

has 12.9

radiant prawn
#

uh? nvcc --version

tardy verge
#

is nvidia-smi inaccurate?

#

I can't ssh back because the test is automated

radiant prawn
#

I learned recently nvidia-smi will show the highest cuda version the driver supports

tardy verge
#

pod is already terminated

radiant prawn
#

o7 thats fine

#

Would you happen to have the prompt from the pod?

#

root@12345678...

#

Or does the output only give you the result of nvidia-smi?

tardy verge
#

8mmj1nmc6r2ksh

#

this is the podid

radiant prawn
#

perfect

tardy verge
#

i don't have the prompt

radiant prawn
#

We have this machine listed as cuda 12.9

#

Weird when I queried for it it showed as 12.8

tardy verge
#

drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 8mmj1nmc6r2ksh
drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 mgrd9lo1q1bptd
drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 mx91o0m3l84i0c
drwxr-xr-x@ 8 riverfog7 staff 256 Oct 16 09:58 uirv051063dg6j
drwxr-xr-x@ 6 riverfog7 staff 192 Oct 16 09:58 yh7yf9vtb4o2k1
drwxr-xr-x@ 6 riverfog7 staff 192 Oct 16 09:58 yl08tpmbn6esh9

#

these are the ones i tested

radiant prawn
#

I just opened this, this is excellent actually

tardy verge
#

I do have some 12.8 ones too

radiant prawn
#

If you do manage to find a correlation let me know, not that you're obligated to and I can very easily create (or run?) a script that simulates a bunch of different variables to pull details. I think we have this chalked down to these issues from the last time we got a report like this:

https://trac.ffmpeg.org/ticket/11694
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1249
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1209
https://github.com/NVIDIA/nvidia-container-toolkit/issues/1197
https://github.com/NVIDIA/k8s-device-plugin/issues/1282

tardy verge
#

OP said gstreamer worked so it might be a bug in ffmpeg

radiant prawn
#

But this issue was reopened, by another? customer just yesterday with the following reproduction and we rolled that up into this too. We discussed a few workarounds, but aren't happy with any of them as they all have their own issues.

tardy verge
#

only correlation is this?

#

stupidly blue is failure

#

red is sucess

radiant prawn
#

Does it help to know that after our maintenance the lowest driver version in the fleet will be 570.195.03?

tardy verge
quaint moss
#

That’s the driver I have the most issues with 😂

#

I wish I had escalated privs on one of these nodes so I could test a few things. For example I wonder if this could be fixed by running mknod with the major and minor from /dev/nvidiaX

MAJOR=$(stat -c '%t' /dev/nvidiaX)
MINOR=$(stat -c '%T' /dev/nvidiaX)

mknod /dev/nvidia0 c 0x$MAJOR 0x$MINOR
chmod 666 /dev/nvidia0
tardy verge
#

forgot to mention it

radiant prawn
#

I don't have ssh access to the hosts but I do have a lot of other permission.

#

@tardy verge I can message you a credit to continue your testing if you'd like.

tardy verge
#

sure but its midterm season soon so i don't know if i can continue for long 😅

radiant prawn
#

ah i understand

quaint moss
#

I still wonder why the 6000 WK has worked no matter what version I get. I guess we need to check the index in nvidia-smi to see if it’s just luck

#

This is the kind of bug I used to love working on when I was at nvidia. Don’t have access to those kind of testing rigs anymore though.

tardy verge
#

is there any stats that is nice to have when debugging

tardy verge
#

what is this?

#

display attatched = True?

quaint moss
#

Is that the thingy you need to set if you want to do stuff like vnc or X11 forwarding?

tardy verge
#

i don't know about that part well

#

this time its 280 instances

#

of 5090s mixed between community and secure cloud

quaint moss
#

Is your test easily runable? I don’t mind to burn through credits testing other scenarios

#

Other gpu’s I should say

tardy verge
#

its just doing this

#

in a template

tardy verge
#

autogluon feature importance

#

it just identified every gpu by id

#

idk at this point

#

this is probably a software bug inside nvidia or ffmpeg

quaint moss
#

Yeah I read all the tickets and associated links this morning. Nothing seems to be reliable in the reproduction. Some people say you need to be GPU 0, others say the last gpu is working or an odd number in between.

Some say the bug is a regression starting at driver 570. Others have reproduced it on 550 and lower.

Nobody seems to be focused on fixing it. Some ffmpeg references say its an issue with nvcodec itself.

quaint moss
#

Just run several simultaneous iterations of a quick encoding task on the PRO 6000 WK. Not a single one failed. All gave me 580/13

#

First test on a PRO 6000, 575/12.9 fail

quaint moss
#

Had a failure on a serverless worker. I didn't catch it in time to see the logs. But the only difference is that it was not in NA.

quaint moss
#

I just had an idea to try and do a health check on a serverless endpoint. My question is, once a worker gets my image and goes idle, does it already have this issue or not?

Was hoping I could do a health check and if it fails, the worker terminates and a new one is created until I am left with nothing but workers without this issue.

#

But when my health check fails, whatever is orchestrating the containers just restarts it instead of terminating and launching on a new worker. Wonder if I can fail with a different error code to get it to terminate?

tardy verge
#

you can self destruct with this in a pod

#

runpodctl remove pod ${RUNPOD_POD_ID}

#

I'm not sure about serverless

tardy verge
#

so no need to configure credentials

quaint moss
#

I’ll give it a shot.

I don’t know if my concept is flawed though. Like, if a worker starts the container and doesn’t get the error in ffmpeg, does that mean when a request comes in hours later that it still won’t run into this bug?

I guess I don’t know how the serverless workers are orchestrated.

#

The question is, is the gpu already assigned when the worker goes idle? And if so, does it stay that way?

tardy verge
#

@radiant prawn is there an api to terminate serverless workers individually?

#

its possible in serverless console thingey

quaint moss
#

If so, this fixes everything for me. Just takes a little longer to deploy

tardy verge
#

you can see in my test that there are multiple overlapping gpu ids

tardy verge
quaint moss
#

I think I’m saying the same thing as you

tardy verge
#

test ffmpeg fails -> terminate worker -> runpod spins up same worker

#

can happen

quaint moss
#

Ohh

tardy verge
#

probably not a problem if there are many GPUs

#

but since you are using the RTX PRO 6000 and they have limited supply

quaint moss
#

Well when I manually terminate one I always get back a worker with a different id. But I don’t know if that’s unique or not

#

I’m using L40, L40S, RTX PRO 6000 and I think one other gpu

tardy verge
#

pod id is different here

#

but GPU uuid is same for some pods

quaint moss
#

I guess I don’t know what a worker truly is. Is it a shared server? Is it a shared cluster? Etc.

Because it could totally just be server rack that picks up requests from the queue and runs docker run -it … -gpus=5 (not really but you get the idea). That means it could work for one request then fail the next.

Otherwise if it’s consistent, then I’m ok with this janky solution.

tardy verge
#

My opinion is a serverless worker is just a pod

#

and worker id = pod id

#

and works with the same infrastructure, hence can share network volumes

quaint moss
#

That’s what I think too.

#

Pods + orchestration = serverless worker

tardy verge
#

ECS in AWS terms

#

but with an API Gateway

#

and a queue

#

and cloudfront

quaint moss
#

Yep.

#

So if I can kill a worker during the initial health check, it may be a workable solution. Provided I don’t get the same one over and over 🤔

tardy verge
#

this is only a problem when there is like 10 GPUs available and 7 of them are not working

#

but you are trying to get 5 workers

#

and you get broke because you get billed for the ffmpeg health check time

quaint moss
#

Oh I never thought about it billing me for the deploy time

tardy verge
#

you should get billed for the health check

#

because the container is already starte

#

d

quaint moss
#

That’s a good point. I’ll have to see where I can run this in the lifecycle

#

I put the health check right before the serverless handler and deployed and noticed some of the instances kept initializing. So I assumed it was running and I just couldn’t see the logs

tardy verge
#

i think its failing the health check and the host just restarts it

#

pods do the same thing when containers exit abnormally

#

host starts it until it works

#

do you have anything in serverless console-> logs instead of serverless console->workers->worker->logs

quaint moss
#

Nothing from the deploy. So it’s either not running the health check or none of the logs it generates during a deploy are in those logs I can see

#

The only thing I changed was adding the health check and the only two outcomes I saw were workers initializing over and over or becoming ready and successfully handling requests.

tardy verge
#

I just spun up a random serverless endpoint and it looks like "Initializing" is pulling images and extracting them

#

and "running" is the actual container running

#

so if worker is created
initialize -> running (load model in memory and health check, etc) -> idle (waits till request)

quaint moss
#

I only ever get the running state when I send a request. I just get initializing -> idle

tardy verge
#

oh I think i set this to get the worker count to go up

#

that makes sense i think im wrong then

#

my question is "Is anything happening after the container starts and before serverless start billed?"

#

its not really clear from this explaination

quaint moss
#

What I thought was happening is deploy > workers are assigned and they all pull your image

Then request > container starts

#

But after deploy, does it run the container at all.

#

If so, health check + terminate on fail would work. Otherwise it won’t

#

I’ve never noticed it charging me for the deploy phases

quaint moss
#

Hmm. I don’t think this will work now. There is no health check I can find in the docs for queue based serverless

quaint moss
#

Ugh. I am spending too much time thinking about this each night. I gotta just implement my own queue and let the occasional failures retry. At most, terminate failed workers

tardy verge
#

Does dockerfile health checks work?

quaint moss
#

I don’t think the worker is even running the container until you send a request to it.

#

At least that’s my theory

tardy verge
#

Hmm

quaint moss
#

I wonder how Google gets around this

tardy verge
#

google?

quaint moss
#

With cloud run gpu instances. We’ve processed lots of video using those and never ran into this problem. They’re on L4 gpus

tardy verge
#

idk about cloud run but in AWS ECS containers run inside vms

#

not like runpod (shared host)

quaint moss
#

Google cloud run is using docker. At least for their second gen runtimes

#

Maybe that’s the answer. Just run a docker container inside the docker container 😅

tardy verge
#

good news

#

serverless workers count as pods ig

#

so runpodctl remove pod <workerid> works

#
if eval ffmpeg -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -pix_fmt yuv420p -t 5 /tmp/test.mp4 -y > /dev/null 2> /dev/null; then
    echo "FFMPEG with NVENC encoding succeeded."
else
    echo "FFMPEG with NVENC encoding failed."
    runpodctl remove pod ${RUNPOD_POD_ID}
fi
#

should work

#

if the env variable is correct and runpodctl is installed in the pod

quaint moss
#

Sorry. Wife aggro. I'll check this out.

#

So with that scenario, the job will fail, but it will take the worker out with it.

radiant prawn
quaint moss
#

And there's no way during a deploy to tap into any of the health checks?

tardy verge
#

That should depend on the docker cmd running at worker initialization

quaint moss
#

Yeah I don’t think the container itself runs. I suspect the pod just fetches the image and makes sure everything is ready for when it receives a request

quaint moss
#

Ok new tactic...

My API forms the request, sends it to the serverless endpoint, then receives the webhook back on success or fail. On fail, it retries.

Meanwhile, on the worker, if we get the bad cuda ffmpeg response about no supported devices, we terminate the worker.

Provided there is enough delay, the retry will go to the next worker while the last one was being terminated. So the system should only waste a few seconds on retries and terminating before eventually hitting a successful worker.

This also has the added side effect of constantly pruning bad workers.

#

So far its working as expected

tardy verge
#

Hmm

quaint moss
#

Just noticing some things. If I setup my serverless workers to run on 5090's, it's like 9.5 out of 10 fail and have to be terminated.

L40 and L40S, maybe 1 out of 5 fail.

RTX PRO 6000, zero have been terminate by my script.

#

Totally usable now that I am having the worker terminate itself when it detects those failure modes. After a small amount of failures, we get 100% working workers. I have not seen a worker that worked the first time but then subsequently failed. So I think it does work the way I was thinking.

tardy verge
#

Failure rate was about 75% in my testing

#

But it includes community clpud

jolly tapir
#

Hey everyone, is this issue fixed yet?
I have an RTX 3060 Ti, and when I run my Docker image locally, it works perfectly and uses the GPU.
However, when I pull the same image on Serverless or RunPod, it doesn’t use the GPU at all.

It only works if I add a CPU fallback — then it runs on CPU.
But if I try running with GPU only, it throws this error:
"details": "Traceback (most recent call last):
File "/app/unified_r2_only.py", line 134, in compress_video_nvenc
raise RuntimeError(f"ffmpeg exit {proc.returncode}")
RuntimeError: ffmpeg exit 1 "

quaint moss
#

Just for reference, the last couple deploys I did had to prune less workers before it was left with 5 unaffected ones. Possibly just luck of the draw.

#

I have all datacenters selected still though.

#

I suspect if I limited to NA it would be close to zero

tardy verge
#

@radiant prawn Sorry to mention but can you check these Community cloud pods if they are working properly?

radiant prawn
#

Working backwards:

radiant prawn
radiant prawn
# tardy verge

Machine error (both variations of the error shown here)

radiant prawn
# tardy verge

I'll aggregate these pod ids to unique machine ids and ideally point out the errors to this host.

jolly tapir
#

Has any one dealth with this issue? Because I get ffmpeg return -1 on most cards on runpod while some who work give this performance while same code on my local utilizes 100 percentage.

quaint moss
#

Amphere cards are meh with nvenc

#

Probably gonna have to find the actual error so we can see what’s really going on.

#

Even with 4 nvenc chips and running tons of nvenc sessions I only get like 15% usage

#

Gpu usage isn’t the same as nvenc usage

dense scroll
#

Is there any paved road with this? I am having the same issue.

normal gate
#

i dont think there's fix for this yet

dense scroll
#

after big debugging NVIDIA_DRIVER_CAPABILITIES=compute,decode,encode,utility,video
not working. the ,decode,encode is being blocked no mater what i do.

quaint moss
#

I finally solved this. Going to write a paper describing it.

astral silo
#

it probably really deserves a paper since it took over 6 months to solve

quaint moss
#

For sure. Got a few more tests to run to make sure it doesn’t break on multi-gpu pods. But I wrote a shared library that patches the calls made between nvidia’s user-space driver libraries and the kernel.

Specifically the call NV0000_CTRL_CMD_GPU_GET_ATTACHED_IDS that goes to the nvidia resource manager.

#

libnvidia-encode.so and libnvcuvid.so

#

Totally worked for me earlier. I booted up a pod and ran ffmpeg as a test and got the error. Then added an env var to tell the system to use my library and boom. Success.

Tried it on 5 different pods and it fixed it on every gpu and cuda combo so far.

#

The sad part is this is an nvidia bug in the driver but we could have fixed it from inside the container the whole time.

raven heath
#

@quaint moss fell free to share finds as I'm also intrested

raven heath
#

@quaint moss so I think I was able to fully nail the ffmpeg setup and to make so it works 🙂

raven heath
#

Feel free to use it it's all open source for all

quaint moss
#

There's all of my notes and code. Although @raven heath's shim is probably more friendly. Mine was just a c file you compile during image build so you can include the library.

#

I have debugging flags you can pass in to have it send logs to stderr or out to a file. I had to use the crap out of those because sometimes the failures wouldn't get logged and I'd lose them when I destroyed the pod.

mint fossil
# raven heath <@356832952815976449> can you give a try to https://github.com/MadiatorLabs/nvsc...

It is very good.

On my end, I performed dozens of encodes using this software across multiple data centers, specifically focusing on the RTX 5090, and did not encounter this issue. In other words, I was able to successfully perform high-speed encoding using NVENC.

I believe this problem has finally been resolved.

It seems like it would work well if I add a few layers to the Dockerfile and perhaps set up an alias for the ffmpeg command.