#Serverless Requests Queuing Forever

319 messages · Page 1 of 1 (latest)

fiery scaffold
#

Title says it all - I send a request to my serverless endpoint (just a test through the runpod website UI), and even though all of my workers are healthy, the request has just been sitting in the queue for over a minute.

Am I being charged for time spent in queue as well as time spent on actual inference? If that's the case, then I'm burning a lot of money very fast lol. Am I doing something wrong?

dusk river
#

U need to specify more info like the image and model u r using

ivory chasm
#

you're charged only when a worker is running, including loading the model ( not specifically time in queue, but can be) look at the worker tab, when it's green it's running

#

Check a worker then check the log

fiery scaffold
ivory chasm
fiery scaffold
# ivory chasm Check a worker then check the log

the workers were all fully initialized and ready, but once I queued the request, they just didnt seem to do anything. I was getting a tokenizer error in the logs for the first model I tried running but didnt see any errors on the second

ivory chasm
#

and which gpu model are you using?

fiery scaffold
#

I believe I had selected an H100

ivory chasm
#

ic

#

no logs at all?

#

if you can please export or download the logs and just send it here

fiery scaffold
#

I will on my next attempt, had to move onto another project for a little while

dusk river
#

If you didnt specify a network volume, downloading models can take a long time

#

27b model is about 54GB

fiery scaffold
dusk river
#

Can you upload the logs here

fiery scaffold
#

Like I said, I will once I go to try again. I've already removed that endpoint unfortuantely.

dusk river
#

The pr was merged 26days ago

fiery scaffold
#

I was wondering about that - I just assumed that the default VLLM container on the serverless option was up to date though

#

Can I use any container off of a registry like you can with normal pods?

dusk river
#

From what i know it needs a handler for the requests

#

But you can always build the vllm container with the latest vllm

#

Dockerfile should be in runpod's official repo

fiery scaffold
#

haha yeah I've actually been trying to do that so I can build it with support for a 5090

#

which also is not going well

#

I can build the image just fine, it's just not compiling with the right CUDA version no matter the modifications I make to the dockerfile

dusk river
#

Wdym by support for 5090

fiery scaffold
#

anyway that's unrelated

fiery scaffold
dusk river
#

It should be file tho because noone uses cuda 12.8 . Its too new

#

Cuda 12.4 should work fine with a 5090

#

@fiery scaffold whats ur max midel len

ivory chasm
#

vllm 0.8.2 supports gemma3 already

ivory chasm
#

how did you check it, i thought you just choose a base image for that

dusk river
fiery scaffold
dusk river
#

Cuda related stuff always causes headaches 😪

ivory chasm
#

no really, how did you check it, i wanna know

fiery scaffold
#

The logs say as much in the container after powering it on once it's built.

ivory chasm
#

"It is not compiling with the right cuda version"

fiery scaffold
#

right - I changed the base image via the ARG for cuda version in the dockerfile. I'm going to go back at it later tonight but just havent had the chance yet.

#

ARG CUDA_VERSION=12.8.1
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base
ARG CUDA_VERSION=12.8.1
ARG PYTHON_VERSION=3.12
ARG TARGETPLATFORM
ENV DEBIAN_FRONTEND=noninteractive

#

then later on theres an arg/env value called torch_cuda_arch_list which searches seem to indicate I should set to either 12.8 or 12.8.1. I believe this has something to dow ith how the flash attn modules are compiled or rather which versions of cuda they compile for

fiery scaffold
#

but then after building the image with all of those changes, I still get the following when starting the container:
NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.
If you want to use the NVIDIA GeForce RTX 5090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

More info in this thread: https://github.com/vllm-project/vllm/issues/14452

#

Lots of other apps and projects out there where people are having the same issue with blackwell compatibility

#

anyway, I just havent taken all the time needed to fully look into this nor is this what the current thread is about lol

ivory chasm
#

Try using 12.1-12.4

fiery scaffold
#

12.1-12.4 are not compatible with Blackwell cards though

#

it must be 12.8 or later

fiery scaffold
#

as far as I can tell yes

#

I'm not familiar enough with all of the intricate details but they don't support the new sm_120 compute capabilities of the blackwell cards just yet

#

I didn't know this until after I bought the card obviously, but tbh I'm still keeping it since I'm sure the support will come soon

dusk river
#

Did you try other versions?

#

Like 12.4

#

Cards with higher cuda compute capabilities should support lower cuda versions

#

It could be a version mismatch between your graphics driver and pytorch

fiery scaffold
#

hmm interesting. I'm on 570.124 on Linux so could be something there. Havent tried anything in windows but maybe I'll give that a shot next

dusk river
#

What does yourr nvcc -V

#

Print?

fiery scaffold
#

well my driver situation is pretty messed up, but I still don't think that's the issue exactly. CUDA applications that explicitly support the new compute capabilities and CUDA version seem to work just fine.

dusk river
#

It should print 12.8 something

ivory chasm
#

If their container image is 12.8 then it will be that, is the pod host 12.8(via pod create filter)?

fiery scaffold
#

It's all a bit more complex that I thought after more research. For now I'm just sticking with ollama locally until full explicit support for Blackwell is included in a vLLM release

dusk river
ivory chasm
#

huh requires custom vllm build and nightly packages nice

fiery scaffold
#

oh, this wasnt the issue/post I was following instructions from, this one is way better

#

damn thank you

#

this will probably work, now just need to find something similar for SGLang

dusk river
#

@fiery scaffold the issue mentions torch version upgrades so doing that may make sglang work too

fiery scaffold
#

thank you guys so much, seriously

dusk river
#

Btw can you give us an update if it works?

#

Just in case someone has to use a B200 or a RTX 5090 to deploy vllm

fiery scaffold
#

will do

#

oh you know what? SGLang's dockerfile uses triton server as it's base image, so that's going to be inherently different. I have zero knowledge yet on Triton haha. Maybe it can be swapped for a different base image or the same that vLLM uses? Not too sure if there are specific dependencies there with triton.

dusk river
#

If its torch based wont that solution work?

#

@fiery scaffold why r u using sglang tho?

#

Im interested in building a dockerfile because i may use it in the future

fiery scaffold
#

honestly I'm not sure yet, but according to searches it's somehow better for tool usage, ettc whereas vLLM excels more at speeding up inference and serving simultaneous requests

#

SGLang is I think built from vLLM though

#

still pretty new to anything outside of Ollama so I'm probably not the best person to ask lol

fiery scaffold
dusk river
dusk river
#

but vllm is way better (like literally 5+ times better) if the requests can be batched

fiery scaffold
#

yeah, I'm preparing to deploy in a high traffic prod environment though lol. So I need something a little more robust

dusk river
#

for example

#

2xA40 with a 70B llama

fiery scaffold
dusk river
#

will get about 200tok/s with batched requests

dusk river
fiery scaffold
#

not sure quite yet which model but up to about 110b parameters as far as model sizes go. We'll be evaluating a bunch of different models initially before we decide on one

#

it'll likely be cloud hosted though

dusk river
#

lol its big

#

r u a startup?

fiery scaffold
#

no more details on that I wish to share at the moment haha

dusk river
#

anyways

#

110b at fp8 or int4?

fiery scaffold
#

very much looking forward to getting my hands on a DGX spark and/or station for this stuff soon though

fiery scaffold
#

smaller models that I'm looking at are in the 24-32b range and those I want to run at full fp16

#

I need to spend some time educating myself on the practical differences in accuracy between different quants

dusk river
fiery scaffold
#

the spark?

dusk river
#

you wont be able to batch request so much

#

yeah

fiery scaffold
#

yeah I'm going to probably get a spark for dev use, looking more at the station for potential prod stuff

#

the memory bandwidth on the spark is pretty low, but can't beat the 128GB available for loading models

#

not at that price anyway

dusk river
#

if you have many users you have to go cloud anyway

#

and you should have the latency & throughput requirements

#

cause more batched requests = more latency and more throughput

#

have to find a middle ground there

#

and determine the memory requirements based on that

fiery scaffold
#

same actually, about to run the build

#

are you on docker?

dusk river
#

just ran the image on runpod

#

ill try to build vllm there and if it works write the dockerfile

#

but the terminal doesnt work 😦

fiery scaffold
#

ah bummer

dusk river
#

are you running it locally?

fiery scaffold
#

yeah

#

on build now I'm getting this:


Dockerfile:135

134 | ENV CCACHE_DIR=/root/.cache/ccache
135 | >>> RUN --mount=type=cache,target=/root/.cache/ccache
136 | >>> --mount=type=cache,target=/root/.cache/uv
137 | >>> --mount=type=bind,source=.git,target=.git
138 | >>> if [ "$USE_SCCACHE" != "1" ]; then
139 | >>> # Clean any existing CMake artifacts
140 | >>> rm -rf .deps &&
141 | >>> mkdir -p .deps &&
142 | >>> python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38;
143 | >>> fi

144
ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref
#

I never modified this section though, nor do I quite understand what it means lol

dusk river
#

apt-get update && apt-get install -y --no-install-recommends
kmod
git
python3-pip
ccache

#

try this

#

installing ccache

fiery scaffold
#

ah nice ty

dusk river
fiery scaffold
#

still fairly new to docker tbh, only been working with it for like 3-4 months now

#

RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections
&& echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections
&& apt-get update -y
&& apt-get install -y ccache software-properties-common git curl wget sudo vim python3-pip
&& apt-get install -y ffmpeg libsm6 libxext6 libgl1 \

ccache is installed right at the top of the dockerfile though

#

oh

#

my target was wrong

#

trying to eventually get to the openai server image from the base

dusk river
#

?

#

Are you building the image yourself or using the image from nvidia

fiery scaffold
#

nvidia

dusk river
#

Is it this one?

fiery scaffold
#

FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base

You talking about this?

dusk river
#

Uhh

fiery scaffold
#

I must be lost lol

dusk river
#

Yeah

dusk river
#

So the gh issue says just install vllm on top of it

fiery scaffold
#

I switched out the base image for the one you posted, but still getting that ccache issue

#

I'm still trying to modify the offical dockerfile though

dusk river
#

I thin you dont have to do that

#

In here it says that image has the torch and python stuff

#

So you have to clone vllm and then build it with the compiler supporting blackwell

#

And then its done

fiery scaffold
#

gonna try it

dusk river
#

try this

#
FROM nvcr.io/nvidia/pytorch:25.02-py3 as base
WORKDIR /tmp
RUN apt-get update && apt-get install -y --no-install-recommends \
    kmod \
    git \
    python3-pip \
    ccache \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

ENV VLLM_FLASH_ATTN_VERSION=2

RUN git clone https://github.com/vllm-project/vllm.git && cd vllm
RUN python3 use_existing_torch.py && pip install -r requirements/build.txt && pip install setuptools_scm
RUN --mount=type=cache,target=/home/root/.cache/ccache MAX_JOBS=10 CCACHE_DIR=/home/root/.cache/ccache python3 setup.py develop && cd /tmp/ && rm -r vllm
RUN python3 -c "import vllm; print(vllm.__version__)"

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
fiery scaffold
#

kk

#

what I'm not quite clear on, and this is just my lack of knowledge on the subject, is that as I watch the flash attn builds happen, it only appears to do sm80 and sm90?

dusk river
#

maybe its not ready for blackwell too

fiery scaffold
#

but those older compute versions should still work for flash attn?

#

well on blackwell

dusk river
#

hmm i dont know cuda well soo

fiery scaffold
#

building it now but it takes forever once it gets to the flash attn cmake steps. I did this once before and built a dockerfile based on that issue page, but I was missing the entrypoint line

#

I think that might be all I was missing, so I think this will do it hopefully

fiery scaffold
#

yeah just started building that

#

maxjobs=10 should speed up the cmake steps I take it?

dusk river
#

if you have 10 cores yes

fiery scaffold
#

oh I do

dusk river
#

setting it much higher than the core count makes the machine sort of freeze

fiery scaffold
#

ah gotcha

dusk river
#

cuz it uses all da cores for building

fiery scaffold
#

had to modify it a bit to change working dir after clone

dusk river
#

oof my bad

fiery scaffold
#

`# syntax=docker/dockerfile:1.4

FROM nvcr.io/nvidia/pytorch:25.02-py3 as base
WORKDIR /tmp

Install required packages.

RUN apt-get update && apt-get install -y --no-install-recommends
kmod
git
python3-pip
ccache
&& apt-get clean && rm -rf /var/lib/apt/lists/*

Set environment variable required by vLLM.

ENV VLLM_FLASH_ATTN_VERSION=2

Clone the vLLM repository.

RUN git clone https://github.com/vllm-project/vllm.git

Change working directory to the cloned repository.

WORKDIR /tmp/vllm

Run the preparatory script and install build dependencies.

RUN python3 use_existing_torch.py &&
pip install -r requirements/build.txt &&
pip install setuptools_scm

Build vLLM from source in develop mode.

RUN --mount=type=cache,target=/root/.cache/ccache
MAX_JOBS=10 CCACHE_DIR=/root/.cache/ccache
python3 setup.py develop &&
cd /tmp && rm -rf vllm

Test the installation by printing the vLLM version.

RUN python3 -c "import vllm; print(vllm.version)"

Set the entrypoint to start the vLLM OpenAI-compatible API server.

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]`

#

now we're cookin

dusk river
#

maybe if it works building sglang with that could work

fiery scaffold
#

yeah gonna try it

dusk river
#

also i found nvidia's official(idk but it says nvidia) image for triton inference server

#

so that's a candidate for sglang base image

fiery scaffold
#

nicee

dusk river
#

just got ssh working

#

it appears that torch is working with blackwell

fiery scaffold
#

very nice

#

about halfway done building my image

dusk river
#

good sign on my side too

fiery scaffold
#

what build command and args did you use?

dusk river
#

its the same (except the core count) as the dockerfile

#

CCACHE_DIR=/home/root/.cache/ccache python3 setup.py develop
just this

#

uhoh

#

shouldn't have deleted the code

#

lol

fiery scaffold
#

not following

#

"its setup.py develop
shouldn't have deleted the code"

What do you mean?

dusk river
#

Stackoverflow says develop links the code in the repo to site packages

#

So if i delete the repo it might break

fiery scaffold
#

oh the cloned repo in the container?

dusk river
#

yeah

fiery scaffold
#

ah gotcha

#

somehow that last build locked my PC up haha, had to start over 😦

dusk river
#

and dont clone at /tmp if you are not gonna delete it

fiery scaffold
#

where should I clone to then?

dusk river
#

/tmp gets removed (cuz obviously its temporary)

fiery scaffold
#

ah yeah

dusk river
#

home folder will be good

fiery scaffold
#

workspace will do

#

or even /app

dusk river
#

isnt it the network volume mount folder?

fiery scaffold
#

no idea, not that fluent in docker yet lol

dusk river
#

just go with /app or /vllm then

fiery scaffold
#

yeah I used /app

#

just restarted the build

dusk river
#

if mine finishes faster ill give you the wheel file (im building with setup.py bdist_wheel)

#

a wheel is just a prebuilt binary

fiery scaffold
#

nice

dusk river
#

it failed

#

while building

#

@fiery scaffold did u suceed

#

i think it needs a LOT of ram

fiery scaffold
#

had to restart my build, only halfway through the cmake steps

dusk river
#

me too

#

how much is ur ram?

#

mine failed with 96gigs in runpod's RTX5090

fiery scaffold
#

I've got plenty haha, more than that

dusk river
#

why did it fail tho

fiery scaffold
#

no idea, havent looked at the logs yet

#

sec

dusk river
#

u r rich lol

fiery scaffold
#

no, just irresponsible with what I buy haha

dusk river
#

i just trid to run it on a macbook

#

failed miserably

fiery scaffold
#

hahaha

dusk river
#

had only 48gigs

#

😦

fiery scaffold
#

oof

#

I've only got like 25GB of RAM used up at tthe moment, that's odd it failed with that much system RAM

dusk river
#

idk either

#

maybe because it had to run with rosetta

#

it uses almost 100gigs here XD

#

its hella fast tho

#

power of 32 vCPUs

#

it OOMed

fiery scaffold
#

oh wow lol, I wonder why the RAM usage is so high?

#

not having anything close to that building locally

dusk river
#

🥲

#

this one selected wrong region

fiery scaffold
#

oh wow haha

dusk river
#

that type is not supported lol

#

got one with 128vcpus and 1tb ram

fiery scaffold
#

FYI dont build it in develop mode, it failed on the last step, starting over again lol

dusk river
#

im building on wheels mode

#

it ooms cuz of many workers so im just building with single worker

#

probably finishes building tomorrow

ivory chasm
#

some says sglang is faster

#

in certain models

fiery scaffold
#

can vllm server multiple models simultaneously or dynamically unload/load different models as needed?

ivory chasm
#

I think so with vllm api but it's not wzposwd in the serverless youll have to use pods and directly connect to vllm

#

Or use ports on serverless

fiery scaffold
#

gotcha

#

well I made some serious progress on getting vLLM working and built for blackwell, but after all that, it seems there's no way to compile xformers to work with torch 2.8.x dev builds so I wont be able to use models like gemma3

#

I realize that it takes time to develop this stuff but it's extremely frustrating that nvidia would release a new architecture, charge thousands of dollars for the GPU, and not support this part of the community especially to help develop support for sm120 and cuda 12.8

#

I'm beyond angry right now, I've been at this since this morning

dusk river
#

this is like deep into the rabbit hole[

#

😄

#

have to compile every fking thing

dusk river
#

@fiery scaffold TORCH_CUDA_ARCH_LIST="12.0" pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

dusk river
#

i cant test it cuz i dont have. a blackwell gpu

#

ah i can run it on runpod maybe

ivory chasm
#

rtx5090 i mean you can use this gpu its blackwell too right?

dusk river
#

yeah

dusk river
#

can just upload the wheel instead of compiling from scratch

ivory chasm
#

this is weird im trying 5090 on community cloud, no system logs after like 15mins ish
and then suddenly web terminal also disconnected, but container logs are there

dusk river
#

lol i used it yesterday and it was fine

#

(with a custom image tho)

ivory chasm
#

i used runpod's pytorch

dusk river
#

its not supported

#

with blackwell

#

i mean it doesnt work for blackwell cuz

#

cuda capability issues

ivory chasm
#

hmm? oh

#

its 12.8 tho doesnt it means it supports blackwell too

dusk river
#

there is an image with cuda 12.8?

#

oh

ivory chasm
#

Yeah maybe it's a new one

dusk river
#

when did it pop up

dusk river
#

wtf why is there an image with a different torch version than what ive built the wheel to

ivory chasm
#

huh

ivory chasm
dusk river
#

i built the thing for torch 2.7.1 but the image is 2.8.0 so i probably cant use the wheel with that image

#

have to stick with nvidia's bloated

fiery scaffold
#

I realized though, I've learned a ton of good info I didn't know two days ago throughout trying to solve this problem lol. So not all bad.

dusk river
fiery scaffold
#

Hahahaha

dusk river
#

u have to compile triton and

#

sglang

#

😆