Serverless VLLM batching | Runpod | Page 1

hollow comet Jun 8, 2025, 9:27 PM

#

Hey so every hour I have like 10k prompts I want to send to my serverless instance. Im using vllm and my question is does the batching which vllm does out of the box work for the serverless instance cuz I send all prompts as single request not in one request. I could not find anything about this in the docs and in this chat. Would be really helpful thanks.

primal belfryBOT Jun 8, 2025, 9:27 PM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

hollow comet Jun 8, 2025, 9:29 PM

#

And Am I doing it the r ight way like sending each job in a single request cuz with other APIs I send al the prompts in one requst like normal batching.

pulsar shoal Jun 8, 2025, 10:02 PM

#

Both work. #1371876938913677483 message

hollow comet Jun 11, 2025, 4:55 AM

#

pulsar shoal Both work. https://discord.com/channels/912829806415085598/1371876938913677483/1...

So I prefer to use sending all prompts in one batch one requst cuz its 10k prompts and I would be limited by runpod rate limits.

This works fine with just text prompts but with prompts for a mulit modal model (Internvl3 14b) it sums up the tokesn from all prompts in the batch and fails cuz it exceed context lenght. It makes no sense that it sums them up as they are seperate conversations.
The token count is also really high per prompt like 4k and the image is not that big.

Do you have any idea why did might be happening

pulsar shoal Jun 11, 2025, 5:02 AM

#

In that thread, I also gave [an example](#1371876938913677483 message) of how you can overcome the RunPod API limits if you have huge amounts of data.

hollow comet Jun 11, 2025, 5:03 AM

#

Yeah I read that but id still prefer to send it all in one request. It workds perfecly fine with just text prompts but with image prompts it sums up the tokens for the vonersation for some reason

#

This is a btach request with 6 conversation with iamgs that are about 600x600 each.

#

This is a batchign requst of 6 conversation with one image each the images are around 600x600
{
  "delayTime": 1078,
  "executionTime": 436,
  "id": "sync-109571ec-5527-4d2c-aa0a-b902c7d88df2-e1",
  "output": [
    {
      "code": 400,
      "message": "This model's maximum context length is 8192 tokens. However, you requested 81637 tokens (81381 in the messages, 256 in the completion). Please reduce the length of the messages or completion.",
      "object": "error",
      "param": null,
      "type": "BadRequestError"
    }
  ],
  "status": "COMPLETED",
  "workerId": "htcn0ja4fv4i18"
}

#

If I use the normal /run endpoint the otken count is much less its 400 for one image

pulsar shoal Jun 11, 2025, 5:08 AM

#

How are you even sending the images via standard text completion endpoint? As far as I know it's supposed to be just array on strings.

hollow comet Jun 11, 2025, 5:14 AM

#

I convert them to base64

hollow comet Jun 11, 2025, 5:20 AM

#

pulsar shoal How are you even sending the images via standard text completion endpoint? As fa...

def create_message_with_image(image_data: Dict[str, Any]) -> List[Dict[str, Any]]:
    """Create OpenAI message format with image for chat completions"""
    # Validate image data
    if not image_data.get('content'):
        raise Exception("No image content provided")

    # Check image size
    image_size = len(image_data['content'])
    if image_size > 20 * 1024 * 1024:  # 20MB limit
        raise Exception(f"Image too large: {image_size} bytes")

    image_base64 = base64.b64encode(image_data['content']).decode('utf-8')

    # Determine media type
    media_type = "image/jpeg"
    if image_data['content'].startswith(b'\x89PNG'):
        media_type = "image/png"
    elif image_data['content'].startswith(b'GIF'):
        media_type = "image/gif"

    # Create OpenAI message format
    messages = [
        {
            "role": "system",
            "content": get_system_prompt()
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": get_user_prompt()
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:{media_type};base64,{image_base64}"
                    }
                }
            ]
        }
    ]

    return messages

#

Its also werid that the token count is so much igher via openai instead of the /run thing

#

This is one single requst
{
"delayTime": 792,
"executionTime": 9456,
"id": "sync-0f22391b-2fcb-4589-99e8-01125d164425-e2",
"output": [
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "json\n{\n \"cracked\": true,\n \"battery_health\": null,\n \"color\": null,\n \"condition_score\": 30\n}\n",
"reasoning_content": null,
"role": "assistant",
"tool_calls": []
},
"stop_reason": null
}
],
"created": 1749619224,
"id": "chatcmpl-bddf2e19a3204e8fbae98389292ee924",
"kv_transfer_params": null,
"model": "OpenGVLab/InternVL3-14B",
"object": "chat.completion",
"prompt_logprobs": null,
"usage": {
"completion_tokens": 36,
"prompt_tokens": 3835,
"prompt_tokens_details": null,
"total_tokens": 3871
}
}
],
"status": "COMPLETED",
"workerId": "htcn0ja4fv4i18"
}

pulsar shoal Jun 11, 2025, 5:25 AM

#

hollow comet ```py def create_message_with_image(image_data: Dict[str, Any]) -> List[Dict[str...

This is a chat completion format. You use that for single individual requests. Standard text completion can be used for sending a batch in one request but doesn't support multimodal input. So processing 10k individual requests is what you want to do.

#

Or you have to customize/make your own handler with offline inference. From vLLM docs:

For multimodal batch inference, you must use offline inference where you can pass a list of multimodal prompts to llm.generate. The online OpenAI-compatible server only supports multimodal input via the Chat Completions API, and only one prompt per request is allowed for multimodal data.

hollow comet Jun 11, 2025, 5:31 AM

#

pulsar shoal Or you have to customize/make your own handler with offline inference. From vLLM...

Ohh I see that makes se nse

#

I just dont like the sending single reqeust for each job logic

#

For 10k requests its already hitting the rate limit and thats not even the polling which comes after

#

And I want to keep the queue filled all the time so the Gpu are filled effienctly all the time but thats also not that easy to do

#

I tried tracking the queue with the health endpoint and then sending new prompts when its getting emptier

ornate pivot Jun 11, 2025, 5:34 AM

#

Do you hit the rate limit on runpod's /run or /runsync?

ornate pivot Jun 11, 2025, 5:35 AM

#

hollow comet For 10k requests its already hitting the rate limit and thats not even the polli...

can't you ask runpod support for more?

pulsar shoal Jun 11, 2025, 5:39 AM

#

hollow comet For 10k requests its already hitting the rate limit and thats not even the polli...

You're ignoring my recommendation to bypass the RunPod job API. You can make the engine requests internally from the handler and get the data somewhere else than from the RunPod job input. In fact, even normally the serverless handler itself in the worker is constructing and passing them to the engine like a proxy.

hollow comet Jun 11, 2025, 5:54 AM

#

pulsar shoal You're ignoring my recommendation to bypass the RunPod job API. You can make the...

I appreciate your help but I want to avoid using my own docker image cuz the runpod vllm worker has faster coldstart times cuz it’s cached on all instances

#

„Worker vLLM is now cached on all RunPod machines, resulting in near-instant deployment! Previously, downloading and extracting the image took 3-5 minutes on average.“

#

And I believe that it is more likely to have flashboot even after some time

hollow comet Jun 11, 2025, 5:56 AM

#

ornate pivot can't you ask runpod support for more?

Haven’t thought about that

hollow comet Jun 11, 2025, 5:56 AM

#

ornate pivot Do you hit the rate limit on runpod's /run or /runsync?

On /run

hollow comet Jun 11, 2025, 5:59 AM

#

hollow comet „Worker vLLM is now cached on all RunPod machines, resulting in near-instant dep...

I run burst tasks so I need like a lot of prompts processed every like few hours and in the setup I have it takes like 60 seconds when warm but when it’s cold it takes an extra 60 seconds or more to start and that makes it not that attractive

pulsar shoal Jun 11, 2025, 6:10 AM

#

It should have only faster initialization (image download to the worker). Coldstarts are not affected, or actually slower than with my image. Flashboot availability is not affected.

#

Unfortunately, there's no secret magic that would make the official vLLM template faster than custom image, even thought one could expect it.

hollow comet Jun 11, 2025, 6:16 AM

#

Hmm but what about donwloading the docker image can it be saved on network storage like llm weights so I can at least avoid the donwload time?

hollow comet Jun 11, 2025, 6:17 AM

#

pulsar shoal It should have only faster initialization (image download to the worker). Coldst...

Does it support the batching I need for multi modal input?

pulsar shoal Jun 11, 2025, 6:28 AM

#

For the fastest loading speeds, you want to bake the model (LLM weights) into the image itself. Cold-start time is something to be worried about, not the worker initialization stage - that is not billed and if you choose enough max workers on the endpoint, it's most likely not something that will border you. Container images are hosted on container registry like DockerHub, or you can build the images right from the GitHub Repo

hollow comet Jun 11, 2025, 6:34 AM

#

pulsar shoal For the fastest loading speeds, you want to bake the model (LLM weights) into th...

What time difference does baking in the LLM in the image make? I dont like bakin it in cuy building and pushing takes forever.

ornate pivot Jun 11, 2025, 7:22 AM

#

pulsar shoal Unfortunately, there's no secret magic that would make the official vLLM templat...

Make a pr about your change too

#

To the vllm worker from runpod 😆

ornate pivot Jun 11, 2025, 8:51 AM

#

hollow comet What time difference does baking in the LLM in the image make? I dont like bakin...

Alot on cold starts because disk speeds are higher or also latency is less

pulsar shoal Jun 11, 2025, 2:23 PM

#

ornate pivot Make a pr about your change too

I'll keep working for RunPod for free after they stop ignoring the issues. Sounds fair? 😀

Because there's no reason to work on reducing the cold starts to the minimum if the queue delay on the platform is 10s.

ornate pivot Jun 11, 2025, 2:24 PM

#

Wow they still havent replied you in your ticket?

pulsar shoal Jun 11, 2025, 2:26 PM

#

River just told me to use different GPU and then ghosted me 😕 Both on discord and email. The ticket is still open.

#

Meanwhile users even DM me they have the same problem and they're leaving the platform because of it

hollow comet Jun 17, 2025, 11:13 PM

#

ornate pivot Alot on cold starts because disk speeds are higher or also latency is less

The model files are almost 50GB how do I deal with such a big docker image layer?

#

Building fails on runpod like when uploading the layer for the model it restarts over and over again and at some point it says build failed

ornate pivot Jun 18, 2025, 12:39 AM

#

hollow comet Building fails on runpod like when uploading the layer for the model it restarts...

Build failed because it takes too long or image being built is too big

hollow comet Jun 18, 2025, 2:20 AM

#

ornate pivot Build failed because it takes too long or image being built is too big

how should I deal with it? Build it on my machine and push it to dockerhub? Any suggestions? Or google cloud build?

#

I want to bake in the model as its faster

ornate pivot Jun 18, 2025, 3:16 AM

#

Yeah sure anything works

#

id suggest dockerhub

#Serverless VLLM batching