#SGLang

119 messages · Page 1 of 1 (latest)

vague knoll
#

SGLang works very well in pod but impossible to run in serverless.
the api route stay => error 404
i use the exact same config (docker, command line, port) in pod and serverless.

exotic storm
vague knoll
#

yes. but it launch but using request on runpod ui or openai call give zero results

exotic storm
#

Would you mind sending me a screenshot of a deployed worker with the request + result? Then I will open an issue on our repo and get an engineer looking at the problem.

vague knoll
#

some screen, i hope it can help.

vague knoll
vague knoll
#

do you need something else?

exotic storm
#

Nope this looks fine, thank you very much!

vague knoll
#

thanks.
i hope your team will find a solution or a tutorial if the error is between the keyboard and the chair.

vague knoll
#

any news about that?

exotic storm
#

nope not yet, sorry! Will keep you updated once we have something 🙏

exotic storm
vague knoll
exotic storm
#

awesome, thank you very much!!!

vague knoll
# exotic storm awesome, thank you very much!!!

so it's almost ok !
but maybe the last issue i have is a bad configuration.

tldr; i can't ping the cluster only some instance

My configuration:

container Image:
runpod/worker-sglang:preview-cuda12.1.0

Container Start Command:
python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-8B --context-length 8192 --host 0.0.0.0 --port 8000

Expose HTTP Ports: 8000,

once launched, if i click on the running instance, click connect web and use the url (something like: https://xxx-8000.proxy.runpod.net/v1) it works.

but i don't have something like
OPENAI BASE URL
https://api.runpod.ai/v2/vllm-xxxx/openai/v1 to ping the cluster

junior plume
#

what if you build the url yourself

#

https://api.runpod.ai/v2/your-endpoint-id/openai/v1

vague knoll
vague knoll
#

i think the endpoint is not correctly "connected" to their instance

junior plume
#

401 is unauthorized tho

#

does the /health also return 401?

#

maybe check if your api key if valid, and create new endpoint

exotic storm
#

@vague knoll thank you very much for testing this in depth, I will report this back to the team

vague knoll
exotic storm
junior plume
#

Are you using openai client?

#

Yeah maybe send the code

vague knoll
exotic storm
#

AHH so when you use temperature or any of the other params, it will fail?

vague knoll
#

yes

exotic storm
#

ok awesome. I reported this to the team and based on their feedback this will either be fixed today or I will create an issue on GitHub to not lose focus.

#

thanks for helping out!

vague knoll
exotic storm
#

We appreciate your debugging skills very much ❤️

#

@vague knoll May I ask what kind of use case you have for SGLang?

vague knoll
#

vLLM works well but it slower than sglang, i don't find good (fast) settings to run 70B model with vLLM. With SGLang (at least on the pod, not serverless (i don't have data yet for serverless), it's a HUGE difference)

junior plume
#

Expect Lil 2-3x faster some docs says

exotic storm
vague knoll
junior plume
#

Ohh mb wrong read

vague knoll
vague knoll
#

@exotic storm to continue about SGLang:

I did a test

  • 2x h100 (2 GPUs / worker).
  • 10 concurrent users (each doing 3 request of 2000 tokens in / 200 out)

Serverless:
worker-sglang
CONTEXT_LENGTH 8192
TENSOR_PARALLEL_SIZE 2
MODEL_PATH NousResearch/Hermes-3-Llama-3.1-70B-FP8

Average Inference time: 30s

Pod:
lmsysorg/sglang:latest
python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B-FP8 --context-length 8192 --host 0.0.0.0 --port 8000 --tp 2

Average Inference time: 3s

vague knoll
#

(test done after model in vram ofc)

junior plume
#

How much faster than vllm in pod?

#

Wow roughly 3times faster on the average inference time

exotic storm
exotic storm
vague knoll
#

the cuda version is not the same
and flashinfer is outdated

exotic storm
#

you mean in the image from RunPod?

vague knoll
#

yes

junior plume
#

if you want you can make a pr to update that 👍

junior plume
exotic storm
junior plume
#

oh i haven't seen it yet, but yeah that might worth trying

vague knoll
#

i m trying in local

exotic storm
vague knoll
vague knoll
junior plume
vague knoll
junior plume
#

nc

#

or it might be other deps / package too thats affecting performance maybe from dockerfile, or other scripts

vague knoll
#

can't say

#

btw, if it's possible to have some credit to make tests faster (i stop/launch instance everytime to reduce cost but it's timeconsuming).

i'm currently building a new docker with updated version of some dependencies.

junior plume
junior plume
vague knoll
#

ok, i updated the DockerFile with more recent version of python.
but no gain!

#

my next hypothesis is this one:

on serverless, the different request don't use the same instance (processus) of the SGlang server

on pod, the server is launched like that:
python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B-FP8 --context-length 8192 --host 0.0.0.0 --port 8000 --tp 2

and running, so when new request is launched, it's handled by this sglang server.

on serverless, code is :

`

Initialize the engine

engine = SGlangEngine()
engine.start_server()
engine.wait_for_server()

async def async_handler(job):
"""Handle the requests asynchronously."""
job_input = job["input"]
print(f"JOB_INPUT: {job_input}")

if job_input.get("openai_route"):
    openai_route, openai_input = job_input.get("openai_route"), job_input.get("openai_input")
    openai_request = OpenAIRequest()
    
    if openai_route == "/v1/chat/completions":
        async for chunk in openai_request.request_chat_completions(**openai_input):
            yield chunk
    elif openai_route == "/v1/completions":
        async for chunk in openai_request.request_completions(**openai_input):
            yield chunk
    elif openai_route == "/v1/models":
        models = await openai_request.get_models()
        yield models
else:
    generate_url = f"{engine.base_url}/generate"
    headers = {"Content-Type": "application/json"}
    generate_data = {
        "text": job_input.get("prompt", ""),
        "sampling_params": job_input.get("sampling_params", {})
    }
    response = requests.post(generate_url, json=generate_data, headers=headers)
    if response.status_code == 200:
        yield response.json()
    else:
        yield {"error": f"Generate request failed with status code {response.status_code}", "details": response.text}

runpod.serverless.start({"handler": async_handler, "return_aggregate_stream": True})
`

vague knoll
#

i just compare the handler for sglang and vLLM, one big difference is sglang one don't have concurrency_modifier param.

junior plume
vague knoll
vague knoll
#

ok, i know the reason now.
but i don't have any idea how to solve it. it's too related to runpod/serverless architecture for me.
maybe @exotic storm could help.

#

Serverless:

"message":"[gpu=0] Decode batch. #running-req: 1, #token: 2303, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0 " Finished running generator. 2024-08-23 10:50:19.804 [96xlptgpmheem3] [info] _client.py :1026 2024-08-23 01:50:19,803 HTTP Request: POST http://0.0.0.0:30000/v1/chat/completions "HTTP/1.1 200 OK" 2024-08-23 10:50:19.803 [96xlptgpmheem3] [info] INFO: 127.0.0.1:55604 - "POST /v1/chat/completions HTTP/1.1" 200 OK 2024-08-23 10:50:19.635 [96xlptgpmheem3] [info] [gpu=0] Decode batch. #running-req: 1, #token: 2294, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0 2024-08-23 10:50:18.544 [96xlptgpmheem3] [info] [gpu=0] Decode batch. #running-req: 1, #token: 2254, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0 2024-08-23 10:50:17.454 [96xlptgpmheem3] [info] [gpu=0] Decode batch. #running-req: 1, #token: 2214, token usage: 0.01, gen throughput (token/s): 16.40, #queue-req: 0 2024-08-23 10:50:17.273 [96xlptgpmheem3] [info] [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 95.59%, #running-req: 0, #queue-req: 0

Pod:

`2024-08-23T08:53:25.799936753Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 74.92%, #running-req: 0, #queue-req: 0
2024-08-23T08:53:25.856204372Z [gpu=0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 4416, cache hit rate: 83.26%, #running-req: 1, #queue-req: 0
2024-08-23T08:53:26.118457662Z [gpu=0] Prefill batch. #new-seq: 3, #new-token: 3, #cached-token: 6624, cache hit rate: 88.82%, #running-req: 3, #queue-req: 0
2024-08-23T08:53:26.148959295Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 89.94%, #running-req: 6, #queue-req: 0
2024-08-23T08:53:26.173732167Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 90.85%, #running-req: 7, #queue-req: 0
2024-08-23T08:53:26.418726222Z [gpu=0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 4416, cache hit rate: 92.25%, #running-req: 8, #queue-req: 0

[gpu=0] Decode batch. #running-req: 10, #token: 2799, token usage: 0.01, gen throughput (token/s): 397.74, #queue-req: 0`

#

on pod, request are handled in batch
on serverless, one after the other.

#

so it doesn't use its optimizatiosn to decode/encode in batch. it's completly useless if we don't find the correct setup.

exotic storm
#

Thanks for pointing this out, I will check this out myself and see if we can somehow get around doing this. I remember that I heard something about this at some point, but I need to dig deeper

vague knoll
vague knoll
junior plume
#

Yeah if you reproduce everything and it's slow it's probably serverless

vague knoll
junior plume
#

ooh nice

#

perhaps little comparing to the vllm-worker can fix that

junior plume
junior plume
#

Ic

vague knoll
junior plume
#

yup i haven't tested it yet, thats before i saw your commit on the concurrency, i think its from runpod's side if you tested it

vague knoll
#

i compare code between vllm and sglang and i don't see what can be negative on the sglang one

junior plume
vague knoll
#

when i launched i have
queued 9 and Inprogress 1
never more than one in progress

junior plume
#

ahh i thought it was on the inference time, all this time hahah

#

yeah runpod needs to fix their queue to be faster on applying jobs/assigning

vague knoll
#

but the issue is not here with vLLM

junior plume
#

yea

exotic storm
#

@vague knoll so we were talking about this internally and we are already working on some things to make this happen. Would you mind opening a new issue on the worker-sglang repo with your findings ? Because we would love to keep track on where this came from. And you did a great job on debugging already and we would love to keep you in the loop on that.

GitHub

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

vague knoll
#

done!

exotic storm
vague knoll
#

i hope you'll find a way to solve the remaining issue

exotic storm
#

that would be really nice indeed.

vague knoll
#

i did some more test today without big success... just burning few dozens of bucks.

#

coding with async seems a little helpful (i think i got 2 "inprogress" at the same time but never more even flooding with dozens of lite requests (openai) on 2 x H100s with 2GPU/worker with a 8B model)

vague knoll
#

any news with your internal team?

exotic storm
#

@vague knoll is still under investigation! Possible first result this week.

wheat crescent
#

hi guys, can i run multiple nodes on runpod? example: two nodes with 4 GPUs on each node

junior plume
#

Yes but, for private networking between them not right now

wheat crescent
#

i ran with command: /bin/bash -c "python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-8B --context-length 8192 --quantization fp8 --host 0.0.0.0 --port 8000 --tp 8 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 0"

and got error mesage: [W909 10:25:24.935506246 socket.cpp:697] [c10d] The IPv6 network addresses of (sgl-dev-0, 50000) cannot be retrieved (gai error: -2 - Name or service not known)

#

run on community cloud

junior plume
#

Well it's not the correct IP is it