#Qwen/Qwen3-Omni-30B-A3B-Instruct

18 messages · Page 1 of 1 (latest)

prisma ruin
#

trying to get this running on runpod, but can't figure out the correct configuration. I'd like to run on a single 48gb A40 gpu if possible or whatever the minimum context needed for processing ~5min mp3 audio files would be. anyone have any ideas?

real relicBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

calm tinselBOT
pliant wigeon
#

With this models smaller cousin (Qwen/Qwen3-VL-30B-A3B-Thinking) I got away with these settings:
MODEL_MAX_LEN=131072
GPU_MEMORY_UTILIZATION=0.98

Let me know if that works 😄

prisma ruin
#

oh thanks, but I need help with specifically the omni qwen3 multimodal which uses vllm-omni i believe... Qwen/Qwen3-VL-30B-A3B-Thinking isn't omni with stages.

does the regular vllm serverless template (https://console.runpod.io/hub/runpod-workers/worker-vllm) support omni multimodal? i need to input text and audio and output text. i can ignore video/images to save ram

#

i think the worker handler the vllm template is based on needs to be made omni compatible... i tried having claude vibe code it for me, but i don't know what i'm doing when it comes to this inference stuff. i've burned through $14 of time just spinning up tests and failing lol...

https://github.com/runpod-workers/worker-vllm

GitHub

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

pliant wigeon
#

With all due respect I have no idea (if the worker version supports the omni models), I can go find out though. Just a moment.

prisma ruin
#

yes please... ask who ever you need to. thanks 🙏

pliant wigeon
#

I made the pod template version but they're very different 😛

#

I'm just gonna run it myself, one sec

prisma ruin
#

i've been trying on and off for a day and a half now lol... she's not easy to run and zero out there on the net about anyone else doing so either... just full blown instances which i've already got working before... but I don't want the overhead... just want serverless

pliant wigeon
#

I get what you mean, and tuning vllm to a specific model sucks a lot it took me maybe half an hour to get the settings I sent earlier for a much smaller model.

#

Maybe useful to check in the future? But trust me I know it takes a lot of back and forward

prisma ruin
#

yea, i figured november til now they would have more out there... qwen3 omni is new but its been months now... wth

#

and not sure if runpod's vllm is omni capable... it has a --omni cli arg... can i just have an env var as OMNI=1 or something to enable omni? cause as it sits i had to fork the worker-vllm github project and convert to using vllm-omni so it would launch vllm in omni mode properly