#RTX PRO 6000 - New Issues / VLLM failed to load

1 messages · Page 1 of 1 (latest)

inland bison
#

Until last week I was able to use the following startup parameters without any issue:
--host 0.0.0.0 --port 8000 --model Qwen/Qwen3-Coder-Next-FP8 --dtype auto --gpu-memory-utilization 0.94 --api-key ###### --max-model-len 131072 --tensor-parallel-size 1 --enable-auto-tool-choice --tool-call-parser qwen3_xml

This was always my go-to setup for using the pods.
As of the last few days, no matter what I do, once the model has been loaded and is passed the warm-up phase, it gives me several errors and crashes, trying to reload the shards.
I've tried this with 4 pods today and every single one of them failed. The AI bot thing tells me that hardware might have changed, but from the settings it's still the same.

I use the 96GB VRAM using RTX Pro 6000 for 1.89/hr with 100GB storge..

Any help on this :/ I've already wasted 2 - 3 $ on just trying to get it to startup..

keen girderBOT
tame streamBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

inland bison
#

Not sure if this might help, but it says 79% utilization on startup in terms of VRAM. But it's clearly still downloading the model. I assume it's reserved - but it didn't use to show this until the GPU actually loaded the model.

#

(EngineCore pid=479) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=479) return func(*args, **kwargs)
(EngineCore pid=479) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=479) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in init
(EngineCore pid=479) super().init(
(EngineCore pid=479) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 128, in init
(EngineCore pid=479) kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=479) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=479) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=479) return func(*args, **kwargs)
(EngineCore pid=479) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=479) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches
(EngineCore pid=479)

#

(APIServer pid=51) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker
(APIServer pid=51) async with build_async_engine_client(
(APIServer pid=51) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=51) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=51) return await anext(self.gen)
(APIServer pid=51) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=51) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=51) async with build_async_engine_client_from_engine_args(
(APIServer pid=51) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=51) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=51) return await anext(self.gen)

nimble badge
#

Will be way easier to enjoy the model

#

For tools I recommend adding --jinjatools to the KCPP_ARGS variable along with changing the context size there, but side from that all you need is a link to the GGUF download link (Of part 1 if its a big model) and it will just set itself up

#

No network storage, no manual hassle, no dependency management

inland bison
#

Thank you, i'll definately check it out 🙂 I was just confused what changed, but it seems there was a change in VLLM and mamba cache so i had to add --max-num-seqs 848 parameter to get it to run. Just unfortunate that it took me this long to figure out 🙁 I wasted money and GPU's others could have used.. I'm just very much locked into this workflow I have with it. But I'll have a look at koboldai setup, thanks so much.

This can be closed now - i found a solution to my problem and will also check out the suggested runpod setup n.n