#GGUF in serverless vLLM

57 messages · Page 1 of 1 (latest)

solar saddle
orchid cape
#

You will have to create your own serverless handler for it, because the vllm worker does not support GGUF due to the underlying vllm engine not supporting GGUF.

solar saddle
#

Such a shame that it doesn't. Can I run Ollama in serverless?

orchid cape
#

You can run whatever you want in serverless as long as you implement the RunPod serverless handler

crystal trail
solar saddle
#

By templates, do you mean the very limited ones from "quick deploy", or any template that can be run in a normal pod like on the screenshot? I can't input the ollama/ollama template on the serverless deployment

orchid cape
#

You can't use pod templates in serverless, they don't work the same way, you need to invoke the serverless handler for serverless as I mentioned above.

solar saddle
#

Where can I browse community templates for serverless? There has to be someone that already did this

orchid cape
#

The only serverless template available is the vllm one that RunPod created.

orchid cape
solar saddle
#

I found this thread. I will probably need to configure it myself. Thank you for your help.

crystal trail
#

yep there are no place for sharing community templates for serverless yet on the site

orchid cape
proven flicker
orchid cape
proven flicker
#

oh

undone folio
solar saddle
#

Cpu inference isn't good enough but thank you

crystal trail
undone folio
#

@solar saddle If you follow that guide, but just select GPU, you’ll get the same results.

kindred cloud
#

Seems vLLM will have GGUF support soon!

glacial shale
#

I just created an account today and am looking at the serverless vLLM quick deploy settings. If GGUF isn't supported, what's this thing? I don't see a bpw / quant level setting.

orchid cape
glacial shale
#

I'm asking what that drop menu do.

orchid cape
#

It does what it says, select quantization type

#

Its not quant level, its quant TYPE

glacial shale
#

[redacted] I'm hoping not to need 360GB of VRAM to run an 8x22B.
Edit: Oh wait, that just means I can point a name/model-AWQ-or-GPTQ repository to serverless.

glacial shale
#

two
looks like I wouldn't be able to run it anyway
Edit: I'll have to quantize this obscure model...

orchid cape
#

Which model? Many models are already available as quantized versions

glacial shale
orchid cape
#

I see there is EXL2 quantized versions but vllm doesn't support EXL2 quant type

#

Aphrodite Engine and TabbyAPI both support EXL2 tho.

sharp saffron
#

Until vLLM supports more quant formats, you'll have to have an AWQ, SqueezeLLM, or GPTQ quant of the model. I used a Jupyter pod to make an AWQ of the model I wanted. Or if Aphrodite-Engine ever works on serverless, that will be an option too.

glacial shale
#

I ended up using KoboldCpp's runpod template for gguf, lol. And sharing with some people to spend less time idling. (I'm being an idiot, yes.)

sharp saffron
#

If all else fails, just run the numbers to see if serverless will be better for your use cases. There's an amount of active time where pods become more cost efficient.

#

On last gen 48GB for example, it's 40% active runtime (Actively processing requests)

main aspen
#

GGUF is a format for offline use on your own computer. It's not meant to be for servers really. Use AWQ or GPTQ untill ex2 is supported on vLLM.

orchid cape
#

GGUF supports both CPU and GPU not just CPU

main aspen
#

Yeah and It's not the fastest though.

orchid cape
#

EXL2 is fastest.

main aspen
#

Yeah I just hope vLLM would one day support EXL2. It would open up so many new opportunities.

orchid cape
#

aphrodite engine supports it

main aspen
#

Yes, but aphrodite runs only on classic Pods and it's very expensive to run. 🙂 This is why I love serverless, it's cheap to begin with (but gets 3 x more expensive if you have constant traffic). Serverless is great to start a project with minimal traffic, only if the project is a success and can generate money, then it's worth it to switch to a classic pod with aphrodite.

orchid cape
#

You can port aphrodite to serverless too

main aspen
#

But it will be experimental right? I don't think it's that easy and I'm so busy as it is with coding. 🙂

orchid cape
#

He is probably referring to aphrodite-engine. Its not experimental, TabbyAPI is:
"TabbyAPI is a hobby project solely for a small amount of users. It is not meant to run on production servers. For that, please look at other backends that support those workloads."

main aspen
#

No, I mean right now I can create an vLLM serverless directly from run pod dashboard.

#

The same isn’t true about Aphrodite as serverless

#

Hence I assume the latter is experimental.

sharp saffron
#

As in, there is no turnkey solution or "Just Type This" solution to deploying Aphrodite Engine on serverless.

crystal trail
#

Ohhh ya, There's no quick deploy for it yet, but vllm is also considerably not really production ready for big models, has some bugs too

sharp saffron
#

True enough. I guess can we really say that any FOSS solution is 'production ready' right now?