#Loading LLama 70b on vllm template serverless cant answer a simple question like "what is your name"

44 messages · Page 1 of 1 (latest)

novel sequoia
#

I am loading with 1 worker and 2 GPU's 80g

But the model just cant performance at all, it gives gibrish answers for simple prompts like "what is your name"

frigid thorn
#

I tried it and it works, just long load times

#

What config do you use? Just the default?

novel sequoia
#

I am just setting bfloat16 the rest i leave blank/default.

When i load with web-ui, getting completely different responses.

frigid thorn
#

Oh you tried with pods too?

frigid thorn
#

And used a network volume

frigid thorn
white flare
#

Is it llama instruct? i think i was told there was a difference between llama 70b and instruct

#

Instruct is more like an actual chat, respond and answer

#

while the llama 70b is like some weird completion thing. i had also gotten gibberish answers in the past

#

making me move to just using openllm

frigid thorn
#

Instruct is the completion thing?

white flare
#

Lol 👁️

frigid thorn
#

I thought it was only instruct

white flare
#

Haha maybe im wrong and to use chat model

frigid thorn
#

I didn't see the llama chat version hmm

#

Can you send the link here

#

I wanna see haha

white flare
#

Oof I dont remember. let me see if i can find my old post on this where i also asked about gibberish coming out of vllm

frigid thorn
white flare
# frigid thorn What's the open llm ?

It’s just another framework to run llm models easily - i prefer to runpod’s vllm solution which i just dont prefer. some reason couldn’t ever get the vllm to work nicely / easily as openllm i felt

https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless

https://github.com/bentoml/OpenLLM

GitHub

A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.

GitHub

Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud. - bentoml/OpenLLM

#

and also i could get openllm to work vs ollama which requires a whole background server etc

#

and couldn’t ever get ollama to preload models properly

frigid thorn
#

Hmm alright I should try that out haha

white flare
white flare
#

Which basically is unusable

frigid thorn
#

Unusable? Try it out 😂

white flare
#

Thxfully depot gave me free caches 🙏

frigid thorn
white flare
white flare
#

Oh yea depot usually cost money

frigid thorn
#

The plan you pay for the depot

white flare
#

But they gave me a sponsored account

#

So i use it for free lol

frigid thorn
#

I c that's cool 👍

#

So, what's the "gibrish" response like? @novel sequoia

novel sequoia
#

Im using the instruct version. Just feels like its x10 quantized like the model is very stupid.

frigid thorn
#

Yeah its not normal