#how to set a max output token

11 messages · Page 1 of 1 (latest)

tall lily
#

Hi, I deployed a finetuned llama 3 via vllm serverless on runpod. However, I'm getting limited output tokens everytime. Does anyone know if we can alter the max output tokens while sending the input prompt json?

sand axle
#

vllm does not support yet llama 3.1

topaz nimbus
#

they're working to update vllm-worker

tall lily
#

I'm not using llama 3.1, it's the old llama 3

topaz nimbus
#

hmm okay max output tokens?

#

i think limited output tokens can be modified if you use openai pip package to request to runpod serverless

#

try using that

#

you can modify max output tokens

delicate panther
#

Are you asking how to set the Max Model Length parameter inside the vLLM worker? It is under LLM Settings.

tall lily