#Qwen2 qwen2-7b-instruct-fp16 short responses (llama.cpp, langchain)

48 messages · Page 1 of 1 (latest)

fierce orbit
#

Hello, I hope you are well. I am new to the world of ML; I have been experimenting with models and understanding concepts for a few months. I wanted to understand how some models behave when attempting to do web scraping, so I decided to use Qwen2 for my test. Initially, I used the 'traditional' method, but I noticed there was a significant wait time for the model inference (which I still don't understand why it happens compared to those loaded using llama.cpp, even if they are not quantified). However, I find that using GGUF models, they give me a brief response which doesn't fully address my question prompted. I have tried using different prompts and some hyperparameters without success, a bit frustrated without understanding the reasons. I directed myself to the demo version at https://huggingface.co/spaces/Qwen/Qwen2-7b-instruct-demo which gives me the responses I expect, so somehow I understand that I am doing something wrong.

I want to continue growing in this area that I am very passionate about. If anyone can explain what I am doing wrong and guide me on the right path, I would be very grateful. I will provide the file I am working with, excuse the comments in the code and the phrases in Spanish as it is my native language.

#

For the record i have mac m1 studio 32gb ram

#

Here is the exact model i used:

"""huggingface-cli download Qwen/Qwen2-7B-Instruct-GGUF qwen2-7b-instruct-fp16.gguf --local-dir . --local-dir-use-symlinks False""""

fierce orbit
#

I was just testing things, but it didn't change the behavior.

#

I tried setting it to True and got the same results.

wide rover
#

in llm add max_tokens = -1 or whatever

#

the default is some tiny value

fierce orbit
#

thank you for your suggestion but i also did that i tried max_tokens, max_new_tokens to like 4096 - 9000 same result

#

i will try again just in case

#

my test parameters

#

result

#

it should show 20 as in the demo page

#

demo page same model same Q, same prompt

wide rover
#

also isn't it supposed to be "repeat_penalty", these docs are awful

fierce orbit
#

Ty again for your time this was the result of change temperature and repeat_penalty

#

It's kind of weird because it feels like the model is always answering with the same max_tokens or almost similar.

#

i tried also llama3

wide rover
#

or have you already tried that

fierce orbit
#

let me try

wide rover
#

I've never used Qwen before so I don't really know if it has a secret parameter sauce it needs to perform well

#

also for this task you could go with Phi3

#

it does great on thsi stuff

fierce orbit
#

i will try with Phi3 but with no params this is the result

#

do you know why with llama cpp the inference time is like 10x faster?

#

i left _ctx because i need more context window

wide rover
#

the basic pytorch transformers models are always slower than the llama cpp/gguf/onnx version

fierce orbit
#

mm ok ok thanks for that info, now im downloading de phi3-small

wide rover
#

use phi3 small instruct

fierce orbit
wide rover
#

yup all good

#

there's also a 128k version

#

i personally prefer that cause i like big ole context windows

#

and it runs fine full precision on a T4

fierce orbit
#

yeaaa me too i love big context windows beecause i dont have to split the data to much

wide rover
#

also openrouter lets you run all these models for super cheap, some of them for free, in case you wanna just try a better model and get the task done

fierce orbit
#

trying to solve this issue

fierce orbit
#

going to check

fierce orbit
#

done!! phi-3-small-128k Phi-3-medium-128k-instruct-Q8_0

#

i think the gguf file from qwen2 it has like limited prompt inside

#

for only respond short responses i know what do you thing?

#

BTW thank you so much for the help i appreciated