#Error generating text, getting no text in response
1 messages · Page 1 of 1 (latest)
The vocab sizes of the 2 models don't match.
You might be able to get around it by changing "vocab_size": 32001 to 32000 in the lora's config.json.
That might also break things.
I never really used Loras, but this often fixes model issues for me.
I looked in the tokenizer_config file, and nothing seems to use 32001.
Both of them are mostly the same.
One message removed from a suspended account.
One message removed from a suspended account.
Probably best to ask someone more familiar with Loras!
Have you tested other ones?
Also what model was it trained with?
Perhaps other models use a slightly different shape that causes the missmatch.
One message removed from a suspended account.
One message removed from a suspended account.
I mean, using a lora trained for LLama2 might not work for Mistral.
I didn't look into the structure of those models. But I think the internals are sufficiently different because of the sliding window approach Mistral takes to extend context.
I could be wrong.
13B models should be pretty standard, only LLama1 and LLama2 had models in the 13B range
One message removed from a suspended account.
But also yes, using models closer to what the lora was finetuned on would give you better results.
A lora is a difference that's put ontop of a model
One message removed from a suspended account.
You do have the right model it was trained on.
Hmm, that's odd that they're breaking like that.
Yup, just took a look
One message removed from a suspended account.
not sure why the error message isn't shown in there, but it's saying your context is too long
if you see a message like ```
"Failed to build the chat prompt. The input is too long for the available context length.
Truncation length: {state['truncation_length']}
max_new_tokens: {state['max_new_tokens']} (is it too high?)
Available context length: {max_length}"```
that might give you insight into some settings to change
One message removed from a suspended account.
Check your console for information about the error
One message removed from a suspended account.
One message removed from a suspended account.
Well your context is 1731 tokens, that should fit into 2048/4096 fine
One message removed from a suspended account.
One message removed from a suspended account.
One message removed from a suspended account.
make a backup of the tokenizer and config.yml
But try copying those 2 files from the lora to the model.
I feel like you would be getting different errors, but sure!
change the lora config.json back to 32001 as it was
One message removed from a suspended account.
One message removed from a suspended account.
One message removed from a suspended account.
One message removed from a suspended account.
One message removed from a suspended account.
Made sure that both:
lora/config.json and model/config.json both are set to 32001?
One message removed from a suspended account.
One message removed from a suspended account.
One message removed from a suspended account.
That's confusing, wonder what's going on.
I did read some more about the model and it said it's a test version.
No clue if it's that or another issue
One message removed from a suspended account.
One message removed from a suspended account.
I would look around testing models and see what works best for you.
The last few models I've used and liked alot were OpenHermes-2.5-Mistral-7B and SOLAR-10.7B-Instruct-v1.0
One message removed from a suspended account.
One message removed from a suspended account.
One message removed from a suspended account.
One message removed from a suspended account.
What kind of GPU are you using?
I heard some dont play well with Exllama.
With mistral I get similar speeds to a 7b model, And Solar is not far behind.
I load them with ExllamaV2_HF.
I'll share a link to the exact versions I use when I get on PC
One message removed from a suspended account.
That should definitely support it
I noticed you had context length really high.
Check your vram if its leaking into your shared vram because that would also slow down your model
One message removed from a suspended account.
Your 3060 has 12Gb of Vram
The GDDR6 vram in GPUs is significantly faster than normal ram in your pc.
For best performance you'd want the model entirely in vram, that way the GPU can do all the processing.
Otherwise it needs to swap layers in an out, or use the CPU to also process.
Which can slow it down a huge amount
For example, Solar10B at 4bit uses ~8GB vram on my GPU at a context length of 4096
One message removed from a suspended account.
When loading your model, you can set the max context length
This also effects how much Vram/ram your model will take up
in your screenshot it is set to 17664
I was suggesting it's worth checking task manager if on windows
to make sure you're within the gpu still
in task manager, performance tab
there are 2 graphs for gpu memory usage
the top one is the real VRAM, the bottom is your pc kind of faking it by using your CPU ram as well (but this is slower)
okay, I loaded the 10B model at 17664 max context on my 11GB gpu.
It doesn't fully fit and starts leaking into the "shared vram"
When you start generating text, the VRAM usage will go up a little more as well.
it's very possible this is the issue
One message removed from a suspended account.
One message removed from a suspended account.
One message removed from a suspended account.
That is odd, try saving the settings as well
One message removed from a suspended account.
Yea, the more layers of the model that are pushed into the CPU ram, the slower it will get
One message removed from a suspended account.