Error generating text, getting no text in response | Text Generation WebUI | Page 1

little flower Jun 13, 2024, 10:06 AM

#

One message removed from a suspended account.

valid walrus Jun 13, 2024, 10:10 AM

#

The vocab sizes of the 2 models don't match.
You might be able to get around it by changing "vocab_size": 32001 to 32000 in the lora's config.json.

That might also break things.
I never really used Loras, but this often fixes model issues for me.
I looked in the tokenizer_config file, and nothing seems to use 32001.

Both of them are mostly the same.

little flower Jun 13, 2024, 10:13 AM

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus Jun 13, 2024, 10:28 AM

#

Probably best to ask someone more familiar with Loras!
Have you tested other ones?

Also what model was it trained with?
Perhaps other models use a slightly different shape that causes the missmatch.

little flower Jun 13, 2024, 10:28 AM

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus Jun 13, 2024, 10:34 AM

#

little flower One message removed from a suspended account.

I mean, using a lora trained for LLama2 might not work for Mistral.
I didn't look into the structure of those models. But I think the internals are sufficiently different because of the sliding window approach Mistral takes to extend context.
I could be wrong.

#

13B models should be pretty standard, only LLama1 and LLama2 had models in the 13B range

little flower Jun 13, 2024, 10:35 AM

#

One message removed from a suspended account.

valid walrus Jun 13, 2024, 10:36 AM

#

little flower One message removed from a suspended account.

But also yes, using models closer to what the lora was finetuned on would give you better results.
A lora is a difference that's put ontop of a model

little flower Jun 13, 2024, 10:36 AM

#

One message removed from a suspended account.

valid walrus Jun 13, 2024, 10:37 AM

#

little flower One message removed from a suspended account.

You do have the right model it was trained on.
Hmm, that's odd that they're breaking like that.

valid walrus Jun 13, 2024, 10:37 AM

#

little flower One message removed from a suspended account.

Yup, just took a look

little flower Jun 13, 2024, 10:38 AM

#

One message removed from a suspended account.

valid walrus Jun 13, 2024, 10:40 AM

#

not sure why the error message isn't shown in there, but it's saying your context is too long

#

if you see a message like ```
"Failed to build the chat prompt. The input is too long for the available context length.

Truncation length: {state['truncation_length']}
max_new_tokens: {state['max_new_tokens']} (is it too high?)
Available context length: {max_length}"```
that might give you insight into some settings to change

little flower Jun 13, 2024, 10:42 AM

#

One message removed from a suspended account.

valid walrus Jun 13, 2024, 10:43 AM

#

Check your console for information about the error

little flower Jun 13, 2024, 10:43 AM

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus Jun 13, 2024, 10:44 AM

#

Well your context is 1731 tokens, that should fit into 2048/4096 fine

little flower Jun 13, 2024, 10:45 AM

#

One message removed from a suspended account.

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus Jun 13, 2024, 10:47 AM

#

make a backup of the tokenizer and config.yml
But try copying those 2 files from the lora to the model.

valid walrus Jun 13, 2024, 10:47 AM

#

little flower One message removed from a suspended account.

I feel like you would be getting different errors, but sure!

valid walrus Jun 13, 2024, 10:48 AM

#

valid walrus make a backup of the tokenizer and config.yml But try copying those 2 files from...

change the lora config.json back to 32001 as it was

little flower Jun 13, 2024, 10:49 AM

#

One message removed from a suspended account.

#

One message removed from a suspended account.

#

One message removed from a suspended account.

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus Jun 13, 2024, 11:04 AM

#

little flower One message removed from a suspended account.

Made sure that both:
lora/config.json and model/config.json both are set to 32001?

little flower Jun 13, 2024, 11:20 AM

#

One message removed from a suspended account.

little flower Jun 13, 2024, 11:39 AM

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus Jun 13, 2024, 11:43 AM

#

That's confusing, wonder what's going on.
I did read some more about the model and it said it's a test version.
No clue if it's that or another issue

little flower Jun 13, 2024, 11:44 AM

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus Jun 13, 2024, 12:04 PM

#

I would look around testing models and see what works best for you.

The last few models I've used and liked alot were OpenHermes-2.5-Mistral-7B and SOLAR-10.7B-Instruct-v1.0

little flower Jun 13, 2024, 12:18 PM

#

One message removed from a suspended account.

little flower Jun 14, 2024, 1:20 AM

#

One message removed from a suspended account.

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus Jun 14, 2024, 3:50 AM

#

What kind of GPU are you using?
I heard some dont play well with Exllama.

With mistral I get similar speeds to a 7b model, And Solar is not far behind.

I load them with ExllamaV2_HF.
I'll share a link to the exact versions I use when I get on PC

little flower Jun 14, 2024, 4:53 AM

#

One message removed from a suspended account.

valid walrus Jun 14, 2024, 4:54 AM

#

That should definitely support it

#

I noticed you had context length really high.
Check your vram if its leaking into your shared vram because that would also slow down your model

little flower Jun 14, 2024, 4:58 AM

#

One message removed from a suspended account.

valid walrus Jun 14, 2024, 5:05 AM

#

Your 3060 has 12Gb of Vram
The GDDR6 vram in GPUs is significantly faster than normal ram in your pc.

For best performance you'd want the model entirely in vram, that way the GPU can do all the processing.

Otherwise it needs to swap layers in an out, or use the CPU to also process.
Which can slow it down a huge amount

#

For example, Solar10B at 4bit uses ~8GB vram on my GPU at a context length of 4096

little flower Jun 14, 2024, 5:07 AM

#

One message removed from a suspended account.

valid walrus Jun 14, 2024, 5:07 AM

#

#

When loading your model, you can set the max context length

#

This also effects how much Vram/ram your model will take up

#

in your screenshot it is set to 17664

#

I was suggesting it's worth checking task manager if on windows

#

to make sure you're within the gpu still

#

#

in task manager, performance tab

#

there are 2 graphs for gpu memory usage

#

the top one is the real VRAM, the bottom is your pc kind of faking it by using your CPU ram as well (but this is slower)

#

okay, I loaded the 10B model at 17664 max context on my 11GB gpu.
It doesn't fully fit and starts leaking into the "shared vram"

#

When you start generating text, the VRAM usage will go up a little more as well.

#

it's very possible this is the issue

little flower Jun 14, 2024, 5:14 AM

#

One message removed from a suspended account.

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus Jun 14, 2024, 5:15 AM

#

That is odd, try saving the settings as well

little flower Jun 14, 2024, 5:17 AM

#

One message removed from a suspended account.

valid walrus Jun 14, 2024, 5:18 AM

#

Yea, the more layers of the model that are pushed into the CPU ram, the slower it will get

little flower Jun 14, 2024, 5:26 AM

#

One message removed from a suspended account.

valid walrus Jun 14, 2024, 5:28 AM

#

Yea, you should be getting ~20+ tokens per sec if fully in GPU

#

if it's still in the 1-6 range you should try lowering the context length.

6k should fit perfectly though!

#Error generating text, getting no text in response