#Error generating text, getting no text in response

1 messages · Page 1 of 1 (latest)

little flower
#

One message removed from a suspended account.

valid walrus
#

The vocab sizes of the 2 models don't match.
You might be able to get around it by changing "vocab_size": 32001 to 32000 in the lora's config.json.

That might also break things.
I never really used Loras, but this often fixes model issues for me.
I looked in the tokenizer_config file, and nothing seems to use 32001.

Both of them are mostly the same.

little flower
#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus
#

Probably best to ask someone more familiar with Loras!
Have you tested other ones?

Also what model was it trained with?
Perhaps other models use a slightly different shape that causes the missmatch.

little flower
#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus
# little flower One message removed from a suspended account.

I mean, using a lora trained for LLama2 might not work for Mistral.
I didn't look into the structure of those models. But I think the internals are sufficiently different because of the sliding window approach Mistral takes to extend context.
I could be wrong.

#

13B models should be pretty standard, only LLama1 and LLama2 had models in the 13B range

little flower
#

One message removed from a suspended account.

valid walrus
little flower
#

One message removed from a suspended account.

valid walrus
valid walrus
little flower
#

One message removed from a suspended account.

valid walrus
#

not sure why the error message isn't shown in there, but it's saying your context is too long

#

if you see a message like ```
"Failed to build the chat prompt. The input is too long for the available context length.

Truncation length: {state['truncation_length']}
max_new_tokens: {state['max_new_tokens']} (is it too high?)
Available context length: {max_length}"```
that might give you insight into some settings to change

little flower
#

One message removed from a suspended account.

valid walrus
#

Check your console for information about the error

little flower
#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus
#

Well your context is 1731 tokens, that should fit into 2048/4096 fine

little flower
#

One message removed from a suspended account.

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus
#

make a backup of the tokenizer and config.yml
But try copying those 2 files from the lora to the model.

valid walrus
valid walrus
little flower
#

One message removed from a suspended account.

#

One message removed from a suspended account.

#

One message removed from a suspended account.

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus
little flower
#

One message removed from a suspended account.

little flower
#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus
#

That's confusing, wonder what's going on.
I did read some more about the model and it said it's a test version.
No clue if it's that or another issue

little flower
#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus
#

I would look around testing models and see what works best for you.

The last few models I've used and liked alot were OpenHermes-2.5-Mistral-7B and SOLAR-10.7B-Instruct-v1.0

little flower
#

One message removed from a suspended account.

little flower
#

One message removed from a suspended account.

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus
#

What kind of GPU are you using?
I heard some dont play well with Exllama.

With mistral I get similar speeds to a 7b model, And Solar is not far behind.

I load them with ExllamaV2_HF.
I'll share a link to the exact versions I use when I get on PC

little flower
#

One message removed from a suspended account.

valid walrus
#

That should definitely support it

#

I noticed you had context length really high.
Check your vram if its leaking into your shared vram because that would also slow down your model

little flower
#

One message removed from a suspended account.

valid walrus
#

Your 3060 has 12Gb of Vram
The GDDR6 vram in GPUs is significantly faster than normal ram in your pc.

For best performance you'd want the model entirely in vram, that way the GPU can do all the processing.

Otherwise it needs to swap layers in an out, or use the CPU to also process.
Which can slow it down a huge amount

#

For example, Solar10B at 4bit uses ~8GB vram on my GPU at a context length of 4096

little flower
#

One message removed from a suspended account.

valid walrus
#

When loading your model, you can set the max context length

#

This also effects how much Vram/ram your model will take up

#

in your screenshot it is set to 17664

#

I was suggesting it's worth checking task manager if on windows

#

to make sure you're within the gpu still

#

in task manager, performance tab

#

there are 2 graphs for gpu memory usage

#

the top one is the real VRAM, the bottom is your pc kind of faking it by using your CPU ram as well (but this is slower)

#

okay, I loaded the 10B model at 17664 max context on my 11GB gpu.
It doesn't fully fit and starts leaking into the "shared vram"

#

When you start generating text, the VRAM usage will go up a little more as well.

#

it's very possible this is the issue

little flower
#

One message removed from a suspended account.

#

One message removed from a suspended account.

#

One message removed from a suspended account.

valid walrus
#

That is odd, try saving the settings as well

little flower
#

One message removed from a suspended account.

valid walrus
#

Yea, the more layers of the model that are pushed into the CPU ram, the slower it will get

little flower
#

One message removed from a suspended account.

valid walrus
#

Yea, you should be getting ~20+ tokens per sec if fully in GPU

#

if it's still in the 1-6 range you should try lowering the context length.

6k should fit perfectly though!