#noob question, ran out of tokens. What do I do? How do you get more tokens or how do you avoid that?
1 messages · Page 1 of 1 (latest)
I think you might be misunderstanding what tokens are? They're basically word fragments
not 100% sure what you mean by ran out. Are you talking about on a web-hosted service?
nope, oobabooga. Yeah I did misunderstand, I guess I'll just tweak some settings instead.
I can get it to go for long conversations but eventually it just does that and it doesn't get past that
Yeah the error message above that says you've run out of memory
yeah, I guess all I can do is tweak settings or change to a smaller model
either smaller model, or reduce the context length (max_seq_len) on the model load page
not an option but I'll take it in mind when I try other models. I usually set context lenght to the highest possible since I think it improves quality, I didn't know there was a downside other than performance/resource requirement
the context length affects how much of the context is stored in memory. The longer the length, the more memory it uses
so basically it's like memory, context is the whole conversation.
Is there a way to reduce it? Just so it forgets the older parts of it? without deleting the conversation
context is the amount of the conversation to keep in memory
what type of model you are using? Just GPTQ?
I am trying a bunch,
TheBloke/PsyMedRP-v1-20B-GGUF psymedrp-v1-20b.Q4_K_M.gguf
mlabonne/NeuralHermes-2.5-Mistral-7B
Open-Orca/Mistral-7B-OpenOrca
TheBloke/WizardCoder-15B-1.0-GPTQ
and many more
I am mainly trying to understand how it works, what's available, and what I think is the best experience
So if you're choosing AutoGPTQ, the settings are pretty generic. I'd recommend you pick the actual loader instead for the appropriate model type. That should give you more settings to set things like context length
So e.g. for GPTQ, choose Exllamav2_HF
exllamav2 doesn't work for this one since it's not based on llama
gotcha
does transformers and autogptq have an unlimited context size? As in, they don't have a context size limit?
I'm not too familiar with either of those honestly. I'd have to assume it's based on the amount of VRAM that you specify
alright, I'll test that theory
the maximum context length is determined by the model, not the loader you use
you can just artificially limit it to fit on less capable systems
you're using a really odd mix of models and loaders. If you have a fairly capable gpu, just stick to .exl, otherwise stick to gguf and offload. There's no reason to use full precision models unless you're planning to train
is rtx 3080 with 12GB ram fairly capable?