#reset conversation id

1 messages · Page 1 of 1 (latest)

raven valve
#

In the VPE log, I saw 'reset conversation ID.' How does it reset, and what effects might this have? I noticed that if the interval between the second and first conversations is short, the LLM response is very quick. However, if the interval is about ~1 hour, the LLM response time doubles. Does anyone know why this happens?

#

I use the local Ollama.

cloud patio
#

Do you have the model set to never unload in ollama? Otherwise after a while ollama will unload the model until another request comes in i think unless you configure it not to do that

#

Conversation id is the memory/context of the present conversation, think by default in ha it will try to remember up to the last 20 messages as long as that can fit in the context window

#

But after a while of inactivity it gets reset since it's assumed that conversation is done.

forest summit
cloud patio
#

Yeah i think ollama itself has a setting for it too, i always make sure to have that one set to not unload a well 😅

forest summit
raven valve
cloud patio
#

You may want to try turning on flash attention and kv quantization if you haven't already, that can speed things up in terms of the context window by reducing memory use, though may need to play with it and see how it affects the models performance.

raven valve
raven valve
#

Hi @cloud patio, have you added both of these environment variables for qwen2.5:7b? What is your K/V cache quantization type? For me, I can't add either of them because it causes the LLM response time to increase by more than 10 seconds. I do see a reduction in VRAM usage, nearly by one-third, but it's almost unusable due to the long wait times for results.

cloud patio
#

Hm no, works just fine for me. May depends on how well the GPU handles the quantization and flash attention, it does put more work on the GPU I think.

#

I keep my quantization at q8 if I can, more than that and you start seeing more noticeable losses in precision.