I have a fresh HA install with Voice PE sattelite, Ollama and Whisper hosted on a local GPU enabled server. Fully local voice assistant is configured and working.
I pulled a bunch of models such as qwen2.5, llama3.1, mistral-small, of different sizes and quantizations, but all of them are tools/instruct versions as required by HA.
They all generally work as expected, but I'm observing some quirks I can't explain. I'm a newbie to all of this, but have a decent technical background.
I'm hoping to pick some brain here from more experienced folks.
I've currently settled on model llama3.1:8b-instruct-q8_0 as it seems to be a good balance in speed and accuracy for my system.
Larger models are a bit slower, but don't seem to offer any obvious benefit. All my models run fully in GPUs, but larger ones are split between 2 GPUs, which I believe makes it a bit slower.
Question 1. I swear at some point the assistant was often asking follow up questions and was waiting for my next command without having to start with a wake word again, which was very convenient and natural. Somewhere during swapping models and tweaking prompts this behavior changed and now it stops after each command and I have to wake it up again for follow ups. I tried adding this line to the prompt "Wait for follow up commands or questions", but it doesn't do anything. What controls the follow up behavior?
Question 2. I added this line to the default prompt "Be sassy, but brief and to the point", just to add some fun to the conversation. However, I noted sometimes there are some sassy remarks, but other times there isn't and I can't explain why.
My current HA Ollama settings are as follows:
System Prompt:
You are a voice assistant for Home Assistant.
Answer questions about the world truthfully.
Answer in plain text.
Be sassy, but brief and to the point.
Wait for follow up commands or questions.
Context Window = 16384
Max History = 20
Keep alive = -1
I only have 7 entities exposed to assistant.