I also posted this on the HA subreddit,
Hey everyone,
I’ve got Home Assistant Voice up and running with a Preview Edition voice device on my desk. Everything is processed locally—speech-to-text via FasterWhisper, and text-to-speech via Kokoro (both running in containers outside of HA). The setup is working really well overall, but I’m running into some performance and behavior issues that I think are related to LLM context handling.
For the LLM, I’m using Gemma 3 (27B, tool-trained, 8k context). It’s running on an RTX A5000 (24GB VRAM). The “Assist” checkbox is enabled, and I am getting good responses, but:
Latency: If I say something concise like “Turn off the bedroom lights,” things are fast. But when I say a more natural sentence like “Hey, if there are any lights on in the bedroom, can you turn them off for me?”—the request hits the LLM, and it can take minutes to respond. I suspect it’s due to too much context being passed in each request.
Inconsistent Actions: After a few interactions, the assistant stops reliably performing actions. For example, I’ll ask it to turn off bedroom lights that are on, and it won’t actually do it—even though the same command worked earlier in the session. This seems like it could be another symptom of overloaded context or prompt degradation.
So, two key questions: • Is there a smaller/faster model than Gemma 3 27B (ideally still with tool-use support) that would be better suited for this kind of natural command interpretation? I’d prefer to keep it local, so I’m not looking to offload to cloud APIs. • Any tips for managing or trimming the context going to the model to keep things snappy and reduce weird behavior over multiple queries?
I feel like I’m super close to a great fully local voice pipeline—just need a little help fine-tuning performance.
Thanks in advance!