#Looking for Help Optimizing My Local Home Assistant Voice Setup (LLM Context + Speed Issues)

1 messages · Page 1 of 1 (latest)

night meadow
#

I also posted this on the HA subreddit,

Hey everyone,

I’ve got Home Assistant Voice up and running with a Preview Edition voice device on my desk. Everything is processed locally—speech-to-text via FasterWhisper, and text-to-speech via Kokoro (both running in containers outside of HA). The setup is working really well overall, but I’m running into some performance and behavior issues that I think are related to LLM context handling.

For the LLM, I’m using Gemma 3 (27B, tool-trained, 8k context). It’s running on an RTX A5000 (24GB VRAM). The “Assist” checkbox is enabled, and I am getting good responses, but:

Latency: If I say something concise like “Turn off the bedroom lights,” things are fast. But when I say a more natural sentence like “Hey, if there are any lights on in the bedroom, can you turn them off for me?”—the request hits the LLM, and it can take minutes to respond. I suspect it’s due to too much context being passed in each request.

Inconsistent Actions: After a few interactions, the assistant stops reliably performing actions. For example, I’ll ask it to turn off bedroom lights that are on, and it won’t actually do it—even though the same command worked earlier in the session. This seems like it could be another symptom of overloaded context or prompt degradation.

So, two key questions: • Is there a smaller/faster model than Gemma 3 27B (ideally still with tool-use support) that would be better suited for this kind of natural command interpretation? I’d prefer to keep it local, so I’m not looking to offload to cloud APIs. • Any tips for managing or trimming the context going to the model to keep things snappy and reduce weird behavior over multiple queries?
I feel like I’m super close to a great fully local voice pipeline—just need a little help fine-tuning performance.
Thanks in advance!

ripe smelt
night meadow
#

How many things are exposed to your Voice assistant and how quick are your replies? Mine are really slow.

ripe smelt
night meadow
#

Ok I'm definitely doing something wrong since I've got similar gpu horse power, I switched to the 12b Gemma tools model and it helps a bit but still like a minute to respond lol

ripe smelt
#

is it actually using your gpu?

#

watch nvtop when doing a request

night meadow
#

Yea absolutely I keep a close eye on ollama logs

ripe smelt
#

can watch the actual memory/gpu usage on nvtop and see if its constantly high during the request or if its doing something strange

twilit delta
#

Do you have Open WebUI as well? Curious what tokens per second you get generally.

night meadow
ripe smelt
night meadow
#

All layers are in gpu, gpu is ok fast slightly slower than a 3090 probably

#

Anyone know where I can view the input I'm sending the model with my call?

#

I just set up assistant through google generative AI API instead of my Ollama and she was pretty quick!

#

so it's the model, probably getting loaded with lots of context making it much slower than 8-9 tk/s

#

The google API obviously has the horsepower where that's not a problem

#

I'm using like 10,000 freaking tokens for my calls, this was like 2 calls to gemini

#

That's what's killing my speed on ollama, the context size of the calls, I must be doing something wrong, I only have like 30 or 40 devices exposed to assistant

#

☝️ Yeah its like 10K token requests, how is that normal lmao

ripe smelt
#

on the ollama integration you can hit "configure" and set the context window size

night meadow
#

True but if it needs 8-10k tokens to get the picture it needs of my home to make a decision, and I cap it at 4K well that's not going to work no?

ripe smelt
#

what do you have it at currently? I have mine at 32k (32768)

night meadow
#

This is me asking the weather and with ollama debug logs on, nothing crazy but still like 5,000 tokens

night meadow
#

I actually have a RTX 5000 ADA (32GB) in a machine right now but I can't commit it to being on 24/7 for this use case and it'd be kinda a waste of something so expensive. It'd be on machine that has faster ram and cpu too so tk/s would be much quicker...hmmmm

#

Perhaps run 2 ollamas on 2 different machines and use the fast one when I can. I can set an automation to switch voice assistants via a voice command and just use the fastest one available at the time.

twilit delta
#

You could use a ping sensor to detect when a server is running and use that to switch preemptively. You may also have to reload the Ollama integrations. That can also be done when the ping sensor switches.

#

I do something like this myself, but one Ollama server and using Wake on LAN.

night meadow
#

I set up the RTX 5000 Ada with its own ollama docker and it is on a more modern machine and it is much much faster

#

I think the 24gb rtx5000 (non ada) being on older server e52680 v2 CPUs and slow ecc ram plus pcie 3 8x etc is making it sluggish. The 32gb ada card is like a detuned 4090 and is on 13 gen Intel and fast ram etc so it's super fast.

twilit delta
#

That's great! My 16Gb 5070 Ti gets about 21 tps with Gemma3 (using Ollama 0.6.8, I think 0.7+ has some improvements for Gemma3).

It seems that Gemma3 is just slower than Qwen3. Using the 14b model I get 2-5x that. Even on CPU, I tend to get closer to 5-10tps on an AMD 9950X3D..

#

I've got my Home Assistant using Qwen3 for most things but then Gemma3 for vision.

twilit delta
#

12b. I have a lot less vram. Heh.

Trying to find the right model size to balance with context.