Local llm response speed | Home Assistant | Page 1

distant vine Dec 29, 2025, 4:53 PM

#

I am trying out some local llm integration with HA and voice PE. I have noticed that the time it takes to do things is incredibly long (sometimes 1 minute to turn on a switch ).

So far I have tested the local llm to see the token generation speed and response time just by itself ( using ollama ) and it is pretty quick with responses within 1-2 seconds.

I am not quite sure where the bottleneck might be in the sst -> ollama -> ha action -> tts pipeline so I want a way to look at all the steps in order to try and figure out what is happening.

So should I just be looking to setup a debug or trace so I can see all the input and output that is being generated at each step? If so how would I do that?

dusky marlin Dec 29, 2025, 5:25 PM

#

distant vine I am trying out some local llm integration with HA and voice PE. I have noticed ...

on the voice assistant settings page you can click the 3 dot menu for your pipeline and select debug. this will show you the trace of the voice pipeline call and the timing of the steps.

distant vine Dec 29, 2025, 5:28 PM

#

dusky marlin on the voice assistant settings page you can click the 3 dot menu for your pipel...

I had checked there before but the debug never enabled and did not produce the expected log file for download. Most likely user error on my part so will try again.

Is the expectation that the debug log would also include the full text being sent to the llm and the full response returned?

dusky marlin Dec 29, 2025, 5:29 PM

#

distant vine I had checked there before but the debug never enabled and did not produce the e...

#

it's not something you "enable" its just a trace log of pipeline calls.

distant vine Dec 29, 2025, 5:31 PM

#

dusky marlin it's not something you "enable" its just a trace log of pipeline calls.

Oh, ok let me take a closer look I was probably looking in the wrong place then. Thank you

dusky marlin Dec 29, 2025, 5:32 PM

#

distant vine Oh, ok let me take a closer look I was probably looking in the wrong place then....

distant vine Dec 29, 2025, 6:22 PM

#

I am seeing the debug log now and it makes some sense but does not show the full text that is being sent into the llm. I will have to poke around some more to see if I can get that on the ollama side.

Its interesting the time it all takes for simple commands according to the debug is ~5 seconds, but it is hit or miss if the resulting text is annouced through the PE. Maybe that is the part that takes up to 1 minute. I will have to check that out.

Also noticed from your screenshot that you too have a glados voice. I am actively working on that as well. I just can't quite figure out why it doesn't always output the audio on the voice PE at all.

dusky marlin Dec 29, 2025, 6:26 PM

#

distant vine I am seeing the debug log now and it makes some sense but does not show the full...

expand the green bit at the top for the full prompt.

#

also, how many entities are you exposing?

#

limiting how many you expose can help speed/accuracy

#

also what model are you running? different models will perform differently.

distant vine Dec 29, 2025, 6:41 PM

#

dusky marlin also, how many entities are you exposing?

Exposing 163 entities
I will try limiting that further

For the model I have tried a few different ones. Currently using Qwen2.5 7b, have also tried qwen 3 0.6b, qwen 3 14b, gemma 3 4b, gemma 3 1b. All with mixed results, but I think now that you pointed out a way to more clearly evaluate what is happening I can keep everything else the same and swap out the models to see what that does to performance.

The machine running the models is dedicated to just doing that so I leave the model loaded into ram pretty much all the time which helps on the llm processing side, but the main issue is the tts and the actual voice that is hit or miss. At least that is what I am noticing now from the debug view so at the very least I have a place to focus my attention.

dusky marlin Dec 29, 2025, 6:49 PM

#

distant vine Exposing 163 entities I will try limiting that further For the model I have tri...

try ensuring the VPE is up to date and try giving it a reboot. that sometimes helps random issues. also try standard piper voices instead of the custom. i mostly use stock ones. except the pipeline demo'ed above.

distant vine Dec 29, 2025, 6:56 PM

#

dusky marlin try ensuring the VPE is up to date and try giving it a reboot. that sometimes he...

I will give it a quick update. Although not sure how to make that happen. I am just about at the edge of my comfort since trying to make the glados voice work I have to "take control" so I am not sure how updating will go but if it comes to it I can just redo everything.

dusky marlin Dec 29, 2025, 7:09 PM

#

distant vine I will give it a quick update. Although not sure how to make that happen. I am j...

you absolutely do not have to "take control" of the esphome firmware to make a custom TTS voice work.

#

unless you really need to customise the firmware you should leave it stock.

distant vine Dec 29, 2025, 7:18 PM

#

oh then maybe I did it wrong. I just could not find another way to have a custom voice model as an option. Did I do it entirely wrong then?

Edit:
I was looking up how to do that through documentation and other resources and the only way I found to work was the take control route

dusky marlin Dec 29, 2025, 7:18 PM

#

distant vine oh then maybe I did it wrong. I just could not find another way to have a custom...

TTS is not done by the device. it just outputs the audio file sent to it which is generated by piper (or another service).

distant vine Dec 29, 2025, 7:20 PM

#

dusky marlin TTS is not done by the device. it just outputs the audio file sent to it which i...

I think it is definitely a skill issue on my part in that case. I will reformat the device. Add it again and leave it managed by the default firmware which can be updated more easily and see if my custom voices stay available

dusky marlin Dec 29, 2025, 7:20 PM

#

custom wake words can require taking control and adding stuff. but tts voices is seperate and is managed within HA

#

reflating by plugging into desktop and using THIS tool following THIS guide. is the best way to return to stock firmware.

Home Assistant Voice PE

Install firmware on Home Assistant Voice PE.

Nabu Casa

Reinstalling the firmware on Home Assistant Voice Preview Edition

It is not usually necessary to reinstall the firmware. Only follow this procedure if you have a good reason to do so. Normally, you receive an update notification for the Home Assistant Voice Previ...

distant vine Dec 29, 2025, 7:26 PM

#

I did want to use some custom wakewords so that was the reason. Sorry I had forgotten about that part of thing. I think it would be great if there was way to do custom wakewaords without taking control as it seems like one of the first things many people would want to do, but unsure if that is possible or even a planned feature.

#

For now I will probably reinstall the firmware based on the links you sent. Update it to the latest and then take control again in order to use my custom wakewords

#

Ultimate goal is to replace any amazon devies with 100% local stuff, but others are just more used to a specific wake words so that has to remain as an option and the only way to accomplish that was to train a custom word and use that, but you are pointing out some solid stuff that is super helpful and it is definitely appreciated because I have just been trying stuff and hoping for the best

dusky marlin Dec 29, 2025, 7:31 PM

#

distant vine For now I will probably reinstall the firmware based on the links you sent. Upda...

setting up custom wake words is not an easy thing to dop. I would suggest leaving that for now and working on other stuff.
there is a plan that in the future you will be able to load custom wake words without taking control. its a task in the backlog
i suggest moving back to stock and not taking control and focus on getting things running the way you want/need before trying to customise wake words.

distant vine Dec 29, 2025, 7:37 PM

#

dusky marlin setting up custom wake words is not an easy thing to dop. I would suggest leavin...

That makes sense. I might have gone about it backwards. I did the custom wakeword part first. Currently that is the one super reliable thing as I got it working and have wakewords that are well trained and function as expected, but you definitely have a point especially when it comes to doing updates. That was something I was not quite sure how I was going to handle.

I will definitely focus on the current issue of the TTS speed and it taking a really long time to play the audio if it plays at all. That is the current thing that is confusing me, but I have a few ideas now that the debug stuff makes sense.

still brook Dec 30, 2025, 12:13 AM

#

@distant vine what gpu are you running that on?

#

I have messed with this to and was able to get fast responses when doing it locally outside of home assistant and even without the control enabled with good results but just found that my gpu just did not have what it takes to process all of the context needed within a good time frame in order to control objects

#

I switched to just pumping it through chat gpt for now with fairly resonable response times especailly when most of the commands are processed locally for pretty dang cheap running the 4o mini model and have only used lke 10 cents worth of tokens in the last month

#

I haven't found wake word to be to bad to implement, the traininng environments are the things I have had issues with. that being said if you are running micro wake word Id be happy to assist with some wake word training as I have a environment that I have done all of the headache to get working and would happily lend a hand

distant vine Dec 30, 2025, 5:37 AM

#

still brook I haven't found wake word to be to bad to implement, the traininng environments ...

That is wonderful to know thanks. I found a google collab that was working for a while and used that to train the 2 wake words I would need and then found a few others on hugging face and just kind of stuck to those for now. I think the hesitation I have with the wakeword is like I was mentioning before. It requires taking control, which is not all that big of a deal, but in my opinion seems like an oversight given the nature of the device and audience the device is made for. The community of people is literally the ones who want to customize everything so having that not be part of it from day one without straying off the proverbial path is an interesting choice. But I am sure I do not understand the reasons behind it and it's simple enough for me to complain about a feature not being built in if I am not the one making it functional so I try to take my own opinion with a grain of salt.

distant vine Dec 30, 2025, 5:43 AM

#

still brook <@677144263011598339> what gpu are you running that on?

I am running on an m4 mac mini with 32gb of ram. It is only runnig ollama and nothing else so there are minimal processes and dumping all the resources into the model helps out a lot. It's not lightning fast at everything but it's fast enough for my purposes as I have other things that I use the llm for so it worked out.
I have thought about just paying for chatgpt or claude, but I am stubborn and wanted to jump in a little bit and learn more about what needs to happen to make things run faster and basically just tinker with things as it is interesting for the moment to have the tools and try to figure out what works and how to make it better. I would say I am at the baby homelab stage and trying desperately to keep it under wraps but I can see it going overboard very soon and my meager bank account already dislikes the future version of me.

ancient moat Dec 31, 2025, 12:15 PM

#

Just to jump in on the GPU front, if memory serves, the apple silicon GPUs suffer quite a bit at the Prompt Processing stage, so having 160 entities exposed is probably having a more substantial impact than it would on a dGPU.

Have you tested the response speed if you un-check the assist functionality in the ollama integration?

teal geyser Jan 1, 2026, 4:40 PM

#

distant vine I am running on an m4 mac mini with 32gb of ram. It is only runnig ollama and no...

Hey 🙂 What kind of LLMs do you run on the m4? How good is the quality of the responses compared to commercial providers?

distant vine Jan 1, 2026, 5:05 PM

#

ancient moat Just to jump in on the GPU front, if memory serves, the apple silicon GPUs suffe...

I have reduced the exposed items to 90. Might reduce it further. I wanted to see how things would work with more entities and then reduce it to mark improvements.

summer geode Jan 1, 2026, 7:07 PM

#

distant vine I am trying out some local llm integration with HA and voice PE. I have noticed ...

try your questions using assist and verify whether it's the speach parts or the search.

distant vine Jan 1, 2026, 11:22 PM

#

teal geyser Hey 🙂 What kind of LLMs do you run on the m4? How good is the quality of the re...

I am using a range of models from qwen3 0.6b up to gemma 3 27b.

The quality is basically a crap shoot on the small models ( as expected ) on the large models it is nicer but the tradeoff is the loading time. Somewhere in the middle it's pretty reliable. At present it isn't as reliable as the commercial providers, there is just no way to really compete with them on the same grounds but I am not evaluating in those terms so it is hard to say which is better. It is a matter of what you want to achieve, but objectively speaking the commercial models are always going to be "faster"

distant vine Jan 1, 2026, 11:23 PM

#

summer geode try your questions using assist and verify whether it's the speach parts or the ...

I am not familiar with that setting I will need to look around to find it and will definitely give it a try.

I have been noticing over the past day or two that I always get an appropriate text responce but the TTS part doesn't always trigger. It is quite confusing.

summer geode Jan 2, 2026, 2:17 AM

#

distant vine I am not familiar with that setting I will need to look around to find it and wi...

From any of your dashboards it’s the square box in the upper right hand corner. If it’s slow responding to that look into your ollama conversation agent. Try using open webui too. That will eliminate HA from the equation completely. Those won’t fix anything but it might help focus your troubleshooting.

summer geode Jan 2, 2026, 2:21 AM

#

distant vine I am not familiar with that setting I will need to look around to find it and wi...

I ran into a problem if the tts was already saying something it didn’t say the second thing at all. Even though it thought it did. So any of my automation text goes through a loop now (max 10 times) delaying 10 seconds each loop or until i check the status of the pe and it’s not talking anymore. Then i send the next message.

distant vine Jan 2, 2026, 4:02 PM

#

summer geode I ran into a problem if the tts was already saying something it didn’t say the s...

For me it isn’t a matter of interrupting what is being said. Even when doing an inflection thing or asking a single question I can see ( when using debug ) that the process completes and the response goes through but the speech part does not happen.

summer geode Jan 2, 2026, 4:16 PM

#

distant vine For me it isn’t a matter of interrupting what is being said. Even when doing an ...

Yea that started to happen to me when I upgraded to 2025.12.4. I've seen several other messages about it when I googled the errors I was getting and it seems like it's a bug that has just surfaced over the past few upgrades. Hopefully someone is working on it.

#Local llm response speed