#Local llm response speed

1 messages · Page 1 of 1 (latest)

distant vine
#

I am trying out some local llm integration with HA and voice PE. I have noticed that the time it takes to do things is incredibly long (sometimes 1 minute to turn on a switch ).

So far I have tested the local llm to see the token generation speed and response time just by itself ( using ollama ) and it is pretty quick with responses within 1-2 seconds.

I am not quite sure where the bottleneck might be in the sst -> ollama -> ha action -> tts pipeline so I want a way to look at all the steps in order to try and figure out what is happening.

So should I just be looking to setup a debug or trace so I can see all the input and output that is being generated at each step? If so how would I do that?

dusky marlin
distant vine
dusky marlin
#

it's not something you "enable" its just a trace log of pipeline calls.

distant vine
distant vine
#

I am seeing the debug log now and it makes some sense but does not show the full text that is being sent into the llm. I will have to poke around some more to see if I can get that on the ollama side.

Its interesting the time it all takes for simple commands according to the debug is ~5 seconds, but it is hit or miss if the resulting text is annouced through the PE. Maybe that is the part that takes up to 1 minute. I will have to check that out.

Also noticed from your screenshot that you too have a glados voice. I am actively working on that as well. I just can't quite figure out why it doesn't always output the audio on the voice PE at all.

dusky marlin
#

also, how many entities are you exposing?

#

limiting how many you expose can help speed/accuracy

#

also what model are you running? different models will perform differently.

distant vine
# dusky marlin also, how many entities are you exposing?

Exposing 163 entities
I will try limiting that further

For the model I have tried a few different ones. Currently using Qwen2.5 7b, have also tried qwen 3 0.6b, qwen 3 14b, gemma 3 4b, gemma 3 1b. All with mixed results, but I think now that you pointed out a way to more clearly evaluate what is happening I can keep everything else the same and swap out the models to see what that does to performance.

The machine running the models is dedicated to just doing that so I leave the model loaded into ram pretty much all the time which helps on the llm processing side, but the main issue is the tts and the actual voice that is hit or miss. At least that is what I am noticing now from the debug view so at the very least I have a place to focus my attention.

dusky marlin
distant vine
dusky marlin
#

unless you really need to customise the firmware you should leave it stock.

distant vine
#

oh then maybe I did it wrong. I just could not find another way to have a custom voice model as an option. Did I do it entirely wrong then?

Edit:
I was looking up how to do that through documentation and other resources and the only way I found to work was the take control route

dusky marlin
distant vine
dusky marlin
#

custom wake words can require taking control and adding stuff. but tts voices is seperate and is managed within HA

distant vine
#

I did want to use some custom wakewords so that was the reason. Sorry I had forgotten about that part of thing. I think it would be great if there was way to do custom wakewaords without taking control as it seems like one of the first things many people would want to do, but unsure if that is possible or even a planned feature.

#

For now I will probably reinstall the firmware based on the links you sent. Update it to the latest and then take control again in order to use my custom wakewords

#

Ultimate goal is to replace any amazon devies with 100% local stuff, but others are just more used to a specific wake words so that has to remain as an option and the only way to accomplish that was to train a custom word and use that, but you are pointing out some solid stuff that is super helpful and it is definitely appreciated because I have just been trying stuff and hoping for the best

dusky marlin
# distant vine For now I will probably reinstall the firmware based on the links you sent. Upda...

setting up custom wake words is not an easy thing to dop. I would suggest leaving that for now and working on other stuff.
there is a plan that in the future you will be able to load custom wake words without taking control. its a task in the backlog
i suggest moving back to stock and not taking control and focus on getting things running the way you want/need before trying to customise wake words.

distant vine
# dusky marlin setting up custom wake words is not an easy thing to dop. I would suggest leavin...

That makes sense. I might have gone about it backwards. I did the custom wakeword part first. Currently that is the one super reliable thing as I got it working and have wakewords that are well trained and function as expected, but you definitely have a point especially when it comes to doing updates. That was something I was not quite sure how I was going to handle.

I will definitely focus on the current issue of the TTS speed and it taking a really long time to play the audio if it plays at all. That is the current thing that is confusing me, but I have a few ideas now that the debug stuff makes sense.

still brook
#

@distant vine what gpu are you running that on?

#

I have messed with this to and was able to get fast responses when doing it locally outside of home assistant and even without the control enabled with good results but just found that my gpu just did not have what it takes to process all of the context needed within a good time frame in order to control objects

#

I switched to just pumping it through chat gpt for now with fairly resonable response times especailly when most of the commands are processed locally for pretty dang cheap running the 4o mini model and have only used lke 10 cents worth of tokens in the last month

#

I haven't found wake word to be to bad to implement, the traininng environments are the things I have had issues with. that being said if you are running micro wake word Id be happy to assist with some wake word training as I have a environment that I have done all of the headache to get working and would happily lend a hand

distant vine
# still brook I haven't found wake word to be to bad to implement, the traininng environments ...

That is wonderful to know thanks. I found a google collab that was working for a while and used that to train the 2 wake words I would need and then found a few others on hugging face and just kind of stuck to those for now. I think the hesitation I have with the wakeword is like I was mentioning before. It requires taking control, which is not all that big of a deal, but in my opinion seems like an oversight given the nature of the device and audience the device is made for. The community of people is literally the ones who want to customize everything so having that not be part of it from day one without straying off the proverbial path is an interesting choice. But I am sure I do not understand the reasons behind it and it's simple enough for me to complain about a feature not being built in if I am not the one making it functional so I try to take my own opinion with a grain of salt.

distant vine
# still brook <@677144263011598339> what gpu are you running that on?

I am running on an m4 mac mini with 32gb of ram. It is only runnig ollama and nothing else so there are minimal processes and dumping all the resources into the model helps out a lot. It's not lightning fast at everything but it's fast enough for my purposes as I have other things that I use the llm for so it worked out.
I have thought about just paying for chatgpt or claude, but I am stubborn and wanted to jump in a little bit and learn more about what needs to happen to make things run faster and basically just tinker with things as it is interesting for the moment to have the tools and try to figure out what works and how to make it better. I would say I am at the baby homelab stage and trying desperately to keep it under wraps but I can see it going overboard very soon and my meager bank account already dislikes the future version of me.

ancient moat
#

Just to jump in on the GPU front, if memory serves, the apple silicon GPUs suffer quite a bit at the Prompt Processing stage, so having 160 entities exposed is probably having a more substantial impact than it would on a dGPU.

Have you tested the response speed if you un-check the assist functionality in the ollama integration?

teal geyser
distant vine
summer geode
distant vine
# teal geyser Hey 🙂 What kind of LLMs do you run on the m4? How good is the quality of the re...

I am using a range of models from qwen3 0.6b up to gemma 3 27b.

The quality is basically a crap shoot on the small models ( as expected ) on the large models it is nicer but the tradeoff is the loading time. Somewhere in the middle it's pretty reliable. At present it isn't as reliable as the commercial providers, there is just no way to really compete with them on the same grounds but I am not evaluating in those terms so it is hard to say which is better. It is a matter of what you want to achieve, but objectively speaking the commercial models are always going to be "faster"

distant vine
summer geode
summer geode
distant vine
summer geode