Ollama llama3.1 understands question but action is not executed | Home Assistant | Page 1

dreamy patio Mar 17, 2025, 11:25 PM

#

I've been trying to get a voice assistant based on llama3.1 to work but I can't get it to control any of my devices. I have HA running on a NAS and an ollama server running on a rig with RTX3090. The model is running on <IP>:11434 and when assist is ticked off I get a response I just the way I expect from a conversation agent. However when I turn assist on the speech reponse I get is something like the following.

Query: "I'd like to watch tv"
response: {"name": "MediaTurnOn", "parameters": {"area": "Woonkamer, Living room", "domain": ["media_player"], "name": "TV Woonkamer"}}.

It clearly has information about the entities I exposed but I expect a response more in the lines of: "I've turned on the TV for you" and it then actually happening.
After looking in the debug log I see that the speech indeed has the mediaplayer turn on command but the data target is empty. Is this correct behaviour and am I missing something obvious?

Raw debug response is attached to the post.

I've already tried the following:
I picked the llama3.1 model because it is the one that is suggested on the HA website but I have similar behaviour with other models that support tools.
I've reduced the amount of exposed entities to 25 but it makes no difference. Even with 100 entities exposed the response I get is snappy and from the information I do get back it seems to pick a correct action (or at least the right intent to said action). I've also tried to increase the context window size but that made no difference either.

I've also removed my custom sentences and tried with prefer local processing off and on. The local processing seems to work but I've turned it off for debugging purposes right now.
Hopefully someone can point me in the correct direction. I feel like I'm 95% of the way there.

📎 message.txt

tiny python Mar 18, 2025, 2:04 AM

#

Llama 3.1 is pretty low on the leaderboard for HA these days: https://github.com/allenporter/home-assistant-datasets/tree/main/reports

GitHub

home-assistant-datasets/reports at main · allenporter/home-assistan...

This package is a collection of datasets for evaluating AI Models in the context of Home Assistant. - allenporter/home-assistant-datasets

#

Seems if you have the vram, qwq would be the best local model at the moment

tiny python Mar 18, 2025, 2:34 AM

#

The responses look like the model attempting tool calling but not formatting it quite correctly and resulting in it coming back as a reply instead of a command. I've seen this happen when the model is either not smart enough to properly call the tools, or if the context window is too small to hold all the prompt, tool, and entity data as well as conversation history.

late shadow Mar 18, 2025, 8:56 AM

#

tiny python Seems if you have the vram, qwq would be the best local model at the moment

isn't that a reasoning model?

tiny python Mar 18, 2025, 11:01 AM

#

Yes, but if you look at the leaderboard it is one of the top performing models for HA activities

late shadow Mar 18, 2025, 5:22 PM

#

tiny python Yes, but if you look at the leaderboard it is one of the top performing models f...

I'm not sure I understand. If I ask my voice assistant for the weather, I don't want to listen to its thinking tokens (maybe they can be filterered out?) or wait 30 seconds for a response

tiny python Mar 18, 2025, 5:22 PM

#

Think it has been coded to exclude the thinking stuff from response

#

I mean people are using qwq, it's on the leaderboard so someone has tested 🤷

late shadow Mar 18, 2025, 5:23 PM

#

Makes sense, but it will still be like 10x slower

tiny python Mar 18, 2025, 5:23 PM

#

Yeah that's the balancing act

#

smaller models may be faster, but they won't be as smart

late shadow Mar 18, 2025, 5:23 PM

#

The leaderboard is a synthetic benchmark, not user reports. It doesn't take into account response time

tiny python Mar 18, 2025, 5:23 PM

#

I've tested several on the leaderboard, such as Qwen2.5, and it is decently fast on my hardware

#

but thing is exactly that, it really depends on the hardware

late shadow Mar 18, 2025, 5:24 PM

#

tiny python smaller models may be faster, but they won't be as smart

I'm saying that it's slow to the point of being 100% useless. I don't want to wait for more than 3 seconds for a response, and with qwq I'll have to wait fore like 30

tiny python Mar 18, 2025, 5:24 PM

#

Like for me Qwen responded in like a second or less, someone with a 3090 is gonna probably get it twice as fast as me, someone with an older card like a p40 will probabyl be twice as slow as me, etc

late shadow Mar 18, 2025, 5:24 PM

#

tiny python I've tested several on the leaderboard, such as Qwen2.5, and it is decently fast...

sure, I'm using Qwen 2.5 myself. the question was about using a reasoning model. imagine if qwen produced 10x the amount of text for you (would take 10x longer to respond)

tiny python Mar 18, 2025, 5:25 PM

#

even running gpt 4o-mini that can sometimes take a second or two to reply

#

Yeah I get ya

late shadow Mar 18, 2025, 5:25 PM

#

tiny python even running gpt 4o-mini that can sometimes take a second or two to reply

again, taking a second or two to reply is fine. taking 30 is not, it's 100% useless at that point

tiny python Mar 18, 2025, 5:25 PM

#

all I am saying is, llama 3.1 is the worst performing of the bunch at this point

#

I see a lot of people using qwen2.5 7b or 14b

#

some are trying out gemma3 with tool calling added as well

late shadow Mar 18, 2025, 5:26 PM

#

there doesn't really exist hardware you can run locally reasonably that would produce a response from qwq in 2 seconds

tiny python Mar 18, 2025, 5:26 PM

#

I mean a 5090 would probably do it, but they are hard to come by and of course pricy 😅

late shadow Mar 18, 2025, 5:26 PM

#

tiny python I mean a 5090 would probably do it, but they are hard to come by and of course p...

no, it won't

#

I have a 4090 myself and run qwen2.5:14b-instruct-q8_0, I get respones in 1-3 seconds

#

qwq runs at relatively the same speed, but outputs 10x more tokens due to reasoning, so would take 10x the time to resopnd

#

I mean, the version of qwq that I can fit into my VRAM. there are multiple versions

tiny python Mar 18, 2025, 5:28 PM

#

fair enough.

#

I am not GPU rich enough to try that stuff, so can't really comment 😄 all I can go by is stuff I see around from others. Not sure i have seen anyone really using qwq around here though to be fair. Mostly Qwen

late shadow Mar 18, 2025, 5:29 PM

#

I ran llama 3.1 8b at fp16 before Qwen. Qwen does seem quite a bit better, but has a tendency to slip into Chinese or Thai if it runs into dificulties

tiny python Mar 18, 2025, 5:30 PM

#

Yeah I noticed the same when I tried

#

Also isn't great in terms of information stuff

#

due to knowledge restrictions 😦

late shadow Mar 18, 2025, 5:30 PM

#

For information I've built a script that queries GPT-4.5.
I can instruct Qwen to use it by saying "Ask GPT ...", and it responds with "Here's what GPT said: ..."

tiny python Mar 18, 2025, 5:31 PM

#

been meaning to try gemma3 tools, just haven't had the time to tinker with that one, but think I saw someone in the ML channel saying it seemed to be working well, on par or maybe slightly better than Qwen

late shadow Mar 18, 2025, 5:31 PM

#

tiny python due to knowledge restrictions 😦

I didn't find it too dumb, but it heavily depends on the size of the model you run. I run the 17GB version

tiny python Mar 18, 2025, 5:32 PM

#

Yeah, I was trying it with music assistant, and I tried something simple like "Play 5 popular songs by System of a Down" and it gave me "Song 1, Song 2, Song 3" 😄

late shadow Mar 18, 2025, 5:33 PM

#

Oh yeah, I guess it wouldn't know which songs are popular. Music assistant is tricky with voice. they're working on it

tiny python Mar 18, 2025, 5:33 PM

#

Yeah I am one of the people working on that 😅

#

gpt 4o handles that one fine, I assume due to different knowledge domain and of course more params

late shadow Mar 18, 2025, 5:34 PM

#

awesome, thank you, keep it up

#

GPT-4o is like 100x bigger than the model you were running

tiny python Mar 18, 2025, 5:34 PM

#

yup

late shadow Mar 18, 2025, 5:34 PM

#

and GPT-4.5 is 10x bigger than that still

tiny python Mar 18, 2025, 5:35 PM

#

Yeah, that one is a bit more costly to run day-to-day I imagine

late shadow Mar 18, 2025, 5:35 PM

#

oh yeah, it's 30x more expensive than 4o

tiny python Mar 18, 2025, 5:36 PM

#

yeah, that's the big getcha sadly. Those models are excellent but will kill your wallet lol. 4o seems to do well with most things though. Ideally probably need to do some agentic stuff, where 4o can delegate to 4.5 for harder stuff

late shadow Mar 18, 2025, 5:36 PM

#

it's fine if you use it like once a day for a random query, still comes down like a dollar a month, but not as your main assistant

tiny python Mar 18, 2025, 5:36 PM

#

Yeah

late shadow Mar 18, 2025, 5:36 PM

#

4o-mini is honestly perfectly fine for an assistant

#

that's what I use as a fallback when my 4090 machine is offline

#

mostly for the speed

tiny python Mar 18, 2025, 5:37 PM

#

Yeah that's a good setup. I'm hoping to get there if I can snag a good GPU at some point. 4060ti works, but isn't the best for LLM especially witht he slower bus speed and limit of 16GB for the model I have.

#

and I lose like 4GB of that 16 running whisper locally

late shadow Mar 18, 2025, 5:38 PM

#

3090 is the sweet spot I believe, but not cheap either

tiny python Mar 18, 2025, 5:38 PM

#

Yeah probably cheaper than the 5090 though lol, just a matter of finding one

late shadow Mar 18, 2025, 5:39 PM

#

anything is cheaper than a 5090

tiny python Mar 18, 2025, 5:41 PM

#

That's a comment from someone on gemma3-12b tools, guess they are finding it performing better than Qwen in their testing at least

#

Think he is referring to https://ollama.com/PetrosStav/gemma3-tools:12b

PetrosStav/gemma3-tools:12b

Google gemma3 with added tools support.

dreamy patio Mar 18, 2025, 8:02 PM

#

Thanks for the help guys. I did get it to work but with mixed results.

#

When I ask to turn the kitchen lights on it works basically instantly wth qwq. It understands what to do right away and then executes the action with a concise speech confirmation

#

However when it needs to reiterate on it's thinking it has a response with [think] tags and then a command between [tool] tags. The answer takes up to 2 minutes in cases like that and the responses are very long because the model starts to reason/arguing with itself

#

It does derive the correct action but again fails to execute it. I'll try some different models to see what works best

#

I'm very impressed with the speech recognition/speech generation though. The whole thing feels snappy when it works and those are run on a I3 13100F cpu

proper bone Mar 18, 2025, 10:08 PM

#

I am a big fan of mistral-small-24b. It replaced Qwen2.5 which also had very reliable function calling in combination with HA

late shadow Mar 19, 2025, 9:09 AM

#

dreamy patio When I ask to turn the kitchen lights on it works basically instantly wth qwq. I...

I think this is not reaching qwq - you probably have "Prefer handling commands locally" and what you're seeing is the basic intent handling. Let me guess - the response is "Turned on the light"? That's hardcoded.

late shadow Mar 19, 2025, 9:10 AM

#

dreamy patio However when it needs to reiterate on it's thinking it has a response with [thin...

Yeah just don't bother with using a reasoning model for a voice assistant

late shadow Mar 19, 2025, 9:10 AM

#

proper bone I am a big fan of mistral-small-24b. It replaced Qwen2.5 which also had very rel...

what size are you running it at? and what was the size/quantization that you ran Qwen2.5 at? Did you run into issues with Qwen?

proper bone Mar 19, 2025, 1:08 PM

#

Mistral-small-24B Q4 and Qwen2.5-32B Q4. I was happy with Qwen but Mistral works perfectly the same way as Qwen do, but thanks to the smaller size, the response is even snappier. I let them run on of my RTX3090s

dreamy patio Mar 19, 2025, 8:59 PM

#

So turns out I'm an idiot and I didn't update ollama for a long while. I got so fed up with windows 11 last year that I installed linux on my main pc and I just learned that ollama does not automatically update on windows. This was what was causing all my problems

#

I've set the post to resolved.

#Ollama llama3.1 understands question but action is not executed