#Ollama llama3.1 understands question but action is not executed

1 messages ยท Page 1 of 1 (latest)

dreamy patio
#

I've been trying to get a voice assistant based on llama3.1 to work but I can't get it to control any of my devices. I have HA running on a NAS and an ollama server running on a rig with RTX3090. The model is running on <IP>:11434 and when assist is ticked off I get a response I just the way I expect from a conversation agent. However when I turn assist on the speech reponse I get is something like the following.

Query: "I'd like to watch tv"
response: {"name": "MediaTurnOn", "parameters": {"area": "Woonkamer, Living room", "domain": ["media_player"], "name": "TV Woonkamer"}}.

It clearly has information about the entities I exposed but I expect a response more in the lines of: "I've turned on the TV for you" and it then actually happening.
After looking in the debug log I see that the speech indeed has the mediaplayer turn on command but the data target is empty. Is this correct behaviour and am I missing something obvious?

Raw debug response is attached to the post.

I've already tried the following:
I picked the llama3.1 model because it is the one that is suggested on the HA website but I have similar behaviour with other models that support tools.
I've reduced the amount of exposed entities to 25 but it makes no difference. Even with 100 entities exposed the response I get is snappy and from the information I do get back it seems to pick a correct action (or at least the right intent to said action). I've also tried to increase the context window size but that made no difference either.

I've also removed my custom sentences and tried with prefer local processing off and on. The local processing seems to work but I've turned it off for debugging purposes right now.
Hopefully someone can point me in the correct direction. I feel like I'm 95% of the way there.

tiny python
#

Seems if you have the vram, qwq would be the best local model at the moment

tiny python
#

The responses look like the model attempting tool calling but not formatting it quite correctly and resulting in it coming back as a reply instead of a command. I've seen this happen when the model is either not smart enough to properly call the tools, or if the context window is too small to hold all the prompt, tool, and entity data as well as conversation history.

late shadow
tiny python
#

Yes, but if you look at the leaderboard it is one of the top performing models for HA activities

late shadow
tiny python
#

Think it has been coded to exclude the thinking stuff from response

#

I mean people are using qwq, it's on the leaderboard so someone has tested ๐Ÿคท

late shadow
#

Makes sense, but it will still be like 10x slower

tiny python
#

Yeah that's the balancing act

#

smaller models may be faster, but they won't be as smart

late shadow
#

The leaderboard is a synthetic benchmark, not user reports. It doesn't take into account response time

tiny python
#

I've tested several on the leaderboard, such as Qwen2.5, and it is decently fast on my hardware

#

but thing is exactly that, it really depends on the hardware

late shadow
tiny python
#

Like for me Qwen responded in like a second or less, someone with a 3090 is gonna probably get it twice as fast as me, someone with an older card like a p40 will probabyl be twice as slow as me, etc

late shadow
tiny python
#

even running gpt 4o-mini that can sometimes take a second or two to reply

#

Yeah I get ya

late shadow
tiny python
#

all I am saying is, llama 3.1 is the worst performing of the bunch at this point

#

I see a lot of people using qwen2.5 7b or 14b

#

some are trying out gemma3 with tool calling added as well

late shadow
#

there doesn't really exist hardware you can run locally reasonably that would produce a response from qwq in 2 seconds

tiny python
#

I mean a 5090 would probably do it, but they are hard to come by and of course pricy ๐Ÿ˜…

late shadow
#

I have a 4090 myself and run qwen2.5:14b-instruct-q8_0, I get respones in 1-3 seconds

#

qwq runs at relatively the same speed, but outputs 10x more tokens due to reasoning, so would take 10x the time to resopnd

#

I mean, the version of qwq that I can fit into my VRAM. there are multiple versions

tiny python
#

fair enough.

#

I am not GPU rich enough to try that stuff, so can't really comment ๐Ÿ˜„ all I can go by is stuff I see around from others. Not sure i have seen anyone really using qwq around here though to be fair. Mostly Qwen

late shadow
#

I ran llama 3.1 8b at fp16 before Qwen. Qwen does seem quite a bit better, but has a tendency to slip into Chinese or Thai if it runs into dificulties

tiny python
#

Yeah I noticed the same when I tried

#

Also isn't great in terms of information stuff

#

due to knowledge restrictions ๐Ÿ˜ฆ

late shadow
#

For information I've built a script that queries GPT-4.5.
I can instruct Qwen to use it by saying "Ask GPT ...", and it responds with "Here's what GPT said: ..."

tiny python
#

been meaning to try gemma3 tools, just haven't had the time to tinker with that one, but think I saw someone in the ML channel saying it seemed to be working well, on par or maybe slightly better than Qwen

late shadow
tiny python
#

Yeah, I was trying it with music assistant, and I tried something simple like "Play 5 popular songs by System of a Down" and it gave me "Song 1, Song 2, Song 3" ๐Ÿ˜„

late shadow
#

Oh yeah, I guess it wouldn't know which songs are popular. Music assistant is tricky with voice. they're working on it

tiny python
#

Yeah I am one of the people working on that ๐Ÿ˜…

#

gpt 4o handles that one fine, I assume due to different knowledge domain and of course more params

late shadow
#

awesome, thank you, keep it up

#

GPT-4o is like 100x bigger than the model you were running

tiny python
#

yup

late shadow
#

and GPT-4.5 is 10x bigger than that still

tiny python
#

Yeah, that one is a bit more costly to run day-to-day I imagine

late shadow
#

oh yeah, it's 30x more expensive than 4o

tiny python
#

yeah, that's the big getcha sadly. Those models are excellent but will kill your wallet lol. 4o seems to do well with most things though. Ideally probably need to do some agentic stuff, where 4o can delegate to 4.5 for harder stuff

late shadow
#

it's fine if you use it like once a day for a random query, still comes down like a dollar a month, but not as your main assistant

tiny python
#

Yeah

late shadow
#

4o-mini is honestly perfectly fine for an assistant

#

that's what I use as a fallback when my 4090 machine is offline

#

mostly for the speed

tiny python
#

Yeah that's a good setup. I'm hoping to get there if I can snag a good GPU at some point. 4060ti works, but isn't the best for LLM especially witht he slower bus speed and limit of 16GB for the model I have.

#

and I lose like 4GB of that 16 running whisper locally

late shadow
#

3090 is the sweet spot I believe, but not cheap either

tiny python
#

Yeah probably cheaper than the 5090 though lol, just a matter of finding one

late shadow
#

anything is cheaper than a 5090

tiny python
#

That's a comment from someone on gemma3-12b tools, guess they are finding it performing better than Qwen in their testing at least

dreamy patio
#

Thanks for the help guys. I did get it to work but with mixed results.

#

When I ask to turn the kitchen lights on it works basically instantly wth qwq. It understands what to do right away and then executes the action with a concise speech confirmation

#

However when it needs to reiterate on it's thinking it has a response with [think] tags and then a command between [tool] tags. The answer takes up to 2 minutes in cases like that and the responses are very long because the model starts to reason/arguing with itself

#

It does derive the correct action but again fails to execute it. I'll try some different models to see what works best

#

I'm very impressed with the speech recognition/speech generation though. The whole thing feels snappy when it works and those are run on a I3 13100F cpu

proper bone
#

I am a big fan of mistral-small-24b. It replaced Qwen2.5 which also had very reliable function calling in combination with HA

late shadow
late shadow
late shadow
proper bone
#

Mistral-small-24B Q4 and Qwen2.5-32B Q4. I was happy with Qwen but Mistral works perfectly the same way as Qwen do, but thanks to the smaller size, the response is even snappier. I let them run on of my RTX3090s

dreamy patio
#

So turns out I'm an idiot and I didn't update ollama for a long while. I got so fed up with windows 11 last year that I installed linux on my main pc and I just learned that ollama does not automatically update on windows. This was what was causing all my problems

#

I've set the post to resolved.