#What factors affect conversation agent latency?

1 messages · Page 1 of 1 (latest)

dapper oxide
#

I've been trying different models using the ollama integration with a locally running instance of ollama. With gpt-oss-20b I get about 2 seconds of latency when giving a simple prompt, from pressing <return> till an answer starts being given. With qwen3:30b-a3b-instruct-2507-q4_K_M I'm getting 4 seconds of latency. Other models are even longer latency.

When I run these models directly in ollama with ollama run there is practically 0 latency for either. So I'm wondering what HA does that causes the latency.

I've pruned my exposed entities down to just 10. I started at well over 100. This has not changed changed the latency in any meaningful way. So I'm wondering where the time is going...

tender prawn
#

There's also context length, and system prompt that should be processed. It contains info about exposed entities and available tools.

dapper oxide
#

Right. I enabled debug logging in ollama so I could see the full system prompt. The instructions and tool parts aren't excessively long. There's something in the log that I don't understand though. There is an "ids=[...]" string that has more than 1800 numbers listed. Many of the numbers are duplicates. The list is there but much shorter if I disable assist in the agent settings. Anybody know what this ids list is?

sturdy cliff
#

Can you share what you see?

dapper oxide
#

Yes, but tomorrow. Machine is at the office...

dapper oxide
#

Speculating... Maybe the ids are being generated by ollama? Maybe that is the tokenized representation of the prompt?

dapper oxide
tender prawn
#

Crap, i tried to see the line. Then i realized, that it's 21 KB in single line.

dapper oxide
#

If those ids are the tokenized prompt, then it looks like qwen3 tokenizes the prompt to about 40% more tokens than gpt-oss, which would account for some of the lag. Even if the models were the same tps, there would be a 40% penalty. Switching between models changes the number of those ids significantly

dapper oxide
#

So the gpt-oss prompt does not have the tools json section. Wonder why that is. Both models show "tools" in capabilities in olamma show

dapper oxide
dapper oxide
dapper oxide
#

sooo, in theory ollama caches the prompt up to where it changes between messages. But the current time changes with every message and is in the prompt before the tools. So if the tools are present (as the are for qwne3), they get reprocessed with every message. I think this is what accounts for the difference in latency I am seeing between models.

dapper oxide
dapper oxide
#

Confirmed, 2025.11 release eliminates the majority of latency I was experiencing. Thanks to the developers! It's too bad this improvement didn't get a mention in the release announcement

fickle walrus
#

I'm curious how much the 2025.11 release helped with latency in general, and how much lower we might be able to get things? The main reason I haven't purchased a VoicePE or started playing with Ollama locally is the amount of processing time I see responses taking in the YouTube videos I've watched. Even the latest video from 1 month ago shows a request to turn off the lights taking about 4 seconds.

sturdy cliff
#

I'm confused why there's no support for llama.cpp. It's been much faster than ollama in my tests.

#

By the way you don't need a Voice PE to test this yourself. Use your phone to speak to HA.

fickle walrus
#

Thanks @sturdy cliff. That's a good point about the phone. I really do want to replace all of the Alexa devices in my home with something that can run locally even when the internet is out. But, TBH, I've put off even starting to install things until I saw response times come down. I don't THINK I have other needs for Ollama since I don't mind using ChatGPT for most things. But I could certainly be missing valuable use cases since I don't currently have it.

sturdy cliff
dapper oxide
#

After the update, latency for simple non-tool interaction (e.g. "tell me a joke") after initial loading is a fraction of a second. When a tool is used there is some additional time spent. I haven't measured, but less that 2 seconds for the things I've tried. Much more interactive than before the update.

#

I'm using a Strix Halo AMD apu. Better than your typical apu, but no where near what a decent discrete GPU would do. I run both stt and conversation agent on the strix. I'm very happy with the performance and accuracy for this particular use case

dapper oxide
#

@sturdy cliff I would also love to see llama.cpp support. Getting ollama running on the strix was a fair bit more effort than llama.cpp. But I can understand why they went with ollama first. It's easier to set up for people that have more mainstream (I.e. nvidia) platforms. And it makes experimenting with alternative llms easy. Hoping they add official support for a plain vanilla openai api though

fickle walrus