I've been trying different models using the ollama integration with a locally running instance of ollama. With gpt-oss-20b I get about 2 seconds of latency when giving a simple prompt, from pressing <return> till an answer starts being given. With qwen3:30b-a3b-instruct-2507-q4_K_M I'm getting 4 seconds of latency. Other models are even longer latency.
When I run these models directly in ollama with ollama run there is practically 0 latency for either. So I'm wondering what HA does that causes the latency.
I've pruned my exposed entities down to just 10. I started at well over 100. This has not changed changed the latency in any meaningful way. So I'm wondering where the time is going...