If I ask a query directly of a model running in the Ollama terminal, response is not bad - but if I make the same query from Assist it takes 25 to 30 seconds longer to respond. Part of this might be because it dumps the entire answer vs the terminal building up to the final answer. How would I got about determining what is slowing things down? In case there is interest - I am running Ollama in a docker under unraid, on a Dell Poweredge R730 using an Nvidia P40 with 24G of VRAM. I have used a number of different queries, but for today discussion - "please write a one paragraph batman story" is what Iam working with. Directly on Ollama it is about 20 seconds. Through Assist it is almost a minute.
#Assist response is SLOW when using LLM
1 messages · Page 1 of 1 (latest)
I saw this thread (not sure why I did not see it before) "ollama veeeery slow when run from HA" -- Which suggest I should be seeing better responses, but nothing outside of context size was offered up as possibilities for the slowness. Hopefully others might have something else to try... 🤞
There is a debug window in the voice assistant tab
https://www.home-assistant.io/voice_control/troubleshooting/
When you ask question directly, the model context is clear. It is generating based on model data exclusively. It's fast.
When you do it with Assist control, your request contains system prompt, as well as rules of interaction with HA (list of tools), and all exposed entities data. Not only it gets loaded into context of LLM, but your request is processed on top of all that data. It's slow.
If you put "no control" in LLM config in HA, it will be much faster - but the model won't know anything about your home.
Or run it on proper GPU. That helps. 🙂
You also want to play around with different models
What 😩 ... are you saying a nvidia p40 is no good? I know it is slower than modern cards, but most reviews seems to suggest it was serviceable. I have tried multiple models - with similar speed results (but worse accuracy in a lot of cases). I have settled on the qwen2.5 32G q2 (tried 16G and it was about same speed).
I am using qwen2.5 7b on my 3060 12GB just fine
(can't run 14b because of vram)
You can always try reducing the amount of entities exposed to assist
Yeah, I have been on the reduce entities kick... I went from 160 things exposed to about 40.
Wonder if anyone else out there is running a p40, it would be a good data point.
@gleaming lake so I went ahead and retried 7b (I had it loaded and configured) - and it is definitely A LOT faster... Then I remembered why I went a way from it. It is a lot less forgiving and would do things I had not told it to do. But, that was with more entities... Have to see if that is still the case.