#ollama veeeery slow when run from HA

1 messages · Page 1 of 1 (latest)

rigid current
#

I run ollama 3.2 in a separate VM (same box as HAOS). Running on ITX with 64G ram and AMD 4600G CPU.
When I run it from inside the VM, it's quite responsive and I get a reply almost immediately.
However, when run via HA assist, answer is taking at least 30 seconds and often I get an error "Timeout running pipeline".

What may be the reason for this? Too many exposed entities (~200)? Anything else?
Also sometimes I'm getting nonsense replies, for example I write "hello", and it replies it could not switch off lights.

alpine ruin
#

Possibly running into context window. With a lot of entities exposed you can get pretty close or even exceed the 8192 window. Some models even default to only a 4196 token window which fills up really fast. Once context gets too large the model will hallucinate

topaz rune
#

That's definitely context. When you run it in console, there's no context besides the info you've typing in. But every request from HA has all entities, areas, tools (e.g. scripts) added to the prompt system context message. So model gets overwhelmed.

I just put it in "no control", and use for general questions, or to refine already prepared info.

rigid current
#

so basically it needs beefy hardware to run and there is no way around it?

mortal rain
rigid current
#

just tried it. Removed all entities, left only 3 lights. It takes around 20 seconds to switch on

mortal rain
#

that's weird, then. I don't have any idea, but maybe someone else does

rigid current
#

how long does it take for you, and on what hardware?

#

I saw people running it on raspberries...

mortal rain
#

I haven't been runnning ollama for a while now (issues with my GPU cooling) and can't answer that question

rigid current
#

I've watched multiple reviews and found that even on decent hardware ollama hallucinates a lot, switches wrong lights etc. So I conlude that it is not ready yet and will be halting my integration for at least a year.

#

from my testing it's the same. I explicitly say to switch on "main light", but it switches on all of them ,etc

#

or says, bus does nothing

#

in current state, definitely not worth investing in hardware

frozen juniper
#

Actually, when I tried to configure the assistant without control for home assistant it works fast as normal, but if you set it to control home assistant especially with many exposed entities it takes too much time

#

However, I did an alternative method by setting custom sentences with the default home assistant conversation when they are triggered, then sends the conversation to ollama using conversation.process and responds back.

alpine ruin
#

The thing with Ollama is the model itself. For most people, unless you are running an $80k GPU, the largest model you can run with decent response times is a quantized 8b model. for the most part going to 8bit quantization doesn't have that big of a hit, but the further you quantize the more likely the model is to hallucinate. Add to that, that an 8B model is generally not going to perform as well as a model like gpt-4o (1.8 TRILLION parameters) and even gpt-4o-mini (think this is 8B, but trainied from the full 1.8 trillion params of gpt-4o, and probably running in raw fp16 unquantized) and you run into what you are seeing.
I tried playing around with llama3.2 with my context window set to 8196 and it was decent and fairly fast on my 4060ti, but still not as reliable as gpt-4o-mini (sometimes it would get things wrong, turn on the wrong light, not utilize tools I defined correctly, etc). The context window size for me was also around 6k if I remember correctly, so it was close to the max window already with just 150 entities and a handful of tools exposed. Note I have 150 exposed entities presently, and I get responses within a second or two.

topaz rune
short token
#

I just setup ollama on my desktop PC, along with a docker container for both wyoming-whisper and wyoming-piper.

Installed the add-ons for whisper, piper, and open wake word in HA Add-ons. configured the pipeline with qwen 2.5 as conversation agent with custom prompt (runs on my desktop - not on same box as HAOS), faster-whisper (running on desktop), and faster-piper (running on desktop).

Then configured an m5stack atom echo using tutorial: https://www.home-assistant.io/voice_control/thirteen-usd-voice-remote/

I have it working, response seems fairly quick (sub 10 seconds).

Issues are:

  • It doesn't know how to do some things that seem basic (switch source of Roku device).
  • Sometimes controls entities that have nothing to do with what I asked.

Wants: Want to speak to the atom echo, and have the response come out a different media_player. (can't figure it out).

Home Assistant

Open source home automation that puts local control and privacy first. Powered by a worldwide community of tinkerers and DIY enthusiasts. Perfect to run on a Raspberry Pi or a local server.

#

This was the video I watched to config all of this: https://www.youtube.com/watch?v=XvbVePuP7NY

Scam Copilot is available for new users of Bitdefender Premium Security and superior plans. Find out more here: https://bitdefend.me/SCNetworkChuck

I’m replacing Amazon Alexa with my own, completely local AI voice assistant!! The amazing part is that it cleanly integrates with my home automation system, Home Assistant. Also, it’s using local LL...

▶ Play video