Streaming API for LLMs like ollama | Home Assistant | Page 1

wind portal Jan 10, 2025, 9:59 PM

#

Hi, I've just started playing with local LLMs and spun up ollama on my Jetson Nano, now trying to integrate it to HA using API. First thing I tried is official Ollama integration, but quickly noticed its deficiencies and slowness to respond: it doesn't use streaming API:
https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-chat-completion

here is the piece of code that does (or does not) that
https://github.com/home-assistant/core/blob/dev/homeassistant/components/ollama/conversation.py#L265

are there any reasons for such choice? are there any options for faster response time?

GitHub

core/homeassistant/components/ollama/conversation.py at dev · home-...

:house_with_garden: Open source home automation that puts local control and privacy first. - home-assistant/core

GitHub

ollama/docs/api.md at main · ollama/ollama

Get up and running with Llama 3.3, Phi 4, Gemma 2, and other large language models. - ollama/ollama

near mulch Jan 10, 2025, 10:03 PM

#

Hmm, i cannot imagine streaming API working with Text To Speech engine.

wind portal Jan 10, 2025, 10:08 PM

#

had that thought as well, but was wondering if someone has started thinking of it

#

I think I saw google doing some steps towards that with their gemini

#

https://pypi.org/project/pysbd/
also this

near mulch Jan 10, 2025, 10:43 PM

#

wind portal had that thought as well, but was wondering if someone has started thinking of i...

There's your answer to the question "are there any reasons for such choice?".
If anything is in the works, I have no idea. But even giants like Google don't have solution to that so far. Generating speech with precise intonation from incomplete sentence is impossible even for human. When you're talking, you already have full sentence (if not next few) in your head, so you know where to put stress, and where to make your voice lower.
In short - there's no technology now, and probably it will be some new TTS LLM, that can generate not only token, but stress and intonation as it goes.

obsidian canopy Jan 10, 2025, 10:55 PM

#

Technicality there's tech to do it, but not in the open source world. For instance i believe the full 4o model of gpt actually does sts (speech to speech) where you say your question, the model understands the sound means text, and replies directly in the corresponding sound for the response text. Ie hears question directly outputs audio reply

#

That's how that realtime conversation stuff they demoed supposedly works for chatgpt

near mulch Jan 10, 2025, 11:01 PM

#

obsidian canopy Technicality there's tech to do it, but not in the open source world. For instan...

Oh okay, multimodal LLMs can do sts already? I thought it's just very fast inference and sentence-by-sentence TTS.

wind portal Jan 10, 2025, 11:18 PM

#

near mulch There's your answer to the question "are there any reasons for such choice?". If...

agree, I'm just trying to find ways on how to make llms useful on smaller chips. technically you can always use something like mac pro or wait for nvidia's project digits

#

seems like in the end of a day all of us are going to have a bulky ai server to play with

near mulch Jan 10, 2025, 11:19 PM

#

wind portal agree, I'm just trying to find ways on how to make llms useful on smaller chips....

Yeah, 3k$ for local GPT 🙂

obsidian canopy Jan 10, 2025, 11:21 PM

#

near mulch Oh okay, multimodal LLMs can do sts already? I thought it's just very fast infer...

The open ai one can. Gemini still uses tts i think

#Streaming API for LLMs like ollama