#Streaming API for LLMs like ollama

1 messages · Page 1 of 1 (latest)

wind portal
#

Hi, I've just started playing with local LLMs and spun up ollama on my Jetson Nano, now trying to integrate it to HA using API. First thing I tried is official Ollama integration, but quickly noticed its deficiencies and slowness to respond: it doesn't use streaming API:
https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-chat-completion

here is the piece of code that does (or does not) that
https://github.com/home-assistant/core/blob/dev/homeassistant/components/ollama/conversation.py#L265

are there any reasons for such choice? are there any options for faster response time?

GitHub

:house_with_garden: Open source home automation that puts local control and privacy first. - home-assistant/core

GitHub

Get up and running with Llama 3.3, Phi 4, Gemma 2, and other large language models. - ollama/ollama

near mulch
#

Hmm, i cannot imagine streaming API working with Text To Speech engine.

wind portal
#

had that thought as well, but was wondering if someone has started thinking of it

#

I think I saw google doing some steps towards that with their gemini

near mulch
# wind portal had that thought as well, but was wondering if someone has started thinking of i...

There's your answer to the question "are there any reasons for such choice?".
If anything is in the works, I have no idea. But even giants like Google don't have solution to that so far. Generating speech with precise intonation from incomplete sentence is impossible even for human. When you're talking, you already have full sentence (if not next few) in your head, so you know where to put stress, and where to make your voice lower.
In short - there's no technology now, and probably it will be some new TTS LLM, that can generate not only token, but stress and intonation as it goes.

obsidian canopy
#

Technicality there's tech to do it, but not in the open source world. For instance i believe the full 4o model of gpt actually does sts (speech to speech) where you say your question, the model understands the sound means text, and replies directly in the corresponding sound for the response text. Ie hears question directly outputs audio reply

#

That's how that realtime conversation stuff they demoed supposedly works for chatgpt

near mulch
wind portal
#

seems like in the end of a day all of us are going to have a bulky ai server to play with

obsidian canopy