#Streaming audio to voice assistant preview edition in real-time.

1 messages · Page 1 of 1 (latest)

mortal stratus
#

Hello, I am trying to integrate TTS API to voice assistant but even though I use streamed response which I send to voice assistant in chunks, it still waits until stream ends. Is it possible to play stream in real-time?

I use logic like this in MQTT wyoming custom python script:

await self.write_event(AudioStart(rate=self.rate, width=self.width, channels=self.channels).event())

async for chunk in bytes_iter:
  await self.write_event(AudioChunk(audio=bytes(to_send), rate=self.rate, width=self.width, channels=self.channels).event())

await self.write_event(AudioStop().event())

Voice assistant waits until AudioStop() event is written. So even though TTS starts sending audio in 200ms, it sometimes waits for like 10 seconds before it starts playing which is kinda annoying.

#

Streaming audio to voice assistant preview edition in real-time.

sharp pilot
hollow oasis
mortal stratus
#

Also for some reason voice assistant LED ring stopped spinning after last update during TTS playback. Previously there was spinning white circle during TTS but now it just completes task, turns LED ring off and then plays TTS audio

#

Is there any documentation for voice assistant (preview edition) development?

hollow oasis
#

I don't have a test HA for convenient integration development, so I put together an external server layer.
There is quite a lot of room for improvement, but I hope it will help you. The main task is to split text from LLM into short sentences. Instead of implementing your own solution (sentence_boundary), you can use the ready-made sentence_stream library.

#

As far as I could see, last month's update to the 11labs integration also uses streaming and smart text segmentation.

hollow oasis
mortal stratus
# hollow oasis switched from using the library to direct queries and added language settings

I can see same implementation there.

            await self.write_event(
                AudioStart(
                    rate=self.speech_tts.sample_rate,
                    width=self.speech_tts.sample_width,
                    channels=self.speech_tts.channels,
                ).event(),
            )

            async for chunk in self.speech_tts.synthesize(text, voice_id=voice_name, language=language):
                if chunk:
                    await self.write_event(
                        AudioChunk(
                            audio=chunk,
                            rate=self.speech_tts.sample_rate,
                            width=self.speech_tts.sample_width,
                            channels=self.speech_tts.channels,
                        ).event(),
                    )
            
            await self.write_event(AudioStop().event())

Voice assistant waits for AudioStop().event() before it starts playing

#

homeassistant library tts source code even says this:

#

it doesnt iterate chunks and expects all data as bytes

hollow oasis
#

You're a bit confused. HA fully supports streaming. However, due to limitations of local synthesis engines or providers, we have to create splitting mechanisms. If the provider being used is capable of independently accumulating/splitting/delivering audio chunks, then there's no problem using one generator for sending text data and one for receiving. If you're interested, I can show you a project using grpc working in this mode.

As for my server implementation, to the end user, it will appear as streaming. The AudioStop().event() wait is not global.

hollow oasis
mortal stratus
hollow oasis
mortal stratus