#Wyoming protocol implementation

1 messages · Page 1 of 1 (latest)

light gale
#

Hey! Sorry if a bit technical or incorrect thread.
I want to dig into writing Wyoming satellite. I found protocol documentation here https://github.com/rhasspy/wyoming and example on Python for default Wyoming satellite here https://github.com/rhasspy/wyoming-satellite/. It should be enough to get me started, but i'm curious if there are any other projects or docs that i can read to better understand communication patterns between client and server, as well as exact setup for HA auto-discovery and other important niche stuff.
Sorry for pinging @worthy granite , i know that you're main guru there. 🙂

Thanks!

GitHub

Peer-to-peer protocol for voice assistants. Contribute to rhasspy/wyoming development by creating an account on GitHub.

GitHub

Remote voice satellite using Wyoming protocol. Contribute to rhasspy/wyoming-satellite development by creating an account on GitHub.

worthy granite
#

Hey @light gale! Those are definitely the resources I would recommend (besides me of course 😄)

#

I know the docs aren't the greatest, unfortunately. Maybe we can improve them a bit together.

light gale
light gale
worthy granite
#

Something that isn't communicated well is which messages are required and which are optional in different contexts. The protocol fortunately isn't too complicated, but it can be tricky.

light gale
#

(I'm trying to put that to Android, and probably will try to port OWW or MWW too, if it's not too hard...)

worthy granite
#

I'm not familiar with that project, but it almost surely implements things over MQTT. Wyoming is purely peer-to-peer TCP.

light gale
#

Good thing that I just had experience with websockets on MA mobile client.

worthy granite
#

Nope, not even websocket. Just straight TCP. I had originally considered websocket, but I wanted it to be easy to implement on microcontrollers.

light gale
#

Oh geez! Okay, well, I guess simple wrapper on a byte stream will do 🙂

worthy granite
#

The low level protocol is extremely basic:

  1. Open a TCP connection and send a line of JSON with the event type, data_length, and optionally payload_length
  2. Write UTF-8 encoded JSON that is data_length bytes with the event data
  3. Optionally write payload_length bytes with the binary payload (usually audio)
#

Everything else is just which events are expected when.

light gale
#

Got you. No checksums?

worthy granite
#

Nope, I assume TCP is working 😄

light gale
#

Fingers crossed. 🙂
Thank you! Will try to work it out. 🙂

#

Oh, one more thing: from perspective to get it working, should I try Openwakeword or Microwakeword? They both use hell of a Python libraries, that I might stuck on - but maybe one is easier than other? 🙂
Porcupine is too expensive for my goal....

#

(basically I'd like to make everything open source and available readily)

light gale
#

@worthy granite sorry for bothering you again and thanks for your help in advance! 🙂
So far i ported the events to Android, and starting to port the satellite itself. At first stage i will go with full streaming, to get it working - and then will decide what to port for wake word and VAD.

At this stage, it would be really helpful to have some test server running, to test the communication. Is there anything like this, or should i actually spin up some Home Assistant instance in Docker? That would be alright, but it's easier to debug if something is running in console... 🙂

Thanks!

#

Also, any hints on correct events flow would be super cool. I see the Wyoming Sat code, but it's Python, that i read like foreign language (enough to understand, not enough to dive into).

lone tree
light gale
light gale
worthy granite
#

So many Wyoming things I've never seen 😄
Do you mean a test server that mimics HA's side of the communicate with a Wyoming satellite?

light gale
light gale
#

Hey @worthy granite ! I'm digging it bite by bite. 🙂
I already have satellite connected to Home Assistant, and can announce to it. However, i'm stuck with sending stream from the satellite.
(I make it iteratively, first step is PoC for always-streaming satellite)

What i have:
PCM 16BIT, 16 KHz, single channel. I write it in parallel to the WAV file for debugging, and hear the voice clearly.
The communication looks like in the original satellite:

  1. Getting "run-satellite"
  2. Getting "describe"
  3. Sending "info"
  4. Sending "streaming-started"
  5. Sending "audio-start" with rate=16000, width=2, channels=1
  6. Streaming "audio-chunk" continuously (Around 2k bits per chunk, the size is chosen by Android. Maybe i need to make that smaller?)

Where am i wrong? Something obvious?
Sorry for bothering, and thanks in advance for your help!

light gale
#

Sending message means sending metadata -> new line -> data JSON bytes -> payload bytes.

#

Here's log of communication start and 1st chunk (i've reduced buffer to 256 bytes, no luck):

Received: {"type": "run-satellite", "version": "1.5.4"}
Received: {"type": "describe", "version": "1.5.4"}
Sending: {"type":"streaming-started","version":"1.0.0"}
Sending: {"type":"info","version":"1.0.0","data_length":223}
Sending data: {"satellite":{"name":"Android Wyoming Satellite","attribution":{"name":"formatbce","url":"https://github.com"},"installed":true,"description":"Wyoming satellite on Android platform","version":"0.0.1-alpha","area":"Office"}}
Sending: {"type":"audio-start","version":"1.0.0","data_length":37}
Sending data: {"rate":16000,"width":2,"channels":1}
Sending: {"type":"audio-chunk","version":"1.0.0","data_length":37,"payload_length":256}
Sending data: {"rate":16000,"width":2,"channels":1}
Sending 256 bytes
..... repeating with new data
light gale
#

Well, believe or not, the solution came to me in the dream. I missed sending "run-pipeline" message.

#

Works now

#

Now there's another problem: how to organize correctly the flow of voice data while TTS is played, and why HA is disconnecting from satellite after TTS...

lone tree
#

I would be very interested to learn from @worthy granite how streaming generation is planned to be implemented. I’ve reviewed all related PRs, but I still don’t see an answer to this question. Will there be a function responsible for accumulating the LLM response, splitting it into groups of sentences and transmitting them to existing TTS servers? Or is it planned to implement a TTS server that directly handles the stream of chunks?
What system components still need refinement, and is there an approximate release date for this feature?

light gale
light gale
#

Debugging the connection. Looks like the audio data is hanging the satellite (especially if i'm streaming continuously, and HA starts announcing back). I'm not sure how to ease this flows. Also, for some reason it's announcing once, and after that HA is just switching the satellite entity to "responding", and doesn't send following announcements at all..

steep flower
lone tree
# light gale Sorry, i opened this thread for conversation about already existing implementati...

Wyoming tts client in the system still does not support async_supports_streaming_input() and related things. But we will probably have to wait for the release of at least one tts service with a new function to find out what solution will be applied. By the way, have you seen this project. https://github.com/jeffc/hassmic/ The app itself is a standalone satellite that also integrates via the Wyoming protocol (unfortunately, it requires a pipeline with an external wake word service).

light gale
worthy granite
#

Sorry for the late reply, I've been out for over a week! @light gale are you sending "audio-stop" and "played" from the satellite? The "played" event is needed to tell HA when the TTS announcement has finished.

worthy granite
light gale
#

also i guess i don't need to send audio-start and audio-stop to server, and also streaming-started and streaming-stopped seem to be single-use too? Because with VAD i wanted to stop audio sending properly, but after streaming-stopped and streaming-started again, HA dropped connection.

worthy granite
#

Are you sending the ping/pong messages?

light gale
light gale
# worthy granite Are you sending the ping/pong messages?

Okay, well, looks like server doesn't respond to ping messages at all..?
Anyways, i don't care about it, i just have watchdog now, that is considering connection broken if there's no ping from server for 10 sec.

Now i have another things... 🙂

  1. I wanted to stop streaming audio from satellite while TTS is received back. But if i'm sending streaming-stopped, and then streaming-started after played, server breaks the connection...
    Should i just stop sending audio chunks instead? It seems to be working that way. But then i don't understand, what are those streaming start/stop events...
  2. Is there a mechanism to tell server that satellite is disconnecting? When i close connection on my device, HA still shows satellite as available...
  3. Initial connection process isn't clear... I'm sending info for every describe, adding satellite to HA - but not getting run-satellite from server, until i physically restart my satellite (sometimes after 2 restarts). After that, it starts working properly (getting run-satellite on each restart). Should i explicitly reconnect after describe -> info?... I don't see that logic anywhere in other projects...

Thanks in advance Mike!

lone tree
light gale
#

@worthy granite For some reason i stopped receiving pings and run-sattelite :

2025-06-12 16:03:16.610 DEBUG (MainThread) [homeassistant.components.wyoming.config_flow] Zeroconf discovery info: ZeroconfServiceInfo(ip_address=ZeroconfIPv4Address('192.168.1.118'), ip_addresses=[ZeroconfIPv4Address('192.168.1.118'), ZeroconfIPv6Address('fe80::bc90:edff:fec0:2fe0')], port=10700, hostname='Android_1VME4EDT.local.', type='_wyoming._tcp.local.', name='android-wyoming-satellite._wyoming._tcp.local.', properties={'': None})
2025-06-12 16:03:25.204 DEBUG (MainThread) [homeassistant.components.wyoming.assist_satellite] Connecting to satellite at 192.168.1.118:10700
2025-06-12 16:03:25.366 DEBUG (MainThread) [homeassistant.components.wyoming.assist_satellite] Connected to satellite
2025-06-12 16:03:30.370 DEBUG (MainThread) [homeassistant.components.wyoming.assist_satellite] TimeoutError: 
2025-06-12 16:03:30.370 WARNING (MainThread) [homeassistant.components.wyoming.assist_satellite] Satellite has been disconnected. Reconnecting in 10 second(s)
2025-06-12 16:03:33.371 DEBUG (MainThread) [homeassistant.components.wyoming.assist_satellite] Disconnecting from satellite
2025-06-12 16:03:33.372 DEBUG (MainThread) [homeassistant.components.wyoming.assist_satellite] Connecting to satellite at 192.168.1.118:10700
2025-06-12 16:03:33.410 DEBUG (MainThread) [homeassistant.components.wyoming.assist_satellite] Connected to satellite
2025-06-12 16:03:38.412 DEBUG (MainThread) [homeassistant.components.wyoming.assist_satellite] TimeoutError: 
2025-06-12 16:03:38.412 WARNING (MainThread) [homeassistant.components.wyoming.assist_satellite] Satellite has been disconnected. Reconnecting in 10 second(s)
...

It connects, sends describe - i'm responding with info, and that's it...

light gale
#

Could it be that it's something with newest HA version maybe?

#

Because my logic didn't change at all, and yet i can't for life of me to get my HA register the satellite

#

It sees my info message, because without that it wouldn't let it appear in Discovered devices.

#

But it can't add satellite, trying to reconnect... And doesn't send anything to the satellite.

light gale
#

Moved back couple days in my code - it also doesn't work, which means something changed on HA side probably?...

light gale
#

Still struggling with it. My loop, waiting on socket input channel, returns nulls constantly, no unknown data, nothing...

#

I actually cannot remember if something was there before. As far as i remember, describe -> info pair is pretty much everything i had - after that there was just run-satellite...
But still HA says "Unable to connect". Which is ridiculous, because it just sent describe and received info on that socket....
Should i try resetting connection right after info sending?...

light gale
#

Today i succeeded connecting satellite to HA. No code changes from my side. The config entry was added since yesterday, and tried to connect all night. Today i restarted the satellite app several times, and eventually config entry connected and shown all corresponding entities. Then i went and launched "Set up voice assistant" flow, setting up the pipeline with OWW.
However, run-satellite was sent only after another satellite restart, and i responded with run-pipeline.
But pipeline was still not ready i guess, server was just sending pings that's it.
And just next time, when i restarted satellite once more, i got run-satellite, responded with run-pipeline and got detect back, so i was able to start streaming...

#

Well huh, after next restart i didn't receive anything but describe again, so i'm on square one.

prime mist
light gale
prime mist
light gale
light gale
#

Okay, i found one of the problems. It's mDNS.
Looks like, if i have mDNS on, HA cannot add/connect to satellite. After each restart, looks like HA treats satellite as completely new device.
I'm not sure how to avoid this, but for now i just disabled zeroconf completely, and it sudo works.

Now another problem is, i have to restart satellite at least 2 times to get it connected first time:

  1. Launching satellite, initiating config entry on HA side with IP/port. Exchange describe->info, HA adds config entry, in disabled - Unable to connect state. No further wyoming interaction from HA (no ping, nothing).
  2. Restarting satellite. HA is connecting to satellite, exchange describe->info happening, config entry is initializing entities, but again - no further wyoming interaction from HA (no ping, nothing).
  3. Restarting satellite again. HA is connecting to satellite, exchange describe->info happening, HA starts sending ping and sends run-satellite command. Flow is initialized successfully, everything works as expected after.

I need to understand, why these socket disconnections happening.
Also would be nice to have mDNS working as expected.
Also (unrelated to previous) if the TTS response is long, and i start streaming right after playing it, HA disconnects socket too... Should i wait for some time before streaming again?..

lone tree
light gale
# lone tree >if the TTS response is long, and i start streaming right after playing it, HA d...

Thanks, the diagram seems right, but my situation is what's happening after audio-stop received. My streaming satellite is stopping sending audio-chunk while pipeline is happening, up until received audio-stop and detect - then it waits for all audio to be played by local player, sends played and starts sending audio-chunk stream again (no local wake word).
It works if speech was relatively short - but if it's fairly long, i'm getting "broken pipe" error after several packets, which means HA disconnected from socket

worthy granite
#

Is HA disconnecting because of a missing ping/pong? There is an overall timeout per pipeline too, so maybe this is being hit.

light gale
light gale
# worthy granite Is HA disconnecting because of a missing ping/pong? There is an overall timeout ...

Ok, i ripped through everything - apparently, HA is disconnecting right after receiving "info", and wanted from satellite to restart the socket. This happens several times during initial setup, but now i managed it (even onboarding is showing).
Now i'm basically done with implementation shenanigans - the only thing that's bothering me is lack of communication from server.
Can you help me please?

#

First and foremost: how to distinguish "nevermind"?
Normal communication from server is

"voice-started"
... voice...
"voice-stopped"
 "transcript"
... intent, tts...
"synthesize"
"audio_start"
... chunks ...
"audio_stop"

But if i say "nevermind", it stops after transcript. And since it might be like 10 seconds between transcript and synthesize, i can't even make decent timeout to go back to idling...
Like, the question: is there some indication that pipeline is ended?

#

Another thing is - is it possible to get info about timers on start? Like, when i restarted satellite, can i get active timers from HA? I don't know how, if at all... 🙂

#

Thank you in advance Mike! I hope it's simple questions.

light gale
worthy granite
#

For the "nevermind" issue, you should get a run-end message. If no TTS events or audio has been sent, this means the pipeline ended early.

worthy granite
#

The audio start/stop also end up being used for each audio chunk, which is why I added synthesize-stopped to indicate that the TTS system is completely done producing audio chunks.

lone tree
#

Two methods are used for compatibility, as far as I understand, this is not a mandatory requirement. In my alternative client, I simply check the server for streaming capability. This places a little more responsibility on the user (do not change the server type on the fly).

light gale
# worthy granite For the "nevermind" issue, you should get a `run-end` message. If no TTS events ...

I'm not receiving run-end...

Sending: {"type":"run-pipeline","version":"1.5.4","data_length":39}
Sending data: {"start_stage":"asr","end_stage":"tts"}
Received: {"type": "transcribe", "version": "1.5.4", "data_length": 21}
Received: {"type": "voice-started", "version": "1.5.4", "data_length": 19}
Received: {"type": "voice-stopped", "version": "1.5.4", "data_length": 19}
Received: {"type": "transcript", "version": "1.5.4", "data_length": 23}

And that's it

light gale
# worthy granite The audio shouldn't be sent twice, but the text to synthesize is: first in chunk...

I meant this part:

Streaming:

→ synthesize-start event (required)
→ synthesize-chunk event (required)
Text chunks are sent as they're produced
← audio-start, audio-chunk (one or more), audio-stop
Audio chunks are sent as they're produced with start/stop
→ synthesize event
Sent for backwards compatibility
→ synthesize-stop event
End of text stream
← Final audio must be sent
audio-start, audio-chunk (one or more), audio-stop
← synthesize-stopped
Tells server that final audio has been sent

in here: https://github.com/OHF-Voice/wyoming

GitHub

Peer-to-peer protocol for voice assistants. Contribute to OHF-Voice/wyoming development by creating an account on GitHub.

worthy granite
light gale
#

With Wyoming it's a bit confusing, eh?
Socket-wise the satellite is the server, and HA is the client. But logically HA is the server.
So if i'm saying "nevermind" that speech goes to HA, then HA should send "run-end" to satellite, so it knows that pipeline is shut. But i don't see run-end on satellite side....

light gale
#

I'm logging every line i'm receiving on satellite side.

worthy granite
#

Oh, I guess that's only internal to HA. Maybe I was expecting run-satellite to be sent multiple times 🤔
I should probably add some kind of end event there 😄

light gale
#

I'd actually also made Timers a bit more robust... Other things seem to be working excellent.

lone tree
light gale
#

There's also a lot of info that satellite is sending on wake words etc - would be nice to have the controls in HA actually working :))

worthy granite
#

Definitely! I also want to have different pipelines for different wake words, a "stop" like the VPE, etc. 😄

worthy granite
light gale
# worthy granite The audio shouldn't be sent twice, but the text to synthesize is: first in chunk...

Hey Mike!
After moving to HA 2025.7.0, after audio-start i'm getting this:

Received: ����|�~�{�y�y�{�|�x�x�u�v�v�w�x�v�x�x�x�z�w�w�|�~�������}�~����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������....
#

And of course parser is failing.

light gale
#

Oh okay, it's actually happening because of exception - that is happening because payload length on audio chunk is 700+ kilobytes. Is it normal?

#

@worthy granite i realized, that for cached responses single audio-chunk is returned with full response now. Is it intended, or is it bug? Because if it's intended, i will have to remake the data logic (right now i'm using byte array pool, but that has restricted max size)..

#

What doesn't kill us, makes us stronger, right? I'm allocating now the pools up to 1MB... Hope that's enough.

lone tree
# light gale <@638799193586139136> i realized, that for cached responses single audio-chunk i...

The components have a choice of which method to request from tts, depending on the type of data being sent. If it receives a string, full synthesis will occur; if it receives an asynchronous generator, streaming will begin.
Currently, there is no smart selection (as when working with LLM) for announcement actions, tts.speak, and full responses, and a full message is always sent. This may change over time.
This is what the solution looks like, returning streaming at the response stage when automation passes the full text to set_conversation_response.
To maintain compatibility with older methods, you will probably have to use a solution with an increased buffer.

light gale