#"Sorry, I didn't understand that" should have a (configurable) number of retries

1 messages Β· Page 1 of 1 (latest)

signal bloom
#

It is very unpleasant to have to repeatedly say the wakeword over and over again when the instruction is not understood.

"<wakeword>" starts listening "do X with Y"
"Sorry, I didn't understannd that"
"<wakeword>" starts listening "do X with Y"
"I don't know of a device called 'wise'"
"<wakeword>" starts listening "do X with Y"
"Okay, I did X with Y"

It would be far more natural for it to ask you to repeat and start listening immediately at least a couple of times. (I mean, it seems unlikely that people will just immediately give up on the first instance of it not understanding).

"<wakeword>" starts listening "do X with Y"
"Sorry, I didn't understand, can you repeat?" starts listening
"Do X with Y"
"I may have misheard as I don't know of a device called 'wise'. Can you say it clearer?" starts listening
"Do ecks with why"
"Okay, I did X with Y"

Obviously, a retry of 0 would result in the current behaviour and people can set their retry count to whatever suits them.

frank marsh
#

What do you use for STT? Looks like it gives you pretty much of trouble. πŸ™‚

signal bloom
#

Whisper using the tiny model, but I don't see how that is relevant? When it doesn't understand, it's annoying to have to wakeword it again considering the likelihood of repeating the instruction is nearly 100% at least once.

frank marsh
#

Well, good STT would free you from this. Regardless of the error handling (it's actually good idea to have some action on_error).
I'm sure if you go to traces, you will see that Assist was seeing completely incorrect sentence. So having good STT is better than repeat same sentence because it heard you wrong.

signal bloom
#

Well, yes, if the STT doesn't fail to understand me I don't have to worry about having to repeat myself, but again, that's irrelevant to the point I'm making.

#

When it doesn't understand you, pretty much everyone will repeat the instruction at least once, so it should automatically listen again.

#

Besides, if I ask it "Where is <xyems_name>" it says it doesn't understand, because the STT returns a different spelling of my name. So the STT was actually pretty much successful, and I still ended up repeating myself (because I didn't know it was getting the spelling wrong).

#

It was also strange that yesterday, I asked it to "turn off the door lights" and it responded with "Sorry, I don't know of any device named door lights". It got it right on the second try. Unfortunately, that has scrolled off the debug history so I can't see what it actually heard either time.

#

(As an aside, regarding "good STT," when I figure out how to pass STT and TTS to my machines that are actually set up with such services backed by decent GPUs, I may have that πŸ™‚ )

frank marsh
#

Regarding latter - I use Piper and Whisper on Docker on my Proxmox server, spinning on Lenovo ThinkCentre mini-pc.

On the topic: I believe it should be configurable, not just repeat. Repeat could be one of options. But in general, would be cool to have the error catching callback, with error type as variable. E.g. I'd try to send same invalid request to different conversation agent (e.g. LLM) - probably it can process the request. Or launch other script, if error was "several entities with name .."

signal bloom
#

Sure, that makes sense.

frank marsh
#

Yeah that's not enough juice to use higher beams.

signal bloom
#

Yes, hence why I want to offload it onto one of my other machines that is intended for AI, like the 256GB RAM+RTX3090 24GB machine I have πŸ˜„

lunar dirge
#

Just an idea: since we do have the HassNevermind intent, i think the re-triggering could either be turned off (like it is now) or go on forever until you say "nevermind", when it stops. I don't really see the need for a certain number of retries

signal bloom
#

Providing it understands you saying "nevermind"..

#

Given we are discussing it not understanding the user, perhaps we should bear in mind that it might not..

lunar dirge
#

"understanding" (=transcribing) is the job of the STT engine, of which HA/Nabu Casa controls exactly zero πŸ™‚
what we are discussing here is a feature of the default conversation agent, which is entirely under HA/NC's control
if your STT does not understand "nevermind", then maybe you should optimize your usage of STT (bigger, better model, larger beam size, different service etc.). formatBCE already suggested that

signal bloom
#

I am aware that it is the STT that converts the audio to text.

#

That has nothing to do with making HA drop into an infinite loop because it never outputs "nevermind".

signal bloom
#

I know, which is why my suggestion was configurable "retires" to, at the very least, put a maximum cap on it which aligns with my personal tolerance for not being understood.

#

Being able to break with "nevermind" is nice too though and I wasn't aware that was available.

lunar dirge
#

but to be fair fairer, i've not presented a polished feature with full corner case support. i've simply covered the happy path. you could either have a max number of retries (like you said) or simply interpret silence as a means of cancelling the interaction

signal bloom
#

You don't need to present such a thing. What you are described was a potentially indefinite loop which is user facing, which would be a terrible experience.

#

I mean, more configurable is more good in my opinion. I prefer the predictability of "when woken, there are up to 3 attempts to understand the instruction which can be terminated early by 'nevermind'"

#

A "silence for X time" consuming an attempt is nicer to me than it terminating.

#

Though I will admit, part of that is because I have found it to be incredibly unreliable so far. There are precious few instances of me actually knowing what state it is in..

lunar dirge
#

I mean, more configurable is more good in my opinion
No dev and few product managers will agree with that statement πŸ˜…

But please feel free to add a feature request to the comunity forums

signal bloom
#

I am a developer.

lunar dirge
#

Well then add the feature

sick pasture
#

From what I have read here, I'd say the root issue is you are trying to solve a problem that already has a solution. 😦 Whisper tiny is a very primitive model, so it;s pretty much expected that it'd give you these kinds of problems where you are misunderstood. The solution is to use the larger model, where you are far less likely to be misunderstood and therefore far less likely to have to repeat yourself, eliminating the premise of this problem altogether. πŸ™‚ Using the large model with beam 5, I can't remember the last time I had to repeat a command 😁

#

Though to be fair, getting the wyoming faster-whisper set up to use GPU requires more advanced knowledge and work, but it is possible. Can also make use of the STT/TTS services from Nabu if you have a subscription, though I can understand not wanting to use a cloud service.

signal bloom
#

It takes over a minute to get a response with the medium model with beam 1.

#

I would like it to respond to me this side of 2025.

sick pasture
#

On my setup responses on large beam 5 are within 200ms. Again, have to do the GPU setup to get to that πŸ˜‰

signal bloom
#

Yes, and I don't have a GPU on the machine HA is running on..

sick pasture
#

Yeah, you mentioned another "AI" machine. You can install faster whisper on a separate machine running docker with a GPU, and connect it using the wyoming protocol

signal bloom
#

And, you know.. people run HA on Raspberry PIs.

sick pasture
#

Yes. But there's a reason giant datacenters run models like these for the masses πŸ™‚

#

you'll never get Google/Alexa/OpenAI quality on an RPI

signal bloom
#

I don't care what they do because access to those can be lost at any time.. which I why I run these things locally.

sick pasture
#

It just doesn't have the horsepower

signal bloom
#

I don't want Google/Alexa/OpenAI quality on an RPi. What I want is for it to not be annoying when it doesn't understand.

sick pasture
#

What you are describing is somnething I think Nabu is looking into from what I have read, but don't think it is there yet.

#

(continuous conversation)

#

Think someone in one of the posts managed to hack something together to make it sort of work

#

but requires a lot of footwork to get there.

signal bloom
#

That's something else. I just want to not have to repeat the wakeword over and over again because no-one converses that way.

sick pasture
#

Yeah, that's literally continuous conversation. The idea of initiating a conversation, and have a back and forth

#

instead of having to use wakeword on each interaction.

signal bloom
#

No, I don't want a back and forth. I just want it to relisten when it fails to understand up to X times.

#

There is no "continuation" between attempts.

sick pasture
#

The I guess as other have said, implement the feature as you see fit. Beauty of open source πŸ™‚

signal bloom
#

If I continue to use HA, I'll contribute in such a way, but right now, that seems unlikely.

frank marsh
#

So all this conversation was with the background of you stopping using HA? πŸ™‚

signal bloom
#

Hardly. The conversation was about making HA behave in a more pleasant way so I don't want to stop using it.

#

If you've read my general comments, I've had a rather unpleasant experience trying to get HA to work. I'd like to use that experience to make HA better, but if that's not wanted, that's fine.

lunar dirge
#

My dude, 3 people have independently told you that you're using crap technology not controlled by HA (due to hardware limitations, I understand). All 3 of these people have suggested how to get better results.
Still, you want to stop using HA because it doesn't behave well enough with your crap tech that they don't control and can't improve.
At the same time, despite your self-proclaimed development abilities, you think it's unlikely you'll go on using HA to make it worth the investment to contribute in order to have it behave the way you want.
That looks like a dead end to me, in which case... why are you still complaining instead of just waving it goodbye?

#

I'd like to use that experience to make HA better, but if that's not wanted, that's fine.
It's wanted as long as you actually contribute with more than "this is crap and here's how it should work, now run along and make it so"

signal bloom
#

The hardware is better than the minimum hardware officially supported and at no point have I asked anyone to improve it. My suggestion was to improve the end user experience by changing how HA behaves in situations that are more likely to occur at this "lower end".

lunar dirge
#

minimum hardware officially supported
officially supported by whom?

signal bloom
#

Perhaps "endorsed" would be a more appropriate term. I did have HA installed on an RPi 4, as it was listed on the website as being suitable and I had an issue where the Pi would power cycle using the voice assistant using only the official addons..

#

However, as I investigated that, I realised the hardware was insufficient and changed to the current equipment it is installed on.

graceful silo
#

This is a really poor argument

signal bloom
#

Sorry, what do you mean?

graceful silo
#

You're expecting minimum spec machines to perform high end functionality.

signal bloom
#

I am not though. Where have you got the idea that I am?

graceful silo
#

It's like expecting a base computer that can run windows would be built for autocad

#

or expecting to enter your ford focus into formula 1

#

Yeah, you can do it, but the results aren't going to be favorable

lunar dirge
#

Xyem, I get your argument: "make this stuff configurable so it would work on low-end hardware, which would also benefit me"
what you don't seem to get is that you're putting band-aid on a broken bone here. and when I suggested an alternative to the band-aid (still nowhere near as good as a cast, but better than a band-aid in situations other than a broken bone), you keep mentioning the broken bone you don't want to fix, complaining about the fact that nowhere on the band-aid box does it say that it can't be placed over broken bones πŸ™‚

signal bloom
#

And from my perspective, I have a graze, not a broken bone, so I don't know why everyone is suggesting I get a cast instead of the band-aid I'm asking for and would be quite happy with..

lunar dirge
#

so I don't know why everyone is suggesting I get a cast instead
because you don't seem to have the experience/understanding to diagnose yourself, while rejecting diagnostics from people who do. Sorry if that was too harsh, but it's the truth

signal bloom
#

Heh, I find this analogy quite amusing given some recent personal experience with a medical issue where my diagnosis was actually more accurate than a medical professional. I get what you mean though.

graceful silo
#

The analogy doesn't really fit because you know more about how you feel than your doctor does. The doctor is going to start with the "most likely" diagnosis and move on from there. It's quite the opposite with what we are discussing at hand.

signal bloom
#

Mhm, that's the problems with analogies, but I think I understood the point telele was making anyway.

sick pasture
#

I think where a lot of people are going here is that your solution could work, but if you think about this from a general design, it's not ideal. If someone has to repeat themselves multiple times like that just to turn on a light, they likely have either taken out their phone and turned on the light that way, or just walked over to the switch and turned it on. Most users don't want to sit there repeating themselves to get the light to turn on. So the ideal UX is that the command works on the first try, every try, without being misunderstood. To achieve that you just need to run a better model that is better at understanding speech. πŸ™‚
Now I understand you are saying most/many people are running HA on a RPI, and aren't running special hardware to run larger models and such, totally get that. Those same users however are also likely running the Nabu Casa STT/TTS instead of trying to run faster-whisper tiny, which would be highly accurate and avoid this as well.

#

But, you could also just put in a feature request on the HA github I guess, and maybe they consider implementing something like this with the other work they are doing for voice? πŸ™‚

signal bloom
#

Nabu Casa STT/TTS is cloud stuff, I presume?

#

In regards to "If someone has to repeat themselves multiple times like that just to turn on a light, they likely have either taken out their phone and turned on the light that way, or just walked over to the switch and turned it on", that's kind of the point I'm suggesting. It's making that situation which, while not ideal, much less jarring and annoying. I'm fine with repeating myself a couple of times (because I know I am running on limited hardware that might struggle in non-ideal situations) but having to wakeword it every time feels like I'm talking to a child that isn't listening to me, rather than a helpful assistant. The "couple of times" is my threshold to then "go and do it manually", as you say,

#

I mean, I've played games at sub 8 frames a second because I couldn't just go and buy a computer that could run it faster.. still had plenty of fun.

#

And, of course, now I can make people laugh (or cringe?) when I tell them I did that πŸ˜„