Problem statement:
During collaboration with other Neuro, sometimes it takes a long time to respond, which creates an awkward silence that somewhat disrupts the pace of the conversation. As I understand it, there may be several reasons for this:
- Slow internet.
- Incorrect detection of the end of a phrase.
3) Long response generation
My assumption is that more often it is a long response generation. I may be wrong, if this is not the case and the problem is in 1 or 2, then you can stop reading.
I want to suggest a simple heuristic for reducing delays between phrases. It consists of adding pre-recorded phrases (Interjection), which would be played after detecting the end of the interlocutor's phrase. Immediately or after a certain timer expired (as far as I know, 400ms is a comfortable response time for the user from the system. I'm not sure what the delay is for human speech) a random pre-recorded phrase was played (hmm, mmm, oh, uh, er, um, erm, m-hm, yeah, etc.), while the answer was generated in parallel and then the real answer was played.
You can also improve this system by taking the last punctuation mark "?" -> ["Hmm","Let me think..."] or "!" -> ["Oh!", "What?"]
Pros:
- The conversation will go more smoothly, because the interlocutor will understand that he will be answered now
- This will add naturalness
- Almost instant responses
- In my opinion, a relatively easy solution, compared to optimizing the code or buying faster equipment
Cons:
- I'm not sure that this solution will work for a regular Neuro, because of her monotonous tts, but for Evil, in my opinion, this is an acceptable solution, because she can say one "Hmm" in several ways.
- Neuro will look "stupider" and this will clutter her speech
It would be interesting to hear criticism of this suggestion.
this exactly how I phrased my first forum post suggestion too