#discuss: helping Neuro deal with people talking over eachother in the call? what requires solutions?

1 messages · Page 1 of 1 (latest)

keen grotto
#

This is especially relevant for games like Among Us since a lot of people tend talk over one another, rather than one at a time. Specifically during discussions in the emergency meetings.

I believe there was already a similar discussion thread to this, regarding ways Neuro could discern which person is saying what in future group collabs like the dating shows, and gameshow collabs etc. without needing to implement complicated stuff like voice recognition.

There was talk about many different ways to do it. like everyone being in 2 calls for example (by being in 2 calls at once, they meant 1 for the group and one for Neuro to hear individually)
Which was apparently too complicated.

And another idea, was to simply use discord voice activity to detect which people are speaking at any given time. Which nobody seemed to have a problem with, and the discussion kinda ended there.

But i still don’t think that would adequately solve the problem. Because as soon as more than one person starts talking at once, then simply knowing who in the call is speaking at any given moment isn’t the same as knowing which words are coming from what person.

And also, if Neuro isn’t able to somehow “hear” every person individually, she might not even be able to pick up any coherent sentences at all whenever multiple people speak at once.
During an emergency meeting in an among us collab for example. If two players are arguing over each other at once regarding which one is the imposter or something, it really doesn’t matter if Neuro can discern that it’s those two people speaking if it all goes through one audio channel anyway and all she “hears” is an amalgamation of noise rather than two separate arguements.

It’s just no good if two sentences turn into one scuffed amalgamation that retains none of the meaning from either.

Knowing where it came from means nothing at that point.

TLDR: people loud, what do?

keen grotto
#

discuss: helping Neuro deal with people talking over eachother in the call? what requires solutions?

tulip coral
#

The language model would probably have to be tweaked quite a bit for this specific situation like Among Us. I remember at one point someone asked in chat how neuro would know if she's the imposter or not, and Vedal's response was simply "she'll be told so". So it's possibly a similar situation here. Based on the data she has, she's going to have to pick specific people to call out and get a response from. She could continually ask someone for a response until she gets a coherent answer. I think this would naturally discourage people talking over one another.

The discord voice activity is a decent solution, I'm sure this would better allow her to identify players and people accordingly. But it's definitely not really possible for her to interpret every person's speech individually at the same time without over-complicating things.

Vedal mentioned something about improving her speech interpretation abilities. This applies here as well. For example, if 2 people are talking at the same time and it turns into 1 non-coherent sentence, then neuro should recognize this sentence doesn't make sense and ask for clarification rather than responding to it in a nonsensical manner. Furthermore, again she can use the discord voice activity data to realize that multiple people are talking at once.

keen grotto
#

I suppose if you want to go around the problem, rather than through it by looking for ways to discourage people from talking over each other in the first place, rather than looking for ways to handle it when they do, then i guess it could be as simple as having neuro interject with “One at a time please” or something, whenever she detects several people talking over eachother for too long.

But even then, that doesn’t really “solve” the problem, as much as try to move around it. While it is definitely a good way to minimize the issue, it only really works for people that somewhat understand how Neuro works. And not everyone will necessarily abide by it. Especially under pressure. Emergency meeting discussions in among us for example have a limited time for the players to make their arguments. People are generally very disorganized and a lot of people wouldn’t feel inclined to speak one at a time, and wait for Neuro to make her response before having their turn to say what they want to say.

Some people might even just find it annoying. Like if she kept constantly asking people to talk one at a time, or explain what they were trying to say on the Mizkif collab for example? Not a chance anybody would actually start talking one at a time in a clear way. Instead they would probably just get annoyed if they felt neuro kept interjecting with the same stuff constantly.

I don’t think the ideal path is to try to get everybody else to change how they speak to eachother in a group to adapt to Neuro, (IF POSSIBLE) it would be better to adapt Neuro’s way of understanding what people are saying, to reflect being part of a conversation with several people rather than just a 1-on-1. A step forward, rather than a step around, i guess

#

Either way you go though, this is a difficult problem to solve,
And I definitely don’t envy the turtle on this one.

#

You never know. Maybe he’ll amaze us all with a much simpler solution than is immediately obvious, which doesn’t have any of these issues.

tulip coral
#

Totally, it's not a solid solution because it's not really replicating human communication that well. It would probably be annoying and inconsistent. But it's one of those things where you'll have to compromise for now just to have some type of solution. This would be an impressive speech recognition breakthrough if she could handle multiple people talking over each other without issue. Not sure if the tech is fully there yet.

keen grotto
#

Yeah definitely

naive raft
#

Send each speaker a unique url to a web app that captures Speech to text using browser's webkitSpeechRecognition:
https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API/Using_the_Web_Speech_API

var recognition = new webkitSpeechRecognition();
  recognition.continuous = true;
  recognition.interimResults = true;

  recognition.onstart = function() {
    //If Neuro is talking: Send shut up no more backslash backslash after finishing the sentence signal to Neuro API
    //Get the Unique Id on Url that indentify the user
    //probably show a feedback to user of the current speech to text captured so far while speaking a.k.a. subtitle of their own voice :3
  };

  recognition.onerror = function(event) {
    //Flash a message box to tell the user to "Stop Speaking in British Accent mate!"
    //If Nero not talking. Tell Neuro via API to ask the user if he/she is British :3
 
  };

  recognition.onend = function() {
    // Send the captured speech to text and Unique Id on Url that indentify the user to Neuro API

  }

Speech recognition involves receiving speech through a device's microphone, which is then checked by a speech recognition service against a list of grammar (basically, the vocabulary you want to have recognized in a particular app.) When a word or phrase is successfully recognized, it is returned as a result (or list of results) as a text string...

regal patio
keen grotto
#

that is an idea

tardy pond
#

What is being asked for here is speaker diarization, or an AI to generate a speaker segmented transcription that could then be passed to Neuro's AI. I haven't seen an 'on the fly' model for that; even if one exists, the latency hit would probably be ridiculous. You would also have to solve for that AI transcribing Neuro back to herself(which would be hilarious).

Neuro being able to detect who is speaking in Discord is the most elegant solution. People will just have to learn how to play with her.

keen grotto
#

Yeah, if there really is no way for Neuro to hear everyone’s audio separately

naive raft
fervent mural
#

I have the most pepega solution

Give Neuro the ability to do "random mutes", that way she's just receiving one input while the rest are muted, and do "exceptions" like if its a show, never mute the host

naive raft
keen grotto
#

Hmm

keen grotto
#

I’m sure vedal is aware of the issue already, so I’m optimistic he’s thinking of solutions as well.

naive raft
#

well since i have tested each solution i can provide the main cons: the google one requires matching the speaker_tag to the user's name after diarization, not good for real time speaker id.

The microsoft one just requires training and testing which you can use with guest streamer's vods.

The webkitSpeechRecognition one is closest to zero shot approach, depending on the guest streamer's setup -- may or may not be obtrusive (for those who use voice changers)

keen grotto
#

How they would affect latency is also probably a big concern since the Tutel seems very interested in keeping latency to a minimum

naive raft
#

Azure solution's latency is similar to their speech to text, the only difference is you get the result back as dictionary of word and speak. so in the Azure solution you get speaker ID based based on your initial training.

https://youtu.be/qPB3ndV2jGY?t=543

In this session, we will have a quick look at the Conversation Transcription solution that helps you convert speech to text for multiple speakers as well as to identify the speakers!

▶ Play video
#

For Google cloud solution it returns similar dictionary but just tag the different users numerically such as "speaker_tag: 1" https://youtu.be/8uVFbCrVEOA?t=690

Hello Everyone! My name is Andrew Fung, in this video, I will be showing you how you can upload an audio conversation file and get the transcription of using Google’s Speech-To-Text API and display the script on the web using Python Dash. In this first part of the video, we will be focusing on getting the audio transcription program working and ...

▶ Play video
#

The webkitSpeechRecognition solution which I use on my own vtuber bot of course is even faster, you can compare the latency to
https://webcaptioner.com/ (Chrome only) but just imagine each caption is sent to the bot with your username attached to the unique link.

tight current
#

this is just for among us, but maybe vedal can make a voice chat mod, instead of using discord, that automatically transcribes the voices for each speaker and sends it to neuro, maybe he can edit the pre-existing proximity chat mod or somthing

tranquil peak
naive raft
keen grotto
#

Yeah, there’s definitely some notable stuff on here regarding the pros and cons of varying potential solutions to this. I wonder what the turtle thinks, because I’m sure he’s definitely thought about this as well at some point. I doubt he’ll randomly stumble upon this discussion when he gets back tho, so I’ll try donating or something next dev stream if he ends up doing another question-answering segment again.

But with the debut coming up soonTM, I’d imagine this probably isn’t too high on the list of priorities at the moment.

naive raft
#

we can try directly tag him here, but its still ungodly early to ping him out of bed right now vedalHappi

keen grotto
#

Errr, probably not. People shouldn’t ping Vedal unless they either know him, or need to share vital or urgent information.
It’s discord etiquette. But it’s also just a rule here as well.
Besides, it’d be kinda rude to bring notifications his way while he’s on break.
Especially if it’s just to ask a question.

inner relic
#

I do read this stuff dw

naive raft
keen grotto
#

Personally I’m exited to potentially hear about the specifics of what the Tutel is cooking in regards to this, and what his end-goal for handling multiple people speaking is, hopefully on the next dev stream.

Because addressing something like that will definitely be pretty important in regards to among us collabs. But any improvements that are made for that, will still be useful in the future for any potential gameshows or group collabs that may happen down the line.

naive raft
#

actually I also found another option, why not get it straight from discord as separate voices : https://v12.discordjs.guide/voice/receiving-audio.html#basic-usage


const fs = require('fs');

// Create a ReadableStream of s16le PCM audio
const audio = connection.receiver.createStream(user, { mode: 'pcm' });

audio.pipe(fs.createWriteStream('user_audio,pcm'));

The CONS is you have to write your own threading and its Node js :3

So i tried to check the discord python API for same support and its RFC is a mess because they couldn't agree how to design it (threading, format, etc): https://github.com/Rapptz/discord.py/issues/1094

You can see this unofficial RFC fork used in (it seems stable): https://github.com/RobotCasserole1736/CasseroleDiscordBotPublic/tree/master

NOTE: In order to get audio streaming working effectively, I had to pull not the main discord.py, but made a local copy of github user imayhaveborkedit's fork of discord.py. Details about their changes are on this RFC.. Basically, new API's for voice client listen() and supporting classes were added to make incoming audio data available. Forward this on to the speakers, and we're golden! Despite the developmental nature of the library, seems to be quite stable and happy at the moment?

If Vedal happens to read this, ||sorry mate, I'm too lazy to write the actual code, I'll try to write a threaded listener for any user who joins the VC by default :3||

GitHub

Note: DO NOT use this in production. The code is messy (and possibly broken) and probably filled with debug prints. Use only with the intent to experiment or give feedback, although almost everythi...

GitHub

Public release of 1736's Discord bot. Contribute to RobotCasserole1736/CasseroleDiscordBotPublic development by creating an account on GitHub.

arctic ledge
inner relic
#

honestly neurosama bot account is better endgame anyways because it supports stereo for when she plays music/sings and is much easier to automate

lethal panther
#

i heard discord bot

#

im making it

lethal panther
#

nevermind

#

anyways heres some information i can share:

Bots CANNOT join any kind of DM Calls, leaving her only to voice chats
Discord has a function that allows to check who is currently talking, allowing to detect multiple users
Playing songs and talking is easy, but if vedal needed neuro to use discord camera/stream it doesnt appear to have anything its against ToS for a bot to stream/use video

#

you can record people's voices using voice_client.start_recording

#

it returns an mp3 file, but can also be saved in bytes

lethal panther
#

when it comes to singing, discord.py allows audio to be passed, so there can be a command called "sing" and takes an mp3 file

#

it also can be used via the control panel, inserting a file there and sending it to the bot via the API itself

lethal panther
#

the bot isnt a confirmed feature, but if it is i wouldnt mind helping doing the development

naive raft
#

yes its joins voice chats as stated:

In addition to sending audio over voice connections, you can also receive audio (i.e., listen to other users and bots in a voice channel) using discord.js.

Q: Can you even use discord.js with bearer tokens?
A: it authenticates like all discord via token:
https://discordjs.guide/preparations/setting-up-a-bot-application.html#creating-your-bot

Here is a complete Bot in discord.js implementation we can copy.

You can see the implementation of voice recording by bot here (index.js LINE:350): https://github.com/inevolin/DiscordSpeechBot/blob/master/index.js

function speak_impl(voice_Connection, mapKey) {
    voice_Connection.on('speaking', async (user, speaking) => {
        if (speaking.bitfield == 0 || user.bot) {
            return
        }

        //// Here we identify the speaker 
        console.log(`I'm listening to ${user.username}`)
        

        //// This is inside anonymous async funtion so its already threaded
        const audioStream = voice_Connection.receiver.createStream(user, { mode: 'pcm' })
        audioStream.on('error',  (e) => { 
            console.log('audioStream: ' + e)
        });
        let buffer = [];
        audioStream.on('data', (data) => {
            buffer.push(data)
        })

        //// After here send the data and user to Neuro's API (REST/firebase/messageque)
     
#

for use of pycord heres a code i found: https://github.com/Pycord-Development/pycord/blob/master/examples/audio_recording.py

on line line:23 you see that its possible to get a list of user_id, audio tuple

 files = [
        discord.File(audio.file, f"{user_id}.{sink.encoding}")
        for user_id, audio in sink.audio_data.items()
    ]

Note: The audio format is Opus so you might need to convert it before sending to speech recognizer :3

I guess ask #programming for help,im just really tight on schedulre right now :3

GitHub

Pycord, a maintained fork of discord.py, is a python wrapper for the Discord API - pycord/audio_recording.py at master · Pycord-Development/pycord

lethal panther
#

You're seeing v12, which is an outdated version and pretty much dead

#

Vedal most likely needs a python bot since neuro's built on python

naive raft
lethal panther
naive raft
lethal panther
#

I was reading thru that issue yesterday, sounded good but nothing in the docs, meaning its either not implemented or just gone

#

I also saw that you can get data directly from the websocket, but decoding it is extremely hard

naive raft
lethal panther
#

As i said discord.js is probably not what vedal wants

#

Neuro is coded in python, which is why it should be discord.py

naive raft
#

yeah there are python forks available, just note this may change when discord.py changes thats why these authors are pushing it on RFC :3

as for the voice data stream from discord is opus format AFAIK, theres a lot of python binding for libopus so it can be converted to pcm.

lethal panther
#

Well, the best way would be taking it directly from the websocket, but considering thats alot of work to decode data & transform it to input, we should wait for Vedal's opinion for now

#

the rest, like songs & playing audio is pretty basic and can be done easily

rugged pulsar
#

Hey just a horrendous ad-hoc solution but could Vedal make like 10 extra discord accounts for the call and have them mute everyone except 1 specific person each and then output to a virtual cable containing just that person's audio?

Then when 1 person speaks the Neuro model prioritises the sentence coming from the person who spoke first, you could even add a (This comes from person X) into the prompt to help with awareness

#

So let's say collab with person A,B and C + Neuro

Accounts:
A
A listener
B
B listener
C
C listener
Neuro

A listener mutes B, C and Neuro
B listener mutes A, C and Neuro
C listener mutes A, B and Neuro
Neuro mutes everyone

A, B and C listener output to A, B and C virtual cables

When 1 cable detects audio it mutes the other two until the audio cable is quiet again + 3 seconds

Inputed audio is sent to Neuro sama with:

Test, [from person A]

#

obviously this would require some setup for each collab but Neuro sama collabs are never really spontaneous

lethal panther
#

that is not a good idea

rugged pulsar
lethal panther
#

wayyyy to complex

#

theres way to many easier things

#

too hard to setup

rugged pulsar
#

It's relatively simply isn't it?

lethal panther
#

can lead to many many many mistakes

rugged pulsar
#

I'm not really following

lethal panther
#

When 1 cable detects audio it mutes the other two until the audio cable is quiet again + 3 seconds

#

this

rugged pulsar
#

all this really does is split discord voice chats into individual seperate audio cables

rugged pulsar
#

but the rest of it should work

lethal panther
#

theres easier methods with discord api

#

that can record each person individualy

rugged pulsar
#

hmmm

#

well I'm approaching this as someone with very little experience coding

#

I still think the listening accounts and virtual audio cables would work well in at least splitting the audio into individual sections

#

this is really just an alternative to coding tbh

naive raft
#

What's the fuss? We already solved it above in code, the choices are just between the stable version on node.js or the forked version on Python :3

rugged pulsar
#

Ah I didn't realise it was solved sorry

lethal panther
keen grotto
#

Seems to make the most sense

keen grotto
#

On the same topic of helping Neuro communicate better. Yesterday in dev stream, Vedal raised the point of her responding to sentences that weren’t actually meant for her. Or more specifically, neuro responding when Vedal was trying to talk to chat.

This could probably be partially solved by simply having her not respond to sentences from Voice inputs that include “Chat” in the sentence. But that could lead to potential issues of Donowalling people.

Another way which someone in chat mentioned, could be to use something similar to sentiment analysis to see whether the statement seems like it’s directed at specifically somebody else, like “Chat” or any other name.

For example in a Collab like among us. If someone asks somebody a question like: “Filian, what were you doing in electrical?”
Neuro wouldn’t need to respond to that question like it was directed at her.
Because that would just interrupt the answer from the intended recipient.

Something like this, if it were to actually work correctly, would be really helpful in collabs. And would resolve the issue of Vedal having to manually mute/unmute her like the mizkif collab.

So the benefit is there. But the problem with trying a method like that, is whether it’s even possible to do this. And if it’s potentially too complicated, to even be worth the time.
Then there’s the potential that it might lead to too many false positives detected from the AI, leading to her donowalling collab partners

Neuro’s current animations seem to work off of something like sentiment analysis AI, but if so, then that’s for the outputs,
and something like this would need to be done to the inputs.

Idk. I’m just a messenger here.

I just saw someone say it and figured i should relay it in here, since it is relevant

naive raft
#

Oops i made a separate thread. Since this thread solves the VC problem. Vedal talking to chat which is not on VC and no way for Neuro to know is of a different nature. It's a matter of attention span, it could just be Veedal thinking out loud as well. But i guess it also works on VC mode :3 https://discord.com/channels/574720535888396288/1112984472741220412

keen grotto
#

Ah sorry

#

I also went over the VC suggestion stuff i’d seen from people for outside collabs,
Ones that aren’t just concerning Tutel —> chat/nwero

yeah. But sorry if i was clogging the discussion with duplicates from elsewhere

naive raft
#

on VC call I think Neuro should listen to all and then make a summary of all that it heard before creating a response - she should then response until her name is called. The dev stream is different, Vedal wants to ~~exclude ~~ separate neuro's attention from the conversation with chat.

keen grotto
#

Ahh

#

Yeah you’re right. That’s a different thing

keen grotto
#

Hmm

inland rock
#

I think be silent when other talks is solved Vedal mentioned that feature a few times. Sorry when somebody already mentioned it the only the problem is when multiply bots or a lot of people talk . Humans have a (sub/)unconscious order to talk often social defined . So the only solution so far I can see is add another NN which gives the right to speak to Evil an Neuro when they collapse together with human.

civic hamlet
#

I built a system for this actually.

#

Hasn't been updated in a while. Lots to improve, but basically each person on a discord call is seperated and you can deal with each speaker independently.

#

It listens continously for input. And plays an mp3 back. But there is a fork of the code that can play audio from an audio source instead.

#

Discord only allows it work in a server though. So you need a collab server setup.

#

With something like Among Us with 8 people, it would be too slow to process everyone. But you could use a cloud service or get a ton of GPUs.

#

I hate whisper, so maybe someone has a better STT.

inland rock
#

I think you can also get the performance problem solved by using C++. I switch von Phyton to C++ and Pytorch C++ frontend. The improvement on small models where about 60x times(smaller the bigger) the the only downside is most external tools and libraries are Phyton exclusive which vendor look you . I mean technical you can call Phyton code from C++ but this slows you down again . I have the pain currently to try to jit/script the NVIDIA voice TTS model to the C++ variant and it through errors only. The other problem I see is the m:n correlation n-> bots , m-> humans which drives your processing time by large numbers to insane . IRL it also doesn't make sense to talk to such many people at once so nothing to solve here.