#Working with BytesIO, and converting memory data to .wav file

1 messages · Page 1 of 1 (latest)

ember night
tame eagle
# ember night Hello! I know this is a fairly old thread but I am doing something similar for a...

Hey Rio! Absolutely. Shoot me a message with your questions if you have something specific. Otherwise, it might be better for us to find a time to chat in voice, as it's a complex subject in the context of how we're implementing it. The project you're working on seems pretty similar to the one I've been developing, so I'm sure I could help give some guidance. Though for mine, I decided against using a third party api for the AI, because I'm setting the project up for commercial use, and instead set up my open NLU and NLP server and api to help with cost efficiency. Since then I've been working on a hotword detection model to further reduce irrelevant data and requests.

Anyway, all that to say, happy to help where I can - just shoot me a dm. 🙂

#

First off, great job with Pycord guys! Loving it.

Ok, so diving in:

The Goal: I'm attempting to create a speech recognition module that utilizes the voice data gathered by Pycord, which is then sent to the speech_recognition library, and then analyzed for qualified data which is then sent to dialogflow for responses and sentiment analysis.

The Problem: I think the main problem here is primarily my lack of understanding of memory buffers and BytesIO, having never worked with them before. I'm sure there has to be a simple solution that I'm just missing. But as far as I can tell, there's no way to handle BytesIO data objects directly with the speech_recognition library.

So my first attempted solution was to try and find a way convert the BytesIO data contained within the WaveSink to a temporary .wav file would could then be processed by speech_recognition. However, again due to my lack of knowledge of how that particular data type works, I'm not sure how to go about this. I've attempted to convert it with the scipy.io.wavefile module, with no success. I then took a look at the Discord.File module within Pycord, to see what it was doing to convert it to the .wav file that is sent to discord. However, that module isn't (to my understanding) really designed for the same context that I'm working within, and as a result it's hard for me to diagnose what it is that I'm missing.

The Assistance I Need: I mainly just need to understand how to interact with BytesIO data stored by the pycord libarary. Any additional info about working with that data type, and insights you all may have into the issue would be greatly appreciated.

native cobalt
#

What speech recognition library are you using

tame eagle
native cobalt
#

So what do you need to do to pass the audio

tame eagle
#

Exactly. To my understanding this library doesn't directly handle BytesIO as an input though, which is (to my understanding) how the data is stored within the sink. Is that correct?

#

Is there a better way of passing the audio directly as a source to the SR module?

tame eagle
#

For anyone interested in doing something similar, or running into the same issue, I did end up getting it figured out. I'll include the basics here, and if anyone needs more details let me know and I can give a more detailed explanation.

  1. Make sure you pass the VoiceClient instance of the channel that your bot connected with, to your finished_callback function. This will be used later, in step #3.

  2. You want to go ahead and grab the BytesIO object (bufferedIOBase), stored in the Sink that you used to store the recorded audio. I'm sure there's a more direct way to do this, but for my case, I got it by using the same function used for sending the audio files to discord:

files = [discord.File(audio.file, f"{user_id}.{sink.encoding}") for user_id, audio in sink.audio_data.items()]

  1. Now you're going to want to re-add the audio data information to the buffer. I'm not really sure why it's necessary to do this again, because discord.File already performs this operation, but for whatever reason, it doesn't work without it. Here's an example of how to do this (this is also where you'll used the VoiceClient that you passed to the finished_callback):

#

for f in files:
#Reset the buffer seek position,
so that it reads from the start of the file
f.fp.seek(0)
data = f.fp
#Open buffer 'File-like Object'
and write wav audio data to it
with wave.open(data, "wb") as f:
f.setnchannels(vc.decoder.CHANNELS)
f.setsampwidth(vc.decoder.SAMPLE_SIZE // vc.decoder.CHANNELS)
f.setframerate(vc.decoder.SAMPLING_RATE)
#Reset buffer seek position again
(not sure this is neccessary, but used for safety)
data.seek(0)

  1. Create numpy array:

    audio_buffer = data.read()
    audio_array = np.frombuffer(audio_buffer, dtype=np.int32, count=-1)

  2. Use SoundFile Library to convert to .wav file.

tame eagle
#

For anyone looking to do this in the future, See This Post as the old one is not at all optimal for live speech_recognition.

Optimized Solution Overview:
Note: I will not be posting details here. If you need any further details, you're welcome to comment here or shoot me a message

Essentially, the problem with accessing the audio data while the sink is recording, is that the recording function runs in a separate thread. So in it's natural state (without modification) the data won't be accessible until the thread completes (i.e. the stop_recording function is called) and the process navigates back to the main thread with the rest of your code.

The solution:
The solution I ended up implementing was slightly more complex, just as a result of the way the data is going to be utilized, but the essence lies in a slight modification to the base Sink Class in the core.py file. Essentially you add a queue (use either Queue (which is FIFO) or LifoQueue) as a base Sink class parameter. You then go down to the write function and implement it however you want. I chose to wrap the user and BytesIO buffer in a dict, and then pass that to the Queue via self.queue.put(dict_wrapper).

And that's pretty much it! After that, you just need to incorporate the receiving end of the pipe into your Bot's code via sink.queue.get().

An additional note: I did create a custom sink, by subclassing the base Sink class. I also added the queue there as a parameter. I then added a custom sr_formatting definition designed to process the BytesIO buffer bytes, add wav formatting, convert to numpy array, and write back to a different memory buffer to prep it for speech recognition processing.

Hope this is useful to someone!