#ad_discordbot (Fork of Fork of xNul's bot)
1 messages · Page 20 of 1
🥳
I'm not sure about the encryption stuff, will have to test it
I think I spammed discord's ratelimits too much tonight.
I'm stripping out all the unnecessary stuff like external libs and complexity for an example to echo your voice back to you
The next code I’m going to add is to make it possible to run the bot in a non-TGWUI environment for the image generation capabilities only
Most of the code to allow this already exists in the bot
I basically just need to update the batch files that launch the box so that it will create a new virtual environment if the text generation one is not found. Install the few requirements that the bot currently relies on from the text generation environment. Finally skip trying to import text generation modules when it’s configured for image generation only
i didnt had time today
still wait for it though
im getting it! i little slow, it should run on gpu thought, but the model is the base whisper which shouldnt be too slow even on cpu...
Ok I see that this bot writes your text
that's awesome!
you got it hearing you so far
alr the thing works, it is pretty fast, what happens is that it should detect 1s of silence to chunk and 2s of silence to do the final stt and send the message, the time works but only when you speak again.
its something like, i spoke (3years later),
a little bit before i speak again, oh it had been more than 2 s send the message then yes yes marcos im hearing you
i spoke 2025-03-06 00:04:50,083 - INFO - Started hearing from Marcos
I spoke for 2.38s but the transcribe log is much after 2025-03-06 00:04:58,946 - INFO - Transcribing chunk for Marcos with audio length 2.38 seconds
It happened a couple of milliseconds before i spoke 2025-03-06 00:04:59,181
2025-03-06 00:04:58,946 - INFO - Transcribing chunk for Marcos with audio length 2.38 seconds
2025-03-06 00:04:58,946 - INFO - Starting transcription for temp_323088470241312774_199565.wav
2025-03-06 00:04:59,181 - INFO - Started hearing from Marcos
2025-03-06 00:05:00,388 - INFO - Transcription completed for temp_323088470241312774_199565.wav: Hello, hello, hello
2025-03-06 00:05:02,837 - INFO - Sending transcription for Marcos: Hello, hello, hello```
and also the silence log disappeared here D:
it feels like if i dont speak it pauses
gonna cointinue tomorrow
i hope i get it done before sunday
have you looked into live whisper repos?
They do something like collecting audio as it comes in and adding it to a buffer (like 30s)
Process on that and return the output of high confidence tokens.
And trim the audio/text.
There are some issues with this too, like if it doesn't hear you finish your sentence it might continue retrying to process the same buffer over and over until it gets your final words.
What would be cool is if we can run the speech detection model that preprocesses before being sent to whisper to do the cutoffs more intelligently.
If you're going to do turn based voice chat,
I recommend starting with voice messages because there you don't have to worry about pauses and it could be a cool interface!
The user decides when they're done
didnt understand
i did think about realtime whisper, and instead of sending everything it hears, only send when there's a keyword like **hey **and send the message when there a that's it
if the problem is for the future me, it not my problem 👍
gonna leave more features for the future me
I was thinking the idea is pretty similar, but just taking a slightly different route to achieve.
Also that's a good idea to use a trigger keyword, that can save a lot on processing.
There are some libraries dedicated for that too
getting a somehow working version but just takes random time to send the message
also gonna fix that
im dying
what's going on?
fixed audio processing getting paused, but takes random time around 1-12s
and i cant get it fixed
but anything else is pretty good
take a look if you want, gonna check https://github.com/davabase/whisper_real_time and https://github.com/ufal/whisper_streaming
I could take a look at it,
i have a few ideas, like how do you check that it has been silent for a certain amount of time with background noise
I think that 2nd one https://github.com/ufal/whisper_streaming is what I used
wow looks like a lot going on in the bot
one idea is that the audio processing stops when there's silence, have you tried printing when silence is detected?
sink_cb = voice_recv.BasicSink(callback)
sink_silence = voice_recv.SilenceGeneratorSink(sink_cb)
vc.listen(sink_silence)
Wrapping the voice sink in a silence generator will create silent packets when the user stops talking so the silence detection loop has silent packets to work with
i havent tested running it yet
i tried to implement silent, but there is something stopping it
the silent threshold i made is nearly useless
Silence thresholds arent easy, it will have to be tweaked for everyone's mic
Since you're using the whisper-stream library, you can track based in if the sentence has been ended perhaps
Also whisper large v3 has has an issue with never finishing sentence punctuation iirc.
large v2 works great
actually whisper base is really fast
the silent should work
but for some reason there is a 10s waiting time that i dont know why is it there
i think there something about the way it retrieves voice data and processing and etc, making it really slow?
i think thats it, i normally disable turbo to save energy and not loosing power, but the cpu runs on low frequency
shouldn't be a big problem for gaming and others stuffs but my python code 😓
i hate my life
me--> 😪
🤗
Running with the tiny model it seems fine, responds within a second of finishing talking
With the reasoning models, is it the stopping strings that need to be modified/updated so that the bot doesn't keep speaking beyond its initial response, as well as not outputting </think> at the end of its responses, or are those two separate issues?
Even going through deepseek r1's release, I haven't seen any examples of what would need to be specified in order to remove the </think> from the text output. I get that these models are meant to be run where you can see the thinking/context window so you otherwise wouldn't see that, but it outputs it in ooba and when speaking to the bot on discord.
Both deepseek r1 and now QwQ do it, and it makes sense if it's related to the thinking/reasoning tokens. Bartowski has both exl2 quants and ggml files on HF for QwQ if you haven't tried it out yet.
Of course it could be used as a stopping string if you wanted
If the responses are generally too verbose, there are other settings for that
👍
i think it wasnt my fault, the the extension depends a lot on cpu to decrypt, making it slow if the cpu is running at low clock speed
but man, if i dont remove the cap for clock speed, 35w for doing nothing and 40~70w for moving my mouse is
these days my pc is crashing, i think my intel cpu is cooked
is it possible to make the bot capable of being user installed
to make it usable in dms
it is usable in dms
in your setting file you can turn on dm
confing.yaml
although it is not installed but it can be used
@terse folio i added a new asr engine somehow and broke all my loggings, not fixing that as it works just fine i think :)
https://github.com/marcos33998/asr_discordbot/tree/MoreEngine
now the question is how do i plug it to https://github.com/altoiddealer/ad_discordbot
Go into your Discord Developer portal, then Apps, Bot > 0auth. Be sure that you check off pretty much all the checkboxes when generating the bot invite URL.
And yes as Marcos said you’ll also need to enable that one setting in the config.yaml
By default most commands are hardcoded as disabled for users via DM, but the bot owner (you) can use almost all cmds via DMs
So this is working pretty much exactly how you want?
I could try to make some time to integrate the feature
works just fine, with some flaws that i dont think it is my problem
the time it takes to collect all the packets is demonic
im confused if it is my cpu, the extension or discord's fault
the water of retrieving audio is deeper than i thought
@halcyon quarry can you try those benchmark scripts and give the results? and is it possible to somehow use partially pycord? also how hard would it be to move to pycord?
requirements are pynacl for both i think
and discord-ext-voice-recv @ git+https://github.com/Aviana/discord-ext-voice-recv for the py
i got really good result with pycord and demonic with discord.py
Voice receive extension package for discord.py. Contribute to Aviana/discord-ext-voice-recv development by creating an account on GitHub.
both uses import discord 😓
discord.py uses !benchmark and pycord uses /benchmark
ad_bot already installs pynacl btw
ad_bot cant have pycord 💀
pycord is a different discord bot library.
But discord.py was the first in python and probably still the most feature packed.
But yes there are other libraries in other languages that do voice receive out of the box
also looking briefly at the utils.asr module, I can't find where "asr_manager" is defined
forgot to upload that one 😱
should work now
pycord is made on top of discord.py, the voice recording was supported in discord.py but dev abandoned it while pycord even improved it
i dont want to modify libraries but, both using discord...
what happened to the previous extension on discord.py?
?
if you mean the original and not the fork, pretty much abandoned and few months after discord deprecated of its supported decrypt method
tho fork added support for the new encryption
and it is super slow
at least for me
oh I see,
I guess I havent updated yet or something
I was trying to get an echo test version of voice receive working, but joined the voice channel too much and discord locked me out for a day Xd.
hadn't really gotten there to test latency
hmm
yeah my discord.py was outdated as well, I activated the TGWUI venv and used pip install discord.py -U which updated it to current, which includes the support for that encryption
So if we are to add this feature I'd just need to update requirements.txt to ensure it specifies v2.5>
if the requirements.txt will replace the current installation, yea that should work ^^
yes my updater scripts first git pull then do execute requirements.txt
so discord.py restored support for voice receiving
?
You said it was aead_xchacha20_poly1305_rtpsize yes? That's the method that the changelog says was implemented in discord.py 2.5
pure chaos
That Issue is from 2023 my dude
Seems that this encryption method is implemented as of merely 3 weeks ago
why would a guy update an extension when the main project actually does what the extension does
😓
they are playing with my mind
Well let's say I made a bitching extension that does something that doesn't work in the main project
Now some time later, I imagine they want to just update a few blocks of code rather than rewrite or discontinue the extension
Anyway, the way to correctly capture / encrypt/decrypt voice channel data in discord.py is likely documented here:
https://discordpy.readthedocs.io/en/stable/index.html
the methods etc are probably very similar if not identical to how it works in this extension you found
discord.py doesn't have voice recv because he couldnt figure out a good standard for it yet. so someone else made an extension
reseaching again if it has support now and get rid of the extension
Just took a quick look through the asr bot extension and it's not a crapton of lines or anything...
asr bot extension?
thats is my humble bot, using the extension to retrieve packets
Ok I see that there is not a method like receive_audio_packet only a send_audio_packet
could be encryption for sending
actually...
s
discord.py only has voice send (think music bots)
Danny (the author) has stated he's not adding voice receive yet.
There's a discord.py discord if you need more info ^^
a few years ago
Aight 🙂 Looks like the extension here (https://github.com/imayhaveborkedit/discord-ext-voice-recv) could just be added as a dependency if we implement whisper?
Can confirm this one works
dont work for me
i get 2025-03-08 22:31:46 - __main__ - ERROR - Benchmark failed: aead_xchacha20_poly1305_rtpsize
I made this image way way way back when TGWUI was juuuuuuust getting off the ground
okay, will do a sanity check as it's been a moment since i updated ^^
i never got my echo test working for other reasons, but I should be able to get it to write to a wav file
Here's a quick LTX-Video image2video I just executed in about 1 minute on my 4070ti (12gb vram)
the one taht works to me is https://github.com/Aviana/discord-ext-voice-recv
Voice receive extension package for discord.py. Contribute to Aviana/discord-ext-voice-recv development by creating an account on GitHub.
XD
I'm extremely interested in actually getting the comfy UI / swarm support added in, for users to easily configure execution of various workflows via the bot and send to channel the expected output
The one thing that's a bit of a bummer about current video generation is that the models generate the whole video as essentially a full length diffusion process - not a sequential process that could be paused and resumed
That's fascinating to me too, i did a little research into it and found that you can upload custom comfyui nodes via api to be processed?
That would be cool for things like assigning entire workflows to tags
So those sort of requests would stall the bot bigtime in a busy server
Yes, I believe there's some extreme flexibility for using the Comfy API
maybe think about some distribution.
Like giving the option for people to add multiple api urls for multipel comfyui servers
And have the bot pick the least used/free one
can help there when the time comes
Some pretty ambitious thoughts going through my head lately in regards to managing the generative endpoints
I've currently got some pretty fixed and rigid definitions for what an image request payload entails -- I was thinking how it would be much better to make it so the current payload stuff I have is like an example template, but that the bot could accept, process and send whatever the user defined without any errors.
somehow pycord actually connected using xsalsa20_poly1305_lite, why this works???
Doing this wouldn't really be too hard but I'd want to make like, a dedicated method to filter some of the client specific features I've written for A1111 / Forge / Reforge
yea, apis change, people might want to use other tools.
Checking requests is useful if you know where it's going.
having the bot return the error the api returns should suffice ^^
No clue tbh, I never had to mess with encryption stuff
I think it already does this... I just have a lot of micromanagement going on that I could cut back on
But mainly, the overhaul I need would make it very very simple to manage payloads and settings for various APIs
like a user directory for storing payloads, then just having one setting in config/py to specify the one you'll be using, or something
interesting, be safe with that.
Storing payloads in json/yaml files could mess things up if they aren't escaped properly
Well I currently already do something like this, with basesettings.yaml
BTW - I've been using a lot of ComfyUI workflows lately, there's a ton of cool crap you can do without having to actually play with the spaghetti
camfyui is cool
Followed by, lots of additional cool crap you can do once you do know how to play with the spaghetti
you tried right?
yes, yesterday or the day before, everything was working.
And my other test bot was working.
today I'm having the encryption issue
I'll try out the fork later
The default example i2v workflow for LTX-Video includes this prompt enhancing LLM that wrote a very good caption for this image
try the benchmark too, mine is very slow, now that you have to switch to the fork you will suffer like me
So my input was the image, and a prompt a caveman reading something on his computer monitor
here
when i use image generation i still use words,words,asdasd,asd,asd,asd
what i found comfyui interesting is that you can plug a lot more things and have a super complex workflos, i even saw people selling comfyui workflows 💀
Yeah it’s the YouTubers all going full blown Patreon lol
It’s insanity really anyone who can invest an hour or so just tinkering can figure out how to connect different workflows together and stuff
It’s so simple now that you can create Groups of nodes
You can toggle a complex feature on and off by just Bypass Group now
I had a similar view, and still do kinda...
My family often told me I should create tutorials or classes for friends. (random topics like programming)
And I often replied like "anyone could figure this out with a few hours of research"
or "it's really nothing special"
But sometimes we forget how niche the things we know are.
Here for example, a lot of us have some programming knowledge, and the node networks of comfyui are reminiscing of programming where one thing pipes into another.
But, especially now days where AI tools are mainstream, everyone wants to do it.
I can understand why people sell shortcuts.
But I appreciate it when there's an open source solution that you can compile yourself for free or download a working version for a donation
people are getting dumber with tiktok
Yea, I can see that with ai tools as well, taking away reason to do your own problem solving 😭
Memebenders
like this?
it returns invalid scopes. im looking on the discord documentation what kinds of scopes i need but its a bit confusing
Make sure you did everything noted here in Step 2 of the installation instructions
https://github.com/altoiddealer/ad_discordbot?tab=readme-ov-file#installation
Also check your server settings - Integrations > Bot
And Roles. Make sure the bot has privileges
this is for making it userinstallable yes?
It could be installed without any intents scopes etc, but then it’s not very useful
i kinda got it to show up at all in another server its not installed in albiet it didnt work fully
In the developer portal, Bot, 0auth (I mentioned this yesterday)
This is where you generate an invitation link for your bot to join a server
The link changed as you check off the various permissions you want to allow the bot to have
yeah
So check off Bot, which expands a new section, and check basically everything. Copy paste the link in browser and invite the bot to the server. Repeat for additional servers
You may also need to give the bot a Role with more permissions like, Send Messages etc
In the server settings, or channel settings, etc
you always can look for examples at ad_discordbot/settings_templates
or https://github.com/altoiddealer/ad_discordbot/tree/main/settings_templates
Discord bot which transforms your servers into hubs for limitless local AI-driven interaction and content creation. Features cutting-edge tools for professionals, and unlocks creative fun for casua...
He has the settings templates in correct place and all that, for sure
the bot copies them automatically if the user did not already
wanted to fork pycord and change its name and misclicked and did a PR 😓 never used codespace and do a merge with vs code :v
got blocked really quick XD
also i think i misclicked a lot of thing 😦
how can i have pycord and discord.py at the same time?
this is killing me
i really dont want to use the extension of discord.py ;-;
actually i believe that they can be installed together
dont tell me that they really ca be together .-.
dreaming too much f
actually got an idea, gonna try it tomorrow

@halcyon quarry is it okay for you to add another bot for stt?
discord.py's extension although it can be added in one bot, i'll prefer adding another bot with pycord as it is faster for me
Does the bot send the transcribed text to the channel? If so, then my bot should already be able to handle it... although I'd probably want to add a tag like "regex_text" that could be a regex string to update the user text. In this case, ignore or modify the prefix that other bot adds
what i was thinking is a deeper connection, but if a separated bot sending the transcription is good enough i will be happier, right now my bot sends displayname: transcribed text
Just trying to stay flexible 🤗
besides a customizable Regex tag has been on my mind for some time
is there tag to stop tts?
actually, for voice chat anything should stop the tts
and to make it better, add "you were interrupted" to the context
Suppose I need a "should_tts" tag
i'll try to add multi speaker for now
had to move to pycord as the bot was made for discord.py so... multi speaker will be done before april 👍
also i'll have to move a lot of thing to config but my bot is lightning fast now https://github.com/marcos33998/asr_discordbot/tree/pycord-MoreEngine
A speech-to-text (STT) bot for Discord that joins voice channels and transcribes conversations. - GitHub - marcos33998/asr_discordbot at pycord-MoreEngine
lightning fast in the transcriptions?
That's cool, I wonder what the difference was between pycord and the former?
At least for me discord.py was super slow
Before it takes me X5 time to collect all the packets
My hypothesis is lack of optimisation
makes sense makes sense,
That's something i'm interested in testing timings with when i have free time
But i'm happy it's working good for you now 😸
Although the biggest problem it has now is recording the whole voice channel in a single file
I'm done with my video game vacation now so I'll be making progress again
Result of the recent new image command option, LLM gave a very good prompt for Flux
This NeuralBeagle model will never cease to amaze me
bot at the top :O
Interesting, I had made a PR to add the bot but I definitely did not put it at the top
Either ooba checked out the bot and thought it deserved the recognition? Or someone else sneaky moved it in a PR lol
Congrats, that's really cool :)
The best way to get beaten to something,
oh misread a little, I thought you were considering the PR but someone did it before you
I'm really stoked about it
That repository has instructions for requesting to have an extension added to the list, which is simply to clone the repo, update the list and send it as a PR.
I did this, but I had inserted my bot way down the list, just below Oobabot
It looks like shortly after, he decided to reorganize to sort what he thought to be most noteworthy to the top
that's epic!
I only feel bad you don't have more recognition, you wrote some of the most impressive features that it offers
Well that's in my control eh 😛
I was glad to make some friends here, and support something cool ^^
I’m grateful, very. I hope you’re also proud of this ascention of the bot in that list
Could not be in that coveted spot without you
I remember every contribution you made, notably you sorted out my chaotic single file making my life so much easier
Absolutley!
It's wonderful to see those parts still being used and expanded upon,
anyway, it's time for me to sleep for now, take care!
@valid crypt thanks for pointing this out I’m literally drinking champagne over this
Mind is blown
🫡
I think it could make sense to do the following:
- ship the bot with text and image generation each disabled by default.
- make it possible for the text generation to be run in 2 modes, as API mode, or the custom TGWUI integration
While the bot is configured for anything besides text gen enabled + TGWUI integration, do not attempt to activate and rely on the TGWUI venv. Instead create own venv.
I suppose these would be better controled via CMD Flags
Yes, making some good progress on this... borrowed a lot of code from TGWUI setup so that it will create a venv and install requirements
I’m going to have it detect and ask how to handle venv setup
got multi speaker working, https://github.com/marcos33998/asr_discordbot/tree/multiSpeaker , the next step would be filter tts from bot, or a more general blacklist, and packing responses, as i think it would be better on my side than in bot's side
i need a tag to stop tts on new message, and add "you were interrupted while speaking" or custom text to the context
wait a second, why bot is replying to my bot when i speak but not when bot speaks? :O magic
Actually, what it need to happen is stop tts if user speaks, so this must be either you add voice detection or your bot has to commuticate somehow with mine
by taking a shower, i think this can be done with tags!
a tag to pause tts, to continue and to abort
I feel a bit over my head with adjusting this installation procedure...
In this process, I noticed that a requirement of the bot, pydub, has been expected to be present in the TGWUI environment -- but TGWUI does not seem to install this by default. It seems to only get installed by some common extension we've been using
In any case, I figured out the few requirements needed if not relying on TGWUI environment, and they're all now in requirements.txt
I don’t think this is a viable solution
A simple stop tts would be fine for now
In my bot there is some false alarms although they are filtered afterwards, to make the process fast, pausing the tts is imo the best way
I’m sure someone has thought of this but I was thinking how it would be really cool if there was a such thing as future conditioning, when generating text “in the middle”
Something like this must already exist, but I just haven’t heard of it
The bot’s history management allows generating in the middle but only by omitting the future text. Would be neat if there was a mode that including it had an influence
Eh nvm this is achievable by summarizing the future text and just better prompting including it
at least a simple, fast tag to stop tts pls
Alright, I’ll see about doing this tonight
i also want this :p
Working on the "should_tts" tag now
I added this tag...
toggle_vc_playback: str
Changes playback in guild's voice channel where tag is triggered. Use with 'for_guild_ids_only' condition for selective control. Valid values: 'stop', 'pause', 'resume', 'toggle' (pauses or resumes)
Ok I just added a should_tts tag as well - which is only useful if using value false to prevent TTS on the current interaction
@valid crypt Please let me know if these additions work expectedly, as I think they should!
ok updating now
toggle_vc_playback could have should_gen_text: false as default
actually with that off, it does not work
toggle_vc_playback is not intended to have any influence on TTS generation - simply affecting the current playback state in the voice channel
but it triggers generation
You mean, the absence of this tag does not result in TTS generation?
Adding this tag makes TTS generation happen explicitly?
I have the other should_tts tag which is intended to explicitly prevent TTS from generating
a tag triggering toggle result toggling the tts which means that it works, but the tag it self triggers text generation, and with the tag that turns off generation makes the toggle dont work anymore
Please tell me how you feel the logic should be applied
my intention was with the tag the vc chould be toggled, but the message that contains the tag should not generate text
What happens if you include both tags?
i thought it can be done with this tag but it makes the toggle dont work
that is something that i was going to try later
Right well I think you're expecting toggle_vc_playback to do more than I think it should do
😓 just it shouldn't take it as a message
Maybe you have it triggering on the wrong condition
there must be audio playing to be toggle, but by sending the toggle, you get a new audio (new response)
with the comment it works, removing the # it doesnt work
my intention was let my asr bot to pause the tts if it was receiving audio packets
should tts
oops
Sorry :< Just pushed the fix for that
Line 1673 in bot.py needed an await
tts_sw = await self.check_tts_before_llm_gen()
i hope that tag can be triggered by the bot itself too :v
erm
It wouldn't make sense because the TTS would have already been generated
They are generated simultaneously
The tags system does not currently review and apply tags to every response "chunk" as it is generating text - it applies tags after the response has completed generating
oh
That'd be a bit of a tricky one to implement...
I do have a special system in place for reviewing response chunks for "censored" text
Could slip in the special handling for TTS here
at the expense of, even more code complexity
It's currently running this code here for every response chunk. When initially building tags, it looks specifically for censor tags and keeps separate tabs on them for this function
Would have to do the same - make a list of should_tts tags as they are being built, then look for trigger text in every response chunk
no hurry :p
changed manually and there is tts, this is my second time booting it, the first time there was nothing.
Ok I solved one of 2 bugs there
this error here is due to the TTS streaming feature...
Try adding this little bit here which I think should resolve it
not sure if this will actually resolve it
Was TTS streaming working on the main branch?
with the new remote Alltalk v2?
It's possible that this feature is bugged for the new AllTalk (I haven't had time to test it out yet)
streaming works
ill try that
If that fails,
Add a print statement here and print chunk
print("chunk:", chunk)
Also one here - to print vis_resp_chunk
print("vis_resp_chunk:", vis_resp_chunk)
In any case I should probably add a line here before searching for the audio patterm if vis_resp_chunk:
ok
I did add this line now... which will prevent error but won't fix unexpected bug
It generated 0 tokens
Are you sure this only happens when using should_tts?
============================================================
C:\text-generation-webui-main\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py:1240: RuntimeWarning: Detected duplicate leading "<|begin_of_text|>" in prompt, this will likely reduce response quality, consider removing it...
warnings.warn(
Llama.generate: 1334 prefix-match hit, remaining 99 prompt tokens to eval
llama_perf_context_print: load time = 2194.38 ms
llama_perf_context_print: prompt eval time = 769.07 ms / 99 tokens ( 7.77 ms per token, 128.73 tokens per second)
llama_perf_context_print: eval time = 4166.61 ms / 45 runs ( 92.59 ms per token, 10.80 tokens per second)
llama_perf_context_print: total time = 5032.58 ms / 144 tokens
Output generated in 5.04 seconds (8.93 tokens/s, 45 tokens, context 1465, seed 984739868)
vis_resp_chunk: <audio src="file/C:\text-generation-webui-main\extensions\alltalk_remote_tts\Ganyu_20250318-180936.wav" controls autoplay></audio>I've been thinking about our last conversation, and the way you described the view from the mountain still brings a smile to my face.
18:09:36.446 #3170 INFO [bot.__main__]: Marcos: "remain_silence"```
everywhere that I have this code, I see no evidence that it should affect text generation at all
i typed correctly right?
tags don't update as you modify the file though - use a command like /character to refresh tags
Or reload the bot
You have that defined correctly
hmm..
turning off streaming? or changing tts?
try updating this block of code I have at line 1650
Replace with this block
async def check_tts_before_llm_gen(self:Union["Task","Tasks"]) -> bool:
# Toggle TTS off if not sending text, or if triggered by Tags
if (not self.params.should_send_text) or (self.params.should_tts == False and tts.enabled):
return await tts.apply_toggle_tts(self.settings, toggle='off')
# Conditions which are only valid for guild interactions
if hasattr(self.ictx, 'guild') and getattr(self.ictx.guild, 'voice_client', None):
# Toggle TTS off if interaction server is not connected to Voice Channel
if not voice_clients.guild_vcs.get(self.ictx.guild.id) and int(tts.settings.get('play_mode', 0)) == 0:
return await tts.apply_toggle_tts(self.settings, toggle='off')
return False
I might know the real issue here...
I don't think the extension params are controlling AllTalk
What happens if you try using a different voice with the /speak command? Does it speak using a different voice?
Or, if you use a different voice filename in the character file?
i remember that you didnt add /speak for all talk, anyways ill go with kokoro
I did add /speak for alltalk
the remote
Does that still reside in a directory called alltalk_tts in the extensions folder?
it is called alltalk_remote
what i did was changing it to alltalk_remote_tts and it works
try changing it to alltalk_tts
kokoro, there is tts,
We didn't figure out how to control kokoro via extension params either
Try edge_tts or try renaming alltalk
These are the "supported" ones:
'alltalk_tts', 'coqui_tts', 'silero_tts', 'elevenlabs_tts', 'edge_tts', 'vits_api_tts'
hum
When I have time I need to try getting alltalk remote working on my end
see what's up with that...
I expect that the TTS will correctly be prevented when using tts apps that respect the extension parameters
alltalk TTS remote may have different labels for the extension params
i somehow killed vits-simple-api let me fix it
alr, idk why it died, i got to reinstall it, and i hope that everything will be fine
Same!
i dont know wt is going on with this
I'd be looking into this now if I wasn't super busy
alr vits working (just the tts)
actually we can delay should_tts , and make the toggle tag do not trigger a response :v
i think i didnt change anything but, does the bot play the tts locally???
it is a feature that i would like to have though
toggle_vc_playback applies specifically to the voice channel
should_tts applies to the TTS generation entirely
here buddy, python is playing the tts
i was wondering why i was hearing echo, it was discord and python
lmao
was that a new feature only with vits?
Probably
its your bot buddy
Ok I thought you meant, that the actual vits code when triggered by TGWUI, might open some python player
b/c tgwui is running vits code when TTS is triggered as an extension
vits is running on my main pc and bot on my remote pc
and the remote is playing locally the tts
for now the should_tts is not very working
as it shouldn't generate the tts from the beggining
that was on me
typo
vits works
As in, the tag works correctly with vits yes?
The bot can only modify TTS behavior if the extension parameters are valid
as defined in base_settings or your character file
vits ✅
also tested that vc toggle gives error if there is no audio
vc toggle + should tts = 0 tokens
the toggle should be able to combine with this
Please elaborate
i should be able to pause the tts without triggering text generation
You can
how
You may be correct
this is what happend when i used should_gen_text: false
basically nothing
it does not trigger text generation but same for the tag to pause
i suppose that you will fixe it so, i'll be adding those features to the asr bot
👍
ehhhh
DDD:
yeah I'll figure something out
👍
I need to review if there's any other tags irrelevant to the text generation, process them regardless
fixed and pushed?
i finished mine https://github.com/marcos33998/asr_discordbot
A speech-to-text (STT) bot for Discord that joins voice channels and transcribes conversations. - marcos33998/asr_discordbot
Probably tomorrow 😆
Still trying to make progress on “unrequire TGWUI” logic
Bat file is getting pretty complex but almost have the installation/launch logic figured out
oh
I'm taking steps towards the bot being used in either of 2 ways:
- With TGWUI integration (as it is currently)
- As a Standalone where TGWUI is not required, but can be used via API
- Image generation capabilities and other bot functions will work in either setup
The logic I'm adding into the launcher script:
- It is checking for a txt file that confirms whether the bot is installed, which will specify the conda environment.
- If the file is not found it is assuming it is the first run of the bot.
2a. It will detect if the bot is nested in TGWUI. If so, it will have both install options.
2b. If TGWUI is not detected, it will mention that TGWUI was not found for an integration option, and only provide Standalone option. - Depending on the option, it will activate the appropriate environment, check for the bots requirements there and install as necessary.
3a. For TGWUI integrated, the bot will not create its own environment.
3b. For Standalone, the bot will download git / Miniconda as necessary and install them automaticaly and create environment - in the same fashion that TGWUI does.
I plan on segregating the TGWUI integrated features such as Extension management, and anything else that won't be compatible with API.
Finally - I'm replacing the Updater scripts with update wizards just like TGWUI has. The wizard will have option to switch from TGWUI to Standalone (vice versa if TGWUI is detected)
if you are able to do that, as it can use any open ai compatible apis, something that would be amazing would be adding support for vlm :V
the only extension adding vlm in tgwui is https://github.com/RandomInternetPreson/Lucid_Vision which is not very cool :/
also i dont think it would work now
My ultimate vision is that the bot can host any number of APIs, so long as user wants to duplicate a block of code and fill in some lines
use the tags system to call whatever API and do whatever with the response
Need to inch my way in that direction and it starts with still recommending, but un-requiring TGWUI
i want to turn my lights on :p
ad_discordbot has ya covered 👍
i think there is a open source api for that 👍
idk if this works https://github.com/google-home/google-home-api-sample-app-android but the idea is definitely interesting and has a lot of potential
From the TGWUI installer:
@rem figure out whether git and conda needs to be installed
call "%CONDA_ROOT_PREFIX%\_conda.exe" --version >nul 2>&1
if "%ERRORLEVEL%" EQU "0" set conda_exists=T
@rem (if necessary) install git and conda into a contained environment
Realizing that this startup script actually doesn't install git though 😛
at the end it calls another script which does, but still
Sorry I didn't finish up the TTS stuff yet
no hurry
technically 2 bot can log into the same account but cant use the same voice channel
hmmmmm
why sst if no tts :(
bro is back after 1 year - 3 days https://github.com/imayhaveborkedit/discord-ext-voice-recv
Voice receive extension package for discord.py. Contribute to imayhaveborkedit/discord-ext-voice-recv development by creating an account on GitHub.
I'm actually glad to be splitting up these "generic tags" processing, the few things being added here are essentially duplicates in both llm tag processing and img tag processing functions
Will just call this ahead of those with the 'phase' as positional argument
@valid crypt I pushed the changes to the same branch
This now processes "generic" tag matches immediately after matching them, and before handling LLM and Img specific functions
So you should now be able to toggle the voice channel playback regardless of whether anything is being generated or not
this handles the following tags: flow, toggle_vc_playback, send_user_image and persist
If you confirm it handles the VC playback expectedly I'll push it to main
rats, missed it
at least with alltalk remote should tts doesnt work, and when a new message is sent the current audio cant be paused
basically if bot is generating the next message, the current tts cant be paused
right, the issue with that is the extension params must have changed or something, or the directory name is messing it up
the bot controls the TTS by hijacking TGWUI's extension loader and updating parameters
I should be able to figure that one out - this is an alltalk-remote specific issue though.
Thanks a lot for beta testing these improvements
And the suggestions, all good ones
@valid crypt if you want to help debug this alltalk thing
print("EXTENSION ARGS:", shared.args.extensions)
print("EXTENSION ARGS:", shared.settings)
def on_ready
Did you have to do anything special to get the remote thing working, or just follow the steps carefully?
nothing special
I'm going to go ahead and try getting things up and running on my end since I have a little bit of time on my fingertips
settings are changed either through webui or directly in alltalk
Ok so in your basesettings or in your character, etc
There is the bots custom extension support
If you have an alltalk_tts dictionary key just try renaming it to alltalk_remote_tts
try it
rats
These are the correct extension args Im pretty sure
But I need to update the bot to ensure certain keys behave correctly...
I’m working out some more kinks, like adding exceptions for when current method to “get voices” fails
Plan to add TTS API.
It seems like the remote extension ignores settings defined from settings.yaml, and only applies changes via gradio
I expect the author of alltalk will clarify whether the alltalk v2 remote extension can be controlled in the same way as other extensions (via TGWUI extension arguments).
:V
ive noticed something funny, at least for all talk remote, an api request to tgwui will trigger tts XD
I have API working for /speak command now
@valid crypt I just pushed changes to that tts branch, which prevents the script from crashing when trying to collect voices for the /speak command
Also a few api settings added to config.yaml
The bot can now use the /speak command with alltalk v2.
You can safely rename the extension folder to anything with the phrase 'alltalk' in it and it should behave the same (alltalk, alltalk_remote, etc)
These additions are kind of a hotfix - I have much bigger plans for API stuff, it's really going to be an overhaul on the bot
gotta change my focus back to the install process, update wizard scripts, etc
i have experienced some sudden crashes, but never thought that the command was the culprit
The one you shared in the past when the directory was named "alltalk_tts" occured because it tried importing a function that didn't exist in the new alltalk v2, and I did not have it in a "try / except" block
I revised the logic so that if there is a specified tts voices endpoint, it will first attempt to collect the voices using it. If it fails, it will try using the original methods I had. If that fails, it now just disables the voices option in /speak cmd
rather than just crashing and burning
after some testing, i think if there is a message with no gen text, and gets deleted after some milliseconds, the input that comes afterward is not detected
let me test further, i got that by using my asr bot
example, my bot sends pause tag after receiving audio packets and deletes it
if the one that comes after is not a tag, then there is a cooldown of 1 seconds?
I don't really understand what you're describing
Whenever the bot is triggered to interact in any way, such as a message request, etc - it collects the information it needs, creates a task, and queues it.
It then processes each task sequentially
It can queue up new tasks while it is processing the current task
There is a behavior setting called "chance to reply to other bots"
maybe it's at 0.0?
i thought that too, i dont have anything so it should use default, the test is made after adding reply to 1
but if i speak longer which means sending the text later, works
weird
it also isn't consistent
that is weird too because before it was often working too
Here is the code that applies the vc playback tag
async def toggle_playback_in_voice_channel(self, guild_id, action='stop'):
if self.guild_vcs.get(guild_id):
guild_vc:discord.VoiceClient = self.guild_vcs[guild_id]
if action == 'stop' and guild_vc.is_playing():
guild_vc.stop()
log.info(f"TTS playback was stopped for guild {guild_id}")
elif (action == 'pause' or action == 'toggle') and guild_vc.is_playing():
guild_vc.pause()
log.info(f"TTS playback was paused in guild {guild_id}")
elif (action == 'resume' or action == 'toggle') and guild_vc.is_paused():
guild_vc.resume()
log.info(f"TTS playback resumed in guild {guild_id}")
If the value is stop and something it currently playing, it will stop.
If the value is pause or toggle and its currently playing, then it will pause.
If the value is resume or toggle and it is currently paused, then it will resume.
also i was trying to get as little latency as possible so i changed the stream chance to 2 and it is not always splitting (exclamation mark should split)
same here
I'm contemplating this now
i think is the text gen tag, so ill test deeper
The text splitting is super complicated btw lol
What makes it super complicated, is that longer syntax such as \n\n will never trigger without some very complicated logic
got you more example
I made a system where it creates a little "window of text" to check (it is only evaluating like 5 characters at a time).
If it matches on a shorter syntax (like ".") it will set a flag to not split the text, and wait one more iteration.
Test something for me, we'll just increase the window a little bit
print("matched syntax:", syntax, "window:", check_window)
Try increasing the window size here to 3 or 4
I can test this out myself in a bit but right now I'm working on something important work related
ok
didn't work, if it isn't for the perfection i dont think this is a must (at least not a priority)
I'll have to look into it a bit more, I thought I had it figured out 100% but apparently not
this time worked, looks like that it isn't very consistant
not actually, there is another .
i think ive discovered something
What did you discover? 😛
erm
if the sentence have rain or pain doesnt work if the message was send by my bot
what logic is this .-.
I suppose you have those associated with tags?
I don't think I have anything hardcoded for "ai"
This is where it decides whether it will reply to a message or not
Ahhh
i can, my bot cant
😱
discrimination
XD
I think I'm starting to understand what';s happening here...
If you change reply_to_bots_when_addressed to 1.0 I believe it will solve this
I need to revise the logic to avoid this issue
I think this logic sort of makes sense if it is matching the whole word for the bot name, but it's not it is triggering if the bots name is anywhere in the text string at all
:O
Like if you only want it to reply to another bot which said "Hey ai, tell me about"
It's currently rolling probability for reply_to_bots_when_addressed if the bot's name (ai) is literally anywhere in the text
Could you please put it back to 0.0, and update line 6645 with this?
if message.author.bot and re.search(rf'\b{re.escape(last_character.lower())}\b', text) and main_condition:
this should now only trigger on whole words
You could test with rain, pain, etc - and also "Hey ai, what's up?)
ok
never triggers
Ok - I had it wrong 😛
nvm... ehhhh
I'll play around with this myself
@valid crypt Try with the bots actual casing like is it "AI" ?
how about Hey AI dude
you did change that setting back to 0.0 right?
i removed it
and you restarted the script?
i rebooted and reselected 👍 i dont think i missed
i didnt know that tags could take effect in real time
by tweaking to get maximun performance i've noticed that the token per second is wrong, it includes time that the tts takes to generate the tts
tts off
:P
average 3s to get the first stream tts, arrrgh, i want it to be faster and faster
i also have some suspicion of tts making the generation slower only because of those 0.5 seconds, im not that slow to take 2 more seconds, something is making everything slower
Pushed the recent changes to Main
The tokens / sec is printed from TGWUI code, and is not available for the bot to modify easily
i really think that it is being affected
likely!
well, just the TTS model being loaded means less ram and/or vram available to TGWUI (I think that's how it works)
although this is 33 but in reality is around 20
tts is one a different machine, so if the token/s reflects the real time it shouldnt be 6
also this one, if the generation is completed in 0.5 and tts completed in 0.2s there is no way to hear the tts after 2s
from tgwui, with tts, toggled off tts
if tts is turned off, there is some delay at sending the message to discord, if completely disabled it is lightning fast
ill dig deeper tomorrow
did you test it?
🤔
Renamed it to alltalk_remote and there's nothing about tts
No building for /speak or anything
Well I'm dying right 😪
Now
I just pushed an update
I did overlook a few little things 😛
Note that the bot currently requires the extension name to be in config.yaml - even if you have TGWUI flags to launch it
so maybe you updated the folder name but not the name in config.yaml
did both
There’s a key under “tts_settings” offhand I think it is “tts_client” which is expecting one string value
You can launch more extensions with the bot via CMD_FLAGS
lets say, streaming tts works, but it modifies the token/s from tgwui, and tts makes bot much slower
I think that typically, you'll generate all the text while it maximizes VRAM usage, then memory moves around as the TTS model gets to do its thing.
But with streaming we need to jump back and forth between both models so the memory gets shifted around more
in my case, i have two machines so there's no way to have bottle neck
,something that i've noticed is that pause tag works just fine but it takes around 0.05 to 0.2s, while my discord ping is 4ms
ah yeah derp
keep forgetting you have 2 machines
But yes, I'm sure TGWUI starts determining the tokens /sec at the beginning of text generation, and stops at the end of the entire job, but with the TTS streaming the bot hijacks normal behavior adding all the TTS time to the total time.
There's likely nothing I can do about this
could be python or too much code, although i think that tts api could improve the speed
how do i log the time when the bot gets the tts?
There are limits to how fast discord can accept data from a single source - and these are applied automatically
So, when we are sending text as well as a "pause()" or "stop()" cmd, etc - these are all automatically throttled
If you search for " def apply_extensions" you'll find the function that applies the tts
could add a print statement there before and after the process is called
print("TIME START:", time.time())
print("TIME END:", time.time())
?
Yep thats good
might also want to add
print("TIME START:", start)```
print("TIME END:", end)```
print("TIME SPENT:", end - start)
the instance i hear the audio it gets printed, where 0.02s is from network (actually lower), 0.73s from alltalk server round it to 1.25s there is alot of time missing
hmmm
the math dont need to be very exact, by seeing it i know that it took around 2s :P
yep
api might help
if you didnt know what would happen when i try to use speak, i suppose that it is meant to work
last time when i made something to use all talk i was using /v1/audio/speech
I had successfully used the speak command with my setup
with alltalk remote
need to look into your error when I have a sec and see what's going on
Ok I see I do have an issue with some of the logic initializing the TTS extension
particularly when it does not end with _tts
fixing this
@valid crypt I have it fixed nice now
will be pushing it in a sec
I changed it so that whatever the heck extension is set in config.yaml as the tts client, it's gonna load it up.
If any additional TTS extensions try loading from flags, etc - it's going to warn that only one client can load and it will only load the configured tts extension
Might revisit this logic later when I get the API suite all worked out
discord just did an ui overhaul .-.
i think that it still doesnt work
how did you name your alltalk, did you put any params?
Just pushed the fixes to Main
All should be good in the update I pushed 6 mins ago
"Improve TTS / Extension loading"
main or not main
Where is that last screenshot from?
config
tgwui settings.yaml
and i dont have params in character
My setup is basically the same. Same, no params in character.
That log message isn't 100% accurate
Does your startup look similar to the screenshot I posted?
Loading your configured TTS extension "alltalk_remote"
It will only print this if the extension is tts.client, and the key that speak command is trying to access is that value
i just renamed it back with _tts
try as alltalk_remote
the first try got new error this time
but ended with the same
i missed something nothing
are you using the xtts?
got this
im not
also i gtg 💤 i wish, my bot could be added to your readme as temporal stt solution, i tried my best... it sends tts tags!
at least ill dream it
Yeah I’m using xtts so that could be part of the issue…
Upon reviewing the handling for speak cmd, that warning just means no voice was selected and there was no voice param. But this shouldn't yield an actual error if you have alltalk running with RVC or whatever
the api request will just be request = {'text_input': self.text} which is the minimum required information needed for a successful response
Do you know any very simple extensions that respect default args set in TGWUI’s settings.yaml? I think I just need to review how that’s applied, tweak alltalk remote code and send a PR
Ok offhand I think it was like setup() or atsetup() - just need to take a peek there in the remote…
never heard of a extension reading tgwui's setting.yaml
Could be the other way around > TGWUI sets the extensions parameters so long as they are formatted expectedly
in any case it's the reason why for alltalkv1, edge, vits, etc - the bot is able to update parameters on the fly
(because you can set parameter values in TGWUI settings.yaml and they actually take effect)
eh... I'll figure it out
@valid crypt Is the /speak command working, aside from that warning message?
No, but as far as I can tell it should be sending the minimal API request and should not yield an actual error
such as an invalid voice etc... if none selected
i wont be able to test anything for a few days as my intel cpu is definitely cooked...
actually i can :v
forgot that i have a second machine 😓
my second machine also give the same error and it is not working, the no voice selected hapopens when the extension or the name is alltalk_remote without _tts
you could try downloading a vits model
same error
Can't reproduce on my end
The only thing you shared that was different from what I have, is I do not include the extension in TGWUI's settings.yaml
I only put it in the bot's config.yaml
ahh
let me try that
that might be the only difference, i just did a clean install and got the same error so...
I hadn't said anything because I was thinking if that could happen when I was looking at the code but it didn't seem likely...
failed ;-;
i literally did a clean install, entered the token set voice channel, add the extension name in config... hmmm
wait, i removed alltalk from settings but still loading it
from bot config
Yes, thats what I said - I only load the TTS from bot's config
If that's actually the solution to your issue I should be able to prevent that kind of conflict...
literally fresh install maybe i missed something that i have to do?
I'm super busy but Ill see if I can help for a sec
from the fresh install i only added the extension in config this time as it is on my second machine i didnt change the ip
loop = asyncio.get_event_loop()
if tts.api_mode == True:
request = {'text_input': self.text}
print("tts.client:", tts.client)
print("tts_args:", tts_args)
client_args:dict = tts_args[tts.client]
hmm
Alright, I see the problem finally
Honestly unsure why I didn't get error when I tested
Actually, the reason I didn't get error was because I chose a voice each time
the fix?
im lil dumb
I can test this myself later but I believe if you just replace the whole process_speak_args() function with this,
it should do the trick
async def process_speak_args(ctx: commands.Context, selected_voice=None, lang=None, user_voice=None):
try:
tts_args = {tts.client: {}}
if lang:
if tts.client == 'elevenlabs_tts':
if lang != 'English':
tts_args[tts.client].setdefault('model', 'eleven_multilingual_v1')
# Currently no language parameter for elevenlabs_tts
else:
tts_args[tts.client].setdefault(tts.lang_key, lang)
tts_args[tts.client][tts.lang_key] = lang
if selected_voice or user_voice:
tts_args[tts.client].setdefault(tts.voice_key, 'temp_voice.wav' if user_voice else selected_voice)
elif tts.client == 'silero_tts' and lang:
if lang != 'English':
tts_args = await process_speak_silero_non_eng(ctx, lang) # returns complete args for silero_tts
if selected_voice:
await ctx.send(f'Currently, non-English languages will use a default voice (not using "{selected_voice}")', ephemeral=True)
elif tts.client in tgwui.last_extension_params and tts.voice_key in tgwui.last_extension_params[tts.client]:
pass # Default to voice in last_extension_params
elif f'{tts.client}-{tts.voice_key}' in shared.settings:
pass # Default to voice in shared.settings
else:
await ctx.send("No voice was selected or provided, and a default voice was not found. Request will probably fail...", ephemeral=True)
return tts_args
except Exception as e:
log.error(f"Error processing tts options: {e}")
await ctx.send(f"Error processing tts options: {e}", ephemeral=True)
did it work?
Actually I just realized an even easier fix, I know it will work
I just pushed it now
it worked, and i trust your easier fix, how do you get the audio file with the api, same as capturing the scr?
hijacking the extension?
Nope
er
When using the /speak command the bot uses an API call and gets the response directly.
When chatting normally, TGWUI runs the remote extension which makes the API call which returns the audio file, and it's detected in the response from TGWUI same as all the others
Not really hijacking the extension per se - more like hijacking the entire TGWUI chatbot wrapper function
Normally it will only apply TTS extensions after the complete text generation
But I monkeypatch that function, and apply TTS extensions any time the bot wants to split text
hmmm, it feels slow to be directly, the first video is raw, the second one i added black screen after completing the generation and when it sends the audio which is when the when the warn appeared
What's the problem?
Enable Deepspeed 😛
Alright, maybe it's due to the discord message sending over and over
the generation part took 1.14s where 0.27s is used to generate, and i started the timer when bot received the /speak
Try deleting these 3 lines
in async def process_speak_args
Beyond that, there's nothing I can really do to improve the speed
sliming down >:D
well, until I have true API stuff implemented
The bot functions don't add any time
i have one written for all talk
it's just generational tasks and discord interactions that slow things down
import requests
from threading import Thread
# Configure the OpenAI-compatible (AllTalk) TTS endpoint URL (change IP and port as needed)
openai_tts_url = f"http://192.168.1.14:7851/v1/audio/speech"
def generate_openai_tts(text: str, voice: str = "nova", speed: float = 1.0,
model: str = "any_model_name", response_format: str = "wav") -> bool:
"""
Call the OpenAI-compatible (AllTalk) TTS endpoint to generate speech audio.
Parameters:
text: The text input (max 4096 characters).
voice: The voice to use. Supported values: 'alloy', 'echo', 'fable', 'nova', 'onyx', 'shimmer'.
speed: Playback speed (between 0.25 and 4.0; default 1.0).
model: Model identifier (currently ignored but required).
response_format: Audio format (e.g. 'wav').
Returns:
True if audio was successfully saved, False otherwise.
"""
payload = {
"model": model,
"input": text,
"voice": voice,
"response_format": response_format,
"speed": speed
}
headers = {"Content-Type": "application/json"}
try:
resp = requests.post(openai_tts_url, data=json.dumps(payload), headers=headers)
if resp.status_code == 200:
with open(voice_path, "wb") as f:
f.write(resp.content)
return True
else:
notice(f"Error: {resp.status_code} - {resp.text}")
return False
except Exception as e:
notice(f"OpenAI TTS request error: {e}")
return False
elif tts_menu.get() == "OpenAI TTS":
# Use the new OpenAI-compatible API endpoint
# You can customize the voice and speed here if desired.
if generate_openai_tts(text, voice="nova", speed=1.0):
play_voice()
i might have some import missing
def play_mp3_th():
pg.init()
try:
pg.mixer.music.load(voice_path)
pg.mixer.music.play()
while pg.mixer.music.get_busy():
pg.time.Clock().tick(1)
pg.mixer.music.stop()
except:
pass
pg.quit()
Thread(target=play_mp3_th).start()```
100% ~~not ~~stolen 👍
i dont have that but i suppose it is the same
oops yeah I started typing this warning
yeah clear those 2 lines
I'm deleting this message
I just deleted those 2 lines and pushed it
deleting those i get double warns but works just fine
the first is when bot receives the audio and the second one when bot sends the audio
ugh
there is definitely 0.5s of room to improve, if the second delay is not discord that would be another 1s
That error is not discord so there's no improvement
that's just a warning from the bot
i wrote delay, it took 1s to send the audio file
Ok - by "There is no improvement" I mean, when using TGWUI the way the bot is now
oh ok
One more thing....
I noticed that your print statements included that your character will go idle
Offhand I don't think the responsiveness setting / go idle crap would add delay to this processing
im using this one
it has no responsiveness setting
Trying to fix for one last, I messed up with my system and I only can boot into safe mode, hope I can fix it
lesson of the day, do restore point in different drives and never touch display drivers
When I have a chance, I’ll update the extension argument handling, so it doesn’t spam warnings when there’s really no problem
When I had coded this feature, the main focus was handling extensions with XTTS, And I was doing a lot of shooting from the hip to actually get it working fast
Still a few rough edges to smooth out
i was wondering because if it's using the api directly it didnt even need to load any extension
The bot code currently cannot directly support TTS API for normal requests - it's the remote extension for TGWUI that is doing the api calls
It is using the API for the /speak command
Need to update a crapton of logic in the bot and I don't have the time to do that at the moment
alr
Think about it though, it wouldn't save any time
Well
I'd have to include an option to simultaneously generate text and TTS
and the only sane reason for a user to do this would be if they have dedicated computer for each
what i was wondering is why is it taking so much (0.5s) to do a call, while by using the web (remote) it is almost instant
The bot could then go ahead and happily buzz along generating the text uninterrupted - and on the bot end, I'd need for the responses to wait for TTS gen to complete so it can deliver both chunks at once
same for the extension, the text is generated for 0.5s and then it send the call
Yes - that's because there is no stop and go
generates 100% text, generates TTS - sends both
Bot, back and forth back and forth
It will take a very sophisticated framework to allow both at once, with dedicated computers
due to the streaming responses feature
otherwise, very simple
The current framework is already very complicated - so it's going to be quite a can of worms overhauling it
Not saying it won't happen but it's going to be a little while - and I am moving in that direction
actually i can try all in the same machine, and then i'll tell if there's a lot of performance bottleneck or an actual problem
If you stick a print statement anywhere in the bot code, you'll see that code execution is almost instant, everywhere.
The only slowdowns are discord interaction and generative tasks
if i want to print when does it do the api call, where would it be, in the all talk extension? i want to generate text entirely and no streaming to test the latency
i also want to print when all talk receives the api call
search for chatbot_wrapper in TGWUI/modules/chat.py
wayyy at the bottom, it will trigger TTS immediately before returning the complete text response
ad discordbot applies this logic whenever the bot decides to split text, instead of waiting until 100% text generated
If you disable the bot's streaming text setting in config.yaml
it will behave the same as TGWUI
You can find the modified version in the bot by also searching chatbot_wrapper
right before blaming your bot should test tgwui first :P
where i print when it finishes?
tgwui
in the screenshot I shared, the first print is when text is done generating and it is about to request TTS
the second print is when it's going to return the text and audio.
oh?
around 2.5s and the tts is done in 1.05 like with the bot
copy pasted the print :v
🤓
Bot checks out
the huge optimizations that you'll see in open source crap is like, different computational algorithms and stuff
the bot is just a whole bunch of normal simple braindead operations wrapped around these complex generative models
I've been messing around with it enough to know, there are no intentional pauses anywhere unless user wants to play around with responsiveness
i think that the text is generated and saved instantly, but the way it uses the extension is not very efficient
nope, alltalk is slow i think, the last thing is the time when it does the call
the there is something making everything slower, the first print is when it receives the api call, and the secon one is when it sends
ayo i found a way to do this lightning fast but idk how
example of tgwui api call
the part that is taking extra time is completing the tts to send or or just generation completed
but this only happend using the tgwui api call
instead, if you use the alltalk :7851 to generate tts it is done instantly and skips the completing process
lightning fast
ok so I finally have the installer changes hammered out. With this framework the bot could theoretically support additional deep integrations like the TGWUI one
This bit was definitely not my forte
Wizard 🌈
OK - I can't say for sure if these are bug free or not but I updated all the launchers and replaced all the updaters with wizards
This is on the "unrequire TGWUI" branch https://github.com/altoiddealer/ad_discordbot/tree/unrequire_tgwui
Marcos you only use windows yes?
im a windows boy
✋
Tried linux in the past to more efficiently run a a little budget HTPC (home theater PC) and it was a very painful experience
XD, i only use wsl if windows is not supported
Are you able to try running the WSL launcher on that branch?
That’s the one I most expect to be broken lol
my cpu is missing, so i would like to trywhen my cpu arrives
Ah yeah
but i can test with windows 🙂
The windows one will work up until it actually loads the bot haha
The hard part is done though I can probably finish the job tonight or tomorrow
On first run it determines if the parent directory is TGWUI. If it is, it gives 2 options: integrated install (uses TGWUI environment and will enable more features) or Standalone (create and use own env)
Also included a method to check if parent is a fork of TGWUI and allow that
The update wizard includes an option to switch the install from TGWUI/Standalone
The standalone install will not be able to do text generation until I slap in a TGWUI api method. The ultimate plan is for the bot to accept any configured software for api calls for a number of specific purposes
(text gen, img gen, video gen, tts gen, etc) so long as the API has a get method for or user predefines the expected payload structure
Also planning to add a user “command builder” that can utilize configured apis
So a user could easily configure their own custom command like “/set_tts_voice” etc
Or a command for a specific comfyui workflow like “/comfy_wan_img2vid”
@valid crypt about recommending your fork of STT, if you could write something up like a word doc, I could probably do that. Needs to include how to install, troubleshooting, etc
In terms of usage with the bot, or any deviation from normal instructions from your peoject page
my readme is pretty complete, i dont think i missed anything, althoug i can improve a little
i could make a bat to install or something
ill look into those
cant be easier, and no troubleshooting it just works :p
some bugs
forgot to include usage but i think this is enough
also you have to mention that these are for those tts tags
Could you share some example tags, like the minimum for it to play nicely with my bot?
i only have that as it is a simple bot to do transcription...
as i use it with /main i didnt include ping or a field to add the name of the bot
i could add replacing to do some voice command, something like if i say "create image", replace it with the tag as no asr will include things like _ (create_image), but the tag can adapt to the transcription.
shut up --> stop tag
or
stop tag: 'shut up'
¯_(ツ)_/¯
i found out why my all talk is taking more time, all because of me being dumb
Do tell
tell?
Yeah, tell me 😛
didnt understand
Just curious what you did that caused slowdown with alltalk
rvc 💀
Seems like I'm picking a good time to try making TGWUI API possible.
That's cool!
Hit a bit of a snag with my new install method
On this branch
I'm not sure what exactly is causing this, but now when I import TGWUI's modules.shared - the args parser in that module is now parsing my bot's args
On the plus side, the bot will now successfully launch and handle image generation tasks etc - without TGWUI.
Just need to debug this particular issue here and the TGWUI integration should also be back in place
This has something to do with how the bot initializes... previously, os.cwd() would return the directory to TGWUI, but now even running from TGWUI env os.cwd() is returning root dir of the bot
I was also parsing args in bot.py but I moved that to utils_shared.py
Ah okay… the bot code was originally popping them as they were read in. I tweaked it, there’s probably a straggler.
Oobabooga actually reopened 4 issues I opened that were closed as Stale
Ah - that’s very cool
i guess he noticed that a lot of the closed-as-stale issues were wrong and wanted to unfuck it a bit
just need somebody with a massive amount of patience given access to manually close them, to then go through and actually sort things out
I remember feeling a little dismayed when no response, they were legitimate issues
2.5k issues is 💀
Yeah at the time I think it was like > 1k as well
"victim of its own success" situation - repo gets far more issue posts than the one guy running it has time to manage
need either a team or a dedicated crazy person extremely high patience individual to sort it
ComfyUI has the same problem
-# (for a bit of time I was in charge of solving it re comfy but I wasn't enough and I'm not on that team anymore so rip)
Meanwhile I get excited when an issue is reported - evidence someone is using the bot lol
I was starting to feel something like a team member in Forge, I had solved a few medium difficulty issues. But interest waned when I couldn’t get Illyasviel to review something… the project is like dead to him immediately after he added flux support. I improved the logic of module handling for model changes, they would have just merged it but I changed enough that I felt it warranted the author’s blessing
I can’t even imagine how good Forge would be if Illyasviel kept plugging away at it from time to time, the guy’s a genius
I heard there’s this cool app called SwarmUI too

@calm rain just curious to know - did you come up with the idea to use the calculations to factor rounding precision of image resolutions? Or did you basically lift that from trainer code / stability / etc?
So simple yet such an effective method for applying nice res values. Love it
it's literally how the resolutions for training data in many models (including SDXL but also pretty much most models out there) get selected. The base set of resolutions available in the UI are literally just the SDXL trainset res list. The unique bit in Swarm is just the UI to select it, and reusing the same math to estimate good values for models that work at other scales (since they almost certainly calculate train res's the same way anyway)
❤️