#🐣┃suno-showcase
1 messages · Page 2 of 1
prompt: "[laughter] [laughs] [sighs] [music] ♪ [gasps] [clears throat]"
Output: strange noise, and guy gets startled awake and attempts to answer the teachers language question.
i think think thats his way of saying i aint got the capacity for dat
welcome to the machine
prompt: "[slide whistle] [music] [clears throat][laughter] [music] [laughs] [sighs] ♪ [gasps] — [slide whistle]"
output: 🔨 🔨 🔨 🔨 🔨 🔨 🔨
i cant stop thinking about the victim of the mosely
what was he trying to say
what the hell was that laff
haha
Discord rules
i got more lore on the mosley... 20 years
"its safe to waze your fuldore feelings"
freestyle friday
prompt?
(beat with lyrics) YEAH, YEAH, YEAH, YEAH, YEAH, YEAH, YEAH, YEAH, hey, hey, you, you, I don't like your girlfriend, no way, no way, I think you need a new one
didnt make it all theway
got hung up on the yeahs understandably
(beat with lyrics) seems good
i made some crazy tongue twisters with chatgpt4
i had to screenshot the prompt because it is blocked by discord, i suspect repeating the same words or all the brackets...
but then Jairo Correa posts audio of the rules...
and i googled "the mosley victim" and there was an event 10 days ago, which did involve some of the things said in the audio clips, lol
wut
a hearing, medical, etc...
Flendiferous plibber-klorping slazzles zlungled slebbidly dlorbitant blurking klentacles, gleebulating qlibber-mlungulated vlivvers strandiferously, jlurching qlabberwabbled lipty-lpotch plibberations, whilst trondiforously clonking plaggled qlibberwinks, plargulating slibberwocked bligglets, and dlalumphing glabberdoodling rlizzwinks in plibber-sprangled tlazmires, vlurbing qlibber-splattered qlabber-splonks, flewting slabber-splorped plarfiggles splatteringly splangled, plibber-splurting rlizzulated slonktacles plibberwabbled, and slurbblingly qlibber-slurgled slibberwobble-slazzled plappledapples, plibber-plonking qlibber-splorped glabber-slurgles, plibber-plungulating plibber-slazzled dlentiferous glabberwocked spleebulations in plibber-sprangled plazmires.
someone try that with a long prompt
i cant get it in 14 seconds
LMAO
enthusiasm
betty betty betty betty betty betty betty baby YEAH! YEAH! YEAH YEAH!
at least the mic wasnt turned up to 15 like it is most the time
to work with the blood soil as the blood prize?
I added this at the start and end [speaking fast]
are your temps set to .7?
I'm using WebUI, there is no temp setting
probably .7
just text + choose voice
[speaking fast] djfklglskgjhoewirjg;eiorjhwiqspteiucpnmxitzkqiaistuaiupjaeprfdijtgpetrjatgemverbzqyfvtdczsabqcwbsnudtfvybyghinjmokopls,cxpwkcmjvneuthybvruthvniejkmdicwopkslmxokwsnxjqiysgfoqwenwiuetywejnwfgosduhisugrpqueyrweiuqypqiupgiqusdgkfhgkgfvkbzcmvnncvmzfdvbjdfhglakdjghqeriutyqeoiuryqpiuweyqipwetogqeageveurfcsrdexredzaecryegbxruahwexrihjaeirjewrokmokg,oky,oukympjn,uypj,jukmjuhbndyouhdvnityirtiyegfjkgdskm,fbadfjygarfhnjalsruifagerkjmnlgurbhgliadnriluyghsilhgiyeahioaiyreotiuyerptuiyewriouytrfganfjc,bihn.,ihbx,rwihihbxvhmwvbxwrmfchmcsfdmhrmcfousdfuosrncirxgxvzyqzvqfnxurcfgncgmxbmxerxnqncfruoegmrvqeqvqpetewimtvwregbgjgvkfdvmgks [speaking fast]
what did you try to send me in general i just noticed
you got busted for sending porn again
haha
a link to the walmart music video
hilarious song
hopefully AI can make a whole video like that soon
❤️ AI ❤️
[rap song] she got that louisiana purchase, louisiana protocol [rap song]
[rap song] basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket basket [rap song]
i think when you make the prompt too big, it does whatever
[rap song] basket basket basket basket basket basket basket basket [rap song]
(beat with lyrics) basket basket basket basket basket basket basket basket (beat with lyrics)
girl... it's a song, not a task.
i made up a rap
Do you have the npz file for this voice? This sounds really good.
announcer
hello, too difficult for me to simply down load this program from GitHub? Do i need to first install PIP?
I'm going to make an installer over the weekend, it's a bit annoying
create an venv before to stay clean.
Holy cow this is good.
haha, amazing clips 🙂 btw we set up a channel #🐶┃bark-technical to encourage better sharing of npz files. hopefully that helps with finding prompts that are fun and clone well into new clips like travolta, jane etc
I thought i'd try to see how far we can push this.... and I think I broke journalism. Here Ive brought together a few AIs - gpt3, Bark, etc. now I can give any 1000 word document to my code, and it will, in a single click, spit out a video.. here's my test: https://www.youtube.com/watch?v=hyi1CgXbZCg - let me know what you think.
Transcript...
VOICE1: Welcome everyone to the show. Today we are discussing a very timely and important topic - synthetic media and the potential harms it can cause. As many of you may know, synthetic media is media that has been produced with the help of artificial intelligence, and it includes things like deepfakes and AI-generated text, vide...
I am speechless .... story written by an AI ... read by an AI.
are you using adobe animate for autolipsync?
How do you do that ? just for one sentence it takes me dozens of attempts to get the audio just about right...
Are you using sad talker extension for this?
If you paste the whole deal into infinite it has greater consistency and chooses voice often based on entire text
Thanks
yo yo sunoheadz
dance beats or regular beats that is the question
both
oh shit
a lot of these have 3 second brilliant parts and then the other 8 are rough
cleean beat
this beat is so hard it said 14 seconds? nah i only need 7
lose controooooooLLLLLLLLLLLLLLLLLLLLLLLLLLLL
i cant blame him i put lose control in the prompt
wow i actually made a file that needed to be turned UP
actually i take taht back it is pretty good at normal volume
Trying with the stitching, but many times it screws up and lose focus
Please explain your process so we can try to make it as good as possible. 😄
If anyone have had a hard day and need some extra appreciation. 😄
no - I've written a python code to swap out lip positions based on audio volume
no - just infinite speaker, gpt3 and some python stuff i've written with gpt4's help. next week, i'm going to try to swap out my manga actors with video...
hello
bark infinity? gotcha. I was wondering how the animation was being cued to the sound for the talking. sad talker is supposedly able to do what splines was doing a few months ago but as a wav file drag and drop into the stable diffusion extension. Havent run it yet.
which npz did u use?
Will try to post it here when I get home. 😉👍
I wrote python code to split the script by speaker, and run bark on each speaker in turn. Then used moviepy to create the video shots and cut them back together
Here you go my man!
thanks
I used bark to convert voice lines in a software called "Crew Chief" which is a program that connects with racing games to provide voice guidance while racing https://www.youtube.com/watch?v=mHiSSWUfDb0
very cool!!
man :hi loli how are you today?
girl: hi Petir I'm good
man :hi loli how are you today?
girl: hi Petir I'm good
😭
Thanks! It's a lot of work, thousands of wav files, but this voice model is really impressive
psychobot laff
catchy
if you ask bark to drop the bass it really drops the bass
dat sine sweep at 8 seconds holy shit
female singer
lets see if history file works for that
you can do it allllllllllllllllllllllllllllllllllllllllllllllllllllllll by yourself
my favorite thing so far. it improvised a little ditty.
edm style
made with chatGPT4:
In the depths of the ocean, creatures glow,
Bioluminescence, a light show,
A jellyfish whispers, "I bet you can't see,
I'm 95% water, just like tea!" [laughs]
Volcanoes erupt, spewing lava and ash,
Their molten rock flows, in a fiery flash,
A mountain yells, "I don't mean to boast,
But when I blow my top, I make the best toast!" [laughs]
The Earth is round, spinning with grace,
A giant blue marble, floating in space,
A cheeky astronaut once said with a grin,
"Gravity's the reason we don't just fly off into the wind!" [laughs]
Einstein was brilliant, his theories profound,
E=mc², a formula renowned,
He quipped with a smile, "It's all relative, you see,
The faster I go, the younger I'll be!" [laughs]
In this world of wonders, mysteries, and jokes,
Nature and humor, together they coax,
A laugh and a lesson, they bring us delight,
In this beautiful world, we take flight.
I tried to copy Scatman John's voice, not yet successful. The line "Scatman, fatman, black and white man, tell me about the color of your soul" from Scatland's World ended a bit creepy in my opinion.
how do u run it with mps? (my mps is enabled, just wanna know how to make bark work with it)
-- extension bark_tts
I'd love more decent female voices, seems like bark is much better at male voices. This is the only one from the list I've found actually produces decent sounding results
it's that infamous book story that Butters wrote, in South Park. this is an example of getting longer than a 13 second output by concatenating the output arrays into a single waveform before exporting that and compressing into MP3.
Roughly nine minutes of Bark voices being obsessed with mares: https://youtu.be/_tHnB4BpRRg
Over nine minutes of various expressing love and praise for mares using the latest broad text-to-audio AI; Bark. It's capable of all sorts of audio input from text including speech, singing, music, sound effects, instrument samples, screeching, strangeness, etc.
https://github.com/suno-ai/bark | https://huggingface.co/spaces/suno/bark
All im...
britney not-spears - hit me baby, one more time as sung/read by suno-bark, concatenating on line with custom voice
I'm a big fan of this one-- #🐶┃bark-technical message
Really enjoying the different voices. This the radio talk show.
used oobabooga, stable vicuna, bark, sd, and sad talker to make this:
#oobabooga #stablevicuna #stablediffusion #bark
bark's able to read stories quite naturally
I put together a video of some of the prompts I’ve tried with Bark:
See how weird and wonderful Bark can be with these experiments and prompting guide.
Try it out here:
https://replicate.com/suno-ai/bark
Code here, if you want to run locally:
https://github.com/suno-ai/bark
@quasi oyster it's better to just add the music on top using nonlinear editor 😄
made a longer form rendering of the rules for the server
mmkay lol
okay yeah the music functionality is not overly precise at the moment lmao
I was hoping this would be possible can you clarify how you got the sad talker working with ooba?
was a bit of work to get to this final result, I didn't link sad talker with oobabooga, here's my workflow
- generate a story from ooba
- copy the generated story into bark and generate the audio for it
- take a front view of one of your sd character generations and upscale it to 1536x1536
- generate the sad talker character using the bark audio, I tried using the long 2 minute audio, but my pc ran out of ram, so I did it in sections
- put it all back together in a video editing sofware
interesting, I put in "MAN: [laughs] How about we [beep] this place up!" and it said an actual swear instead of [beep]! Interesting... usually beeps actually work
same prompt, for the same speaker ("v2/en_speaker_1") hahaha these are so weird
this is an instance where the [beep] was generated correctly
also i really like this! how did you do it?
here's another raw infinite bark story result:
took it into adobe audio enhancer, but it changed up some words
@long egret how're you enhancing the audio?
i am not enhancing it other than editing it in Kdenlive
the radio one isn't enhanced or edited at all. it's just straight out of Bark using announcer voice
StableLM telling a story about hot dogs for some reason.
Finally we have a definitive answer on why six is afraid of seven @tardy topaz
An old man.
Story in the style of a redditor (by ChatGPT), using the long generation (advanced) method as explained by official Readme
Version 2 of the crew chief for racing games. I have converted over 5000 wav files into the en_speaker6 model. Here is the result https://youtu.be/lqCAgZaknDg
An updated example of AI-generated voices for crew chief
not sure what is wrong but audio tends to distort as it progresses
using smaller model btw
I'm curious how long did this take to generate
Hello ☺️ I'm completely obsessed with AI creations and Bark is becoming a firm favourite! I wanted to share this here with you. Made with Gen-2 and Bark. It's very silly, I hope you like it. https://twitter.com/emmacatnip/status/1651040268613763074?s=20
about 4 minutes
updated to latest countFloyd commit
Im Puerto Rican and speak mostly spanglish, the spanish accent cracks me up
Loving the new update!
@ebon widget aha i got it working
the prompt was:
{man} So, I was thinking I could come over around 3?
{woman} And what would we do?
...
An episode of "I Think You Should Leave" with Tim Robinson, as written by ChatGPT
If you liked that, here's another. Episode 2: The Talent Show.
hehe awesome!
Seems like you are using JonathanFly/bark , the technique used for longer generation just keeps amplifying the previous chunks style. I would recommend taking a look at the official guide on generating longer generations (as described on Readme), it outputs better generation.
Cool. It looks like the difference is mostly they use a fixed speaker prompt? On my I defaulted to 100% feedback, but you can use --stability_mode to instead do that, FYI.
- Then they add .25 seconds of silence between prompts. Probably good for the non feedback case but maybe not for progressive?
- Semantic Temp lowered to 0.6
- min_eos_p=0.05
Bark, animov(modelscope), AudioLDM, Blender & Generative AI add-on used: https://www.youtube.com/watch?v=ZbVR4fknEys
Follow the instructions here for installing the add-on: https://github.com/tin2tin/Generative_AI
The narration is done with Bark and the music and sounds are done with AudioLDM. Video done with animov weights.
thats a nice suite you have
i might have to try blenderAI
I have yet to figure out how to use the sunoai jupyter notebook locally to setup a conversation, but was able to go about it using count floyd pretty easily
Thank you, i will take a look again
Got it in last minute, barely tested though
@proud yacht So my one-click installer went fine 4 days ago when I tested it, randomly doesn't work this week, lol. Might just a conda update? AHAH
This sounds so good! What settings did you use?
what is this multimodel hub ?
see my post about TRON in #🪦┃community-updates
#oobabooga #stablevicuna #bark #jonnyquest
it seems that at the end of sentences after every period, the speaker seems to always choose to say either 'and' or 'umm'
haha, you can probably lower the threshold min_eos_p to help with that?
Very cool to see all the audio prompts, especially the music one like kpop_acoustic. I assume these are custom prompts that you created and use with npz files?
My first test 🙂 The poem/song is written by ChatGPT. At the end I had written "Thank you for listening!", which turned into "Thank you for, what's the name? ... ... ... John", haha.
how did you create such a ,long audio, i am only able to create 14sec audio
I used this https://github.com/C0untFloyd/bark-gui it generates longer passages in chunks and combines them at the end
Hey everyone, welcome back to our channel! Today, we're going to explore sustainable living and resource management, and how small changes in our daily lives can make a big difference for our planet. If you're interested in making the world a greener place, then this video is for you! So, let's dive right in!
A politician trying to explain that the Egg came first, and not the Chicken.
Thanks for the update, it is working a lot better now.
{man} So, I was thinking I could come over around 3?
{woman} And what would we do?
{man} Let's just watch some anime memes [music] [songs]
{woman} [moan!]!
哈哈哈,你好啊,今天天气真的不错哦
#🐣┃suno-showcase how could this possible?! today is till a raining day!
test for huangxiaofa
不会
你知道怎么用吗?
introduce youself
how do you get such long outputs?
use the long form generation notebook in the repo
I tried but it doesnt seem to generate a sound file ):
there are multiple UIs that allow >15s generation
self plug https://github.com/JonathanFly/bark
can you send me an example of the complete file for long form generation please?
see JF's bark fork link there
well as long as we are self plugging https://github.com/rsxdalv/tts-generation-webui
though more people are familiar with JF's fork so you might get more support with it
Looks cleannnnnnn, and multiple models. An inspiration for when I clean things up
Thank you.
ngl gradio really has heavy limitations, I recommend doing research before investing too much time in a GUI with gradio
I finally went back and changed my text fields to number fields, like somebody who didn't discover a weak ago
Yeah I know, I am dying here
Like for example, the dropdowns. I kept trying to make them show a different name to the user than the actual value. And apparently this isn't actually a feature. Like, I thought that was a fundamental definition of a dropdown lol
So you have pass a function and process it, like what
#📚┃suno-school guys 😄
actually the amount of gradio pain is so large we could fill a channel with it, I'll create a thread in technical discussions
It's fascinating how if you use existing song lyrics, you don't need notes. This was sample #2 on a no-notes test, and it tracks the original melody pretty well!
Also I don't think line by line formatting makes a difference, or marginal if it does, it's just the lack of periods I believe
That acts like a music note
Made with @tardy topaz's repo
I'm probably gonna be tied up for a couple days and never put out anything new, but there is a dev branch if you crave something new. https://github.com/JonathanFly/bark/tree/dev has some cool user templating stuff, and the main functions should be fine unless I broke it right before I pushed.
how?
been having lots of fun converting short stories to audio
https://www.youtube.com/watch?v=H6aQ3NyPwPI
hello , how about time cost on a short story ? < 10 mins or ?
and , add this to code , result would be 10x better .
I didn't time each generation, but I would say on my WIN10 pc that has 2060 rtx super 8gb, and 32 gb ram, it took about 2-5 mins, the time traveling indiana jones one was probably between 5-8mins, using the large models with offloading
so cost 15 min to generate 2 min audio ? maybe you too need update your software
yes , solved , now it just cost 3.5 minutes
how would you get audio without using the notebook,? i run plain python on pc and got output in bool format playable with vlc not plain windows format wav.
not sure why yet
just use ffmpeg to fotmat it to mp3 or something ,
#audio(np.concatenate(pieces), rate=SAMPLE_RATE)
audio_array = (np.concatenate(pieces))
write_wav("bark_generation.wav", SAMPLE_RATE, audio_array)
makes it , but bool form
as long as you get the audio , you can easily make changes to audio file .
i think it is the encodeing part
what does the encoding?
i get audio
but my other projects get a windows playable format
jut trying to figure it out now
😹 i'm new too . don't have answer , yet
i'm new too
🙂
this is what i get but plays fine in vlc
not windows
download and try it
sounds perfect , are you using bigger model ?
no small models
small gpu
also plays in itunes
but not in my windows players lol
also got it to play from array only
like if you wanted a conversation with a chatbot
RTX 3060TI
Float values allow for more precise representations of the data, which can be important for maintaining fidelity during encoding and decoding. Additionally, some encoding algorithms may require float values as input to perform certain operations, such as normalization or feature extraction. Overall, using float values can help ensure that the encoded data is as accurate and representative of the original data as possible.
Thanks for sharing, but please change the format of your git page, it literally damages my retinas to see that many emojis in one place, otherwise very nice
Probably not the readme file itself, (because as far I know there's no way have to two versions/themes in github) but the actual app I plan on having a non-silly emoji-free display mode
Well I could link to the different readme, that'd be easy
Is your fork less GPU intensive? It seems to run faster
I just defaulted offloading to CPU on, but it's a setting in the regular Bark. It's hard to tell the difference in speed
It might even be faster, for some reason (more free GPU memory?)
Did you know? You can extract the source audio from a history prompt npz. here are some examples from the default history prompts
haven't been able to extract the source audio from the announcer yet though, as that one is in uint16 format, while the others are in int32 or int64 (although i did have to manually convert int64 since gradio didn't have it built in. (just do data / 4295229444))
nvm don't do that division lol
That quality is insane, have you trained that model for those voices?
these are the official voices, i just extracted the audio files they were based on
You can do more than that! I made into a feature:
interesting
I turned the mutation up too high, but you can generate more subtle variations. Just access the semantic prompt, and treat it like a new sample. In this verison I also chop it up, so it's super diverged
I went a little overboard on the RNG for that one, but the more moderate oen is really good if you have a weird noise or hum
it'll usually generate a variant without it
i'm mainly looking into how bark history prompts work so i can attempt to make a voice cloner that generates the semantic prompt from the actual audio file instead of just generating one with the same text and then praying that it works lol
You'd have to train a model
here's a en_speaker_03 variant who doesn't hmmmm as much, probably
yeah i figured
creating the training data will be easy but time consuming. and i'd probably use a markov chain to just quickly create a bunch of text for it lol
also the bark in my webui is like a monkeypatched frankenstein's monster from what it originally was lol
mine too, it's such a mess i keep not integrating my actual new stuff
ugh
i've got some real smooth long clips now
funny to reuse semantics, you get a different voice. but same speech patterns (notice the "ssimilar" and the "like-")
Anyone knows how to create deep, rough voice like old man?
Probably trial and error. Although my voice cloning method will probably be easier, but i haven't really finished it yet and i don't want to release it while unfinished.
这个如何使用来着?
同问 哈哈
There are some ways I have played with, but honestly the simplest way still work best. 1. Write a text prompt that sounds like a rough older male voice, something they would say 2. Generate 100 random voices, pick the best.
- Save the best voice, use it for your actual textt
not sure if someone has tried this prompt before but mildly amusing
same prompt, second generation:
I used this speech too much as a test sample to make sure nothing was bugged, starting to hear it in my dreams
This seems like a belenciaga video in making
love that movie. The AI does a decent job but hard to match the real Fletcher
The very first music I tried was this silly Korean nonsense song and it's still one of the longest coherent clips, musically (also a bunch of others in the YouTube) https://www.youtube.com/watch?v=4pV9d25KqCE
A silly experiment with multi-lingual AI text, drawing, and music.
다국어 인공지능 텍스트, 음악, ChatGPT 그리기 및 노래로 바보 같은 실험.
If you're seeing this silly experiment in your YouTube feed, I apologize, I checked the box that says "don't publish this video" and I thought that's all I have to do. But I haven't made a video in three years I forget how this works...
The second segment was one continuous fully feedback last clip as full history for next clip, no fancy merging or any tricks, but it somehow stayed coherent. The trick is just a single guitar I guess, not too complicatetd
I wish I had been saving exact generation parameters but at this time was totally trying random things
The only times I've gotten similar coherent Bark output is when I used a very well known song, and it literally outputs an approximation of the melody and chords. But this is the best so far with novel text
Hey Im new
Just posted a song I made using bark!
https://www.instagram.com/p/CsKgu-5NWCh/?igshid=MzRlODBiNWFlZA==
Just started experimenting with bark, hope more sounds can be added
I got a 'video has no sound' error, FYI
search this Discord for .npz to find a bunch. Hm actually that doesn't return files, but just scroll up in #🐶┃bark-technical message
Can anyone tell if i can use bark for creating podcast audio and upload it on YouTube ?
Damn really?? Not seeing that on my end. Thanks for the heads up
One shotting full music seems tough, but you can gen good beats to build from
very nice
Punjab Caretaker Chief Minister Mohsin Naqvi has said that staging protests is the right of every political party but “when those political workers reach Cantt, they convert into terrorists”.
“The worker of a political party cannot attack Jinnah House (Lahore Corps Commander House), a terrorist has done it,” he said, adding that around 400 people had gone inside the building while 3,400 were outside.
“No matter what happens, we will not sit idle until each and every person involved in this is arrested.”
He said that there was “no doubt” that these protests were “pre-planned”.#adio
So, we did some benchmarking with Bark on an H100 and the results were very promising. Also, thanks @tardy topaz for the audio snippets. 😊
This is dope!!
Can you share the source code for processing in batches? My understanding is bark out of the box doesn't support batch inference. If you guys built this, it'd be awesome to take a look at how you did it!
We aren't doing batch inference, we're just batching up the requests in order.
What is batch inference in this case?
Is it just taking 10 sentences and running them at the same time or is it more advanced?
Source code for what we used either way can be found here: https://github.com/neocadia/bark/blob/feature/add-http-api/bark/serve.py
Sweeet thanks for sharing!
Made with Bark: https://www.youtube.com/watch?v=ep47PDXQvwk
Get the free text-to-video add-on for Blender here: https://github.com/tin2tin/Generative_AI
Voice by Bark: https://github.com/suno-ai/bark
Music by szegvari: https://freesound.org/people/szegvari/
Text by chatGPT
Gives me 90s radio vibes with the male voice
Bark, Animov(Modelscope) and chatGPT: https://www.youtube.com/watch?v=gm50m_yEyCQ
"What if Trevor Noah was French?" Preview of future Bark Infinity fun features. Don't hammer me with questions yet, still sorting it out and will do writeup later. I am big time under water with real life work so not till weekend at earliest - I need to chill on Bark experiments, like seriously seriously. But experiments like this is why Bark Infinity hasn't been updated. The future of Bark is bright. We haven't seen anything yet. 👀⏳➡️🌟
我爱你
anyone here cloned any famous person voice?
like trump / elon musk / biden / david attenbourough etc?
i need data
Tricky to thread the needle between 'speak with any accent you want' and 'speak with a random speech impediment'
The Trump voice clones pause in the middle of speech but also don't stop talking while doing it because the only data they have for the clone is the real Trump - who pauses every few words, lol
funny
what is your name
@tardy topaz check this out lol
I hear something like that a lot. One thing I always hear is sound like that instead of applause, like in a talk show or crowd. The crowd always sounds like static.
So many weird artifacts
These were being messed with not really natural, but still weird
This is more of what I was going for
I think it's tricky, whispering has a very specific like microphone tone
I think it'll work just be trickier. I managed to used sets of voices to influence other voices to have similar accents, but a whisper is probably a little more subtle effect, so I bet you have to do a bit more work to tease it out
Even the accents more often than not just cause weird speech problems
I bet you just need a lot of good samples. Like if you had 1000 clear non whispering voices, and 100 whispers, you could probably sort of take the difference between them, and get an idea for what tokens to push
I'm trying to find the right words for generating cuz there are certain things you can say to influence it, but 90% of the time it'll just make all the results bad
It's probably not really worth it honestly, versus just randomly finding some cool voices that sound like they are whispering
Like a specific sentence will get the results I want
But like, that sort of general workflow, it mgiht work for any style. Set of voices like X, versus set like Y, take the difference, use that as a nudge
that's the long term idea
But right now it's like like 1/3 almost passable french accents and 2/3 people who sound like their lips were tied together or get stuck on a syllable
I've also tried tongue twisters and one of the results paused, laughed, and gave up on it
Half of the others got mixed up
Interesting
That's a fun result
Like, the model is not reading your text. But really it's just actually being smart about what a real human would sound like
Shy sheep show sheepish smiles. <- that was the prompt
One of the results paused to get "smiles" right
a brief pause
Honestly that's a great little showcase for what makes Bark different. Bark will screw up if you give a tongue twister! That's why it's so cool.
It it wasn't way too late AM I might try myself...
I did try and do Math using the force to keep going flag, you know, can bark add two numbers together? But it wasn't that interesting so far
Possibly
If you prompt something like, "You want me to shut up? Ok I'll be quiet. (Then some other sentence)
You might get like grumble or a whisper
Like imagine a TV show scene or something that would be half normal and then switch
I don't like using [whispers] and stuff, just feels like the overall quality is worse
Though if you DO get a good voice. that's probably one you can use that tag with
I usually have more success with throwing [yawn] or [yawns] somewhere in the middle of a sentence, but if it doesn't work it ruins it
cause it makes the result sleepy
Yeah I agree, if you HAVE to use a tag, put it in the middle, between two normal text blocks
An example.
Man that's hilarious
I'm actually gonna run like million tongue twister samples, someday. Make a 10 hours youtube video
And this one gets it right.
Have you tried like, ridiculously large and complicated words to pronounce? If Bark is good, the person will like pause, maybe think for a sec, and then struggle through it?
Not yet.
I haven't tried, but hopefully it's like that, that'd be cool
I think I tried superscript or something and it ends up saying random numbers and letters instead
either that or small caps
Hah. I wonder if there are actually some speakers that can like, perfectly do math equations. Like it must be in the training data, math classes on youtube or something
But not sure what the subtitles look like there
Probably too innacurate
this may be because this phrase is used too often
Maybe random can do it because that's the text that created them. And then an existing speaker might struggle?
Like if somebody chose to say that word on TV
it's probably like, not a problem for them
I'm not sure if it can get stuff like, "You want me to say what? Blah" and then it should have trouble, right?
That's probably a pretty common pattern
One thing I wanted to try is like, a prompt that is only used as setup. Like you say (I can't pronounce that!) and it gives that to the speaker, renders it, and then uses it in the next sample. But it's not part of your final clip.
So you just use it to try and change the audio style
Haven't tried anything like that yet
Not a high priority. I bet it works but just super randomly, so really, not that useful
He meditated so hard he escaped our universe, leaving a giant hole, causing a traffic accident as you can hear in the horn. A short story with just the letter M, never mind six words.
That en_fiery voice is such a trooper, no matter what you give it. Even sings Baby Shark with genuine enthusiasm.
I'm really starting to genuinely dig the not-quite-singing but not-quite-talking way Bark renders a lot songs. It's musical, but not sung.
@tardy topaz I booted up your UI, looks pretty sweet. It's got only the Suno default voices in it though
are yours in there yet!
?*
@tardy topaz Best one so far for whispering.
SOON. There are many I've posted in this Discord that are good, if you search
I'm really starting to genuinely dig the not-quite-singing but not-quite-talking way Bark renders a lot songs. It's musical, but not sung
Honest to God Bark out of the Bark is basically a perfect "Spoken Sung" model. As best exemplified in the classic William Shatner "Rocket Man" clip that I can't believe is such bad quality on YouTube that honest to god I might have a better copy on VHS somewhere. How was this not preserved OMG https://www.youtube.com/watch?v=lul-Y8vSr0I
From The Science Fiction Film Awards, William Shatner's unforgettable performance of Elton John's "Rocket Man".
Includes Karen Black's introduction of Bernie Taupin, and Taupin's introduction of Shatner.
Rock-It, Man... :-)
This aired on local Chicago TV on Friday, January 20th 1978.
About The Museum of Classic Chicago Television:
The Muse...
We need the best science and the best AI to restore that clip. All the alternatives on YouTube seem terrible. This is critical.
Maybe Bark... I can clear up Bark generations with enough sweat and luck. Once you can reverse semantic like Mylo is working on, and might be days away. Then encode all the lyrics into Bark, regenerate as clear. Or maybe it just works perfect with a single Shatner model and you don't need to do each lyric. Because Bark is that good, seriously.
@tardy topaz I have 29 good/okay results so far, do you want them :p
I'm gonna try and do some basic grunt work first, bugs in Bark Infinity, install easier. But if you send them (drop box, google, whatever) I might take a look this weekend. Have you tested how the speakers generate if you use them? Do you need to use special text to get them to talk in a whisper?
It's okay if they don't match the text BTW!
as long as they make natural sounding audio outputs that sounds like a real person speaking
then it probably doesn't matter, the way I use them, which just as a target reference point to nudge bark towards those tokens
I haven't tested yet, I've just been generating them. All of them match the text. One of them is 11 best results out of 200, another is 15 out of 100.
The import thing is, if you use the voice, does it sounds like whispering ? Or maybe it does but only if you prompt them correctly. The first is the best, the second is still useful, but keep in seperate groups
Like whispering I mean
The first is really good though, since that's what we're going for, just make ANY speaker file whisper with no special prompt
BTW if you happen to get any really nice clear singers, send me!
I need more
They are actual whispers
Oh can you try and vary your prompt? I know it's a pain
But I think one problem I am having is like, I used voices as reference. But all the voices were speaking THE SAME WORDS. So like if you 'more closely match a set of voices all speaking the same words' the that's kind of pushing it to just to towards whispering, but towards those words.
I noticed if I more closely matched the French samples, the output was worse. But all the French samples were speaking the same sentence so this kind of makes sense
ok
If I tried to look at the samples and detect 'what does whispering look like' if they are all saying the same words, then that twill also include the words "I'm whispering"
It's still sueful though to have all the ones that are the same
But thought I'd mention, if you CAN do it, it's beter
You can just use the voices to generate more voices!
If they still whisper, and it's high quality, then that's fine to get more diverse output
Even I could do that but it's just kind of a boring thing you gotta grind through
I can still try
All that said, you can send your current stuff. Maybe it just works!
dropbox or google share I guess?
Don't bother right now, tomororw night earliest, and probably sat
I was gonna see if I could use another method like you said
maybe include I'm whispering in brackets and see if it doesn't break anything
If this does work I'll have to give a custom script, might be awhile I make this a Bark UI feature, not even sure how yet
The easiest way to improve that dataset, take each voice, and make like 100 unique prompts, give each one 2 or 3, save the best from the set. So ideally we have a set of all different words, and some voices can be there 2 or 3 times, is okay I think
Maybe take a book and use every sentence, try to cover all the possible basic sounds, is the idea
None of that might be needed, but if you're poking around anyway...
Maybe you should add a batch generation feature to make that easier then, lol
So, generate 10 results with X and 20 results with Y text
if that's not already possible
There is, on the dev branch
ok ok
The one you provided https://github.com/JonathanFly/bark
But the web user interface?
Says command line
Like do you use a web browser?
God, I got to stop making jokes
It's a WebUI, so it's a joke because I put console output in the Gradio app
but software you know, not really the best avenue for humor maybe.
Mhm...
Is there an option to give it a folder of voices?
Let me check
There should be a checkbox like, "don't join the text"
so the you can put in a long text
and split the text however
and it's all seperate
This ?
That's not it
hmn, i don't think it' sin that version
I doubt it
In dev branch there is this
Don't have that
You can use that now
git checkout dev from command line
I didn't put the folder input of NPZ files in there alas
but that checkbox might help you
you meant git
yeah
You might also find this useful:
So if you use the text again, it won't be split exactly the same, more diversity
It's like, whatever number you have, + that value, or that value, randomly. So if you have 150 as your goal size you might get 140 or 160. Then if you use
Honestly I'm worried dev is bugged, since obviously nobody is really testing is. But I will AT THE LEAST to do a bug fix pass this weekend
This one is surprisingly close to the Jhin quote I used.
At one point when you use the split prompts thing, it was not properly clearing the last voice. So all the samples would sound the same. I can't remember when I fixed that, so look out
Will try it out later
Is that at common expression? I noticed a lot of that with like popular song lyrics in Bark
You get a surprisingly amount of matching cadence
Its a quote from a video game so I don't know.
Is it like a fighting game line or something maybe? If so I bet it's in the YouTube training data like 1000 times, for each match, lol. Just over and over.
https://www.youtube.com/watch?v=QwQ3i9L0j74 Most likely. The character is iconic.
This is the voice for the new champion Jhin.
Purchase RP here (Amazon Affiliate - NA): https://amzn.to/2qZ3Bmv
Note: The Voice might not be final, you never know what tweaks Riot might make to it.
For League of Legends Related News Check Out Surrender@20:
http://www.surrenderat20.net/
Feel Free to Follow me on Twitter as well:
https://twitter...
His voice has music attached as background
Kinda like Bark
Oh yeah
I bet you hear that when you click on something in League of Legends
So just imagine the incredibly amount of times that's got to be on YouTube
I bet there's a like a 'click on character who says iconic line' style in Bark, that you could trigger
sort of like how you can trigger 'this is a commercial' style, if you've done that
I love so much how if you just keep cranking up the weights on the French accent Bark actually almost REPHRASES YOUR TEXT PROMPT like a French person. Crazy that it somehow works. It usually doesn't, but even 1 in 10 or 20 is all you need to make a funny YouTube video or whatever.
Oh that's an old sample, hmn, still a bit of it but not the one I meant
How do you "crank up the weights"?
Right now it's a mess of code and super hit or miss, but some day it will be a feature in my fork. Basically I look at a set of French versus English and try to up the chance of French tokens
So in the UI you will basically pick a 'target set of voices' and a 'reference set of voices' the reference set is English voices, the Target is French, and the it tries to find the main difference and increase the odds of those tokens in Bark, for your speaker. And honestly accents is the least interesting thing you might do with this format, you can think of many cool ways to use sets of voices like that. But accents seems like a simple case where I can check if it the idea works.
Right now there's like 5 or 6 numbers that are super fiddly, like threshold numbers of 'how common should the token be in French, but probably there's a way to make it automatic based on some rules. But I am shelving this until I made some basic updates or I never will, lol
Oh I see, so there is some expected distribution of tokens in English vs. French, and you want to overexpress the French tokens? What's the interface for overexpressing tokens? Does it look like the SDWebUI? thing:1.2?
I guess... is this a prompt engineering thing you're doing or is it model surgery?
Right now it's scaling the odds by the frequency in the target distribution, with a lot of cutoffs for outliers, and for using most 'french but not english' tokens, or a specially hard penalty tokens common english set but french. a ton of hardcoded values I picked out a hat, no science at the moment, but a proof of concept.
But the method right now is just trying some values, didn't work, up them a bit, worse, okay lower them, okay that worked. etc
a big mess
I see. Does this work for all models in the public repo? It seems like it should if you're changing the frequency (are you basically adding silent "frenchy" tokens?)
It works for the 3 models I happened to be testing with, so I kind of assumed it was general
None of them happened to be Suno, but I was using Suno voices in the target set as French and English examples.
I'm just multiplying the odds. Like this token is 4x more common French than English. So in the model, at the last step, you just sap the multiplier in there. I thought I would have to also consider tokens in front or back too, the order of tokens should mattter. But actually kind of just works with a general multiplier. Or negative penalty or English.
Oh you can compare like speaker voice against average English speaker, and give their tokens a bit of an exemption from the penalties
otherwise you can wipe out og voice
Can you show me how you over express tokens? I've seen this in the SDWebUI as well, but am not sure if it's custom or a package
Oh it's not models, sorry, I mean prompts
This seems to e one of the libs:
https://github.com/damian0815/compel
Right now it's in the Bark core code, just my custom code
I should probably look at that, because I bet there's smarter ways to do this!
Oh that's weighting parts of the prompt. That would also be cool in Bark!
This is a lot more sophisticated than what I did, probably worth looking into
I'm not really familiar with Stable Diffusion internally so not totally sure how much applies
Gosh, the negative prompt is really fun in Stable Diffusion. I guess you could run one bark generation with a negative prompt, save that token distribution, and then run your positive prompt and try to penalize the tokens in the past negative. 90% chance this is completely useless but 10% could do something interesting?
Anyone know why LLMs don't have negative prompts, is it just useless?
No idea. Negative prompts have a huge impact in stable diffusion from my experience though.
It can't be as easy the idea I explained, or else it would already exist. Probably it just doesn't effect the output in a similar manner like it does in Stable Diffusion.
But it still might be interesting in Bark, since it's not quite the same.
jeez, also, there is a super easy way to make this actually useful i'm pretty sure will work.
I am too overbooked though, just writing it down, not trying this
Honestly the french accents thing kind of covers this idea. I can probably work it in there. Essentially you can pick an .npz file, a past sample, as a negative prompt.
And maybe I cna get that to work
However it's gonna need a LOT of tuning and tests, so you only penalize the right things. Rather than just like, 'the sound of the human voice'
But i'm sure it COULD work
I'm just imagining the Bark WebUI. This is literally like picking 500 .npz files now, in different menus. Like seriously out of control. hah
Gradio is not ready for this.
Unless I missed it, THERE IS NO FOLDER PICKER?
Can you think what negative even means in Bark? It's a kind of abstract idea! Negative to the raw audio sound, the emotion, the gender, it's not very concrete.
Best I can think of. Take 100 average English samples. Take the negative prompt. Find out what's most unique in it. Then penalize that. Maybe, possibly, that works.
It would associate any word as a negative.
Like for example, if you had a negative prompt of music notes, with the same as your positive prompt. What would you want Bark to do? Just be super formal and monotone?
I'm not sure what 'working correctly' means.
The word could be linked to other things so its not 100% guaranteed either, like how I only got 10-20 whispering results out of 200, so its like a 10% reduction on certain things depending on how they're used. I'm sure that yelling or other expressions have higher odds of appearing, so you could at least use it to filter out music, maybe it'll make the audio clearer if you use it that way?
I guess as long as it changes the output in any way you can detect, it's still fun. Just try shoving shakespeare quotes or rap lyrics in the negative prompt. Maybe it has a cool effec.t
It's not gonna be essential like SD but has a chance of at least being a fun things to try sometimes.
For sure
For whispering, negative prompt, "I HATE YOUR GUTS!!!!"
or something yelling like
Everyone is working on voice cloning. Somebody make a negative prompt and let me know if it does anything interesting. I want to try it without having to work out how to do it.
I don't really know a thing about Stable Diffusion. If it was like, trained with a negative prompt, then there's basically no chance this is useful.
Looks like it's not
It might be harder if everything isn't already tagged in Bark. Stable Diffusion has the benefit of the fact that the websites they scraped did all the tagging for them e.g. there's this one site that has 50~ tags per image.
I'm too sleepy to think this through. Will save idea for rainy day though.
And when you negative prompt in stable diffusion you're just adjusting the weight, adding brackets increases the strength of the negative prompt
It does like they just hijack the sampler, kind of like I am, but would have to read more about Diffusion models to know how similar it is
Oh, btw
When I had ChatGPT do that invisible prompt thing, what does that do exactly to bark?
Like, I had it edit the code this one time to let me include invisible prompts
That wouldn't affect the output
did it work?
Yes
So I could do "Insert text here" and add another sentence at the end that'd be invisible
Did it effect the output though?
Yes but it also increased the odds of noise
What was the part of the code it edited to do this?
I can think of a simple way, like just treat it as seperate samples
And throw away the one invisible one
But I think the noise was generated by the fact that the invisible sentence required using special symbols so the symbols were being included in the sentence which could be fixed
cuz i was using || to hide the last part of the sentence
There is a way I can add this as a feature very simply, so I might. But doing it a deeper level would be hard.
But if you split the text, as new samples, that does lower overall quality
compared to just having one long one probably
But still, maybe the simple way is useful
What symbol do they use for invisible prompt in SD, do you know?
Or some other thing, if there is a standard
"||" the symbol didn't matter, it was just two of these
It was something to be included in the prompt that would be removed, and everything after it would be removed as well
If you find the code, if it actually modified generate_text_semantic
the old bark code is simpler so i might check how i did it that way
then please show me cause I'm not sure offhand how to make it work at that level, without experimenting
Maybe it just replaced them with pad tokens or something
I was using the first version of bark so idk how much you changed lol
But the idea is they should still effect the text. so if you I AM YELLING
then the person starts yelling
I kind of think it just deleted them, and they didn't still effect the audio. But if actually didn't it does work, heck, I'll just add that to Bark Infinity
chatgpt can be pretty smart, so it could have done it right
I'll try it again.
maybe it was this i needed to edit... def generate_text_to_speech
No hurry honestly I don't need more features, I need basic bug fixes.
The easy and fast way to do it fast, just split the text, and ignore that segment in final audio. But still use a bit of as the history_prompt for the next actual audio segment. I could do that simple version and it might be cool. Though I don't have the partial segment audio joining even in Bark Infinity yet...
I can't stay up half the night again, no rush, no chance I do anything with this soon.
All people want is easy to install Bark Webui or Colab notebook, and a nice set of clear speakers. Literally that's all I should do this weekend.
yup
To be honest I am little confused how I ended up so deep in the audio stuff. The literal reason is that a bunch of really silly ideas keep working and that's addicting.
But I don't really need TTS so sometimes I do stop and wonder why I'm trying to hard to make Trevor Noah sing in a French accent. Not only do I not need this, does anyone? lol
That said, I bet he does a great French accent while singing. en_fiery always delivers.
It is actually ridiculous. I gotta do something with this stuff at least. bug fixes can wait a day, need to make at least one funny video with this Bark tech
what kinda video?
Since I just decided to this, that part is yet to be determined.
But you know, singing, accents, something like that.
I do think you actually need Trevor Noah singing in a French accent
It is a basic human need ❤️
I had trouble combining both singing and french, but I didn't try too much, lol. But I agree
I think I just need more singing samples!
Or I guess French singing, then it's just one set. That should work
Singing in general works way less well than the accents. Just sounds like autotune most of the time. But the sample of singing I have is small
Also like, I'm not using like principle component analysis or other things I should probably be doing, just literally counting tokens. So not exactly optimal lol
Frankly it's just yet another thing that really shouldn't work the crude way I did it.
My singers are mixed with music, so that's probably why it goes into autotune mode
They are not all voice only
I mean maybe I can lean into the autotune sound. That could be also cool. Mainly every set requires a ton of fiddly guessing of thresholds right now, so there must be a smarter way to do this where it's not so fragile.
One application of a negative prompt for Bark would be generating logits with it as the prompt but negating them and then adding them to the regular prompt generated logits. (You probably want a control for the relative weight of the negative and positive prompts, too.) One problem, though, that I foresee with doing this in the simple and naive way, though, is that because Bark gets things like language and accent from the prompt (regular and history), a negative prompt like this that is prompted in the same language as the main prompt may cause things like the language used to be less stable, which probably isn't the goal.
guys, #📚┃suno-school maybe better for that stuff
@tardy topaz you're gonna have to run this one locally lmao
Are you on the dev branch, let me check. Recently I've been using a new method for clear voices.
I think maybe you can do it
Actually, when I convert it with ffmpeg, it works
Yeah you can't quite do it. Basically start with the clearest voice you can. Then I'll explain later, you erase the coarse prompt. Which is or will be checkbox. And you make the history prompt as small as possible, but still clear. And then make long 14s samples. You get a pretty good range of voices, pretty diverse, all re usually fairly clear!
I'll make this easy. It really reduces the worse noisy speakers.
There is some common feeling between the voices, they aren't THAT different, but plenty different to make a ton
Ok, I should be able to get one in a few results
just show me how to split it later so i can batch them
not now tho
I can make one for you
Wait let me see
Honesstly its late let's table this, it's 7 30 am
haha
sorry but I almost forgot
i still think "high quality:" makes things better
first try
lol, probably just a good voice
its not a fluke
don't sweat the voices. I'll set you up later weekend with how I do it now. I wish i had done it like this.
ok
Also maybe with voice cloning, nobody needs to do any of this?
Randomly synthesized voices are still fun. People will also want voice model merging.
Well, you can't voice clone Barack Obama but french, or whatever.
Since he doesn't exist. So my stuff is still useful.
You can model merge now actually
Bark just kind of works. Not all the time but sometimes
It's not a feature but it could be
I do it like 1000x
it's useful
I will make a feture
it is super useful
Sometimes you just get like, actually, a perfect mix somehow
For example a TTS voice, with a very human voice, it's like half that person, half TTS. just works sometimes lol
Some of my voices I really sweat over. literally 20 or 30 model merges
haha
I'll have to learn that kinda stuff later for sure
It's just gonna be, pick two npzx
instead of 1
Actually it
will be a tool. beause it doesn't always work. So I usually make like 10 versions
and then one was okay lol
I got a Donald Trump whisper, almost. But the voice changes too much. Still pretty close lol.
If you use the voice it's even more changed
but maybe fixable
I wasn't trying for whisper, actually one of the most clear whispers though
whispering voices are very hard w/o directly prompting them to whisper
Yeah totally rng
only one I can remember lately
I saved it mess with
try and make it work better
singing is similar. if you sing, voice changes
it's a really good whisper voice, lol
whispering is probably easier than accents, it's a pretty general sound
I think accents are easier
yeah maybe
should see if cna voice clone a whisper
with mylo's thing
then i can use as the samples
instead of you finding them
Its a prompt.
high quality: announcer: Hello passengers, this is your captain speaking. This plane is about to crash!
oh you want to see what my bark generated without being prompted to?
"hello passengers, this is your captain speaking, this plane is about to-" and it cuts off there
Sure
so i made a voice based on a dantdm video, just the intro
. . .how?
I mean mylo
oh
this was not prompted, the prompt was completely different
"if you enjoyed this, like this video... check out AA-"
I had the notion of a 10 hour unprompted Bark YouTube video just endless nonsense
lol
it's the semantics keeping it alive at that point, as spamming random semantics will drown out the voice
probably because the prompt was the intro
this was the audio i used for cloning
it probably recognised that it was an intro, and recognised it
Have you ever seen Whisper, it hears those words so much
they are banned in the raw codde
lala
lol
lol
Just give it any audio, if it's not sure, it ouputs thank you for subsribing
since it was trained on youtube videos?
Yeah presumably. It literally can't stop hearing it
Any time the noisy, it says that
Am I hallucinating I can't find it now
also, the voice cloning is easy to implement, and i provided some code snippets so you can easily implement it in your webui if you want to
I will, nice
I will maybe even train more
one thing I got is npz everywhere
and usually the prompt
not greatest diversity though
yeah, there's a 4 and a 14 epoch model on the huggingface repo
you could train from there or train from scratch, it doesn't take long to train from scratch
and most of the mistakes it makes are not things you will pick up on that much as a human
like it might misclassify a token for another token, but if you heard them side by side there would barely be a difference
i believe bark, with it's 10000 tokens, has a bunch of duplicate tokens which are interchangable
at least from the perspective of HuBERT base
There's some funny stuff, like some tokens are like a description, or at least it feels like it. adds an effect to the whole clip
or removes
also, the voicemod soundboard sounds pages have a lot of clips that are great for voice cloning as well
the joe biden voice clone is actually based on the elevenlabs generated audio
I think I'm still take some time to tune the clones
You can still dial them in a bit
possible, maybe a consistent sound
Yeah, background hums
sometimes go away for whole clip
or appear
not predictably but, if you desperate, you can randomly delete. and try to get lucky
I kind of thought a hum would be temporal?
but it's like almost a little tag ? total no idea here. I was just little surprised
like it just changes where the prediction goes probably
but subjectively, it was like that
I don't know how semantic and language like, the semantic tokens actually are, maybe not impossible
i gotta see if i can make bark generate infinite length (and probably decode on cpu from that point on)
What do you mean, in the actual model?
or cut into chunks and then decode
You can chunk coarse and fine easily
yeah
I think maybe you can put tokens into the inference space, but I didn't get around to trying that
Like instead of puttting history where it should go. take up inference space
MAYBE you can use that trick to chunk semantic?
if you don't do that it just sounds bad
Like giving an actor a 3 word first part of a line
and nothing else
you can chunk semantics probably, just make sure you have a good history prompt
I tried it a lot
2 words is the breakdown ponit
but it all just sounds bad
because it doesn't have enough context to perform the line
it works just bad
Bark in general, I find, give it a big text if possible. It's more descriptive.
So it sounds like you have an actor, right. And you give him a notecard with 2 words on it. He reads it. Then you give him another.
It just sounds wrong lol
here's a fun experiment
hit me, maybe i tried it
use a cloned history prompt, then generate without prompt with early_stopping=False
Oh that was literally my first idea yeah
honestly i tried to do that with WHISPER
but i couldn't figure it out
coudl it predict based on audio, what next tokens were likely, with no other input, based on the internal llm
so it's like speech to text but guesses what you say
I think you can do it now, in the cpp fork, but I didn't check
damn sometimes i forget to add the quick kwargs and then it doesn't auto hide (gradio please implement element replacements)
since the point of the webui is more than text-to-speech, voice cloning was just something i wanted because i thought it would be cool.
There's so many easy features I need to add.
just mashign two prompts together, works pretty well
like a model merge
not always but enough
just averages of 2 voices with the same semantics?
like, it should not work
it should work with the same semantics
but you really get a nice hybrid!
usually have to render different size variatns pick the best
even like a robot tts, and human
it's like half tts
lol
even 3 prompts, not impossible
oh here's a fun one
have you tried just taking a speaker. delete every other token
they talk twice as fast. still sound pretty natural
haha
or the opposite, double token
i don't know why I was doing but it's actually not even that unnatural
what about this though, instead of merging 2 voices by averaging, you extrapolate the difference from voice a to voice b onto voice b or c? like the add difference merging from stable diffusion webui
yeah did you see my accent work, a little like that
it's 8:30 am and I haven't slept I'm not sure I can actually explain
but I did in discord previous messages, using french setrs
of voices
and singing
I think there's SO much you can do
with voices averaging, differences, using a set of voices as a penalty or target
the singing sounds like autotune, but I realized half my singing samples were music after
so actually, that was probably working correctly
also, you keep talking about music, you can finetune bark with music to have it basically be bark but as musiclm
I wonder. Presumably if it could, base bark would be better though?
It must have seen a lto?
Oh nevermind I understand now
You mean finetune, but overwrite exsiting capability
Fully music Bark
yeah
Yeah that would be cool. Even finetune to specific artist
If it's fast
I really think there's a billion things left to do in current model
that's what history prompts are for
that's be ideal, i just assumed it would still not work great, but maybe
it should be higher quality than the voice cloning with my model though, since it would actually use the same things as the original model did during training
@viral lynx great work! Are you integrating the cloned into webui. I tried to get to test it but I was lost.
i'll probably release my webui so people have something to play with cloned voices, (also, cloned voices are saved under the same name as the original voices, but with npz, in the custom speakers folder)
Your repo is meant to be used in conjunction with bark’s api right. I was just lost but I will wait for your ui and read the code.
I was trying test_hubert but it’s expecting semantic.npy lol
yeah, that's a file to compare to lol, you could technically just create an empty npy file called that, or disable the check
Oh, so it’s ok to comment out ‘original’
Makes a bit sense now that’s why it’s a test, you are trying to see if they are identical
yeah, you can remove the print as well
this here is how you can actually do voice cloning, as a developer
So how do you use the generated npy?
you put it in the npz with the coarse and fine from the same audio
to make it easy, just use a different voice cloner, and replace the semantic_prompt.npy inside of the npz with the npy from here, make sure it's called semantic_prompt.npy
I ran both and got nothing and assumed this was meant to be used in conjunction with bark
yeah, correct, the semantic_tokens from that code can be saved to an npy
and that can be used inside of a history prompt for the cloned voice, but i'll make my webui public in a bit
Awesome! I will probably get more insight that way but I will still try your instructions now.
it auto installs when you run the run.bat, you can add whatever flags to it as well, i should probably document those
Bark is too powerful. It's so beautiful.
This is typically what you get, nice singing, but doesn't feel like Obama anymore.
BUT it can be done. You can keep backing up and hit a spot where it sings, and not change. SO GOOD.
I think the UI for Bark, rather than pick a prompt, pick a location in the prompt instead. That would make some of this fiddling easy.
Installed is there a tab I need to be in to clone?
it's in the text to speech
just pick the bark model and it will load the stuff, as "speaker from" put "upload"
and you'll get a thing where you can upload an audio file
Thanks will try now. Restarting ui
Quite impressive. Just tried it
yeah, with a good input audio you'll get really good results, + if you generated a really good result, you can download the speaker prompt from that generated audio, they are sometimes more consistent
With some effort, I managed to extract you implementations.
The code is a monster bro! How you pulled this off is quite impressive.
@viral lynx 🤝
thanks
100% - super impressive
@viral lynx you never used the models you generated in the webui? Is there a reason why and how could I try those?
I mean the models you have in huggingface
it downloads it from huggingface though?
the 14 is the epoch, the other on is on epoch 4, but i didn't rename it lol
yep
@viral lynx what are you suggestions for audio length to clone from and why do I sometimes get a different voice between chunks.
around 6 - 10 seconds is usually great. sometimes you get a different voice, i recommend saving the npz that comes out of a good result, since that one is fully bark generated
Thanks for the hint. Will probably generate 10 samples then pick the best
Best use of voice cloning, no prompt, no stop, just get cool audio you barely hear from Bark typically!
Some of the best sound effects and music instrumentals, and like animal sounds, etc, feels a lot different than typical Bark sample
Less structured but also kind of more natural in a chaotic way, super neat
There's more sound effects in Bark than I thought
Mylo has given so many ideas I can't move. You can train this in like 8 hours? You could try SOO really wild unbalanced possibly absurd datasets, like 2 a day, and see what happens!
Maybe nothing for all of them and you stop on day 3, still cool
again, not 8 hours, 20 minutes
the 8 hours is the amount of training data i had
but it trains faster than realtime
A fun rap AI taking over the world.
yoo how'd you get obama and trump?
Those are hand made, but you can also clone them now
Or both, which is actually still kind of maybe necessary
I still had to tweak the wav clones honestly by ear
Nice
I kind of copied and pasted all that into my code I may even update. Not a polish release but it works.
You can do both now. Clone automatically model merge etc. Though merging is not in next update
Can you point me to where I should look for the training part? I've played with the inference a lot and I really like the flow of the voices
you mean voice clone? training no idea
just thought I remembered somebody trying a new language
Yep voice clone my bad, I though it classified as training!
(trying to replicate my own voice 😅 )
Ohh this look terrific I'll give it a go: https://github.com/gitmylo/audio-webui
I should have done but I wasted the day, and now I'm tiried
are you technical enough to install via conda yourself?
Yep I'm a TD in VFX 🙂
I could push this version
but it doesn't have updates ymc for conda or pip list
someone would have to just figure it outt
and i don't time until maybe late today
I could PR that if I manage to figure it out
it does have the cloning
but it's like just a mess
for produciton
i mean sure whatevr
let me just at least remove print statemnts
Ahaha, well I've seen everything in the AI/Python world I'm immune to this now 😅
Hit me up with the link when you can and I'll contribute back if I can! Thanks Jonathan!
it's https://github.com/JonathanFly/bark so i'll push a new branch probably.
actually if you check for problems, that would be helpful
since i plan doing more tonight
maybe an hour
the cloning will wrok
not sure about generation though haha
this is a real mess al in one file just to do it in a couple hours
haha
I think I can make anti voice clone work, or at least, be occasionally funny. I've been saving bucket of clones tokens and trying things. instead of fixing critical filename crash bugs. audio as input, just alone. cool idea. it's just more tokens. even a vague style hint. a minute of audio is a lot of tokens.
I made a simple web api for whoever wants it, it has streaming and file based generation, and a simple short lived queue https://github.com/demandcluster/bulldog
Check out a sample short podcast I created using Bark: https://youtu.be/CW790VwEO9c
🎙️ "Tech Talk with Rob: IoT Workshops in Agriculture" 🌾
Join Rob in this insightful episode as he explores IoT workshops in the Faculty of Agriculture in Israel. Powered by AI and featuring seamless narration by Bark TTS technology, this podcast delves into how technology is revolutionizing agriculture.
Discover how IoT bridges the gap between...
@viral lynx I am getting these errors while running the ./run.bat script. Plz help
are you on linux? use the sh files instead if you're on linux
Like WAT how ???!?! I got work but bark will be what I mess with asap
This is really amazing. I am new to bark ai and new to the server.. this is one of the first things i've previewed-- I was curious how you got this rapper voice?
is there any documentation for voice cloning with bark
It's a bit rough, only even been been out since last Friday. I'll add some soon.
The short version in my fork is upload a wav file, you get a bunch of voices. Try the voices maybe you get a real good one.
Oh bark infinity does voice cloning differently? That sound interesting
AI musings on using sexualized consciousness to travel through time. Ooba Booga WebUI with Stable Diffusion. Manticore 13b LLM with Bark TTS and Automatic1111's SadTalker extension.
(this is post-processed btw, but the voice in the beginning is straight bark-infinity output)
I made a jupyter notebook that you can use https://colab.research.google.com/drive/1IA3c_R859nANerMARazCSrjc2UD3ws8A?usp=sharing
I had a bot I'm building summarize an article, then convert the summary to an audio file.
The prompt was pure laughs and I didn't even mean to the join the segments. Perfection, honestly.
I didn't save all the .npz for each segment though, a travesty.
That clip sound like it came out a horror movie.
had some similarly haunted. 🙈
using your self model ?
what self model?
voice cloning
ah. yes. tried to random generate while using a voice cloned npz
Sounds like a hounded doll with a build-in-voice-box.
My clip is a totally normal Bark random speaker, just came out perfect.
The laugh at the end after you think it's over, jesus
Here's another one B)
Today podcast made in one go (just one inference for each podcast, not manually picked from hundreds) :
https://soundcloud.com/jacktalk/sets/jacktalk-today-20230524
JackTalk is brought to you by ai.pictures. All content is generated with A.I.
probably not a good idea to go over 1 for the temperature. but it started so well...
me after the lobotomy:
Super cool. Do you generate multiple segments of 10-ish seconds and merge them later ? The voices are still Bark-weird but you made an excellent use of this limitation, and that gives the podcast a certain deliberate crazy tone. Love it.
thanks! yes, it limits is 14 seconds, so it is a concatenation of many short clip
Next I will try make a live comedian with this :