#๐ฃโsuno-showcase
1 messages ยท Page 3 of 1
holy โ ๏ธ
Did you filter out the brackets, she keeps replacing them with () instead. ๐
BTW I think Bark could be really cool for livestream that needs constantly changing voice. Even create new voice on the spot. That's a pretty unique capability. So a whole script, with new characters, each gets unique voice for each script.
Instead of single speaker.
(for clarity, I would cheat a bit, and make the random voices variants of already clear voices...)
This is really cool!! Is it like footage loops with some Wav2Lip?
It responds impressively fast, for Bark anyway.
That's me
30cm?
Oh no, nvm, I am same name
like the image I posted, probably before you
I tried to get her say brackets for awhile
I saw I tried the song ๐
But yes it's super fast
A lot happen in LLM lately I haven't followed
but for bark yes it must use the small ones
I tested a friend's 4090 and pretty sure even that didn't return 14s in 14s
on large models
But it wasn't pytorch nightly...
There is a huge diff between 3090 and 4090 ?
Wow still much more than I thought
Yeah I am still jealous.
But wait... in a very short time it does: Some diffusion image used as presentation, language processing, wav2lip and bark....
Must be multiple stuff in parrallel
the llm is probably an api
really just wav2lip and bark
or whatever the new hot wav2lip is
Yes it must be gpt3.5 it just did the "as a language model"
I'm still baffled by the speed of progress
it uses diffusion!
I was playing with an 'infinite live tv show' on twitch. It used diffusion for the video, in a really hacky way. You generate a scene of the characters, which coudl be anything. Walking down a street in paris. Then you use some extensions to split it up and make the characters different looking, multidiffusion I think it was. (multiple people in SD tend to blend, like merge faces, if you ask for say three different people normally) and then you rapidly generate 'variations' which if you play quickly. Well it looks like puppets talking. But it was so not realtime, like I would have had to have three 3090s.
I just ask it that and it told me "yes it's some of the tools I use"
At the time I had a burning motivation because "infinite Seinfeld" was blowing up and I thought it was just awful, and wasting the format. But when that I died I didn't have the urgency.
Far from realtime but it got a 2x speedup a few days ago with the nvidia driver update!
I'm experimenting a lot with SD for 3D CG (some tests in the readme here https://github.com/melMass/MLOPs-stage)
One thing I was trying to get working, was the depth mapping. You know how a sitcom is a multicamera thing? Like it's a single line of cameras. So all the angles are straight, a little left, and a little right. And you can depth map the output of SD, if you seen that. 3d-in-painting extension or similar name. And then pan left and right, like a sitcom! But OMFG so slow
If somebody gives me 10 4090s. I could make a hell of a sitcom stream. Just throwing that out there.
Sounds intriguing I know 3d inpaint but I haven't tried much animation besides existing extensions for automatic
(deforum and temporal kit only actually)
I think it might be okay with deforum. I didn't explore it much.
But the images change more
At least then, I know there's been new developements.
Yeah now it supports controlnet which yield to much more control
But you can basically 'pan left, pan right' it just (at the time) looks like a crazy dream image
I had worked so hard just trying to get the characters to talk like puppets, without their shirts changing colors constantly. I was like OH NO I can't throw that away.
It started well ๐
Yeah controlnet avoids that and more, but there isn't yet a workflow that wraps it all yet
Haha, actually many voices (or voices that sounds like a podcast) seem to switch a lot. It's like modeling the changing of speakers in Bark I guess. Lots of random voices. The worst is cartoons. Try making a cartoon voice that doesn't change randomly sometimes. Also I swear Bark says that 'I Like pizza' in that style all the time, I know that! Weird. I actually never really checked, do any speaker prompts that switched voices very consistently, do they ever switch to the same voice, at a somewhat regular time like the end of a sentence or something? So you could maybe use one speaker and just render out a conversation in Bark, with Bark doing both sides. You don't split it at all. Doesn't seem like it should work. But stranger things have.
Cool projects, followed
Ohhh I had saw your GauGan experiment!!
Haha, I think I burned 10 years of GPU lifetime against the NVIDIA GauGan, they never limited it.
I literally put entire movies through it.
Frame by frame. I had this ridiculous browser based segmentation model using processing.js just thrown together, as a preprocessing before the main one.
That's another project that was weirdly limited. It was like 'draw a tree' but actually had 100+ other categories
My Discord Icon? That's a gaugan person. When the model has no people. But the category for people was in the GauGAN model, a little bit, accidentally. And it would render people that looked like this. It was enchanting and spooky. Delightful.
It even has a mouth!
Later NVIDIA blocked all those categories, a sad day
Pretty sure they eventually came back, but SD kind of made it feel old fashioned
I had no idea you could use it in the cloud back then! I showed your experiment around at the time quite a lot!
Testing ChatGPT with the input "Write a mid size monologue from an A.I that would imitate Joe Rogan's voice, make it light and fun" and the rogan voice (some long parts work really well):
Sounds pretty good but you need to run a speedup on that rogan. All the clones are too slow for me by default. Just run a bunch of long prompts and resave, try to find one that doesn't change voice, but talks faster.
Unless that's the goal, a robot Rogan?
I need to bring back mylo's implementation, I just made a basic CLI based on the notebook you shared earlier and the output of voices is deterministic there for some reason
I think it is deterministic. That's partly why I cut like random.
I even added noise and stuff at one point, just get different results, but I think I pulled it
Or at least the large model. Maybe small wasn't, the first tone.
I don't recall if it's always a match but I have gotten exact same clones, with same audio. But Bark Infinity spits out clones like a fountain so doesn't really matter, just every few seconds in the audio.
That's my secret. I'm just spamming out the clones. There's good in there somewhere
So you are spamming voice clones AND inference? Because the later too is non deterministic, I've had the same voice produce garbage or great results so now I mostly focus on the later
It's like, generate 10 clones, from each, render one sample. That's the default. The last clone, the one you get if you do 1, is often the best.
(and I think it likes 10 sec inputs better for cloning)
But then the cherry picked one can infere badly too right?
Yeah totally, it's really just go get a snack, come back, check the samples
Maybe some inferred wel
Then check again
Yep I do that for the audio generation but not much for the voices (I tried initially), but I'm very much starting all this you have way more experience
My experience is mostly pre cloning, from tweaking voices. But weirdly it's still useful. Because the clones don't come out the oven quite right.
However I am still trying to figure out how to put some in Bark Infinity, as a feature that just works.
Mainly how to edit the speaker files
So for example, you can grab the rogan voice from seconds 12 to 18, or or whatever. (choosing random numbers)
because you can hear that part is good
So you can snatch that rogan up
I learned today you CAN do this in Gradio. There is an audio tool to pick a section of audio.
So it should be doable
I don't know if that makes sense. But for that rogan clip, if you save all the npzs. There might a section that just sounds really good. Well, you can carefully try to build a new npz of just that, by selecting the right spot.
It's almost so simple it's weird, right? Just like, fine the best rogan part, and make that a voice. But really it's that simple thinking that works in Bark
So today I learned I can make this possible in the UI
That would be awesome I was wondering how to edit/visualize it better to enhance it.
One idea I want to try was lerping good voices from batch on mylo's ui or yours that are both non deterministic
Yeah it's still a little awkward, but I'll try.
Like that bar, you can't see anything. But if hit the play button it starts at that spot.
So it's trial and error but you can find the good spots
Audio editing is a really bad fit in Gradio but it's kind of what you need.
But you aren't editing the wav file, or not always. You edit the NPZ generation.
ANd maybe just semantic or coarse.
Though that's maybe too advanced and fancy.
I will just use the wav file as input, so I know where the user thinks the best voice is.
I may be getting more ambitious than my software dev skills. But it's really fun.
Just poking and fiddling with each voice.
I mean I got stuck today trying to fix some bad filename bug for 3 hours, so yeah, probably getting too ambitious. But it's a good idea.
bark infinity is super cool, I couldn't make 'batch" work even after ticking the warning so I started the CLI to both clone and infer.
I will try to decipher the logic behind yours and mylo's clone to understand why mine is deterministic and I'll test further
For batch, can you describe what you were doing?
I just want to know if it was a bug, which it could be
The cursor was showing a red forbiden sign and I coul'nt bump it up
(in the voice cloning tab)
the main one worked
What we are saying 1 wav input -> mutliple clones (npz)
Regarding your audio editing tool, you can also replace the player completely in html/js to fit better
I could use a tool if somebody wrote one, but myself, probably not do it. But yeah it could work and I did search a little didn't find anything. But maybe already exists somewhere?
Okay so by batch, you want upload one wav.
That is 10 seconds ish
and get different clones
By processing it over and over
Yep to then cherry pick the best!
(for now I'm doing this on the inference in my cli because my clone is deterministic)
I pulled that feature when it was deterministic, though I had for while running a audio process to make the wav slightly different.
Oh yours is too!
However, I'm not sure you are getting better results
No I'm saying the opposite ๐คฃ
Inference -> non deterministic
Voice clones -> deterministic
Mylo's audio webui aren't
you give the same input wav and each generation create a different size of npz
I think it is, but what happens is the gradio UI does some conversions sometimes
I tried to figure this out
and I'm pretty sure it is deterministic, if you give it exact same wav
But sometimes there's some audio conversion that adds rng to it
mine (I compared in numpy but they even are exactly the same size on disk) produce the same stuff
I am not sure if that is useful, have you found it helpful?
Maybe it's a bug then but try the same input and then save and reclick generate you'll see a different size
Like the same wav is really good a second time?
Haha, it's confusing.
I personally abandonned that path and only batch the inference (text to speech from the same npz)
and compare that
I think you can test the cli in your bark infinity env
In Mylo's UI, you can use a wav file two ways. One way is pure speaker. That puts a file in /clones_voices directory. That is deterministic.
The other way, the other field, is using the audio file as a prompt. So you get a crazy variety of voices. But it sometimes works. That is the speaker.npz file.
so if the name of the .npz is the name of the wav, that its 100% the same I think. The other one is kind of ignoring part of the file, it's more for creative uses.
But actually sometimes makes a decent clone by luck
I'm trying to figure out how to make that stuff clear myself
but basically, if name of npz = name of wav, perfect clone. if not, it's more like a creative sample based on the audio
I would have to check but when I trailed the path to the methods I could not really understand all the slicing etc and it's definitely more complicated then the one I used or from the notebook
But I was only referring to voice cloning
Which I think there is only one way no?
So in my UI it says like, "Use audio file instead of text prompt" (doesn't actually work right now)
Mylo has that feature, it's the lower audio box
IIRC
the top one is pure clone. Honestly I would need to double check
You know I can't quite remember on a late friday, just that directory, I think data\bark_custom_speakers is the pure clones
I'll check in a bit, just not at main computer kind of awkward
Take your time, just sharing as I think it can be useful, you can see the size offset I was mentionning, this is just pressing generate consecutively
Yeah I think that's fine. I think that does a perfect clone. But then it uses the audio file again, as a prompt.
I actually even opened a issue this morning about that: https://github.com/gitmylo/audio-webui/issues/7
So you should have two .npz. one in the dir, and one speaker.npz
It's roughly like cloning deterministically and then using the clone to generate.
That's basically the main process, in general
Why two? That's what I don't understand, it doesn't autosave but autoloads:
the one in the directory, that one probably the same every time. But the other one is using a generation and saving again. so it different every time. And sometimes way better.
because it's a real bark voice
well real bark audio sample
I don't really understand what you mean. There is no default npz
I will have to actually check. So you don't get a speaker.npz file?
just the one with the name?
You only get a speaker.npz file but they aren't autosave, just hyperlinked in gradio for you to save in data/bark_custom_speakers
I think you are doing it the right away. You are basically cloning, then generating. So it's different every time with good variety
Yep mylo's ui does it but I think it's a new thing since all the methods are called "'new" ๐
So it is deterministic, but it doesn't matter. Because it's way different from a generation.
It kind of just skips the deterministic part.
If you find some little workflow that works, let me know I haven't really tried much, just doing other cleanup
And for completeness this is the one generated with the cli:
I think basically it's just like using that .npz, with your text, and then saving.
Which is the so far tried and true method.
Hmm I think you are right now I understand what you meant
So you can use the CLI and just do the text yourself, and resave, should be same.
4kb
Right. That is just 'hello
so it's nothing
You are cloning normally, same as CLI, same every time. Then you make clone speak. Then you save.
That is the best way.
Yep I did not get why you correlated the two... they are in this UI
So you can batch it now, just generate samples and save final .npz
This is what I settled for and I'm getting good results
with thte clone
but only using the same full voice clone from the cli I don't edit npz
like the 5mn of joe rogan was first try
Yeah that's a nice feature, I will try to make it
I think it IS in Bark Infinityt
but just very crude
It is just chopping 5 seconds
always
first 5 or something
It's not a user feature
But if you generate in the clonining, it saves twice. And one is more in the earlier part of clip
However if you just pick a number, that is very often a bad spot
You really need the user to listen
This is no?
Don't hit that button, Ithink
Just put a text prompt in
Like your rogan quote
Try setting repeat, honest to got I can't remember if that works
in the cloning
uncheck the 'just give me box' that is not really good. I mean it's weird. so try if you want. But it's slow
a failed experiment really
I should make this way less about chopping up. It's not great for short audio, like this case
I can't capture it but what I meant earlier is that ticking or not "give me more clone" doesn't allow me to edit the slide, I get a red forbiden sign as a cursor
Oh hmnn
I can check in 20 minutes,
Probably bugged
But if you can, maybe restart?
and don't check it or leave unchecked
No emergency! It's 4am here, I'll soon sleep
I just did now to make the screenshots but I had tried earlier
Just the 'Create an audio sample for each created clone at the end, using the Main Text Prompt'
and then just did the cli quickly
If you have a second rogan clip you can try the second audio thing
I'm only 80% sure it doesn't help. haha
Ahah yes I mean to ask about that! I'll try, but I think I understand better what input yield to better results
I added many things that didn't really help, but I wanted to try. And left in for now. Will replace with new things I try.
I mean they might help I honestly didn't have testing time. But if so, hit or miss.
The second audio sample uses that audio, as the prompt. Instead of your text that you type in.
It seemed like that might be better but so far, not
So it makes Rogan, or whoever, say the things in that audio
I'll be back in 15, at desktop, if you're still awake I can check whatever.
Oh nice feature so speech to speech?
Yeah
I started to try voice cloning from french sample but the voice always deviate to english for some reason
Needs french training. Oh I sent someone a clip, so I can download from discord and post. Just funny to run long audio through as second sample, voice just roams all over.
So the og voice, gone quick, if you just keep playing because this is way too long. and it is replaying the audio samples with RNG voices that slowly morph.
To do it right you can't use a long clip but it's amusing
(for speech to speech, you could only really get the first 5 to 12 seconds at best. in this case just to try it, it keeps going, and the voices are completely lost from whatever the speaker was quickly)
Hey all, we're considering using suno for a content creation platform we're building.
Are there any restrictions to using it for a commercial platform that features TTS + voice cloning, and charges users?
( i can share more details in private, if needed )
cc: @lyric steeple
yes you are right, just wav2lip , also stable diffusion and chatgpt3.5 api
yes , it is just wav2lip
haha, the backend is just chatgpt3.5 api, so i have not too much control over the output ๐
wow may i have the link of twitch ?
i found there is a link which can clone voice?
spoiler alert: both use the same cloning code, the main difference is:
gitmylo/bark-voice-cloning-HuBERT-quantizer: creates the clone file for use in bark.
serp-ai/bark-with-voice-clone: creates and uses the voice clone in bark.
if you want a webui instead:
https://github.com/gitmylo/audio-webui
or
https://github.com/JonathanFly/bark
if your cloned speaker was singing there's a high chance the generation will continue singing
scrolling through my temp, there are some cursed gens
first time hearing bark yawn (around middle)
for children's books
Did you try a lot? Feels dialed in just right.
first shot
Was it 'clone and do one sample' at least?
clone + prompt one thing, then try something else, this was that else
Nice, yeah it can be enough.
it's kind of funny, it does a good job and then keeps hallucinating, perhaps it should be used with a -0.5s voice connector
She performs for children. She's creative and improvises. Works for me.
Well, maybe not just randomly
But sometimes it sounds like it
no, it's a short prompt and a long voice
Oh yeah
Sounds like the intro to a song
I like my prompts short and my voices long
Though actually it's the opposite for me.
the best way to generate a voice with bark is to first have the voice say what you want to generate. Then use it as a prompt.
Except, if you already have the result, what are you asking bark to do? ๐คทโโ๏ธ
It is a funny quirk
Though I thought my code was bugged, until I tried it on the Suno default tspeakers lol
Ask them to say their own prompts
lol, don't torture them
I was like, why is my second segment always halucinating? But I was runing the same script, of repeats
I found that the history prompt will speak the the script originally feed in, it is probably becoz of its nature as gpt based
Oh, because there's no repetition penalty? I forget that used to not be standard.
So maybe the fix is just a repetition penalty? But
But I wonder if it works as with sounds...
it seems to have to be implemented in training stage, isn't it?
I'm pretty sure it's just code in the sampler, like you would implement top-k or whatever. But presumably if it was super easy it would be done already. Actually I just googled and it seems almost too easy to implement, I wonder if am I actually extra confused. This I could do. But why wouldn't it be in the nano GPT?
If it was easy, and effective. Presumably it would be generically in the fork. Whatever I am busy today and also not the person for that.
Someone who writes that code all day should get in there any try though. Maybe it is hard to do it right, as opposed to, at all. I just wait for huggingface to do all that kind of stuff.
You are right, thanks Jonathan, your contribution to this new technology is important
I think I broke it. Anyone know why it does this? It completely avoided the text.
https://huggingface.co/spaces/suno/bark/discussions/94#6472a0e822016353ae3cec6f
With Bark it could happen even normally, but anything with [words] all over it is high chance of failure. It's not really like 'feature' more something people discovered that sometimes works
I'd be interested to know if you get really good animals sounds actually.
๐
big semantic + big coarse model does make a difference, as much as I don't like waiting for it
I can sacrifice coarse but not big semantic.
I thought that the differences were small but I just tested on the wrong samples. They can generate qualitatively different things in some cases.
some test of some more audio post-processing in my bark plugin.
Today podcast, all are inferencing in one go, not picked from dozens :
https://soundcloud.com/jacktalk/sets/jacktalk-today-20230530
JackTalk is brought to you by ai.pictures. All content is generated with A.I.
first non voice clone just making sure gradio upload works, first sample, copy and pasted help gradio text as content. each segment in clip same exact speaker, same text, yet musical in wildly different ways. literally the first generation of the first clone i just tested. just throw any audio into the cloner people. (It's not an audio clip as the prompt, it's just a music clip as a voice clone, I just happened to copy and paste the closest nearby text on screen.) The music clip was the Deux Ex theme, which you can hear a sliver of.
๐
Just love it ! Can you walk me to the process ? What prompt-voice are you using ? Do you generate each segment parallรจle and then concatenate the audio or ? You say there is no post processing, how do you ensure the sentences are properly ended ? What specific tweaks is you program doing on the audio ?
Hi guys, here is a dedicated blog post https://dev.to/adriens/agi-bark-smart-waitress-285h
Hopefully you'll like it
in the first week, i spent hours to pick a few from hundreds of generated samples, then stick with those few and test with variations, to see which one is good for the results
11
the applause is there, just can't quite surface it all the way. so close.
i found that bark voice (if good quality) sound similar , it is probably because of the training dataset
I've been using Bark for doing a radio show. It's pretty fun.
English Voice 3 sounds very much like Chris Morris
If anyone is interested
Very interesting and creative. Also weird. Overall I enjoyed listening to it. Did you create all sentences in a single generation, or trying your luck multiple times ? The global editing is made by hand, I take it
Yeah just a single generation using the long form scripts. I did the second to try and fix up the weirdness that happens sometimes, which sort of makes it confusing, but works in the context. A bit of editing at first, but then just gave in to the pace of the generations.
#๐ฃโsuno-showcase ๅฆๆๆจๆณจ้ๅฑๅ็ฝไผ ่พ้ๅบฆ๏ผๆ้่ฆ็ปๅปบๆดๅคง็็ฝ็ป๏ผๅๆถๅฏนไปทๆ ผไธๆๆ็่ฏ๏ผAsus TX-AX6000ๆด่ฝ็ฌฆๅๆจ็้ๆฑใๅฆๆๆจๅฏนไปทๆ ผๆๆ๏ผๆๅธๆ่ฆ็่ๅดๆดๅนฟไธไบ๏ผๅๆถ่ฟ้่ฆๅค้กนWiFi 6ๆๆฏ็ๆฏๆ๏ผ้ฃไนTP-Link xdr6088ๅฏ่ฝๆด้ๅๆจ
couldn't resist trying one batch of 30, overloading Bark with hints like the woman literally starts speaking with "In a world..." some with no announcer. Though Bark makes those generally sound like weekly news teaser videos. The text really super shapes the voice you get. This is just a random assortment of them.
if you want to learn how to make chatwithalice, please go here :
https://igg.me/at/chatwithalice
I built a virtual teacher on twitch :
https://www.twitch.tv/chatwithalice
I tell you how to make. | Check out 'Online course-build virtual channel 24x7 on twitch' on Indiegogo.
The Chinese generated voice is very strange [MAN]:ๅจ่่ๆฏๅฆๅฏไปฅๅญๅจไธคๅ ๅถ๏ผ [WOMAN][laughs]ไธ,ไธๅฏ่ฝ,ๅ ไธบๆไปฌๅ ปไธ่ตท.
ไฝ ๅฅฝ
ๆต่ฏไธไธไธญๆ็ๆๆ็็ใ
I just tried the example prompt "โช In the jungle, the mighty jungle, the lion barks tonight โช" and...
what the heck-
it is not a one go easy task as suno nature is different from other TTS
I see
it now only is good at English i tested several languages
Is every Chinese voice bad? If you can find even 2 or 3 good ones, I think you can find infinite good ones. Eventually...
chinese not that bad, it is already better than many tts projects, just not as good as english, i think it is becoz the training dataset not big enough, or may chinese dataset is not all around enough
ๆไธญๅฝไบบๅ๏ผๅบๆฅไธไธ
ๆฌๅคด็!โฌ๏ธ
i think we should just focus on English at this moment, i think suno will make next verion which other languages will be better , just like chatgpt, it is particular good at english
Hi everyone. Can you point me to the best singing examples Bark generated ? And can Bark run on a Windows install without using the GPU (like for example SVC, RVC, Vlad diffusion) ?
Jonathan's webui is for bark, with lots of bark related features
My webui is for bark and any other audio related webuis, with less model specific features
merging them isn't really a good idea, as the webuis have different purposes
I'm frustratingly incapable of actually making the UI I want to make, technically. It would look completely different, like a node based sound laboratory where you draw lines and connect segments to create unique processes. Like a visual UI version of a Jypyter notebook. My only hope is I stumble across something almost like that I can fork lol
Some day I'm going to drink way too much coffee and try to rig up some crazy way to use https://wavesurfer-js.org/examples/#multitrack.js in gradio. But honestly I can't believe somebody hasn't done it already. It's a web page. Half the Stable Diffusion UI features are just hooking javascript into Gradio, already. Somebody who really specializes in that could probably do it quick.
Hey, guys! Want to ask, is it a right place to share my pet project utilising Bark?
yes, that would be great, as long as it doesn't have anything that breaks the community rules
Nice! Here it is https://castpod.live/
It's quite straightforward. You provide a text prompt about a particular topic, and Castpod generates all the elements you'd expect in an audio podcast: cover art, a theme song, a title, description, tags, as well as the podcast's characters, including their names, roles, avatars, and of course, the podcast script and the corresponding audio conversation.
The part which is generating audio output is powered by Bark. Now, there are some limitations is quality and size of the podcast, but sometimes, the results truly mimic authentic conversations with insightful viewpoints that I hadn't even considered before.
It was so fun to develop, thank you a lot for your work and tool you have shared!
For now, podcast generation is quite expensive. The average cost of generating one podcast is approximately 60 cents โ a considerable amount! As a result, I implemented a paywall for podcast generation. I'm sharing a license key that comes with some generation credits. Simply enter the key on the podcast creation page, and you'll be all set to give it a try. 04B68F62-79B7-4337-B054-F9741D1A65CC
The music isn't Bark too is it? If it is you must have really worked for those. Soo hard to get that high quality.
Nope, theme songs are generated with https://github.com/riffusion/riffusion
Bark handles only audio conversation output
Very nice. I built a similar engine a while ago but used 11labs for the voices. It's totally unsustainable money wise, and even tho it's way cheaper with self generation, your comment shows were not quite there yet. Maybe once soundstorm is ported to the open source - it's supposed to accelerate inference by an order of magnitude with their non auto regressive solution. I think the podcast shows that bark is not quite there yet also in terms of voice control and quality. It's a good experimental tool (and sometimes an artistic one) and probably a good project for research, but not really something that can be part of a product.
Yes, agree with that. For now, costs and quality are solid limitations for such a project. But awesome thing here - it is already able to solve the puzzle, even with super young solutions such as Bark and pretty dummy GPT-3 if compare to GTP-4, proof of concept works and sometimes works ridiculously well. There is no doubt with Bark undergoes further iterations and improvements and GPT upgrade, it will reach the level of quality and costs makes it possible to create products like that.
Indeed. My take was behind https://radio-hn.pages.dev and this was so fun building I toyed with the idea of productize it (like recast) but price of inference wuldn't allow it (yet)
Yeah, me, I just love building pets and when I checked out Bark that was like the most obvious idea to play with, I wanted to make it open, but for sure there is no way now, too costy. But, yeah, that was a lot of fun to build, looks and feels like a magic ๐
!! NOISE WARNING !!
I hadn't seen this before, it goes from blast noise to speaking in a studio midway, maybe there's something interesting in there
{
_version: "0.0.1",
_hash_version: "0.0.2",
_type: "bark",
is_big_semantic_model: false,
is_big_coarse_model: false,
is_big_fine_model: true,
prompt: "โช ๅบใๅฎๅฎใฎๆฐใใไธใค ้ใๅฐ็ใฎๅบใไธ็ใง ๅฐใใชๆใฎๆใใฏๅฑใ ๅฐใใชๅณถใฎใใชใใฎใใจใธ",
language: null,
speaker_id: null,
hash: "c342707bb377c37533b46660842959ed",
history_prompt: "None",
history_prompt_npz: null,
history_hash: "6adf97f83acf6453d4a6a4b1070f3754",
text_temp: 0.7,
waveform_temp: 0.7,
date: "2023-06-08_20-59-08",
seed: "1542369585"
}
!! NOISE WARNING !!
As a very low effort youtube channel mostly for figuring out how to chain together different AI tools. I built a youtube channel that is almost entirely automated. Images from Stable Diffusion, script from gpt4, audio from Bark, and face from SadTalker https://www.youtube.com/channel/UChhN-FdST9UDux_Mu5JX4BQ
NEW AI JUST DROPED! https://huggingface.co/spaces/Martinic/MusicGen made by Meta, a better, open source MusicLM, with melody transfer! (Mario piano)
Music might be dead.
I think they dropped the ball. They cleaned their data too well.
I mean it's awesome don't get me wrong
but it's killing me that the LLM lobotomized by only seeing tags and that souless description text on those stock photo sties
What do I know. Maybe the training would have fallen apart if there was more that such regular data. But from what I can they didn't try it.
However, holy cow, there are a lot of functions that look super interestitng
Very interesting. Thanks for sharing. Prompt: An 80s driving pop song with heavy drums and synth pads in the background
I am a little jealous of the cool wav files
hello! today I installed Bark and MusicGen locally , and then I put this together ๐ https://on.soundcloud.com/RHhFh
Included musicgen in my webui alongside bark (tts-generation-webui)
i have been doing it in 2016, but no stable diffusion and gpt4 at that time, fyi :
https://www.youtube.com/@ai.picturesespanol2405
Using vocos on some noisy bark generations:
Hi my phone is running very fast
How was the inference speed, in comparaison ?
encodec and vocos are both really fast
the slowest part is coarse > semantic > fine > vocos > encodec
but really 90% is coarse, 9% semantic, with rest being small
This one sounds alright.
prompt for this one should be: "solfรจge, dou rei mie fah sol law sti doe"
then i reuse the npz to make this...: "[do][re][me][fa][so][la][ti]"...
We built a small demo. It's a webui built with Next.js for JavaScript/TypeScript developers: https://github.com/failfa-st/bark-web-ui
It is still very basic but we're attempting to add more features in the next weeks.
Web UI for Bark by Suno.ai built with next.js. Contribute to failfa-st/bark-web-ui development by creating an account on GitHub.
Another use in a radio show - was testing different settings to see the results.
Great work. That sounds amazing.
What prompt did you use with musicgen, if you dont mind my asking?
Hi team, can i ask how can i clone the voice?
thanks! ๐ it is fun to make music but days later I can't enjoy it that much.
I can't remember the prompt exactly , MusicGen has no way to recover the used prompt.
was playing around with words like :
simplistic, slow authentic drums, exotic percussion, subtle silent textures, old-school hiphop, drum break, simple beat, dj premier, dj revolution , non-harmonics, dry raw empty
Got very random results, so I made a lot of attempts. today it rendered this one. I used the large model between 10 to 20 seconds. then I used Audacity to edit the loop.
if you got a good gpu, you can use RVC (Retrieval based Voice Conversion WebUI).
I also tried with Bark but that was a bit too complicated.
Today I cloned marilyn monroe her voice then I use Bark output to change it to her voice ๐
making it do singing , used Bark Infinity Cloning, probably not as good as it can be, but my pc slow , only 11gb vram โจ ๐ used 1 sample of 30 seconds for the cloning
I'm Making a new version of Marilyn Monroe using Bark Infinity Cloning, now just her spoken voice, no singing. I also use the optional second sample part in the webui and I got really amazing results.
these are using the prompt to the letter. I think it sounds just like her 1953 movies, and her voice has this natural asmr vibe and it is just amazing how it can reproduce it. Now only if it could stay consistent ... [whispering] ๐
I have a silky smooth voice, and today I will tell you about the exercise regimen of the common sloth.
๐
cherrypicking...
Nice. This is helpful. Going to try it out myself!
Haha nice!
@jolly fog would you mind if I DM you with some quick questions?
I've built an app that I'm not ready to make public yet, so want to keep it private.
using local version of sadtalker model here , much better result
yeah ok! I'm curious what app let me know!
@jolly fog thx for the link to Sadtalker
So much fun ๐คฃ
Is it the opentalker one?
whatever is available through auto1111: https://github.com/OpenTalker/SadTalker/blob/main/docs/webui_extension.md
Thanks
what did you use here?
Hi! First one was with online demo of sadtalker, second one is with local sadtalker as extension of stable diffusion. That new one is using an updated model they just released, much more animated. But my gpu is 11gb and very slow at doing this.
did you enable eye blinking in the extension? (the UI does not have an option for that yet)
I was wondering if I can just modify the source files in the auto111-wbui-extension.
no, but it is blinking , you are right if you notice it is animating pretty good, just a lucky render I think maybe from the pose style selection? I forgot what number I picked
or maybe your image is not realistic enough for recognizing the eyes, my other examples also have some eye movement
I see there is a dev version you can activate blinking, so even if mine is blinking already, maybe I didnt understand what you meant?
https://github.com/OpenTalker/SadTalker/discussions/386
all good, thanks, I thought you might have used a refernce video for eye blinking.
seems better with a different cutout and trying different pose-styles
yeah nice! I think I heard anime style wont allow blinking eyes , but maybe the new coming update might change that.
how did you do the 50 seconds audio?
I used my custom node express server: https://github.com/failfa-st/express-bark
with a dev version of hyv: https://github.com/failfa-st/hyv
It allows setting a batch size (3 seems to be the limit on a 4090) so that means 3 in parallel. the rest is a queue.
I can generate endless long mp3 files this way.
I'l soon buil it into this project too (still has the 13s limit): https://github.com/failfa-st/bark-web-ui
yo, there's like 5 webuis at this point can yall work together ? lmao
I just want to spew feature ideas and have them appear, I am not a good UI programmer at implementing. I got like 5% of the way through my Bark list
We build our UIs with Next.js instead of Gradio. it's a different approach in general making AI stuff more accessble to JavaScript developers
Wheres the new sadtalker model?
I have bark and sadtalker set up on my discord bot so you can do it all thru discord
Who wants to PROMPT a TTS model. Just let them talk!
I dont know if my comment made you look for that, but there is some tutorial video and an online demo, and I just noticed my version is much better animated than those old examples. but I maybe confused about that because their Changelog doesn't mention any new models except a 512 model .
I love how the Bark speakers trip over their words, and then try again. So eerily human.
how long does it take to process?
Without face restoration it takes 2x length of audio (batchcount 4) on a 4090 with i9 13900ks and 128 ddr5
Face restoration takes 2-3 times as long)
thats sounds reminicient of that band that did the backing to some Japanese animation, i cant remember its name, Ronin? I remember Sturgill Simpson Ronin
What / where is sad talker
Also aye๐งฟ randomly got one output that had music behind it โฆ itโs amazing but canโt reproduce it โฆ also: really really want to get these dudesto sing
Installed the latest one click installer d seems to ๐ working consistently .:: wondering if musicgen is includded in there and how even it is invoked /.: in the prompt w a tag mayb ??
Yes, I believe mylos audio webui and mine (tts-generation-webui) include bark alongside musicgen
Someone said that they have used them together but I don't know how.
this is sadtalker (see link below, it generates talking heads for audio). It has a web-ui and an auto1111 extension (both work very similar but the auto1111 extension makes it easier to integrate generated images)
I tried using sadtalker from the commandline but there are some bugs
I tried using sadtalker trough the API provided by gradio but there are several bugs
It works with any image + audio (they do not have to be generated)
I hope this helps.
This one includes several tts ttm models: https://github.com/rsxdalv/tts-generation-webui
It is nice to compare (hopefully soon combine) them. I haven't had any issues so far.
It allows voice cloning, suno.npz improvement (over vocos), bark, tortoise and MusicGen.
I wanna say: Amazeballz
Was playing around and stitched a few generations together to create a guided meditation (script by gpt-3.5-turbo). I tired using 2 different english voices, and tried one in Portuguese (I don't speak Portuguese. I was just seeing if it would work and it seems to do alright!)
It's interesting to me that the the speakers sometimes deviate from the original script, or make up words/sentences (sometimes they are semantically similar, which is even more interesting).
Another thing I noticed was the length of time of each meditation differs by a large margin, even when given (at least for the English speakers) the exact same script. Each speaker has their own style and such for reading and incorporating pauses.
From D&D night
Voice processing by Suno, followed up with SadTalker....
Man... how realistic... Could you share the npz file for this ?
taco taco song
I will soon, but the Obama voice specifically is one I'm holding back just a bit because I cloned it not using the cloner. Instead cloned 'by ear' using an truly absurd and ridiculous manual cloning process that I was very surprised and fascinated actually worked. I really want to express just how surprised I was when Obama actually emerged out of it.
So one day when I get time, I am going to do do fun YouTube video showing it off. It's horribly inefficient (it took 10+ of hours of tweaking to make a really good voice clone by hand, like this, though I could do a bad one in about 2 hours.) but the concept is just really interesting. Basically it's just iterative Bark semantic prompts only -- literally no audio reference. And then some manual merging/blending/tweaking to keep it on the rails. And lots of me personally listening with my ears and adjusting. I use my ears to setup an iterative process, then slowly ran Bark over rand over again, and lots of my just deciding that 'this voice has the right semantics, but it's missing this aspect from this voice, so if I made 5 new versions with a bit of each...' stuff like that.
Some of the manual tweaking is actually still useful for normal voice clones and I am porting it to my fork, eventually. Voice blending or model merging, ways to render a voice in increasing or decreasing intensity, or to soften it, to dial it in. Stuff like that. But it's hard to code it in a UI for Gradio, so not too much of it. Also with the voice cloner, probably just making a better normal voice clone is 99 times out of 100 better use of anyone's time.
Nobody sane should voice clone like that, but the fact that it is possible in Bark is remarkable.
All that said, I think there is likely some use of the techniques for 'dialing in' normal voice clones too. However all of my methods rely on my personal hearing and judgment of voices (typically I rank a bunch in a folder, as a first step) and that doesn't scale, so would have to be automated.
But also, maybe just fine-tune all of Bark on one voice, and never do any of that stuff...
The TLDR is that Bark lets you do whatever you want, almost. Just open up an .npz file and cut and paste, remove every third token, mash two voices together, insert weird patterns, ban all the tokens in one voice from another. Literally do anything, and usually it still sounds pretty natural! You do get a ton of 'broken' voices with weird stutters that get stuck halfway through reading your text, but Bark makes even the broken voices sound like a real person with a speech impediment!
Obama with speech problems or weird accents.
I mean listen to this. Poor Trevor Noah is given a horrible stutter. But in Bark, he KEEPS TRYING TO SPEAK THE TEXT. He doesn't give up! How cool is that right? It's more like a personality than a voice. The prompt is just the regular sentence, not repeats.
Bark not reading your text is the worst but also maybe actually the best thing about it.
I had it spit out this randomly ignoring the text and it has me briefly question what I was doing lol
(bunch of silence at the start but it's worth it for the ending)
Hi
MusicGen + bark + a bit of audacity and you can have a 24/7 news radio ๐
@blissful pulsar edited it a bit 

Try
[music]
My words here
[music]
[music][music][music]
My words Here
[music][music][music]
It's super hit or miss, but if a voice DOES work, it often works again
So save your .npz outputs. And then try resusing the ones that work.
(Maybe worth a repost) Google trained SoundStorm on 100,000 hours of dialog in part so they could have two person conversation prompts. The same text prompts - character for character identical - very often just work in Bark right now, out of the box.
(Well... Bark loves to generate a dialog of a person talking to themselves but I think you can fairly consider even those samples a dialog, it's performed like a conversation.)
Something really funny happened to me this morning. | Oh wow, what? | Well, uh I woke up as usual. | Uhhuh | Went downstairs to have uh breakfast. | Yeah | Started eating. Then uh 10 minutes later I realized it was the middle of the night. | Oh no way, that's so funny!
๐
mixing the full and small models helps reach fast generation speeds without the quality dropping much
Iโve really got to spend some time wrapping my head around npzs, and voice generation
Like it seems to me like itโs random every time ??
But somehow feel like that cannot ๐
It's a checkbox called 'save every NPZ' in my fork. Bottom right area. If you check that box, next to your output .wav or .mp4 files, you will always have new .npz files. These files are the voice of the audio file you just made. (If you used a random voice, it should save the .npz next to the .wav by default.)
Then later you want to make new audio that sounds like that cool .wav sample? So you go the menu, something like "pick an npz file from your filesystem as the speaker" and you can pick one of those .npz files, that is the same name as the .wav file. If you decide you like the voice a lot you can put the .npz file into a directory so it shows up in of choices, alongside en_speaker_03.npz etc.
For music and more unusual things there's a decent chance the .npz file doesn't reproduce the same effect. But for pure voices, it should.
https://www.youtube.com/channel/UCGimnCFFH_5AyDpGxd71kqw (suno for voiceovers ๐)
Welcome to Comedy, Code, & Pixels โ where dark humor and artificial intelligence collide in a spectacle of pixel wizardry. Step into an unparalleled domain hosted by Zane, the enigmatic Goth, and Oliver, the sharp-witted Englishman. With their elusive charm and sarcastic repartee, they lure you into a world where codes dance to uncanny tunes, an...
(โฏยฐโกยฐ)โฏ๏ธต โปโโป
Goddamn! Thanks
Join us in this captivating journey as we unravel the mysteries behind Banksy, the renowned street artist. From his humble beginnings in the underground scene of Bristol to his groundbreaking artistic style, we delve into the impact of his art on the public and the ongoing enigma surrounding his true identity. Discover how Banksy has made a mark...
๐
Sounds like someone slowly reading from a script for the first time
Not even that
He sounds like heโs struggling to read
And has no energy!
At least the voiceover is consistent in this video
Join us as we embark on a captivating journey into the world of HAM radio, exploring its significance, origin, emergency communication role, licensing process, diverse frequencies and modes, global reach, sense of community, inclusivity, and technological advancements. Discover the fascinating world of HAM radio and its timeless appeal.
BWAHAHAHAHAJAJAJSJDJDJDO
โLadies and gentlemen. Gather, round, for another, riveting episode, of our illustrious channelโ
FWUHUHUHUHHU
He sounds so slow
And bored
We can make more energetic & stable voices for this though?
I want every day in the show I make to sound like Looney Tunes
Whatโs your manual merging/blending/tweaking?
Code / commands?
oh damn
i found out
that files of the same prompt that take less time to produce have less background noise (i think)
a noisy file took 8 minutes
a good file took just over 1 min (1 min 8 seconds) + waiting time to write files
here they are
with the npz
i assume the npz is the speaker file that i can select
gonna try doing that next
i can write code to automatically quit the process of voice generation if it takes too long so that i don't waste my day (in theory) perhaps
i need to test this some more times to be sure
damn wrong place to post
*wrong channel
anyway the prompt was """ fine fools flower fight frame """
It's not automated like that. One reason I didn't automate or build in to a command (yet) is because it's more or less me going "let's try using this many tokens of this, and then doing this.... nope that didn't work. let's try a litle more..." instead of fixed process.
why does this work but not this?
audio_array = generate_audio(text_prompt, history_prompt="v2/en_speaker_1")
this works
audio_array = generate_audio(text_prompt, history_prompt="v2/bark_generation5")
this doesn't
and i put the npz in the same folder
the file paths can be bit fiddly. just suing full path, like whole directory. including '.npz'
ok
i forget exactly what works, but it does work,
also check what your current python process thinks is the current directory. maybe it's not what you though
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
ded
i just copied and pasted the path
hmmm what are you on, windows, google colab?
windows
there's not like a weird character in the name? or emoji?
no but for some reason the copied path uses slashes in the other direction
\ instead of /
with your voice
alright i replaced all the \ with /
it didn't give the same error
now i wait for it to generate
yay
ngl that was odd lmao
i'm guessing that's what the truncated \UXXXXXXXX escape means
maybe it's for security idk
filenames and paths and differences between windows, linux, whatever - it's a cause of TONS of headaches
GODDAMNNNNN
IT'S GOOD.
8 minute and a few seconds for this 13 second audio file.
if this voice was randomly generated, does that mean it was never heard before? or is it selected from existing speakers?
(it's open source and therefore commercial use anyway right)
I can probably keep using this voice again and again, or try to change the emotions and then generate it repeatedly until i get it to speak the way i want it to, correct?
CPU only
if you used no prompts, it's technicall random. that said it's not impossible for 'random' voice to be really close to existing voice, the most common one I get is the twitch streamer robot voice. it's got be maybe the most common voice on youtube or soemthing
Lol
Ok
it really is THAT VOICE
I guess 1) if the person who sounds like it hears it, they can tell me to replace it 2) I can put post-processing on it using a DAW to make it sound higher or lower or change the formant to attempt to obscure it more
And lastly, it is fairly common for two people to sound SIMILAR (not exactly the same)
I think Iโll keep this one because I want it and it sounds excellent and no one said no (screw copyrighting fears)
Thanks! I truly appreciate all the feedback. We're making improvements everyday. ๐
Maybe possible in post, but I dont know of any way to tell the model that. ๐คทโโ๏ธ
I did this video today with those samples and hand animated vector art
PS who can guide me to an AI that can take the audio Iโve generated and make it sound like a clean recording?
I wish for this AI to be available offline, usable on the laptop (Windows 11 PC) I used to make these audio samples, and open source / commercial use.
Iโll research this myself as well, will post here if I find something good
Check out https://github.com/facebookresearch/denoiser , have used this with some success
if you dont mind sharing, is this all scripted or do you do some work in a video editor?
Thanks ๐
Everything is scripted. The only human input is really the topic. Video is assembled using FFmpeg
I couldnt get this working due to some old dependency conflicts. Ill have to try in a more isolated environment.
I got some improvements in audio using -filter:a "dynaudnorm" in FFmpeg
Is it possible to use this on specific files instead of the entire system?
Should be able to do this programatically within Python
Yeah the whole thing but do you mean I have to rewrite the module
Iโm looking for quick fixes although if I spend time I probably can
Like ideally there should be existing functions
Iโll check it out.
It looks like a lot of the audio reading functions are for getting the dataset file locations (audio.py & denoiser.py)
More time
This wonโt upgrade speech audio quality
It just denoises background audio
This is what I was trying to find (an offline version)
For the files used in this video (itโs okay but it could sound cleaner)
Needs an AI
^ this problem (called "Speech Super-resolution") AFAIK is still pretty difficult, not many good models out there
๐ Iโll go try and find something and will post when I do
Done finally, Its a poem I wrote a while ago and I wanted to hear it sung
Nice. I should experiment more with the singing..
I wonder if you can could get a longer term melody out of that voice if you made a bunch of variants, used in sequence. It's almost sort of got some coherence, way more than typical Bark singers.
I think Bark can maybe do music if it's singing only, or maybe singing + one single instrument that is used a bit sparingly. More than that and it seems to not quite hold it together.
I am using the BARK TTS WEB UI and it seems it is speaking whatever it is feeling like instead of my script!
Book reading using the AI
hey can you please share the prompt
The source? Its a part of the prologue for the book A game of thrones in the ASOIAF series
Hey @teal grove
I found this https://github.com/Rikorose/DeepFilterNet seems to work really well
Thanks
Itโs just a denoiser or can it also filter out or correct the rough parts of audio?
Hereโs the latest short vid I did
I like the music in this case, but it would be cool to use something that can separate the music and give a good result
Iโll go ahead and test this out anyways
not that am aware of
Is that a random voice? Pretty unique vibe
I wonder if that voice will be more likely to hallucinate with normal text prompts... Lol
No, that's a fine-tuned voice. I guess I don't need elevenlabs anymore... Now if only I had time to prepare a proper USLT vietnamese dataset, but ALL OF THE TVB MOVIES FROM THE 2000S HAVE BEEN REPLACED WITH THE NORTHERN ACCENT DUB!!!! *glares at the vietnamese community (HOW DID YOU GUYS LET THAT HAPPEN?!! AND WHY?!!!!! )
Ahh, still super cool voice. But I was like, damn, if that came out of prompting a made up childhood rhyme and Bark just came up with that, that'd probably be the most impressive text prompted Bark voice I'd seen.
It might work with well known childhood rhymes... though getting many children out of bark random voices is like pulling teeth
And no it hallucinates in general, but only at the end of the prompt and even setting min eos has no effect, but whatever, it's after the prompt so cropping it off is easy enough.
And oh yes, it's the best voice I have in any voice AI.
Anyways, childhood rhyme, eh? https://kingkiller.fandom.com/wiki/Lackless_poem
You should hear him in RVC, though ๐ฑ
Also wth there's a boy version of that poem??? I never saw that in the book?!!!
Bark is still the model that is choosing which syllables to stress, how to pronounce things. Unless you fine-tuned in just rhymes?
Actually I'll try that in base vanilla bark, curious
Do you mean it's even better?
No, the variance is alright. I just fine tuned it on speaking. But the source speaker is very expressive, so it ends up being a premium dataset. There's little gems like this in the dataset:
And yes he is legendary in RVC
His voice is the reason I halted training all other voices to desperately probe the secrets to his success
Do you happen to remember if RVC stressed the childhood rhyme as well?
No RVC just flawlessly voice converts source audio, be it singing, emotions (to somme extent) and others:
That's inference from an anime character
Right, yeah, so I guess it's about whatever you happen to use for the TTS part, before Bark. If that had the same childhood rhyme speech pattern, as well as those samples. Just curious how unique Bark is.
The dataset doesn't speak in rhymes, the voice is just very expressive
Bark kind of figured out the rhyme on it's own and that actually took a while. The first few fine-tunes got the pacing wrong, and also if you put in too many verses, it gets the pacing wrong and ignores half the prompt. Too short and it sounds weird, so no typing it in verse by verse
(first few fine-tunes didn't even bother pronouncing the word "seven" or the s sound at the end of "things". )
For pacing you will eventually be able to control in inference, though I'm not 100% sure the same things I tested work in a fine tuned model, I guess.
I haven't tried myself but the fine-tune just a diff of the Bark weights, not the full 6 gigs for text_2.pt or whatnot, I think? So I guess you could even just use different versions of the voice
Well here's the same inference on normal speech, so I think bark can read the context to some degree
The end of that is classic Bark weirdness. IT's like the perfect weird horror movie sound model. So mahy times late at night I'm like WOAH.
And this version has trouble speaking german...
Especially because it's usually just like that, after a big sound gap, so you get surprised lol
Actually Bark not too bad, like sample 3, it is understanding the text as a rhyme and the voice
Oh yeah little gems like this one which for native japanese speakers I'm sure must be gold
Yeah, I saw alot of potential in bark back then despite the horrible voice quality. I don't usually go messing around with heavily beta stuff. MusicGEN for example is absolute garbage. But bark.. now bark is actually good.
You mostly get music, actually, because of the lack of puncutation. But the non music is way more children than Bark typically, so the childhood rhyme is being somewhat undestood
Genuinely seems like an especially good text prompt, tons of great unusual text voices. Complements to Rothfuss, maybe.
I actually wasn't expecting it to work that well. Elevenlabs kind of failed at the Alan Wake poems, the "For he did not know, that beyond the lake he called home, lies a deeper, darker ocean green, where waves are both wilder and more serene. To its ports I've been, to its ports... I've been!"
Yeah I'm impressed, it does for sure generate many samples in the proper childhood rhyme speaking pattern and intonation, just out of the box, and using this fictional rhyme so it's just matching the concept of the text, not the rhyme against a known one.
Got a classic Bark 'local TV news broadcaster doing a promo for upcoming news segment' voice. Even in this rhyme, can't dodge the news voice lol.
I don't have this in a public fork yet, it's pretty fiddly, but you can penalize quieter tokens in the generation code and make Bark amusingly fast.
Bark is just so good, I think that sample even sounds like the speaker is out of breath!
God damn
that quick breath in the middle is awesome
I know right? It's like truly modeling a person trying really hard to speak fast!
Woah and that sounds like rapping/rap battles. Hmmm...
Yeah I think there's ton of potential that doesn't need fine-tuning or loras, anything, just nudging the sampler a bit and Bark is so good it usually makes things sound good.
Well I REALLY hope finetuning is sufficient to add another language, or a really hacky solution is to remap chinese characters to Hรกn Viแปt (https://en.wikipedia.org/wiki/Sino-Vietnamese_vocabulary)
But that would be REALLY inconvenient
Sino-Vietnamese vocabulary (Vietnamese: tแปซ Hรกn Viแปt, Chแปฏ Hรกn: ่ฉๆผข่ถ, literally 'Chinese-Vietnamese words') is a layer of about 3,000 monosyllabic morphemes of the Vietnamese language borrowed from Literary Chinese with consistent pronunciations based on "Annamese" Middle Chinese. Compounds using these morphemes are used extensively in cultural and...
Are you channeling Young Sheldon Cooper?
created a script to take long monologues and export them. Just need to add multi threading now, and fix some of the parsing
I didn't realize there was already a tutorial on this. I made it in java.
Listening to No-game No-life light novels. I think with RVC and bark it would be really cool audiobook
pretty nice narrator and slightly longer generation
ooh I like that
Japanese bark is all over the place
yeah, we really need to find some good presets there - I think it can be good with good presets but has huge variation
Presets for narration or Japanese?
Can I ask for a quick test, can the bigger models properly spell this: ไธญ่ฏ็ฉใ๏ผchuukabutsu
I tried several variations and it always chose the wrong spelling/reading (chuukamono)
by the way, phonetically I've heard that it's able to produce good output, but the phonemes chosen are not always correct
interestingly enough Google Translate chooses the other reading as well
(I am out of my depth here)
if it were English, it's like using a Germanic pronunciation for a Latin origin word?
sometimes Mono is correct, sometimes butsu
aaaaaaaaaaa
sorry I'm just happy w
the first one seems to mimic one of popular Japanese TTS's that's probably why the quality is such
can those models use the small model's npzs? I could give you a better one for history_prompt
ya
might take a little effort but doable
also if you have any slightly longer prompts (3sentences or so) would love to try those
ok I'll find some to choose from
one more decent one as far as i can tell (still grainy)
ใใใ
it sounds like a news special from a tv report
This is just a randomly generated paragraph
ๅฐ็ใฎๆฐๅๅคๅใซ้ขใใๆฐใใ็ ็ฉถใ็บ่กจใใใพใใใ็ ็ฉถ่
ใใกใฏใๅ็ๅฏ่ฝใจใใซใฎใผใฎๅฉ็จใๆฅ้ใซๅขใใฆใใใใจใซใใใๆธฉๅฎคๅนๆใฌในใฎๆๅบ้ใๆธๅฐใใฆใใใใจใๆใใใซใใพใใใๅคช้ฝๅ
ใ้ขจๅใชใฉใฎใฏใชใผใณใชใจใใซใฎใผๆบใฎๅฉ็จใ้ฒใใงใใใใใๅฐ็ๆธฉๆๅใฎๆๅถใซๅคงใใซๅฏไธใใฆใใพใใใใใฏ็ด ๆดใใใใใฅใผในใงใ๏ผ
Also, for some reason bark likes to generate many Japanese voices with foreign accents
yeah, we have the same problem with Chinese (and other languages too)
here's a random one - I think we have a little work to do to make npzs work so will report back
This one has clearer audio but it jumbles words:
Hmm, what if I gave you a recipe for generating a good "seed" voice?
btw with google and chuukabutsu - it's funny because they write chuukamono but they generate chuukabutsu in the audio
i see!
yeah could be worth a try
{
"_version": "0.0.1",
"_hash_version": "0.0.2",
"_type": "bark",
"is_big_semantic_model": true,
"is_big_coarse_model": true,
"is_big_fine_model": false,
"prompt": "ๅใใฆไผใฃใๆฅใใ ๅใฎๅฟใฎๅ
จใฆใๅฅชใฃใ ใฉใใๅใ็ฉบๆฐใ็บใๅใฏ ๅฏใใ็ฎใใใฆใใใ ",
"language": null,
"speaker_id": null,
"hash": "ba221be9420a7791e8dc6ec5f175ca12",
"history_prompt": "None",
"history_prompt_npz": null,
"history_hash": "6adf97f83acf6453d4a6a4b1070f3754",
"text_temp": 0.6,
"waveform_temp": 0.8,
"date": "2023-06-10_19-03-45",
"seed": "332186546"
}
awesome
here's what the above json sounds like for reference
it has an unnecessary bgm but it's voice actor level of diction
another one from the same 'family'
will throw a few more random ones over, gonna need to write some code to try the other ones
(no idea if these are good)
it has good parts but sounds like a record that's stuck in other places
Oh man I had to
they mash up words and leave them out, but there's very little noise - they could be piped into a following generation and the results might be good
interesting! OK we have some work to do it seems
sometimes stuffing words down bark's.. prompt causes it to snap back into reality
this could come in handy
alguem sabe se consigo
adicionar a minha voz
e a inteligencia artificial
gerar audios com ela ?
๐ฑ
I will try that out later. I way more time on this then id like to admit today haha
ehhh i have to try it. Ill do some reading to see how to add this in
test
/bark
u can do in #๐ถโbark-beta
/bark
so I tried this, but its not working. I renamed it, and changed generation.py to make the index higher
is it v1 or v2?
olรก
It's a history prompt so it doesn't have a version, it's supposed to be loaded as a dictionary and passed as history_prompt parameter
Here's Bark + Demucs:
Original (Bark):
then, seperating vocals using Demucs:
- Running vocos @ 3.0kbps on the isolated voice:
this is awesome!
Oi, tudo bem ?
Aqui quem fala รฉ o Fernando, sou o Especialista no tratamento com o GOTA MAX, e vou te ajudar neste atendimento.
long generation with my Plugin.
any possibility for elevenlabs quality tts voice?
it's like these voices are passed through a filter
and they end up sounding like they have some slight (graininess/machinelike/synthlike) quality which is (at the moment) difficult to hide
yeah we are working on that for the next version. clean history prompts defo help but ultimately it's limited by the codec
generally the variation is a feature since it can do arbitrary audio but it needs to be controllable to remove it for TTS use cases
how do these two audio files compare to your ear?
SPOILERS do not read until you tell me your first impression ||(I don't care about the true quality in this instance, I'm fine with there being an illusion of quality)||
it can be harder to tell the difference if you hear the same thing over and over and get used to the sound (potentially)
the cleaned one sounds a bit richer, but not necessarily less noise
- richer how & 2) by noise, do you mean machine-like?
you say it sounds a tad ... better? (i hope)
spoiler || it's an EQ + distortion to hopefully make the sound sample sound less tinny ||
|| it actually sounds awful with distortion only ||
actually nvm after a short break i can tell it's the same sample quality
eh well
i tried cleaning a dile that sounds like this:
i removed the noise profile in the file ending with _2 from the file ending with _4 to produce the file ending with _3
in fl studio
tell me how good this denoiser is
i think it can work, i'd just need to isolate the worst sounding parts of the audio by hand and obsess over it for a while to hopefully get a better result
it's come out muffled because the noise profile was in the higher frequencies
(used Edison)
these last two, after applying pitch and formant effects, it's like there's background noise
and idk exactly _7 is
lol
is suno bark just actually cutting up audio and stitching it together?
in creative ways
like sometimes it will say things that aren't in the prompt that i assume are from the training material
or are just made from some sort of manner of processing like the music is
part of this prompt is me trying to make the voice sound angry (history prompts)
^ the prompt i used to generate the speech
perhaps it is impossible to have a clean take because after using distortion what sounds totally clean actually contains faint noise that might say something else or have other noises/sounds, with the loudest sounds being closer to the prompt and the faintest being furthest (usually).
idk actually because sometimes it loudly says something else
perhaps this is the reason why the prompts degrade with each iteration?
(or so I've heard)
I guess there isnโt an AI of any kind that is available to the public that sounds perfectly natural in speech.
The Selfie Song - Made with Suno ๐ (with postfx)
I have used the Vocos vocoder in this example: #๐ฃโsuno-showcase message no manual editing afterwards.
its added in my Whispering Tiger Plugin. Others might have added Vocos as well in their project. not sure which ones though.
tts-generation-webui has vocos from npz, so you can run it on past generations. Vocos can be applied on wav (incl. mp3 etc) as well
Amazing!
still slightly you-know-ish but yeah
where do i find & install this plugin
will try this next
how do i fix this
Oh my, I am a mug, my old scripts have the same error. Oh well.
But they still work though
Nvm Iโll try reinstalling / installing pysoundfile ๐ฅฑ
THANK YOU FOR TELLING ME ๐ ๐ ๐
(btw the text prompt was the test script from the link)
the audio quality in these samples are good enough to not cause annoying glitch sounds when being processed by the pitch shifter plugin in FL Studio ๐
(so far)
I'm gonna try accessing my old voice preset tomorrow and seeing how much better it sounds with vocos
messing with tempo produces great results too
like this for instance
Planets Planets. ๐ ๐
whatup fly
hav u tried this https://huggingface.co/spaces/lj1995/vocal2guitar
seems cool
That's RVC
i havent tried RVC yet
i cant wait until we get stereo effects. imagine the panning effects AI will be able to do
Is the title the prompt?
Iโd want to make the sung parts without background music and then make my own tracks Iโll see how that goes
using suno to add vocals to my tracks in ableton
Neat ๐ฅ
๐ตAha aha aha aha, doo doo doo doo, yeah yeah yeah yeah, ooo ooo ooo ooo.๐ต was teh full prompt. if you put less you wont get a full 14 seconds
sounds great. i was lookin for a vst that matches vocals to a beat but the best i can find is if you already have 2 closely matching waveforms
and obviously suno output isnt going to match a already existing waveform
put a ton of effects to make it sound better generally using ovox by waves or maybe antares coudl tune the vocals
There is an option to force maximum length no matter what, so you always get 14s, which is quite useful for non speaking prompts like this.
oh im still using the OG version
yeah sounds great. im just too lazy to import into a sampler or match up the vocals to the beat properly manually. hopefully AI voice can stay on beat soon
RVC seems cool
It's in the original, but in the lower level functions, set allow_early_stop to False right here: https://github.com/suno-ai/bark/blob/599fed040e52c89e0b3580e02e2684b2c9100701/bark/generation.py#L386
thanks i am gonna try that
this npz stays on beat well
npz kinda like npc huh
i set allow_early_stop=False,
but its not workin is there something in bark_perform i need to change
i wonder what semantic rate does
of it's bark_perform that's --semantic_allow_early_stop False
I didn't realize you were using my fork. if that doesn't work let me know, I'll fix that right now
semantic_rate is just a constant value, it is time each token represents in terms of actual audio time. It's not something you can change.
oh i thought maybe it would speed up speech
sweet now i get 15 seconds with only 2 words of prompt
will this be useful i guess we'll see
For music you can also use a blank text prompt, with a previous music .npz file. However I'm not sure that works in bark_perform.py. I can check.
BTW, for generic music try stuff like [music][music][music] it seems silly but repeated tags can be good for that
Can you run the python bark_webui.py instead? That has a checkbox for 'blank text' prompts for sure
BTW that's also fun to use with voices... #๐ฃโsuno-showcase message
Even [music][music][music][music][music][music]
i am way behind on your fork atm. im still using the very first one but i hacked it up so i dont want to upgrade until i have to
respect
In general you can try both () and []
and with some voices, one works, and the other doesn't!
Oh also stars. Try * dance beat * or * ominous music * etc
i seen someone mention there is a way to use lora's with bark is that just a npz file or different
Different. I still need to try it myself. For fine-tuning voice clones on the model.
Not allowing early stop keeps things... interesting
Breaking into random words is kind of a vibe honestly
I don't what your defaults are by setting topp and topk, they might be none. Latley I've been using like topk 200 and 150 on corase
topp 0.95 or even a bit higher
solid!
That might be the old code, not sure
i mean they work but sound completely different
i went from 100% music to some guy talkin yeah
It's the end of the song, and then the radio DJ talking. Sounds pretty accurate! Not the words, but the tone of speech is great.
fastback at the residence great song
Bark names it's own songs!
lol
( dance music ) works but ( Christmas music ) is a nightmare. Only thing I got that was close. I love how they all have a radio DJ outtro though!
One downside to not allowing early stop. If your .npz is the radio outro, you won't be able to easily recover the music.
Since Bark will continue the end of the clip, which is a person talking. Though you can go in and change this with work, it's not a built in feature. And picking a random point tends to also work bad. But if it's truly a one of a kind .npz, send to me, I'll fix it.
Uhhh... sometimes Bark feels like it's giving you a 14s audio clip ripped from the multiverse, just a glimpse of something somewhere happening, lol. Is that a crying baby in the background?!? Actually maybe a cat.
Oh there's a tiny tiny bit of really good music in here, but 90% of the clip is total madness.
oh good to know thanks
yeah its like dialing a random number into bill & teds phone booth
considering how little we know of quantum mechanics i sometimes think AI reaches a threshold that allows something from the consciousness field to enter
I'm dialing in some tweaks...
Actually what's making it good is my buggy code has been sampling twice the whole time... !
(it's 30 seconds becuse I just bugged it trying stuff, it's just duping the audio)
SICK. Okay trying all kinds of sampling settings, everything CAN work.
Oh I actually have some other code enabled... doesn't work in base bark. gonna have to reverse engineer what I did to make music so good by accident actually.
The range inside Bark is wild.
Was there anything beyond the words in the prompt? Oh I am sleepy today. I can just see the chat bot message. Yeah it's very long prompt, nice.
copy pasted from 2001 movie.
It's a great prompt, I just tried it a bunch. Bark does vaguely mimic the performance much of the time! Also the outputs are just really good quality.
aaaand video: https://www.youtube.com/watch?v=Y4RrMDN_ZDU https://twitter.com/AIlvessuo/status/1680959951756316673
Summary from:
https://twitter.com/AndrewCritchCA/status/1680461874171658242
Once, in a past now distant, humanity faced a formidable challenge: the unchecked acceleration of artificial intelligence. Viewed through the lens of AI, humans were but slow-moving, sentient flora, showing flickers of intelligence in their unhurried existence.
Imagin...
@AndrewCritchCA As video. Slim change that @ESYudkowsky @elonmusk notice also but let's try. @AndrewCritchCA your summary was spot on.
Boltzmann brain. Greg Egan fan maybe?
More like fan of all secret stuff ๐ https://www.eurogamer.net/trials-evolutions-insane-century-spanning-arg-scavenger-hunt-solved
Jonathaaaan, how did you hack Bark again? ๐
It's nice, but it's just not a Bark voice to me without like, the sound of baby crying in the background and the speaker bumping the microphone halfway through. lol
lmao
pod3000 said "Bark feels like dialing a random number into bill & teds phone booth" truly perfectly put
More and more I'm surprised how much control you actually do have with the text prompt.
Some prompts are like, 90% very similar voices.
but is it control or specificity? like, can you actually tell it what you want it do to, or do you have the key for a specific output?
Yeah I don't mean control as in fine grained control of performance. Just for 'summoning' the the random voice initially. After that it's much hairier.
I find direct descriptions can work, but it's usually not the best way. Like saying, "I'm a chatbot" seems to work better than describing the chatbot, via other prompting methods.
I remember trying that with genders, it did not work for me lol. I think the audiobooks in the dataset form a "prompt resistance"
however I know that with trying to get a "cheerier" tone, it's useful to add a [laugh] etc
On a generic meditation like:
Listen to my soothing, relaxing voice. Breathe calmly in, and out. Slowly close your eyes. Continue to breathe at this slow pace. Feel the air expand your lungs with each in breath.
I was getting like 80+ percent women. Even adding "Bond. James Bond." only knocked it down to like 50.
Oh, yeah, there's a bias
I think it would be interesting to have a -audiobook or -female +male "bias controls"
Can we train based on our own voices?
There are voice cloning models, which try to make a "voice" that matches yours rather than training the base model
Also tokens that silence noise really exist, huh, if only it wasn't such a pain to employ them
Yes, bark has voice cloning options
I watch a lot of โtrueโ (fake) scary stories YouTube channels to fall asleep to at night and decided to try to make one myself completely with AI. The stories are written by chatgpt, images by dalle, and voices by bark. The stories are pretty corny but the voice would almost be passable if it werenโt for the hallucinations between text chunks I think https://youtube.com/watch?v=64iOfT6YI0E&feature=sharea
All stitched together with python/ moviepy so basically 0 human intervention whatsoever lol
One thing Bark absolutely crushes at, largely but not entirely by accident, is horror audio. So much creepy distortion, fading voices, babbling, unrecognizable sounds. I have so many samples that could have been a clip in a horror movie.
It's a little hard to leverage it on purpose, but I've scared myself by playing a longer Bark sample at night. You hit a quiet spot where nothing happens for 4 seconds, you think the audio sample is done so you almost forget about it. Then you suddenly hear some unnatural voice screaming out of nowhere fading into static. And it's the end of your Bark audio sample, it just went off.
It's crazy consistent, right?
Have any npz prompts you could share oritented towars horror?
It's usually the result of a voice switch, where the .npz voice completely changes half way through the audio. Which is normally a bad npz you don't keep, so I haven't been really setting them aside. But I totally should have for the especially creepy ones, and will in future.
Actually ripping this idea from Suno's mc, meditation voices are dual use and basically kind of work as horror voices when you give them different prompts, check it
4 voices like that, + the clips with the npz, might be something usable. It's fairly creepy.
The whiny electric noise... works in horror
one more clean audio, though a bit less creepy. actually ends completely blowing the mood, lol
Probably worth tryingg some meditation prompts to find slow whispery voices, they are a close fit, and meditation prompts are very strong and consistent with random voices.
make anything good with audiocraft?
i got lucky a few times
suno is good for beats but audiocraft is better for melodys and non-percussion instruments
actually u can make some sick drums in audiocraft
"jadefixed" lol
Any tips on getting rid of the hallucinations? I followed the tips on the GitHub (eos setting or whatever it is)
Eos would only help with overextending
As for regular old hallucinations, they kind of just happen, but more often for some cases than others
Depending on voice, prompt etc
The main one we've seen is attempting not to understuff or overstuff the prompt
I've seen that a longer history (i.e. 2 sentences) doesn't like generating a short phrase (like few words)
EOS can help maybe for end-of-text hallucinations, but some general things:
- Prompt is too long. Generally you can use a longer prompt than the speaker can finish without causing hallucinations, but it can go too far.
- Prompt is too short. Bark really wants to generate at least 6-8 seconds, even if it has to add words.
- Prompt and speaker style mismatch. In the extreme, a difference language than the voice. Also: formal vs casual, accents, Old English versus modern slang, etc
- Prompt with non-spoken text generally. (laughs) [screaming] MAN: WOMAN: and so on. Some voices work fine, others fall apart entirely.
- Prompt and speaker lower level mismatch. Your prompt is an unnatural followup to the prompt in original voice. Like if the original random voice prompt ends "I like eating pizza at" Then you prompt: "My name is Suno." as the next words. Bad fit.
- A quirk of Bark voices and repetition. When a voice speaks the same (or very similar) prompt as the one that created the voice in the first place.
- A thing some speakers have a high tendency to do, for mysterious reasons. There's just something in that voice that makes it more likely.
- Random chance. Bark just decided to dial a random number into bill & teds phone booth and record 14s of audio, instead of reading your text. It happens.
For a lot of these, it just makes the chance of hallucinations somewhat more likely, and for the most part may work fine.
Without the voice nor model to have heard sonata arctica's "i have a right", that was quite sour 
Singing voices are tricky, nice one, regardless!
Aye! Thanks :P And wish me luck getting this one to sound not bad 
Haha, good luck. You mght have a lot of ear piercing sounds in your future. I have made some recent progress with music Bark a bit recently, but it's much harder than regular voices.
Currently using bark infinity. But don't know where i can get better models :P or even how to train a quite high quality to be NPZ as it turned out to sound the same as the sample, even if the long mp3 is quite damn quality :P
My guess is a simpler prompt might work better, maybe:
[intense music][intense music]My lyrics [intense music] [intense music]
no note symbols maybe too
Aye. Wanted to use seeds to test different ones manually or "that one i like, let me reuse that seed". No custom stuff worked, just errors out, and can't reuse seeds 
Change 'intense' or something else
That was just a sample. But that works somewhat better with just [music] for example
maybe [power ballad] I don't know. It's all undiscovered what works best
Aye. so guessing all the prompts there are just guessing what works? And not a "these are the ones that will work"? :P Is there a "already discovered library" i can look through for emptions for voice, and types of music?
Music is mostly undiscovered. For a random voice, I think focus on the TEXT not the [brackets]
The example I've been giving people is to try this prompt:
Listen to my soothing, relaxing voice. Breathe calmly in, and out. Slowly close your eyes. Continue to breathe at this slow pace. Feel the air expand your lungs with each in breath.
Notice how all the random voices are pretty similar!
Like 8 of 10 will be a slow super calm, sometimes whispering, female.
That's how influential the right text can be.
Also lately I've been using top-k 300 semantic, and top-k 150 or 200 coarse. I think it might be better.
Indeed. Been looking for new/other trained stuff as well, or if bark is the only text to voice synthesizer :P
Ah, i can't use those, just errors out
Are you on AMD?
It should work unless you are on AMD.
I can make it work on AMD, but honestly it's a low priority. I might just wait for better AMD support a bit.
Nvidia. 3090
If you get those error on NVIDIA, then something is wrong.
Okay two things. One, are are using a seed? Turn it off if so. Set it to 0.
Two, well, I'm not sure but your Python AI setup might be a little screwy there.
That's what i wanted. Why won't seed work? As i wanted a unified result and not first the female voice i trained, then into a random scottish lad a second after throuout the last 13 sec lol
Oh boi..
Okay, so the seed doesn't help you for that. The seed only meands you get the SAME female and then the SAME scottish lad.
The seed should not cause an error with topk, that probably is a bug... but also you shouldn't use a seed.
The seed is a random number seed. So it means you get the same voice with the 100% exact same prompt. If you change anything, use the voice again, it does not help with consistency in that way.
And this is why seed is important 
The seed has its uses, but it's not really useful the way you want to be. It makes the reproducible. Like if you played an RPG and rolled a D20 10 times. If you used the same seed, all D20 rolls will be the same sequence of numbers.
However, it does not makes all the D20s rolls similar, right?
No idea what a D20 is