#๐โsuno-school
1 messages ยท Page 2 of 1
don't know if this is the right forum for sharing this, but found this unofficial repo for training a Bark: https://github.com/anyvoiceai/Barkify, thought this could be useful. any comment?
looks promising
yeah, making a little thing for it right now
as a separate script though, so you can just process your generated semantics into wavs
fixed the "rn", it was putting \n and \r instead of \n and \r, so i just had to convert them
or maybe i should decode it better
๐
lol
sounding like a nonsensical news article
@glossy trout the repo has been updated, there's now a create_wavs.py, and the texts don't have weird random tokens through them
the create_wavs.py will just create a wav for every semantic prompt it finds in the semantics output folder
it won't process the same file twice in the wav generation, that's on purpose
3000 something semantic files lol, i have some processing to do
old dataset was 900 files of 100 tokens (about 2-3 seconds)
for the new one i've already got >3000 semantic files which are about 10 seconds long
Today is a bit of a crazy schedule - I will try to switch to wav creation, not 100% sure depends on how complicated it is
It's currently cranking out semantic files
git pull, run create_wavs.py
lol
Hi all. Is it possible to use Bark for real-time TTS on React Native?
Only for one particular definition of real-time, in that a very fast GPU like a 4090 can almost generate 14s of audio in 14s (though not quite realistically). But not for latency.
Ah I see. Thanks!
i'm about 10% through my files
like 33% through now
@copper pewter
Here are the semantic files
I couldn't do the .wav files today. It's 4 different servers and I was on the road a lot of the day, so I couldn't ssh into them and update the script
Let me know if you still need them next week, I can run another process to create them
or LMK if you're able to create them and don't need them any more
Either way is OK, happy to run more GPUs, just let me know what you need and when. but I'll be out of town the next few days.
Am able to finetune the existing coarse model on a new language and new set of code books, I kept the fine as it is and just generated some samples from a ground truth semantic tokens, can you spot which one is the original human audio ?
3 is more human like with clarity. How did you train a new language?
it involved multiple steps, I would write a clean doc sometime this week, your guess are correct ๐ Its finetuned for just 30mins on a RTX 2070 ๐ So I guess it needs more time and data ๐
if it helps, I was able to drop the generation time and improve audio quality while doing so ๐ฒ
How did you get the ground truth semantic tokens for a new language?
Hi guys
i need help i want to generate audio with long text i tried many things but did'nt work
i have generate file audio but with no sound in it
i wonder if ~8 hours of training data will be enough, but i'll just have to try lol
for now I have taken the output from the hubert hubert_base_ls960.pt and the kmeans model they provided for this, its not ideal, but I just wanted to test if it works. then I finetuned the coarse model with this, surprisingly the model learnt these new codes in very few steps (~2k steps)
btw if you finetune with new codes the old knowledge will be lost by default. There may be some ways to make it still preserve the old knowledge. Now to build a working text to speech model, the semantic model also needs to be trained. That am yet to do.
nice, i'm also using the hubert_base_ls960.pt, but a custom model for kmeans, to try and make it compatible with everything
it does make sense that finetuning bark's semantic (text_2.pt) would work faster than training from scratch, i just didn't think of it. The back-up for voice cloning will always be a full custom bark model though
try cloning your own voice on your finetuned bark models, since you can extract semantics using the hubertwithkmeans and they should be compatible with your model as that was trained on those tokens.
meanwhile i'm still generating training data, it wasn't running while i was asleep, but it's been running for a few hours already again
and i'm over 50% done now
By 'finetuned bark models', are you referring to already finetuned bark models? Or you mean fine-tuning the raw bark from the scratch? I haven't encounter any repo to fine-tune Bark with, for example, LoRA. Except this issue post https://github.com/suno-ai/bark/issues/117
bark runs in pytorch, you can literally just train it based on the already trained model (which is finetuning)
you could probably also figure out loras based on what we have
hm okey, got that
@lunar glen @copper pewter - Do you think text2.pt (Bark's text_to_semantic model) was trained using hubert_base_ls960.pt?
probably not, and definitely not those kmeans, the kmeans tokenize to 500 tokens, but bark uses 10000 tokens
however i do believe you can achieve good voice cloning by using hubert_base_ls960 with a custom tokenizer which tokenizes to 10000 tokens
Ah I see. You'd probably lose some fidelity though, since going from 500 to 10k will be a little lossy
oh, and i'm currently at 1.62gb of wavs for the semantics
Noooiiiiccceeee
still just the semantics i generated yesterday lol
the 500 token quantizer won't be used
Btw, just FYI the audioLM paper originally used w2v-BERT to extract semantics from sound: https://arxiv.org/abs/2108.06209
Motivated by the success of masked language modeling~(MLM) in pre-training
natural language processing models, we propose w2v-BERT that explores MLM for
self-supervised speech representation learning. w2v-BERT is a framework that
combines contrastive learning and MLM, where the former trains the model to
discretize input continuous speech signal...
it's replaced with my model which fakes it to act like the one used to train bark as well as i can manage
and the audioLM-pytorch implementation uses HuBERT
is anyone aware of any multi-modal or prompt-to-audio models similar to Google's latest project where by you can enter a prompt and it will generate sound effects or music?
Oh! Got it
it's a recreation of google's musiclm, just like bark uses a recreation of google's audiolm
you are amazing. thank you
3002/3079
You can clone your own voice now
I thought suno/bark didnโt allow that
it's just not a feature, and they won't enable it, but if you figure it out, it's fine
because the legal things around a company doing that are not very clear for example
Ah I see, but we can still fine tune their models on whatever voice we wanted
yeah, or fine tune a model to use different semantics and do voice cloning from there
or train a model to create valid semantics for the regular bark models (which is what i'm doing)
the data is finally done
just preprocessing the data now
Nice
this is from the start, it seems to be going pretty fast lol
with how much variety there is in the training data, it has not had any duplicates yet. so it lowering is a good sign (loss shows every 50 batches, 1 batch is a single audio clip)
What kind of GPU(s) are you using? Also, you seem extremely familiar for some reason... can't put my finger on it though.
3060 12gb
the training is only taking 3.2gb vram total though, you could get it running on a 1650 if you tried probably
passed 2 epochs so far
I see... if you ever need a faster GPU lmk cuz I have a 3090 and would be glad to help out in any way possible.
I like the spirit here
it seems to be fine on my 3060
Another thing is if it's only using 3.2gb of vram, shouldn't you up the batch size to use more resources?
Cuz honestly I think if it's only taking that amount of vram, it sounds like you're not utilizing the full power of your GPU
there's a few things that are still trial and error, like whether to cut unequally sized outputs from the start or from the end
and i am at 100% gpu utilisation
So this is more a test run than anything?
Yeh, but you could fit more into your VRAM, couldn't you?
yeah, i could, but currently a batch is an entire wav
Ah
outputs actually look like outputs, nice
like, the semantics extracted have similar patterns to actual semantics from bark
IT WORKED
original
regenerated audio
the second audio does not know about the semantics of the first, it specifically extracted them from the wav
ChatGPT ELI5:
In simple terms, the sentence means that the patterns found in the extracted meaning (semantics) from a dog's bark are similar to the actual patterns found in the bark itself. It's like the model successfully captured and reproduced those patterns. The second audio, which was generated separately, doesn't have knowledge of the first audio's meaning. Instead, it extracted the meaning specifically from the sound waves (wav file) itself. ๐
lol
anyways, i'll implement voice cloning tomorrow, show it for a bit, and release the model and training code later
currently just have one for hubert_base_ls960. might add more models later if someone wants it for a bigger HuBERT model, but as you saw from before, it's already really good
i'll release the models and code later this week
Nice. I'm excited to play with it, especially wonder about not voice cloning uses, music, and enabling whole new things. If you can train it that fast you can do one-off projects with different goals than accurately reversing semantic.
I'm already regretting this and dreading the next update in base bark
Holy moley, yeah, base bark updates aren't going to be fun maintaining that...
this is based on something i spoke, i don't know why it put it twice though lol
actually crazy that it managed this in 3 epochs of training that model
Crazy!
You may have realized what I have, Bark makes everything easy
I'm always thinking, 'this can't work' and then it does
So I could believe it
If we can really train a useful model that fast, seriously, Bark is gonna be just ridiculously powerful
I've found the semantic model is incredibly robust, so I bet it makes do with even what might be somewhat poor model, it probably still makes it work. Though you could keep training and make it better for sure.
Now I want to take a crack but it's 4AM here and I'm being absurd
But yeah, I can't believe the stuff I shoved through the semantic model and somehow still worked
Sometimes I get messages like "Why don't you make an audio microSaas" or something, and I'm like, "That sounds like another project I don't know anything about that I don't have time for" but people are RABID for voice cloning, it must have been half the chat messages in this Discord for awhile. You probably actually could make an instant business.
lol
The serp github project has like 1000 stars, and it's terrible!
i still have to see the quality of voice cloning, i'm currently adding it to my webui
I'm more surprised that people haven't started doing voice cloned tts yet.
Cuz voice clone is speech to speech so far
that's just proof of concept
because of how bark speaker prompts work, this should also work in text to speech
I've tried making it tts based, but I use so much vb cables that the results are bad(voice clone i mean)
for some reason i'm getting system errors from soundfile when i do torchaudio.load? but only in this part of my webui, it's fine in the other part
"system error" not very descriptive
Just scipy.io.wavfile for wavs, and pydub later for other stuff
and the "actual error code" is just "system error"
But I'm not sure exactly what you're doing with it, so maybe those do not cover the functitonality
i was loading the audio with torchaudio.load
i guess if i want to load the wav then i'll just have to put it in a torch tensor
I didn't even realize torchaudio used it, since I'm not loading audio like thatt
Just basic output and conversion
Does your webui not process wavs with the same library as the training code?
I guess maybe not, they do different things
ah now i'm getting an error with wavfile, there's something wrong with my directories, kind of expected that
ok, voice cloning works great, but outputs aren't always perfect, sometimes the voice is somewhat different. sometimes completely different?
might just be an issue with my input wav file though
Probably just needs more training, more data. It's honestly shocking it works at all that tquickly
I thought it would at least 10x that long, if not more
i'm just gonna check if the semantics make sense, by taking the audio i used for the voice, and instead generating semantics for it
and use those as the prompt instead of history prompt
nah sounds fine
thats the issue I mentioned with the semantic_prompt generally
some voices work almost perfectly
like, i took the start of a dream video, and i replaced the voice, and that works fine, voice cloning also work fine but not all the time
others will change voice some percentage of the time
like this
If the semantic is close but not quite the same as it should be in Bark, maybe the errors accumulate faster so the voice diverges faster than a regular Bark generated history_prompt
Maybe just needs more training and data
But yeah even regular voices do that sometijmes
"works but not all the time" is literally everything in Bark
I mean you trained in 8 hours on a 3060!
And it works sometimes
Be happy!
I mean the thing to kee in mind is that the semantic prompt generated at inference time is not actually the semantic prompt that is used as input to the model
its correlated with the text embeddings
so only a subset of semantic outputs will actually perform well
Right, the last 10% might be a lot of work. My gut feeling is that Bark is pretty robust so probably handles more than you expect, even though it kind of feels you would need the perfect text embeddings. I've done worse things and they just worked.
yeah I suspect the closer a voice is to the training set, the better it will perform
because the model has better learned to disaggregate text semantics from acoustic semantics
The history prompts carry a mind blowing amount of detail for a voice, considering how tiny they are really. It's really piggy backing off the innate Bark model a lot, so you can get by with some really crude things sometimes too.
I wonder if @copper pewter added a penalty loss to his model between his semantic predictions from hubert and the actual semantic prompts from en_speaker_1/2/3/etc
because those presumably are true semantic inputs
no, not speaker_1/2/3, but actually just saved ones, they're the same
He used a ton of real bark generations
3000 something clips
saved ones arent the same though
(preumably)
en_speaker_0 etc presumably come from the "true" semantic model
no
which is unreleased
Oh, you know I don't know if they did do that?
They might just be random bark generations?
en_speaker_0 is most likely just a saved bark gen, they claim bark does not include any real people's voices
thats what im guessing
I think you just need more clips, more time, more diversity. 3000 isn't even that much
after all
around 3000 * 50 * 10 semantic tokens
out of 10000 possible options, a lot of overlap
i think there was mention of an unreleased model though
thats why i assume there's a "true" semantic model out there
Yeah they have their version of what Mylo trained, basically, that's what they said they didn't release
Presumably it is perfect, but I don't think they would use it for the Suno default voices, just don't see the point.
The default voices are all saying the same sentence, if it was the case, probably not even helpful.
i might just train a few more epochs and try that model out, who knows
i mean it's possible that with more data, the new model gets closer to the "true model"
(more data, more training etc)
yeah, i don't have the resources for that, but model and code will be released later this week
with your synthesized training data, did you use the same phrase?
Step 2. Mylo Audio dethrones Eleven labs.
(i can't remember exactly what it was)
I wonder if that would make a difference too
it's trained on >3000 random phrases from books
and shakespear plays
the actual words don't matter though, the sounds do
They probably matter some
well....i wouldn't be so quick to assume that
Or rather, the overall structure and meaning of the sounds, which we call words
dont you think there is a reason that each provided prompt uses the same phrase?
I don't think there is a reason for that, not really. But just that there's probably some nuance and depth that might be tricky to capture still, compared to real Bark generated semantic prompts. And you could need a wider range of text in the training data as part of that. To hit like perfect match.
I mean there's a way to test my hypothesis
just measure the distance between the bark-provided semantic prompts and mylo-provided prompts for the same audio from the same speaker
if that keeps going down then it theoretically can converge to the same model
Well that's what he trained it yeah, so just keep checking the loss
no, i mean checking against the pre-provided en_speaker_0 etc semantic prompts
I was just speculating that you might hit a wall and need a wider set of text. Asian languages or something, for example.
for audio generated with that speaker
As far I can tell there's nothing special about the provided prompts, but that's just my guess, I guess there could be. But even so I don't think it would be that useful, regular bark generated semantics should be all you need.
its possible, I'm just thinking of ways to empirically verify that
Hey you seem pretty knowledgeable, random question. Somebody asked me about adding a 'negative prompt' to Bark Infinity, like Stable Diffusion. Do you know if there's a reason LLMs never had this idea? Does it not even make sense, or is useless, or something?
I'm not sure exactly what the implementation would looks like even, or even what 'working correctly' means exactly, but it does seem like a fun idea.
"I HATE YOUR GUYS" as a negative prompt, to make somebody quiet and friendly
or something
I think you could implement something similar conceptually but negative prompts are mostly a diffusion model concept
because they can sample from two latent spaces
i guess autoregressive decoding could downweight the logits of a negative phrase though
I was thinking something like, generate the negative prompt. Save the generation. Compare the generation to the 'average' generation from that language. Pick some cutoff for the most common tokens or patterns used unique to the negative prompt (so you don't penalize the sound of a human voice generally) and then penalize those in the actual generation. Just a random idea.
Kind of depends on how the semantic works, so it might be useless. But maybe does somethitng.
i think you could figure out some way of doing it but in my experience picking the average of the output of a network never performs that well
its just too much information mushed together
id maybe start with trying to convert man to woman
or vice versa
maybe looking at the attention weights for that particular token?
Well I did use this method, basically, for my french accents I've been posting. It's not reliable
but it does work often enough
But comparing set of english versus french
So it COULD just work, bark is surprisingly robust
do you know about pca?
No
I'm pretty much a code monkey rubbing sticks together. I mean theoretically I took stats in college but it's long been phased out of my neurons
basically to find the major dimensions of variance in a dataset
and the eigenvectors of that variance
that would be a good place to start
Most linear algebra too, annoyingly, since that has been so necessary now I need to relearn a bit
haha
However, there's great stats libraries. I bet I could get pretty far just banging some keys
Somehow I can still program in assembler despite not seeing it in forever, just because that one professor was such a hard ass it got burned into my brain. But not the math.
yeha sklearn.PCA is my go to
Yeah I did use that a bit already, just to get some basic stats, entropy, perplexity, whatever. Just to look for simple trends.
Those particular metrics were largely useless, at least for what I was doing at that moment.
i don't know what any of that means ๐
perplexity is probably something you'll need, it's all over any language model training and evaluations
The rest, eh. Just get that loss down.
Have you tried cloning the rare Bark voices, for example a child's voice?
I'll be impressed if that works, but maybe you only need a few samples in the data?
Or maybe a child's voice isn't that different from an adult, compared to say a human versus a dog barking or a car alarm
i believe voice cloning a "rare" bark voice shouldn't be an issue, it doesn't need to be in the training data
Yeah as long as it's close enough, it only has to be in the Bark data
like, who knows, maybe you could record your cat meowing and see what bark thinks it would sound like as a person, that would be quite an interesting thing
so your prompt is accurate enough to get in the right space
Bark can probably do meows!
I haven't heard one, but it's got be in the training data.
is it?
Yeah, I mean it can do music and random sound effects
It's just raw audio, youtube, tv shows, etc, just guessing.
I often get a 'tv commercial' sound
so for sure TV
yeah, i did notice i got a voice which sounded a lot like a speaker that was shared in #๐ถโbark-technical
the old tv sounding speaker
Yeah it's raw data and not even noise filtered
which is also kind of cool, so it can make the noise too
like an old time AM radio or something
I have some of those
I was really killing myself to get Star Trek TNG background 'hum' sound
possible but difficult as heck
after a bit more training, it doesn't seem to be that much more consistent. still, the times that it does work it's really high quality
i have said this like 100x time for everything i tried, lol.
i'm still kind of blown away it trained so fast, I guess I had a totally wrong intuition about how hard it would be. I should have gone for it. I also hadn't until recently thought about all the non voice clone use cases, which made me more interested. I'm really impressed you went from a single neuron in javascript to just nailing it like 3 days later. You got a bright future.
i do have years of programming experience, so that really helped me
do you have a clip of joe biden talking so i can demonstrate a voice that everyone will recognise?
Not at the moment, away from main desktop
Try for Rick and Morty
I had trouble with Rick
really crazy voice, atypical
need to find some voice lines
Probably google like rick and morty sound board, or something
I can give you some later but wont' be home for awhiel
Have you tried music? I would be super intrigued if it works, just even rarely, it would open up a lot of fun possibilities to make new kinds of stuff with Bark.
Cloning is cool but like, you know, already a thing. But if it works with music, OMFG
idk, music won't be better than bark music, it's just a history prompt
So like, once in a blue moon, I get really good music out of Bark. So it's not impossible.
Like 14 full seconds that almost sound like a real clip
But it's so so rare
Even though it's just a prompt, one time it even stays pretty solid for a minute, using prev segment as new prompt. Literally just one time though, lol
You may be right that encoding music into the history doesn't help much, mostly still sucks, but maybey not
Looks similar, did you train the coarse model from scratch or fine tuned the suno one ?
From yesterday I have been trying to train the text model, the merge token trick is eating me up, any luck with the text model ?
Semantics is extracted from Hubert right ? Then you trained some model for semantics to coarse ?
Or you mean text to semantics you trained ?
also, @untold briar i might have a great idea to fix it not always generating entirely correctly, just save the prompts from when it does. problem solved
i trained a model to take the output from HuBERT and convert it to bark compatible semantics
Oh that's a nice idea dude
i don't have a decent clip without a ton of background audio, so bark is mainly just continuing the background audio
Ah, I wish I was at my desktop, really curious how it comes out. I can try to google one sec
i'm getting a female voice, the same female voice every time, somehow
Haha, I had the same problem! Wacky!
It must be something abou titt
Or like, it's triggering a cartoon or something
It didn't always happen so I got something Rick like eventually but I actually got that, doing my completely insane method
which seems crazy to me, must be like a model thing in Bark
That's just totally bizarre honestly, like WTF.
yeah your method is kinda a lot of trial and error, just generating multiple and hoping one of them is good right?
It's way more convoluted, I identified certain trends, manually ranked voices by similarity, and iteratively bred them over generations, trying to find a recursive process that made it converge more in that direction. a process which wasn't that similar between voices. with a lot of weird hacks like manually mixing two voices that each have a similar aspect. Just a completely insane process. But also a fascinating concept that it converges so well, I was blown away it kept getting closer.
for example with mine, this was the second try, idk why he screams hello though
i don't really watch dream or anything but i'm pretty sure this sounds pretty accurate
that probably takes ages lol
my voice cloning takes less than 5 seconds on cpu
It was really like curiousity, could this actually work? and it's crazy how well it close it can get
like this was the first one after
It is kind of my thing to do things "the incorrect way"
technically, if you keep combining the 2 that would have the actual voice between them theoretically, you could actually get a great clone
I was like, this can't possibly work, right? So it was pretty fun! Also just kind of wild it kept getting closer and closer until it seemed perfect to my ears. I thought it would eventually hit a wall.
it would just take a long time
It was like a couple hours, maybe a bit faster at the end but I got bored
But Rick was just weird, constantly got voice switches, and stuff.
I think it must be modeling a cartoon. Bark I mean. Cartoons do switch voices rapidly. So this is, in a sense, perhaps an accurate prediction. But I literally kept getting a woman's voice, that was by far the most common. It's a bit mind blowing that somehow both methods produced that.
It wasn't just cloning, I was interested in making voice hybrids, flipping the gender of a voice, it was all related work
Just seeing what might work
Especially what *shouldn't *work, but if it does it's very fun.
Gotta bail for a bit, @ me if it's interesting, will check later today
Okay, I thought this would be a fascinating YouTube video (Titles... How to voice clone the least efficient way? Still workshopping it.) I may as well try to explain quickly because who knows when Iโll get to making that. Basically after the third time I randomly got a Bark voice that sounded suspiciously like Trevor Noah, I was like, ok seriously WHAT is going on?
And it was my test script. I was constantly running the same prompts, in the same order, iteratively using the last segment as the new history, with a bunch of variations: sometimes half the prev segment + half a fixed history, randomly deleting a random portion of the history, mixing two different histories, and stuff like that. Lots of trimming and deleting and mixing, and importantly preserving at least some fraction of previous segment, so it was always iterating and building on the past. (Hitchhiker's Guide to the Galaxy and Andor scenes, if youโre curious.)
And I looked at the samples and in some parts of the process, later samples did actually sound a bit more Trevor Noah than earlier samples. It was super rare to get a full clone from this particular process, but I ran this script all the time so it did just randomly happen more than one time. The main thing was that it was a process that somehow trended towards a particular voice. So it made me wonder if you could do better on purpose. And you can if youโre completely crazy and procrastinating from actual work.
hmm
I hope I can explain in a YouTube somehow, I was just so surprised it worked, don't know if I can capture that feeling of 'how the heck is this working?'.
"Voice cloning with zero seconds of audio (for crazy people)." Hmnnn. I like that one
Totally @ me if you try some music. That's what I'm most curious about, even if it's probably not great.
Trying to to get good music out of Bark is extremely difficult because so many samples are literally painful to listen to. So if this is alternative way to try stuff, would be a relief.
Oh you gotta try the meow too. That's critical. We have to know what happens. For science!
i do have a little bit of ai generated lo fi from riffusion
but idk, my tokenizer did not have any music in its data
Yeah, I understand it shouldn't work. 100%. But I'm saying, maybe it's still interesting!
ok, so first gen it had the little sounds in the output audio, and a voice saying aaaaaaaaaaaaaaa
I call that a win. That sounds a bit like music to me.
the original music btw, it's kind of low quality, since it was a test of an extension i made for stable diffusion webui
Accuracy isn't the only goal. Can you make an interesting history prompt for Bark, that is capable of generating Bark outputs that would otherwise be hard to maek?
That's a win.
and this sounds like nothing lol
can you give an example
It sounds singing, a bit..
Well any consistent and not noisy vaguely music sounding audio, would be a win
Some of my weird experiments made Bark output like, car alarms. But in clear audio. Not useful, but kind of interesting.
Doorbells, honks, horns, lots of that.
I mean it's early. In future could train on music, sounds, whatever. But thanks for testing for me.

the voice cloning is so good though
best results if you use the voice cloning, then generate a piece of audio which fits the person you're trying to clone, then taking the speaker prompt from there, that way it's fully bark-generated and accurate. but actually has the voice
that's what i did in the clip above
I have a few bidens, one issue I had is he trended towards either public microphone, or way close to mic, he kept oscillating and neither was ideal, I was gonna try hybriding but I got bored of this manual process
He's the one I'm least happy with though, of the presidents
I was thinking about future uses. If you can encode an hour of a speaker, for example. Then sort of like how I'm doing my french accents, one possible use of your model could be to create a deeper clone than the history prompt
Kind of like how I push it towards french tokens, maybe it's possible to do something like with like a bigger sample size
and possibly, small chance, it could overall increase quality past what you can do from single history_prompt. without having to fine-tune
From what I hear fine-tuning is ALSO quick
so this may be kind of a dead end
but still, could be worth trying, some day
fine tuning should be got tier though, for actual practical purposes
Oh can you flip a sign in your model somewhere? Can you produce the semantic that is the least like the wav instead? Probably total noise or silence. But sometimes a bad idea can do something cool.
You're getting a lot of insight into how I work. Try the wrong thing that even if it works, should sound terrible.
I'd say "Hey, it's a living." but it's not, it's really not. If anything it's an anti-living.
I can try all this stuff later in the week when you release, I'll stop bugging you. I'm stuck in a waiting room and bored out of my mind, and I can't try anything myself for a couple hours.
Does anyone know, of the people who did fine-tune whatever model (coarse I think, for a new language) were they able to do it and preserve the Bark original model's capabilities, or did it have a detrimental effect on them?
just put random semantics if you want that
Yeah I agree that's the most likely result, basically the same as random tokens. But it's worth checking at least a bit, sometimes a really bad idea can surprise you.
tried to voice replace a meme with a random voice
if you've got a decent clip you can get decent results on voices that aren't very common
dantdm
Cartoon voices seem to be tightly tangled with background music in Bark. Super deep narrator voices as well are incredibly likely to add it, even if the original speaker had nothing.
Could be really cool if the background sounds were a bit better and more plausibly a real clip
Best way to do that would be editing the audio beforehand to remove the background music as musch as possible. Nowadays there are tons of tools that do that
Or even better, hire a voice actor from Fiverr who can do impersonations and record a bunch of clean voice lines for training
I think they're trying to keep it raw, and figure out some way to sample from it with more control. But I don't know if there's any downside to adding even more data like that as an addition to the dataset.
Probably not?
I wouldn't think so.
What is a good amount of data and voice lines to train custom model?
There's somebody in here who at least started training a new language, but I haven't kept up with the details. It might be only the coarse model as well.
I don't think you'll need to train Bark at all for good voice cloning.
You could do it for better cloning, probably. But should be good with just a good prompt.
Unless your voice lines are atypical, sound effects, music, or a new language.
I see. I'm still new so I don't know how to do voice cloning yet. I've Bark infinity running and tried to do voice cloning through its UI, but haven't managed to make it work yet
Bark should work much better than Tortoise, using a small audio sample.
That's not the real cloning. The person working on is right above here, probably out later this week.
I put that in Bark Infinity because people asked, but it's really just not not worth the effort. Literally just wait a few days.
Oh nice! Would love to give it a try and test it out if I can manage to get it running
still learning how to run Python, git, and stuff, my expertise is more in visual than programming haha
He's still cooking the cloning a bit, just scroll up literally in this thread, that's it right there.
Oh Mylo with the audio! Cool thanks man
Gonn spend the afternoon reading chat history haha
May I ask your reason for cloning? I'm sort of curious why it's just the thing everyone asks about specifically.
If you only need a nice voice in a particular style, you may not need to clone at all. If it just needs be in a specific style and tone, but not an exact voice.
It's pretty wild how expressive Bark is. You can really get quite a specific voice with a bit of effort.
Trying to create a voice for a female character for a Youtube channel. Wanted a clean voice that sounds similar to this
Yeah that specific style happens to be quite tricky in Bark. Children's or high pitches voices talking real fast. That's one you might just want to clone.
Haha yeah that's what I thought
That might be one that does need fine-tuning to do well, even. Cartoons style in general has a lot of background as we just mentioned, so might be kind of messy. I think that could be fixed but short term you might need fine tuning. Which is not even really a thing, yet, to be clear. But probably not too long.
I see! So the fine tuning would be feeding more audio data in the cloning process?
The cloning probably just needs a short sample, and generates a bark prompt. An .npz file. Fine tuning would be a more traditional process where you use a lot of data and longer time. They are both cloning, but I was using cloning to mean the first, yeah.
Are you planning to release over this weekend? ๐
yep
Gotcha! Appreciate you answering my noob questions man, thank you ๐ซก
i'll see how it does on this
My guess is it's hard to get it to generate clean voice only samples. But maybe it works!
after a few tries it got the voice, but not really following the prompt
Pretty good, yeah.
Holy crap my mind is blown lol
Really good actually. That's a tough style in Bark. A+
That's amazing damn
lmao what happened, i used the speaker prompt that resulted from the last gen's audio and it just switched up
Yeah, cartoon voices man.
lol
the voice just switching lmao
Literally every cartoon voice I have does that
is that the model trying to use its trained data to fill in gaps?
maybe i can play with the temperatures a bit until a get a better output from bark, since using those outputs is even better since they're "true" semantics
I think it's really just because real cartoon switch voices very frequently. So Bark is likely to do it too.
And how do you test the cloned voice from them on? Is it a .npz file that gets spit out?
That's right.
Aaah ok things are starting to click lol
first it creates an npz, which will have a lot of variety in it's continuations still, but often gets a realistic response, if you get it to generate more audio in the correct voice with that, you get another npz to save, which will have a consistent voice that you want
that's how i did the one with joe biden
Got it! Wow blowing my mind here
Yeah, Bark doesn't need fine-tuning. Unless you go way way outside the norm.
Would it be ok for me to get the .npz file to test it out on my end as well?
yeah
still testing a bit though, i want it to both have the voice, and correctly follow text
yeah no rush! Thank you
You want to be truly mind blown. Trained in just 8 hours on a 3060.
Faster than cloning a single voice for many TTS models.
lol that's nuts
There are some TTS models that people fine-tune on a voice for longer than 8 hours, to make ONE clone.
the training itsself took like 20 minutes for the model i made lol
on 8 hours of audio data
i guess this would technically be zero-shot voice cloning right?
I'm not really sure how people use zero-shot in audio stuff, not sure.
I remember at the beginning of this year when I was looking at how to do voice training and on the videos the people would train for days on a 4080 to get anything that sounded at least human
on like 30 hours of audio data
for those David attenborough voices
I wonder if the appeal will fade once it's just baked into every single app or whatever
Probably, most people will use it to say some funny words and move on
There will always be a demand for higher quality, but rather than pure voice, it'll almost be 'acting quality' literally how good of a performer the voice model is, similar to how you might judge a real actor.
A short sample will probably sound the same, but reading a paragraph from a book, the better model will have a lot more nuance.
Yeah for sure
Even now I sometimes evaluate just the prompts/speakers like that, in Bark. Not a ton, but there seems to be some subtle differences between speaker variants that are otherwise sonically similar.
It's a little hard to judge in 14 seconds, but I'd guess it stands out more if the generations were longer.
Especially an audiobook, a full paragraph has a kind of rhythm to it you can almost feel, with a real reader. But this version of Bark can't really ever see the whole thing.
Right I think that's gonna be the hardest thing to capture. I think the voice that gets the closest is Google's Voice Assistant., which I think does a better job than Siri in sounding real
So you can kind of cheat it right now in Bark. In the simplest way possible, using a 'beginning of paragraph' speaker .npz, and one for the end. lol. It actually basically works. But obviously doing it for real would sound way better.
Just kind of randomly checking variants and trying to find the right ones for that.
It's also a little TOO regular, at least when I just tested the idea. Maybe okay with more subtle variants than I tried.
Wow interesting ๐ค
Yeah man, Bark is literally barely even explored right now.
Yeah I'm amazed more people aren't talking about it/testing out
You can do the same thing for emotions, whispering (probably didn't test that), whatever. A lot of work to get good variants, but it's a really simple idea and you only need to get them once.
A feature I might add to my fork is a fake prompt that is only used as a history input. So you say, "I went home to work. (I am crying I am so sad.) I said hello to my wife." And basically it uses the middle sentence in the history, but not the audio output. Just splitting text segments, not anything fancy
I think it's probably super super unreliable
but it seems fun
Oh wow Literally like a VO Director directing talent on how to deliver the lines, that's cool
So the idea is, it's like you asked Bark to generate the audio with that sentence in the tone, but doesn't actually include the middle sentence. But if you try this now, you'll find it doesn't work reliably. You really got RNG it up to get a good one usuually.
I used to be a VO recording technitian in the past, one thing we would always do with a VO talent is not tell them exactly how to do a line, but ask for more or less of an emotion, or mention the tone
With Bark, it's usually even more subtle
You have to write text that just would be said in the way you want.
as in "can you finish this sentence on a higher tone?" or "Less crying, more anger"
A better example would be (My dog was dead and I started sobbing.) or something, truly over the top, but then the lien would be read more sadly in Bark.
incorporating that into prompts would be gamechanging
It's not gonna great in the quick version, since splitting the sentences alone kind of makes it less expressive and feel less connected.
Still could be useful and it's easy to implement, so whatever.
And I think I already mentioned, but using this to control emotion doesn't actually work well. It just works once in awhile.
Are there variables in the training data that classify emotions?
or tags, not sure how to call it
No idea. I'd guess no with pretty high confidence, just seems against the spirit of the project.
I see. Could be interesting to train data with those "tags" from the beginning. Maybe that would be easier to control once data is trained?
The one thing they've talked about is the training data is pretty raw, not noise filtered, just raw audio
So that's why I guess no
If you search gkusko (trying not to ping) he's talked a lot in this discord about it. But it could be weeks ago.
I'll search it out
They might hope Bark could eventually do classifying like that on audio. A true foundation audio model.
It probably can, sort of... I can almost imagine something.
Imagine a comically simple method. Encode the audio "I am feeling" into bark, using something like Mylo's model. Then prompt bark to continue the audio, which is jsut a flag, you can force it do that.
I bet you get well, mostly nonsense, but statistically I bet different emotions have some correlation in the likely words you hear to the actual emotional tone in the audio.
The least accurate and most inefficient emotion classifier system you could possibly make. But I bet it's more accurate than chance with enough samples. Basically my specialty. Also I guess this only works on the audio of the words "I am feeling" ? Truly the most useless machine. Actually, why not clone the voice of the audio sample, then add a fake "I am feeling" to the end, then give that to Bark. Even more convoluted. Excellent.
If there's a terrible way to do something, that's the way I've probably done it. The funny thing is some of it just actually works straight up in Bark.
I bet it works better than trying to generate punchlines with Bark. https://twitter.com/jonathanfly/status/1650001584485552130 (Though it did actually do a few with the Chicken crossed the road joke.)
Gosh I haven't Tweeted in forever.
You should! You got a whole follower base already
I do have like hundreds of samples of the US presidents singing American pie, right in front of me. The world should hear this.
I'll retweet it to my 100 followers lol
My Twitter is actually a weird mix cause like it's half from crazy jokes, and half from when I used to read https://arxiv.org/ for AI papers every day and was constantly Tweeting paper summaries or usually quickly trying the paper repo, so like serious science people. So I honestly get gunshy and try to think like, can I split the difference somehow for both groups, lol.
Yeah that's stuff that goes over my head for sure ๐คฃ
but the funny AI videos I like lol
It's so expressive, right? The voices are so insanely human. On the github for my bark fork there's like 20 minutes of that. I love it so much.
I don't really use TTS in practical use, but it was just so lively and real feeling, I got kind of addicted to poking at the audio model. For literally no purpose but to see what it did. And that's how I somehow ended up spending half my week's free time trying to get Donald Trump to sing in a ridiculous French accent.
I Tweeted Don't Sleep on Bark when nobody knew about it. Bark has 20,000 stars now... and people are still sleeping on Bark. That's the Tweet. Legit like, Mylo cloned in 20 minutes. I threw sticks and stones together and got French accents, somehow. (I might be overemphasizing self derogatory humor a bit here, it was real work and quite a bit of coding, if not very sophisticated.) Imagine what actual domain experts really building on the Bark model will do.
Hahha funny you say that I just got the French voice to talk in a great accent in english too, using the output npz files you guys suggested earlier
Yeah it's cool, many are multilingual.
Specifically I was trying to force any voice to have a specific accent, which normally you can't do. Like, ugh, I have some here somewhere.
some distict voices are really easy to clone btw, anyone recognise this one?
Not sure, sounds like an announcer?
Sounds familiar but can't put my finger on it
postal dude
ahahaha winamp voice! That's a deep cut
Ahaha Postal dude yes there you go
Hey man I'm right around there too got the reference!
Postal dude is also a deep cut
2.91 version of the iconic software splash screen and remembered track
It's pretty similar!
hahaha might be the same voice actor
rick hunter, who is the voice of the postal dude in postal 1, 2, and as an option in 4, does do radio commercials
also, lol
my explorer hasn't realized that it's already the next day
Man I looked it up. The winamp voice actor vanished into thin air?
wild
Actual urban legend voice
hahaha wow
Oh shoot Jonathan I just noticed that you're the one who made Bark Infinity! That's you right?
Yeah. I don't know how I ended up so deep into audio to be honest. And I kind of need an exit plan eventually, cause like I don't even it myself except to play and experiment, but I will be updating it soon and support it for near future, for sure.
Man that's awesome thank you for your hard work! Been playing with it nonstop ever since installing it!
And now I feel some responsibility to update it.
My god though, I have like a billion changes for my accent stuff, so it's just gonna be a nightmare.
But I'll do a basic bugfix and core features, I mean 99% of people want clear voices and an easy installer. Literally nobody is asking for people to sing in comical accents, but that's what I'm dying over, lol. And the install is so confusing.
hahahaha it is a bit confusing, took me a little while to make it work on my end
I'm spending 10x the hours on these crazy ideas, and everyone keeps messages me, "I can't install Bark."
I actually did start a basic patch some today, for real.
Miniconda saved me
If you or Mylo need help with anything visual or even audio data to be trained on just let me know seriously, you both are doing the Lord's work
There are some practical things I have learned, as a result of some silly things, that can be useful for simply joining audio clips more seamlessly, so it's not all all for nothing.
absolutely Infinity is saving me a lot, cause using commands on prompts is the death of me haha
Cool. There are some rough spots that drive me nuts. I can't believe how difficult it is to like, load all your settings in Gradio. I looked at how Stable Diffusion does it and they just wrote their own JS thing that does, basically just bypass Gradio. UGh
I knew Automatic1111 was Gradio, and I made a bad assumption that most of it's features were Gradio's features. But actually, it's a miracle they got all that stuff to work in Gradio and they built a billion hacks on top to do it.
I couldn't figure out how to display a non fixed amount of audio samples, or let the user pick a folder of files, really simple stuff can be missing in Gradio.
There's a whole thread for Gradio complaining, haha. I really go off easily. It's really not the worst but I was frustrated a lot, and I hadn't developed in Gradio before last month, so I kept being like, "I must be missing a feature?" and actually usually it just didn't exist, lol.
I'm gonna have rip the Stable Diffusion code just to refresh the file picker dropdowns if there's a new .npz file. How is this just not part of the base ui, lol.
Actually there is a funny thing I didn't think about. The actual most annoying about Gradio isn't Gradio's fault, really. It's that the newer Gradio API is so new that ChatGPT doesn't know anything about so you can't use ChatGPT really for Gradio dev. I never thought about that but actually, super annoying downside. Actually maybe it works now with plugins? But when I tried originally it was hallucinating all the time, worse than useless. It's an interesting factor to consider -- a software tool or library or language created or heavily modified after an AI models cutoff date, means the model is a much worse aid. Welcome to 2023. I wonder if that's gonna work as a negative pressure against new programming languages, an extra hurdle to overcome.
Hahaha wow yeah true
Have you tried asking Bard to check if it has the same knowledge?
I haven't, I should totally.
They say it has no cutoff date, not sure how exactly that is implemented, but it's got to be at more recent.
My face when I first tried Gradio, tried using ChatGPT, and it hallucinated the most god damn beautiful Gradio API and code, with all the features I wanted, just perfect. And then I discovered it was all just complete fantasy. There's got to be German word for this feeling by now.
Lol
@untold briar new programming languages can be easily added onto models via LoRA
Back on topic, anyone found some way to minimize the chance of audio becoming more metallic the longer it gets? It doesn't always happen but I haven't really noticed a trend in sampling parameters, or anything really obvious at least.
the only way to do that is to actually modify the way the pipeline works to use like, a split attention head
Are you replying to the negative prompt idea?
the audio going off the rails over time.
that, i believe, occurs because the attention head is linear and has a limited sliding window len
Huh. Is that a lot of changes, or could it be hacked on pretty quickly?
they can implement multi-layer single head attention and it could work better than multi-head split attention
it's a LOT of changes, but nothing that a genius like the ControlNet team couldn't hack up in a day
and i could also be way off base about it 
Can I also ask you about the negative prompt idea, as you seem to be have some real actual expertise. Do you know if it even makes sense in LLMs, or if there's a reason why it's not done?
@open magnet sorry to ping you, but is that close to the mark for the reason we don't have longer than 13s, that the audio recordings even from a given prompt end up 'forgetting who they are'? I concat renderings and the voice noticeably changes toward the end of the more exciting statements.
I think the model was trained on segments of that size, and they do start a bit in the middle, to minimize edge effects, but not sure
@untold briar that would probably be a 'user preference layer', a LoRA. you can generate text embeddings as well. there's really no reason you can't do it. but it's likely that the pipeline would have to be altered to make that work. and i'm assuming that'll be less performant in a big way, or they would have done it.
i should clarify, less performant with our current hw/net architectures
With an regular text LLM it's kind of hard to imagine what exactly the negative prompt should do, if it works correctly. But with Bark I can almost see it. A negative prompt of "I HATE YOU AND YOU SHOULD DIE" might make it more friendly and quiet, something like that.
negative prompts can be thought of as weighted probability sets that influence the decisions made by the model, to drop the probability score of those tokens and reduce their likelihood of generation
they already do this but bake it into the model for things like the N word and so on. they provide positive and negative prompts via the Anthropic dataset. have you looked at that?
So the way I was gonna try it, very crudely. Generate one sample of negative prompt. Compare that sample versus like average english sample and try to pick out prompt specfic tokens, not language generally, with some stats. So you don't penalize the sound of human speaking or something general like that. And then penalize those tokens in real sample.
i'm just going to start training stuff on hollywood movies.
good quality images, audio, and, often, transcripts are baked into the format
I was thinking a super weird one might be 'audio description captions' like where they describe what's happening on screen. I don't know how useful it is, but they talk very precisely over lots of dynamic audio.
I mean, I don't know what you do with outputs like that.
But I was kind of curious what it would learn.
omg lmao
the Suno crew should be willing to describe how to do the training and stuff but idk when they'll decide that
baseline Stable Diffusion 2.1 -> my Lord of the Rings model (The Hobbit), early on in training
this is considered an impossible challenge, to style transfer into SD 2.x without destroying it. so, i'm definitely excited for the challenge with Bark.
Nice
The pace of dev in the Diffusion space is nuts. So many things all the time.
Random question before I'm for a bit, anyone gotten somewhat realistic applause sounds from Bark? I always get a very very electronic sound in the place that is clearly supposed to be applause.
sorry what's the question? happy to opine. 13s is a bit random. just the standard small LLM context size of 1024
i see. it's just that when i do have longer audio, the voice deviates from the initial audio, a LOT. it never seems to normalize itself against the initial sample
so i was wondering if the processing is fully linear, and if it is, would a split attention approach help with coherence, or would it hurt
i have only briefly looked at it and i definitely barely understood it, so my understanding is coming from the sliding window length stuff i messed with
By deviates, do you mean deviates from the sound of the voice, or the audio quality itself gets the metallic twinge, or both?
I'm not sure if they are correlated, but I can't really remember off hand. I don't remember noticing that specifically at least.
both
The metallic trend is such a consistent effect, it feels like there should be some fix, especially it's not always there. I kind of wonder if the voice thing is a deeper issue, it's just hard to perfectly model a whole human voice from a history_prompt.
Totally uninformed feelings, lol.
Even in ChatGPT, with text, sometimes you feel it starting to degrade when you approach the context window, and it's less coherent. Not even go past it, just close.
oh, have you ever tried downsampling human voices to 22kHz? maybe tis' just that.
@untold briar it isn't modelling a whole human voice based on the recording. the model takes in the audio and behaves similar to something like GPT2 where it autocompletes what it is given, following the same patterns.
the emotions come from the actual training that went into the coarse model
a voice that sounds similar to X or Y has tensor space, physically co-located
there's interesting issues with 'overfitting' on the audio and i'm not certain whether an overfitted voice will perform better or worse. i know that overfitting the coarse model on one voice will result in a reduced variability of its outputs.
it likely just needs to be a larger model with more parameters to train.
it's a hard trap to fall into - thinking this is as good as it gets - but this stuff is a moving target, and it's likely that suno has internal models that far surpass this toy's abilities
Oh yeah, I agree. I guess I still meant like, presumably the prompt still has to be long enough for that? It's shorter than I expected in actual effect. 256 in semantic, like 206 or in that range in coarse for the semantic tokens. Fine strangely has the longest input but fine also doesn't seem really matter much. I do find even literally randomly chopping up coarse prompts quite a bit can push it into a bit more expressive space, when the outputs seem to get stuck in a less expressive space. Just from my literal caveman science slicing and dicing. One thing I do all the time is trim both semantic and coarse, removing chunks somewhat randomly, to try and break out of some effect like background sound. Doesn't usually work but sometimes actually does. Or just resizing it fractionally, like, "I want a voice only a little like that..."
An interesting thing to do is render 20 versions of a history_prompt, across the full range of sizes 0 to 256, or 341 or whatever fine is I forget. And you can get a feel for how much prompt carries how much an effect on the voice. Really just 256 I could honestly hardly ever tell the difference in fine, I just saw the output was technically different unless I extended it that much. Using semantic length here, so adjust according per coarse ratio, I usually just think in semantic units even when messing with coarse.
That's kind of how I got into manually hand crafting voices, cause you can kind of just mix them up like recipes and Bark makes it work, at least sometimes. Crazy.
To be fair there's no good reason to do this manually. Like seriously do it the right way, train a model like Mylo. But it was interesting that doing it by hand was actually kind of possible.
@untold briar - I've found that the voice gets more metallic as you keep passing the full_generation result of a generation in as history_prompt. Instead, if every couple sentences you restart from the base history_promp file, instead of using the full_generation from the previous prompt, you "restart" the degradation, so the overall quality is higher.
That's correct. However I feel like there's a tradeoff in nuance when you do that. Have you compared the expressiveness of doing it both ways? Even if it's more metallic, the less chopped of version is generally more expressive in my experience. Oh I misread. I meant, even in single generation.
Like just one 14s chunk
Absolutely you can't pass full_generation over and over, without some other edits, yeah.
Hmmm yeah. Aside from that, I haven't found better strategies.
So if you break your 14s segment, into two smaller halves. You do usualy get less metallic.
But it's less cohesive as a sample.
Mainly you can just keep trying until it's not metallic. It just usually work eventually. So maybe there's some sampling tweak
Ohhh interesting. So that means shorter samples are less metallic?
For me, usually the effect increases over time
Oh wow, good to know. I haven't tested that
It's just strange how it's just perfect, and same sampling parameters, and then it's not.
My understanding is that shorter samples and longer samples both have a higher chance of hallucinations. The closer you can get to 14s, not too long and not too short, the lower the chances of hallucinations
So there's some trade-off between metallic voice vs hallucination probability
I think that's right, yeah, going to 1s is really bad
By going shorter
I've tried everything, can you tell, haha. If you give semantic two or threes words at once. I mean it works but it sounds you handed an actor a notecard with just those words, they said them, and then you gave them the next notecard.
Even if you pass the history correctly, from prev segment, for each two or three word chunk. You super feel the cadence of the original chunks in the voice.
I think this actually doesn't even work at all in the small models.
But the large one kind of suffers through it, but it just sounds terrible.
It makes sense right? Imagine if you gave an actor the first three words of a line read.
How can they possibly know how best to say the full line from that
Hmm - yeah I feel like even passing ~12s of audio, it often doesn't sound good in long sentences
Also I tried using a single semantic generation, with two coarse gens. And IIRC that didn't fix the metatl.
If you break a 20s sentence into two parts
So I think it's from semantic. But honestly I didn't double check that.
hmmmm
It feels like it should be in the coarse, so honestly, somebody should double check. Or I will someday
I might still have the notebook, maybe I double check if I screwed up. Cause it feels wrong the effect is in the semantic tokens.
I think theoretically it can be in either. Semantic tokens can probably generate that metallic sound if it's low quality
Yeah they are both super expressive, changes to either can change so much
Someone mentioned even Eleven Audio has that effect too? At least sometimes? so maybe it's not a simple fix
I also agree the biggest audio segment you can get away with the better. Give the model a lot of info. Imagine if you could give it a whole paragraph, so it can properly start and finish like a real audiobook reader would go through a paragraph with a specific rhythm.
You do end up with way too fast talking sometimes, but in general, aim big
This is one hacky way to adjust speaking speed as well. Keep trying large chunks of text and saving, or small text of saving, until you get something that still sounds the same but generally talks faster or sloewr.
If you don't need that specific voice though, just start with a fast or slow one.
I really want .npz to come with variants, baked in. It's super annoying to get good ones, but somebody only needs to make them once.
fast slow sad happy whispering, whatever
I just tried for whispering, because somebody in here wants that specifically. It generally changed the voice, so that one might not work. But I was only trying on one npz.
It should work, probably just unlucky
btw if by some freak chance somebody has been generating tons of clear French singing npzs, I would be eternally grateful for them.
I'm a bit more optimistic about Bark for music, if it's pure singing, with little to no background. That seems doable.
Bark can do two music things. Pure singing... well that's a maybe. And sick beats. https://soundcloud.com/jonathan-fly-620508219
https://huggingface.co/GitMylo/bark-voice-cloning
https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer
release
@untold briar @glossy trout
Nice, you're gonna kill me, it's 5AM, but I am curious
Should I literally make tea
jesus
I guess it's saturday
The other annoying voice in this otherwise great clip pops in or out based on the some really tiny token changes in the prompt. Clip a few tokens and there he is. Carry the clipped segment into the next, he doesnโt appear. If only I could carry these tokens into any clip and ward away voice changes. (I tried, didn't work, lol.)
Is this it mylo??
๐
Hooo baby! Thanks man!
Is it useful to use really long data of audio? like 45 mins or hours?
8 hours for the training of the voice cloning model
6 seconds or more for a voice you want to clone
there's already 2 pretrained models, 4 and 14 epochs
they should be fine for most purposes
So I would probably get the same output if I use 10 secs of cloning or 10 minutes of cloning?
for cloning*
yeah, just make sure your audio clip fits the recommendations explained in the git repository
yep yep
clear, no music, ends after a sentence
Does this have a UI yet, or just the source files?
this can be implemented in any ui etc
I'll add it, maybe Sunday?
i can beta release my webui, it's just really early access and stuff
like most of this is practically useless
gotcha
still kinda cool to have this refreshing thing though
Jonathan you're the man!
gradio has most things you need, just not very clear how to do them
Awesome!
Does it refresh automatically? I genuinely thought I had to do it myself?
no, not here, it would be too slow
If the user generates a sample. Gradio won't see the new npz by default
like, i made this a while ago, and it made me want to learn more about ai to make my own models etc
So I found if you randomly sample and penalize tokens from language, you can just get random accents.
Even without a lot of data to use as reference
makes sense
This was the only time I got this accent, but I forget the same the random tokens!
also, do you allow people to use empty prompts or na?
hmm, what if someone made a model to enhance audio, so you can replace it in the semantic history to get a higher quality voice
confused
both prompts though
I think I mentioned, encoding bad audio to fix it
wonder if you can literally just disable the assert for empty prompts for enabling them
is a cooil use
i want infinite famous people nonsense!
You can, ugh, I'm pretty I at least tried this
it was why I made extra confused mode an int
for how many segment of no prompt
haha
also, with my code it's kind of like, something updates in bark? my code will keep working, as i have to update it myself. It did not support v2 speaker prompts until i updated it myself lol
i wonder what that could do
but i'll see if i can increase the token limit as well, just slide the history window right?
dantdm changed language
3 sets of tokens? only the first one is limited though
well, limited the shortest at least
idk about the exact limits
Oh I meant like, all the parameters, it's really just 3 'sets of tokens'
history prompts work, so this should work too
"so i have this book, it's called"
I have at least 5 hours like that on my hard drive
maybe 2x htat
For the two jokes. I was looking for real punchlines
bitteling
sometimes it makes a bit of sense, sometimes it repeats a word, and sometimes it's just a word salad
If you scroll WAY back. somebody from Suno ran it on their better or bigger model.
And it made a punchline on like sample 3!
Somewhere in this discord
So it's way better than our bark, for that. I ran so many
@copper pewter Thanks for repo. Im still wondering, How do i do the cloning. Sorry,im new to this.
I have the wav audio file which i want to clone and i have your repo.
https://huggingface.co/GitMylo/bark-voice-cloning
https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer
Now how do i approach from here? If you can explain so that all of us who are non programmers can understand and use.
I tried to use this script https://github.com/gitmylo/bark-data-gen to generate the data using colab,Its running and has generated a bunch of .npy files. If this is the first step. What to do next?
i'll release my webui in a bit, you can do it in there (just note that it's bleeding edge, and breaking changes might be pushed to main at times)
@copper pewter I'm just looking at your repo now - did you forget to commit hubert_manager.py?
possible
yeah looks like it, i guess pycharm didn't think it was needed since it was extrernally added (since i made that in the webui initially)
pushed it
I took a stab and downloaded https://huggingface.co/GitMylo/bark-voice-cloning/blob/main/quantifier_hubert_base_ls960_14.pth as the quantizer path
correct, that's the 14 epoch quantizer
great, thanks
clone, run run.bat and you're ready
should be, haven't done testing on more than 2 machines
untested on linux, etc as well
the other problem with doing ML stuff is I feel like I'm constantly buying/filling up hard drives
yeah, agree
i have an ssd with hundreds of gigabytes of stable diffusion models
and a hard drive with a hundred gigabytes of other models as well
Next Frame Prediction models.
So many
the checkpoints are each very different!
Even days apart
i should probably add a little thing to show the command line flags too
since the -si flag is really useful, it skips the install, so it won't check if your packages are installed before launching
You know I can barely type but your code is small and clean. Could I actually do this even on 0 sleep
hmnn
A think worth doing is a think worth badly. Maybe.
well let's try using it at least
Even clear documentation. Nice
You're the best Mylo! So I just download this and run the .bat file? Anything I need to do before that?
What was your og biden clip? Was it also too close to mic? Sounds like my Biden, funnily enough lol.
yep
You gotta careful breed singing bidens into more likely singing bidens
haha this is hilarious
They are soooo god
the rhythm, sooo nice
Actually a ton of work, trying to get singing but not change voices too much, lol
You kind of need like tons of clones, one is not enough
I didn't make tool yet, but the best trick, use full long 4 line singing prompt. But then cut the history prompt as short as you can, from the front, and preserve og voice. best chance
like use first few seconds only
squidward as history, and it starts singing?
the stuff that creates actually sounds like music
omg it sounds like squidward is summoning a demon
You are the real MVP! Can't wait to tinker with this! ๐
Hey @copper pewter ! So where exactly do I run the command lines? I add that to the .bat file?
Also, where do I add the file to clone the voice in the UI?
sorry to bug you for tech support lol
the bat is for the webui, it has it built in
like, the huggingface link is the model itsself, the bark-voice-cloning-HuBERT-quantizer repo is the semantic extraction code itsself. the audio-webui is a webui i was working on
the reason i made voice cloning was because i felt like it was missing from my webui (although a lot of stuff is still missing)
Oh so the webui is a separate thing from the voice clone tool?
yeah, the webui has voice cloning integrated
Gotcha! and where do I do it exactly? lol sorry for the stupid questions
for the webui: you just load the bark model on the text to speech tab, and set "speaker from" to "upload", which allows you to upload a wav
Ah gotcha that's where I'm running into the low memory problem then I only have an 8gb gpu
where do i run the command for the low vram?
instead of directly running run.bat, you can add arguments, or create a bat where you run with arguments
gotcha!
would this work if i add this as the first thing on the bat file?
set COMMANDLINE_ARGS= --skip-install --bark-low-vram
Oh nvm it didn't work lol, the bat is still installing
I'll figure it out thanks for the clarification man
it doesn't use COMMANDLINE_ARGS
this isn't stable diffusion webui
instead, what you do is you create a new bat file, which does this:
call run.bat --skip-install --bark-low-vram
exactly where i got it from lol
ah that's what was missing! the call command. Thanks man!
Finally got it working you're the man Mylo! Thank you!
@copper pewter I'm planning on tagging the NPZs I make so I don't mix them up with regular generations maybe this could be a general convention? Just in case I might have wished to know in the future. Maybe
Is cloning only for English now ?
๐คฃ I mean other languages
I haven't had time, but it might ok in european or closer to English
but it lower quality for sure
I bet it could fixed in a day, so no worries
I could probably do French, happen to have a lot
so you need to train another model for a different language, right ?
then all language mixed together?
It could be the case it works better if you split, I kind doubt it but mabye
it hasn't been 24 hours so like, pretty early
๐ I want to try japanese and chinese at the moment
I'm super short on time, but tomorrow I'll try a bunch
ok, I 'll wanit
Hi ! Where can I find a birdseye view on how bark works ? The global architecture, and the different steps ?
I am new to ml and tts and have been going around audio ml, spear, and other to chew on new concepts. I'd like to do the same with bark. Does it do text to semantic token to ?? To mel-spec to audio ?
no mel
look at audiolm
the same
a lot of is almost exactly the same
there's no page like that for bark specifically so that's the best model
Anything I say is basically badly trying to summarize that
well the low level boring stuff i kind of know
Ok, I was reviewing this ad well, cool. Does it use the pytorch implem of soundstream or encodec ? Both those codec can transform audio token to audio right ?
Also generally speaking, why is there a separation between coarse and fine audio token ?
that's a surprisingly adept question
I don't think they would be too mad if I metioned, the devs said it's not probablly not actually ideal.. But that's how it is.
haha
I mean still works great
but there were strong hints probably would not do that again
Is that an architecture geared for training optimisation ?
that's past what I know, just a few details and practical stuff, not sure
Doesn't seem like it but no idea
really it just feels like it's goals are general purpose and wide, first
so really that's it
It might train fast, I think somebody did something with a new language
very fast but I don't know
Ok, so that's also probably how audio lm work? But there must be a general reason for the why. Is there a Suno team member I can ask in here ? (I thought you might be)
admin in discord, but has sleep symbol
but those peole
especially gkucsko if you catch him online
or just search the discord for his message perhaps
that might answer your questions pretty well
that's where all the knowledge would have come from, if somebody did know
Cool thanks let me ping @open magnet on the question then, for when he'll be around ๐
They say they want be a foundation model for audio. Foundation model aren't like about training fast, right? Just being super cool and awesome.
Actually that's not right, training fast can be part of that, in fact is
It's too late
I was thinking GPT like super big and slow, then I was thinking of Diffusion, all the Segment Everything things, and like training fast in some ways is cool
Traceback (most recent call last):
File "D:\audio-webui-master\audio-webui-master\main.py", line 9, in <module>
from webui.modules.implementations.tts_monkeypatching import patch as patch1
File "D:\audio-webui-master\audio-webui-master\webui\modules\implementations_init_.py", line 1, in <module>
import webui.modules.implementations.ttsmodels as tts
File "D:\audio-webui-master\audio-webui-master\webui\modules\implementations\ttsmodels.py", line 9, in <module>
from webui.modules.implementations.patches.bark_custom_voices import wav_to_semantics, generate_fine_from_wav,
File "D:\audio-webui-master\audio-webui-master\webui\modules\implementations\patches\bark_custom_voices.py", line 8, in <module>
from hubert.pre_kmeans_hubert import CustomHubert
File "D:\audio-webui-master\audio-webui-master\hubert\pre_kmeans_hubert.py", line 9, in <module>
import fairseq
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq_init_.py", line 20, in <module>
from fairseq.distributed import utils as distributed_utils
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq\distributed_init_.py", line 7, in <module>
from .fully_sharded_data_parallel import (
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq\distributed\fully_sharded_data_parallel.py", line 10, in <module>
from fairseq.dataclass.configs import DistributedTrainingConfig
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq\dataclass_init_.py", line 6, in <module>
from .configs import FairseqDataclass
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq\dataclass\configs.py", line 1104, in <module>
@dataclass
^^^^^^^^^
raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'fairseq.dataclass.configs.CommonConfig'> for field common is not allowed: use default_factory
Press any key to continue . . .
I don't think I can possibly debug that this late but I might try and update my fork an hour, if it can be done, and I have that running and didn't get that erorr
To me that looks like one library is some slightly wrong version
good luck figuring it out
๐ try to add anaconda installation
omfg I really am asleep. Search my history I did have that
Just pip install fairseq
forget conda
too out of datet
some breaking chagne
sorry that's hilarious, literally that exact problem hours ago, like 5
it should be like a thing, if you make breaking changes, then update your conda wtf
actually not sure how that stuff is organized exactly, now that I think about. but surely somebody is to blame
Aldo reading the soundstorm page https://google-research.github.io/seanet/soundstorm/examples/ this doesn't use calrse and fine token I believe ? And is orders of magnitude faster for inference?
Sounds a lot less expressive to me though
Not to mention, all the same style
Bark is really wide
all those samples are tiny blip in Bark space
the same dot
they are okay but all kind of close. though maybe Google just chose them all in that style
i'm being hyperbolic but I do think Bark is quite a bit better
Ok, I personnally am blown away by the soundstorm demo. Generating 30 seconds of dialog from 2 seconds of audio, in exactly the same voice, is way over any results I've seen from bark right
They do 0 seconds as well
I mean
like a perfect model, no wav input
I have like 10
IT's nuts
I keep meaning to do a writetup but never got around to it
you can hone in on them in the latent space
like anyone it's crazy. no wavs
it's a bit pointless now that we have actual cloning
but super cool
not really anyone presumably famous peole
I would explain but basically just keep tweaking bark voices directionlly
Personnally I don't want cloning. I want to be able to generate 10-20 voices that are general purpose for audio books and podcast, and even the top would be prompting the model for a voice
I'm a bit baffled by the cloning myself
Cloning is just a quick way to get a voixe
What are are people want it so bad? If it's just clear voices, you won't need it
Just need a bit of time
Time is exactly what people don't have, and skills. Voice cloning removes those problems.
Also, long form right
I tried to do the long form bark, could not get anywhere. I still want to learn and improve my bark-fu, so I'll get back to it
But can you share your best inferences on your best voices ?
I can totally but can me make it tomorrow, I'll be doing dev on bark probably
i mean i probably have random wavs but
So what I have handy is trying to make the presidents sing
lol
These are some of my favorites to be honest though
Do you have paragraph long audio ?
Singing is hard and tends to lead to distortion or voice changes
Like long paragraph ?
Doing singing is super impressive and cool, but not very useful (at least I think) it's like a beautiful geeky art
these were not chosen for clarity at all just soem i have
typically the clarity is mostly in the voice
but Bark does have a reliability issue. but non real time? you can just try again and get clear
Get that. But imagine you want to feed it a book or an article. You need a reliable piepline, not a manual try and error. I am sure there are ways to get there but bark is not there rn
Do you think so? I feel like that's not right exactly
Oh nvm
I was gonna say you would direct an actor
And listen to the whole book
but you mean, as purely automated
My knowledge is super limited actually, maybe bark can actually do all that !
Yeah I think Bark is better thought almost like an actor. Not a script. So you may have be involved, but the final result is a lot better
(can't download your sounds from the app, I'll try the PC later)
Apologizing to have all this discussion on the main channel. Let's do a thread
tomorrow i'm out for the night, even mentally
Bark and longform
Bark makes mistakes in Japanese readings
I haven't been able to run the ui yet
trying to upgrade conda
and my base env fucked me up
There are multiple UIs
@copper pewter Would it be possible to correct incorrect readings/pronunciations with a fine-tuned semantic model?
Though the dataset would be a another challenge.
I think I'll push mine to git, even just mid changes, cause it does work and it's been weeks now. But you may have to figure out what libraries are new or changed yourself, or wait for tomorrow
For installing or whatever
But it has a copy and paste huberty myloizer
Yes, but you'll have to make it learn those semantics, so you'll need to use a model to extract them
That is super exciting, because it suggests that there's an actual method to improve upon this model's weaknesses.
There are multiple UIs if you want an easier install.
how to clone with audio-webui
I have no idea, there is not a place for audio input ?
since I used anaconda, I might just skipped all model download
put the "speaker from" on "upload", then upload your audio file
and if you want to replace a voice in an audio clip, put "Input type" on "File"
when you clone a voice, it gets saved in your bark custom speakers directory,
the button you get to download the speaker is if you want to download the speaker based on the audio generated
it should automatically download them from facebook and huggingface?
I used anaconda, so I just comment out something in main.py and run it
or how to disable venv installation in run.bat process
Yeah they should be automatic
just if the app is loaded
I actually used regular installer though, like used the .bat
however I had tried conda, that's why I knew the fairseq thing
but nothing is downloaded, I am in webui
from webui import args # Will show help message if needed
#from install import ensure_installed
#print('Checking installs and venv')
#ensure_installed() # Installs missing packages
from webui.modules.implementations.tts_monkeypatching import patch as patch1
patch1()
print('Launching')
from webui.webui import launch_webui
launch_webui()
right, that's how you disable the install and venv check/activation, (since it only checks if it's in venv i think, i don't have conda and don't want to install it)
๐คฃ yeah, just no models
it downloads the cloning models when you start cloning for the first time though
it doesn't download on install, in case you'd never use it, it would be a waste of storage space
I have no idea where the upload is