#๐Ÿ“šโ”ƒsuno-school

1 messages ยท Page 2 of 1

turbid vault
#

Another question:- Can I train with a Dataset of Japanese Voice and make it to generate Voice in English?

crude raft
#

don't know if this is the right forum for sharing this, but found this unofficial repo for training a Bark: https://github.com/anyvoiceai/Barkify, thought this could be useful. any comment?

turbid vault
#

looks promising

copper pewter
#

yeah, making a little thing for it right now

#

as a separate script though, so you can just process your generated semantics into wavs

copper pewter
#

fixed the "rn", it was putting \n and \r instead of \n and \r, so i just had to convert them

#

or maybe i should decode it better

#

๐Ÿ‘Œ

#

sounding like a nonsensical news article

#

@glossy trout the repo has been updated, there's now a create_wavs.py, and the texts don't have weird random tokens through them

#

the create_wavs.py will just create a wav for every semantic prompt it finds in the semantics output folder

copper pewter
#

it won't process the same file twice in the wav generation, that's on purpose

copper pewter
#

3000 something semantic files lol, i have some processing to do

#

old dataset was 900 files of 100 tokens (about 2-3 seconds)

#

for the new one i've already got >3000 semantic files which are about 10 seconds long

glossy trout
#

Today is a bit of a crazy schedule - I will try to switch to wav creation, not 100% sure depends on how complicated it is

#

It's currently cranking out semantic files

formal pebble
#

Hi all. Is it possible to use Bark for real-time TTS on React Native?

untold briar
#

Only for one particular definition of real-time, in that a very fast GPU like a 4090 can almost generate 14s of audio in 14s (though not quite realistically). But not for latency.

copper pewter
copper pewter
#

like 33% through now

glossy trout
#

@copper pewter

#

Here are the semantic files

I couldn't do the .wav files today. It's 4 different servers and I was on the road a lot of the day, so I couldn't ssh into them and update the script

#

Let me know if you still need them next week, I can run another process to create them

#

or LMK if you're able to create them and don't need them any more

#

Either way is OK, happy to run more GPUs, just let me know what you need and when. but I'll be out of town the next few days.

lunar glen
#

Am able to finetune the existing coarse model on a new language and new set of code books, I kept the fine as it is and just generated some samples from a ground truth semantic tokens, can you spot which one is the original human audio ?

glossy trout
#

@lunar glen - I would guess #3

#

Can you share how you retrained the course model?

spring herald
lunar glen
#

it involved multiple steps, I would write a clean doc sometime this week, your guess are correct ๐Ÿ˜„ Its finetuned for just 30mins on a RTX 2070 ๐Ÿ˜› So I guess it needs more time and data ๐Ÿ˜„

graceful condor
#

if it helps, I was able to drop the generation time and improve audio quality while doing so ๐Ÿ˜ฒ

glossy trout
hasty acorn
#

Hi guys

#

i need help i want to generate audio with long text i tried many things but did'nt work

#

i have generate file audio but with no sound in it

copper pewter
#

i wonder if ~8 hours of training data will be enough, but i'll just have to try lol

lunar glen
#

btw if you finetune with new codes the old knowledge will be lost by default. There may be some ways to make it still preserve the old knowledge. Now to build a working text to speech model, the semantic model also needs to be trained. That am yet to do.

copper pewter
#

nice, i'm also using the hubert_base_ls960.pt, but a custom model for kmeans, to try and make it compatible with everything

#

it does make sense that finetuning bark's semantic (text_2.pt) would work faster than training from scratch, i just didn't think of it. The back-up for voice cloning will always be a full custom bark model though

copper pewter
#

meanwhile i'm still generating training data, it wasn't running while i was asleep, but it's been running for a few hours already again

#

and i'm over 50% done now

feral knoll
# copper pewter try cloning your own voice on your finetuned bark models, since you can extract ...

By 'finetuned bark models', are you referring to already finetuned bark models? Or you mean fine-tuning the raw bark from the scratch? I haven't encounter any repo to fine-tune Bark with, for example, LoRA. Except this issue post https://github.com/suno-ai/bark/issues/117

GitHub

I propose to add support for LoRa fine-tuning of the model, as it has been shown to be effective for GPT models based on community feedback. Additionally, there have been attempts by the nanoGPT co...

copper pewter
#

you could probably also figure out loras based on what we have

glossy trout
#

@lunar glen @copper pewter - Do you think text2.pt (Bark's text_to_semantic model) was trained using hubert_base_ls960.pt?

copper pewter
#

probably not, and definitely not those kmeans, the kmeans tokenize to 500 tokens, but bark uses 10000 tokens

#

however i do believe you can achieve good voice cloning by using hubert_base_ls960 with a custom tokenizer which tokenizes to 10000 tokens

glossy trout
#

Ah I see. You'd probably lose some fidelity though, since going from 500 to 10k will be a little lossy

copper pewter
#

oh, and i'm currently at 1.62gb of wavs for the semantics

glossy trout
#

Noooiiiiccceeee

copper pewter
#

still just the semantics i generated yesterday lol

copper pewter
glossy trout
#

Btw, just FYI the audioLM paper originally used w2v-BERT to extract semantics from sound: https://arxiv.org/abs/2108.06209

copper pewter
#

it's replaced with my model which fakes it to act like the one used to train bark as well as i can manage

copper pewter
arctic pasture
#

is anyone aware of any multi-modal or prompt-to-audio models similar to Google's latest project where by you can enter a prompt and it will generate sound effects or music?

glossy trout
copper pewter
#

it's a recreation of google's musiclm, just like bark uses a recreation of google's audiolm

arctic pasture
copper pewter
#

3002/3079

graceful inlet
#

I thought suno/bark didnโ€™t allow that

copper pewter
#

because the legal things around a company doing that are not very clear for example

graceful inlet
copper pewter
#

yeah, or fine tune a model to use different semantics and do voice cloning from there
or train a model to create valid semantics for the regular bark models (which is what i'm doing)

#

the data is finally done

copper pewter
#

just preprocessing the data now

graceful inlet
#

Nice

copper pewter
#

this is from the start, it seems to be going pretty fast lol

#

with how much variety there is in the training data, it has not had any duplicates yet. so it lowering is a good sign (loss shows every 50 batches, 1 batch is a single audio clip)

brisk crown
copper pewter
#

3060 12gb

#

the training is only taking 3.2gb vram total though, you could get it running on a 1650 if you tried probably

#

passed 2 epochs so far

brisk crown
#

I see... if you ever need a faster GPU lmk cuz I have a 3090 and would be glad to help out in any way possible.

hollow root
#

I like the spirit here

copper pewter
#

it seems to be fine on my 3060

brisk crown
#

Another thing is if it's only using 3.2gb of vram, shouldn't you up the batch size to use more resources?

#

Cuz honestly I think if it's only taking that amount of vram, it sounds like you're not utilizing the full power of your GPU

copper pewter
#

there's a few things that are still trial and error, like whether to cut unequally sized outputs from the start or from the end

#

and i am at 100% gpu utilisation

brisk crown
#

So this is more a test run than anything?

brisk crown
copper pewter
brisk crown
#

Ah

copper pewter
#

outputs actually look like outputs, nice

#

like, the semantics extracted have similar patterns to actual semantics from bark

#

IT WORKED

#

the second audio does not know about the semantics of the first, it specifically extracted them from the wav

leaden gazelle
#

ChatGPT ELI5:
In simple terms, the sentence means that the patterns found in the extracted meaning (semantics) from a dog's bark are similar to the actual patterns found in the bark itself. It's like the model successfully captured and reproduced those patterns. The second audio, which was generated separately, doesn't have knowledge of the first audio's meaning. Instead, it extracted the meaning specifically from the sound waves (wav file) itself. ๐Ÿ‘

copper pewter
#

lol

#

anyways, i'll implement voice cloning tomorrow, show it for a bit, and release the model and training code later

#

currently just have one for hubert_base_ls960. might add more models later if someone wants it for a bigger HuBERT model, but as you saw from before, it's already really good

copper pewter
#

i'll release the models and code later this week

untold briar
#

Nice. I'm excited to play with it, especially wonder about not voice cloning uses, music, and enabling whole new things. If you can train it that fast you can do one-off projects with different goals than accurately reversing semantic.

#

I'm already regretting this and dreading the next update in base bark

true wave
copper pewter
#

this is based on something i spoke, i don't know why it put it twice though lol

#

actually crazy that it managed this in 3 epochs of training that model

untold briar
#

You may have realized what I have, Bark makes everything easy

#

I'm always thinking, 'this can't work' and then it does

#

So I could believe it

#

If we can really train a useful model that fast, seriously, Bark is gonna be just ridiculously powerful

#

I've found the semantic model is incredibly robust, so I bet it makes do with even what might be somewhat poor model, it probably still makes it work. Though you could keep training and make it better for sure.

#

Now I want to take a crack but it's 4AM here and I'm being absurd

#

But yeah, I can't believe the stuff I shoved through the semantic model and somehow still worked

untold briar
# copper pewter actually crazy that it managed this in 3 epochs of training that model

Sometimes I get messages like "Why don't you make an audio microSaas" or something, and I'm like, "That sounds like another project I don't know anything about that I don't have time for" but people are RABID for voice cloning, it must have been half the chat messages in this Discord for awhile. You probably actually could make an instant business.

copper pewter
#

lol

untold briar
#

The serp github project has like 1000 stars, and it's terrible!

copper pewter
#

i still have to see the quality of voice cloning, i'm currently adding it to my webui

leaden gazelle
#

I'm more surprised that people haven't started doing voice cloned tts yet.

#

Cuz voice clone is speech to speech so far

copper pewter
#

that's just proof of concept

#

because of how bark speaker prompts work, this should also work in text to speech

leaden gazelle
#

I've tried making it tts based, but I use so much vb cables that the results are bad(voice clone i mean)

copper pewter
#

for some reason i'm getting system errors from soundfile when i do torchaudio.load? but only in this part of my webui, it's fine in the other part

#

"system error" not very descriptive

untold briar
#

I nuked soundfile from my install

#

because it was so randomly breaking for people

copper pewter
#

what did you replace it with

#

soundfile code

untold briar
#

Just scipy.io.wavfile for wavs, and pydub later for other stuff

copper pewter
#

and the "actual error code" is just "system error"

untold briar
#

But I'm not sure exactly what you're doing with it, so maybe those do not cover the functitonality

copper pewter
#

i was loading the audio with torchaudio.load

#

i guess if i want to load the wav then i'll just have to put it in a torch tensor

untold briar
#

I didn't even realize torchaudio used it, since I'm not loading audio like thatt

#

Just basic output and conversion

#

Does your webui not process wavs with the same library as the training code?

#

I guess maybe not, they do different things

copper pewter
#

ah now i'm getting an error with wavfile, there's something wrong with my directories, kind of expected that

copper pewter
#

ok, voice cloning works great, but outputs aren't always perfect, sometimes the voice is somewhat different. sometimes completely different?

#

might just be an issue with my input wav file though

untold briar
#

Probably just needs more training, more data. It's honestly shocking it works at all that tquickly

#

I thought it would at least 10x that long, if not more

copper pewter
#

i'm just gonna check if the semantics make sense, by taking the audio i used for the voice, and instead generating semantics for it

#

and use those as the prompt instead of history prompt

#

nah sounds fine

foggy sandal
#

some voices work almost perfectly

copper pewter
#

like, i took the start of a dream video, and i replaced the voice, and that works fine, voice cloning also work fine but not all the time

foggy sandal
#

others will change voice some percentage of the time

copper pewter
#

like this

untold briar
#

If the semantic is close but not quite the same as it should be in Bark, maybe the errors accumulate faster so the voice diverges faster than a regular Bark generated history_prompt

#

Maybe just needs more training and data

#

But yeah even regular voices do that sometijmes

#

"works but not all the time" is literally everything in Bark

#

I mean you trained in 8 hours on a 3060!

#

And it works sometimes

#

Be happy!

foggy sandal
#

I mean the thing to kee in mind is that the semantic prompt generated at inference time is not actually the semantic prompt that is used as input to the model

#

its correlated with the text embeddings

#

so only a subset of semantic outputs will actually perform well

untold briar
#

Right, the last 10% might be a lot of work. My gut feeling is that Bark is pretty robust so probably handles more than you expect, even though it kind of feels you would need the perfect text embeddings. I've done worse things and they just worked.

foggy sandal
#

yeah I suspect the closer a voice is to the training set, the better it will perform

#

because the model has better learned to disaggregate text semantics from acoustic semantics

untold briar
#

The history prompts carry a mind blowing amount of detail for a voice, considering how tiny they are really. It's really piggy backing off the innate Bark model a lot, so you can get by with some really crude things sometimes too.

foggy sandal
#

I wonder if @copper pewter added a penalty loss to his model between his semantic predictions from hubert and the actual semantic prompts from en_speaker_1/2/3/etc

#

because those presumably are true semantic inputs

copper pewter
#

no, not speaker_1/2/3, but actually just saved ones, they're the same

untold briar
#

He used a ton of real bark generations

copper pewter
#

3000 something clips

foggy sandal
#

saved ones arent the same though

#

(preumably)

#

en_speaker_0 etc presumably come from the "true" semantic model

copper pewter
#

no

foggy sandal
#

which is unreleased

untold briar
#

Oh, you know I don't know if they did do that?

#

They might just be random bark generations?

copper pewter
#

en_speaker_0 is most likely just a saved bark gen, they claim bark does not include any real people's voices

foggy sandal
#

thats what im guessing

untold briar
#

I think you just need more clips, more time, more diversity. 3000 isn't even that much

copper pewter
#

after all

copper pewter
#

out of 10000 possible options, a lot of overlap

foggy sandal
#

i think there was mention of an unreleased model though

#

thats why i assume there's a "true" semantic model out there

untold briar
#

Yeah they have their version of what Mylo trained, basically, that's what they said they didn't release

#

Presumably it is perfect, but I don't think they would use it for the Suno default voices, just don't see the point.

#

The default voices are all saying the same sentence, if it was the case, probably not even helpful.

copper pewter
#

i might just train a few more epochs and try that model out, who knows

foggy sandal
#

i mean it's possible that with more data, the new model gets closer to the "true model"

#

(more data, more training etc)

copper pewter
#

yeah, i don't have the resources for that, but model and code will be released later this week

foggy sandal
#

with your synthesized training data, did you use the same phrase?

untold briar
#

Step 2. Mylo Audio dethrones Eleven labs.

foggy sandal
#

(i can't remember exactly what it was)

#

I wonder if that would make a difference too

copper pewter
#

it's trained on >3000 random phrases from books

#

and shakespear plays

#

the actual words don't matter though, the sounds do

untold briar
#

They probably matter some

foggy sandal
#

well....i wouldn't be so quick to assume that

untold briar
#

Or rather, the overall structure and meaning of the sounds, which we call words

foggy sandal
#

dont you think there is a reason that each provided prompt uses the same phrase?

untold briar
#

I don't think there is a reason for that, not really. But just that there's probably some nuance and depth that might be tricky to capture still, compared to real Bark generated semantic prompts. And you could need a wider range of text in the training data as part of that. To hit like perfect match.

foggy sandal
#

I mean there's a way to test my hypothesis

#

just measure the distance between the bark-provided semantic prompts and mylo-provided prompts for the same audio from the same speaker

#

if that keeps going down then it theoretically can converge to the same model

untold briar
#

Well that's what he trained it yeah, so just keep checking the loss

foggy sandal
#

no, i mean checking against the pre-provided en_speaker_0 etc semantic prompts

untold briar
#

I was just speculating that you might hit a wall and need a wider set of text. Asian languages or something, for example.

foggy sandal
#

for audio generated with that speaker

untold briar
#

As far I can tell there's nothing special about the provided prompts, but that's just my guess, I guess there could be. But even so I don't think it would be that useful, regular bark generated semantics should be all you need.

foggy sandal
#

its possible, I'm just thinking of ways to empirically verify that

untold briar
#

Hey you seem pretty knowledgeable, random question. Somebody asked me about adding a 'negative prompt' to Bark Infinity, like Stable Diffusion. Do you know if there's a reason LLMs never had this idea? Does it not even make sense, or is useless, or something?

#

I'm not sure exactly what the implementation would looks like even, or even what 'working correctly' means exactly, but it does seem like a fun idea.

#

"I HATE YOUR GUYS" as a negative prompt, to make somebody quiet and friendly

#

or something

foggy sandal
#

I think you could implement something similar conceptually but negative prompts are mostly a diffusion model concept

#

because they can sample from two latent spaces

#

i guess autoregressive decoding could downweight the logits of a negative phrase though

untold briar
#

I was thinking something like, generate the negative prompt. Save the generation. Compare the generation to the 'average' generation from that language. Pick some cutoff for the most common tokens or patterns used unique to the negative prompt (so you don't penalize the sound of a human voice generally) and then penalize those in the actual generation. Just a random idea.

#

Kind of depends on how the semantic works, so it might be useless. But maybe does somethitng.

foggy sandal
#

i think you could figure out some way of doing it but in my experience picking the average of the output of a network never performs that well

#

its just too much information mushed together

#

id maybe start with trying to convert man to woman

#

or vice versa

#

maybe looking at the attention weights for that particular token?

untold briar
#

Well I did use this method, basically, for my french accents I've been posting. It's not reliable

#

but it does work often enough

#

But comparing set of english versus french

#

So it COULD just work, bark is surprisingly robust

foggy sandal
#

do you know about pca?

untold briar
#

No

foggy sandal
#

principal component analysis

#

its a bog standard statistical technique

untold briar
#

I'm pretty much a code monkey rubbing sticks together. I mean theoretically I took stats in college but it's long been phased out of my neurons

foggy sandal
#

basically to find the major dimensions of variance in a dataset

#

and the eigenvectors of that variance

#

that would be a good place to start

untold briar
#

Most linear algebra too, annoyingly, since that has been so necessary now I need to relearn a bit

#

haha

#

However, there's great stats libraries. I bet I could get pretty far just banging some keys

#

Somehow I can still program in assembler despite not seeing it in forever, just because that one professor was such a hard ass it got burned into my brain. But not the math.

foggy sandal
#

yeha sklearn.PCA is my go to

untold briar
#

Yeah I did use that a bit already, just to get some basic stats, entropy, perplexity, whatever. Just to look for simple trends.

#

Those particular metrics were largely useless, at least for what I was doing at that moment.

copper pewter
#

i don't know what any of that means ๐Ÿ˜Ž

untold briar
#

perplexity is probably something you'll need, it's all over any language model training and evaluations

#

The rest, eh. Just get that loss down.

untold briar
#

I'll be impressed if that works, but maybe you only need a few samples in the data?

#

Or maybe a child's voice isn't that different from an adult, compared to say a human versus a dog barking or a car alarm

copper pewter
#

i believe voice cloning a "rare" bark voice shouldn't be an issue, it doesn't need to be in the training data

untold briar
#

Yeah as long as it's close enough, it only has to be in the Bark data

copper pewter
#

like, who knows, maybe you could record your cat meowing and see what bark thinks it would sound like as a person, that would be quite an interesting thing

untold briar
#

so your prompt is accurate enough to get in the right space

#

Bark can probably do meows!

#

I haven't heard one, but it's got be in the training data.

copper pewter
#

is it?

untold briar
#

Yeah, I mean it can do music and random sound effects

#

It's just raw audio, youtube, tv shows, etc, just guessing.

#

I often get a 'tv commercial' sound

#

so for sure TV

copper pewter
#

the old tv sounding speaker

untold briar
#

Yeah it's raw data and not even noise filtered

#

which is also kind of cool, so it can make the noise too

#

like an old time AM radio or something

#

I have some of those

#

I was really killing myself to get Star Trek TNG background 'hum' sound

#

possible but difficult as heck

copper pewter
#

after a bit more training, it doesn't seem to be that much more consistent. still, the times that it does work it's really high quality

untold briar
#

i'm still kind of blown away it trained so fast, I guess I had a totally wrong intuition about how hard it would be. I should have gone for it. I also hadn't until recently thought about all the non voice clone use cases, which made me more interested. I'm really impressed you went from a single neuron in javascript to just nailing it like 3 days later. You got a bright future.

copper pewter
#

i do have years of programming experience, so that really helped me

#

do you have a clip of joe biden talking so i can demonstrate a voice that everyone will recognise?

untold briar
#

Not at the moment, away from main desktop

#

Try for Rick and Morty

#

I had trouble with Rick

#

really crazy voice, atypical

copper pewter
#

need to find some voice lines

untold briar
#

Probably google like rick and morty sound board, or something

#

I can give you some later but wont' be home for awhiel

#

Have you tried music? I would be super intrigued if it works, just even rarely, it would open up a lot of fun possibilities to make new kinds of stuff with Bark.

#

Cloning is cool but like, you know, already a thing. But if it works with music, OMFG

copper pewter
#

idk, music won't be better than bark music, it's just a history prompt

untold briar
#

So like, once in a blue moon, I get really good music out of Bark. So it's not impossible.

#

Like 14 full seconds that almost sound like a real clip

#

But it's so so rare

#

Even though it's just a prompt, one time it even stays pretty solid for a minute, using prev segment as new prompt. Literally just one time though, lol

#

You may be right that encoding music into the history doesn't help much, mostly still sucks, but maybey not

lunar glen
# copper pewter regenerated audio

Looks similar, did you train the coarse model from scratch or fine tuned the suno one ?
From yesterday I have been trying to train the text model, the merge token trick is eating me up, any luck with the text model ?

copper pewter
#

neither

#

i trained a model to extract semantics

lunar glen
#

Semantics is extracted from Hubert right ? Then you trained some model for semantics to coarse ?

#

Or you mean text to semantics you trained ?

copper pewter
#

also, @untold briar i might have a great idea to fix it not always generating entirely correctly, just save the prompts from when it does. problem solved

copper pewter
lunar glen
#

Oh that's a nice idea dude

copper pewter
untold briar
#

Ah, I wish I was at my desktop, really curious how it comes out. I can try to google one sec

copper pewter
#

i'm getting a female voice, the same female voice every time, somehow

untold briar
#

Haha, I had the same problem! Wacky!

#

It must be something abou titt

#

Or like, it's triggering a cartoon or something

#

It didn't always happen so I got something Rick like eventually but I actually got that, doing my completely insane method

#

which seems crazy to me, must be like a model thing in Bark

#

That's just totally bizarre honestly, like WTF.

copper pewter
#

yeah your method is kinda a lot of trial and error, just generating multiple and hoping one of them is good right?

untold briar
#

It's way more convoluted, I identified certain trends, manually ranked voices by similarity, and iteratively bred them over generations, trying to find a recursive process that made it converge more in that direction. a process which wasn't that similar between voices. with a lot of weird hacks like manually mixing two voices that each have a similar aspect. Just a completely insane process. But also a fascinating concept that it converges so well, I was blown away it kept getting closer.

copper pewter
#

for example with mine, this was the second try, idk why he screams hello though

#

i don't really watch dream or anything but i'm pretty sure this sounds pretty accurate

copper pewter
#

my voice cloning takes less than 5 seconds on cpu

untold briar
#

It was really like curiousity, could this actually work? and it's crazy how well it close it can get

copper pewter
untold briar
#

It is kind of my thing to do things "the incorrect way"

copper pewter
#

technically, if you keep combining the 2 that would have the actual voice between them theoretically, you could actually get a great clone

untold briar
#

I was like, this can't possibly work, right? So it was pretty fun! Also just kind of wild it kept getting closer and closer until it seemed perfect to my ears. I thought it would eventually hit a wall.

copper pewter
#

it would just take a long time

untold briar
#

It was like a couple hours, maybe a bit faster at the end but I got bored

#

But Rick was just weird, constantly got voice switches, and stuff.

#

I think it must be modeling a cartoon. Bark I mean. Cartoons do switch voices rapidly. So this is, in a sense, perhaps an accurate prediction. But I literally kept getting a woman's voice, that was by far the most common. It's a bit mind blowing that somehow both methods produced that.

#

It wasn't just cloning, I was interested in making voice hybrids, flipping the gender of a voice, it was all related work

#

Just seeing what might work

#

Especially what *shouldn't *work, but if it does it's very fun.

#

Gotta bail for a bit, @ me if it's interesting, will check later today

untold briar
#

Okay, I thought this would be a fascinating YouTube video (Titles... How to voice clone the least efficient way? Still workshopping it.) I may as well try to explain quickly because who knows when Iโ€™ll get to making that. Basically after the third time I randomly got a Bark voice that sounded suspiciously like Trevor Noah, I was like, ok seriously WHAT is going on?

And it was my test script. I was constantly running the same prompts, in the same order, iteratively using the last segment as the new history, with a bunch of variations: sometimes half the prev segment + half a fixed history, randomly deleting a random portion of the history, mixing two different histories, and stuff like that. Lots of trimming and deleting and mixing, and importantly preserving at least some fraction of previous segment, so it was always iterating and building on the past. (Hitchhiker's Guide to the Galaxy and Andor scenes, if youโ€™re curious.)

And I looked at the samples and in some parts of the process, later samples did actually sound a bit more Trevor Noah than earlier samples. It was super rare to get a full clone from this particular process, but I ran this script all the time so it did just randomly happen more than one time. The main thing was that it was a process that somehow trended towards a particular voice. So it made me wonder if you could do better on purpose. And you can if youโ€™re completely crazy and procrastinating from actual work.

copper pewter
#

hmm

untold briar
#

I hope I can explain in a YouTube somehow, I was just so surprised it worked, don't know if I can capture that feeling of 'how the heck is this working?'.

untold briar
#

"Voice cloning with zero seconds of audio (for crazy people)." Hmnnn. I like that one

untold briar
# copper pewter hmm

Totally @ me if you try some music. That's what I'm most curious about, even if it's probably not great.

#

Trying to to get good music out of Bark is extremely difficult because so many samples are literally painful to listen to. So if this is alternative way to try stuff, would be a relief.

#

Oh you gotta try the meow too. That's critical. We have to know what happens. For science!

copper pewter
#

i do have a little bit of ai generated lo fi from riffusion

#

but idk, my tokenizer did not have any music in its data

untold briar
#

Yeah, I understand it shouldn't work. 100%. But I'm saying, maybe it's still interesting!

copper pewter
#

ok, so first gen it had the little sounds in the output audio, and a voice saying aaaaaaaaaaaaaaa

untold briar
#

I call that a win. That sounds a bit like music to me.

copper pewter
#

the original music btw, it's kind of low quality, since it was a test of an extension i made for stable diffusion webui

untold briar
#

Accuracy isn't the only goal. Can you make an interesting history prompt for Bark, that is capable of generating Bark outputs that would otherwise be hard to maek?

#

That's a win.

copper pewter
untold briar
#

It sounds singing, a bit..

copper pewter
#

yeah

untold briar
#

Well any consistent and not noisy vaguely music sounding audio, would be a win

#

Some of my weird experiments made Bark output like, car alarms. But in clear audio. Not useful, but kind of interesting.

#

Doorbells, honks, horns, lots of that.

copper pewter
untold briar
#

I mean it's early. In future could train on music, sounds, whatever. But thanks for testing for me.

copper pewter
#

the voice cloning is so good though

#

best results if you use the voice cloning, then generate a piece of audio which fits the person you're trying to clone, then taking the speaker prompt from there, that way it's fully bark-generated and accurate. but actually has the voice

#

that's what i did in the clip above

untold briar
#

I have a few bidens, one issue I had is he trended towards either public microphone, or way close to mic, he kept oscillating and neither was ideal, I was gonna try hybriding but I got bored of this manual process

#

He's the one I'm least happy with though, of the presidents

#

I was thinking about future uses. If you can encode an hour of a speaker, for example. Then sort of like how I'm doing my french accents, one possible use of your model could be to create a deeper clone than the history prompt

#

Kind of like how I push it towards french tokens, maybe it's possible to do something like with like a bigger sample size

#

and possibly, small chance, it could overall increase quality past what you can do from single history_prompt. without having to fine-tune

#

From what I hear fine-tuning is ALSO quick

#

so this may be kind of a dead end

#

but still, could be worth trying, some day

#

fine tuning should be got tier though, for actual practical purposes

#

Oh can you flip a sign in your model somewhere? Can you produce the semantic that is the least like the wav instead? Probably total noise or silence. But sometimes a bad idea can do something cool.

#

You're getting a lot of insight into how I work. Try the wrong thing that even if it works, should sound terrible.

#

I'd say "Hey, it's a living." but it's not, it's really not. If anything it's an anti-living.

#

I can try all this stuff later in the week when you release, I'll stop bugging you. I'm stuck in a waiting room and bored out of my mind, and I can't try anything myself for a couple hours.

#

Does anyone know, of the people who did fine-tune whatever model (coarse I think, for a new language) were they able to do it and preserve the Bark original model's capabilities, or did it have a detrimental effect on them?

copper pewter
untold briar
#

Yeah I agree that's the most likely result, basically the same as random tokens. But it's worth checking at least a bit, sometimes a really bad idea can surprise you.

copper pewter
copper pewter
#

if you've got a decent clip you can get decent results on voices that aren't very common

untold briar
#

Cartoon voices seem to be tightly tangled with background music in Bark. Super deep narrator voices as well are incredibly likely to add it, even if the original speaker had nothing.

#

Could be really cool if the background sounds were a bit better and more plausibly a real clip

sharp gale
#

Best way to do that would be editing the audio beforehand to remove the background music as musch as possible. Nowadays there are tons of tools that do that

#

Or even better, hire a voice actor from Fiverr who can do impersonations and record a bunch of clean voice lines for training

untold briar
#

I think they're trying to keep it raw, and figure out some way to sample from it with more control. But I don't know if there's any downside to adding even more data like that as an addition to the dataset.

#

Probably not?

sharp gale
#

I wouldn't think so.

#

What is a good amount of data and voice lines to train custom model?

untold briar
#

There's somebody in here who at least started training a new language, but I haven't kept up with the details. It might be only the coarse model as well.

#

I don't think you'll need to train Bark at all for good voice cloning.

#

You could do it for better cloning, probably. But should be good with just a good prompt.

#

Unless your voice lines are atypical, sound effects, music, or a new language.

sharp gale
#

I see. I'm still new so I don't know how to do voice cloning yet. I've Bark infinity running and tried to do voice cloning through its UI, but haven't managed to make it work yet

untold briar
#

Bark should work much better than Tortoise, using a small audio sample.

#

That's not the real cloning. The person working on is right above here, probably out later this week.

#

I put that in Bark Infinity because people asked, but it's really just not not worth the effort. Literally just wait a few days.

sharp gale
#

Oh nice! Would love to give it a try and test it out if I can manage to get it running

#

still learning how to run Python, git, and stuff, my expertise is more in visual than programming haha

untold briar
#

He's still cooking the cloning a bit, just scroll up literally in this thread, that's it right there.

sharp gale
#

Oh Mylo with the audio! Cool thanks man

#

Gonn spend the afternoon reading chat history haha

untold briar
#

May I ask your reason for cloning? I'm sort of curious why it's just the thing everyone asks about specifically.

#

If you only need a nice voice in a particular style, you may not need to clone at all. If it just needs be in a specific style and tone, but not an exact voice.

#

It's pretty wild how expressive Bark is. You can really get quite a specific voice with a bit of effort.

sharp gale
#

Trying to create a voice for a female character for a Youtube channel. Wanted a clean voice that sounds similar to this

untold briar
#

Yeah that specific style happens to be quite tricky in Bark. Children's or high pitches voices talking real fast. That's one you might just want to clone.

sharp gale
#

Haha yeah that's what I thought

untold briar
#

That might be one that does need fine-tuning to do well, even. Cartoons style in general has a lot of background as we just mentioned, so might be kind of messy. I think that could be fixed but short term you might need fine tuning. Which is not even really a thing, yet, to be clear. But probably not too long.

sharp gale
#

I see! So the fine tuning would be feeding more audio data in the cloning process?

untold briar
#

The cloning probably just needs a short sample, and generates a bark prompt. An .npz file. Fine tuning would be a more traditional process where you use a lot of data and longer time. They are both cloning, but I was using cloning to mean the first, yeah.

whole jay
copper pewter
#

yep

sharp gale
copper pewter
untold briar
#

My guess is it's hard to get it to generate clean voice only samples. But maybe it works!

copper pewter
#

after a few tries it got the voice, but not really following the prompt

untold briar
#

Pretty good, yeah.

sharp gale
untold briar
#

Really good actually. That's a tough style in Bark. A+

sharp gale
#

That's amazing damn

copper pewter
#

lmao what happened, i used the speaker prompt that resulted from the last gen's audio and it just switched up

untold briar
#

Yeah, cartoon voices man.

copper pewter
#

the voice just switching lmao

untold briar
#

Literally every cartoon voice I have does that

sharp gale
#

is that the model trying to use its trained data to fill in gaps?

copper pewter
#

maybe i can play with the temperatures a bit until a get a better output from bark, since using those outputs is even better since they're "true" semantics

untold briar
#

I think it's really just because real cartoon switch voices very frequently. So Bark is likely to do it too.

sharp gale
untold briar
#

That's right.

sharp gale
#

Aaah ok things are starting to click lol

copper pewter
#

first it creates an npz, which will have a lot of variety in it's continuations still, but often gets a realistic response, if you get it to generate more audio in the correct voice with that, you get another npz to save, which will have a consistent voice that you want

#

that's how i did the one with joe biden

untold briar
#

Even with non cloned voices, that's a good pattern.

#

To increase reliability.

sharp gale
#

Got it! Wow blowing my mind here

copper pewter
untold briar
#

Yeah, Bark doesn't need fine-tuning. Unless you go way way outside the norm.

sharp gale
# copper pewter

Would it be ok for me to get the .npz file to test it out on my end as well?

copper pewter
#

yeah

#

still testing a bit though, i want it to both have the voice, and correctly follow text

sharp gale
#

yeah no rush! Thank you

untold briar
#

You want to be truly mind blown. Trained in just 8 hours on a 3060.

#

Faster than cloning a single voice for many TTS models.

sharp gale
#

lol that's nuts

untold briar
# copper pewter ?

There are some TTS models that people fine-tune on a voice for longer than 8 hours, to make ONE clone.

copper pewter
#

the training itsself took like 20 minutes for the model i made lol

#

on 8 hours of audio data

#

i guess this would technically be zero-shot voice cloning right?

untold briar
#

I'm not really sure how people use zero-shot in audio stuff, not sure.

sharp gale
#

I remember at the beginning of this year when I was looking at how to do voice training and on the videos the people would train for days on a 4080 to get anything that sounded at least human

#

on like 30 hours of audio data

#

for those David attenborough voices

untold briar
#

I wonder if the appeal will fade once it's just baked into every single app or whatever

sharp gale
#

Probably, most people will use it to say some funny words and move on

untold briar
#

There will always be a demand for higher quality, but rather than pure voice, it'll almost be 'acting quality' literally how good of a performer the voice model is, similar to how you might judge a real actor.

#

A short sample will probably sound the same, but reading a paragraph from a book, the better model will have a lot more nuance.

sharp gale
#

Yeah for sure

untold briar
#

Even now I sometimes evaluate just the prompts/speakers like that, in Bark. Not a ton, but there seems to be some subtle differences between speaker variants that are otherwise sonically similar.

#

It's a little hard to judge in 14 seconds, but I'd guess it stands out more if the generations were longer.

#

Especially an audiobook, a full paragraph has a kind of rhythm to it you can almost feel, with a real reader. But this version of Bark can't really ever see the whole thing.

sharp gale
#

Right I think that's gonna be the hardest thing to capture. I think the voice that gets the closest is Google's Voice Assistant., which I think does a better job than Siri in sounding real

untold briar
#

So you can kind of cheat it right now in Bark. In the simplest way possible, using a 'beginning of paragraph' speaker .npz, and one for the end. lol. It actually basically works. But obviously doing it for real would sound way better.

#

Just kind of randomly checking variants and trying to find the right ones for that.

#

It's also a little TOO regular, at least when I just tested the idea. Maybe okay with more subtle variants than I tried.

sharp gale
#

Wow interesting ๐Ÿค”

untold briar
#

Yeah man, Bark is literally barely even explored right now.

sharp gale
#

Yeah I'm amazed more people aren't talking about it/testing out

untold briar
#

You can do the same thing for emotions, whispering (probably didn't test that), whatever. A lot of work to get good variants, but it's a really simple idea and you only need to get them once.

#

A feature I might add to my fork is a fake prompt that is only used as a history input. So you say, "I went home to work. (I am crying I am so sad.) I said hello to my wife." And basically it uses the middle sentence in the history, but not the audio output. Just splitting text segments, not anything fancy

#

I think it's probably super super unreliable

#

but it seems fun

sharp gale
#

Oh wow Literally like a VO Director directing talent on how to deliver the lines, that's cool

untold briar
#

So the idea is, it's like you asked Bark to generate the audio with that sentence in the tone, but doesn't actually include the middle sentence. But if you try this now, you'll find it doesn't work reliably. You really got RNG it up to get a good one usuually.

sharp gale
#

I used to be a VO recording technitian in the past, one thing we would always do with a VO talent is not tell them exactly how to do a line, but ask for more or less of an emotion, or mention the tone

untold briar
#

With Bark, it's usually even more subtle

#

You have to write text that just would be said in the way you want.

sharp gale
#

as in "can you finish this sentence on a higher tone?" or "Less crying, more anger"

untold briar
#

A better example would be (My dog was dead and I started sobbing.) or something, truly over the top, but then the lien would be read more sadly in Bark.

sharp gale
#

incorporating that into prompts would be gamechanging

untold briar
#

It's not gonna great in the quick version, since splitting the sentences alone kind of makes it less expressive and feel less connected.

#

Still could be useful and it's easy to implement, so whatever.

#

And I think I already mentioned, but using this to control emotion doesn't actually work well. It just works once in awhile.

sharp gale
#

Are there variables in the training data that classify emotions?

#

or tags, not sure how to call it

untold briar
#

No idea. I'd guess no with pretty high confidence, just seems against the spirit of the project.

sharp gale
#

I see. Could be interesting to train data with those "tags" from the beginning. Maybe that would be easier to control once data is trained?

untold briar
#

The one thing they've talked about is the training data is pretty raw, not noise filtered, just raw audio

#

So that's why I guess no

#

If you search gkusko (trying not to ping) he's talked a lot in this discord about it. But it could be weeks ago.

sharp gale
#

I'll search it out

untold briar
#

They might hope Bark could eventually do classifying like that on audio. A true foundation audio model.

#

It probably can, sort of... I can almost imagine something.

#

Imagine a comically simple method. Encode the audio "I am feeling" into bark, using something like Mylo's model. Then prompt bark to continue the audio, which is jsut a flag, you can force it do that.

#

I bet you get well, mostly nonsense, but statistically I bet different emotions have some correlation in the likely words you hear to the actual emotional tone in the audio.

#

The least accurate and most inefficient emotion classifier system you could possibly make. But I bet it's more accurate than chance with enough samples. Basically my specialty. Also I guess this only works on the audio of the words "I am feeling" ? Truly the most useless machine. Actually, why not clone the voice of the audio sample, then add a fake "I am feeling" to the end, then give that to Bark. Even more convoluted. Excellent.

sharp gale
#

Haha

#

Would love to see that

untold briar
#

If there's a terrible way to do something, that's the way I've probably done it. The funny thing is some of it just actually works straight up in Bark.

sharp gale
#

Ahahahhaha this is amazing

#

You got an immediate follow great stuff

untold briar
#

Gosh I haven't Tweeted in forever.

sharp gale
#

You should! You got a whole follower base already

untold briar
#

I do have like hundreds of samples of the US presidents singing American pie, right in front of me. The world should hear this.

sharp gale
#

I'll retweet it to my 100 followers lol

untold briar
#

My Twitter is actually a weird mix cause like it's half from crazy jokes, and half from when I used to read https://arxiv.org/ for AI papers every day and was constantly Tweeting paper summaries or usually quickly trying the paper repo, so like serious science people. So I honestly get gunshy and try to think like, can I split the difference somehow for both groups, lol.

sharp gale
#

Yeah that's stuff that goes over my head for sure ๐Ÿคฃ

#

but the funny AI videos I like lol

untold briar
#

It's so expressive, right? The voices are so insanely human. On the github for my bark fork there's like 20 minutes of that. I love it so much.

#

I don't really use TTS in practical use, but it was just so lively and real feeling, I got kind of addicted to poking at the audio model. For literally no purpose but to see what it did. And that's how I somehow ended up spending half my week's free time trying to get Donald Trump to sing in a ridiculous French accent.

untold briar
#

I Tweeted Don't Sleep on Bark when nobody knew about it. Bark has 20,000 stars now... and people are still sleeping on Bark. That's the Tweet. Legit like, Mylo cloned in 20 minutes. I threw sticks and stones together and got French accents, somehow. (I might be overemphasizing self derogatory humor a bit here, it was real work and quite a bit of coding, if not very sophisticated.) Imagine what actual domain experts really building on the Bark model will do.

sharp gale
#

Hahha funny you say that I just got the French voice to talk in a great accent in english too, using the output npz files you guys suggested earlier

untold briar
#

Yeah it's cool, many are multilingual.

#

Specifically I was trying to force any voice to have a specific accent, which normally you can't do. Like, ugh, I have some here somewhere.

copper pewter
#

some distict voices are really easy to clone btw, anyone recognise this one?

untold briar
#

Not sure, sounds like an announcer?

sharp gale
#

Sounds familiar but can't put my finger on it

copper pewter
#

postal dude

untold briar
#

The winamp voice?

#

Oh I'm not sure I know that one

sharp gale
#

ahahaha winamp voice! That's a deep cut

untold briar
#

Gosh showing my age.

#

haha

sharp gale
#

Ahaha Postal dude yes there you go

sharp gale
#

Postal dude is also a deep cut

untold briar
#

It's pretty similar!

sharp gale
#

hahaha might be the same voice actor

untold briar
#

I know right?

#

I don't know the postal voice but hear that pretty close

copper pewter
#

rick hunter, who is the voice of the postal dude in postal 1, 2, and as an option in 4, does do radio commercials

#

also, lol

#

my explorer hasn't realized that it's already the next day

untold briar
#

Man I looked it up. The winamp voice actor vanished into thin air?

#

wild

#

Actual urban legend voice

sharp gale
#

hahaha wow

#

Oh shoot Jonathan I just noticed that you're the one who made Bark Infinity! That's you right?

untold briar
#

Yeah. I don't know how I ended up so deep into audio to be honest. And I kind of need an exit plan eventually, cause like I don't even it myself except to play and experiment, but I will be updating it soon and support it for near future, for sure.

sharp gale
#

Man that's awesome thank you for your hard work! Been playing with it nonstop ever since installing it!

untold briar
#

And now I feel some responsibility to update it.

#

My god though, I have like a billion changes for my accent stuff, so it's just gonna be a nightmare.

#

But I'll do a basic bugfix and core features, I mean 99% of people want clear voices and an easy installer. Literally nobody is asking for people to sing in comical accents, but that's what I'm dying over, lol. And the install is so confusing.

sharp gale
#

hahahaha it is a bit confusing, took me a little while to make it work on my end

untold briar
#

I'm spending 10x the hours on these crazy ideas, and everyone keeps messages me, "I can't install Bark."

#

I actually did start a basic patch some today, for real.

sharp gale
#

Miniconda saved me

#

If you or Mylo need help with anything visual or even audio data to be trained on just let me know seriously, you both are doing the Lord's work

untold briar
#

There are some practical things I have learned, as a result of some silly things, that can be useful for simply joining audio clips more seamlessly, so it's not all all for nothing.

sharp gale
#

absolutely Infinity is saving me a lot, cause using commands on prompts is the death of me haha

untold briar
#

Cool. There are some rough spots that drive me nuts. I can't believe how difficult it is to like, load all your settings in Gradio. I looked at how Stable Diffusion does it and they just wrote their own JS thing that does, basically just bypass Gradio. UGh

#

I knew Automatic1111 was Gradio, and I made a bad assumption that most of it's features were Gradio's features. But actually, it's a miracle they got all that stuff to work in Gradio and they built a billion hacks on top to do it.

#

I couldn't figure out how to display a non fixed amount of audio samples, or let the user pick a folder of files, really simple stuff can be missing in Gradio.

#

There's a whole thread for Gradio complaining, haha. I really go off easily. It's really not the worst but I was frustrated a lot, and I hadn't developed in Gradio before last month, so I kept being like, "I must be missing a feature?" and actually usually it just didn't exist, lol.

#

I'm gonna have rip the Stable Diffusion code just to refresh the file picker dropdowns if there's a new .npz file. How is this just not part of the base ui, lol.

#

Actually there is a funny thing I didn't think about. The actual most annoying about Gradio isn't Gradio's fault, really. It's that the newer Gradio API is so new that ChatGPT doesn't know anything about so you can't use ChatGPT really for Gradio dev. I never thought about that but actually, super annoying downside. Actually maybe it works now with plugins? But when I tried originally it was hallucinating all the time, worse than useless. It's an interesting factor to consider -- a software tool or library or language created or heavily modified after an AI models cutoff date, means the model is a much worse aid. Welcome to 2023. I wonder if that's gonna work as a negative pressure against new programming languages, an extra hurdle to overcome.

sharp gale
#

Hahaha wow yeah true

#

Have you tried asking Bard to check if it has the same knowledge?

untold briar
#

I haven't, I should totally.

#

They say it has no cutoff date, not sure how exactly that is implemented, but it's got to be at more recent.

#

My face when I first tried Gradio, tried using ChatGPT, and it hallucinated the most god damn beautiful Gradio API and code, with all the features I wanted, just perfect. And then I discovered it was all just complete fantasy. There's got to be German word for this feeling by now.

sharp gale
#

Lol

earnest copper
#

@untold briar new programming languages can be easily added onto models via LoRA

untold briar
#

Back on topic, anyone found some way to minimize the chance of audio becoming more metallic the longer it gets? It doesn't always happen but I haven't really noticed a trend in sampling parameters, or anything really obvious at least.

earnest copper
#

the only way to do that is to actually modify the way the pipeline works to use like, a split attention head

untold briar
#

Are you replying to the negative prompt idea?

earnest copper
#

the audio going off the rails over time.

#

that, i believe, occurs because the attention head is linear and has a limited sliding window len

untold briar
#

Huh. Is that a lot of changes, or could it be hacked on pretty quickly?

earnest copper
#

they can implement multi-layer single head attention and it could work better than multi-head split attention

#

it's a LOT of changes, but nothing that a genius like the ControlNet team couldn't hack up in a day

#

and i could also be way off base about it Sad

untold briar
#

Can I also ask you about the negative prompt idea, as you seem to be have some real actual expertise. Do you know if it even makes sense in LLMs, or if there's a reason why it's not done?

earnest copper
#

@open magnet sorry to ping you, but is that close to the mark for the reason we don't have longer than 13s, that the audio recordings even from a given prompt end up 'forgetting who they are'? I concat renderings and the voice noticeably changes toward the end of the more exciting statements.

untold briar
#

I think the model was trained on segments of that size, and they do start a bit in the middle, to minimize edge effects, but not sure

earnest copper
#

@untold briar that would probably be a 'user preference layer', a LoRA. you can generate text embeddings as well. there's really no reason you can't do it. but it's likely that the pipeline would have to be altered to make that work. and i'm assuming that'll be less performant in a big way, or they would have done it.

#

i should clarify, less performant with our current hw/net architectures

untold briar
#

With an regular text LLM it's kind of hard to imagine what exactly the negative prompt should do, if it works correctly. But with Bark I can almost see it. A negative prompt of "I HATE YOU AND YOU SHOULD DIE" might make it more friendly and quiet, something like that.

earnest copper
#

negative prompts can be thought of as weighted probability sets that influence the decisions made by the model, to drop the probability score of those tokens and reduce their likelihood of generation

#

they already do this but bake it into the model for things like the N word and so on. they provide positive and negative prompts via the Anthropic dataset. have you looked at that?

untold briar
#

So the way I was gonna try it, very crudely. Generate one sample of negative prompt. Compare that sample versus like average english sample and try to pick out prompt specfic tokens, not language generally, with some stats. So you don't penalize the sound of human speaking or something general like that. And then penalize those tokens in real sample.

earnest copper
#

i'm just going to start training stuff on hollywood movies.

#

good quality images, audio, and, often, transcripts are baked into the format

untold briar
#

I was thinking a super weird one might be 'audio description captions' like where they describe what's happening on screen. I don't know how useful it is, but they talk very precisely over lots of dynamic audio.

#

I mean, I don't know what you do with outputs like that.

#

But I was kind of curious what it would learn.

earnest copper
#

omg lmao

#

the Suno crew should be willing to describe how to do the training and stuff but idk when they'll decide that

#

baseline Stable Diffusion 2.1 -> my Lord of the Rings model (The Hobbit), early on in training

#

this is considered an impossible challenge, to style transfer into SD 2.x without destroying it. so, i'm definitely excited for the challenge with Bark.

untold briar
#

Nice

#

The pace of dev in the Diffusion space is nuts. So many things all the time.

#

Random question before I'm for a bit, anyone gotten somewhat realistic applause sounds from Bark? I always get a very very electronic sound in the place that is clearly supposed to be applause.

open magnet
earnest copper
#

i see. it's just that when i do have longer audio, the voice deviates from the initial audio, a LOT. it never seems to normalize itself against the initial sample

#

so i was wondering if the processing is fully linear, and if it is, would a split attention approach help with coherence, or would it hurt

#

i have only briefly looked at it and i definitely barely understood it, so my understanding is coming from the sliding window length stuff i messed with

untold briar
#

By deviates, do you mean deviates from the sound of the voice, or the audio quality itself gets the metallic twinge, or both?

#

I'm not sure if they are correlated, but I can't really remember off hand. I don't remember noticing that specifically at least.

earnest copper
#

both

untold briar
#

The metallic trend is such a consistent effect, it feels like there should be some fix, especially it's not always there. I kind of wonder if the voice thing is a deeper issue, it's just hard to perfectly model a whole human voice from a history_prompt.

#

Totally uninformed feelings, lol.

#

Even in ChatGPT, with text, sometimes you feel it starting to degrade when you approach the context window, and it's less coherent. Not even go past it, just close.

earnest copper
#

oh, have you ever tried downsampling human voices to 22kHz? maybe tis' just that.

#

@untold briar it isn't modelling a whole human voice based on the recording. the model takes in the audio and behaves similar to something like GPT2 where it autocompletes what it is given, following the same patterns.

the emotions come from the actual training that went into the coarse model

#

a voice that sounds similar to X or Y has tensor space, physically co-located

#

there's interesting issues with 'overfitting' on the audio and i'm not certain whether an overfitted voice will perform better or worse. i know that overfitting the coarse model on one voice will result in a reduced variability of its outputs.

#

it likely just needs to be a larger model with more parameters to train.

#

it's a hard trap to fall into - thinking this is as good as it gets - but this stuff is a moving target, and it's likely that suno has internal models that far surpass this toy's abilities

untold briar
#

Oh yeah, I agree. I guess I still meant like, presumably the prompt still has to be long enough for that? It's shorter than I expected in actual effect. 256 in semantic, like 206 or in that range in coarse for the semantic tokens. Fine strangely has the longest input but fine also doesn't seem really matter much. I do find even literally randomly chopping up coarse prompts quite a bit can push it into a bit more expressive space, when the outputs seem to get stuck in a less expressive space. Just from my literal caveman science slicing and dicing. One thing I do all the time is trim both semantic and coarse, removing chunks somewhat randomly, to try and break out of some effect like background sound. Doesn't usually work but sometimes actually does. Or just resizing it fractionally, like, "I want a voice only a little like that..."

#

An interesting thing to do is render 20 versions of a history_prompt, across the full range of sizes 0 to 256, or 341 or whatever fine is I forget. And you can get a feel for how much prompt carries how much an effect on the voice. Really just 256 I could honestly hardly ever tell the difference in fine, I just saw the output was technically different unless I extended it that much. Using semantic length here, so adjust according per coarse ratio, I usually just think in semantic units even when messing with coarse.

#

That's kind of how I got into manually hand crafting voices, cause you can kind of just mix them up like recipes and Bark makes it work, at least sometimes. Crazy.

#

To be fair there's no good reason to do this manually. Like seriously do it the right way, train a model like Mylo. But it was interesting that doing it by hand was actually kind of possible.

glossy trout
#

@untold briar - I've found that the voice gets more metallic as you keep passing the full_generation result of a generation in as history_prompt. Instead, if every couple sentences you restart from the base history_promp file, instead of using the full_generation from the previous prompt, you "restart" the degradation, so the overall quality is higher.

untold briar
#

Like just one 14s chunk

#

Absolutely you can't pass full_generation over and over, without some other edits, yeah.

glossy trout
#

Hmmm yeah. Aside from that, I haven't found better strategies.

untold briar
#

So if you break your 14s segment, into two smaller halves. You do usualy get less metallic.

#

But it's less cohesive as a sample.

#

Mainly you can just keep trying until it's not metallic. It just usually work eventually. So maybe there's some sampling tweak

glossy trout
#

Ohhh interesting. So that means shorter samples are less metallic?

untold briar
#

For me, usually the effect increases over time

glossy trout
#

Oh wow, good to know. I haven't tested that

untold briar
#

It's just strange how it's just perfect, and same sampling parameters, and then it's not.

glossy trout
#

My understanding is that shorter samples and longer samples both have a higher chance of hallucinations. The closer you can get to 14s, not too long and not too short, the lower the chances of hallucinations

#

So there's some trade-off between metallic voice vs hallucination probability

untold briar
#

I think that's right, yeah, going to 1s is really bad

glossy trout
#

By going shorter

untold briar
#

I've tried everything, can you tell, haha. If you give semantic two or threes words at once. I mean it works but it sounds you handed an actor a notecard with just those words, they said them, and then you gave them the next notecard.

#

Even if you pass the history correctly, from prev segment, for each two or three word chunk. You super feel the cadence of the original chunks in the voice.

#

I think this actually doesn't even work at all in the small models.

#

But the large one kind of suffers through it, but it just sounds terrible.

#

It makes sense right? Imagine if you gave an actor the first three words of a line read.

#

How can they possibly know how best to say the full line from that

glossy trout
#

Hmm - yeah I feel like even passing ~12s of audio, it often doesn't sound good in long sentences

untold briar
#

Also I tried using a single semantic generation, with two coarse gens. And IIRC that didn't fix the metatl.

glossy trout
#

If you break a 20s sentence into two parts

untold briar
#

So I think it's from semantic. But honestly I didn't double check that.

glossy trout
#

hmmmm

untold briar
#

It feels like it should be in the coarse, so honestly, somebody should double check. Or I will someday

#

I might still have the notebook, maybe I double check if I screwed up. Cause it feels wrong the effect is in the semantic tokens.

glossy trout
#

I think theoretically it can be in either. Semantic tokens can probably generate that metallic sound if it's low quality

untold briar
#

Yeah they are both super expressive, changes to either can change so much

#

Someone mentioned even Eleven Audio has that effect too? At least sometimes? so maybe it's not a simple fix

#

I also agree the biggest audio segment you can get away with the better. Give the model a lot of info. Imagine if you could give it a whole paragraph, so it can properly start and finish like a real audiobook reader would go through a paragraph with a specific rhythm.

#

You do end up with way too fast talking sometimes, but in general, aim big

#

This is one hacky way to adjust speaking speed as well. Keep trying large chunks of text and saving, or small text of saving, until you get something that still sounds the same but generally talks faster or sloewr.

#

If you don't need that specific voice though, just start with a fast or slow one.

#

I really want .npz to come with variants, baked in. It's super annoying to get good ones, but somebody only needs to make them once.

#

fast slow sad happy whispering, whatever

#

I just tried for whispering, because somebody in here wants that specifically. It generally changed the voice, so that one might not work. But I was only trying on one npz.

#

It should work, probably just unlucky

#

btw if by some freak chance somebody has been generating tons of clear French singing npzs, I would be eternally grateful for them.

#

I'm a bit more optimistic about Bark for music, if it's pure singing, with little to no background. That seems doable.

copper pewter
#

release

#

@untold briar @glossy trout

untold briar
#

Nice, you're gonna kill me, it's 5AM, but I am curious

#

Should I literally make tea

#

jesus

#

I guess it's saturday

untold briar
copper pewter
#

๐Ÿ‘

sharp gale
#

Hooo baby! Thanks man!

#

Is it useful to use really long data of audio? like 45 mins or hours?

copper pewter
#

8 hours for the training of the voice cloning model
6 seconds or more for a voice you want to clone

#

there's already 2 pretrained models, 4 and 14 epochs

#

they should be fine for most purposes

sharp gale
#

So I would probably get the same output if I use 10 secs of cloning or 10 minutes of cloning?

#

for cloning*

copper pewter
#

yeah, just make sure your audio clip fits the recommendations explained in the git repository

sharp gale
#

yep yep

copper pewter
#

clear, no music, ends after a sentence

sharp gale
#

Does this have a UI yet, or just the source files?

copper pewter
#

this can be implemented in any ui etc

untold briar
#

I'll add it, maybe Sunday?

copper pewter
#

i can beta release my webui, it's just really early access and stuff

#

like most of this is practically useless

sharp gale
#

gotcha

copper pewter
#

still kinda cool to have this refreshing thing though

sharp gale
untold briar
#

You conquered gradio, refreshed the files.

#

lol

#

Why isn't just automatic???

copper pewter
#

gradio has most things you need, just not very clear how to do them

untold briar
#

Does it refresh automatically? I genuinely thought I had to do it myself?

copper pewter
#

no, not here, it would be too slow

untold briar
#

If the user generates a sample. Gradio won't see the new npz by default

copper pewter
#

like, i made this a while ago, and it made me want to learn more about ai to make my own models etc

untold briar
#

So I found if you randomly sample and penalize tokens from language, you can just get random accents.

#

Even without a lot of data to use as reference

copper pewter
#

makes sense

untold briar
#

This was the only time I got this accent, but I forget the same the random tokens!

copper pewter
#

also, do you allow people to use empty prompts or na?

untold briar
#

I was gonna allow empty, call it extra confused

#

but didn't actually implement

copper pewter
#

hmm, what if someone made a model to enhance audio, so you can replace it in the semantic history to get a higher quality voice

untold briar
#

both prompts though

copper pewter
#

the prompt was "a"

#

history prompt was dantdm

untold briar
#

I think I mentioned, encoding bad audio to fix it

copper pewter
#

wonder if you can literally just disable the assert for empty prompts for enabling them

untold briar
#

is a cooil use

copper pewter
#

i want infinite famous people nonsense!

untold briar
#

You can, ugh, I'm pretty I at least tried this

#

it was why I made extra confused mode an int

#

for how many segment of no prompt

#

haha

copper pewter
#

also, with my code it's kind of like, something updates in bark? my code will keep working, as i have to update it myself. It did not support v2 speaker prompts until i updated it myself lol

#

i wonder what that could do

untold briar
#

yeah i'm in even more trouble

#

out of control

copper pewter
#

pain

#

literally just this for me

untold briar
#

well bark infinity isn't using it at all

#

at least now, partly why

copper pewter
#

but i'll see if i can increase the token limit as well, just slide the history window right?

untold briar
#

it looks like a lot but it's 3 sets of tokens

#

and a bunch of constants

#

really

copper pewter
untold briar
#

The length is 1024

#

just the model

#

you can push it but it's probably bad

copper pewter
#

well, limited the shortest at least

#

idk about the exact limits

untold briar
#

Oh I meant like, all the parameters, it's really just 3 'sets of tokens'

copper pewter
untold briar
#

how big is that?

#

My guess you get a lot more not following the text?

copper pewter
untold briar
#

I have at least 5 hours like that on my hard drive

#

maybe 2x htat

#

For the two jokes. I was looking for real punchlines

copper pewter
#

bitteling

#

sometimes it makes a bit of sense, sometimes it repeats a word, and sometimes it's just a word salad

untold briar
#

If you scroll WAY back. somebody from Suno ran it on their better or bigger model.

#

And it made a punchline on like sample 3!

#

Somewhere in this discord

#

So it's way better than our bark, for that. I ran so many

spring herald
#

@copper pewter Thanks for repo. Im still wondering, How do i do the cloning. Sorry,im new to this.
I have the wav audio file which i want to clone and i have your repo.
https://huggingface.co/GitMylo/bark-voice-cloning
https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer

Now how do i approach from here? If you can explain so that all of us who are non programmers can understand and use.

I tried to use this script https://github.com/gitmylo/bark-data-gen to generate the data using colab,Its running and has generated a bunch of .npy files. If this is the first step. What to do next?

copper pewter
#

i'll release my webui in a bit, you can do it in there (just note that it's bleeding edge, and breaking changes might be pushed to main at times)

foggy sandal
#

@copper pewter I'm just looking at your repo now - did you forget to commit hubert_manager.py?

copper pewter
#

possible

#

yeah looks like it, i guess pycharm didn't think it was needed since it was extrernally added (since i made that in the webui initially)

#

pushed it

copper pewter
#

correct, that's the 14 epoch quantizer

foggy sandal
#

great, thanks

copper pewter
#

clone, run run.bat and you're ready

#

should be, haven't done testing on more than 2 machines

#

untested on linux, etc as well

foggy sandal
#

the other problem with doing ML stuff is I feel like I'm constantly buying/filling up hard drives

copper pewter
#

yeah, agree

#

i have an ssd with hundreds of gigabytes of stable diffusion models
and a hard drive with a hundred gigabytes of other models as well

untold briar
#

Next Frame Prediction models.

#

So many

#

the checkpoints are each very different!

#

Even days apart

copper pewter
#

i should probably add a little thing to show the command line flags too

#

since the -si flag is really useful, it skips the install, so it won't check if your packages are installed before launching

untold briar
#

You know I can barely type but your code is small and clean. Could I actually do this even on 0 sleep

#

hmnn

#

A think worth doing is a think worth badly. Maybe.

#

well let's try using it at least

#

Even clear documentation. Nice

sharp gale
untold briar
copper pewter
untold briar
untold briar
#

They are soooo god

#

the rhythm, sooo nice

#

Actually a ton of work, trying to get singing but not change voices too much, lol

#

You kind of need like tons of clones, one is not enough

#

I didn't make tool yet, but the best trick, use full long 4 line singing prompt. But then cut the history prompt as short as you can, from the front, and preserve og voice. best chance

#

like use first few seconds only

copper pewter
copper pewter
glossy trout
glossy trout
sharp gale
#

Hey @copper pewter ! So where exactly do I run the command lines? I add that to the .bat file?

Also, where do I add the file to clone the voice in the UI?
sorry to bug you for tech support lol

copper pewter
#

the bat is for the webui, it has it built in

#

like, the huggingface link is the model itsself, the bark-voice-cloning-HuBERT-quantizer repo is the semantic extraction code itsself. the audio-webui is a webui i was working on

#

the reason i made voice cloning was because i felt like it was missing from my webui (although a lot of stuff is still missing)

sharp gale
copper pewter
#

yeah, the webui has voice cloning integrated

sharp gale
copper pewter
#

for the webui: you just load the bark model on the text to speech tab, and set "speaker from" to "upload", which allows you to upload a wav

sharp gale
#

Ah gotcha that's where I'm running into the low memory problem then I only have an 8gb gpu

#

where do i run the command for the low vram?

copper pewter
#

instead of directly running run.bat, you can add arguments, or create a bat where you run with arguments

sharp gale
#

gotcha!

#

would this work if i add this as the first thing on the bat file?

set COMMANDLINE_ARGS= --skip-install --bark-low-vram

#

Oh nvm it didn't work lol, the bat is still installing

#

I'll figure it out thanks for the clarification man

copper pewter
#

it doesn't use COMMANDLINE_ARGS

#

this isn't stable diffusion webui

#

instead, what you do is you create a new bat file, which does this:
call run.bat --skip-install --bark-low-vram

sharp gale
#

ah that's what was missing! the call command. Thanks man!

#

Finally got it working you're the man Mylo! Thank you!

untold briar
#

@copper pewter I'm planning on tagging the NPZs I make so I don't mix them up with regular generations maybe this could be a general convention? Just in case I might have wished to know in the future. Maybe

wooden pier
#

Is cloning only for English now ?

untold briar
#

If you care about accuracy

#

For music or something it can also be used

wooden pier
#

๐Ÿคฃ I mean other languages

untold briar
#

I haven't had time, but it might ok in european or closer to English

#

but it lower quality for sure

#

I bet it could fixed in a day, so no worries

#

I could probably do French, happen to have a lot

wooden pier
#

so you need to train another model for a different language, right ?

untold briar
#

Nope, just need more data really

#

Train the same model, just better

wooden pier
#

then all language mixed together?

untold briar
#

It could be the case it works better if you split, I kind doubt it but mabye

#

it hasn't been 24 hours so like, pretty early

wooden pier
#

๐Ÿ’€ I want to try japanese and chinese at the moment

untold briar
#

I'm super short on time, but tomorrow I'll try a bunch

wooden pier
#

ok, I 'll wanit

signal apex
#

Hi ! Where can I find a birdseye view on how bark works ? The global architecture, and the different steps ?

#

I am new to ml and tts and have been going around audio ml, spear, and other to chew on new concepts. I'd like to do the same with bark. Does it do text to semantic token to ?? To mel-spec to audio ?

untold briar
#

no mel

#

look at audiolm

#

the same

#

a lot of is almost exactly the same

#

there's no page like that for bark specifically so that's the best model

#

Anything I say is basically badly trying to summarize that

#

well the low level boring stuff i kind of know

signal apex
#

Ok, I was reviewing this ad well, cool. Does it use the pytorch implem of soundstream or encodec ? Both those codec can transform audio token to audio right ?

untold briar
#

encodec

#

I think, honeslty never dug in that part

#

but pretty sure off top of head

signal apex
#

Also generally speaking, why is there a separation between coarse and fine audio token ?

untold briar
#

that's a surprisingly adept question

#

I don't think they would be too mad if I metioned, the devs said it's not probablly not actually ideal.. But that's how it is.

#

haha

#

I mean still works great

#

but there were strong hints probably would not do that again

signal apex
#

Is that an architecture geared for training optimisation ?

untold briar
#

that's past what I know, just a few details and practical stuff, not sure

#

Doesn't seem like it but no idea

#

really it just feels like it's goals are general purpose and wide, first

#

so really that's it

#

It might train fast, I think somebody did something with a new language

#

very fast but I don't know

signal apex
#

Ok, so that's also probably how audio lm work? But there must be a general reason for the why. Is there a Suno team member I can ask in here ? (I thought you might be)

untold briar
#

admin in discord, but has sleep symbol

#

but those peole

#

especially gkucsko if you catch him online

#

or just search the discord for his message perhaps

#

that might answer your questions pretty well

#

that's where all the knowledge would have come from, if somebody did know

signal apex
#

Cool thanks let me ping @open magnet on the question then, for when he'll be around ๐Ÿ™

untold briar
#

They say they want be a foundation model for audio. Foundation model aren't like about training fast, right? Just being super cool and awesome.

#

Actually that's not right, training fast can be part of that, in fact is

#

It's too late

#

I was thinking GPT like super big and slow, then I was thinking of Diffusion, all the Segment Everything things, and like training fast in some ways is cool

wooden pier
#

Traceback (most recent call last):
File "D:\audio-webui-master\audio-webui-master\main.py", line 9, in <module>
from webui.modules.implementations.tts_monkeypatching import patch as patch1
File "D:\audio-webui-master\audio-webui-master\webui\modules\implementations_init_.py", line 1, in <module>
import webui.modules.implementations.ttsmodels as tts
File "D:\audio-webui-master\audio-webui-master\webui\modules\implementations\ttsmodels.py", line 9, in <module>
from webui.modules.implementations.patches.bark_custom_voices import wav_to_semantics, generate_fine_from_wav,
File "D:\audio-webui-master\audio-webui-master\webui\modules\implementations\patches\bark_custom_voices.py", line 8, in <module>
from hubert.pre_kmeans_hubert import CustomHubert
File "D:\audio-webui-master\audio-webui-master\hubert\pre_kmeans_hubert.py", line 9, in <module>
import fairseq
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq_init_.py", line 20, in <module>
from fairseq.distributed import utils as distributed_utils
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq\distributed_init_.py", line 7, in <module>
from .fully_sharded_data_parallel import (
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq\distributed\fully_sharded_data_parallel.py", line 10, in <module>
from fairseq.dataclass.configs import DistributedTrainingConfig
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq\dataclass_init_.py", line 6, in <module>
from .configs import FairseqDataclass
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq\dataclass\configs.py", line 1104, in <module>
@dataclass
^^^^^^^^^

#

raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'fairseq.dataclass.configs.CommonConfig'> for field common is not allowed: use default_factory
Press any key to continue . . .

untold briar
#

I don't think I can possibly debug that this late but I might try and update my fork an hour, if it can be done, and I have that running and didn't get that erorr

#

To me that looks like one library is some slightly wrong version

#

good luck figuring it out

wooden pier
#

๐Ÿ’€ try to add anaconda installation

untold briar
#

omfg I really am asleep. Search my history I did have that

#

Just pip install fairseq

#

forget conda

#

too out of datet

#

some breaking chagne

#

sorry that's hilarious, literally that exact problem hours ago, like 5

wooden pier
#

I hate vene๐Ÿ˜ญ

#

venv

untold briar
#

it should be like a thing, if you make breaking changes, then update your conda wtf

#

actually not sure how that stuff is organized exactly, now that I think about. but surely somebody is to blame

signal apex
untold briar
#

Sounds a lot less expressive to me though

#

Not to mention, all the same style

#

Bark is really wide

#

all those samples are tiny blip in Bark space

#

the same dot

#

they are okay but all kind of close. though maybe Google just chose them all in that style

#

i'm being hyperbolic but I do think Bark is quite a bit better

signal apex
#

Ok, I personnally am blown away by the soundstorm demo. Generating 30 seconds of dialog from 2 seconds of audio, in exactly the same voice, is way over any results I've seen from bark right

untold briar
#

Well

#

what about 0 secvonds

signal apex
#

They do 0 seconds as well

untold briar
#

I mean

#

like a perfect model, no wav input

#

I have like 10

#

IT's nuts

#

I keep meaning to do a writetup but never got around to it

#

you can hone in on them in the latent space

#

like anyone it's crazy. no wavs

#

it's a bit pointless now that we have actual cloning

#

but super cool

#

not really anyone presumably famous peole

#

I would explain but basically just keep tweaking bark voices directionlly

signal apex
#

Personnally I don't want cloning. I want to be able to generate 10-20 voices that are general purpose for audio books and podcast, and even the top would be prompting the model for a voice

untold briar
#

I'm a bit baffled by the cloning myself

signal apex
#

Cloning is just a quick way to get a voixe

untold briar
#

What are are people want it so bad? If it's just clear voices, you won't need it

#

Just need a bit of time

signal apex
#

Time is exactly what people don't have, and skills. Voice cloning removes those problems.

#

Also, long form right

#

I tried to do the long form bark, could not get anywhere. I still want to learn and improve my bark-fu, so I'll get back to it

#

But can you share your best inferences on your best voices ?

untold briar
#

I can totally but can me make it tomorrow, I'll be doing dev on bark probably

#

i mean i probably have random wavs but

#

So what I have handy is trying to make the presidents sing

#

lol

#

These are some of my favorites to be honest though

signal apex
#

Do you have paragraph long audio ?

untold briar
#

Singing is hard and tends to lead to distortion or voice changes

signal apex
#

Like long paragraph ?

untold briar
signal apex
untold briar
#

these were not chosen for clarity at all just soem i have

#

typically the clarity is mostly in the voice

#

but Bark does have a reliability issue. but non real time? you can just try again and get clear

signal apex
#

Get that. But imagine you want to feed it a book or an article. You need a reliable piepline, not a manual try and error. I am sure there are ways to get there but bark is not there rn

untold briar
#

Do you think so? I feel like that's not right exactly

#

Oh nvm

#

I was gonna say you would direct an actor

#

And listen to the whole book

#

but you mean, as purely automated

signal apex
#

My knowledge is super limited actually, maybe bark can actually do all that !

untold briar
#

Yeah I think Bark is better thought almost like an actor. Not a script. So you may have be involved, but the final result is a lot better

signal apex
#

(can't download your sounds from the app, I'll try the PC later)

untold briar
#

Oh hm

#

i think it's the lack of video

#

annoying

#

in mp4

signal apex
#

Apologizing to have all this discussion on the main channel. Let's do a thread

untold briar
#

tomorrow i'm out for the night, even mentally

signal apex
#

Bark and longform

blissful verge
wooden pier
#

I haven't been able to run the ui yet

#

trying to upgrade conda

#

and my base env fucked me up

blissful verge
#

There are multiple UIs

blissful verge
#

@copper pewter Would it be possible to correct incorrect readings/pronunciations with a fine-tuned semantic model?
Though the dataset would be a another challenge.

untold briar
#

I think I'll push mine to git, even just mid changes, cause it does work and it's been weeks now. But you may have to figure out what libraries are new or changed yourself, or wait for tomorrow

#

For installing or whatever

#

But it has a copy and paste huberty myloizer

copper pewter
blissful verge
#

That is super exciting, because it suggests that there's an actual method to improve upon this model's weaknesses.

#

There are multiple UIs if you want an easier install.

wooden pier
#

how to clone with audio-webui

#

I have no idea, there is not a place for audio input ?

#

since I used anaconda, I might just skipped all model download

copper pewter
#

when you clone a voice, it gets saved in your bark custom speakers directory,

wooden pier
#

looks like I didn't download any model

#

only quantifier_hubert_base_ls960_14.pth

copper pewter
#

the button you get to download the speaker is if you want to download the speaker based on the audio generated

copper pewter
wooden pier
#

I used anaconda, so I just comment out something in main.py and run it

#

or how to disable venv installation in run.bat process

untold briar
#

Yeah they should be automatic

#

just if the app is loaded

#

I actually used regular installer though, like used the .bat

#

however I had tried conda, that's why I knew the fairseq thing

wooden pier
#

but nothing is downloaded, I am in webui

#

from webui import args # Will show help message if needed

#from install import ensure_installed

#print('Checking installs and venv')

#ensure_installed() # Installs missing packages

from webui.modules.implementations.tts_monkeypatching import patch as patch1
patch1()

print('Launching')

from webui.webui import launch_webui

launch_webui()

copper pewter
#

right, that's how you disable the install and venv check/activation, (since it only checks if it's in venv i think, i don't have conda and don't want to install it)

wooden pier
#

๐Ÿคฃ yeah, just no models

copper pewter
#

it doesn't download on install, in case you'd never use it, it would be a waste of storage space

wooden pier
copper pewter
#

you need to load a model first

#

in this case bark