#📚┃suno-school | Suno | Page 2

turbid vault May 17, 2023, 6:30 AM

#

Another question:- Can I train with a Dataset of Japanese Voice and make it to generate Voice in English?

crude raft May 17, 2023, 6:52 AM

#

don't know if this is the right forum for sharing this, but found this unofficial repo for training a Bark: https://github.com/anyvoiceai/Barkify, thought this could be useful. any comment?

turbid vault May 17, 2023, 7:14 AM

#

looks promising

copper pewter May 17, 2023, 8:34 AM

#

yeah, making a little thing for it right now

#

as a separate script though, so you can just process your generated semantics into wavs

copper pewter May 17, 2023, 8:59 AM

#

fixed the "rn", it was putting \n and \r instead of \n and \r, so i just had to convert them

#

or maybe i should decode it better

#

👌

#

#

lol

#

sounding like a nonsensical news article

#

@glossy trout the repo has been updated, there's now a create_wavs.py, and the texts don't have weird random tokens through them

#

the create_wavs.py will just create a wav for every semantic prompt it finds in the semantics output folder

copper pewter May 17, 2023, 9:40 AM

#

it won't process the same file twice in the wav generation, that's on purpose

copper pewter May 17, 2023, 2:28 PM

#

3000 something semantic files lol, i have some processing to do

#

old dataset was 900 files of 100 tokens (about 2-3 seconds)

#

for the new one i've already got >3000 semantic files which are about 10 seconds long

glossy trout May 17, 2023, 2:29 PM

#

Today is a bit of a crazy schedule - I will try to switch to wav creation, not 100% sure depends on how complicated it is

#

It's currently cranking out semantic files

copper pewter May 17, 2023, 2:51 PM

#

glossy trout Today is a bit of a crazy schedule - I will try to switch to wav creation, not 1...

git pull, run create_wavs.py

#

lol

formal pebble May 17, 2023, 4:06 PM

#

Hi all. Is it possible to use Bark for real-time TTS on React Native?

untold briar May 17, 2023, 4:08 PM

#

Only for one particular definition of real-time, in that a very fast GPU like a 4090 can almost generate 14s of audio in 14s (though not quite realistically). But not for latency.

formal pebble May 17, 2023, 4:11 PM

#

untold briar Only for one particular definition of real-time, in that a very fast GPU like a ...

Ah I see. Thanks!

copper pewter May 17, 2023, 4:22 PM

#

copper pewter lol

i'm about 10% through my files

copper pewter May 17, 2023, 7:39 PM

#

like 33% through now

glossy trout May 18, 2023, 3:57 AM

#

https://www.dropbox.com/s/88y5mydw34rz5u4/output1.zip?dl=0
https://www.dropbox.com/s/19cth3yj674626g/output2.zip?dl=0
https://www.dropbox.com/s/w1i7fuh7fjwqbax/output3.zip?dl=0
https://www.dropbox.com/s/arr4d3mqhn8qbud/output4.zip?dl=0

#

@copper pewter

#

Here are the semantic files

I couldn't do the .wav files today. It's 4 different servers and I was on the road a lot of the day, so I couldn't ssh into them and update the script

#

Let me know if you still need them next week, I can run another process to create them

#

or LMK if you're able to create them and don't need them any more

#

Either way is OK, happy to run more GPUs, just let me know what you need and when. but I'll be out of town the next few days.

lunar glen May 18, 2023, 4:48 AM

#

Am able to finetune the existing coarse model on a new language and new set of code books, I kept the fine as it is and just generated some samples from a ground truth semantic tokens, can you spot which one is the original human audio ?

glossy trout May 18, 2023, 5:07 AM

#

@lunar glen - I would guess #3

#

Can you share how you retrained the course model?

spring herald May 18, 2023, 5:16 AM

#

lunar glen Am able to finetune the existing coarse model on a new language and new set of c...

3 is more human like with clarity. How did you train a new language?

lunar glen May 18, 2023, 5:34 AM

#

it involved multiple steps, I would write a clean doc sometime this week, your guess are correct 😄 Its finetuned for just 30mins on a RTX 2070 😛 So I guess it needs more time and data 😄

graceful condor May 18, 2023, 5:37 AM

#

if it helps, I was able to drop the generation time and improve audio quality while doing so 😲

glossy trout May 18, 2023, 5:47 AM

#

lunar glen Am able to finetune the existing coarse model on a new language and new set of c...

How did you get the ground truth semantic tokens for a new language?

hasty acorn May 18, 2023, 7:26 AM

#

Hi guys

#

i need help i want to generate audio with long text i tried many things but did'nt work

#

i have generate file audio but with no sound in it

copper pewter May 18, 2023, 11:35 AM

#

i wonder if ~8 hours of training data will be enough, but i'll just have to try lol

lunar glen May 18, 2023, 11:43 AM

#

glossy trout How did you get the ground truth semantic tokens for a new language?

for now I have taken the output from the hubert hubert_base_ls960.pt and the kmeans model they provided for this, its not ideal, but I just wanted to test if it works. then I finetuned the coarse model with this, surprisingly the model learnt these new codes in very few steps (~2k steps)

#

btw if you finetune with new codes the old knowledge will be lost by default. There may be some ways to make it still preserve the old knowledge. Now to build a working text to speech model, the semantic model also needs to be trained. That am yet to do.

copper pewter May 18, 2023, 11:47 AM

#

nice, i'm also using the hubert_base_ls960.pt, but a custom model for kmeans, to try and make it compatible with everything

#

it does make sense that finetuning bark's semantic (text_2.pt) would work faster than training from scratch, i just didn't think of it. The back-up for voice cloning will always be a full custom bark model though

copper pewter May 18, 2023, 11:51 AM

#

lunar glen btw if you finetune with new codes the old knowledge will be lost by default. Th...

try cloning your own voice on your finetuned bark models, since you can extract semantics using the hubertwithkmeans and they should be compatible with your model as that was trained on those tokens.

#

meanwhile i'm still generating training data, it wasn't running while i was asleep, but it's been running for a few hours already again

#

and i'm over 50% done now

feral knoll May 18, 2023, 12:32 PM

#

copper pewter try cloning your own voice on your finetuned bark models, since you can extract ...

By 'finetuned bark models', are you referring to already finetuned bark models? Or you mean fine-tuning the raw bark from the scratch? I haven't encounter any repo to fine-tune Bark with, for example, LoRA. Except this issue post https://github.com/suno-ai/bark/issues/117

GitHub

[Feature Request] LoRa fine-tuning? · Issue #117 · suno-ai/bark

I propose to add support for LoRa fine-tuning of the model, as it has been shown to be effective for GPT models based on community feedback. Additionally, there have been attempts by the nanoGPT co...

copper pewter May 18, 2023, 12:35 PM

#

feral knoll By 'finetuned bark models', are you referring to already finetuned bark models? ...

bark runs in pytorch, you can literally just train it based on the already trained model (which is finetuning)

#

you could probably also figure out loras based on what we have

feral knoll May 18, 2023, 12:35 PM

#

copper pewter bark runs in pytorch, you can literally just train it based on the already train...

hm okey, got that

glossy trout May 18, 2023, 1:52 PM

#

@lunar glen @copper pewter - Do you think text2.pt (Bark's text_to_semantic model) was trained using hubert_base_ls960.pt?

copper pewter May 18, 2023, 1:52 PM

#

probably not, and definitely not those kmeans, the kmeans tokenize to 500 tokens, but bark uses 10000 tokens

#

however i do believe you can achieve good voice cloning by using hubert_base_ls960 with a custom tokenizer which tokenizes to 10000 tokens

glossy trout May 18, 2023, 1:54 PM

#

Ah I see. You'd probably lose some fidelity though, since going from 500 to 10k will be a little lossy

copper pewter May 18, 2023, 1:54 PM

#

oh, and i'm currently at 1.62gb of wavs for the semantics

glossy trout May 18, 2023, 1:54 PM

#

Noooiiiiccceeee

copper pewter May 18, 2023, 1:55 PM

#

still just the semantics i generated yesterday lol

copper pewter May 18, 2023, 1:55 PM

#

glossy trout Ah I see. You'd probably lose some fidelity though, since going from 500 to 10k ...

the 500 token quantizer won't be used

glossy trout May 18, 2023, 1:55 PM

#

Btw, just FYI the audioLM paper originally used w2v-BERT to extract semantics from sound: https://arxiv.org/abs/2108.06209

arXiv.org

W2v-BERT: Combining Contrastive Learning and Masked Language Modeli...

Motivated by the success of masked language modeling~(MLM) in pre-training
natural language processing models, we propose w2v-BERT that explores MLM for
self-supervised speech representation learning. w2v-BERT is a framework that
combines contrastive learning and MLM, where the former trains the model to
discretize input continuous speech signal...

copper pewter May 18, 2023, 1:55 PM

#

it's replaced with my model which fakes it to act like the one used to train bark as well as i can manage

copper pewter May 18, 2023, 1:56 PM

#

glossy trout Btw, just FYI the audioLM paper originally used w2v-BERT to extract semantics fr...

and the audioLM-pytorch implementation uses HuBERT

arctic pasture May 18, 2023, 1:57 PM

#

is anyone aware of any multi-modal or prompt-to-audio models similar to Google's latest project where by you can enter a prompt and it will generate sound effects or music?

glossy trout May 18, 2023, 1:57 PM

#

copper pewter and the audioLM-pytorch implementation uses HuBERT

Oh! Got it

copper pewter May 18, 2023, 1:57 PM

#

arctic pasture is anyone aware of any multi-modal or prompt-to-audio models similar to Google's...

https://github.com/lucidrains/musiclm-pytorch

GitHub

GitHub - lucidrains/musiclm-pytorch: Implementation of MusicLM, Goo...

Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch - GitHub - lucidrains/musiclm-pytorch: Implementation of MusicLM, Google&#39...

#

it's a recreation of google's musiclm, just like bark uses a recreation of google's audiolm

arctic pasture May 18, 2023, 1:58 PM

#

copper pewter https://github.com/lucidrains/musiclm-pytorch

you are amazing. thank you

copper pewter May 18, 2023, 6:55 PM

#

3002/3079

graceful inlet May 18, 2023, 7:13 PM

#

copper pewter try cloning your own voice on your finetuned bark models, since you can extract ...

You can clone your own voice now

#

I thought suno/bark didn’t allow that

copper pewter May 18, 2023, 7:14 PM

#

graceful inlet I thought suno/bark didn’t allow that

it's just not a feature, and they won't enable it, but if you figure it out, it's fine

#

because the legal things around a company doing that are not very clear for example

graceful inlet May 18, 2023, 7:15 PM

#

copper pewter it's just not a feature, and they won't enable it, but if you figure it out, it'...

Ah I see, but we can still fine tune their models on whatever voice we wanted

copper pewter May 18, 2023, 7:17 PM

#

yeah, or fine tune a model to use different semantics and do voice cloning from there
or train a model to create valid semantics for the regular bark models (which is what i'm doing)

#

the data is finally done

copper pewter May 18, 2023, 8:06 PM

#

just preprocessing the data now

graceful inlet May 18, 2023, 8:07 PM

#

Nice

copper pewter May 18, 2023, 8:50 PM

#

this is from the start, it seems to be going pretty fast lol

#

with how much variety there is in the training data, it has not had any duplicates yet. so it lowering is a good sign (loss shows every 50 batches, 1 batch is a single audio clip)

brisk crown May 18, 2023, 8:56 PM

#

copper pewter yeah, or fine tune a model to use different semantics and do voice cloning from ...

What kind of GPU(s) are you using? Also, you seem extremely familiar for some reason... can't put my finger on it though.

copper pewter May 18, 2023, 9:02 PM

#

3060 12gb

#

the training is only taking 3.2gb vram total though, you could get it running on a 1650 if you tried probably

#

passed 2 epochs so far

brisk crown May 18, 2023, 9:03 PM

#

I see... if you ever need a faster GPU lmk cuz I have a 3090 and would be glad to help out in any way possible.

hollow root May 18, 2023, 9:03 PM

#

I like the spirit here

copper pewter May 18, 2023, 9:04 PM

#

it seems to be fine on my 3060

brisk crown May 18, 2023, 9:05 PM

#

Another thing is if it's only using 3.2gb of vram, shouldn't you up the batch size to use more resources?

#

Cuz honestly I think if it's only taking that amount of vram, it sounds like you're not utilizing the full power of your GPU

copper pewter May 18, 2023, 9:07 PM

#

there's a few things that are still trial and error, like whether to cut unequally sized outputs from the start or from the end

#

and i am at 100% gpu utilisation

brisk crown May 18, 2023, 9:07 PM

#

So this is more a test run than anything?

brisk crown May 18, 2023, 9:07 PM

#

copper pewter and i am at 100% gpu utilisation

Yeh, but you could fit more into your VRAM, couldn't you?

copper pewter May 18, 2023, 9:08 PM

#

brisk crown Yeh, but you could fit more into your VRAM, couldn't you?

yeah, i could, but currently a batch is an entire wav

brisk crown May 18, 2023, 9:08 PM

#

Ah

copper pewter May 18, 2023, 9:11 PM

#

outputs actually look like outputs, nice

#

like, the semantics extracted have similar patterns to actual semantics from bark

#

IT WORKED

#

original

#

regenerated audio

#

the second audio does not know about the semantics of the first, it specifically extracted them from the wav

leaden gazelle May 18, 2023, 9:18 PM

#

ChatGPT ELI5:
In simple terms, the sentence means that the patterns found in the extracted meaning (semantics) from a dog's bark are similar to the actual patterns found in the bark itself. It's like the model successfully captured and reproduced those patterns. The second audio, which was generated separately, doesn't have knowledge of the first audio's meaning. Instead, it extracted the meaning specifically from the sound waves (wav file) itself. 👍

copper pewter May 18, 2023, 9:25 PM

#

lol

#

anyways, i'll implement voice cloning tomorrow, show it for a bit, and release the model and training code later

#

currently just have one for hubert_base_ls960. might add more models later if someone wants it for a bigger HuBERT model, but as you saw from before, it's already really good

copper pewter May 18, 2023, 9:49 PM

#

i'll release the models and code later this week

untold briar May 19, 2023, 1:38 AM

#

Nice. I'm excited to play with it, especially wonder about not voice cloning uses, music, and enabling whole new things. If you can train it that fast you can do one-off projects with different goals than accurately reversing semantic.

#

I'm already regretting this and dreading the next update in base bark

true wave May 19, 2023, 2:21 AM

#

untold briar Nice. I'm excited to play with it, especially wonder about not voice cloning use...

Holy moley, yeah, base bark updates aren't going to be fun maintaining that...

copper pewter May 19, 2023, 8:01 AM

#

this is based on something i spoke, i don't know why it put it twice though lol

#

actually crazy that it managed this in 3 epochs of training that model

untold briar May 19, 2023, 8:02 AM

#

copper pewter this is based on something i spoke, i don't know why it put it twice though lol

Crazy!

#

You may have realized what I have, Bark makes everything easy

#

I'm always thinking, 'this can't work' and then it does

#

So I could believe it

#

If we can really train a useful model that fast, seriously, Bark is gonna be just ridiculously powerful

#

I've found the semantic model is incredibly robust, so I bet it makes do with even what might be somewhat poor model, it probably still makes it work. Though you could keep training and make it better for sure.

#

Now I want to take a crack but it's 4AM here and I'm being absurd

#

But yeah, I can't believe the stuff I shoved through the semantic model and somehow still worked

untold briar May 19, 2023, 8:17 AM

#

copper pewter actually crazy that it managed this in 3 epochs of training that model

Sometimes I get messages like "Why don't you make an audio microSaas" or something, and I'm like, "That sounds like another project I don't know anything about that I don't have time for" but people are RABID for voice cloning, it must have been half the chat messages in this Discord for awhile. You probably actually could make an instant business.

copper pewter May 19, 2023, 8:19 AM

#

lol

untold briar May 19, 2023, 8:19 AM

#

The serp github project has like 1000 stars, and it's terrible!

copper pewter May 19, 2023, 8:20 AM

#

i still have to see the quality of voice cloning, i'm currently adding it to my webui

leaden gazelle May 19, 2023, 8:20 AM

#

I'm more surprised that people haven't started doing voice cloned tts yet.

#

Cuz voice clone is speech to speech so far

copper pewter May 19, 2023, 8:21 AM

#

that's just proof of concept

#

because of how bark speaker prompts work, this should also work in text to speech

leaden gazelle May 19, 2023, 8:21 AM

#

I've tried making it tts based, but I use so much vb cables that the results are bad(voice clone i mean)

copper pewter May 19, 2023, 8:22 AM

#

for some reason i'm getting system errors from soundfile when i do torchaudio.load? but only in this part of my webui, it's fine in the other part

#

"system error" not very descriptive

untold briar May 19, 2023, 8:23 AM

#

I nuked soundfile from my install

#

because it was so randomly breaking for people

copper pewter May 19, 2023, 8:23 AM

#

what did you replace it with

#

soundfile code

untold briar May 19, 2023, 8:24 AM

#

Just scipy.io.wavfile for wavs, and pydub later for other stuff

copper pewter May 19, 2023, 8:24 AM

#

and the "actual error code" is just "system error"

untold briar May 19, 2023, 8:24 AM

#

But I'm not sure exactly what you're doing with it, so maybe those do not cover the functitonality

copper pewter May 19, 2023, 8:25 AM

#

i was loading the audio with torchaudio.load

#

i guess if i want to load the wav then i'll just have to put it in a torch tensor

untold briar May 19, 2023, 8:27 AM

#

I didn't even realize torchaudio used it, since I'm not loading audio like thatt

#

Just basic output and conversion

#

Does your webui not process wavs with the same library as the training code?

#

I guess maybe not, they do different things

copper pewter May 19, 2023, 8:32 AM

#

ah now i'm getting an error with wavfile, there's something wrong with my directories, kind of expected that

copper pewter May 19, 2023, 12:01 PM

#

ok, voice cloning works great, but outputs aren't always perfect, sometimes the voice is somewhat different. sometimes completely different?

#

might just be an issue with my input wav file though

untold briar May 19, 2023, 12:02 PM

#

Probably just needs more training, more data. It's honestly shocking it works at all that tquickly

#

I thought it would at least 10x that long, if not more

copper pewter May 19, 2023, 12:03 PM

#

i'm just gonna check if the semantics make sense, by taking the audio i used for the voice, and instead generating semantics for it

#

and use those as the prompt instead of history prompt

#

nah sounds fine

foggy sandal May 19, 2023, 12:04 PM

#

copper pewter ok, voice cloning works great, but outputs aren't always perfect, sometimes the ...

thats the issue I mentioned with the semantic_prompt generally

#

some voices work almost perfectly

copper pewter May 19, 2023, 12:04 PM

#

like, i took the start of a dream video, and i replaced the voice, and that works fine, voice cloning also work fine but not all the time

#

foggy sandal May 19, 2023, 12:04 PM

#

others will change voice some percentage of the time

copper pewter May 19, 2023, 12:04 PM

#

like this

untold briar May 19, 2023, 12:07 PM

#

If the semantic is close but not quite the same as it should be in Bark, maybe the errors accumulate faster so the voice diverges faster than a regular Bark generated history_prompt

#

Maybe just needs more training and data

#

But yeah even regular voices do that sometijmes

#

"works but not all the time" is literally everything in Bark

#

I mean you trained in 8 hours on a 3060!

#

And it works sometimes

#

Be happy!

foggy sandal May 19, 2023, 12:10 PM

#

I mean the thing to kee in mind is that the semantic prompt generated at inference time is not actually the semantic prompt that is used as input to the model

#

its correlated with the text embeddings

#

so only a subset of semantic outputs will actually perform well

untold briar May 19, 2023, 12:11 PM

#

Right, the last 10% might be a lot of work. My gut feeling is that Bark is pretty robust so probably handles more than you expect, even though it kind of feels you would need the perfect text embeddings. I've done worse things and they just worked.

foggy sandal May 19, 2023, 12:12 PM

#

yeah I suspect the closer a voice is to the training set, the better it will perform

#

because the model has better learned to disaggregate text semantics from acoustic semantics

untold briar May 19, 2023, 12:14 PM

#

The history prompts carry a mind blowing amount of detail for a voice, considering how tiny they are really. It's really piggy backing off the innate Bark model a lot, so you can get by with some really crude things sometimes too.

foggy sandal May 19, 2023, 12:15 PM

#

I wonder if @copper pewter added a penalty loss to his model between his semantic predictions from hubert and the actual semantic prompts from en_speaker_1/2/3/etc

#

because those presumably are true semantic inputs

copper pewter May 19, 2023, 12:16 PM

#

no, not speaker_1/2/3, but actually just saved ones, they're the same

untold briar May 19, 2023, 12:16 PM

#

He used a ton of real bark generations

copper pewter May 19, 2023, 12:16 PM

#

3000 something clips

foggy sandal May 19, 2023, 12:16 PM

#

saved ones arent the same though

#

(preumably)

#

en_speaker_0 etc presumably come from the "true" semantic model

copper pewter May 19, 2023, 12:17 PM

#

no

foggy sandal May 19, 2023, 12:17 PM

#

which is unreleased

untold briar May 19, 2023, 12:18 PM

#

Oh, you know I don't know if they did do that?

#

They might just be random bark generations?

copper pewter May 19, 2023, 12:18 PM

#

en_speaker_0 is most likely just a saved bark gen, they claim bark does not include any real people's voices

foggy sandal May 19, 2023, 12:18 PM

#

thats what im guessing

untold briar May 19, 2023, 12:18 PM

#

I think you just need more clips, more time, more diversity. 3000 isn't even that much

copper pewter May 19, 2023, 12:18 PM

#

after all

copper pewter May 19, 2023, 12:19 PM

#

untold briar I think you just need more clips, more time, more diversity. 3000 isn't even tha...

around 3000 * 50 * 10 semantic tokens

#

out of 10000 possible options, a lot of overlap

foggy sandal May 19, 2023, 12:20 PM

#

i think there was mention of an unreleased model though

#

thats why i assume there's a "true" semantic model out there

untold briar May 19, 2023, 12:20 PM

#

Yeah they have their version of what Mylo trained, basically, that's what they said they didn't release

#

Presumably it is perfect, but I don't think they would use it for the Suno default voices, just don't see the point.

#

The default voices are all saying the same sentence, if it was the case, probably not even helpful.

copper pewter May 19, 2023, 12:23 PM

#

i might just train a few more epochs and try that model out, who knows

foggy sandal May 19, 2023, 12:23 PM

#

i mean it's possible that with more data, the new model gets closer to the "true model"

#

(more data, more training etc)

copper pewter May 19, 2023, 12:25 PM

#

yeah, i don't have the resources for that, but model and code will be released later this week

foggy sandal May 19, 2023, 12:27 PM

#

with your synthesized training data, did you use the same phrase?

untold briar May 19, 2023, 12:27 PM

#

Step 2. Mylo Audio dethrones Eleven labs.

copper pewter May 19, 2023, 12:27 PM

#

foggy sandal with your synthesized training data, did you use the same phrase?

no

foggy sandal May 19, 2023, 12:27 PM

#

(i can't remember exactly what it was)

#

I wonder if that would make a difference too

copper pewter May 19, 2023, 12:27 PM

#

it's trained on >3000 random phrases from books

#

and shakespear plays

#

the actual words don't matter though, the sounds do

untold briar May 19, 2023, 12:28 PM

#

They probably matter some

foggy sandal May 19, 2023, 12:28 PM

#

well....i wouldn't be so quick to assume that

untold briar May 19, 2023, 12:29 PM

#

Or rather, the overall structure and meaning of the sounds, which we call words

foggy sandal May 19, 2023, 12:29 PM

#

dont you think there is a reason that each provided prompt uses the same phrase?

untold briar May 19, 2023, 12:30 PM

#

I don't think there is a reason for that, not really. But just that there's probably some nuance and depth that might be tricky to capture still, compared to real Bark generated semantic prompts. And you could need a wider range of text in the training data as part of that. To hit like perfect match.

foggy sandal May 19, 2023, 12:31 PM

#

I mean there's a way to test my hypothesis

#

just measure the distance between the bark-provided semantic prompts and mylo-provided prompts for the same audio from the same speaker

#

if that keeps going down then it theoretically can converge to the same model

untold briar May 19, 2023, 12:33 PM

#

Well that's what he trained it yeah, so just keep checking the loss

foggy sandal May 19, 2023, 12:34 PM

#

no, i mean checking against the pre-provided en_speaker_0 etc semantic prompts

untold briar May 19, 2023, 12:34 PM

#

I was just speculating that you might hit a wall and need a wider set of text. Asian languages or something, for example.

foggy sandal May 19, 2023, 12:34 PM

#

for audio generated with that speaker

untold briar May 19, 2023, 12:35 PM

#

As far I can tell there's nothing special about the provided prompts, but that's just my guess, I guess there could be. But even so I don't think it would be that useful, regular bark generated semantics should be all you need.

foggy sandal May 19, 2023, 12:36 PM

#

its possible, I'm just thinking of ways to empirically verify that

untold briar May 19, 2023, 12:39 PM

#

Hey you seem pretty knowledgeable, random question. Somebody asked me about adding a 'negative prompt' to Bark Infinity, like Stable Diffusion. Do you know if there's a reason LLMs never had this idea? Does it not even make sense, or is useless, or something?

#

I'm not sure exactly what the implementation would looks like even, or even what 'working correctly' means exactly, but it does seem like a fun idea.

#

"I HATE YOUR GUYS" as a negative prompt, to make somebody quiet and friendly

#

or something

foggy sandal May 19, 2023, 12:49 PM

#

I think you could implement something similar conceptually but negative prompts are mostly a diffusion model concept

#

because they can sample from two latent spaces

#

i guess autoregressive decoding could downweight the logits of a negative phrase though

untold briar May 19, 2023, 12:50 PM

#

I was thinking something like, generate the negative prompt. Save the generation. Compare the generation to the 'average' generation from that language. Pick some cutoff for the most common tokens or patterns used unique to the negative prompt (so you don't penalize the sound of a human voice generally) and then penalize those in the actual generation. Just a random idea.

#

Kind of depends on how the semantic works, so it might be useless. But maybe does somethitng.

foggy sandal May 19, 2023, 12:54 PM

#

i think you could figure out some way of doing it but in my experience picking the average of the output of a network never performs that well

#

its just too much information mushed together

#

id maybe start with trying to convert man to woman

#

or vice versa

#

maybe looking at the attention weights for that particular token?

untold briar May 19, 2023, 12:55 PM

#

Well I did use this method, basically, for my french accents I've been posting. It's not reliable

#

but it does work often enough

#

But comparing set of english versus french

#

So it COULD just work, bark is surprisingly robust

foggy sandal May 19, 2023, 12:56 PM

#

do you know about pca?

untold briar May 19, 2023, 12:56 PM

#

No

foggy sandal May 19, 2023, 12:56 PM

#

principal component analysis

#

its a bog standard statistical technique

untold briar May 19, 2023, 12:57 PM

#

I'm pretty much a code monkey rubbing sticks together. I mean theoretically I took stats in college but it's long been phased out of my neurons

foggy sandal May 19, 2023, 12:57 PM

#

basically to find the major dimensions of variance in a dataset

#

and the eigenvectors of that variance

#

that would be a good place to start

untold briar May 19, 2023, 12:58 PM

#

Most linear algebra too, annoyingly, since that has been so necessary now I need to relearn a bit

#

haha

#

However, there's great stats libraries. I bet I could get pretty far just banging some keys

#

Somehow I can still program in assembler despite not seeing it in forever, just because that one professor was such a hard ass it got burned into my brain. But not the math.

foggy sandal May 19, 2023, 1:02 PM

#

yeha sklearn.PCA is my go to

untold briar May 19, 2023, 1:03 PM

#

Yeah I did use that a bit already, just to get some basic stats, entropy, perplexity, whatever. Just to look for simple trends.

#

Those particular metrics were largely useless, at least for what I was doing at that moment.

copper pewter May 19, 2023, 1:09 PM

#

i don't know what any of that means 😎

untold briar May 19, 2023, 1:11 PM

#

perplexity is probably something you'll need, it's all over any language model training and evaluations

#

The rest, eh. Just get that loss down.

untold briar May 19, 2023, 1:32 PM

#

copper pewter i don't know what any of that means 😎

Have you tried cloning the rare Bark voices, for example a child's voice?

#

I'll be impressed if that works, but maybe you only need a few samples in the data?

#

Or maybe a child's voice isn't that different from an adult, compared to say a human versus a dog barking or a car alarm

copper pewter May 19, 2023, 1:33 PM

#

i believe voice cloning a "rare" bark voice shouldn't be an issue, it doesn't need to be in the training data

untold briar May 19, 2023, 1:33 PM

#

Yeah as long as it's close enough, it only has to be in the Bark data

copper pewter May 19, 2023, 1:33 PM

#

like, who knows, maybe you could record your cat meowing and see what bark thinks it would sound like as a person, that would be quite an interesting thing

untold briar May 19, 2023, 1:34 PM

#

so your prompt is accurate enough to get in the right space

#

Bark can probably do meows!

#

I haven't heard one, but it's got be in the training data.

copper pewter May 19, 2023, 1:35 PM

#

is it?

untold briar May 19, 2023, 1:35 PM

#

Yeah, I mean it can do music and random sound effects

#

It's just raw audio, youtube, tv shows, etc, just guessing.

#

I often get a 'tv commercial' sound

#

so for sure TV

copper pewter May 19, 2023, 1:36 PM

#

yeah, i did notice i got a voice which sounded a lot like a speaker that was shared in #🐶┃bark-technical

#

the old tv sounding speaker

untold briar May 19, 2023, 1:36 PM

#

Yeah it's raw data and not even noise filtered

#

which is also kind of cool, so it can make the noise too

#

like an old time AM radio or something

#

I have some of those

#

I was really killing myself to get Star Trek TNG background 'hum' sound

#

possible but difficult as heck

copper pewter May 19, 2023, 1:38 PM

#

after a bit more training, it doesn't seem to be that much more consistent. still, the times that it does work it's really high quality

untold briar May 19, 2023, 1:38 PM

#

copper pewter after a bit more training, it doesn't seem to be that much more consistent. stil...

i have said this like 100x time for everything i tried, lol.

#

i'm still kind of blown away it trained so fast, I guess I had a totally wrong intuition about how hard it would be. I should have gone for it. I also hadn't until recently thought about all the non voice clone use cases, which made me more interested. I'm really impressed you went from a single neuron in javascript to just nailing it like 3 days later. You got a bright future.

copper pewter May 19, 2023, 1:44 PM

#

i do have years of programming experience, so that really helped me

#

do you have a clip of joe biden talking so i can demonstrate a voice that everyone will recognise?

untold briar May 19, 2023, 1:45 PM

#

Not at the moment, away from main desktop

#

Try for Rick and Morty

#

I had trouble with Rick

#

really crazy voice, atypical

copper pewter May 19, 2023, 1:48 PM

#

need to find some voice lines

untold briar May 19, 2023, 1:48 PM

#

Probably google like rick and morty sound board, or something

#

I can give you some later but wont' be home for awhiel

#

Have you tried music? I would be super intrigued if it works, just even rarely, it would open up a lot of fun possibilities to make new kinds of stuff with Bark.

#

Cloning is cool but like, you know, already a thing. But if it works with music, OMFG

copper pewter May 19, 2023, 1:52 PM

#

idk, music won't be better than bark music, it's just a history prompt

untold briar May 19, 2023, 1:53 PM

#

So like, once in a blue moon, I get really good music out of Bark. So it's not impossible.

#

Like 14 full seconds that almost sound like a real clip

#

But it's so so rare

#

Even though it's just a prompt, one time it even stays pretty solid for a minute, using prev segment as new prompt. Literally just one time though, lol

#

You may be right that encoding music into the history doesn't help much, mostly still sucks, but maybey not

lunar glen May 19, 2023, 1:58 PM

#

copper pewter regenerated audio

Looks similar, did you train the coarse model from scratch or fine tuned the suno one ?
From yesterday I have been trying to train the text model, the merge token trick is eating me up, any luck with the text model ?

copper pewter May 19, 2023, 1:58 PM

#

neither

#

i trained a model to extract semantics

lunar glen May 19, 2023, 1:59 PM

#

Semantics is extracted from Hubert right ? Then you trained some model for semantics to coarse ?

#

Or you mean text to semantics you trained ?

copper pewter May 19, 2023, 2:00 PM

#

also, @untold briar i might have a great idea to fix it not always generating entirely correctly, just save the prompts from when it does. problem solved

copper pewter May 19, 2023, 2:00 PM

#

lunar glen Or you mean text to semantics you trained ?

i trained a model to take the output from HuBERT and convert it to bark compatible semantics

lunar glen May 19, 2023, 2:00 PM

#

Oh that's a nice idea dude

copper pewter May 19, 2023, 2:01 PM

#

untold briar Probably google like rick and morty sound board, or something

i don't have a decent clip without a ton of background audio, so bark is mainly just continuing the background audio

untold briar May 19, 2023, 2:03 PM

#

Ah, I wish I was at my desktop, really curious how it comes out. I can try to google one sec

copper pewter May 19, 2023, 2:04 PM

#

i'm getting a female voice, the same female voice every time, somehow

untold briar May 19, 2023, 2:04 PM

#

Haha, I had the same problem! Wacky!

#

It must be something abou titt

#

Or like, it's triggering a cartoon or something

#

It didn't always happen so I got something Rick like eventually but I actually got that, doing my completely insane method

#

which seems crazy to me, must be like a model thing in Bark

#

That's just totally bizarre honestly, like WTF.

copper pewter May 19, 2023, 2:07 PM

#

yeah your method is kinda a lot of trial and error, just generating multiple and hoping one of them is good right?

untold briar May 19, 2023, 2:08 PM

#

It's way more convoluted, I identified certain trends, manually ranked voices by similarity, and iteratively bred them over generations, trying to find a recursive process that made it converge more in that direction. a process which wasn't that similar between voices. with a lot of weird hacks like manually mixing two voices that each have a similar aspect. Just a completely insane process. But also a fascinating concept that it converges so well, I was blown away it kept getting closer.

copper pewter May 19, 2023, 2:08 PM

#

for example with mine, this was the second try, idk why he screams hello though

#

i don't really watch dream or anything but i'm pretty sure this sounds pretty accurate

copper pewter May 19, 2023, 2:09 PM

#

untold briar It's way more convoluted, I identified certain trends, manually ranked voices by...

that probably takes ages lol

#

my voice cloning takes less than 5 seconds on cpu

untold briar May 19, 2023, 2:09 PM

#

It was really like curiousity, could this actually work? and it's crazy how well it close it can get

copper pewter May 19, 2023, 2:10 PM

#

like this was the first one after

untold briar May 19, 2023, 2:10 PM

#

It is kind of my thing to do things "the incorrect way"

copper pewter May 19, 2023, 2:11 PM

#

technically, if you keep combining the 2 that would have the actual voice between them theoretically, you could actually get a great clone

untold briar May 19, 2023, 2:11 PM

#

I was like, this can't possibly work, right? So it was pretty fun! Also just kind of wild it kept getting closer and closer until it seemed perfect to my ears. I thought it would eventually hit a wall.

copper pewter May 19, 2023, 2:11 PM

#

it would just take a long time

untold briar May 19, 2023, 2:12 PM

#

It was like a couple hours, maybe a bit faster at the end but I got bored

#

But Rick was just weird, constantly got voice switches, and stuff.

#

I think it must be modeling a cartoon. Bark I mean. Cartoons do switch voices rapidly. So this is, in a sense, perhaps an accurate prediction. But I literally kept getting a woman's voice, that was by far the most common. It's a bit mind blowing that somehow both methods produced that.

#

It wasn't just cloning, I was interested in making voice hybrids, flipping the gender of a voice, it was all related work

#

Just seeing what might work

#

Especially what *shouldn't *work, but if it does it's very fun.

#

Gotta bail for a bit, @ me if it's interesting, will check later today

untold briar May 19, 2023, 2:59 PM

#

Okay, I thought this would be a fascinating YouTube video (Titles... How to voice clone the least efficient way? Still workshopping it.) I may as well try to explain quickly because who knows when I’ll get to making that. Basically after the third time I randomly got a Bark voice that sounded suspiciously like Trevor Noah, I was like, ok seriously WHAT is going on?

And it was my test script. I was constantly running the same prompts, in the same order, iteratively using the last segment as the new history, with a bunch of variations: sometimes half the prev segment + half a fixed history, randomly deleting a random portion of the history, mixing two different histories, and stuff like that. Lots of trimming and deleting and mixing, and importantly preserving at least some fraction of previous segment, so it was always iterating and building on the past. (Hitchhiker's Guide to the Galaxy and Andor scenes, if you’re curious.)

And I looked at the samples and in some parts of the process, later samples did actually sound a bit more Trevor Noah than earlier samples. It was super rare to get a full clone from this particular process, but I ran this script all the time so it did just randomly happen more than one time. The main thing was that it was a process that somehow trended towards a particular voice. So it made me wonder if you could do better on purpose. And you can if you’re completely crazy and procrastinating from actual work.

copper pewter May 19, 2023, 3:05 PM

#

hmm

untold briar May 19, 2023, 3:16 PM

#

I hope I can explain in a YouTube somehow, I was just so surprised it worked, don't know if I can capture that feeling of 'how the heck is this working?'.

untold briar May 19, 2023, 3:32 PM

#

"Voice cloning with zero seconds of audio (for crazy people)." Hmnnn. I like that one

untold briar May 19, 2023, 3:34 PM

#

copper pewter hmm

Totally @ me if you try some music. That's what I'm most curious about, even if it's probably not great.

#

Trying to to get good music out of Bark is extremely difficult because so many samples are literally painful to listen to. So if this is alternative way to try stuff, would be a relief.

#

Oh you gotta try the meow too. That's critical. We have to know what happens. For science!

copper pewter May 19, 2023, 3:37 PM

#

i do have a little bit of ai generated lo fi from riffusion

#

but idk, my tokenizer did not have any music in its data

untold briar May 19, 2023, 3:38 PM

#

Yeah, I understand it shouldn't work. 100%. But I'm saying, maybe it's still interesting!

copper pewter May 19, 2023, 3:38 PM

#

ok, so first gen it had the little sounds in the output audio, and a voice saying aaaaaaaaaaaaaaa

#

untold briar May 19, 2023, 3:38 PM

#

I call that a win. That sounds a bit like music to me.

copper pewter May 19, 2023, 3:39 PM

#

the original music btw, it's kind of low quality, since it was a test of an extension i made for stable diffusion webui

untold briar May 19, 2023, 3:39 PM

#

Accuracy isn't the only goal. Can you make an interesting history prompt for Bark, that is capable of generating Bark outputs that would otherwise be hard to maek?

#

That's a win.

copper pewter May 19, 2023, 3:39 PM

#

and this sounds like nothing lol

copper pewter May 19, 2023, 3:39 PM

#

untold briar Accuracy isn't the only goal. Can you make an interesting history prompt for Bar...

can you give an example

untold briar May 19, 2023, 3:40 PM

#

It sounds singing, a bit..

copper pewter May 19, 2023, 3:40 PM

#

yeah

#

lol

untold briar May 19, 2023, 3:40 PM

#

Well any consistent and not noisy vaguely music sounding audio, would be a win

#

Some of my weird experiments made Bark output like, car alarms. But in clear audio. Not useful, but kind of interesting.

#

Doorbells, honks, horns, lots of that.

copper pewter May 19, 2023, 3:42 PM

#

untold briar May 19, 2023, 3:42 PM

#

I mean it's early. In future could train on music, sounds, whatever. But thanks for testing for me.

copper pewter May 19, 2023, 3:42 PM

#

verify

#

the voice cloning is so good though

#

best results if you use the voice cloning, then generate a piece of audio which fits the person you're trying to clone, then taking the speaker prompt from there, that way it's fully bark-generated and accurate. but actually has the voice

#

that's what i did in the clip above

untold briar May 19, 2023, 3:45 PM

#

I have a few bidens, one issue I had is he trended towards either public microphone, or way close to mic, he kept oscillating and neither was ideal, I was gonna try hybriding but I got bored of this manual process

#

He's the one I'm least happy with though, of the presidents

#

I was thinking about future uses. If you can encode an hour of a speaker, for example. Then sort of like how I'm doing my french accents, one possible use of your model could be to create a deeper clone than the history prompt

#

Kind of like how I push it towards french tokens, maybe it's possible to do something like with like a bigger sample size

#

and possibly, small chance, it could overall increase quality past what you can do from single history_prompt. without having to fine-tune

#

From what I hear fine-tuning is ALSO quick

#

so this may be kind of a dead end

#

but still, could be worth trying, some day

#

fine tuning should be got tier though, for actual practical purposes

#

Oh can you flip a sign in your model somewhere? Can you produce the semantic that is the least like the wav instead? Probably total noise or silence. But sometimes a bad idea can do something cool.

#

You're getting a lot of insight into how I work. Try the wrong thing that even if it works, should sound terrible.

#

I'd say "Hey, it's a living." but it's not, it's really not. If anything it's an anti-living.

#

I can try all this stuff later in the week when you release, I'll stop bugging you. I'm stuck in a waiting room and bored out of my mind, and I can't try anything myself for a couple hours.

#

Does anyone know, of the people who did fine-tune whatever model (coarse I think, for a new language) were they able to do it and preserve the Bark original model's capabilities, or did it have a detrimental effect on them?

copper pewter May 19, 2023, 4:15 PM

#

untold briar Oh can you flip a sign in your model somewhere? Can you produce the semantic tha...

just put random semantics if you want that

untold briar May 19, 2023, 4:17 PM

#

Yeah I agree that's the most likely result, basically the same as random tokens. But it's worth checking at least a bit, sometimes a really bad idea can surprise you.

copper pewter May 19, 2023, 6:05 PM

#

tried to voice replace a meme with a random voice

copper pewter May 19, 2023, 6:58 PM

#

if you've got a decent clip you can get decent results on voices that aren't very common

#

dantdm

untold briar May 19, 2023, 7:23 PM

#

Cartoon voices seem to be tightly tangled with background music in Bark. Super deep narrator voices as well are incredibly likely to add it, even if the original speaker had nothing.

#

Could be really cool if the background sounds were a bit better and more plausibly a real clip

sharp gale May 19, 2023, 7:26 PM

#

Best way to do that would be editing the audio beforehand to remove the background music as musch as possible. Nowadays there are tons of tools that do that

#

Or even better, hire a voice actor from Fiverr who can do impersonations and record a bunch of clean voice lines for training

untold briar May 19, 2023, 7:27 PM

#

I think they're trying to keep it raw, and figure out some way to sample from it with more control. But I don't know if there's any downside to adding even more data like that as an addition to the dataset.

#

Probably not?

sharp gale May 19, 2023, 7:27 PM

#

I wouldn't think so.

#

What is a good amount of data and voice lines to train custom model?

untold briar May 19, 2023, 7:29 PM

#

There's somebody in here who at least started training a new language, but I haven't kept up with the details. It might be only the coarse model as well.

#

I don't think you'll need to train Bark at all for good voice cloning.

#

You could do it for better cloning, probably. But should be good with just a good prompt.

#

Unless your voice lines are atypical, sound effects, music, or a new language.

sharp gale May 19, 2023, 7:31 PM

#

I see. I'm still new so I don't know how to do voice cloning yet. I've Bark infinity running and tried to do voice cloning through its UI, but haven't managed to make it work yet

untold briar May 19, 2023, 7:31 PM

#

Bark should work much better than Tortoise, using a small audio sample.

#

That's not the real cloning. The person working on is right above here, probably out later this week.

#

I put that in Bark Infinity because people asked, but it's really just not not worth the effort. Literally just wait a few days.

sharp gale May 19, 2023, 7:32 PM

#

Oh nice! Would love to give it a try and test it out if I can manage to get it running

#

still learning how to run Python, git, and stuff, my expertise is more in visual than programming haha

untold briar May 19, 2023, 7:35 PM

#

He's still cooking the cloning a bit, just scroll up literally in this thread, that's it right there.

sharp gale May 19, 2023, 7:36 PM

#

Oh Mylo with the audio! Cool thanks man

#

Gonn spend the afternoon reading chat history haha

untold briar May 19, 2023, 7:39 PM

#

May I ask your reason for cloning? I'm sort of curious why it's just the thing everyone asks about specifically.

#

If you only need a nice voice in a particular style, you may not need to clone at all. If it just needs be in a specific style and tone, but not an exact voice.

#

It's pretty wild how expressive Bark is. You can really get quite a specific voice with a bit of effort.

sharp gale May 19, 2023, 7:42 PM

#

Trying to create a voice for a female character for a Youtube channel. Wanted a clean voice that sounds similar to this

untold briar May 19, 2023, 7:43 PM

#

Yeah that specific style happens to be quite tricky in Bark. Children's or high pitches voices talking real fast. That's one you might just want to clone.

sharp gale May 19, 2023, 7:45 PM

#

Haha yeah that's what I thought

untold briar May 19, 2023, 7:46 PM

#

That might be one that does need fine-tuning to do well, even. Cartoons style in general has a lot of background as we just mentioned, so might be kind of messy. I think that could be fixed but short term you might need fine tuning. Which is not even really a thing, yet, to be clear. But probably not too long.

sharp gale May 19, 2023, 7:48 PM

#

I see! So the fine tuning would be feeding more audio data in the cloning process?

untold briar May 19, 2023, 7:49 PM

#

The cloning probably just needs a short sample, and generates a bark prompt. An .npz file. Fine tuning would be a more traditional process where you use a lot of data and longer time. They are both cloning, but I was using cloning to mean the first, yeah.

whole jay May 19, 2023, 7:49 PM

#

copper pewter yeah, i don't have the resources for that, but model and code will be released l...

Are you planning to release over this weekend? 🙏

copper pewter May 19, 2023, 7:50 PM

#

yep

sharp gale May 19, 2023, 7:53 PM

#

untold briar The cloning probably just needs a short sample, and generates a bark prompt. An ...

Gotcha! Appreciate you answering my noob questions man, thank you 🫡

copper pewter May 19, 2023, 7:54 PM

#

sharp gale Trying to create a voice for a female character for a Youtube channel. Wanted a...

i'll see how it does on this

untold briar May 19, 2023, 7:54 PM

#

My guess is it's hard to get it to generate clean voice only samples. But maybe it works!

copper pewter May 19, 2023, 7:54 PM

#

after a few tries it got the voice, but not really following the prompt

untold briar May 19, 2023, 7:55 PM

#

Pretty good, yeah.

sharp gale May 19, 2023, 7:55 PM

#

copper pewter after a few tries it got the voice, but not really following the prompt

Holy crap my mind is blown lol

untold briar May 19, 2023, 7:55 PM

#

Really good actually. That's a tough style in Bark. A+

sharp gale May 19, 2023, 7:56 PM

#

That's amazing damn

copper pewter May 19, 2023, 7:56 PM

#

lmao what happened, i used the speaker prompt that resulted from the last gen's audio and it just switched up

untold briar May 19, 2023, 7:57 PM

#

Yeah, cartoon voices man.

sharp gale May 19, 2023, 7:57 PM

#

copper pewter after a few tries it got the voice, but not really following the prompt

lol

copper pewter May 19, 2023, 7:57 PM

#

the voice just switching lmao

untold briar May 19, 2023, 7:57 PM

#

Literally every cartoon voice I have does that

sharp gale May 19, 2023, 7:58 PM

#

is that the model trying to use its trained data to fill in gaps?

copper pewter May 19, 2023, 7:58 PM

#

maybe i can play with the temperatures a bit until a get a better output from bark, since using those outputs is even better since they're "true" semantics

untold briar May 19, 2023, 7:58 PM

#

I think it's really just because real cartoon switch voices very frequently. So Bark is likely to do it too.

sharp gale May 19, 2023, 8:00 PM

#

copper pewter maybe i can play with the temperatures a bit until a get a better output from ba...

And how do you test the cloned voice from them on? Is it a .npz file that gets spit out?

untold briar May 19, 2023, 8:00 PM

#

That's right.

sharp gale May 19, 2023, 8:00 PM

#

Aaah ok things are starting to click lol

copper pewter May 19, 2023, 8:01 PM

#

first it creates an npz, which will have a lot of variety in it's continuations still, but often gets a realistic response, if you get it to generate more audio in the correct voice with that, you get another npz to save, which will have a consistent voice that you want

#

that's how i did the one with joe biden

untold briar May 19, 2023, 8:01 PM

#

Even with non cloned voices, that's a good pattern.

#

To increase reliability.

sharp gale May 19, 2023, 8:03 PM

#

Got it! Wow blowing my mind here

copper pewter May 19, 2023, 8:03 PM

#

sharp gale May 19, 2023, 8:04 PM

#

copper pewter

https://tenor.com/view/the-universe-tim-and-eric-mind-blown-mind-blown-meme-mind-explosion-mind-explosion-meme-gif-18002878

Tenor

untold briar May 19, 2023, 8:04 PM

#

Yeah, Bark doesn't need fine-tuning. Unless you go way way outside the norm.

sharp gale May 19, 2023, 8:05 PM

#

copper pewter

Would it be ok for me to get the .npz file to test it out on my end as well?

copper pewter May 19, 2023, 8:05 PM

#

yeah

#

still testing a bit though, i want it to both have the voice, and correctly follow text

sharp gale May 19, 2023, 8:06 PM

#

yeah no rush! Thank you

untold briar May 19, 2023, 8:09 PM

#

You want to be truly mind blown. Trained in just 8 hours on a 3060.

#

Faster than cloning a single voice for many TTS models.

sharp gale May 19, 2023, 8:10 PM

#

lol that's nuts

copper pewter May 19, 2023, 8:13 PM

#

untold briar Faster than cloning a single voice for many TTS models.

?

untold briar May 19, 2023, 8:13 PM

#

copper pewter ?

There are some TTS models that people fine-tune on a voice for longer than 8 hours, to make ONE clone.

copper pewter May 19, 2023, 8:13 PM

#

the training itsself took like 20 minutes for the model i made lol

#

on 8 hours of audio data

#

i guess this would technically be zero-shot voice cloning right?

untold briar May 19, 2023, 8:15 PM

#

I'm not really sure how people use zero-shot in audio stuff, not sure.

sharp gale May 19, 2023, 8:16 PM

#

I remember at the beginning of this year when I was looking at how to do voice training and on the videos the people would train for days on a 4080 to get anything that sounded at least human

#

on like 30 hours of audio data

#

for those David attenborough voices

untold briar May 19, 2023, 8:18 PM

#

I wonder if the appeal will fade once it's just baked into every single app or whatever

sharp gale May 19, 2023, 8:20 PM

#

Probably, most people will use it to say some funny words and move on

untold briar May 19, 2023, 8:21 PM

#

There will always be a demand for higher quality, but rather than pure voice, it'll almost be 'acting quality' literally how good of a performer the voice model is, similar to how you might judge a real actor.

#

A short sample will probably sound the same, but reading a paragraph from a book, the better model will have a lot more nuance.

sharp gale May 19, 2023, 8:24 PM

#

Yeah for sure

untold briar May 19, 2023, 8:26 PM

#

Even now I sometimes evaluate just the prompts/speakers like that, in Bark. Not a ton, but there seems to be some subtle differences between speaker variants that are otherwise sonically similar.

#

It's a little hard to judge in 14 seconds, but I'd guess it stands out more if the generations were longer.

#

Especially an audiobook, a full paragraph has a kind of rhythm to it you can almost feel, with a real reader. But this version of Bark can't really ever see the whole thing.

sharp gale May 19, 2023, 8:33 PM

#

Right I think that's gonna be the hardest thing to capture. I think the voice that gets the closest is Google's Voice Assistant., which I think does a better job than Siri in sounding real

untold briar May 19, 2023, 8:34 PM

#

So you can kind of cheat it right now in Bark. In the simplest way possible, using a 'beginning of paragraph' speaker .npz, and one for the end. lol. It actually basically works. But obviously doing it for real would sound way better.

#

Just kind of randomly checking variants and trying to find the right ones for that.

#

It's also a little TOO regular, at least when I just tested the idea. Maybe okay with more subtle variants than I tried.

sharp gale May 19, 2023, 8:36 PM

#

Wow interesting 🤔

untold briar May 19, 2023, 8:36 PM

#

Yeah man, Bark is literally barely even explored right now.

sharp gale May 19, 2023, 8:37 PM

#

Yeah I'm amazed more people aren't talking about it/testing out

untold briar May 19, 2023, 8:37 PM

#

You can do the same thing for emotions, whispering (probably didn't test that), whatever. A lot of work to get good variants, but it's a really simple idea and you only need to get them once.

#

A feature I might add to my fork is a fake prompt that is only used as a history input. So you say, "I went home to work. (I am crying I am so sad.) I said hello to my wife." And basically it uses the middle sentence in the history, but not the audio output. Just splitting text segments, not anything fancy

#

I think it's probably super super unreliable

#

but it seems fun

sharp gale May 19, 2023, 8:41 PM

#

Oh wow Literally like a VO Director directing talent on how to deliver the lines, that's cool

untold briar May 19, 2023, 8:42 PM

#

So the idea is, it's like you asked Bark to generate the audio with that sentence in the tone, but doesn't actually include the middle sentence. But if you try this now, you'll find it doesn't work reliably. You really got RNG it up to get a good one usuually.

sharp gale May 19, 2023, 8:43 PM

#

I used to be a VO recording technitian in the past, one thing we would always do with a VO talent is not tell them exactly how to do a line, but ask for more or less of an emotion, or mention the tone

untold briar May 19, 2023, 8:44 PM

#

With Bark, it's usually even more subtle

#

You have to write text that just would be said in the way you want.

sharp gale May 19, 2023, 8:44 PM

#

as in "can you finish this sentence on a higher tone?" or "Less crying, more anger"

untold briar May 19, 2023, 8:44 PM

#

A better example would be (My dog was dead and I started sobbing.) or something, truly over the top, but then the lien would be read more sadly in Bark.

sharp gale May 19, 2023, 8:45 PM

#

incorporating that into prompts would be gamechanging

untold briar May 19, 2023, 8:46 PM

#

It's not gonna great in the quick version, since splitting the sentences alone kind of makes it less expressive and feel less connected.

#

Still could be useful and it's easy to implement, so whatever.

#

And I think I already mentioned, but using this to control emotion doesn't actually work well. It just works once in awhile.

sharp gale May 19, 2023, 8:51 PM

#

Are there variables in the training data that classify emotions?

#

or tags, not sure how to call it

untold briar May 19, 2023, 8:52 PM

#

No idea. I'd guess no with pretty high confidence, just seems against the spirit of the project.

sharp gale May 19, 2023, 8:53 PM

#

I see. Could be interesting to train data with those "tags" from the beginning. Maybe that would be easier to control once data is trained?

untold briar May 19, 2023, 8:53 PM

#

The one thing they've talked about is the training data is pretty raw, not noise filtered, just raw audio

#

So that's why I guess no

#

If you search gkusko (trying not to ping) he's talked a lot in this discord about it. But it could be weeks ago.

sharp gale May 19, 2023, 8:56 PM

#

I'll search it out

untold briar May 19, 2023, 8:57 PM

#

They might hope Bark could eventually do classifying like that on audio. A true foundation audio model.

#

It probably can, sort of... I can almost imagine something.

#

Imagine a comically simple method. Encode the audio "I am feeling" into bark, using something like Mylo's model. Then prompt bark to continue the audio, which is jsut a flag, you can force it do that.

#

I bet you get well, mostly nonsense, but statistically I bet different emotions have some correlation in the likely words you hear to the actual emotional tone in the audio.

#

The least accurate and most inefficient emotion classifier system you could possibly make. But I bet it's more accurate than chance with enough samples. Basically my specialty. Also I guess this only works on the audio of the words "I am feeling" ? Truly the most useless machine. Actually, why not clone the voice of the audio sample, then add a fake "I am feeling" to the end, then give that to Bark. Even more convoluted. Excellent.

sharp gale May 19, 2023, 9:03 PM

#

Haha

#

Would love to see that

untold briar May 19, 2023, 9:05 PM

#

If there's a terrible way to do something, that's the way I've probably done it. The funny thing is some of it just actually works straight up in Bark.

#

I bet it works better than trying to generate punchlines with Bark. https://twitter.com/jonathanfly/status/1650001584485552130 (Though it did actually do a few with the Chicken crossed the road joke.)

sharp gale May 19, 2023, 9:13 PM

#

Ahahahhaha this is amazing

#

You got an immediate follow great stuff

untold briar May 19, 2023, 9:16 PM

#

Gosh I haven't Tweeted in forever.

sharp gale May 19, 2023, 9:17 PM

#

You should! You got a whole follower base already

untold briar May 19, 2023, 9:18 PM

#

I do have like hundreds of samples of the US presidents singing American pie, right in front of me. The world should hear this.

sharp gale May 19, 2023, 9:19 PM

#

I'll retweet it to my 100 followers lol

untold briar May 19, 2023, 9:22 PM

#

My Twitter is actually a weird mix cause like it's half from crazy jokes, and half from when I used to read https://arxiv.org/ for AI papers every day and was constantly Tweeting paper summaries or usually quickly trying the paper repo, so like serious science people. So I honestly get gunshy and try to think like, can I split the difference somehow for both groups, lol.

sharp gale May 19, 2023, 9:26 PM

#

Yeah that's stuff that goes over my head for sure 🤣

#

but the funny AI videos I like lol

untold briar May 19, 2023, 9:35 PM

#

It's so expressive, right? The voices are so insanely human. On the github for my bark fork there's like 20 minutes of that. I love it so much.

#

I don't really use TTS in practical use, but it was just so lively and real feeling, I got kind of addicted to poking at the audio model. For literally no purpose but to see what it did. And that's how I somehow ended up spending half my week's free time trying to get Donald Trump to sing in a ridiculous French accent.

untold briar May 19, 2023, 9:52 PM

#

I Tweeted Don't Sleep on Bark when nobody knew about it. Bark has 20,000 stars now... and people are still sleeping on Bark. That's the Tweet. Legit like, Mylo cloned in 20 minutes. I threw sticks and stones together and got French accents, somehow. (I might be overemphasizing self derogatory humor a bit here, it was real work and quite a bit of coding, if not very sophisticated.) Imagine what actual domain experts really building on the Bark model will do.

sharp gale May 19, 2023, 9:57 PM

#

Hahha funny you say that I just got the French voice to talk in a great accent in english too, using the output npz files you guys suggested earlier

untold briar May 19, 2023, 9:58 PM

#

Yeah it's cool, many are multilingual.

#

Specifically I was trying to force any voice to have a specific accent, which normally you can't do. Like, ugh, I have some here somewhere.

copper pewter May 19, 2023, 10:01 PM

#

some distict voices are really easy to clone btw, anyone recognise this one?

untold briar May 19, 2023, 10:01 PM

#

Not sure, sounds like an announcer?

sharp gale May 19, 2023, 10:02 PM

#

Sounds familiar but can't put my finger on it

copper pewter May 19, 2023, 10:02 PM

#

postal dude

untold briar May 19, 2023, 10:02 PM

#

The winamp voice?

#

Oh I'm not sure I know that one

sharp gale May 19, 2023, 10:02 PM

#

ahahaha winamp voice! That's a deep cut

untold briar May 19, 2023, 10:02 PM

#

Gosh showing my age.

#

haha

sharp gale May 19, 2023, 10:03 PM

#

Ahaha Postal dude yes there you go

sharp gale May 19, 2023, 10:03 PM

#

untold briar haha

Hey man I'm right around there too got the reference!

#

Postal dude is also a deep cut

untold briar May 19, 2023, 10:04 PM

#

https://www.youtube.com/watch?v=WqJbvZVGWSE

YouTube

Daniel Neumann

Winamp original - It Really Whips the Llama's Ass

2.91 version of the iconic software splash screen and remembered track

▶ Play video

#

It's pretty similar!

sharp gale May 19, 2023, 10:04 PM

#

hahaha might be the same voice actor

untold briar May 19, 2023, 10:04 PM

#

I know right?

#

I don't know the postal voice but hear that pretty close

copper pewter May 19, 2023, 10:05 PM

#

rick hunter, who is the voice of the postal dude in postal 1, 2, and as an option in 4, does do radio commercials

#

also, lol

#

my explorer hasn't realized that it's already the next day

untold briar May 19, 2023, 10:07 PM

#

Man I looked it up. The winamp voice actor vanished into thin air?

#

wild

#

Actual urban legend voice

sharp gale May 19, 2023, 10:08 PM

#

hahaha wow

#

Oh shoot Jonathan I just noticed that you're the one who made Bark Infinity! That's you right?

untold briar May 19, 2023, 10:11 PM

#

Yeah. I don't know how I ended up so deep into audio to be honest. And I kind of need an exit plan eventually, cause like I don't even it myself except to play and experiment, but I will be updating it soon and support it for near future, for sure.

sharp gale May 19, 2023, 10:12 PM

#

Man that's awesome thank you for your hard work! Been playing with it nonstop ever since installing it!

untold briar May 19, 2023, 10:12 PM

#

And now I feel some responsibility to update it.

#

My god though, I have like a billion changes for my accent stuff, so it's just gonna be a nightmare.

#

But I'll do a basic bugfix and core features, I mean 99% of people want clear voices and an easy installer. Literally nobody is asking for people to sing in comical accents, but that's what I'm dying over, lol. And the install is so confusing.

sharp gale May 19, 2023, 10:16 PM

#

hahahaha it is a bit confusing, took me a little while to make it work on my end

untold briar May 19, 2023, 10:16 PM

#

I'm spending 10x the hours on these crazy ideas, and everyone keeps messages me, "I can't install Bark."

#

I actually did start a basic patch some today, for real.

sharp gale May 19, 2023, 10:17 PM

#

Miniconda saved me

#

If you or Mylo need help with anything visual or even audio data to be trained on just let me know seriously, you both are doing the Lord's work

untold briar May 19, 2023, 10:19 PM

#

There are some practical things I have learned, as a result of some silly things, that can be useful for simply joining audio clips more seamlessly, so it's not all all for nothing.

sharp gale May 19, 2023, 10:23 PM

#

absolutely Infinity is saving me a lot, cause using commands on prompts is the death of me haha

untold briar May 19, 2023, 10:25 PM

#

Cool. There are some rough spots that drive me nuts. I can't believe how difficult it is to like, load all your settings in Gradio. I looked at how Stable Diffusion does it and they just wrote their own JS thing that does, basically just bypass Gradio. UGh

#

I knew Automatic1111 was Gradio, and I made a bad assumption that most of it's features were Gradio's features. But actually, it's a miracle they got all that stuff to work in Gradio and they built a billion hacks on top to do it.

#

I couldn't figure out how to display a non fixed amount of audio samples, or let the user pick a folder of files, really simple stuff can be missing in Gradio.

#

There's a whole thread for Gradio complaining, haha. I really go off easily. It's really not the worst but I was frustrated a lot, and I hadn't developed in Gradio before last month, so I kept being like, "I must be missing a feature?" and actually usually it just didn't exist, lol.

#

I'm gonna have rip the Stable Diffusion code just to refresh the file picker dropdowns if there's a new .npz file. How is this just not part of the base ui, lol.

#

Actually there is a funny thing I didn't think about. The actual most annoying about Gradio isn't Gradio's fault, really. It's that the newer Gradio API is so new that ChatGPT doesn't know anything about so you can't use ChatGPT really for Gradio dev. I never thought about that but actually, super annoying downside. Actually maybe it works now with plugins? But when I tried originally it was hallucinating all the time, worse than useless. It's an interesting factor to consider -- a software tool or library or language created or heavily modified after an AI models cutoff date, means the model is a much worse aid. Welcome to 2023. I wonder if that's gonna work as a negative pressure against new programming languages, an extra hurdle to overcome.

sharp gale May 19, 2023, 10:46 PM

#

Hahaha wow yeah true

#

Have you tried asking Bard to check if it has the same knowledge?

untold briar May 19, 2023, 10:47 PM

#

I haven't, I should totally.

#

They say it has no cutoff date, not sure how exactly that is implemented, but it's got to be at more recent.

#

My face when I first tried Gradio, tried using ChatGPT, and it hallucinated the most god damn beautiful Gradio API and code, with all the features I wanted, just perfect. And then I discovered it was all just complete fantasy. There's got to be German word for this feeling by now.

sharp gale May 19, 2023, 10:55 PM

#

Lol

earnest copper May 19, 2023, 10:58 PM

#

@untold briar new programming languages can be easily added onto models via LoRA

untold briar May 19, 2023, 10:59 PM

#

Back on topic, anyone found some way to minimize the chance of audio becoming more metallic the longer it gets? It doesn't always happen but I haven't really noticed a trend in sampling parameters, or anything really obvious at least.

earnest copper May 19, 2023, 11:00 PM

#

the only way to do that is to actually modify the way the pipeline works to use like, a split attention head

untold briar May 19, 2023, 11:00 PM

#

Are you replying to the negative prompt idea?

earnest copper May 19, 2023, 11:00 PM

#

the audio going off the rails over time.

#

that, i believe, occurs because the attention head is linear and has a limited sliding window len

untold briar May 19, 2023, 11:01 PM

#

Huh. Is that a lot of changes, or could it be hacked on pretty quickly?

earnest copper May 19, 2023, 11:01 PM

#

they can implement multi-layer single head attention and it could work better than multi-head split attention

#

it's a LOT of changes, but nothing that a genius like the ControlNet team couldn't hack up in a day

#

and i could also be way off base about it Sad

untold briar May 19, 2023, 11:02 PM

#

Can I also ask you about the negative prompt idea, as you seem to be have some real actual expertise. Do you know if it even makes sense in LLMs, or if there's a reason why it's not done?

earnest copper May 19, 2023, 11:03 PM

#

@open magnet sorry to ping you, but is that close to the mark for the reason we don't have longer than 13s, that the audio recordings even from a given prompt end up 'forgetting who they are'? I concat renderings and the voice noticeably changes toward the end of the more exciting statements.

untold briar May 19, 2023, 11:04 PM

#

I think the model was trained on segments of that size, and they do start a bit in the middle, to minimize edge effects, but not sure

earnest copper May 19, 2023, 11:04 PM

#

@untold briar that would probably be a 'user preference layer', a LoRA. you can generate text embeddings as well. there's really no reason you can't do it. but it's likely that the pipeline would have to be altered to make that work. and i'm assuming that'll be less performant in a big way, or they would have done it.

#

i should clarify, less performant with our current hw/net architectures

untold briar May 19, 2023, 11:06 PM

#

With an regular text LLM it's kind of hard to imagine what exactly the negative prompt should do, if it works correctly. But with Bark I can almost see it. A negative prompt of "I HATE YOU AND YOU SHOULD DIE" might make it more friendly and quiet, something like that.

earnest copper May 19, 2023, 11:07 PM

#

negative prompts can be thought of as weighted probability sets that influence the decisions made by the model, to drop the probability score of those tokens and reduce their likelihood of generation

#

they already do this but bake it into the model for things like the N word and so on. they provide positive and negative prompts via the Anthropic dataset. have you looked at that?

untold briar May 19, 2023, 11:08 PM

#

So the way I was gonna try it, very crudely. Generate one sample of negative prompt. Compare that sample versus like average english sample and try to pick out prompt specfic tokens, not language generally, with some stats. So you don't penalize the sound of human speaking or something general like that. And then penalize those tokens in real sample.

earnest copper May 19, 2023, 11:09 PM

#

i'm just going to start training stuff on hollywood movies.

#

good quality images, audio, and, often, transcripts are baked into the format

untold briar May 19, 2023, 11:11 PM

#

I was thinking a super weird one might be 'audio description captions' like where they describe what's happening on screen. I don't know how useful it is, but they talk very precisely over lots of dynamic audio.

#

I mean, I don't know what you do with outputs like that.

#

But I was kind of curious what it would learn.

earnest copper May 19, 2023, 11:14 PM

#

omg lmao

#

the Suno crew should be willing to describe how to do the training and stuff but idk when they'll decide that

#

baseline Stable Diffusion 2.1 -> my Lord of the Rings model (The Hobbit), early on in training

#

this is considered an impossible challenge, to style transfer into SD 2.x without destroying it. so, i'm definitely excited for the challenge with Bark.

untold briar May 19, 2023, 11:16 PM

#

Nice

#

The pace of dev in the Diffusion space is nuts. So many things all the time.

#

Random question before I'm for a bit, anyone gotten somewhat realistic applause sounds from Bark? I always get a very very electronic sound in the place that is clearly supposed to be applause.

open magnet May 19, 2023, 11:21 PM

#

earnest copper <@856546017998929971> sorry to ping you, but is that close to the mark for the r...

sorry what's the question? happy to opine. 13s is a bit random. just the standard small LLM context size of 1024

earnest copper May 19, 2023, 11:22 PM

#

i see. it's just that when i do have longer audio, the voice deviates from the initial audio, a LOT. it never seems to normalize itself against the initial sample

#

so i was wondering if the processing is fully linear, and if it is, would a split attention approach help with coherence, or would it hurt

#

i have only briefly looked at it and i definitely barely understood it, so my understanding is coming from the sliding window length stuff i messed with

untold briar May 19, 2023, 11:26 PM

#

By deviates, do you mean deviates from the sound of the voice, or the audio quality itself gets the metallic twinge, or both?

#

I'm not sure if they are correlated, but I can't really remember off hand. I don't remember noticing that specifically at least.

earnest copper May 19, 2023, 11:28 PM

#

both

untold briar May 19, 2023, 11:30 PM

#

The metallic trend is such a consistent effect, it feels like there should be some fix, especially it's not always there. I kind of wonder if the voice thing is a deeper issue, it's just hard to perfectly model a whole human voice from a history_prompt.

#

Totally uninformed feelings, lol.

#

Even in ChatGPT, with text, sometimes you feel it starting to degrade when you approach the context window, and it's less coherent. Not even go past it, just close.

earnest copper May 20, 2023, 12:01 AM

#

oh, have you ever tried downsampling human voices to 22kHz? maybe tis' just that.

#

@untold briar it isn't modelling a whole human voice based on the recording. the model takes in the audio and behaves similar to something like GPT2 where it autocompletes what it is given, following the same patterns.

the emotions come from the actual training that went into the coarse model

#

a voice that sounds similar to X or Y has tensor space, physically co-located

#

there's interesting issues with 'overfitting' on the audio and i'm not certain whether an overfitted voice will perform better or worse. i know that overfitting the coarse model on one voice will result in a reduced variability of its outputs.

#

it likely just needs to be a larger model with more parameters to train.

#

it's a hard trap to fall into - thinking this is as good as it gets - but this stuff is a moving target, and it's likely that suno has internal models that far surpass this toy's abilities

untold briar May 20, 2023, 12:18 AM

#

Oh yeah, I agree. I guess I still meant like, presumably the prompt still has to be long enough for that? It's shorter than I expected in actual effect. 256 in semantic, like 206 or in that range in coarse for the semantic tokens. Fine strangely has the longest input but fine also doesn't seem really matter much. I do find even literally randomly chopping up coarse prompts quite a bit can push it into a bit more expressive space, when the outputs seem to get stuck in a less expressive space. Just from my literal caveman science slicing and dicing. One thing I do all the time is trim both semantic and coarse, removing chunks somewhat randomly, to try and break out of some effect like background sound. Doesn't usually work but sometimes actually does. Or just resizing it fractionally, like, "I want a voice only a little like that..."

#

An interesting thing to do is render 20 versions of a history_prompt, across the full range of sizes 0 to 256, or 341 or whatever fine is I forget. And you can get a feel for how much prompt carries how much an effect on the voice. Really just 256 I could honestly hardly ever tell the difference in fine, I just saw the output was technically different unless I extended it that much. Using semantic length here, so adjust according per coarse ratio, I usually just think in semantic units even when messing with coarse.

#

That's kind of how I got into manually hand crafting voices, cause you can kind of just mix them up like recipes and Bark makes it work, at least sometimes. Crazy.

#

To be fair there's no good reason to do this manually. Like seriously do it the right way, train a model like Mylo. But it was interesting that doing it by hand was actually kind of possible.

glossy trout May 20, 2023, 12:36 AM

#

@untold briar - I've found that the voice gets more metallic as you keep passing the full_generation result of a generation in as history_prompt. Instead, if every couple sentences you restart from the base history_promp file, instead of using the full_generation from the previous prompt, you "restart" the degradation, so the overall quality is higher.

untold briar May 20, 2023, 12:37 AM

#

glossy trout <@614946962139250711> - I've found that the voice gets more metallic as you keep...

That's correct. However I feel like there's a tradeoff in nuance when you do that. Have you compared the expressiveness of doing it both ways? Even if it's more metallic, the less chopped of version is generally more expressive in my experience. Oh I misread. I meant, even in single generation.

#

Like just one 14s chunk

#

Absolutely you can't pass full_generation over and over, without some other edits, yeah.

glossy trout May 20, 2023, 12:39 AM

#

Hmmm yeah. Aside from that, I haven't found better strategies.

untold briar May 20, 2023, 12:39 AM

#

So if you break your 14s segment, into two smaller halves. You do usualy get less metallic.

#

But it's less cohesive as a sample.

#

Mainly you can just keep trying until it's not metallic. It just usually work eventually. So maybe there's some sampling tweak

glossy trout May 20, 2023, 12:42 AM

#

Ohhh interesting. So that means shorter samples are less metallic?

untold briar May 20, 2023, 12:43 AM

#

For me, usually the effect increases over time

glossy trout May 20, 2023, 12:43 AM

#

Oh wow, good to know. I haven't tested that

untold briar May 20, 2023, 12:43 AM

#

It's just strange how it's just perfect, and same sampling parameters, and then it's not.

glossy trout May 20, 2023, 12:43 AM

#

My understanding is that shorter samples and longer samples both have a higher chance of hallucinations. The closer you can get to 14s, not too long and not too short, the lower the chances of hallucinations

#

So there's some trade-off between metallic voice vs hallucination probability

untold briar May 20, 2023, 12:44 AM

#

I think that's right, yeah, going to 1s is really bad

glossy trout May 20, 2023, 12:44 AM

#

By going shorter

untold briar May 20, 2023, 12:45 AM

#

I've tried everything, can you tell, haha. If you give semantic two or threes words at once. I mean it works but it sounds you handed an actor a notecard with just those words, they said them, and then you gave them the next notecard.

#

Even if you pass the history correctly, from prev segment, for each two or three word chunk. You super feel the cadence of the original chunks in the voice.

#

I think this actually doesn't even work at all in the small models.

#

But the large one kind of suffers through it, but it just sounds terrible.

#

It makes sense right? Imagine if you gave an actor the first three words of a line read.

#

How can they possibly know how best to say the full line from that

glossy trout May 20, 2023, 12:48 AM

#

Hmm - yeah I feel like even passing ~12s of audio, it often doesn't sound good in long sentences

untold briar May 20, 2023, 12:48 AM

#

Also I tried using a single semantic generation, with two coarse gens. And IIRC that didn't fix the metatl.

glossy trout May 20, 2023, 12:48 AM

#

If you break a 20s sentence into two parts

untold briar May 20, 2023, 12:49 AM

#

So I think it's from semantic. But honestly I didn't double check that.

glossy trout May 20, 2023, 12:49 AM

#

hmmmm

untold briar May 20, 2023, 12:50 AM

#

It feels like it should be in the coarse, so honestly, somebody should double check. Or I will someday

#

I might still have the notebook, maybe I double check if I screwed up. Cause it feels wrong the effect is in the semantic tokens.

glossy trout May 20, 2023, 12:53 AM

#

I think theoretically it can be in either. Semantic tokens can probably generate that metallic sound if it's low quality

untold briar May 20, 2023, 12:54 AM

#

Yeah they are both super expressive, changes to either can change so much

#

Someone mentioned even Eleven Audio has that effect too? At least sometimes? so maybe it's not a simple fix

#

I also agree the biggest audio segment you can get away with the better. Give the model a lot of info. Imagine if you could give it a whole paragraph, so it can properly start and finish like a real audiobook reader would go through a paragraph with a specific rhythm.

#

You do end up with way too fast talking sometimes, but in general, aim big

#

This is one hacky way to adjust speaking speed as well. Keep trying large chunks of text and saving, or small text of saving, until you get something that still sounds the same but generally talks faster or sloewr.

#

If you don't need that specific voice though, just start with a fast or slow one.

#

I really want .npz to come with variants, baked in. It's super annoying to get good ones, but somebody only needs to make them once.

#

fast slow sad happy whispering, whatever

#

I just tried for whispering, because somebody in here wants that specifically. It generally changed the voice, so that one might not work. But I was only trying on one npz.

#

It should work, probably just unlucky

#

btw if by some freak chance somebody has been generating tons of clear French singing npzs, I would be eternally grateful for them.

#

I'm a bit more optimistic about Bark for music, if it's pure singing, with little to no background. That seems doable.

#

Bark can do two music things. Pure singing... well that's a maybe. And sick beats. https://soundcloud.com/jonathan-fly-620508219

SoundCloud

Jonathan Fly

https://twitter.com/jonathanfly

copper pewter May 20, 2023, 8:49 AM

#

https://huggingface.co/GitMylo/bark-voice-cloning
https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer

GitMylo/bark-voice-cloning · Hugging Face

GitHub

GitHub - gitmylo/bark-voice-cloning-HuBERT-quantizer

Contribute to gitmylo/bark-voice-cloning-HuBERT-quantizer development by creating an account on GitHub.

#

release

#

@untold briar @glossy trout

untold briar May 20, 2023, 9:01 AM

#

Nice, you're gonna kill me, it's 5AM, but I am curious

#

Should I literally make tea

#

jesus

#

I guess it's saturday

untold briar May 20, 2023, 11:32 AM

#

The other annoying voice in this otherwise great clip pops in or out based on the some really tiny token changes in the prompt. Clip a few tokens and there he is. Carry the clipped segment into the next, he doesn’t appear. If only I could carry these tokens into any clip and ward away voice changes. (I tried, didn't work, lol.)

sharp gale May 20, 2023, 12:30 PM

#

copper pewter https://huggingface.co/GitMylo/bark-voice-cloning https://github.com/gitmylo/bar...

Is this it mylo??

copper pewter May 20, 2023, 12:30 PM

#

👍

sharp gale May 20, 2023, 12:31 PM

#

Hooo baby! Thanks man!

#

Is it useful to use really long data of audio? like 45 mins or hours?

copper pewter May 20, 2023, 12:32 PM

#

8 hours for the training of the voice cloning model
6 seconds or more for a voice you want to clone

#

there's already 2 pretrained models, 4 and 14 epochs

#

they should be fine for most purposes

sharp gale May 20, 2023, 12:33 PM

#

So I would probably get the same output if I use 10 secs of cloning or 10 minutes of cloning?

#

for cloning*

copper pewter May 20, 2023, 12:34 PM

#

yeah, just make sure your audio clip fits the recommendations explained in the git repository

sharp gale May 20, 2023, 12:35 PM

#

yep yep

copper pewter May 20, 2023, 12:35 PM

#

clear, no music, ends after a sentence

sharp gale May 20, 2023, 12:36 PM

#

Does this have a UI yet, or just the source files?

copper pewter May 20, 2023, 12:37 PM

#

this can be implemented in any ui etc

untold briar May 20, 2023, 12:37 PM

#

I'll add it, maybe Sunday?

copper pewter May 20, 2023, 12:37 PM

#

i can beta release my webui, it's just really early access and stuff

#

like most of this is practically useless

sharp gale May 20, 2023, 12:38 PM

#

gotcha

copper pewter May 20, 2023, 12:38 PM

#

still kinda cool to have this refreshing thing though

sharp gale May 20, 2023, 12:38 PM

#

untold briar I'll add it, maybe Sunday?

Jonathan you're the man!

untold briar May 20, 2023, 12:39 PM

#

You conquered gradio, refreshed the files.

#

lol

#

Why isn't just automatic???

copper pewter May 20, 2023, 12:39 PM

#

gradio has most things you need, just not very clear how to do them

sharp gale May 20, 2023, 12:39 PM

#

copper pewter still kinda cool to have this refreshing thing though

Awesome!

untold briar May 20, 2023, 12:40 PM

#

Does it refresh automatically? I genuinely thought I had to do it myself?

copper pewter May 20, 2023, 12:40 PM

#

no, not here, it would be too slow

untold briar May 20, 2023, 12:40 PM

#

If the user generates a sample. Gradio won't see the new npz by default

copper pewter May 20, 2023, 12:41 PM

#

like, i made this a while ago, and it made me want to learn more about ai to make my own models etc

untold briar May 20, 2023, 12:42 PM

#

So I found if you randomly sample and penalize tokens from language, you can just get random accents.

#

Even without a lot of data to use as reference

copper pewter May 20, 2023, 12:42 PM

#

makes sense

untold briar May 20, 2023, 12:44 PM

#

This was the only time I got this accent, but I forget the same the random tokens!

copper pewter May 20, 2023, 12:44 PM

#

also, do you allow people to use empty prompts or na?

untold briar May 20, 2023, 12:44 PM

#

I was gonna allow empty, call it extra confused

#

but didn't actually implement

copper pewter May 20, 2023, 12:45 PM

#

hmm, what if someone made a model to enhance audio, so you can replace it in the semantic history to get a higher quality voice

#

confused

untold briar May 20, 2023, 12:45 PM

#

both prompts though

copper pewter May 20, 2023, 12:45 PM

#

the prompt was "a"

#

history prompt was dantdm

untold briar May 20, 2023, 12:46 PM

#

I think I mentioned, encoding bad audio to fix it

copper pewter May 20, 2023, 12:46 PM

#

wonder if you can literally just disable the assert for empty prompts for enabling them

untold briar May 20, 2023, 12:46 PM

#

is a cooil use

copper pewter May 20, 2023, 12:46 PM

#

i want infinite famous people nonsense!

untold briar May 20, 2023, 12:47 PM

#

You can, ugh, I'm pretty I at least tried this

#

it was why I made extra confused mode an int

#

for how many segment of no prompt

#

haha

copper pewter May 20, 2023, 12:47 PM

#

also, with my code it's kind of like, something updates in bark? my code will keep working, as i have to update it myself. It did not support v2 speaker prompts until i updated it myself lol

#

i wonder what that could do

#

untold briar May 20, 2023, 12:49 PM

#

yeah i'm in even more trouble

#

out of control

copper pewter May 20, 2023, 12:49 PM

#

pain

#

literally just this for me

untold briar May 20, 2023, 12:50 PM

#

well bark infinity isn't using it at all

#

at least now, partly why

copper pewter May 20, 2023, 12:50 PM

#

but i'll see if i can increase the token limit as well, just slide the history window right?

untold briar May 20, 2023, 12:51 PM

#

it looks like a lot but it's 3 sets of tokens

#

and a bunch of constants

#

really

copper pewter May 20, 2023, 12:51 PM

#

dantdm changed language

untold briar May 20, 2023, 12:51 PM

#

The length is 1024

#

just the model

#

you can push it but it's probably bad

copper pewter May 20, 2023, 12:51 PM

#

untold briar it looks like a lot but it's 3 sets of tokens

3 sets of tokens? only the first one is limited though

#

well, limited the shortest at least

#

idk about the exact limits

untold briar May 20, 2023, 12:52 PM

#

Oh I meant like, all the parameters, it's really just 3 'sets of tokens'

copper pewter May 20, 2023, 12:52 PM

#

untold briar you can push it but it's probably bad

history prompts work, so this should work too

#

untold briar May 20, 2023, 12:54 PM

#

how big is that?

#

My guess you get a lot more not following the text?

copper pewter May 20, 2023, 12:55 PM

#

"so i have this book, it's called"

untold briar May 20, 2023, 12:56 PM

#

I have at least 5 hours like that on my hard drive

#

maybe 2x htat

#

For the two jokes. I was looking for real punchlines

copper pewter May 20, 2023, 12:56 PM

#

bitteling

#

sometimes it makes a bit of sense, sometimes it repeats a word, and sometimes it's just a word salad

untold briar May 20, 2023, 12:58 PM

#

If you scroll WAY back. somebody from Suno ran it on their better or bigger model.

#

And it made a punchline on like sample 3!

#

Somewhere in this discord

#

So it's way better than our bark, for that. I ran so many

spring herald May 20, 2023, 1:00 PM

#

@copper pewter Thanks for repo. Im still wondering, How do i do the cloning. Sorry,im new to this.
I have the wav audio file which i want to clone and i have your repo.
https://huggingface.co/GitMylo/bark-voice-cloning
https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer

Now how do i approach from here? If you can explain so that all of us who are non programmers can understand and use.

I tried to use this script https://github.com/gitmylo/bark-data-gen to generate the data using colab,Its running and has generated a bunch of .npy files. If this is the first step. What to do next?

copper pewter May 20, 2023, 1:01 PM

#

i'll release my webui in a bit, you can do it in there (just note that it's bleeding edge, and breaking changes might be pushed to main at times)

foggy sandal May 20, 2023, 1:02 PM

#

@copper pewter I'm just looking at your repo now - did you forget to commit hubert_manager.py?

copper pewter May 20, 2023, 1:02 PM

#

possible

#

yeah looks like it, i guess pycharm didn't think it was needed since it was extrernally added (since i made that in the webui initially)

#

pushed it

foggy sandal May 20, 2023, 1:03 PM

#

I took a stab and downloaded https://huggingface.co/GitMylo/bark-voice-cloning/blob/main/quantifier_hubert_base_ls960_14.pth as the quantizer path

quantifier_hubert_base_ls960_14.pth · GitMylo/bark-voice-cloning at...

copper pewter May 20, 2023, 1:04 PM

#

correct, that's the 14 epoch quantizer

foggy sandal May 20, 2023, 1:04 PM

#

great, thanks

copper pewter May 20, 2023, 1:06 PM

#

https://github.com/gitmylo/audio-webui

GitHub

GitHub - gitmylo/audio-webui: A webui for different audio related N...

A webui for different audio related Neural Networks - GitHub - gitmylo/audio-webui: A webui for different audio related Neural Networks

#

clone, run run.bat and you're ready

#

should be, haven't done testing on more than 2 machines

#

untested on linux, etc as well

foggy sandal May 20, 2023, 1:07 PM

#

the other problem with doing ML stuff is I feel like I'm constantly buying/filling up hard drives

copper pewter May 20, 2023, 1:07 PM

#

yeah, agree

#

i have an ssd with hundreds of gigabytes of stable diffusion models
and a hard drive with a hundred gigabytes of other models as well

untold briar May 20, 2023, 1:08 PM

#

Next Frame Prediction models.

#

So many

#

the checkpoints are each very different!

#

Even days apart

copper pewter May 20, 2023, 1:09 PM

#

i should probably add a little thing to show the command line flags too

#

since the -si flag is really useful, it skips the install, so it won't check if your packages are installed before launching

untold briar May 20, 2023, 1:10 PM

#

You know I can barely type but your code is small and clean. Could I actually do this even on 0 sleep

#

hmnn

#

A think worth doing is a think worth badly. Maybe.

#

well let's try using it at least

#

Even clear documentation. Nice

sharp gale May 20, 2023, 1:14 PM

#

copper pewter https://github.com/gitmylo/audio-webui

You're the best Mylo! So I just download this and run the .bat file? Anything I need to do before that?

untold briar May 20, 2023, 1:17 PM

#

What was your og biden clip? Was it also too close to mic? Sounds like my Biden, funnily enough lol.

copper pewter May 20, 2023, 1:17 PM

#

copper pewter May 20, 2023, 1:17 PM

#

sharp gale You're the best Mylo! So I just download this and run the .bat file? Anything I ...

yep

untold briar May 20, 2023, 1:17 PM

#

You gotta careful breed singing bidens into more likely singing bidens

sharp gale May 20, 2023, 1:19 PM

#

untold briar You gotta careful breed singing bidens into more likely singing bidens

haha this is hilarious

untold briar May 20, 2023, 1:19 PM

#

They are soooo god

#

the rhythm, sooo nice

#

Actually a ton of work, trying to get singing but not change voices too much, lol

#

You kind of need like tons of clones, one is not enough

#

I didn't make tool yet, but the best trick, use full long 4 line singing prompt. But then cut the history prompt as short as you can, from the front, and preserve og voice. best chance

#

like use first few seconds only

copper pewter May 20, 2023, 1:28 PM

#

copper pewter May 20, 2023, 2:00 PM

#

squidward as history, and it starts singing?

#

the stuff that creates actually sounds like music

glossy trout May 20, 2023, 3:19 PM

#

copper pewter squidward as history, and it starts singing?

omg it sounds like squidward is summoning a demon

glossy trout May 20, 2023, 3:28 PM

#

copper pewter https://huggingface.co/GitMylo/bark-voice-cloning https://github.com/gitmylo/bar...

You are the real MVP! Can't wait to tinker with this! 😁

sharp gale May 20, 2023, 6:29 PM

#

Hey @copper pewter ! So where exactly do I run the command lines? I add that to the .bat file?

Also, where do I add the file to clone the voice in the UI?
sorry to bug you for tech support lol

copper pewter May 20, 2023, 6:42 PM

#

the bat is for the webui, it has it built in

#

like, the huggingface link is the model itsself, the bark-voice-cloning-HuBERT-quantizer repo is the semantic extraction code itsself. the audio-webui is a webui i was working on

#

the reason i made voice cloning was because i felt like it was missing from my webui (although a lot of stuff is still missing)

sharp gale May 20, 2023, 6:51 PM

#

copper pewter like, the huggingface link is the model itsself, the bark-voice-cloning-HuBERT-q...

Oh so the webui is a separate thing from the voice clone tool?

copper pewter May 20, 2023, 6:51 PM

#

yeah, the webui has voice cloning integrated

sharp gale May 20, 2023, 6:53 PM

#

copper pewter yeah, the webui has voice cloning integrated

Gotcha! and where do I do it exactly? lol sorry for the stupid questions

copper pewter May 20, 2023, 6:55 PM

#

for the webui: you just load the bark model on the text to speech tab, and set "speaker from" to "upload", which allows you to upload a wav

sharp gale May 20, 2023, 6:56 PM

#

Ah gotcha that's where I'm running into the low memory problem then I only have an 8gb gpu

#

where do i run the command for the low vram?

copper pewter May 20, 2023, 7:00 PM

#

instead of directly running run.bat, you can add arguments, or create a bat where you run with arguments

#

sharp gale May 20, 2023, 7:12 PM

#

gotcha!

#

would this work if i add this as the first thing on the bat file?

set COMMANDLINE_ARGS= --skip-install --bark-low-vram

#

Oh nvm it didn't work lol, the bat is still installing

#

I'll figure it out thanks for the clarification man

copper pewter May 20, 2023, 7:18 PM

#

it doesn't use COMMANDLINE_ARGS

#

this isn't stable diffusion webui

#

instead, what you do is you create a new bat file, which does this:
call run.bat --skip-install --bark-low-vram

sharp gale May 20, 2023, 7:19 PM

#

copper pewter this isn't stable diffusion webui

exactly where i got it from lol

#

ah that's what was missing! the call command. Thanks man!

#

Finally got it working you're the man Mylo! Thank you!

untold briar May 21, 2023, 12:28 AM

#

@copper pewter I'm planning on tagging the NPZs I make so I don't mix them up with regular generations maybe this could be a general convention? Just in case I might have wished to know in the future. Maybe

wooden pier May 21, 2023, 4:07 AM

#

Is cloning only for English now ?

untold briar May 21, 2023, 4:08 AM

#

If you care about accuracy

#

For music or something it can also be used

wooden pier May 21, 2023, 4:21 AM

#

🤣 I mean other languages

untold briar May 21, 2023, 4:29 AM

#

I haven't had time, but it might ok in european or closer to English

#

but it lower quality for sure

#

I bet it could fixed in a day, so no worries

#

I could probably do French, happen to have a lot

wooden pier May 21, 2023, 4:31 AM

#

so you need to train another model for a different language, right ?

untold briar May 21, 2023, 4:32 AM

#

Nope, just need more data really

#

Train the same model, just better

wooden pier May 21, 2023, 4:32 AM

#

then all language mixed together?

untold briar May 21, 2023, 4:33 AM

#

It could be the case it works better if you split, I kind doubt it but mabye

#

it hasn't been 24 hours so like, pretty early

wooden pier May 21, 2023, 4:34 AM

#

💀 I want to try japanese and chinese at the moment

untold briar May 21, 2023, 4:34 AM

#

I'm super short on time, but tomorrow I'll try a bunch

wooden pier May 21, 2023, 4:35 AM

#

ok, I 'll wanit

signal apex May 21, 2023, 4:38 AM

#

Hi ! Where can I find a birdseye view on how bark works ? The global architecture, and the different steps ?

#

I am new to ml and tts and have been going around audio ml, spear, and other to chew on new concepts. I'd like to do the same with bark. Does it do text to semantic token to ?? To mel-spec to audio ?

untold briar May 21, 2023, 4:41 AM

#

no mel

#

look at audiolm

#

the same

#

or https://github.com/lucidrains/audiolm-pytorch

GitHub

GitHub - lucidrains/audiolm-pytorch: Implementation of AudioLM, a S...

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch - GitHub - lucidrains/audiolm-pytorch: Implementation of AudioLM, a SOTA Language...

#

a lot of is almost exactly the same

#

there's no page like that for bark specifically so that's the best model

#

Anything I say is basically badly trying to summarize that

#

well the low level boring stuff i kind of know

signal apex May 21, 2023, 4:44 AM

#

Ok, I was reviewing this ad well, cool. Does it use the pytorch implem of soundstream or encodec ? Both those codec can transform audio token to audio right ?

untold briar May 21, 2023, 4:45 AM

#

encodec

#

I think, honeslty never dug in that part

#

but pretty sure off top of head

signal apex May 21, 2023, 4:46 AM

#

Also generally speaking, why is there a separation between coarse and fine audio token ?

untold briar May 21, 2023, 4:46 AM

#

that's a surprisingly adept question

#

I don't think they would be too mad if I metioned, the devs said it's not probablly not actually ideal.. But that's how it is.

#

haha

#

I mean still works great

#

but there were strong hints probably would not do that again

signal apex May 21, 2023, 4:48 AM

#

Is that an architecture geared for training optimisation ?

untold briar May 21, 2023, 4:49 AM

#

that's past what I know, just a few details and practical stuff, not sure

#

Doesn't seem like it but no idea

#

really it just feels like it's goals are general purpose and wide, first

#

so really that's it

#

It might train fast, I think somebody did something with a new language

#

very fast but I don't know

signal apex May 21, 2023, 4:52 AM

#

Ok, so that's also probably how audio lm work? But there must be a general reason for the why. Is there a Suno team member I can ask in here ? (I thought you might be)

untold briar May 21, 2023, 4:52 AM

#

admin in discord, but has sleep symbol

#

but those peole

#

especially gkucsko if you catch him online

#

or just search the discord for his message perhaps

#

that might answer your questions pretty well

#

that's where all the knowledge would have come from, if somebody did know

signal apex May 21, 2023, 4:56 AM

#

Cool thanks let me ping @open magnet on the question then, for when he'll be around 🙏

untold briar May 21, 2023, 4:57 AM

#

They say they want be a foundation model for audio. Foundation model aren't like about training fast, right? Just being super cool and awesome.

#

Actually that's not right, training fast can be part of that, in fact is

#

It's too late

#

I was thinking GPT like super big and slow, then I was thinking of Diffusion, all the Segment Everything things, and like training fast in some ways is cool

wooden pier May 21, 2023, 4:59 AM

#

Traceback (most recent call last):
File "D:\audio-webui-master\audio-webui-master\main.py", line 9, in <module>
from webui.modules.implementations.tts_monkeypatching import patch as patch1
File "D:\audio-webui-master\audio-webui-master\webui\modules\implementations_init_.py", line 1, in <module>
import webui.modules.implementations.ttsmodels as tts
File "D:\audio-webui-master\audio-webui-master\webui\modules\implementations\ttsmodels.py", line 9, in <module>
from webui.modules.implementations.patches.bark_custom_voices import wav_to_semantics, generate_fine_from_wav,
File "D:\audio-webui-master\audio-webui-master\webui\modules\implementations\patches\bark_custom_voices.py", line 8, in <module>
from hubert.pre_kmeans_hubert import CustomHubert
File "D:\audio-webui-master\audio-webui-master\hubert\pre_kmeans_hubert.py", line 9, in <module>
import fairseq
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq_init_.py", line 20, in <module>
from fairseq.distributed import utils as distributed_utils
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq\distributed_init_.py", line 7, in <module>
from .fully_sharded_data_parallel import (
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq\distributed\fully_sharded_data_parallel.py", line 10, in <module>
from fairseq.dataclass.configs import DistributedTrainingConfig
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq\dataclass_init_.py", line 6, in <module>
from .configs import FairseqDataclass
File "D:\audio-webui-master\audio-webui-master\venv\Lib\site-packages\fairseq\dataclass\configs.py", line 1104, in <module>
@dataclass
^^^^^^^^^

#

raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'fairseq.dataclass.configs.CommonConfig'> for field common is not allowed: use default_factory
Press any key to continue . . .

untold briar May 21, 2023, 5:00 AM

#

I don't think I can possibly debug that this late but I might try and update my fork an hour, if it can be done, and I have that running and didn't get that erorr

#

To me that looks like one library is some slightly wrong version

#

good luck figuring it out

wooden pier May 21, 2023, 5:01 AM

#

💀 try to add anaconda installation

untold briar May 21, 2023, 5:02 AM

#

omfg I really am asleep. Search my history I did have that

#

Just pip install fairseq

#

forget conda

#

too out of datet

#

some breaking chagne

#

sorry that's hilarious, literally that exact problem hours ago, like 5

wooden pier May 21, 2023, 5:03 AM

#

I hate vene😭

#

venv

untold briar May 21, 2023, 5:03 AM

#

it should be like a thing, if you make breaking changes, then update your conda wtf

#

actually not sure how that stuff is organized exactly, now that I think about. but surely somebody is to blame

signal apex May 21, 2023, 5:11 AM

#

untold briar I was thinking GPT like super big and slow, then I was thinking of Diffusion, al...

Aldo reading the soundstorm page https://google-research.github.io/seanet/soundstorm/examples/ this doesn't use calrse and fine token I believe ? And is orders of magnitude faster for inference?

untold briar May 21, 2023, 5:12 AM

#

Sounds a lot less expressive to me though

#

Not to mention, all the same style

#

Bark is really wide

#

all those samples are tiny blip in Bark space

#

the same dot

#

they are okay but all kind of close. though maybe Google just chose them all in that style

#

i'm being hyperbolic but I do think Bark is quite a bit better

signal apex May 21, 2023, 5:14 AM

#

Ok, I personnally am blown away by the soundstorm demo. Generating 30 seconds of dialog from 2 seconds of audio, in exactly the same voice, is way over any results I've seen from bark right

untold briar May 21, 2023, 5:14 AM

#

Well

#

what about 0 secvonds

signal apex May 21, 2023, 5:14 AM

#

They do 0 seconds as well

untold briar May 21, 2023, 5:14 AM

#

I mean

#

like a perfect model, no wav input

#

I have like 10

#

IT's nuts

#

I keep meaning to do a writetup but never got around to it

#

you can hone in on them in the latent space

#

like anyone it's crazy. no wavs

#

it's a bit pointless now that we have actual cloning

#

but super cool

#

not really anyone presumably famous peole

#

I would explain but basically just keep tweaking bark voices directionlly

signal apex May 21, 2023, 5:16 AM

#

Personnally I don't want cloning. I want to be able to generate 10-20 voices that are general purpose for audio books and podcast, and even the top would be prompting the model for a voice

untold briar May 21, 2023, 5:17 AM

#

I'm a bit baffled by the cloning myself

signal apex May 21, 2023, 5:17 AM

#

Cloning is just a quick way to get a voixe

untold briar May 21, 2023, 5:17 AM

#

What are are people want it so bad? If it's just clear voices, you won't need it

#

Just need a bit of time

signal apex May 21, 2023, 5:18 AM

#

Time is exactly what people don't have, and skills. Voice cloning removes those problems.

#

Also, long form right

#

I tried to do the long form bark, could not get anywhere. I still want to learn and improve my bark-fu, so I'll get back to it

#

But can you share your best inferences on your best voices ?

untold briar May 21, 2023, 5:20 AM

#

I can totally but can me make it tomorrow, I'll be doing dev on bark probably

#

i mean i probably have random wavs but

#

So what I have handy is trying to make the presidents sing

#

lol

#

These are some of my favorites to be honest though

#

signal apex May 21, 2023, 5:22 AM

#

Do you have paragraph long audio ?

untold briar May 21, 2023, 5:22 AM

#

Singing is hard and tends to lead to distortion or voice changes

signal apex May 21, 2023, 5:22 AM

#

Like long paragraph ?

untold briar May 21, 2023, 5:22 AM

#

signal apex May 21, 2023, 5:23 AM

#

untold briar Singing is hard and tends to lead to distortion or voice changes

Doing singing is super impressive and cool, but not very useful (at least I think) it's like a beautiful geeky art

untold briar May 21, 2023, 5:23 AM

#

these were not chosen for clarity at all just soem i have

#

typically the clarity is mostly in the voice

#

but Bark does have a reliability issue. but non real time? you can just try again and get clear

signal apex May 21, 2023, 5:26 AM

#

Get that. But imagine you want to feed it a book or an article. You need a reliable piepline, not a manual try and error. I am sure there are ways to get there but bark is not there rn

untold briar May 21, 2023, 5:26 AM

#

Do you think so? I feel like that's not right exactly

#

Oh nvm

#

I was gonna say you would direct an actor

#

And listen to the whole book

#

but you mean, as purely automated

signal apex May 21, 2023, 5:26 AM

#

My knowledge is super limited actually, maybe bark can actually do all that !

untold briar May 21, 2023, 5:27 AM

#

Yeah I think Bark is better thought almost like an actor. Not a script. So you may have be involved, but the final result is a lot better

signal apex May 21, 2023, 5:27 AM

#

(can't download your sounds from the app, I'll try the PC later)

untold briar May 21, 2023, 5:27 AM

#

Oh hm

#

i think it's the lack of video

#

annoying

#

in mp4

signal apex May 21, 2023, 5:29 AM

#

Apologizing to have all this discussion on the main channel. Let's do a thread

untold briar May 21, 2023, 5:29 AM

#

tomorrow i'm out for the night, even mentally

signal apex May 21, 2023, 5:30 AM

#

Bark and longform

blissful verge May 21, 2023, 6:50 AM

#

wooden pier 💀 I want to try japanese and chinese at the moment

Bark makes mistakes in Japanese readings

wooden pier May 21, 2023, 6:52 AM

#

I haven't been able to run the ui yet

#

trying to upgrade conda

#

and my base env fucked me up

blissful verge May 21, 2023, 6:54 AM

#

There are multiple UIs

blissful verge May 21, 2023, 7:49 AM

#

@copper pewter Would it be possible to correct incorrect readings/pronunciations with a fine-tuned semantic model?
Though the dataset would be a another challenge.

untold briar May 21, 2023, 7:54 AM

#

I think I'll push mine to git, even just mid changes, cause it does work and it's been weeks now. But you may have to figure out what libraries are new or changed yourself, or wait for tomorrow

#

For installing or whatever

#

But it has a copy and paste huberty myloizer

copper pewter May 21, 2023, 8:47 AM

#

blissful verge <@704733206792110090> Would it be possible to correct incorrect readings/pronunc...

Yes, but you'll have to make it learn those semantics, so you'll need to use a model to extract them

blissful verge May 21, 2023, 8:49 AM

#

That is super exciting, because it suggests that there's an actual method to improve upon this model's weaknesses.

#

There are multiple UIs if you want an easier install.

wooden pier May 21, 2023, 9:24 AM

#

how to clone with audio-webui

#

I have no idea, there is not a place for audio input ?

#

since I used anaconda, I might just skipped all model download

copper pewter May 21, 2023, 9:35 AM

#

wooden pier how to clone with audio-webui

put the "speaker from" on "upload", then upload your audio file
and if you want to replace a voice in an audio clip, put "Input type" on "File"

#

when you clone a voice, it gets saved in your bark custom speakers directory,

wooden pier May 21, 2023, 9:36 AM

#

looks like I didn't download any model

#

only quantifier_hubert_base_ls960_14.pth

copper pewter May 21, 2023, 9:36 AM

#

the button you get to download the speaker is if you want to download the speaker based on the audio generated

copper pewter May 21, 2023, 9:37 AM

#

wooden pier only quantifier_hubert_base_ls960_14.pth

it should automatically download them from facebook and huggingface?

wooden pier May 21, 2023, 9:37 AM

#

I used anaconda, so I just comment out something in main.py and run it

#

or how to disable venv installation in run.bat process

untold briar May 21, 2023, 9:44 AM

#

Yeah they should be automatic

#

just if the app is loaded

#

I actually used regular installer though, like used the .bat

#

however I had tried conda, that's why I knew the fairseq thing

wooden pier May 21, 2023, 9:47 AM

#

but nothing is downloaded, I am in webui

#

from webui import args # Will show help message if needed

#from install import ensure_installed

#print('Checking installs and venv')

#ensure_installed() # Installs missing packages

from webui.modules.implementations.tts_monkeypatching import patch as patch1
patch1()

print('Launching')

from webui.webui import launch_webui

launch_webui()

copper pewter May 21, 2023, 10:17 AM

#

right, that's how you disable the install and venv check/activation, (since it only checks if it's in venv i think, i don't have conda and don't want to install it)

wooden pier May 21, 2023, 10:28 AM

#

🤣 yeah, just no models

copper pewter May 21, 2023, 10:35 AM

#

wooden pier 🤣 yeah, just no models

it downloads the cloning models when you start cloning for the first time though

#

it doesn't download on install, in case you'd never use it, it would be a waste of storage space

wooden pier May 21, 2023, 11:22 AM

#

copper pewter put the "speaker from" on "upload", then upload your audio file and if you want ...

I have no idea where the upload is

copper pewter May 21, 2023, 11:23 AM

#

you need to load a model first

#

in this case bark