#๐Ÿฃโ”ƒsuno-showcase

1 messages ยท Page 3 of 1

crisp bone
clever turret
hazy rain
tardy topaz
#

Did you filter out the brackets, she keeps replacing them with () instead. ๐Ÿ˜„

#

BTW I think Bark could be really cool for livestream that needs constantly changing voice. Even create new voice on the spot. That's a pretty unique capability. So a whole script, with new characters, each gets unique voice for each script.

#

Instead of single speaker.

#

(for clarity, I would cheat a bit, and make the random voices variants of already clear voices...)

hazy rain
#

This is really cool!! Is it like footage loops with some Wav2Lip?

tardy topaz
#

It responds impressively fast, for Bark anyway.

hazy rain
#

It's overall very cool!

#

of course there is already a troll ๐Ÿ˜…

tardy topaz
#

That's me

hazy rain
#

30cm?

tardy topaz
#

Oh no, nvm, I am same name

#

like the image I posted, probably before you

#

I tried to get her say brackets for awhile

hazy rain
#

Yeah I though so, no someone that came just as I was testing it...

#

classic stuff

hazy rain
#

But yes it's super fast

tardy topaz
#

It must be a small model?

#

Or maybe 4090

hazy rain
#

A lot happen in LLM lately I haven't followed

#

but for bark yes it must use the small ones

tardy topaz
#

I tested a friend's 4090 and pretty sure even that didn't return 14s in 14s

#

on large models

#

But it wasn't pytorch nightly...

hazy rain
tardy topaz
#

I think it maybe 40% I forget tnow

#

not double at all

hazy rain
#

Wow still much more than I thought

tardy topaz
#

Yeah I am still jealous.

hazy rain
#

But wait... in a very short time it does: Some diffusion image used as presentation, language processing, wav2lip and bark....

#

Must be multiple stuff in parrallel

tardy topaz
#

the llm is probably an api

#

really just wav2lip and bark

#

or whatever the new hot wav2lip is

hazy rain
#

Yes it must be gpt3.5 it just did the "as a language model"

#

I'm still baffled by the speed of progress

#

it uses diffusion!

tardy topaz
#

I was playing with an 'infinite live tv show' on twitch. It used diffusion for the video, in a really hacky way. You generate a scene of the characters, which coudl be anything. Walking down a street in paris. Then you use some extensions to split it up and make the characters different looking, multidiffusion I think it was. (multiple people in SD tend to blend, like merge faces, if you ask for say three different people normally) and then you rapidly generate 'variations' which if you play quickly. Well it looks like puppets talking. But it was so not realtime, like I would have had to have three 3090s.

hazy rain
#

I just ask it that and it told me "yes it's some of the tools I use"

tardy topaz
#

At the time I had a burning motivation because "infinite Seinfeld" was blowing up and I thought it was just awful, and wasting the format. But when that I died I didn't have the urgency.

hazy rain
tardy topaz
#

One thing I was trying to get working, was the depth mapping. You know how a sitcom is a multicamera thing? Like it's a single line of cameras. So all the angles are straight, a little left, and a little right. And you can depth map the output of SD, if you seen that. 3d-in-painting extension or similar name. And then pan left and right, like a sitcom! But OMFG so slow

#

If somebody gives me 10 4090s. I could make a hell of a sitcom stream. Just throwing that out there.

hazy rain
#

Sounds intriguing I know 3d inpaint but I haven't tried much animation besides existing extensions for automatic

#

(deforum and temporal kit only actually)

tardy topaz
#

I think it might be okay with deforum. I didn't explore it much.

#

But the images change more

#

At least then, I know there's been new developements.

hazy rain
#

Yeah now it supports controlnet which yield to much more control

tardy topaz
#

But you can basically 'pan left, pan right' it just (at the time) looks like a crazy dream image

#

I had worked so hard just trying to get the characters to talk like puppets, without their shirts changing colors constantly. I was like OH NO I can't throw that away.

hazy rain
hazy rain
tardy topaz
# hazy rain It started well ๐Ÿ˜…

Haha, actually many voices (or voices that sounds like a podcast) seem to switch a lot. It's like modeling the changing of speakers in Bark I guess. Lots of random voices. The worst is cartoons. Try making a cartoon voice that doesn't change randomly sometimes. Also I swear Bark says that 'I Like pizza' in that style all the time, I know that! Weird. I actually never really checked, do any speaker prompts that switched voices very consistently, do they ever switch to the same voice, at a somewhat regular time like the end of a sentence or something? So you could maybe use one speaker and just render out a conversation in Bark, with Bark doing both sides. You don't split it at all. Doesn't seem like it should work. But stranger things have.

hazy rain
#

Ohhh I had saw your GauGan experiment!!

tardy topaz
#

Haha, I think I burned 10 years of GPU lifetime against the NVIDIA GauGan, they never limited it.

#

I literally put entire movies through it.

#

Frame by frame. I had this ridiculous browser based segmentation model using processing.js just thrown together, as a preprocessing before the main one.

#

That's another project that was weirdly limited. It was like 'draw a tree' but actually had 100+ other categories

#

My Discord Icon? That's a gaugan person. When the model has no people. But the category for people was in the GauGAN model, a little bit, accidentally. And it would render people that looked like this. It was enchanting and spooky. Delightful.

#

It even has a mouth!

#

Later NVIDIA blocked all those categories, a sad day

#

Pretty sure they eventually came back, but SD kind of made it feel old fashioned

hazy rain
#

I had no idea you could use it in the cloud back then! I showed your experiment around at the time quite a lot!

#

Testing ChatGPT with the input "Write a mid size monologue from an A.I that would imitate Joe Rogan's voice, make it light and fun" and the rogan voice (some long parts work really well):

tardy topaz
# hazy rain

Sounds pretty good but you need to run a speedup on that rogan. All the clones are too slow for me by default. Just run a bunch of long prompts and resave, try to find one that doesn't change voice, but talks faster.

#

Unless that's the goal, a robot Rogan?

hazy rain
tardy topaz
#

I think it is deterministic. That's partly why I cut like random.

#

I even added noise and stuff at one point, just get different results, but I think I pulled it

#

Or at least the large model. Maybe small wasn't, the first tone.

hazy rain
#

But inference is so I run inference multiple times

tardy topaz
#

I don't recall if it's always a match but I have gotten exact same clones, with same audio. But Bark Infinity spits out clones like a fountain so doesn't really matter, just every few seconds in the audio.

#

That's my secret. I'm just spamming out the clones. There's good in there somewhere

hazy rain
tardy topaz
#

It's like, generate 10 clones, from each, render one sample. That's the default. The last clone, the one you get if you do 1, is often the best.

hazy rain
#

(and I think it likes 10 sec inputs better for cloning)

tardy topaz
#

Yeah you don't need 10 clones for 10 seconds.

#

But for 1 minutet?

hazy rain
tardy topaz
#

Yeah totally, it's really just go get a snack, come back, check the samples

#

Maybe some inferred wel

#

Then check again

hazy rain
#

Yep I do that for the audio generation but not much for the voices (I tried initially), but I'm very much starting all this you have way more experience

tardy topaz
#

My experience is mostly pre cloning, from tweaking voices. But weirdly it's still useful. Because the clones don't come out the oven quite right.

#

However I am still trying to figure out how to put some in Bark Infinity, as a feature that just works.

#

Mainly how to edit the speaker files

#

So for example, you can grab the rogan voice from seconds 12 to 18, or or whatever. (choosing random numbers)

#

because you can hear that part is good

#

So you can snatch that rogan up

#

I learned today you CAN do this in Gradio. There is an audio tool to pick a section of audio.

#

So it should be doable

#

I don't know if that makes sense. But for that rogan clip, if you save all the npzs. There might a section that just sounds really good. Well, you can carefully try to build a new npz of just that, by selecting the right spot.

#

It's almost so simple it's weird, right? Just like, fine the best rogan part, and make that a voice. But really it's that simple thinking that works in Bark

#

So today I learned I can make this possible in the UI

hazy rain
#

That would be awesome I was wondering how to edit/visualize it better to enhance it.
One idea I want to try was lerping good voices from batch on mylo's ui or yours that are both non deterministic

tardy topaz
#

Yeah it's still a little awkward, but I'll try.

#

Like that bar, you can't see anything. But if hit the play button it starts at that spot.

#

So it's trial and error but you can find the good spots

#

Audio editing is a really bad fit in Gradio but it's kind of what you need.

#

But you aren't editing the wav file, or not always. You edit the NPZ generation.

#

ANd maybe just semantic or coarse.

#

Though that's maybe too advanced and fancy.

#

I will just use the wav file as input, so I know where the user thinks the best voice is.

#

I may be getting more ambitious than my software dev skills. But it's really fun.

#

Just poking and fiddling with each voice.

#

I mean I got stuck today trying to fix some bad filename bug for 3 hours, so yeah, probably getting too ambitious. But it's a good idea.

hazy rain
#

bark infinity is super cool, I couldn't make 'batch" work even after ticking the warning so I started the CLI to both clone and infer.
I will try to decipher the logic behind yours and mylo's clone to understand why mine is deterministic and I'll test further

tardy topaz
#

For batch, can you describe what you were doing?

#

I just want to know if it was a bug, which it could be

hazy rain
#

The cursor was showing a red forbiden sign and I coul'nt bump it up

#

(in the voice cloning tab)

#

the main one worked

tardy topaz
#

what were you batching?

#

Voice clones, or samples?

hazy rain
#

What we are saying 1 wav input -> mutliple clones (npz)

#

Regarding your audio editing tool, you can also replace the player completely in html/js to fit better

tardy topaz
#

I could use a tool if somebody wrote one, but myself, probably not do it. But yeah it could work and I did search a little didn't find anything. But maybe already exists somewhere?

#

Okay so by batch, you want upload one wav.

#

That is 10 seconds ish

#

and get different clones

#

By processing it over and over

hazy rain
#

Yep to then cherry pick the best!

#

(for now I'm doing this on the inference in my cli because my clone is deterministic)

tardy topaz
#

I pulled that feature when it was deterministic, though I had for while running a audio process to make the wav slightly different.

hazy rain
#

Oh yours is too!

tardy topaz
#

However, I'm not sure you are getting better results

hazy rain
#

I did not know!

#

Mylo's is not (audio webui)

tardy topaz
#

Is your CLI really not deterministic?

#

when you clone?

hazy rain
#

No I'm saying the opposite ๐Ÿคฃ

#

Inference -> non deterministic
Voice clones -> deterministic

#

Mylo's audio webui aren't

#

you give the same input wav and each generation create a different size of npz

tardy topaz
#

I think it is, but what happens is the gradio UI does some conversions sometimes

#

I tried to figure this out

#

and I'm pretty sure it is deterministic, if you give it exact same wav

#

But sometimes there's some audio conversion that adds rng to it

hazy rain
#

mine (I compared in numpy but they even are exactly the same size on disk) produce the same stuff

tardy topaz
#

I am not sure if that is useful, have you found it helpful?

hazy rain
tardy topaz
#

Like the same wav is really good a second time?

hazy rain
#

I'm lost I thought that's what you are advocating for

#

(multiple voice clones)

tardy topaz
#

Haha, it's confusing.

hazy rain
#

I personally abandonned that path and only batch the inference (text to speech from the same npz)

#

and compare that

#

I think you can test the cli in your bark infinity env

tardy topaz
#

In Mylo's UI, you can use a wav file two ways. One way is pure speaker. That puts a file in /clones_voices directory. That is deterministic.

#

The other way, the other field, is using the audio file as a prompt. So you get a crazy variety of voices. But it sometimes works. That is the speaker.npz file.

#

so if the name of the .npz is the name of the wav, that its 100% the same I think. The other one is kind of ignoring part of the file, it's more for creative uses.

#

But actually sometimes makes a decent clone by luck

#

I'm trying to figure out how to make that stuff clear myself

#

but basically, if name of npz = name of wav, perfect clone. if not, it's more like a creative sample based on the audio

hazy rain
#

I would have to check but when I trailed the path to the methods I could not really understand all the slicing etc and it's definitely more complicated then the one I used or from the notebook

#

But I was only referring to voice cloning

#

Which I think there is only one way no?

tardy topaz
#

So in my UI it says like, "Use audio file instead of text prompt" (doesn't actually work right now)

#

Mylo has that feature, it's the lower audio box

#

IIRC

#

the top one is pure clone. Honestly I would need to double check

hazy rain
#

I have my venv around let me check

#

this "speaker from" which generate a voice

tardy topaz
#

You know I can't quite remember on a late friday, just that directory, I think data\bark_custom_speakers is the pure clones

hazy rain
#

That's what mylo told me to do

tardy topaz
#

I'll check in a bit, just not at main computer kind of awkward

hazy rain
#

Take your time, just sharing as I think it can be useful, you can see the size offset I was mentionning, this is just pressing generate consecutively

tardy topaz
#

Yeah I think that's fine. I think that does a perfect clone. But then it uses the audio file again, as a prompt.

hazy rain
tardy topaz
#

So you should have two .npz. one in the dir, and one speaker.npz

#

It's roughly like cloning deterministically and then using the clone to generate.

#

That's basically the main process, in general

hazy rain
#

Why two? That's what I don't understand, it doesn't autosave but autoloads:

tardy topaz
#

the one in the directory, that one probably the same every time. But the other one is using a generation and saving again. so it different every time. And sometimes way better.

#

because it's a real bark voice

#

well real bark audio sample

hazy rain
#

I don't really understand what you mean. There is no default npz

tardy topaz
#

I will have to actually check. So you don't get a speaker.npz file?

#

just the one with the name?

hazy rain
#

You only get a speaker.npz file but they aren't autosave, just hyperlinked in gradio for you to save in data/bark_custom_speakers

tardy topaz
#

I think you are doing it the right away. You are basically cloning, then generating. So it's different every time with good variety

hazy rain
#

Yep mylo's ui does it but I think it's a new thing since all the methods are called "'new" ๐Ÿ˜…

tardy topaz
#

So it is deterministic, but it doesn't matter. Because it's way different from a generation.

#

It kind of just skips the deterministic part.

#

If you find some little workflow that works, let me know I haven't really tried much, just doing other cleanup

hazy rain
tardy topaz
#

I think basically it's just like using that .npz, with your text, and then saving.

#

Which is the so far tried and true method.

hazy rain
#

Hmm I think you are right now I understand what you meant

tardy topaz
#

So you can use the CLI and just do the text yourself, and resave, should be same.

hazy rain
#

4kb

tardy topaz
#

Right. That is just 'hello

#

so it's nothing

#

You are cloning normally, same as CLI, same every time. Then you make clone speak. Then you save.

#

That is the best way.

hazy rain
#

Yep I did not get why you correlated the two... they are in this UI

tardy topaz
#

So you can batch it now, just generate samples and save final .npz

hazy rain
tardy topaz
#

with thte clone

hazy rain
#

but only using the same full voice clone from the cli I don't edit npz

#

like the 5mn of joe rogan was first try

tardy topaz
#

Yeah that's a nice feature, I will try to make it

#

I think it IS in Bark Infinityt

#

but just very crude

#

It is just chopping 5 seconds

#

always

#

first 5 or something

#

It's not a user feature

#

But if you generate in the clonining, it saves twice. And one is more in the earlier part of clip

#

However if you just pick a number, that is very often a bad spot

#

You really need the user to listen

hazy rain
#

This is no?

tardy topaz
#

That's only for regular samples

#

Well actually

hazy rain
#

I was trying there

#

But got the cursor thing

tardy topaz
#

Don't hit that button, Ithink

#

Just put a text prompt in

#

Like your rogan quote

#

Try setting repeat, honest to got I can't remember if that works

#

in the cloning

#

uncheck the 'just give me box' that is not really good. I mean it's weird. so try if you want. But it's slow

#

a failed experiment really

#

I should make this way less about chopping up. It's not great for short audio, like this case

hazy rain
#

I can't capture it but what I meant earlier is that ticking or not "give me more clone" doesn't allow me to edit the slide, I get a red forbiden sign as a cursor

tardy topaz
#

Oh hmnn

#

I can check in 20 minutes,

#

Probably bugged

#

But if you can, maybe restart?

#

and don't check it or leave unchecked

hazy rain
#

No emergency! It's 4am here, I'll soon sleep

hazy rain
tardy topaz
#

Just the 'Create an audio sample for each created clone at the end, using the Main Text Prompt'

hazy rain
#

and then just did the cli quickly

tardy topaz
#

If you have a second rogan clip you can try the second audio thing

#

I'm only 80% sure it doesn't help. haha

hazy rain
#

Ahah yes I mean to ask about that! I'll try, but I think I understand better what input yield to better results

tardy topaz
#

I added many things that didn't really help, but I wanted to try. And left in for now. Will replace with new things I try.

#

I mean they might help I honestly didn't have testing time. But if so, hit or miss.

#

The second audio sample uses that audio, as the prompt. Instead of your text that you type in.

#

It seemed like that might be better but so far, not

#

So it makes Rogan, or whoever, say the things in that audio

#

I'll be back in 15, at desktop, if you're still awake I can check whatever.

hazy rain
#

Oh nice feature so speech to speech?

tardy topaz
#

Yeah

hazy rain
#

I started to try voice cloning from french sample but the voice always deviate to english for some reason

tardy topaz
#

Needs french training. Oh I sent someone a clip, so I can download from discord and post. Just funny to run long audio through as second sample, voice just roams all over.

#

So the og voice, gone quick, if you just keep playing because this is way too long. and it is replaying the audio samples with RNG voices that slowly morph.

#

To do it right you can't use a long clip but it's amusing

#

(for speech to speech, you could only really get the first 5 to 12 seconds at best. in this case just to try it, it keeps going, and the voices are completely lost from whatever the speaker was quickly)

dusky siren
#

Hey all, we're considering using suno for a content creation platform we're building.

Are there any restrictions to using it for a commercial platform that features TTS + voice cloning, and charges users?

( i can share more details in private, if needed )

cc: @lyric steeple

lofty flint
lofty flint
lofty flint
lofty flint
chrome goblet
#

i found there is a link which can clone voice?

viral lynx
chrome goblet
#

thank you

#

i will

#

i will try to make one

#

can we use it to sing a song ?hahaha

viral lynx
viral lynx
quaint night
quaint night
tardy topaz
quaint night
#

first shot

tardy topaz
#

Was it 'clone and do one sample' at least?

quaint night
#

clone + prompt one thing, then try something else, this was that else

tardy topaz
#

Nice, yeah it can be enough.

quaint night
tardy topaz
#

She performs for children. She's creative and improvises. Works for me.

#

Well, maybe not just randomly

#

But sometimes it sounds like it

quaint night
#

no, it's a short prompt and a long voice

tardy topaz
#

Oh yeah

#

Sounds like the intro to a song

#

I like my prompts short and my voices long

#

Though actually it's the opposite for me.

quaint night
#

the best way to generate a voice with bark is to first have the voice say what you want to generate. Then use it as a prompt.

#

Except, if you already have the result, what are you asking bark to do? ๐Ÿคทโ€โ™‚๏ธ

tardy topaz
#

It is a funny quirk

#

Though I thought my code was bugged, until I tried it on the Suno default tspeakers lol

#

Ask them to say their own prompts

quaint night
#

lol, don't torture them

tardy topaz
#

I was like, why is my second segment always halucinating? But I was runing the same script, of repeats

lofty flint
tardy topaz
#

So maybe the fix is just a repetition penalty? But

#

But I wonder if it works as with sounds...

lofty flint
tardy topaz
# lofty flint it seems to have to be implemented in training stage, isn't it?

I'm pretty sure it's just code in the sampler, like you would implement top-k or whatever. But presumably if it was super easy it would be done already. Actually I just googled and it seems almost too easy to implement, I wonder if am I actually extra confused. This I could do. But why wouldn't it be in the nano GPT?

#

If it was easy, and effective. Presumably it would be generically in the fork. Whatever I am busy today and also not the person for that.

#

Someone who writes that code all day should get in there any try though. Maybe it is hard to do it right, as opposed to, at all. I just wait for huggingface to do all that kind of stuff.

lofty flint
past sinew
tardy topaz
#

With Bark it could happen even normally, but anything with [words] all over it is high chance of failure. It's not really like 'feature' more something people discovered that sometimes works

#

I'd be interested to know if you get really good animals sounds actually.

chrome goblet
#

can i choose the music for song ?

#

and how

zenith notch
#

๐Ÿ‘

light otter
quaint night
#

big semantic + big coarse model does make a difference, as much as I don't like waiting for it

tardy topaz
quaint night
#

I thought that the differences were small but I just tested on the wrong samples. They can generate qualitatively different things in some cases.

grizzled shard
lofty flint
tardy topaz
#

first non voice clone just making sure gradio upload works, first sample, copy and pasted help gradio text as content. each segment in clip same exact speaker, same text, yet musical in wildly different ways. literally the first generation of the first clone i just tested. just throw any audio into the cloner people. (It's not an audio clip as the prompt, it's just a music clip as a voice clone, I just happened to copy and paste the closest nearby text on screen.) The music clip was the Deux Ex theme, which you can hear a sliver of.

grizzled shard
round gorge
#

hi

obtuse slate
blissful pulsar
#

Hopefully you'll like it

lofty flint
fallen stump
#

11

tardy topaz
lofty flint
knotty mountain
#

I've been using Bark for doing a radio show. It's pretty fun.

#

English Voice 3 sounds very much like Chris Morris

obtuse slate
# knotty mountain If anyone is interested

Very interesting and creative. Also weird. Overall I enjoyed listening to it. Did you create all sentences in a single generation, or trying your luck multiple times ? The global editing is made by hand, I take it

knotty mountain
#

Yeah just a single generation using the long form scripts. I did the second to try and fix up the weirdness that happens sometimes, which sort of makes it confusing, but works in the context. A bit of editing at first, but then just gave in to the pace of the generations.

cunning tendon
#

#๐Ÿฃโ”ƒsuno-showcase ๅฆ‚ๆžœๆ‚จๆณจ้‡ๅฑ€ๅŸŸ็ฝ‘ไผ ่พ“้€Ÿๅบฆ๏ผŒๆˆ–้œ€่ฆ็ป„ๅปบๆ›ดๅคง็š„็ฝ‘็ปœ๏ผŒๅŒๆ—ถๅฏนไปทๆ ผไธๆ•ๆ„Ÿ็š„่ฏ๏ผŒAsus TX-AX6000ๆ›ด่ƒฝ็ฌฆๅˆๆ‚จ็š„้œ€ๆฑ‚ใ€‚ๅฆ‚ๆžœๆ‚จๅฏนไปทๆ ผๆ•ๆ„Ÿ๏ผŒๆˆ–ๅธŒๆœ›่ฆ†็›–่Œƒๅ›ดๆ›ดๅนฟไธ€ไบ›๏ผŒๅŒๆ—ถ่ฟ˜้œ€่ฆๅคš้กนWiFi 6ๆŠ€ๆœฏ็š„ๆ”ฏๆŒ๏ผŒ้‚ฃไนˆTP-Link xdr6088ๅฏ่ƒฝๆ›ด้€‚ๅˆๆ‚จ

tardy topaz
#

couldn't resist trying one batch of 30, overloading Bark with hints like the woman literally starts speaking with "In a world..." some with no announcer. Though Bark makes those generally sound like weekly news teaser videos. The text really super shapes the voice you get. This is just a random assortment of them.

lofty flint
deft granite
#

The Chinese generated voice is very strange [MAN]:ๅœจ่‹่”ๆ˜ฏๅฆๅฏไปฅๅญ˜ๅœจไธคๅ…šๅˆถ๏ผŸ [WOMAN][laughs]ไธ,ไธๅฏ่ƒฝ,ๅ› ไธบๆˆ‘ไปฌๅ…ปไธ่ตท.

knotty osprey
#

ไฝ ๅฅฝ

tidal fulcrum
#

ๆต‹่ฏ•ไธ€ไธ‹ไธญๆ–‡็š„ๆ•ˆๆžœ็œ‹็œ‹ใ€‚

blissful pulsar
#

I just tried the example prompt "โ™ช In the jungle, the mighty jungle, the lion barks tonight โ™ช" and...

what the heck-

lofty flint
tardy topaz
lofty flint
rough dust
#

ๆœ‰ไธญๅ›ฝไบบๅ—๏ผŸๅ‡บๆฅไธ€ไธ‹

tardy topaz
#

ๆŠฌๅคด็œ‹!โฌ†๏ธ

lofty flint
#

i think we should just focus on English at this moment, i think suno will make next verion which other languages will be better , just like chatgpt, it is particular good at english

blissful pulsar
#

Hi everyone. Can you point me to the best singing examples Bark generated ? And can Bark run on a Windows install without using the GPU (like for example SVC, RVC, Vlad diffusion) ?

viral lynx
#

Jonathan's webui is for bark, with lots of bark related features
My webui is for bark and any other audio related webuis, with less model specific features

merging them isn't really a good idea, as the webuis have different purposes

tardy topaz
#

I'm frustratingly incapable of actually making the UI I want to make, technically. It would look completely different, like a node based sound laboratory where you draw lines and connect segments to create unique processes. Like a visual UI version of a Jypyter notebook. My only hope is I stumble across something almost like that I can fork lol

#

Some day I'm going to drink way too much coffee and try to rig up some crazy way to use https://wavesurfer-js.org/examples/#multitrack.js in gradio. But honestly I can't believe somebody hasn't done it already. It's a web page. Half the Stable Diffusion UI features are just hooking javascript into Gradio, already. Somebody who really specializes in that could probably do it quick.

turbid anvil
#

Hey, guys! Want to ask, is it a right place to share my pet project utilising Bark?

inner pasture
turbid anvil
#

Nice! Here it is https://castpod.live/
It's quite straightforward. You provide a text prompt about a particular topic, and Castpod generates all the elements you'd expect in an audio podcast: cover art, a theme song, a title, description, tags, as well as the podcast's characters, including their names, roles, avatars, and of course, the podcast script and the corresponding audio conversation.
The part which is generating audio output is powered by Bark. Now, there are some limitations is quality and size of the podcast, but sometimes, the results truly mimic authentic conversations with insightful viewpoints that I hadn't even considered before.
It was so fun to develop, thank you a lot for your work and tool you have shared!

#

For now, podcast generation is quite expensive. The average cost of generating one podcast is approximately 60 cents โ€” a considerable amount! As a result, I implemented a paywall for podcast generation. I'm sharing a license key that comes with some generation credits. Simply enter the key on the podcast creation page, and you'll be all set to give it a try. 04B68F62-79B7-4337-B054-F9741D1A65CC

tardy topaz
turbid anvil
#

Bark handles only audio conversation output

obtuse slate
# turbid anvil For now, podcast generation is quite expensive. The average cost of generating o...

Very nice. I built a similar engine a while ago but used 11labs for the voices. It's totally unsustainable money wise, and even tho it's way cheaper with self generation, your comment shows were not quite there yet. Maybe once soundstorm is ported to the open source - it's supposed to accelerate inference by an order of magnitude with their non auto regressive solution. I think the podcast shows that bark is not quite there yet also in terms of voice control and quality. It's a good experimental tool (and sometimes an artistic one) and probably a good project for research, but not really something that can be part of a product.

turbid anvil
# obtuse slate Very nice. I built a similar engine a while ago but used 11labs for the voices. ...

Yes, agree with that. For now, costs and quality are solid limitations for such a project. But awesome thing here - it is already able to solve the puzzle, even with super young solutions such as Bark and pretty dummy GPT-3 if compare to GTP-4, proof of concept works and sometimes works ridiculously well. There is no doubt with Bark undergoes further iterations and improvements and GPT upgrade, it will reach the level of quality and costs makes it possible to create products like that.

obtuse slate
turbid anvil
quaint night
#

!! NOISE WARNING !!
I hadn't seen this before, it goes from blast noise to speaking in a studio midway, maybe there's something interesting in there

{
_version: "0.0.1",
_hash_version: "0.0.2",
_type: "bark",
is_big_semantic_model: false,
is_big_coarse_model: false,
is_big_fine_model: true,
prompt: "โ™ช ๅบƒใ„ๅฎ‡ๅฎ™ใฎๆ•ฐใ‚ใ‚‹ไธ€ใค ้’ใ„ๅœฐ็ƒใฎๅบƒใ„ไธ–็•Œใง ๅฐใ•ใชๆ‹ใฎๆ€ใ„ใฏๅฑŠใ ๅฐใ•ใชๅณถใฎใ‚ใชใŸใฎใ‚‚ใจใธ",
language: null,
speaker_id: null,
hash: "c342707bb377c37533b46660842959ed",
history_prompt: "None",
history_prompt_npz: null,
history_hash: "6adf97f83acf6453d4a6a4b1070f3754",
text_temp: 0.7,
waveform_temp: 0.7,
date: "2023-06-08_20-59-08",
seed: "1542369585"
}
!! NOISE WARNING !!

last geode
blissful pulsar
#

Music might be dead.

tardy topaz
#

I think they dropped the ball. They cleaned their data too well.

#

I mean it's awesome don't get me wrong

#

but it's killing me that the LLM lobotomized by only seeing tags and that souless description text on those stock photo sties

#

What do I know. Maybe the training would have fallen apart if there was more that such regular data. But from what I can they didn't try it.

#

However, holy cow, there are a lot of functions that look super interestitng

fickle otter
tardy topaz
#

I am a little jealous of the cool wav files

jolly fog
quaint night
lofty flint
quaint night
fiery cove
#

Hi my phone is running very fast

obtuse slate
quaint night
#

encodec and vocos are both really fast

#

the slowest part is coarse > semantic > fine > vocos > encodec

#

but really 90% is coarse, 9% semantic, with rest being small

pliant spruce
pliant spruce
pliant spruce
limber onyx
knotty mountain
#

Another use in a radio show - was testing different settings to see the results.

dusky siren
crude ingot
#

Hi team, can i ask how can i clone the voice?

jolly fog
# dusky siren Great work. That sounds amazing. What prompt did you use with musicgen, if you...

thanks! ๐Ÿ˜Š it is fun to make music but days later I can't enjoy it that much.

I can't remember the prompt exactly , MusicGen has no way to recover the used prompt.
was playing around with words like :
simplistic, slow authentic drums, exotic percussion, subtle silent textures, old-school hiphop, drum break, simple beat, dj premier, dj revolution , non-harmonics, dry raw empty

Got very random results, so I made a lot of attempts. today it rendered this one. I used the large model between 10 to 20 seconds. then I used Audacity to edit the loop.

jolly fog
jolly fog
#

making it do singing , used Bark Infinity Cloning, probably not as good as it can be, but my pc slow , only 11gb vram โœจ ๐Ÿ˜Š used 1 sample of 30 seconds for the cloning

jolly fog
#

I'm Making a new version of Marilyn Monroe using Bark Infinity Cloning, now just her spoken voice, no singing. I also use the optional second sample part in the webui and I got really amazing results.
these are using the prompt to the letter. I think it sounds just like her 1953 movies, and her voice has this natural asmr vibe and it is just amazing how it can reproduce it. Now only if it could stay consistent ... [whispering] ๐Ÿ˜„

blissful pulsar
#

I have a silky smooth voice, and today I will tell you about the exercise regimen of the common sloth.

dusky siren
dusky siren
#

@jolly fog would you mind if I DM you with some quick questions?

I've built an app that I'm not ready to make public yet, so want to keep it private.

jolly fog
#

using local version of sadtalker model here , much better result

jolly fog
limber onyx
blissful pulsar
limber onyx
jolly fog
# limber onyx what did you use here?

Hi! First one was with online demo of sadtalker, second one is with local sadtalker as extension of stable diffusion. That new one is using an updated model they just released, much more animated. But my gpu is 11gb and very slow at doing this.

limber onyx
jolly fog
jolly fog
jolly fog
# limber onyx did you enable eye blinking in the extension? (the UI does not have an option fo...

I see there is a dev version you can activate blinking, so even if mine is blinking already, maybe I didnt understand what you meant?
https://github.com/OpenTalker/SadTalker/discussions/386

GitHub

Hi, everyone! Thanks for your patience with the bugs and long-time updates in SadTalker. We are releasing some new features in SadTalker for the WEBUI. You can try to install the dev version for th...

limber onyx
jolly fog
limber onyx
blissful pulsar
#

yo, there's like 5 webuis at this point can yall work together ? lmao

tardy topaz
#

I just want to spew feature ideas and have them appear, I am not a good UI programmer at implementing. I got like 5% of the way through my Bark list

limber onyx
fathom hatch
#

Wheres the new sadtalker model?

#

I have bark and sadtalker set up on my discord bot so you can do it all thru discord

tardy topaz
jolly fog
tardy topaz
quaint night
limber onyx
#

Without face restoration it takes 2x length of audio (batchcount 4) on a 4090 with i9 13900ks and 128 ddr5
Face restoration takes 2-3 times as long)

pale falcon
carmine ravine
#

What / where is sad talker

#

Also aye๐Ÿงฟ randomly got one output that had music behind it โ€ฆ itโ€™s amazing but canโ€™t reproduce it โ€ฆ also: really really want to get these dudesto sing

#

Installed the latest one click installer d seems to ๐Ÿ working consistently .:: wondering if musicgen is includded in there and how even it is invoked /.: in the prompt w a tag mayb ??

quaint night
#

Yes, I believe mylos audio webui and mine (tts-generation-webui) include bark alongside musicgen

#

Someone said that they have used them together but I don't know how.

limber onyx
# carmine ravine What / where is sad talker

this is sadtalker (see link below, it generates talking heads for audio). It has a web-ui and an auto1111 extension (both work very similar but the auto1111 extension makes it easier to integrate generated images)
I tried using sadtalker from the commandline but there are some bugs
I tried using sadtalker trough the API provided by gradio but there are several bugs
It works with any image + audio (they do not have to be generated)
I hope this helps.

https://github.com/OpenTalker/SadTalker

#

This one includes several tts ttm models: https://github.com/rsxdalv/tts-generation-webui

It is nice to compare (hopefully soon combine) them. I haven't had any issues so far.
It allows voice cloning, suno.npz improvement (over vocos), bark, tortoise and MusicGen.

I wanna say: Amazeballz

GitHub

TTS Generation Web UI (Bark, MusicGen, Tortoise). Contribute to rsxdalv/tts-generation-webui development by creating an account on GitHub.

frail python
#

Was playing around and stitched a few generations together to create a guided meditation (script by gpt-3.5-turbo). I tired using 2 different english voices, and tried one in Portuguese (I don't speak Portuguese. I was just seeing if it would work and it seems to do alright!)

It's interesting to me that the the speakers sometimes deviate from the original script, or make up words/sentences (sometimes they are semantically similar, which is even more interesting).

Another thing I noticed was the length of time of each meditation differs by a large margin, even when given (at least for the English speakers) the exact same script. Each speaker has their own style and such for reading and incorporating pauses.

teal gulch
tardy topaz
patent gorge
half yew
queen anchor
tardy topaz
# half yew Man... how realistic... Could you share the npz file for this ?

I will soon, but the Obama voice specifically is one I'm holding back just a bit because I cloned it not using the cloner. Instead cloned 'by ear' using an truly absurd and ridiculous manual cloning process that I was very surprised and fascinated actually worked. I really want to express just how surprised I was when Obama actually emerged out of it.

So one day when I get time, I am going to do do fun YouTube video showing it off. It's horribly inefficient (it took 10+ of hours of tweaking to make a really good voice clone by hand, like this, though I could do a bad one in about 2 hours.) but the concept is just really interesting. Basically it's just iterative Bark semantic prompts only -- literally no audio reference. And then some manual merging/blending/tweaking to keep it on the rails. And lots of me personally listening with my ears and adjusting. I use my ears to setup an iterative process, then slowly ran Bark over rand over again, and lots of my just deciding that 'this voice has the right semantics, but it's missing this aspect from this voice, so if I made 5 new versions with a bit of each...' stuff like that.

Some of the manual tweaking is actually still useful for normal voice clones and I am porting it to my fork, eventually. Voice blending or model merging, ways to render a voice in increasing or decreasing intensity, or to soften it, to dial it in. Stuff like that. But it's hard to code it in a UI for Gradio, so not too much of it. Also with the voice cloner, probably just making a better normal voice clone is 99 times out of 100 better use of anyone's time.

#

Nobody sane should voice clone like that, but the fact that it is possible in Bark is remarkable.

#

All that said, I think there is likely some use of the techniques for 'dialing in' normal voice clones too. However all of my methods rely on my personal hearing and judgment of voices (typically I rank a bunch in a folder, as a first step) and that doesn't scale, so would have to be automated.

#

But also, maybe just fine-tune all of Bark on one voice, and never do any of that stuff...

#

The TLDR is that Bark lets you do whatever you want, almost. Just open up an .npz file and cut and paste, remove every third token, mash two voices together, insert weird patterns, ban all the tokens in one voice from another. Literally do anything, and usually it still sounds pretty natural! You do get a ton of 'broken' voices with weird stutters that get stuck halfway through reading your text, but Bark makes even the broken voices sound like a real person with a speech impediment!

#

I mean listen to this. Poor Trevor Noah is given a horrible stutter. But in Bark, he KEEPS TRYING TO SPEAK THE TEXT. He doesn't give up! How cool is that right? It's more like a personality than a voice. The prompt is just the regular sentence, not repeats.

tardy topaz
dark token
#

I had it spit out this randomly ignoring the text and it has me briefly question what I was doing lol

#

(bunch of silence at the start but it's worth it for the ending)

hollow trail
#

Hi

turbid jasper
#

MusicGen + bark + a bit of audacity and you can have a 24/7 news radio ๐Ÿ˜„

random slate
carmine ravine
#

i wanna know how i made that music happen

#

liek in the background

random slate
tardy topaz
# carmine ravine

Try

[music]
My words here
[music]

[music][music][music]
My words Here
[music][music][music]

It's super hit or miss, but if a voice DOES work, it often works again

#

So save your .npz outputs. And then try resusing the ones that work.

tardy topaz
#

(Maybe worth a repost) Google trained SoundStorm on 100,000 hours of dialog in part so they could have two person conversation prompts. The same text prompts - character for character identical - very often just work in Bark right now, out of the box.

(Well... Bark loves to generate a dialog of a person talking to themselves but I think you can fairly consider even those samples a dialog, it's performed like a conversation.)

Something really funny happened to me this morning. | Oh wow, what? | Well, uh I woke up as usual. | Uhhuh | Went downstairs to have uh breakfast. | Yeah | Started eating. Then uh 10 minutes later I realized it was the middle of the night. | Oh no way, that's so funny!
fast ferry
viral lynx
#

mixing the full and small models helps reach fast generation speeds without the quality dropping much

carmine ravine
#

Like it seems to me like itโ€™s random every time ??

#

But somehow feel like that cannot ๐Ÿ

tardy topaz
# carmine ravine Iโ€™ve really got to spend some time wrapping my head around npzs, and voice gener...

It's a checkbox called 'save every NPZ' in my fork. Bottom right area. If you check that box, next to your output .wav or .mp4 files, you will always have new .npz files. These files are the voice of the audio file you just made. (If you used a random voice, it should save the .npz next to the .wav by default.)

Then later you want to make new audio that sounds like that cool .wav sample? So you go the menu, something like "pick an npz file from your filesystem as the speaker" and you can pick one of those .npz files, that is the same name as the .wav file. If you decide you like the voice a lot you can put the .npz file into a directory so it shows up in of choices, alongside en_speaker_03.npz etc.

For music and more unusual things there's a decent chance the .npz file doesn't reproduce the same effect. But for pure voices, it should.

limber bison
#
runic whale
#

(โ•ฏยฐโ–กยฐ)โ•ฏ๏ธต โ”ปโ”โ”ป

teal grove
#

Join us in this captivating journey as we unravel the mysteries behind Banksy, the renowned street artist. From his humble beginnings in the underground scene of Bristol to his groundbreaking artistic style, we delve into the impact of his art on the public and the ongoing enigma surrounding his true identity. Discover how Banksy has made a mark...

โ–ถ Play video
#

๐Ÿ’€

#

Sounds like someone slowly reading from a script for the first time

#

Not even that

#

He sounds like heโ€™s struggling to read

#

And has no energy!

#

At least the voiceover is consistent in this video

#

Join us as we embark on a captivating journey into the world of HAM radio, exploring its significance, origin, emergency communication role, licensing process, diverse frequencies and modes, global reach, sense of community, inclusivity, and technological advancements. Discover the fascinating world of HAM radio and its timeless appeal.

โ–ถ Play video
#

BWAHAHAHAHAJAJAJSJDJDJDO

#

โ€œLadies and gentlemen. Gather, round, for another, riveting episode, of our illustrious channelโ€

#

FWUHUHUHUHHU

#

He sounds so slow

#

And bored

#

We can make more energetic & stable voices for this though?

#

I want every day in the show I make to sound like Looney Tunes

teal grove
#

Code / commands?

teal grove
#

oh damn

#

i found out

#

that files of the same prompt that take less time to produce have less background noise (i think)

#

a noisy file took 8 minutes

#

a good file took just over 1 min (1 min 8 seconds) + waiting time to write files

#

here they are

#

with the npz

#

i assume the npz is the speaker file that i can select

#

gonna try doing that next

#

i can write code to automatically quit the process of voice generation if it takes too long so that i don't waste my day (in theory) perhaps

#

i need to test this some more times to be sure

#

damn wrong place to post

#

*wrong channel

#

anyway the prompt was """ fine fools flower fight frame """

tardy topaz
# teal grove Whatโ€™s your manual merging/blending/tweaking?

It's not automated like that. One reason I didn't automate or build in to a command (yet) is because it's more or less me going "let's try using this many tokens of this, and then doing this.... nope that didn't work. let's try a litle more..." instead of fixed process.

teal grove
#

audio_array = generate_audio(text_prompt, history_prompt="v2/en_speaker_1")

#

this works

#

audio_array = generate_audio(text_prompt, history_prompt="v2/bark_generation5")

#

this doesn't

#

and i put the npz in the same folder

tardy topaz
#

the file paths can be bit fiddly. just suing full path, like whole directory. including '.npz'

teal grove
#

ok

tardy topaz
#

i forget exactly what works, but it does work,

#

also check what your current python process thinks is the current directory. maybe it's not what you though

teal grove
#

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

#

ded

#

i just copied and pasted the path

tardy topaz
#

hmmm what are you on, windows, google colab?

teal grove
#

windows

tardy topaz
#

there's not like a weird character in the name? or emoji?

teal grove
#

no but for some reason the copied path uses slashes in the other direction

tardy topaz
#

so worse case, if you just want to ge it working

#

rename an existing suno voice

teal grove
#

\ instead of /

tardy topaz
#

with your voice

teal grove
#

alright i replaced all the \ with /

#

it didn't give the same error

#

now i wait for it to generate

#

yay

#

ngl that was odd lmao

#

i'm guessing that's what the truncated \UXXXXXXXX escape means

#

maybe it's for security idk

tardy topaz
#

filenames and paths and differences between windows, linux, whatever - it's a cause of TONS of headaches

teal grove
#

IT'S GOOD.

#

8 minute and a few seconds for this 13 second audio file.

#

if this voice was randomly generated, does that mean it was never heard before? or is it selected from existing speakers?

#

(it's open source and therefore commercial use anyway right)

#

I can probably keep using this voice again and again, or try to change the emotions and then generate it repeatedly until i get it to speak the way i want it to, correct?

tardy topaz
tardy topaz
#

now common, not that common. i got it 3 or 4 times, in 10s of thousands

#

but still

teal grove
#

Ok

tardy topaz
#

it really is THAT VOICE

teal grove
#

I guess 1) if the person who sounds like it hears it, they can tell me to replace it 2) I can put post-processing on it using a DAW to make it sound higher or lower or change the formant to attempt to obscure it more

#

And lastly, it is fairly common for two people to sound SIMILAR (not exactly the same)

#

I think Iโ€™ll keep this one because I want it and it sounds excellent and no one said no (screw copyrighting fears)

limber bison
limber bison
teal grove
#

I did this video today with those samples and hand animated vector art

#

PS who can guide me to an AI that can take the audio Iโ€™ve generated and make it sound like a clean recording?

#

I wish for this AI to be available offline, usable on the laptop (Windows 11 PC) I used to make these audio samples, and open source / commercial use.

#

Iโ€™ll research this myself as well, will post here if I find something good

fast girder
fast girder
limber bison
limber bison
limber bison
carmine ravine
teal grove
fast girder
teal grove
#

Iโ€™m looking for quick fixes although if I spend time I probably can

#

Like ideally there should be existing functions

#

Iโ€™ll check it out.

#

It looks like a lot of the audio reading functions are for getting the dataset file locations (audio.py & denoiser.py)

#

More time

#

This wonโ€™t upgrade speech audio quality

#

It just denoises background audio

teal grove
teal grove
# teal grove

For the files used in this video (itโ€™s okay but it could sound cleaner)

#

Needs an AI

fast girder
#

^ this problem (called "Speech Super-resolution") AFAIK is still pretty difficult, not many good models out there

teal grove
foggy forge
limber bison
#

Nice. I should experiment more with the singing..

tardy topaz
#

I think Bark can maybe do music if it's singing only, or maybe singing + one single instrument that is used a bit sparingly. More than that and it seems to not quite hold it together.

wraith mesa
#

I am using the BARK TTS WEB UI and it seems it is speaking whatever it is feeling like instead of my script!

foggy forge
nimble musk
foggy forge
#

The source? Its a part of the prologue for the book A game of thrones in the ASOIAF series

gusty ingot
surreal wagon
#

best audio ever lol

teal grove
#

Itโ€™s just a denoiser or can it also filter out or correct the rough parts of audio?

#

Hereโ€™s the latest short vid I did

#

I like the music in this case, but it would be cool to use something that can separate the music and give a good result

teal grove
little ore
tardy topaz
#

Is that a random voice? Pretty unique vibe

#

I wonder if that voice will be more likely to hallucinate with normal text prompts... Lol

little ore
#

No, that's a fine-tuned voice. I guess I don't need elevenlabs anymore... Now if only I had time to prepare a proper USLT vietnamese dataset, but ALL OF THE TVB MOVIES FROM THE 2000S HAVE BEEN REPLACED WITH THE NORTHERN ACCENT DUB!!!! *glares at the vietnamese community (HOW DID YOU GUYS LET THAT HAPPEN?!! AND WHY?!!!!! )

tardy topaz
#

It might work with well known childhood rhymes... though getting many children out of bark random voices is like pulling teeth

little ore
#

And no it hallucinates in general, but only at the end of the prompt and even setting min eos has no effect, but whatever, it's after the prompt so cropping it off is easy enough.

And oh yes, it's the best voice I have in any voice AI.

Anyways, childhood rhyme, eh? https://kingkiller.fandom.com/wiki/Lackless_poem

Kingkiller Chronicle Wiki

Young boy in troupe of non-Ruh performersI know a poem about Lackless! The Lackless poem is a poem with two versions. One is about Lady Lackless and the other is about the a door related to the...

#

You should hear him in RVC, though ๐Ÿ˜ฑ

#

Also wth there's a boy version of that poem??? I never saw that in the book?!!!

tardy topaz
#

Bark is still the model that is choosing which syllables to stress, how to pronounce things. Unless you fine-tuned in just rhymes?

#

Actually I'll try that in base vanilla bark, curious

tardy topaz
little ore
#

No, the variance is alright. I just fine tuned it on speaking. But the source speaker is very expressive, so it ends up being a premium dataset. There's little gems like this in the dataset:

#

And yes he is legendary in RVC

#

His voice is the reason I halted training all other voices to desperately probe the secrets to his success

tardy topaz
#

Do you happen to remember if RVC stressed the childhood rhyme as well?

little ore
#

No RVC just flawlessly voice converts source audio, be it singing, emotions (to somme extent) and others:

#

That's inference from an anime character

tardy topaz
#

Right, yeah, so I guess it's about whatever you happen to use for the TTS part, before Bark. If that had the same childhood rhyme speech pattern, as well as those samples. Just curious how unique Bark is.

little ore
#

The dataset doesn't speak in rhymes, the voice is just very expressive

#

Bark kind of figured out the rhyme on it's own and that actually took a while. The first few fine-tunes got the pacing wrong, and also if you put in too many verses, it gets the pacing wrong and ignores half the prompt. Too short and it sounds weird, so no typing it in verse by verse

#

(first few fine-tunes didn't even bother pronouncing the word "seven" or the s sound at the end of "things". )

tardy topaz
#

For pacing you will eventually be able to control in inference, though I'm not 100% sure the same things I tested work in a fine tuned model, I guess.

#

I haven't tried myself but the fine-tune just a diff of the Bark weights, not the full 6 gigs for text_2.pt or whatnot, I think? So I guess you could even just use different versions of the voice

little ore
#

Well here's the same inference on normal speech, so I think bark can read the context to some degree

tardy topaz
#

The end of that is classic Bark weirdness. IT's like the perfect weird horror movie sound model. So mahy times late at night I'm like WOAH.

little ore
#

And this version has trouble speaking german...

tardy topaz
#

Especially because it's usually just like that, after a big sound gap, so you get surprised lol

little ore
#

Oh yeah little gems like this one which for native japanese speakers I'm sure must be gold

#

Yeah, I saw alot of potential in bark back then despite the horrible voice quality. I don't usually go messing around with heavily beta stuff. MusicGEN for example is absolute garbage. But bark.. now bark is actually good.

tardy topaz
#

You mostly get music, actually, because of the lack of puncutation. But the non music is way more children than Bark typically, so the childhood rhyme is being somewhat undestood

little ore
#

I actually wasn't expecting it to work that well. Elevenlabs kind of failed at the Alan Wake poems, the "For he did not know, that beyond the lake he called home, lies a deeper, darker ocean green, where waves are both wilder and more serene. To its ports I've been, to its ports... I've been!"

tardy topaz
little ore
#

HAHAHAHAHAH OH GOD THATS PERFECT

tardy topaz
#

Bark is just so good, I think that sample even sounds like the speaker is out of breath!

#

God damn

fast girder
#

that quick breath in the middle is awesome

tardy topaz
little ore
#

Woah and that sounds like rapping/rap battles. Hmmm...

tardy topaz
#

Yeah I think there's ton of potential that doesn't need fine-tuning or loras, anything, just nudging the sampler a bit and Bark is so good it usually makes things sound good.

rigid idol
#

ไฝ ๅฅฝ

little ore
#

Well I REALLY hope finetuning is sufficient to add another language, or a really hacky solution is to remap chinese characters to Hรกn Viแป‡t (https://en.wikipedia.org/wiki/Sino-Vietnamese_vocabulary)
But that would be REALLY inconvenient

Sino-Vietnamese vocabulary (Vietnamese: tแปซ Hรกn Viแป‡t, Chแปฏ Hรกn: ่ฉžๆผข่ถŠ, literally 'Chinese-Vietnamese words') is a layer of about 3,000 monosyllabic morphemes of the Vietnamese language borrowed from Literary Chinese with consistent pronunciations based on "Annamese" Middle Chinese. Compounds using these morphemes are used extensively in cultural and...

keen lotus
pure brook
#

created a script to take long monologues and export them. Just need to add multi threading now, and fix some of the parsing

pure brook
#

I didn't realize there was already a tutorial on this. I made it in java.

fast ferry
fast girder
#

pretty nice narrator and slightly longer generation

light iris
#

ooh I like that

quaint night
fast girder
#

yeah, we really need to find some good presets there - I think it can be good with good presets but has huge variation

quaint night
#

Presets for narration or Japanese?

fast girder
#

Japanese

#

(well - both)

quaint night
#

Can I ask for a quick test, can the bigger models properly spell this: ไธญ่ฏ็‰ฉใ€€๏ผchuukabutsu

#

I tried several variations and it always chose the wrong spelling/reading (chuukamono)

#

by the way, phonetically I've heard that it's able to produce good output, but the phonemes chosen are not always correct

fast girder
#

interesting! do you have it in a longer prompt

#

i see

#

that makes a lot of sense..

quaint night
#

here's a test I tried on the bot

fast girder
#

interestingly enough Google Translate chooses the other reading as well

#

(I am out of my depth here)

quaint night
#

if it were English, it's like using a Germanic pronunciation for a Latin origin word?

#

sometimes Mono is correct, sometimes butsu

fast girder
#

yeah very interesting!@

quaint night
#

aaaaaaaaaaa

#

sorry I'm just happy w

#

the first one seems to mimic one of popular Japanese TTS's that's probably why the quality is such

fast girder
#

yeah ๐Ÿ˜ฆ

#

we certainly have that problem in English too

quaint night
#

can those models use the small model's npzs? I could give you a better one for history_prompt

fast girder
#

ya

#

might take a little effort but doable

#

also if you have any slightly longer prompts (3sentences or so) would love to try those

quaint night
#

ok I'll find some to choose from

fast girder
#

one more decent one as far as i can tell (still grainy)

quaint night
#

ใŠใŠใŠ

#

it sounds like a news special from a tv report

#

This is just a randomly generated paragraph
ๅœฐ็ƒใฎๆฐ—ๅ€™ๅค‰ๅ‹•ใซ้–ขใ™ใ‚‹ๆ–ฐใ—ใ„็ ”็ฉถใŒ็™บ่กจใ•ใ‚Œใพใ—ใŸใ€‚็ ”็ฉถ่€…ใŸใกใฏใ€ๅ†็”Ÿๅฏ่ƒฝใ‚จใƒใƒซใ‚ฎใƒผใฎๅˆฉ็”จใŒๆ€ฅ้€Ÿใซๅข—ใˆใฆใ„ใ‚‹ใ“ใจใซใ‚ˆใ‚Šใ€ๆธฉๅฎคๅŠนๆžœใ‚ฌใ‚นใฎๆŽ’ๅ‡บ้‡ใŒๆธ›ๅฐ‘ใ—ใฆใ„ใ‚‹ใ“ใจใ‚’ๆ˜Žใ‚‰ใ‹ใซใ—ใพใ—ใŸใ€‚ๅคช้™ฝๅ…‰ใ‚„้ขจๅŠ›ใชใฉใฎใ‚ฏใƒชใƒผใƒณใชใ‚จใƒใƒซใ‚ฎใƒผๆบใฎๅˆฉ็”จใŒ้€ฒใ‚“ใงใ„ใ‚‹ใŸใ‚ใ€ๅœฐ็ƒๆธฉๆš–ๅŒ–ใฎๆŠ‘ๅˆถใซๅคงใ„ใซๅฏ„ไธŽใ—ใฆใ„ใพใ™ใ€‚ใ“ใ‚Œใฏ็ด ๆ™ดใ‚‰ใ—ใ„ใƒ‹ใƒฅใƒผใ‚นใงใ™๏ผ

#

Also, for some reason bark likes to generate many Japanese voices with foreign accents

fast girder
#

yeah, we have the same problem with Chinese (and other languages too)

quaint night
fast girder
#

here's a random one - I think we have a little work to do to make npzs work so will report back

quaint night
#

Hmm, what if I gave you a recipe for generating a good "seed" voice?

#

btw with google and chuukabutsu - it's funny because they write chuukamono but they generate chuukabutsu in the audio

fast girder
#

i see!

quaint night
#
{
  "_version": "0.0.1",
  "_hash_version": "0.0.2",
  "_type": "bark",
  "is_big_semantic_model": true,
  "is_big_coarse_model": true,
  "is_big_fine_model": false,
  "prompt": "ๅˆใ‚ใฆไผšใฃใŸๆ—ฅใ‹ใ‚‰ ๅƒ•ใฎๅฟƒใฎๅ…จใฆใ‚’ๅฅชใฃใŸ ใฉใ“ใ‹ๅ„šใ„็ฉบๆฐ—ใ‚’็บใ†ๅ›ใฏ ๅฏ‚ใ—ใ„็›ฎใ‚’ใ—ใฆใŸใ‚“ใ ",
  "language": null,
  "speaker_id": null,
  "hash": "ba221be9420a7791e8dc6ec5f175ca12",
  "history_prompt": "None",
  "history_prompt_npz": null,
  "history_hash": "6adf97f83acf6453d4a6a4b1070f3754",
  "text_temp": 0.6,
  "waveform_temp": 0.8,
  "date": "2023-06-10_19-03-45",
  "seed": "332186546"
}
fast girder
#

awesome

quaint night
#

it has an unnecessary bgm but it's voice actor level of diction

fast girder
#

will throw a few more random ones over, gonna need to write some code to try the other ones

#

(no idea if these are good)

quaint night
pure brook
quaint night
# fast girder

they mash up words and leave them out, but there's very little noise - they could be piped into a following generation and the results might be good

fast girder
quaint night
#

sometimes stuffing words down bark's.. prompt causes it to snap back into reality

quaint night
copper night
#

alguem sabe se consigo
adicionar a minha voz
e a inteligencia artificial
gerar audios com ela ?

pure brook
#

I will try that out later. I way more time on this then id like to admit today haha

#

ehhh i have to try it. Ill do some reading to see how to add this in

full badger
#

test

dire seal
#

/bark

fast girder
pure brook
cedar ledge
#

/bark

pure brook
#

is it v1 or v2?

remote turret
pure brook
sonic robin
#

olรก

pure brook
quaint night
quaint night
chrome tapir
#

yo yo

#

workin on a new chatgpt/suno/audiocraft project

#

dis gon b gud

fast girder
humble ruin
#

Oi, tudo bem ?
Aqui quem fala รฉ o Fernando, sou o Especialista no tratamento com o GOTA MAX, e vou te ajudar neste atendimento.

grizzled shard
teal grove
#

any possibility for elevenlabs quality tts voice?

#

it's like these voices are passed through a filter

#

and they end up sounding like they have some slight (graininess/machinelike/synthlike) quality which is (at the moment) difficult to hide

ebon widget
#

yeah we are working on that for the next version. clean history prompts defo help but ultimately it's limited by the codec

#

generally the variation is a feature since it can do arbitrary audio but it needs to be controllable to remove it for TTS use cases

teal grove
#

how do these two audio files compare to your ear?

teal grove
# teal grove

SPOILERS do not read until you tell me your first impression ||(I don't care about the true quality in this instance, I'm fine with there being an illusion of quality)||

#

it can be harder to tell the difference if you hear the same thing over and over and get used to the sound (potentially)

ebon widget
#

the cleaned one sounds a bit richer, but not necessarily less noise

teal grove
#
  1. richer how & 2) by noise, do you mean machine-like?
#

you say it sounds a tad ... better? (i hope)

#

spoiler || it's an EQ + distortion to hopefully make the sound sample sound less tinny ||

#

|| it actually sounds awful with distortion only ||

#

actually nvm after a short break i can tell it's the same sample quality

#

eh well

teal grove
#

i tried cleaning a dile that sounds like this:

#

i removed the noise profile in the file ending with _2 from the file ending with _4 to produce the file ending with _3

#

in fl studio

#

tell me how good this denoiser is

#

i think it can work, i'd just need to isolate the worst sounding parts of the audio by hand and obsess over it for a while to hopefully get a better result

#

it's come out muffled because the noise profile was in the higher frequencies

#

(used Edison)

#

these last two, after applying pitch and formant effects, it's like there's background noise

#

and idk exactly _7 is

#

lol

#

is suno bark just actually cutting up audio and stitching it together?

#

in creative ways

#

like sometimes it will say things that aren't in the prompt that i assume are from the training material

#

or are just made from some sort of manner of processing like the music is

#

part of this prompt is me trying to make the voice sound angry (history prompts)

teal grove
#

perhaps it is impossible to have a clean take because after using distortion what sounds totally clean actually contains faint noise that might say something else or have other noises/sounds, with the loudest sounds being closer to the prompt and the faintest being furthest (usually).

#

idk actually because sometimes it loudly says something else

teal grove
#

(or so I've heard)

teal grove
#

I guess there isnโ€™t an AI of any kind that is available to the public that sounds perfectly natural in speech.

hushed hull
grizzled shard
quaint night
#

tts-generation-webui has vocos from npz, so you can run it on past generations. Vocos can be applied on wav (incl. mp3 etc) as well

teal grove
#

still slightly you-know-ish but yeah

teal grove
#

will try this next

teal grove
#

how do i fix this

#

Oh my, I am a mug, my old scripts have the same error. Oh well.

#

But they still work though

#

Nvm Iโ€™ll try reinstalling / installing pysoundfile ๐Ÿฅฑ

teal grove
#

HOLYYYYYY

#

DAAAMMMMMNNNN

teal grove
#

(btw the text prompt was the test script from the link)

#

the audio quality in these samples are good enough to not cause annoying glitch sounds when being processed by the pitch shifter plugin in FL Studio ๐Ÿ‘Œ

#

(so far)

#

I'm gonna try accessing my old voice preset tomorrow and seeing how much better it sounds with vocos

chrome tapir
chrome tapir
tardy topaz
chrome tapir
#

whatup fly

#

seems cool

quaint night
#

That's RVC

chrome tapir
#

i havent tried RVC yet

#

i cant wait until we get stereo effects. imagine the panning effects AI will be able to do

teal grove
#

Iโ€™d want to make the sung parts without background music and then make my own tracks Iโ€™ll see how that goes

hazy whale
#

using suno to add vocals to my tracks in ableton

teal grove
chrome tapir
# teal grove Is the title the prompt?

๐ŸŽตAha aha aha aha, doo doo doo doo, yeah yeah yeah yeah, ooo ooo ooo ooo.๐ŸŽต was teh full prompt. if you put less you wont get a full 14 seconds

chrome tapir
#

and obviously suno output isnt going to match a already existing waveform

hazy whale
#

put a ton of effects to make it sound better generally using ovox by waves or maybe antares coudl tune the vocals

tardy topaz
chrome tapir
chrome tapir
chrome tapir
#

RVC seems cool

tardy topaz
chrome tapir
#

this npz stays on beat well

#

npz kinda like npc huh

chrome tapir
#

i wonder what semantic rate does

tardy topaz
#

of it's bark_perform that's --semantic_allow_early_stop False

#

I didn't realize you were using my fork. if that doesn't work let me know, I'll fix that right now

tardy topaz
chrome tapir
#

oh i thought maybe it would speed up speech

#

sweet now i get 15 seconds with only 2 words of prompt

#

will this be useful i guess we'll see

tardy topaz
#

BTW, for generic music try stuff like [music][music][music] it seems silly but repeated tags can be good for that

#

Can you run the python bark_webui.py instead? That has a checkbox for 'blank text' prompts for sure

#

Even [music][music][music][music][music][music]

chrome tapir
#

i am way behind on your fork atm. im still using the very first one but i hacked it up so i dont want to upgrade until i have to

chrome tapir
#

i am gettin some good results with just (dance beat)

#

ill try music too

tardy topaz
#

In general you can try both () and []

#

and with some voices, one works, and the other doesn't!

#

Oh also stars. Try * dance beat * or * ominous music * etc

chrome tapir
#

i seen someone mention there is a way to use lora's with bark is that just a npz file or different

tardy topaz
#

Different. I still need to try it myself. For fine-tuning voice clones on the model.

chrome tapir
#

oh yeah like obama heh

tardy topaz
#

Breaking into random words is kind of a vibe honestly

chrome tapir
#

everyone loves the wildcard

#

until it starts screaming in your ear at 0db clipping

tardy topaz
#

I don't what your defaults are by setting topp and topk, they might be none. Latley I've been using like topk 200 and 150 on corase

#

topp 0.95 or even a bit higher

chrome tapir
#

wow nice little guitar riff in there

#

sometimes npzs dont work u notice that

tardy topaz
#

That might be the old code, not sure

chrome tapir
#

i mean they work but sound completely different

tardy topaz
#

oh that, yeah

#

that's a complicated issue. especially music is prone to failure

chrome tapir
#

i went from 100% music to some guy talkin yeah

tardy topaz
chrome tapir
#

fastback at the residence great song

tardy topaz
#

Bark names it's own songs!

#

lol

#

One downside to not allowing early stop. If your .npz is the radio outro, you won't be able to easily recover the music.

#

Since Bark will continue the end of the clip, which is a person talking. Though you can go in and change this with work, it's not a built in feature. And picking a random point tends to also work bad. But if it's truly a one of a kind .npz, send to me, I'll fix it.

#

Uhhh... sometimes Bark feels like it's giving you a 14s audio clip ripped from the multiverse, just a glimpse of something somewhere happening, lol. Is that a crying baby in the background?!? Actually maybe a cat.

chrome tapir
#

considering how little we know of quantum mechanics i sometimes think AI reaches a threshold that allows something from the consciousness field to enter

tardy topaz
chrome tapir
#

oo thats nice

#

thats musicmusicmusic or u think the parameters are helping

tardy topaz
#

Actually what's making it good is my buggy code has been sampling twice the whole time... !

#

(it's 30 seconds becuse I just bugged it trying stuff, it's just duping the audio)

tardy topaz
#

Oh I actually have some other code enabled... doesn't work in base bark. gonna have to reverse engineer what I did to make music so good by accident actually.

north loom
#

even scaries than original! :p ๐Ÿ˜ฎ

tardy topaz
#

Was there anything beyond the words in the prompt? Oh I am sleepy today. I can just see the chat bot message. Yeah it's very long prompt, nice.

tardy topaz
north loom
#

Summary from:
https://twitter.com/AndrewCritchCA/status/1680461874171658242

Once, in a past now distant, humanity faced a formidable challenge: the unchecked acceleration of artificial intelligence. Viewed through the lens of AI, humans were but slow-moving, sentient flora, showing flickers of intelligence in their unhurried existence.

Imagin...

โ–ถ Play video

@AndrewCritchCA As video. Slim change that @ESYudkowsky @elonmusk notice also but let's try. @AndrewCritchCA your summary was spot on.

โ–ถ Play video
tardy topaz
#

Boltzmann brain. Greg Egan fan maybe?

north loom
quaint night
quaint night
quaint night
tardy topaz
#

It's nice, but it's just not a Bark voice to me without like, the sound of baby crying in the background and the speaker bumping the microphone halfway through. lol

quaint night
#

lmao

tardy topaz
#

pod3000 said "Bark feels like dialing a random number into bill & teds phone booth" truly perfectly put

quaint night
#

haha

#

True, some of the best voices are tied to some bgm or inexplicable noise

tardy topaz
#

More and more I'm surprised how much control you actually do have with the text prompt.

#

Some prompts are like, 90% very similar voices.

quaint night
#

but is it control or specificity? like, can you actually tell it what you want it do to, or do you have the key for a specific output?

tardy topaz
#

Yeah I don't mean control as in fine grained control of performance. Just for 'summoning' the the random voice initially. After that it's much hairier.

#

I find direct descriptions can work, but it's usually not the best way. Like saying, "I'm a chatbot" seems to work better than describing the chatbot, via other prompting methods.

quaint night
#

I remember trying that with genders, it did not work for me lol. I think the audiobooks in the dataset form a "prompt resistance"

#

however I know that with trying to get a "cheerier" tone, it's useful to add a [laugh] etc

tardy topaz
#

On a generic meditation like:

Listen to my soothing, relaxing voice. Breathe calmly in, and out. Slowly close your eyes. Continue to breathe at this slow pace. Feel the air expand your lungs with each in breath.

I was getting like 80+ percent women. Even adding "Bond. James Bond." only knocked it down to like 50.

quaint night
#

Oh, yeah, there's a bias

#

I think it would be interesting to have a -audiobook or -female +male "bias controls"

tame cobalt
#

Can we train based on our own voices?

quaint night
tame cobalt
#

Do they work with bark?

#

I only know elevenlabs

quaint night
#

Also tokens that silence noise really exist, huh, if only it wasn't such a pain to employ them

#

Yes, bark has voice cloning options

tame cobalt
#

Okay, interesting. Thanks!

#

Man what a crazy time to be alive ๐Ÿ˜„

quaint night
light bronze
#

All stitched together with python/ moviepy so basically 0 human intervention whatsoever lol

tardy topaz
#

One thing Bark absolutely crushes at, largely but not entirely by accident, is horror audio. So much creepy distortion, fading voices, babbling, unrecognizable sounds. I have so many samples that could have been a clip in a horror movie.

It's a little hard to leverage it on purpose, but I've scared myself by playing a longer Bark sample at night. You hit a quiet spot where nothing happens for 4 seconds, you think the audio sample is done so you almost forget about it. Then you suddenly hear some unnatural voice screaming out of nowhere fading into static. And it's the end of your Bark audio sample, it just went off.

quaint night
tardy topaz
#

It's crazy consistent, right?

light bronze
#

Have any npz prompts you could share oritented towars horror?

tardy topaz
#

It's usually the result of a voice switch, where the .npz voice completely changes half way through the audio. Which is normally a bad npz you don't keep, so I haven't been really setting them aside. But I totally should have for the especially creepy ones, and will in future.

tardy topaz
#

The whiny electric noise... works in horror

#

Probably worth tryingg some meditation prompts to find slow whispery voices, they are a close fit, and meditation prompts are very strong and consistent with random voices.

chrome tapir
#

make anything good with audiocraft?

#

i got lucky a few times

#

suno is good for beats but audiocraft is better for melodys and non-percussion instruments

chrome tapir
#

actually u can make some sick drums in audiocraft

quaint night
light bronze
#

Any tips on getting rid of the hallucinations? I followed the tips on the GitHub (eos setting or whatever it is)

quaint night
#

Eos would only help with overextending

#

As for regular old hallucinations, they kind of just happen, but more often for some cases than others

#

Depending on voice, prompt etc

fast girder
#

The main one we've seen is attempting not to understuff or overstuff the prompt

quaint night
#

I've seen that a longer history (i.e. 2 sentences) doesn't like generating a short phrase (like few words)

tardy topaz
# light bronze Any tips on getting rid of the hallucinations? I followed the tips on the GitHub...

EOS can help maybe for end-of-text hallucinations, but some general things:

  1. Prompt is too long. Generally you can use a longer prompt than the speaker can finish without causing hallucinations, but it can go too far.
  2. Prompt is too short. Bark really wants to generate at least 6-8 seconds, even if it has to add words.
  3. Prompt and speaker style mismatch. In the extreme, a difference language than the voice. Also: formal vs casual, accents, Old English versus modern slang, etc
  4. Prompt with non-spoken text generally. (laughs) [screaming] MAN: WOMAN: and so on. Some voices work fine, others fall apart entirely.
  5. Prompt and speaker lower level mismatch. Your prompt is an unnatural followup to the prompt in original voice. Like if the original random voice prompt ends "I like eating pizza at" Then you prompt: "My name is Suno." as the next words. Bad fit.
  6. A quirk of Bark voices and repetition. When a voice speaks the same (or very similar) prompt as the one that created the voice in the first place.
  7. A thing some speakers have a high tendency to do, for mysterious reasons. There's just something in that voice that makes it more likely.
  8. Random chance. Bark just decided to dial a random number into bill & teds phone booth and record 14s of audio, instead of reading your text. It happens.
#

For a lot of these, it just makes the chance of hallucinations somewhat more likely, and for the most part may work fine.

alpine swift
tardy topaz
alpine swift
tardy topaz
#

Haha, good luck. You mght have a lot of ear piercing sounds in your future. I have made some recent progress with music Bark a bit recently, but it's much harder than regular voices.

alpine swift
#

Currently using bark infinity. But don't know where i can get better models :P or even how to train a quite high quality to be NPZ as it turned out to sound the same as the sample, even if the long mp3 is quite damn quality :P

tardy topaz
#

My guess is a simpler prompt might work better, maybe:
[intense music][intense music]My lyrics [intense music] [intense music]

#

no note symbols maybe too

alpine swift
tardy topaz
#

Change 'intense' or something else

#

That was just a sample. But that works somewhat better with just [music] for example

#

maybe [power ballad] I don't know. It's all undiscovered what works best

alpine swift
tardy topaz
#

Music is mostly undiscovered. For a random voice, I think focus on the TEXT not the [brackets]

tardy topaz
#

Like 8 of 10 will be a slow super calm, sometimes whispering, female.

#

That's how influential the right text can be.

#

Also lately I've been using top-k 300 semantic, and top-k 150 or 200 coarse. I think it might be better.

alpine swift
#

Indeed. Been looking for new/other trained stuff as well, or if bark is the only text to voice synthesizer :P

alpine swift
tardy topaz
#

Are you on AMD?

#

It should work unless you are on AMD.

#

I can make it work on AMD, but honestly it's a low priority. I might just wait for better AMD support a bit.

alpine swift
#

Nvidia. 3090

tardy topaz
#

If you get those error on NVIDIA, then something is wrong.

#

Okay two things. One, are are using a seed? Turn it off if so. Set it to 0.

#

Two, well, I'm not sure but your Python AI setup might be a little screwy there.

alpine swift
#

That's what i wanted. Why won't seed work? As i wanted a unified result and not first the female voice i trained, then into a random scottish lad a second after throuout the last 13 sec lol

#

Oh boi..

tardy topaz
#

Okay, so the seed doesn't help you for that. The seed only meands you get the SAME female and then the SAME scottish lad.

#

The seed should not cause an error with topk, that probably is a bug... but also you shouldn't use a seed.

#

The seed is a random number seed. So it means you get the same voice with the 100% exact same prompt. If you change anything, use the voice again, it does not help with consistency in that way.

alpine swift
tardy topaz
#

The seed has its uses, but it's not really useful the way you want to be. It makes the reproducible. Like if you played an RPG and rolled a D20 10 times. If you used the same seed, all D20 rolls will be the same sequence of numbers.

#

However, it does not makes all the D20s rolls similar, right?

alpine swift
#

No idea what a D20 is