#Spin Trained Voice "gibbers" during silence

1 messages · Page 1 of 1 (latest)

ashen narwhal
#

I trained my own voice model using spin this time as it seemed interesting, but my trained voice jabbers like it's demonically possessed during "silence" after speaking which isn't ideal. (random vowels? I guess?)

Is there a particular section of the discord where I can talk to people about this? it's all separated by roles so I'm not sure if I can see all the right places 😅

I checked with:

  1. someone else's "spin" voice
  2. some old "contentvec" voices
  3. training the same voice with a contentvec setup

And they don't have the same issue 🤔 , so normal bouts of silence when i stop speaking through them. It's a shame because i like how the voice sounds when trained with Spin except for the gibbering

Meaning there isn't anything wrong with my Vonovox install, or Applio, just something is the matter with how I'm training the spin voice specifically

fathom oak
#

also you could try spin v2 and use the corresponding pretrain like legacy spin (v2)

#

if not sure, just fall back to the default contentvec for both training & inference as it is proven to be stable

#

additionally, stay away from jp/kr hubert, there are no good models using it, not even the old snowie pretrain

ashen narwhal
# fathom oak the model should be inferred with the same embedder as the one used in the train...

Heya, I haven't used the inference tab before in Applio as it's a feature i don't think i need, i train the voice, then i personally use the voice during my livestreams.

on the training tab I was using the embedder model "spin-v2" and that's the one that resulted in the gibbering during silence-after-speaking.

this version of applio (downloaded today) only has contentvec, spin-v2 and custom as options, the other options you said to stay away from seem to be gone.

for the spin v2 i tried using "custom pretrained" with two different sets of files f0G48k_spin-v2_10e and f0G48k_spin-v2 (along with their "D" counterparts.

should I be using something else for Spin V2?

I kinda like the spin one's sound, and it's definitely possible to have it work, as i downloaded "relaxed girl" https://discord.com/channels/1159260121998827560/1407456715561242645 to make sure I didn't have Vonovox set up wrong and it sounds really good, I really want to get a Spin version of the voice i'm making. pepe_sad_sit

fathom oak
# ashen narwhal Heya, I haven't used the inference tab before in Applio as it's a feature i don'...

you should also do normal inference to test the model. I think the mainline rvc 1006 is still good for inference with the default contentvec, but applio is still not bad for spin/v2.

regarding the voice changer, make sure the environment is quiet enough (no background voices/sounds) and decent headset mic for under 99 bucks is enough, also adjust the silence threshold or use decent external noise suppression as needed.

the spin v1 option in training is hidden as it is superseded by v2 (but still available in inference), though you could modify the script in the Applio/tabs/train/train.py or try the older version 3.4.0

ashen narwhal
# fathom oak you should also do normal inference to test the model. I think the mainline rvc ...

Oh, well as it needs to work for me speaking, it seems fair to skip that step (as even if it was good in inference, but i can't speak with it then it's unusable anyway hehe)

the enviroment is quiet and as i said it sounds normal when:

  1. I speak into the mic (real voice) no background whatsoever.
  2. when i use someone elses spin model
  3. when i use someone elses contentvec model
  4. when i use my contentvec model with this exact same data.

there is no background sound, no distortion and no problems when using anything else there's only problems with my trained spin voice. You see? so it must be a problem with how i'm setting it up.

i'd give spin v1 a try maybe 🤔 , I wish i knew what was wrong with the silence training though, or if that's not what the problem is, i wish i knew what the problem was.

fathom oak
ashen narwhal
fathom oak
#

note that you could use pre-recorded audio for the inference

ashen narwhal
fathom oak
ashen narwhal
# fathom oak no I mean you should try pre-record the audio with your real mic as usual and us...

I don't really know how to work the inference tab, there's a whole bunch of settings in the advanced area inside.

Do i really need to learn all this stuff when i'm not going to use any of this offline audio conversion capabilities at all?

i'm not trying say it can't be used by someone, but I don't think this will help remove the gibberish that's present in the silence in my finished voice so I guess i just don't get why it's important to the problem.

I don't need a model stress test, or burn in test or a microphone test or anything like that, there is no problem with my audio equipment, there's a problem in the training somewhere.

even if we get something to happen in the inference tab, I can't use the inference tab to do my livestream, I need the finished model to work outside of the inference tab. 🤔

#

Sorry! I did want to take the time to say I am grateful that you are helping me, thank you! 👍🏻

I guess all that's let to do is:

  1. See if i can find a different pretrain
  2. No pretrain
  3. Old versions

and things like that as I can't find a way to fix the broken silence making random sounds and try and exhaust all the diffent ideas.

fathom oak
#

again the noisy silence is because you didnt raise the threshold

#

in case of using wasapi/asio devices, you need an external noise suppression, or perhaps vonovox & tg-develop fork provide the noise suppression plugin

fathom oak
ashen narwhal
# fathom oak again the noisy silence is because you didnt raise the threshold

How would that make sense if the settings work on 20 other voices, but no finished trained voiced based off of this one?

I get why you might think it's noise around the room, i understand you, but it's deathly silent in here, there is no feedback when listening to the unmodified voice, there's no feedback with voice 1, voice 2, voice 3... all the way up to voice 20.

but there is a repeatable problem immediately with the same dataset trained spin version of the same voice, even if i pick any of the .pth files that were created along the epochs.

ashen narwhal
fathom oak
ashen narwhal
#

on the top are 6 contentvecs (although i also have previous ones as these 6 made today are just to check I didn't forget how to do it)

and the 6 spin v2's that all came out "dirty"

fathom oak
ashen narwhal
fathom oak
#

and you could try working on another dataset

#

by any chance

ashen narwhal
#

I think it sounds beautiful with a A/B test on the same dataset, much better than what i am using right now that's why i'm really hopeful that it can be solved, if i could have that new voice (without the bugs) i would be set for life.

Unfortunately I don't have another 15 minutes of perfect audio from anything else, the way I got this audio was a complete pain in the backside 💀

it's all from a professional in a studio and i really don't know how to go about getting more finished .WAVs of someone else that are that long.

fathom oak
#

so why not try on another dataset, no matter if it's perfect or mediocre

#

there are many youtube sources, though yea you'd need to extract the voice and clean it

ashen narwhal
#

if you have a spare dataset that's hosted up somewhere I can download i'd fire up the training and see if i get the same problem 🤔

fathom oak
#

again you could just try some youtube contents no matter how mediocre it is

ashen narwhal
#

oh, I had no idea, doesn't make too much sense to me, no need to explain it though, rules are rules and that's that 👍🏻

fathom oak
ashen narwhal
#

Huh, I was expecting the internet to just have things like "100 hours of clean spongebob audio" or "35 minutes of clean james bond audio", but it really doesn't seem to be a thing people just upload 🤔

oh well, I thought that would be a quick 30 minute job to try that and see if i get the same silence-error but it doesn't seem it would be that easy.

fathom oak
ashen narwhal
# fathom oak perhaps "spin is sensitive to audio volume" but looking forward to trying on suc...

Even if you turn the volume down to unusable levels it will still inject gibberish-noise into the silence after speaking, no matter what DB 🤷🏻‍♂️ , it's just broken at a functional level.

all my audio is 48khz.

I've given up on looking for other audio to process as i can't find anything in a ready to use state, I didn't realise prepared audio was such a secret thing 😅 all those places you mentioned have things like "100,000 people say the same number" and other such descriptions that sound completely unusable.

I've only ever trained the one voice and it was 15 minutes of one person, I just wanted something easy and ready-assembled like that, all these other things seem like a lot of effort

#

spin v2 seems to have a lot of potential (as per my two demo files i uploaded earlier) but I guess i just got here too soon before some of the bugs have been worked out and it becomes more of a mainstream thing to use 🤔

once -everyone- is using it generally i'm sure someone will immediately see the same thing and fix it with their know-how in two seconds, I have this habit of landing on something before it's public ready with my bad timing 🤷🏻‍♂️

fathom oak
ashen narwhal
# fathom oak - have you tried training it in 32k? - have you tried using rvc/applio inference...

I gave it a go, but still the same garbled area.
I used inference and added a "acapella" song:

  1. with the contentvec trained version it sounded pretty normal
  2. with the spin v2 trained version it sounded completely garbled (with extra garbling on the silence)

same voice, same dataset.

at least when i was speaking through the spin v2 version, it sounded normal during the words, but the inference tab can't seems to use it at all, quite bizarre. 🤔

  1. I tried just setting the training to 32k (without resaving the 48k dataset) and that didn't work
  2. I tried training the whole thing without any pretrain and that didn't work either
  3. I tried training the whole thing with the built in pretrain without directing it to the custom spin 2 pretrains i downloaded and that didn't work either.

So I think I tried everything other than trying to find a version with spin (v1 )

ashen narwhal
#

Legacycore2.5 just throws an error immediately, so i'm not sure what's up with that 🤔

seems Applio just wanted to be rebooted to stop being fussy 😅 i'm training it now, just to see what it's like

fathom oak
ashen narwhal
fathom oak
mortal bay
#

the random sounds between silences could be that you trained contentvec embeddings while using a spin pretrain or viceversa, or that your dataset needs mute files in order to inference silence

mortal bay
ashen narwhal
# fathom oak here is a tweaked script of Applio/tabs/train/train.py for Applio 3.5.1 to unhid...

I gave all of this a go, I was quite hopeful, but once again on the inference tab the song I made it interperate just came out garbled, that's so sad, I think my dataset is 15 minutes long, all the silence has been removed and all the audio is the one long dataset and all the other preparation work that makes it a good contentvec dataset, and then it just doesn't work as a "spin"

it sounds better than it does on the inference page when I take control and speak through it, but still has those sounds on the silence after talking 🤔

ashen narwhal
mortal bay
#

og pretrain is contentvec so it only works fine if you train contentvec embeddings

#

training spin embeddings with the og pretrain just causes the model to generate random sounds, not words

ashen narwhal
#

i've been trying with fresh training anytime i swap over

and i've tried with silent training files set at "2" and at "0"

and i've tried spin v2 pretrained
and spin v2 pretrained custom foG48k_spin-v2
and spin v2 pretrained custom foG48k_spin-v2_10e

mortal bay
#

ah thats noobies finetune

#

i told him it had some degradation problems, weird he didnt deleted it

#

but it should still be able to speak

ashen narwhal
#

I'd be happy to try another spin2 pretrain you could point me to, it doesn't take me long to do 200 epoch's with anything and give it a whirl, it's a 9950x3d and a 4090 👍🏻

mortal bay
#

i dont know any, im training a personal pretrain that uses spinv2 but im gatekeeping to myself

ashen narwhal
#

I fully understand and can appreciate that 👌🏻

mortal bay
#

if you got a big dataset (100 hours) you can try training your own spin pretrain from 0

ashen narwhal
#

Uh, unfortunately I don't. I just have one voice actress for my dataset, and then after that i'm relying on the pretrain to be available from the internet pepe_sad_sit

mortal bay
#

oh sadge

ashen narwhal
#

i actually did one with no pretrain at all, but that came out much the same

#

I think I ran it for 2000 epochs to try and train to get a little lower on the graphs but... yeah, waste of time.

mortal bay
#

rvc always uses the og pretrain by default, so you actually used og pretrain

#

training without a pretrain generates static

#

2000 epochs is too much

#

you only train that amount if you're using something like diffusion ddsp-svc or so-vits

#

rvc is different

ashen narwhal
#

I have the output on how it sounds from a accapella song if you want to hear the nonsense it comes up with, as you can't even hear anything i assume this isn't something rule breaking as it's just.... nothing words haha. this same dataset on contentvec comes out beautifully sung (mostly)

mortal bay
#

hmm i remember trying noobies pretrain back then and it was able to speak just fine

mortal bay
#

like extracting the dataset phonemes with cvec but using a spin pretrain

#

or spin embeddings and using og pretrain

ashen narwhal
#

i'm finding it a little rough that most spin pretrain links have been removed by the people creating them, or are for non speech sounds like the one at the top of the "pretrain models" section of discord.

mortal bay
#

pretrains are quite expensive to make, hence why they get gatekeep

#

we're talking about days of training in the cloud

#

or even weeks

ashen narwhal
#

I understand, it just seems there was a window when people were sharing them, and then it's passed which is quite saddage. haha

mortal bay
#

the og pretrain used a small dataset (vctk) so it was quite cheap to make for the author

#

yea

#

in these days i'd recommend maybe trying to learn how to make your own pretrain lol

ashen narwhal
#

the one at the top of the pretrain model page "KLM 6.1 (Experimental V3 L608)" I think they had a second thread with a speaking voice (gone now) the author of "Legacy core" had one (gone now) and I think there were some more perhaps?

mortal bay
#

48k is almost impossible to do it right tho, nsf hifigan was not meant for that high sample rate, has a lot of smearing at the top frequencies

mortal bay
#

i deleted the old spin v2 because it was broken

ashen narwhal
#

I have a resaved 32k I have no idea if it affected the quality of the contents, i simply did it as one of the things refused to run without it

ashen narwhal
mortal bay
#

and i think legacy 2.5 is also broken in some way, i did something wrong with that

#

i should have duplicated speaker 0 to sid1

#

yt_nails but i didnt

#

oh wait nvm forgot that last part xD i got confused

mortal bay
#

less clear (?

#

idk how to describe it

#

less detailed

ashen narwhal
#

I quite liked legacy 2.5, but this spin version of the voice just sounded like it was from a new technology altogether and i liked it, i just wish she didn't sound like she was croaking after most words

mortal bay
ashen narwhal
# mortal bay less detailed

Yeah, I get the idea, I always can tell when i get a microphone for voice chat that's a lower range, i just wondered if it sort of crushed the existing data or if it was a little more gentle on it haha

mortal bay
#

now i see why that was happening, so im training it from scratch (og pretrain is no longer involved here) and using the right parameters and settings

#

but its gatekeep bc this is more expensive to make

#

so i cant give an expensive pretrain for free lol

ashen narwhal
#

Sure, sure, no need to appologize. I was doing a ton of game modding work for free while everyone else was playing the game, and all my reward was was complains and demands for new content.

so now i've set it up and more of a ko-fi and donation based thing and i try to relax and have a good time again and just work on it when i feel like

mortal bay
#

for me its hard to hear a difference between 48k and 32k xD but could be that i have some hearing loss idk who knows

ashen narwhal
mortal bay
#

48k training is very hard to do it right and the current vocoders dont do it very well tho thats true

#

it eats the gpu vram like crazy

#

and the training is veery slow

#

the model needs like perfect consistency in order to learn 48k properly

fathom oak
mortal bay
#

headphone quality matters too

ashen narwhal
#

so both of these from earlier are the same dataset and both 48k, and I feel like the contentvec version sounds quite "AI" still and the spin version is so much closer to natural and the voice actors tone, i just feel like i'd spend all day deleting the "error" clumps in adobe audition for offline work, and it would just sound like i'm burping constantly on a livestream... not very cool 😵

ashen narwhal
mortal bay
#

at least v2

#

another cool thing about the spin v2 is that works very good in realtime

ashen narwhal
#

I think i said it earlier, but I think I came into spin v2 both a little too late, and little too soon, which is my curse.

anyone saying "wow check this out!" has deleted their set and it hasn't got the "race to release the best" part either which will come later i suppose 😅

ashen narwhal
mortal bay
#

yea hmm, also about your original problem (gibberish instead of actual words) maybe you could try doing all of the preprocessing steps again, delete the model that doesn't work a do everything again

#

and just wait for someone to post a spin v2 pretrain yt_nails

ashen narwhal
#

I think i've essentially tried every combination of all the different files I have on hand, all fresh trained from the start, with and without all the different options, it only takes 7 seconds for the bar to go across for a epoch in the command prompt 👌🏻

nothing good comes out of any of them, all garbled in the inference tab, all usable in real time (but with the mistake in the silence after speaking)

so I think i've exausted everything i can do.

mortal bay
#

hmm what if you try the 32k spinv2 pretrain instead? iirc noobies trained 32k too

fathom oak
#

I've suggested the 32k one already

mortal bay
fathom oak
mortal bay
#

maybe is a problem with the pretrain itself

ashen narwhal
#

I hadn't swapped over and tried that yet at that point hehe, I did the leg2.5 in 32k, so that's about the point where i tried 32k for the first time.

mortal bay
#

that one is a finetune of the og

mortal bay
#

the one the other person made

#

for that to work fine you need to do the preprocessing again

ashen narwhal
#

shall we do the 10e, or the regular variant 🤔

mortal bay
ashen narwhal
#

meh, let's do both hehe

mortal bay
#

the first one is 20e
he trained the 10e after i told him the 20e epoch had degradation/regression/is broken

ashen narwhal
#

Let's go! pepe_stare I guess we'll try the 20e one for funzies 😅

mortal bay
ashen narwhal
#

I kinda doubt my remote other computer would be able to train a pretrain as it's got the awful low ram version of the 3080 in it, it's either the 10gb when there's a 12 or it's the 8gb when there's a 10gb version, i'm drawing a blank trying to remember.

Something i picked up in a panic when all graphics cards sold if they were left up for sale for longer than 15 seconds during crypto.

#

and I normally hate to give my main computer tasks where I can't really use it (i'm making an exception to try and make my spin voice) I kinda go a little crazy if my main computer is "occupied" 🤪