#Make Evil Neuro able to somewhat sing by herself

1 messages · Page 1 of 1 (latest)

rapid ledge
#

I already see evil doing various fun things with her TTS on stream, so here is an idea, we allow evil to control even more parameters, and give her access to a selection of song instrumentals, then on stream, she could choose to pull out a mic and then attempt to sing (or ad lib) along the instrumentals

more crucially, this would not affect karaoke streams as in normal streams evil is actively choosing what to sing, can ad lib, can go off key, and take requests directly, which i think would be quite enteratining

regal olive
#

Many have requested this before in the past for both evil and neuro but i think the issue right now is how it can be set up. Vedal has to preload the songs and everything for singing to work

rapid ledge
#

the idea is to let evil generate the pitch, phomeme and timing information herself (which evil seems to already be able to do as with the wierd evil noises compilation), and play the instrumentals alongside, (which is similar to an advanced soundboard)

#

though the singing will be scuffed compared to if vedal preloaded it but at least we know it is truly evil singing

#

also this adds spontaneity, imagine evil suddenly doing a scuffed rendition of rickroll on stream

alpine pilot
#

how

rapid ledge
# alpine pilot how

using tags like <word_or_phomeme, pitch, timing> inside the messages, similar to the soundboard idea in concept except this controls the voice

#

and maybe a command to start or stop instrumentals

#

i think evil already controls her voice parameters pretty well

alpine pilot
rapid ledge
#

like pitching up and down

#

those are vocal parameters

dire geyser
#

While her TTS is absolutely capable of breaking the 4th barrier and sounding super realistic, the thing is... is just a TTS. The only input evil gives is... her message, that's it.
Her TTS is the one that tries to "guess" how that message should sound.

Her singing voice sounds super awesome because it's prebaked with an actual singing voice and quite some manual adjustments after that.

For your idea to realistically work... evil would need full vocaloid controll over her voice, including pitch and tone.

So... if evil wanted to say anything, her prompt would probably be something like this:

"he" - (vocal tone) (vocal timing)
"lou" - (vocal tone) (vocal timing)
" " (insert pause time)
"chat" - (vocal tone) (vocal timing)

TLDR: Evil would need to put so much work on speaking, she would need to re-learn how to, like a baby.

rapid ledge
#

in practice, some TTS systems actually allow this, they prioritize the tags, if non were given then use the default settings

#

basically what im thinking of is chat <'iou', timing, pitch> chat, chat would use default parameters while iou would use the provided parameters

vague pecan
#

There's quite a bit of problems with this

  1. the latency of a vocal synthesis -capable system is much greater than that of a regular TTS, and Vedal already doesn't like Evil having higher latency
  2. phoneme data is not something an LLM will know, they're trained on text, not speech, so they're clueless about how something actually sounds. You can easily confuse an LLM with jokes that only work if you can pronounce the text like how it was intended rather than just read it as tokens
  3. if you wanted output that doesn't sound incredibly robotic, it's not enough to generate a single pitch per phoneme, you need a massive rolling list of pitches that forms what is called the pitch curve, which is what separates a song sounding robotic and natural (this is what I've learned from working with NeuroSynth)
rapid ledge