#methodology for ML virtual camera for vtuber model, if it happens

1 messages · Page 1 of 1 (latest)

tough sluice
#

methodology for ML virtual camera for vtuber model, if it happens
If Vedal ever decides to let a ANN control the live2d/other model instead of longer loops, here is a method he can use (that is, if he can't mod live2d, since I assume their TOS says no reverse engineering. A plugin could be possible though). All together, this functions similarly to an autoencoder.

Avatar control model (suppose you could also hook this directly into live2d if possible)
-> purpose: turn emotions, actions, etc. into a numerical representation of what it would look like for live2d

  • OUTPUTS: Same as inputs for the camera fake (some of those can even be moved here), plus:
    -- Arm preset
    -- pupil shape preset
    -- other quick actions
    Inputs:
  • emotion/actions, such as "check chat", "sad", "thank", "wink", "laugh", "nod"
  • speech parameter (makes mouth movements for speech)
    ===
    continued in comment
#

===
Camera Fake model:
-> overall, use human feedback to score outcomes

  • the model first learns how make an output that tricks live2d into thinking it is a face
    -- possible to start from a previously made face generator - probably a dataset out there
  • the model then learns how to control the avatar
    Output:
  • A video signal for live2d
    Inputs: (note: degrees of freedom don't need to be XY, could be diagonal for example)
  • Head turn/lean vector (3.1d, from front)
    -- forward/back; up/down; left/right; magnitude;
  • Eye look vector: the target of gaze (3.1d, from front)
    -- Left/Right; Up/Down; magnitude
    -- eye angle diff -- the difference between angles of eyes
    -- relative to head, or relative to live2d camera?
  • Body lean vector: (3.1d)
    -- left/right (lean, from front); forward/back; left/right (tilt, from top); magnitude
  • Offset vector: How far from the live2d camera (3.1d) (relative to ??? face, body?)
  • eye color (3d) - HSV or YCrCb value (not RGB)
  • pupil size (1d)
  • eyelash deformation (?.1d vector)
    -- the position of the deformation, the magnitude
    -- either 2 of these, 1 of these (always mirrored), or a diff amount
  • eyelids open percent (2d), eyebrow position (2d)
    -- could be each separate or could be one position and one is offset of the other
  • Mouth state:
    -- open/closed (1d)
    -- deformation (similar to eyelash deformation) (this creates smile/frown)
echo light
tough sluice
#

plus, not just face, but position as well

echo light
#

yeah, but I think its just too much work
train the whole thing from ground up

tough sluice
#

well yeah

#

thats why i said

possible to start from a previously made face generator - probably a dataset out there

echo light
#

maybe prompt engineer the LLM and pass it to live2d through triggers

tough sluice
#

you mean like feature visualization

echo light
#

like
bla bla bla *she said furiously* trigger angry

#

yeah

tough sluice
#

the thing about feature visualization is that it only finds the BEST input, not a range of bests. for example, if "i'm madge" is what feature visualization finds, then "I'm mad" still wouldn't be the same output as it in terms of trigger confidence

echo light
#

I don't get it, where would this be in response
what I ment is fine tune neuro's LLM to output certain mood tags at the end of the response like "Wow thank you for the dono [star]" meaning smething like star eyes yeah, and then the [star] thing is stripped and "Wow thank you for the dono" is passed through mood classifier that can have like a 0.0~1.0 vector of possible emotions tags, then the virtual actor system receives the "Wow thank you for the dono" as well as say [1.0, 0.01, 0.0, 0.1] (you know excitement, anger, etc)

#

or do you mean that one message can have multiple mood swings???

tough sluice
#

first of all, thats a layer above all of this the inputs to the Avatar control are those triggers, the avatar control would determine what the avatar needs to look like to do that.
for that layer though, idk if i would call that fine tuning, imo. it sounds like a fairly large change. What I would do is just add a layer of simple detection of what she is saying, instead of changing the LLM. Also consider that the avatar actions triggers need access to things the LLM isn't really designed for, such as donation thanks gestures. It would need another input for that.

grizzled topaz
#

there is no need to simulate a virtual face since you can just change the live2d params directly

tough sluice
tough sluice