Yeah so my voice is translated to text and sent together with system data in brackets like [volts: 12.bla bla ] and so on and he's prompted not ro react to those if not asked. As for controlling him the language model is not even needed or used here, because of the voice-text conversion I can script it to react to certain keywords to trigger other functions outside of the main function, like face recognition and laser point hunting etc etc before returning to main. No data from the camera is sent to the language model. Just the names connected to saved images.
I do wait for the language model to respond before going to another function and everytime i add new features i tell him that in the pre-prompt so he responds in a relevant matter.
Yes i can tell him to do whatever and he'll do it right away or right after playing the GPT response