Hey everyone — built something I think this community will find interesting.
Samuel is a voice-first desktop agent for macOS. He watches your screen (GPT-4o Vision), listens to your system audio (ScreenCaptureKit), and speaks back via the Realtime API. The novel part:
You can ask him to add new tools to himself mid-conversation:
"Hey Samuel, add a weather tool"
→ He proposes what he'll build
→ You approve
→ GPT-4o-mini generates the code, writes it to ~/.samuel/plugins/, hot-loads it via new Function()
→ "Done sir. What city?"
No rebuild. No restart. The tool is live in the same session.
He also teaches languages by voice while you watch anime/YouTube — sees the subtitles, hears the dialogue, speaks vocabulary out loud without you having to do anything. Memory persists across sessions (preferences, corrections, proficiency level).
Stack: Tauri v2 + React + @weary tartan/agents + Realtime API (WebRTC) + GPT-4o Vision + GPT-4o-mini for plugin codegen
Repo: https://github.com/sambuild04/screen-voice-agent
Would love feedback from anyone who's built on the Realtime API — especially around session management and rolling context.