Hey everyone — wanted to share something I’ve been building solo for a few months and get feedback from this community specifically.
I was learning Japanese by watching anime, reading articles, switching between three tabs and a dictionary, and it was driving me nuts. I wanted an assistant who’d just be sitting next to me — see what I’m seeing, hear what I’m hearing, and answer when I asked. So I built agent Samuel.
Samuel is an open-source voice AI desktop agent for macOS.
https://github.com/sambuild04/screen-voice-agent
The architecture:
- Voice loop: OpenAI Realtime API over WebRTC, sub-second response
- Screen vision: GPT-4o Vision on captured screenshots, every 20s, with smart change detection so I’m not burning tokens on a static screen
- Tool framework: @openai/agents JS SDK
- Code generation for new tools: GPT-5.5 with reasoning tokens via the Responses API
- Code review before install: GPT-4o-mini
A few other things he can do:
- Browser automation via Playwright sidecar (real browser, you sign in, Samuel reads/interacts — no OAuth setup)
- Persistent local memory: preferences, corrections, facts, and saved skills (workflows he’s done before that he can replay)
- Voice-controlled UI: say “make yourself smaller” or “move to the other corner,” it happens live
- Recording mode + transcript Q&A for meetings/lectures
- Actively watching or hearing, and remind users if requested ("Hey Samuel, remind me when you hear an advanced level Japanese word)
Limitations:
- macOS only right now (depends on ScreenCaptureKit + Peekaboo)
- Plugins run in a new Function() context with full JS access — not OS sandboxed. The approval flow is the current security boundary. Evaluating Anthropic’s sandbox-runtime and Deno-subprocess approaches.
- Browser sessions don’t persist across launches yet
What I would love feedback on:
Other use cases I’m not seeing? Built this for language learning originally but I keep finding it useful for other things.