#I built a voice AI desktop agent that watches your screen and writes its own tools

1 messages · Page 1 of 1 (latest)

alpine viper
#

Hey everyone — wanted to share something I’ve been building solo for a few
months and get feedback from this community specifically.

I was learning Japanese by watching anime, reading articles, switching between three tabs and a dictionary, and it was driving me nuts. I wanted an assistant who’d just be sitting next to me — see what I’m seeing, hear what I’m hearing, and answer when I asked. So I built agent Samuel.
Samuel is an open-source voice AI desktop agent for macOS.

https://github.com/sambuild04/screen-voice-agent

The architecture:

  • Voice loop: OpenAI Realtime API over WebRTC, sub-second response
  • Screen vision: GPT-4o Vision on captured screenshots, every 20s, with smart
change detection so I’m not burning tokens on a static screen
  • Tool framework: @openai/agents JS SDK
  • Code generation for new tools: GPT-5.5 with reasoning tokens via the
Responses API
  • Code review before install: GPT-4o-mini

A few other things he can do:

  • Browser automation via Playwright sidecar (real browser, you sign in, Samuel
reads/interacts — no OAuth setup)
  • Persistent local memory: preferences, corrections, facts, and saved skills
(workflows he’s done before that he can replay)
  • Voice-controlled UI: say “make yourself smaller” or “move to the other
corner,” it happens live
  • Recording mode + transcript Q&A for meetings/lectures
  • Actively watching or hearing, and remind users if requested ("Hey Samuel, remind me when you hear an advanced level Japanese word)

Limitations:

  • macOS only right now (depends on ScreenCaptureKit + Peekaboo)
  • Plugins run in a new Function() context with full JS access — not OS
sandboxed. The approval flow is the current security boundary. Evaluating
Anthropic’s sandbox-runtime and Deno-subprocess approaches.
  • Browser sessions don’t persist across launches yet

What I would love feedback on:

Other use cases I’m not seeing? Built this for language learning originally but I keep finding it useful for other things.

GitHub

Open-source macOS desktop AI agent (Tauri + React + OpenAI Realtime API) that watches your screen, listens to system audio, and teaches languages by voice. Self-modifying -- writes new tools at run...

torpid oxide
#

do you have any testimonials?

alpine viper
# torpid oxide do you have any testimonials?

So far I have only showed demo videos on social media and received good number of likes and some interested users. Currently I am fixing some bugs and try to make it "install and ready to use" without any setup to make the best possible user experience. Anything else you would be interested in?

torpid oxide
#

I downloaded Wispr for dictation, I tried to command my computer with Talon but Wispr + Talon ended up messing up my my input when you combine discord, i have an old PC, blah, blah, blah blah blah. If this runs efficiently and is native to ChatGPT, you have a customer in me

alpine viper
# torpid oxide I downloaded Wispr for dictation, I tried to command my computer with Talon but ...

Hey, thanks for the message , that's exactly the friction I'm trying to solve. Quick questions before I get back to you properly: are you on Mac or PC? And when you say 'native to ChatGPT,' do you mean using your existing ChatGPT subscription, or just using GPT models in general? Asking because my AI agents runs on Mac and uses the OpenAI API directly at the moment, even though it could be changed with more implementation effort.

torpid oxide
#

too bad about mac os only, all good, should you build out something for windows in the future lmk !