#Playground for Realtime API / speech-to-speech

1 messages · Page 1 of 1 (latest)

torn vine
#

https://playground.livekit.io/

tl;dr; mobile-friendly browser-based playground for the realtime API - make your own ChatGPT Advanced Voice demo with whatever system prompt you want. Requires you to bring your own Realtime API key.

--

I’ve been fortunate to have had early access to the Realtime API through my work at LiveKit, where we’ve built open-source developer tooling that makes building on this model as easy as possible.

I thought it would also be fun to build a “playground” environment, partially to dogfood our own tooling but largely because I just wanted to play with the model. This playground is freely available to anyone to try, and comes loaded up with a bunch of fun demos of the model’s unique capabilities that I’ve put together.

What blew my mind is how much mileage you can get out of the system prompt alone in this API. Here are some use-cases that are at least halfway to a complete MVP:

  • "Customer Support": An complete phone support agent for the playground
  • "Spanish Tutor": A bilingual language-learning demo
  • "Meditation Coach": It can actually pause and resume speech all on its own as it guides you through a meditation routine

Also some fun (and a bit irreverent…) demos of its style and non-verbal capabilities:

  • "Smoker’s Rasp": It can cough and speak like it’s been smoking three packs a day for 30 years (my favorite, lol)
  • "Unconfident Assistant": Umms, buts, and more - surprisingly lifelike
  • "Opera Singer": The best singing demo I’ve been able to compose (but still not quite what they showed off back in May…)

The playground doesn’t store anything anywhere besides your browser but you can share anything fun you put together with a link that encodes your config into URL params.

Lastly - if you’re even more curious how this was built or want to tweak or adapt it for yourself, the whole project and every dependency is open-source (link in footer!).

Speech-to-speech playground for OpenAI's new Realtime API. Built on LiveKit Agents.

torn vine
#

Playground for Realtime API / speech-to-speech

sudden crow
#

Hey can you give me a TLDR on how you handle interruptions and truncate responses?

sudden crow
#

this is the openai example

#

And i don't want it to say truncated but instead like show me the text up until the point i interupted

torn vine
#

yeah great question - this is definitely the most immediately apparent tricky thing about adopting this API once you get into it.

Not sure how to summarize this too much but here's how it works:

You need to buffer output audio frames and output text for playback together. as audio frames get played, you advance the "cursor" into the output text and print the additional characters. Unfortunately text content isn't generated with timings so you just have to make some guesses. you can use raw character counts or some sort of tokenizer, either way you need a guess on playback speed but once you get all of the text and all of the audio buffered, you can then adjust your guess to be basically len(text) / len(audio_frames) and then forward that many characters per frame...

interruptions would occur on input_audio_buffer.speech_started. when you receive that event, you should immediately pause playback of text and audio, then read the number of played frames and send conversation.item.truncate with audio_end_ms.

the biggest gotcha is when your text is running ahead of your audio, because the truncation occurs in audio alone. so the user may see more in text than what was played in audio and the model won't know. so better to be conservative on the text playback side I suppose.

Hopefully they add timings someday. until then, this is the best you can do without adding more models on your own side...

#

here's the typescript source for the synchronized playback in this project https://github.com/livekit/agents-js/blob/main/agents/src/multimodal/agent_playout.ts. truncation occurs here https://github.com/livekit/agents-js/blob/main/agents/src/multimodal/multimodal_agent.ts#L244

i'd really recommend using something like LiveKit in your project. There's a reason OpenAI uses LiveKit for ChatGPT voice mode... 😄

GitHub

Build realtime multimodal AI agents with Node.js. Contribute to livekit/agents-js development by creating an account on GitHub.

sudden crow
#

Can i self host livekit on an ec2 instance? and how long would that take me to get setup?

#

And do i get all the features if i self-host?

torn vine
#

Makes sense - yeah self hosting is pretty easy. There’s not much that’s not in the OSS version.

Also if you share more in DM about your use case I’ll forward to the pricing team - feedback is always valuable to us and we just rolled out a new pricing model recently so we’re definitely looking to improve it.

sudden crow
sudden crow
#

Also I've noticed that OpenAI's advanced voice mode is much better than the api

#

Any reason for that?

#

Like it's faster and displays emotion better than the api & less random cutoffs/glitches

torn vine
#

I'm not sure - I think you should expect that the API will improve quickly though. My guess (just a guess) is that they have more filter and restriction on the public API than on the private one used for AVM just so they can monitor it and err on the safe side. once they've got a few weeks or maybe a few months of experience with it running in the wild they may relax those rules where they can

#

but yeah i see "content filter" all the time when using the API, and never have a similar experience in AVM

violet grail
#

Hey, this project is interesting... can someone explain where do i set the system prompt for the agent with real time?

worthy shoal
#

Can this be used for RAG?

torn vine
#

in the realtime API they are now using the term "instructions" to describe the base prompt included on all requests. they do still support a "system" role in chat but I think they intend you to put your full "system prompt" in the instructions field going forward.

#

@worthy shoal not directly - this playground is just a demo for switching up the model instructions and other parameters. but you can implement RAG using the realtime API with function calling

cobalt delta
#

I immediately have an idea of an interactive Santa talking to children, through a set parental prompt, children would believe in Santa once again haha