If I wanted to make a conversational voice agent with manual control over context. What are the practical differences between the Realtime API, the Responses API with an Audio-Audio capable model, and the Completions API with an Audio-Audio model.
The voice agents post talks about using either the Realtime API or a chained approach. What is the practical difference between the two of these? Considering the Realtime and Completions APIs can also stream responses.