A GPT-4o
A GPT-4o, a new flagship model that can reason across audio, vision, and text in real time.
https://openai.com/index/hello-gpt-4o/
A capable of accepting any combination of text, audio, and image inputs and generating corresponding outputs. You can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation.