Everything to know about OpenAI’s GPT-4o with powerful voice and vision capabilities

OpenAI releases GPT-4o
OpenAI releases GPT-4o. (Image: OpenAI)

OpenAI’s ChatGPT got a major upgrade with the release of the new GPT-4o model, in which “o” stands for “omni” (meaning all or universally). The new AI model can communicate in real-time via any combination of text, audio, image, and video in a more human-like voice.

The OpenAI’s new conversational assistant is rolling out over the next few weeks and will be free for all users through both the GPT app and the web interface, according to the company. ChatGPT Plus users who subscribe to OpenAI’s $20-a-month paid tiers would be able to make more requests.

OpenAI launched its new model just ahead of Google I/O, the tech giant’s flagship developer conference, with plenty of updates expected on Gemini, Google Search, and Android.

What is OpenAI’s GPT-4o?

GPT-4o, OpenAI’s latest Large Language Model (LLM) that can handle prompts that blend text, audio, images, and video seamlessly. Unlike its predecessors, which used separate models for different content types, GPT-4o surpasses GPT-4 Turbo in both capability and performance. Alongside traditional text generation tasks like summarization and Q&A, it excels in reasoning, coding, and solving complex math problems.

The model is “natively multimodal,” which means the single model could generate content or understand commands across voice, text, and images.

OpenAI CEO Sam Altman said the most exciting advancement is the voice and video interaction capabilities of the model to interact with people. “It feels like AI from the movies; and it’s still a bit surprising to me that it’s real. Getting to human-level response times and expressiveness turns out to be a big change,” he explained in his personal blog.

Key capabilities of GPT-4o

OpenAI Chief Technology Officer (CTO) Mira Murati led the live demonstration of the new release. GPT-4o has merged the capabilities of previous OpenAI offerings into a single model, called “omnimodel.” This will lead to faster responses and smoother transitions between tasks.

“We’re looking at the future of interaction between ourselves and the machines,” Murati said of the demo. “We think that GPT-4o is really shifting that paradigm into the future of collaboration, where this interaction becomes much more natural.”

Barret Zoph and Mark Chen, researchers at OpenAI, detailed several applications for the new model. Most impressive is the live conversation feature where you could interrupt the model during its responses, and it would stop, listen, and respond accordingly. “With new real-time conversational speech functionality, you can interrupt the model, you don’t have to wait for a response and the model picks up on your emotions, said Mark Chen, head of frontiers research at OpenAI.

The model can solve visual problems instantly. Zoph used his phone to film himself writing a math equation on paper, with GPT-4o acting as a guide, explaining steps like a teacher would.


In text tasks, GPT-4o demonstrates slightly improved or comparable performance to other leading language models such as previous GPT-4 versions, Anthropic’s Claude 3 Opus, Google’s Gemini, and Meta’s Llama3, based on benchmark results released by OpenAI.

In another video, Sal Khan’s son Imran Khan addresses a question on Khan Academy about finding the sine of an angle in a right triangle. He streams his screen to ChatGPT, which uses multimodal input to assist Khan in reaching a solution. ChatGPT engages Khan by asking questions such as “Which side do you think is the hypotenuse?” and accurately interprets his drawings and responses, providing corrective feedback in a conversational tone.

What can GPT-4o do?

The new version is faster and more engaging, interpreting images, translating languages, recognizing emotions, and recalling previous interactions. You could even interrupt GPT-4o during its responses, and it would pause, listen, and adjust course.

  1. Real-time interactions: Engaging in verbal conversations seamlessly without noticeable delays.
  2. Multimodal reasoning and generation: Integrating text, voice, and vision for processing and responding to various data types simultaneously.
  3. Knowledge-based Q&A: Responding to queries leveraging its trained knowledge base. It can do reasoning, and coding and can solve complex math problems too.
  4. Reduced hallucination and improved safety: Minimizing the generation of incorrect or misleading information, along with enhanced safety protocols for user well-being.
  5. Memory and contextual awareness: Remembering previous interactions and maintaining context over longer conversations.
  6. Language and audio processing: It can handle over 50 languages proficiently.
  7. Text summarization and generation: Executing common text LLM tasks efficiently.
  8. Voice nuance: Generating speech with emotional nuances for nuanced communication.
  9. Real-time translation: Supporting real-time translation between languages.
  10. Data analysis: Analyzing and creating data charts based on prompts.
  11. Large context window: Supporting a context window of up to 128,000 tokens for coherence in longer conversations or documents.
  12. Rapid audio response: GPT-4o can respond to audio inputs swiftly, with a response time as short as 232 milliseconds, similar to human conversation response times. It can also generate and comprehend spoken language for various applications.
  13. Enhanced performance: It matches GPT-4 Turbo’s performance in English text and code, with significant improvements in understanding non-English text, while also being faster and more cost-effective in API usage.
  14. Improved vision and audio understanding: GPT-4o exhibits superior capabilities in comprehending and processing visual and audio inputs compared to previous models.

How to access GPT-4o?

In a May 13 blog post from the company, OpenAI says GPT-4o’s capabilities “will be rolled out iteratively,” but its text and image capabilities will already start to roll out in ChatGPT.

GPT-4o is now available to everyone, including free-tier users, for generating text. To access it, simply log into your ChatGPT account via a web browser, and select the GPT-4o option from the drop-down menu in the top left-hand corner.

While GPT-4o is accessible without a subscription, ChatGPT Plus offers more prompts and newer features. Subscribers can send GPT-4o five times more prompts than nonsubscribers.

ChatGPT’s desktop app for Mac computers will be first available to Plus subscribers, likely starting this week.

Related Posts