Introducing GPT-4o: New Capabilities Making Chat GPT Better Than Ever

GPT-4o’s new capabilities

Since launching its first version in 2018, OpenAI has continuously researched and developed their ChatGPT AI models. Over four versions, from GPT-1 to GPT-4, each version  has made great progress in language processing and generation capabilities, with better accuracy and naturalness. Notably, GPT-4, the latest version before GPT-4o, was recognized as the most advanced model in the series. 

Most recently, on May 13, 2024, OpenAI introduced GPT-4o, its new flagship model. This is a huge step in improving the natural interaction between humans and computers, significantly surpassing previous models.

What is GPT-4o? 

GPT-4o, where the “o” stands for “omni” (meaning “every”, “all” in Latin), is a powerful update from OpenAI’s previous GPT-4 model and demo on May 13, 2024. Mira Murati from OpenAI has said that GPT-4o has the same intelligence level as GPT-4. Thus, GPT-4o not only takes the intelligence and features of the prior GPT-4 version but also develops more new capabilities. It extends access to all users, whether they are paying or non-paying. 

GPT-4o is one of OpenAI’s major iterations of multimodal models. The multimodal GPT-4o can handle various input and output modalities, such as text, audio, images, and even video. All these capabilities are combined into a single model so that GPT-4o can work twice as fast as previous versions. In contrast, earlier ChatGPT versions had to use multiple separate models, each of them working on a specific task such as converting speech to text, text to speech, or text to images. 

To put it simply, when using GPT-4o via Voice Mode, your speech would be transformed to text by Whisper; or when using GPT-4 Turbo to generate text response, that text would be converted to speech by TTS. With the images in ChatGPT, GPT-4o will mix GPT-4 Turbo and DALL-E 3.

Together with the 2x speed, this GPT-4o model is half price that you get for both input tokens ($5 per million) and output tokens ($15 per million), and thus is the cheapest one among the OpenAI’s models yet

What are the new capabilities of GPT-4o making it better than ever?

GPT-4o has extended its range of new capabilities, such as real- time voice conversations, providing the best comprehension across various modalities input and output, also supporting multilingual languages and finally releasing a new Desktop app specially made for MacOS users.

Real time voice conversations  

Before GPT-4o, using Voice Mode to interact with ChatGPT involved three separate models. As a result, response times were experienced at a higher rate than average, with average latencies of 2.8 seconds for GPT-3.5 and 5.4 seconds for GPT-4. With the capability to create real-time voice conversations, GPT-4o marks a significant advancement in Voice Mode technology. Compared to previous Chat GPT versions, all inputs and outputs are processed by the same neural network, which means that users can receive real-time responses almost instantly. 

Moreover, GPT-4o is enabled to mimic human voices, and understand the nuances, tone, and emotion of the voice. As such, it tries to respond in the way that a human person would. All these features aim to create a natural and smooth conversation, improving the natural interaction between humans and computers.

For example, imagine you’re using GPT-4o to practice for a job interview and you express some anxiety in your voice about a difficult question concerning your work experience. GPT-4o can detect the user’s anxious tone and may respond in a calm and reassuring tone to help you calm down and feel more supported.

Improved comprehension 

Furthermore, GPT-4o will be able to analyze and understand different input modalities better. It can analyze not only text and images, but also audio and video inputs because of new improvements in voice mode. The GPT-4o model reads the input, analyzes the speaker’s intonation and intent, and responds in the way the user desires. For instance, if you want the chatbot to respond in a friendly and relaxed mood, GPT-4o can even joke around with you.

Multilingual support 

The multilingual capability of GPT-4o has been significantly improved. GPT-4o has better performance, including better access to 50 different languages worldwide with a 2x faster API than the GPT-4 API, enhancing its ability to understand and generate text in multiple languages.

Along with real-time voice conversation capability and multilingual support, GPT-4o also offers real-time language translation. This feature is widely applied in international communication, customer service, ect. 

MacOS Desktop app 

GPT-4o App for macOS is available for download now, with keyboard shortcuts and screenshot support. Users can utilize shortcuts to quickly execute prompts, such as sending requests or switching between different features of the application. These new features make it easier to use than ever before. And Windows users do not need to worry, as this project will be available for the users of Windows OS by the end of 2024.

GPT-4o’s limitation and safety concern

GPT-4o is OpenAI’s most advanced model, but it still has limitations. According to OpenAI’s official blog, this model is in the early stages of exploring its multimodal interaction capabilities, and its output audio has some drawbacks, being available only with a few preset voices. Therefore, to become a perfect model, GPT-4o needs further development and updates.

In terms of safety concerns, GPT-4o has an average risk level regarding issues such as cybersecurity, misinformation, and bias. Although GPT-4o incorporates security measures like filtered training data and refined model behavior post-training, and undergoes thorough review and evaluation before release, these areas still require careful attention.

GPT-4o’s Performance vs Other Models

GPT-4o is compared with other high-end models 

  • GPT-4 Turbo
  • Claude 3 Opus
  • Gemini Pro 1.5
  • Llama 3 400B

in 6 benchmark figures: 

  • Massive Multitask Language Understanding (MMLU): This category is about 57 academic subjects including mathematics, philosophy, law and medicine. 
  • Graduate-Level Google-Proof Q&A (GPQA): Testing multiple-choice questions in chemistry, physics and biology, known for their high difficulty.
  • MATH: Evaluating mathematical problem-solving skills at middle and high school levels.
  • HumanEval: Examining the functional correctness of computer-generated code.
  • Multilingual Grade School Math (MSGM): Assessing grade school mathematics in ten languages, including less commonly represented languages.
  • Discrete Reasoning Over Paragraphs (DROP): Evaluating comprehension and reasoning abilities through questions based on complete paragraphs.
Performance of 4 models
Performance of 4 models: GPT-4o, GPT-4 Turbo, Gemini Pro 1.5, and Claude 3 Opus in 6 LLM benchmarks. Data is provided by OpenAI

According to data provided by OpenAI, GPT-4o achieves the highest score in four out of six benchmarks, while Claude 3 Opus and GPT-4 Turbo outperform it in specific areas such as MSGM and DROP respectively. Overall, GPT-4o’s performance is remarkable in the multimodal training approach.

While the improvements from GPT-4 to GPT-4o are notable, they are not as interesting as previous jumps like GPT-1 to GPT-2 or GPT-2 to GPT-3. This predicts that future improvement may involve text reasoning and it is really hard to continue.

However, these benchmarks do not fully show AI’s performance in multimodal tasks, which integrate text, audio, and visual inputs. Because this field is new and still developing, there is a lack of standards to evaluate models across text, audio and vision.

Overall, GPT-4o’s performance is a high potential of multimodal training and represents an impressive step forward in AI capabilities.


  1. Is Chat GPT-4o free to all users? 

    Yes, Chat GPT-4o is free to all users. However, free users will have a lower limitation of messages per day. While paid users get 5 times higher limitations: 80 messages every 3 hours

  2. What makes GPT-4o different from previous GPT models?

    The multimodal capabilities make GPT-4o different from previous GPT models. The multimodal capabilities allow GPT-4o can handle various input and output modalities, such as text, audio, images, and even video


Introducing GPT-4o, OpenAI once again demonstrates the immense potential of artificial intelligence in natural language processing and multimodal interaction. GPT-4o, a new updated version of ChatGPT-4, features real-time voice conversation capabilities, improved comprehension, multilingual support, and a macOS desktop application, making it the most advanced model to date. With these new capabilities, ChatGPT promises to attract even more users, far more than 100 million people as OpenAI said. 

Elmer Alasteir

Elmer Alasteir

Elmer Alasteir is an AI Expert and Consultant at Quarule. He has over 10 years of hands-on experience in developing AI solutions for businesses.