In the dynamic landscape of AI language models, the unveiling of GPT-4o by OpenAI marks a significant milestone. This article delves into an independent analysis comparing GPT-4o, GPT-4, and Gemini 1.5, focusing on their English language understanding capabilities and performance.
Table of Contents
Measuring English Language Understanding of OpenAI’s New Flagship Model
OpenAI’s recent unveiling of GPT-4o has ushered in a new era in AI language models and how we interact with them. The most impressive feature is the support for live interaction with ChatGPT, allowing conversational interruptions. Despite some hiccups during the live demo, the accomplishments of the OpenAI team are nothing short of amazing. Best of all, OpenAI provided immediate access to the GPT-4o API following the demo.

In this article, I present my independent analysis measuring the classification abilities of GPT-4o against GPT-4 and Google’s Gemini and Unicorn models using a custom English dataset.
Which Model Excels in English Understanding?
What’s New with GPT-4o?
At the forefront is the concept of an Omni model, designed to seamlessly comprehend and process text, audio, and video. OpenAI appears to be focusing on democratizing GPT-4 level intelligence, making it accessible even to free users. GPT-4o promises enhanced quality and speed across more than 50 languages, offering a more inclusive and globally accessible AI experience at a lower cost. Paid subscribers will enjoy five times the capacity compared to non-paid users. Additionally, a desktop version of ChatGPT is set to facilitate real-time reasoning across audio, vision, and text interfaces for the masses.

How to Use the GPT-4o API
The new GPT-4o model adheres to the existing chat-completion API from OpenAI, ensuring backward compatibility and ease of use.
from openai import AsyncOpenAI
OPENAI_API_KEY = "<your-api-key>"
def openai_chat_resolve(response: dict, strip_tokens = None) -> str:
    if strip_tokens is None:
        strip_tokens = []
    if response and response.choices and len(response.choices) > 0:
        content = response.choices[0].message.content.strip()
        if content is not None or content != '':
            if strip_tokens:
                for token in strip_tokens:
                    content = content.replace(token, '')
            return content
    raise Exception(f'Cannot resolve response: {response}')
async def openai_chat_request(prompt: str, model_name: str, temperature=0.0):
    message = {'role': 'user', 'content': prompt}
    client = AsyncOpenAI(api_key=OPENAI_API_KEY)
    return await client.chat.completions.create(
        model=model_name,
        messages=[message],
        temperature=temperature,
    )
openai_chat_request(prompt="Hello!", model_name="gpt-4o-2024-05-13")GPT-4o is also available via the ChatGPT interface.
Official Evaluation
OpenAI’s blog post includes evaluation scores from known datasets such as MMLU and HumanEval. According to the graph, GPT-4o’s performance is classified as state-of-the-art, which is promising considering the new model is cheaper and faster. However, it’s essential to independently verify these claims, as some models might be partially trained or overfit on these open datasets, leading to unrealistic leaderboard scores.
My Evaluation Dataset
I created a topic dataset to measure classification performance across different language models. The dataset consists of 200 sentences categorized under 50 topics, with some closely related to make classification tasks more challenging. I manually created and labeled the entire dataset in English and used GPT-4 (gpt-4-0613) to translate it into multiple languages. For this evaluation, only the English version is used to avoid biases from using the same language model for dataset creation and topic prediction.
Performance Results
I evaluated the following models:
- GPT-4o: gpt-4o-2024-05-13
- GPT-4: gpt-4-0613
- GPT-4-Turbo: gpt-4-turbo-2024-04-09
- Gemini 1.5 Pro: gemini-1.5-pro-preview-0409
- Gemini 1.0: gemini-1.0-pro-002
- Palm 2 Unicorn: text-unicorn@001

The task was to match each sentence in the dataset with the correct topic, allowing us to calculate an accuracy score per language and each model’s error rate. A lower error rate indicates better model performance.
Error Rate Analysis
As derived from the graph, GPT-4o has the lowest error rate with only two mistakes. GPT-4, Gemini 1.5, and Palm 2 Unicorn each had one more mistake than GPT-4o, showcasing their strong performance. Interestingly, GPT-4 Turbo performed slightly worse than GPT-4-0613, contrary to OpenAI’s claims. Lastly, Gemini 1.0 lagged behind, which is expected given its price range.
Also read:
5 Extra Packages to Use with Flutter 3.22 in 2024
Embracing AI: A Global Trend Among Business Leaders
Conclusion
This analysis using a uniquely crafted English dataset reveals insights into the state-of-the-art capabilities of these advanced language models. GPT-4o stands out with the lowest error rate among the tested models, affirming OpenAI’s performance claims. The AI community and users must continue independent evaluations with diverse datasets to get a clearer picture of a model’s practical effectiveness beyond standardized benchmarks. Note that the dataset is fairly small, and results may vary. The performance evaluation was conducted using the English dataset only; a multilingual comparison will be presented in future analyses.
Join Our Whatsapp Group
Join Telegram group
FAQs
What are the key features of GPT-4o?
Answer: GPT-4o, OpenAI’s latest flagship model, introduces an Omni model concept, capable of comprehending and processing text, audio, and video seamlessly. It aims to democratize GPT-4 level intelligence, making it accessible to a broader audience, including free users. GPT-4o promises enhanced quality and speed across over 50 languages, providing a more inclusive and globally accessible AI experience at a reduced cost. Additionally, paid subscribers will enjoy five times the capacity compared to non-paid users. OpenAI also plans to release a desktop version of ChatGPT to enable real-time reasoning across various interfaces.
How can I use the GPT-4o API?
Answer: The GPT-4o model follows the existing chat-completion API from OpenAI, ensuring backward compatibility and ease of use. Users can utilize the provided Python code snippet to interact with the API, enabling chat completions with the GPT-4o model. Additionally, GPT-4o is accessible via the ChatGPT interface, allowing users to leverage its capabilities seamlessly.
What official evaluation metrics are available for GPT-4o?
Answer: OpenAI’s blog post includes evaluation scores from well-known datasets such as MMLU and HumanEval. According to the provided graph, GPT-4o’s performance is categorized as state-of-the-art, showcasing its promising capabilities, especially considering its affordability and speed. However, it’s crucial for users to independently verify these claims, as some models may have been trained or overfit on these datasets, potentially leading to unrealistic leaderboard scores.
How was the performance of GPT-4o compared to other models?
Answer: The performance comparison involved evaluating several models, including GPT-4o, GPT-4, Gemini 1.5, Gemini 1.0, and Palm 2 Unicorn, using a custom English dataset. The task was to match sentences in the dataset with the correct topic to calculate accuracy scores and error rates for each model. GPT-4o exhibited the lowest error rate, with only two mistakes, highlighting its strong performance. Other models, such as GPT-4, Gemini 1.5, and Palm 2 Unicorn, showed comparable performance, while GPT-4 Turbo and Gemini 1.0 lagged behind.
Join Our Whatsapp Group
Join Telegram group
Where can I find more insights on GPT-4o and Gemini 1.5’s context memory evaluation?
Answer: Further insights on how GPT-4o and Gemini 1.5 handle context memory can be found in an article titled “OpenAI’s GPT-4o vs. Gemini 1.5 Context Memory Evaluation.” The article provides in-depth analysis and comparisons, offering valuable insights into the capabilities of these advanced language models. You can access the article through the provided link for a comprehensive understanding of their context memory handling.
What conclusions can be drawn from the performance analysis of GPT-4o and other models?
Answer: The performance analysis using a custom English dataset showcases GPT-4o’s state-of-the-art capabilities, with the lowest error rate among the tested models. This reaffirms OpenAI’s claims regarding its performance and underscores its potential as a leading language model. However, it’s essential for the AI community and users to continue conducting independent evaluations using diverse datasets to gain a more comprehensive understanding of a model’s practical effectiveness beyond standardized benchmarks. It’s worth noting that the dataset used in the analysis is relatively small, and results may vary. Additionally, the evaluation was conducted solely in English, with a multilingual comparison planned for future analyses.
 
		
Can you be more specific about the content of your article? After reading it, I still have some doubts. Hope you can help me.
Your point of view caught my eye and was very interesting. Thanks. I have a question for you.