In April, LMSYS’s Chatbot Arena saw “im-also-a-good-gpt2-chatbot” on its leaderboard for top generative AIs.
The same AI model has been revealed as GPT-4o. The “GPT2” in the name doesn’t indicate Open AI’s previous AI model, “GPT-2”. Conversely, it indicates a new architecture for the GPT models, and “2” suggests a major change in the model’s design.
Open AI’s engineering teams consider it a big change to justify naming it with a new version number. Still, marketing teams present it modestly as a continuation of GPT-4 rather than a complete overhaul.
Let’s look at what’s new in GPT-4, what it offers, and how to use it in a business.
GPT-4o is Open AI’s latest flagship generative AI model. “O” in GPT-4o stands for “Omni,” which means “every” in Latin. This complements the model’s improved capabilities to handle text, speech, and video.
It makes it easier for users to interact with AI. The previous iterations of Open AI’s generative AI models were about making the model more intelligent. GPT-4o makes it simpler to use and much faster to respond.
You can ask ChatGPT powered by GPT-4o questions and interrupt them while answering. The model will listen when you interrupt and reframe the response in real-time based on the given input. It can pick up nuances in a user’s voice and generate different emotive voice outputs, including singing.
OpenAI’s CTO says, “GPT-4o reasons across voice, text, and vision. This is incredibly important because we’re looking at the future of interaction between humans and machines.”
Below are some of the prominent highlights of GPT-4o.
Did you know? You can leverage GPT-4o to equip your website to sell better and faster. Discover how to use GPT-4o as a sales agent.
Generative AI policies in companies are still in their early stages. The European Union Act is the only significant legal framework. You need to make your own decision about what constitutes safe AI.
OpenAI leverages a preparedness framework to decide if a model can be released to the public. It tests the model for cybersecurity, potential biological, chemical, radiological, or nuclear threats, ability to persuade, and model autonomy. The model’s score is the highest grade (Low, Medium, High, or Critical) it receives in any category.
GPT-4o has a medium concern and avoids the highest risk level that might upend human civilization.
Like all generative AIs, GPT-4o might not always behave exactly as you intended. However, compared to previous models, GPT-4o shows significant improvements. It might present some risks like deepfake scam calls. To mitigate these risks, audio output is only available in preset voices.
GPT-4o offers better images and text capabilities to analyze the content of the input. Compared to previous models, GPT-4o is better at answering complex questions like, “What’s the brand of T-shirt that a person is wearing?” For instance, this model can look at a menu in a different language and translate it.
The future models will offer much more advanced capabilities, such as watching a sports event and explaining its rules.
Here’s what changed in GPT-4o compared to other generative AI models from Open AI.
Previous OpenAI systems combined Whisper, GPT-4 Turbo, and Text-to-Speech in a pipeline with a reasoning engine. They had access to spoken words only and discarded the tone of voice, background noises, and sounds from multiple speakers. It limited GPT-4 Turbo’s ability to express different emotions or styles of speech.
With GPT-4o, a single model reasons across text and audio. This makes the model more receptive to tone and audio information available in the background, generating higher-quality responses with different speaking styles.
GPT-4o’s average voice mode latency is 0.32 seconds. This is nine times faster than GPT-3.5's average of 2.8 seconds and 17 times faster than GPT-4's average of 5.4 seconds.
The average human response time is 0.21 seconds. Therefore, GPT-4o’s response time is closer to that of a human. It makes it suitable for real-time translation of speech.
Tokens are units of text that a model can understand. When you work with a large language model (LLM), the prompt text is first converted into tokens. When you write in English, three words take close to four tokens.
If it takes fewer tokens to represent a language, fewer calculations need to be made, and text generation speed increases. Moreover, this decreases the price for API users as open charges per token input or output are made.
In GPT-4o, Indian languages like Hindi, Marathi, Tamil, Telugu, Gujarati,, and more have benefited, particularly showing reduced tokens. Arabic shows a 2x reduction, while East Asian languages observe a 1.4x to 1.7x reduction in tokens.
GPT 4 Turbo, Claude 3 Opus, and Gemini Pro 1.5 would be the top contenders to compare with GPT-4o. Llama 3 400B may be a contender in the future, but it isn’t finished yet.
Below is a comparison of GPT-4o with the aforementioned models based on different parameters.
Performance fluctuates only by a few percentage points when you compare GPT-4 Turbo and GPT-4o. However, these LLM benchmarks don’t compare AI’s performance on multi-modal problems. The concept is new, and ways of measuring a model’s ability to reason across text, audio, and video are yet to come.
GPT-4o’s performance is impressive and shows a promising future for multimodal training.
GPT-4o can reason across text, audio, and video effectively. It makes the model suitable for a variety of use cases, for example:
GTP-4o can now interact with you as you would converse with humans. You need to spend less time typing, making the conversation more natural. It delivers quick and accurate information.
With more speed and audiovisual capabilities, Open AI presents several real-time use cases where you can interact with AI using the view of the world. This opens up opportunities for navigation, translation, guided instructions, and comprehending complex visual information.
For example, GPT-4o can run on desktops, mobiles, and potentially wearables in the future. You can show a visual or desktop screen to ask questions rather than typing or switching between different models and screens.
On the other hand, GPT-4o's ability to understand video input from a camera and verbally describe the scene can be incredibly useful for visually impaired people. It would work like an audio description feature for real life, helping them understand their surroundings better.
GPT-4o connects your device inputs seamlessly, making it easier to interact with the model. With integrated modalities and improved performance, enterprises can use it to build custom vision applications.
You can use it where open-source models aren’t available and switch to custom models for additional steps to reduce costs.
GPT-4o improves performance and speed. Expertise lets users plug a GPT-4o-powered AI sales agent into a website. Presently, it lets your website visitors answer complex questions, capture leads, and book meetings faster.
With Expertise AI, you can train these agents to answer highly complex visitor questions. In the future, Expertise might leverage GPT-4o’s capabilities to reason across text, video, and audio to train AI sales agents on multiple media formats.
Until then, let your website visitors get the help they need from Expertise's AI sales agents before they reach the stage to connect with a salesperson.
Try Expertise AI and let your visitors experience the speed of GPT-4o in answering questions related to your products or services.