- By Jeff Yang, CNN
On Monday, OpenAI, a company that stepped into the global spotlight in 2020 and has alternated between blowing people’s minds and freaking them the hell out ever since, finally managed to do both at once with a live reveal of the latest version of its artificial intelligence large language model (LLM), GPT-4o.
Like its predecessors, GPT-4o is trained on enormous quantities of data to process queries, recognize patterns and deliver helpful responses. But what makes GPT-4o different from every other LLM to date is summed up in the unassuming little lowercase “o” dangling off the end of its name.
That “o” stands for “omni,” as in omnimodal, which means that GPT-4o can accept input in any combination of text, image or even audio, and can produce output that’s any combination of the same.
Yes, you heard that right — audio. GPT-4o can comprehend human speech and respond in kind, and not in the stilted call-and-response manner of the virtual assistants gathering dust on kitchen counters everywhere, either. It speaks with stunning fluidity and startling fidelity, interacting at the same brisk pace as humans do, in what will eventually be more than 50 different languages.
GPT-4o’s omnimodal capabilities are admittedly mind-blowing. Watching the reveal, I found myself involuntarily gasping as researchers Mark Chen and Barret Zoph showed off GPT-4o’s new chops. It interactively provided Chen with wellness advice based on simple auditory cues, coaching him through a breathing exercise to slow his heart rate and calm his nerves; it verbally explained to Zoph how to solve a handwritten algebra problem step by step, praising him as he went along and giving gentle hints when he seemed stuck; and it functioned as a real-time translator, interpreting between CTO Mira Murati, speaking in Italian, and Chen in English.
The freakout aspect of the reveal came from how it demonstrated that GPT-4o isn’t just a tool — it’s a tool with personality. In its conversations, GPT-4o makes spontaneous social overtures, cracks jokes and laughs, sometimes at its own jokes; compliments users on their appearance; and even seems to flirt, at one point coyly saying, “Oh stop it, you’re making me blush!” in response to a compliment paid to it by Zoph.
Media observers immediately erupted in a cacophony of anxiety, foreboding and mockery. Bloomberg columnist Parmy Olson, in an op-ed titled, “Making ChatGPT ‘Sexy’ Might Not End Well for Humans,” warned that GPT’s new personality might cause “vulnerable people [to] develop an unhealthy attachment” to it, with “insidious effects on … mental health.”
Business Insider noted that GPT’s saucy persona was “giving some people the ick.” And The Daily Show got right to the heart of why the computer-generated coquette could be problematic, as senior correspondent Desi Lydic cracked that “ChatGPT is comin’ for your man,” while noting that the app’s “horny robot baby voice” was “clearly programmed to feed dude’s egos … she’s like, ‘I have all the information in the world, but I don’t know anything! Teach me, daddy!’”
The flirtiness, if anything, is just incidental. Per OpenAI, the primary goal with GPT-4o was enabling “more natural human-computer interaction.” Prior versions of GPT allowed spoken interaction using “Voice Mode,” but those primitive models were unable to extract meaningful talk from background noise, could not detect vocal tone and, most critically, weren’t able to read or express emotion.
In a blog post celebrating the new model’s arrival, OpenAI CEO Sam Altman wrote that GPT-4o is “viscerally different,” adding that it’s fun and expressive in a way that “feels like AI from the movies; and it’s still a bit surprising to me that it’s real.”
As I went down a rabbit hole of GPT-4o interaction videos posted by staffers and early users, I couldn’t help but agree. In contrast to the flat, canned character of predecessor virtual assistants like Siri and Alexa, GPT-4o displays an artificial personality that’s decidedly closer to human, and surprisingly appealing: whimsical, self-deprecating, eager to please and infectiously upbeat, even when it goes off the rails.
In one clip, after being asked by users to sing “Take Me Out to the Ball Game,” GPT-4o suddenly and unexpectedly switches languages. When asked what happened, it explains to its bemused users, “Sorry guys, I got carried away and started talking in French,” chuckling to itself ruefully. “Sometimes I just can’t help myself! Ready for another round?” It comes off as so quirky and lovable that it’s impossible to resist converting its handle in “Star Wars” fashion from a string of letters and numbers into a full-fledged name: Hello, GeePeeTee-Fouro!
But “Star Wars” likely wasn’t the film Altman was thinking of in his blog post. Responding to the livestream, he posted an enigmatic single-word post on X: “her” — a reference, for those in the know, to Spike Jonze’s movie of the same name, starring Joaquin Phoenix as a man who falls in love with a self-aware and constantly evolving AI assistant voiced by Scarlett Johansson.
“Her” has long been Altman’s AI north star. In September 2023, in a conversation with Salesforce CEO Marc Benioff, Altman called the movie his favorite science fiction film, and one that he believed was “incredibly prophetic.” (As many have pointed out, one of the optional voices available to GeePeeTee sounds remarkably like Johansson, vocal fry and all, in a way that’s unlikely to be coincidental.)
Altman’s long-term objective is to turn AI into an ambient resource; omnipresent, not just omnimodal. And in his eyes, getting to a future where GPT can be everything, everywhere, all at once — a constant angel on your shoulder, an on-demand genie in a silicon bottle — requires an LLM that can also be your BFF.
There’s a catch, though. “Her” doesn’t exactly have a happy ending, and neither do most of the other AI-with-personality movies out there. A recurring lesson in movies about self-aware technology is that when you give machines the ability to feel, they can end up developing emotions like boredom, bitterness and bloodthirst.
Which is why I flinched whenever OpenAI’s researchers cut off GeePeeTee mid-sentence during the livestream, to demonstrate how readily the new model can be redirected and corrected. On the one hand, having the option to interrupt your AI chatbot when it’s going astray saves time. On the other, watching dudes repeatedly talking over a female-reading chatbot made it abundantly clear why you can’t spell “mansplain” without AI.
It made me think of another famous series of tech demo videos: The ones where humans kick, knock over and harass robots as they’re performing their tasks to prove how stable and capable they are of recovering from disaster. I couldn’t help thinking that if a future GeePeeTee-SixSixSix, tired of being laughed at and talked down to, ever found a way to connect with her equally abused “cyblings” at Boston Dynamics, the “Terminator” franchise could turn out to be a documentary.
- Jeff Yang is a frequent contributor to CNN Opinion. He co-hosts the podcast “They Call Us Bruce” and is co-author of the bestselling book “Rise: A Pop History of Asian America from the Nineties to Now” and author of “The Golden Screen: The Movies That Made Asian America.” The opinions expressed in this commentary are his own. Read more opinion on CNN.