OpenAI’s GPT-4.5 and Meta’s LLaMa models have passed the Turing Test, a benchmark proposed by Alan Turing in the 1950s to assess whether a machine can exhibit intelligent behaviour indistinguishable from humans. A pivotal moment for conversational AI, one easily eclipsed amid a flurry of intriguing developments, including ChatGPT’s Ghibli imaging, pursuit of Agentic AI (human-like responses are especially relevant for this frontier), breakthroughs in cancer detection using AI, and Google unlocking a ‘thinking’ Gemini 2.5 model.
Though not the first AI models to pass this test, it is one of the most noticeable among recent contenders. GPT 4.5, released in 2023, exhibited most human-like behaviour in the tests, where it found large language model (LLM) competition from Meta’s LLaMa-3.1-405B (here, B is billion, defining parameters), and its sibling, the GPT-4o (this is a 2024 release).
“When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant,” write researchers Cameron R. Jones and Benjamin K. Bergen of the University of California San Diego, in a study awaiting peer review.
“LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time—not significantly more or less often than the humans they were being compared to—while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively),” the paper further details.
There’s a caveat to this.
Does this result mean GPT-4.5, or indeed LLaMa-3.1, are intelligent? Not necessarily. The Turing Test measures conversational performance, not comprehension or consciousness. A 73% success rate (even lower in LLaMa’s case) shows it can play a human convincingly, but it may still lack the reasoning or intent we associate with intelligence, for responses to queries.
Also part of the test was ELIZA, a chatbot from the 1960s, developed by computer scientist Joseph Weizenbaum at the Massachusetts Institute of Technology (MIT). Understandably much weaker AI in comparison with modern LLMs, researchers say they “included ELIZA as a manipulation check to ensure that interrogators were able to identify human witnesses”.
The study confirms that both GPT-4.5 and LLaMa-3.1-405B pass the Turing test, since they score higher than 50%, albeit the former logs better scores.
These are averages of these models being tested with “persona” and “no persona” modes. Crucial distinction between an AI persona and an AI non-persona resides in how an AI presents itself, interacts with users and exhibiting any “character”.
In February, OpenAI released a research preview for GPT-4.5, dubbing it the “largest and best model for chat yet”.
“It is the first model that feels like talking to a thoughtful person to me. i have had several moments where i’ve sat back in my chair and been astonished at getting actually good advice from an AI,” Sam Altman had said, at the time. Altman hasn’t directly addressed the Turing Test results, thus far.
The key to the Turing Test isn’t a universally standardised benchmark, but typically involves a human judge engage in a text-based conversation with both a human and a machine, attempting to determine which is which.
The verdict for the test involving the GPT-4.5 model was delivered after participants had a 5 minute conversation simultaneously with another human participant and with each of the AI systems, before judging which conversational partner they thought was human.
“We’re not losing to artificial intelligence. We’re losing to artificial empathy,” summarises John Nosta, founder of innovation think-tank NostaLab, in a post.
At the end, if a judge cannot reliably distinguish a machine from a human, the machine is said to have passed.
“This study was different from earlier Turing test experiments because it used a more rigorous three-party setup. Is it entirely surprising that—despite how rigorously the test was designed—AI would eventually beat us at “sounding human” when it has been trained on more human data than any one person could ever read or watch?,” says Sinead Bovell, founder of Waye, a tech education company.
Historically, there have been claims of AI passing versions of the Turing Test, though there is scope for debate. In 2014, a chatbot named “Eugene Goostman,” developed by Vladimir Veselov and colleagues, reportedly passed a Turing Test organised by the University of Reading. It is believed to have convinced 33% of judges it was a 13-year-old Ukrainian boy, during five-minute conversations.
A counterargument: the 33% success rate falls short of a 50% requirement — but it perhaps was a harbinger of things to come, just that no one realised it then.
GPT-4.5’s success owes much to OpenAI’s relentless refinement of large language models (LLMs). Building on GPT-4’s multimodal foundation, GPT-4.5 boasts enhanced natural language processing, likely driven by larger datasets, improved training techniques, and a knack for context retention. The persona prompt—a directive to adopt a specific tone or identity—proved pivotal, allowing it to tailor responses with human-like flair.
Sceptics however point to weighty implications and many an unanswered question.
Bovell fears “big economic and social implications”, alluding to a very real scenario of job displacement, potentially undermining human relationships and possibility of deception too.
In previous weeks, the pursuit for Agentic AI has gathered pace, with Microsoft’s new agents for workflows building on developments by the likes of (but certainly not limited to) Adobe, Zoom and Slack. The vision for these agents is to find proficiency in certain jobs or work profiles, such as customer service, healthcare management, data analysis, sales, personal assistance, content creation, research and cybersecurity monitoring.
AI models finding substantiation for their personality skills may prove complimentary.
There is of course the looming prospect of artificial general intelligence, or AGI.
“It is arguably the ease with which LLMs can be prompted to adapt their behaviour to different scenarios that makes them so flexible: and apparently so capable of passing as human,” the researchers illustrate.
Susan Schneider, Founding Director, Center for the Future Mind at the Florida Atlantic University (FAU), says these results are “no surprise”.
“Too bad these AI chatbots aren’t properly aligned. Yet, I predict: they will keep increasing in capacities and it will be a nightmare — emergent properties, ‘deeper fakes’, chatbot cyberwars. Hardly the Kurzweilian dream,” she writes, on social media.
AI’s future lies in practical utility—solving problems, not just being a smart conversationalist. That specifically may highlight an urgent need for new benchmarks, those testing reasoning or ethical alignment, to better gauge AI’s progress.