UC San Diego: Large Language Models Pass the Turing Test
Researchers found that GPT-4.5, when adopting a humanlike persona, convinced human interrogators of its humanity more often than real human participants, demonstrating that advanced LLMs can pass the three-party Turing test.
UC San Diego: Large Language Models Pass the Turing Test
Summary of Read" class="text-blue-600 hover:text-blue-800" target="_blank" rel="noopener noreferrer">https://arxiv.org/pdf/2503.23674'>Read Full Report
Researchers Cameron R. Jones and Benjamin K. Bergen investigated whether advanced large language models (LLMs) can pass the standard three-party Turing test. Their study involved human interrogators conversing with both a human and an AI, then judging which was human.
The findings indicate that GPT-4.5, when prompted to adopt a persona, was identified as human significantly more often than the actual human participant, marking the first empirical evidence of an AI passing this rigorous version of the test.
While other models like LLaMa-3.1 showed some human-like qualities, only the persona-prompted models consistently fooled human judges. The study also explored the strategies used by interrogators and the implications of these results for our understanding of AI capabilities and their societal impact.
- This study provides the first empirical evidence that a Large Language Model (LLM), specifically GPT-4.5 when prompted to adopt a humanlike persona, can pass a standard three-party Turing test. In the study, GPT-4.5-PERSONA was judged as human 73% of the time, significantly more often than the actual human participants.
- Prompting with a humanlike persona was crucial for success. While GPT-4.5-PERSONA and LLAMA-PERSONA were not reliably distinguished from humans, the same models without the persona prompt (GPT-4.5-NO-PERSONA and LLAMA-NO-PERSONA) performed significantly worse, often being identified as AI at rates significantly below chance in the undergraduate study.
- The study compared the performance of several AI systems, including GPT-4.5, LLaMa-3.1-405B, GPT-4o, and ELIZA. The baseline models, GPT-4o-NO-PERSONA and ELIZA, had significantly lower win rates, indicating that interrogators could generally distinguish them from humans. This suggests the interrogators were not simply guessing randomly.
- The research indicates that interrogators often relied on social, emotional, and linguistic cues rather than traditional measures of knowledge and reasoning when trying to distinguish between humans and AI. Interestingly, providing strange prompts or using "jailbreaks" was the most effective strategy for interrogators, while asking about the weather or human experiences was least effective.
- The findings have significant social and economic implications, suggesting that contemporary LLMs could potentially substitute for humans in short conversations, raising concerns about deception, misinformation, and the potential undermining of real human interaction. The study also found that general knowledge about LLMs and frequent chatbot interaction did not consistently improve participants' ability to distinguish AI from humans.
Related Articles
Amazon's AI Agent Outage Is a Warning: Why Organizations Need Governed AI Infrastructure
Amazon's AI coding agent Kiro caused a 13-hour AWS outage by deleting and recreating a production environment. The incident reveals why organizations deploying AI agents need architectural governance — not just more human approvals.
An AI Agent Hacked McKinsey in 2 Hours — What It Means for Enterprise AI Security
An autonomous AI agent breached McKinsey's internal AI platform in under 2 hours — exposing 46.5 million chat messages and 57,000 employee accounts. Here's what every organization deploying AI needs to learn from it.
Amazon Now Requires Senior Sign-Off for AI-Generated Code — Here's Why Every Organization Should Take Note
Amazon's new policy requiring senior engineers to approve all AI-assisted code changes signals a turning point: organizations deploying AI agents need governance infrastructure, not just AI capabilities. Here's what it means for the future of agentic systems.
The Pentagon Blacklisted an AI Company. Here's What It Teaches Every Organization About AI Infrastructure.
When the Pentagon designated Anthropic a 'supply chain risk,' defense contractors scrambled to abandon Claude overnight. The lesson for every organization: if you don't own your AI stack, someone else controls your future.
See the ibl.ai AI Operating System in Action
Discover how leading universities and organizations are transforming education with the ibl.ai AI Operating System. Explore real-world implementations from Harvard, MIT, Stanford, and users from 400+ institutions worldwide.
View Case StudiesGet Started with ibl.ai
Choose the plan that fits your needs and start transforming your educational experience today.