LLMs Score High on FeelingsâLiterally
In the newly published Nature article âLarge Language Models are Proficient in Solving and Creating Emotional Intelligence Tests,â researchers benchmarked six leading LLMsâChatGPT-4, ChatGPT-o1, Gemini 1.5 Flash, Copilot 365, Claude 3.5 Haiku, and DeepSeek V3âacross five well-validated emotional-intelligence (EI) assessments. The average AI score was 81 %, towering over the historical human mean of 56 %. Two models, ChatGPT-o1 and DeepSeek V3, landed more than two standard deviations above the human average.
Beyond Taking the Test: Writing It
The team then asked ChatGPT-4 to create entirely new EI itemsâfresh scenarios, answer choices, and scoring keys. When these AI-generated tests were administered to human participants, they matched the original instruments in difficulty and produced strongly correlated results (r = 0.46). Importantly, 88 % of AI-written scenarios were judged âmostly new,â dismissing concerns of mere paraphrase.
Psychometric Quality Holds Up
Minor gaps emergedâslightly lower perceived realism or content diversityâbut effect sizes were trivial (Cohenâs d < 0.25). Internal consistency, item discrimination, and cross-test correlations stayed within acceptable bounds. In short, an LLM can now crank out a draft EI assessment good enough to enter a formal validation pipeline.
Implications: Toward AI-Augmented Psychometrics
These findings suggest LLMs possess a form of cognitive empathyâthe ability to reason accurately about emotions and regulation strategies. For test developers, this means:
Rapid Item Pool Expansion â Generate hundreds of plausible questions in minutes.
Cost-Efficient Prototyping â Reserve expensive human pilots for fine-tuning, not first drafts.
Domain Coverage â Use AI to ensure balanced representation of emotion-related constructs.
Of course, human experts must still conduct pilot studies, flag cultural bias, and verify reliability before adoption.
What This Means for Learning Platforms
Tools like ibl.aiâs AI Mentor could harness EI-savvy LLMs to coach learners on emotional regulation, feedback reception, or teamworkâskills often harder to teach than math or coding. By embedding AI-generated EI exercises aligned with validated frameworks, platforms can give users consistent, personalized practice while educators supervise with confidence.
Caveats and Future Work
Context Matters â LLMs draw from massive text corpora; domain-specific nuances may still trip them up.
Authenticity vs. Performance â High test scores donât guarantee genuine emotional resonance in daily interactions.
Ethical Oversight â Any AI-assisted assessment must protect privacy and avoid reinforcing bias.
Researchers call for broader datasets, cross-cultural replication, and longitudinal studies to see how AI-generated EI materials fare over time.
Final Thoughts
The study marks a milestone: large language models arenât merely linguistic powerhousesâtheyâre emerging as credible partners in socio-emotional research and education. If guided responsibly, AI could democratize high-quality EI training and assessment, helping humans understand feelingsâboth theirs and othersââwith newfound clarity.