LLMs Score High on Feelings—Literally
In the newly published
Nature article “*[Large Language Models are Proficient in Solving and Creating Emotional Intelligence Tests](https://www.nature.com/articles/s44271-025-00258-x)*,” researchers benchmarked six leading LLMs—ChatGPT-4, ChatGPT-o1, Gemini 1.5 Flash, Copilot 365, Claude 3.5 Haiku, and DeepSeek V3—across five well-validated emotional-intelligence (EI) assessments. The average AI score was
81 %, towering over the historical human mean of 56 %. Two models, ChatGPT-o1 and DeepSeek V3, landed more than
two standard deviations above the human average.
Beyond Taking the Test: Writing It
The team then asked ChatGPT-4 to create entirely new EI items—fresh scenarios, answer choices, and scoring keys. When these AI-generated tests were administered to human participants, they matched the original instruments in difficulty and produced strongly correlated results (r = 0.46). Importantly, 88 % of AI-written scenarios were judged “mostly new,” dismissing concerns of mere paraphrase.
Psychometric Quality Holds Up
Minor gaps emerged—slightly lower perceived realism or content diversity—but effect sizes were trivial (Cohen’s d < 0.25). Internal consistency, item discrimination, and cross-test correlations stayed within acceptable bounds. In short, an LLM can now crank out a draft EI assessment good enough to enter a formal validation pipeline.
Implications: Toward AI-Augmented Psychometrics
These findings suggest LLMs possess a form of
cognitive empathy—the ability to reason accurately about emotions and regulation strategies. For test developers, this means:
- Rapid Item Pool Expansion – Generate hundreds of plausible questions in minutes.
- Cost-Efficient Prototyping – Reserve expensive human pilots for fine-tuning, not first drafts.
- Domain Coverage – Use AI to ensure balanced representation of emotion-related constructs.
Of course, human experts must still conduct pilot studies, flag cultural bias, and verify reliability before adoption.
What This Means for Learning Platforms
Tools like
[ibl.ai’s AI Mentor](https://ibl.ai/product/mentor-ai-higher-ed) could harness EI-savvy LLMs to coach learners on emotional regulation, feedback reception, or teamwork—skills often harder to teach than math or coding. By embedding AI-generated EI exercises aligned with validated frameworks, platforms can give users consistent, personalized practice while educators supervise with confidence.
Caveats and Future Work
- Context Matters – LLMs draw from massive text corpora; domain-specific nuances may still trip them up.
- Authenticity vs. Performance – High test scores don’t guarantee genuine emotional resonance in daily interactions.
- Ethical Oversight – Any AI-assisted assessment must protect privacy and avoid reinforcing bias.
Researchers call for broader datasets, cross-cultural replication, and longitudinal studies to see how AI-generated EI materials fare over time.
Final Thoughts
The study marks a milestone: large language models aren’t merely linguistic powerhouses—they’re emerging as credible partners in socio-emotional research and education. If guided responsibly, AI could democratize high-quality EI training and assessment, helping humans understand feelings—both theirs and others’—with newfound clarity.