MindEval shows state-of-the-art models struggle with realistic multi-turn therapy conversations.
State-of-the-art models consistently struggle to navigate realistic, multi-turn clinical therapy interactions, with performance degrading further as conversations lengthen, showing that larger scale and reasoning alone do not guarantee clinical competence.
Preprint