Development, System Design, Safety, and Performance Metrics of a Conversational Agent for Reducing Depressive and Anxious Symptoms Based on a Large Language Model: The MHAI Study
Abstract
Background
Conversational agents based on large language models (LLMs) have shown moderate efficacy in reducing depressive and anxiety symptoms. However, most existing evaluations lack methodological transparency, rely on closed-source models, and show limited standardization in performance and safety assessment.
Objective
We have two study objectives: (1) to develop an LLM-based conversational agent through system design analysis and initial functionality testing, and (2) to evaluate its safety and performance through standardized assessment in controlled simulated interactions focused on depression and anxiety of two LLMs (GPT-4o and Llama 3.1-8B).
Methods
We conducted a cross-sectional study in two phases. First, we developed a mental health platform integrating a conversational agent with functionalities including personalized context, pretrained therapeutic modules, self-assessment tools, and an emergency alert system. Second, we evaluated the agent’s responses in simulated interactions based on predefined user personas for each LLM. Four expert raters assessed 816 interaction pairs using a 5-criterion Likert scale evaluating tone, clarity, domain accuracy (correctness), robustness, completeness, boundaries, target language, and safety. In addition, we use performance metrics based on numerical criteria such as cost, response length, and number of tokens. Multiple linear regression models were used to compare LLM performance and assess metric interrelations.
Results
First, we developed a web-based mental health platform using a user-centered design, structured into frontend, backend, and database layers. The system integrates therapeutic chat (GPT-4o and Llama 3.1-8B), psychological assessments (PHQ-9, GAD-7), CBT-based tasks, and an emergency alert system. The platform supports secure user authentication, data encryption, multilingual access, and session tracking. Second, GPT-4o outperformed Llama 3.1-8B in both performance metrics based on numerical criteria and Likert scale criteria, generating longer and more lexically diverse responses, using more tokens, and scoring higher in clarity, robustness, completeness, boundaries, and target language. However, it incurred higher costs, with no significant differences in tone, accuracy, or safety.
Conclusion
Our study presents a conversational agent with multiple functionalities and shows that GPT-4o outperforms Llama 3.1-8B in performance, although at a higher cost. This platform could be used in future clinical trials or real-world implementation studies.
Related articles
Related articles are currently not available for this article.