Evaluating the accuracy and reliability of large language models in assisting with pediatric differential diagnoses: A multicenter diagnostic study
Abstract
Importance
Large language models, such as GPT-3, have shown potential in assisting with clinical decision-making, but their accuracy and reliability in pediatric differential diagnosis in rural healthcare settings remain underexplored.
Objective
Evaluate the performance of a fine-tuned GPT-3 model in assisting with pediatric differential diagnosis in rural healthcare settings and compare its accuracy to human physicians.
Methods
Retrospective cohort study using data from a multicenter rural pediatric healthcare organization in Central Louisiana serving approximately 15,000 patients. Data from 500 pediatric patient encounters (age range: 0-18 years) between March 2023 and January 2024 were collected and split into training (70%, n=350) and testing (30%, n=150) sets.
Interventions
GPT-3 model (DaVinci version) fine-tuned using OpenAI API on training data for ten epochs.
Main Outcomes and Measures
Accuracy of fine-tuned GPT-3 model in generating differential diagnoses, evaluated using sensitivity, specificity, precision, F1 score, and overall accuracy. The model’s performance was compared to human physicians on the testing set.
Results
The fine-tuned GPT-3 model achieved an accuracy of 87% (131/150) on the testing set, with a sensitivity of 85%, specificity of 90%, precision of 88%, and F1 score of 0.87. The model’s performance was comparable to human physicians (accuracy 91%; P = .47).
Conclusions and Relevance
The fine-tuned GPT-3 model demonstrated high accuracy and reliability in assisting with pediatric differential diagnosis, with performance comparable to human physicians. Large language models could be valuable tools for supporting clinical decision-making in resource-constrained environments. Further research should explore implementation in various clinical workflows.
Related articles
Related articles are currently not available for this article.