Leveraging Reddit data for Context-enhanced Synthetic Health Data Generation to Identify Low Self Esteem
Abstract
Low self-esteem (LoST) is a latent yet critical psychosocial risk factor that predisposes individuals to depressive disorders. Although structured tools exist to assess self-esteem, their limited clinical adoption suggests that relevant indicators of LoST remain buried within unstructured clinical narratives. The scarcity of annotated clinical notes impedes the development of natural language processing (NLP) models for its detection. Manual chart reviews are labor-intensive and large language model (LLM)-driven (weak) labeling raises privacy concerns. Past studies demonstrate that NLP models trained on LLM-generated synthetic clinical notes achieve performance comparable to, and sometimes better than those trained on real notes. This highlights synthetic data’s utility for augmenting scarce clinical corpora while reducing privacy concerns. Prior efforts have leveraged social media data, such as Reddit, to identify linguistic markers of low self-esteem; however, the linguistic and contextual divergence between social media and clinical text limits the generalizability of these models. To address this gap, we present a novel framework that generates context-enhanced synthetic clinical notes from social media narratives and evaluates the utility of small language models for identifying expressions of low self-esteem. Our approach includes a mixed-method evaluation framework: (i) structure analysis, (ii) readability analysis, (iii) linguistic diversity, and (iv) contextual fidelity of LoST cues in source Reddit posts and synthetic notes. This work offers a scalable, privacy-preserving solution for synthetic data generation for early detection of psychosocial risks such as LoST and demonstrates a pathway for translating mental health signals in clinical notes into clinically actionable insights, thereby identifying patients at risk.
Related articles
Related articles are currently not available for this article.