Applying machine-learning to rapidly analyse large qualitative text datasets to inform the COVID-19 pandemic response: Comparing human and machine-assisted topic analysis techniques

This article has 1 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Background

Machine-assisted topic analysis (MATA) uses artificial intelligence methods to assist qualitative researchers to analyse large amounts of textual data. This could allow qualitative researchers to inform and update public health interventions ‘in real-time’, to ensure they remain acceptable and effective during rapidly changing contexts (such as a pandemic). In this novel study we aimed to understand the potential for such approaches to support intervention implementation, by directly comparing MATA and ‘human-only’ thematic analysis techniques when applied to the same dataset (1472 free-text responses from users of the COVID-19 infection control intervention ‘Germ Defence’).

Methods

In MATA, the analysis process included an unsupervised topic modelling approach to identify latent topics in the text. The human research team then described the topics and identified broad themes. In human-only codebook analysis, an initial codebook was developed by an experienced qualitative researcher and applied to the dataset by a well-trained research team, who met regularly to critique and refine the codes. To understand similarities and difference, formal triangulation using a ‘convergence coding matrix’ compared the findings from both methods, categorising them as ‘agreement’, ‘complementary’, ‘dissonant’, or ‘silent’.

Results

Human analysis took much longer (147.5 hours) than MATA (40 hours). Both human-only and MATA identified key themes about what users found helpful and unhelpful (e.g.Boosting confidence in how to perform the behavioursvsLack of personally relevant content). Formal triangulation of the codes created showed high similarity between the findings. All codes developed from the MATA were classified as in agreement or complementary to the human themes. Where the findings were classified as complementary, this was typically due to slightly differing interpretations or nuance present in the human-only analysis.

Conclusions

Overall, the quality of MATA was as high as the human-only thematic analysis, with substantial time savings. For simple analyses that do not require an in-depth or subtle understanding of the data, MATA is a useful tool that can support qualitative researchers to interpret and analyse large datasets quickly. These findings have practical implications for intervention development and implementation, such as enabling rapid optimisation during public health emergencies.

Contributions to the literature

  • Natural language processing (NLP) techniques have been applied within health research due to the need to rapidly analyse large samples of qualitative data. However, the extent to which these techniques lead to results comparable to human coding requires further assessment.

  • We demonstrate that combining NLP with human analysis to analyse free-text data can be a trustworthy and efficient method to use on large quantities of qualitative data.

  • This method has the potential to play an important role in contexts where rapid descriptive or exploratory analysis of very large datasets is required, such as during a public health emergency.

Related articles

Related articles are currently not available for this article.