A field guide for the compositional analysis of any-omics data
Abstract
Next-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. Today, NGS is routinely used to understand many important topics in biology from human disease to microorganism diversity. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: the magnitude of the counts are determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged, and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when comparing heterogeneous samples (e.g., samples collected across distinct cancers or tissues). Instead, methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. In this manuscript, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. In doing so, we review zero replacement, differential abundance analysis, and within-group and between-group coordination analysis. We then discuss how this pipeline can accommodate complex study design, facilitate the analysis of vertically and horizontally integrated data, including multiomics data, and further extend to single-cell sequencing data. In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”. Taken together, this manuscript establishes the first fully comprehensive analysis protocol that is suitable for any and all -omics data.
Related articles
Related articles are currently not available for this article.