Identifying, understanding, and correcting technical biases on the sex chromosomes in next-generation sequencing data
Abstract
Mammalian X and Y chromosomes share a common evolutionary origin and retain regions of high sequence similarity. This sequence homology can cause the mismapping of short sequencing reads derived from the sex chromosomes and affect variant calling and other downstream analyses. Understanding and correcting this problem is critical for medical genomics and population genomic inference. Here, we characterize how sequence homology can affect analyses on the sex chromosomes and present XYalign, a new tool that: (1) facilitates the inference of sex chromosome complement from next-generation sequencing data; (2) corrects erroneous read mapping on the sex chromosomes; and (3) tabulates and visualizes important metrics for quality control such as mapping quality, sequencing depth, and allele balance. We show how these metrics can be used to identify XX and XY individuals across diverse sequencing experiments, including low and high coverage whole genome sequencing, and exome sequencing. We also show that XYalign corrects mismapped reads on the sex chromosomes, resulting in more accurate variant calling. Finally, we discuss how the flexibility of the XYalign framework can be leveraged for other use cases including the identification of aneuploidy on the autosomes. XYalign is available open source under the GNU General Public License (version 3).
Related articles
Related articles are currently not available for this article.