Seq2KING: An unsupervised internal transformer representation of global human heritages
Abstract
Determining the intricate tapestry of human genetic relationships is a central challenge in population genetics and precision medicine. We propose that the principles of lexical connectivity, which words derive meaning from their contextual interactions, can be adapted to genetic data, enabling transformer models to reveal that individuals with higher genetic similarity form stronger latent connections. We explored this by transposing KING kinship-related matrices into the (query, key, value) QKV latent space within transformer models and determined that attention mechanisms can capture genetic relatedness in an unsupervised fashion. We found that individuals had an attention weight connectivity of 85.34% (p<0.05) if they were from within the same continent, compared to if they were from other continents. Surprisingly, we found that some encoder layers required inversion of their latent representations for this connectivity to become obvious. Lastly, we used BERTViz to create human-readable hyper-dense connectivity patterns among individuals. Our approach is purely based on attention, which yields a non-discrete spectrum of relatedness, and thus uncovers patterns on first principles. Seq2KING addresses the significant challenge of discovering population structures to construct a global human relatedness map, without relying on predefined labels. Our excavation into the latent space is a paradigm shift from legacy-supervised genetic methodologies, which presents a new way to understand the human pangenome as well as discern population substructures for creating precision genetic medicines.
Related articles
Related articles are currently not available for this article.