Efficient Identification of Short Tandem Repeats via Context-Aware Motif Discovery and Ultra-Fast Sequence Alignment
Abstract
Tandem repeats (TRs) are highly polymorphic genomic elements, associated with diverse molecular traits and implicated in numerous human diseases. However, large-scale analysis of TRs has been limited by computational challenges, including motif recognition, detection in complex regions, and excessive computational cost. Here we present FastSTR, a computationally efficient tool for precise detection and characterization of TRs. FastSTR integrates a context-aware N-gram motif model with a segmented global alignment algorithm to enable accurate motif identification and boundary definition, even for repeat units up to 8 bp. Across 13 species, FastSTR achieved >90% recall and 99% precision, running several times faster than existing methods white outperforming them in both sensitivity and accuracy. Applied to the human genome, FastSTR uncovered previously unannotated HSATII elements, resolved population-specific TR demonstrate, and identified recurrent STR alterations in lung cancer. These results demonstrate FastSTR as a versatile framework for TR annotation and discovery, advancing studies of genome evolution, genetic diversity, and disease.
Related articles
Related articles are currently not available for this article.