An annotation-free structural readout of CDS and UTR sequence regimes in the human genome
Abstract
Genomic sequence analysis typically begins with annotation: gene models, functional domains, expression data, or evolutionary conservation. We introduce SHP (Saussurean Hash Projection), a lightweight, zero-training structural readout for nucleotide sequences. Feature computation is annotation-free after CDS/UTR sequence extraction: Ensembl transcript annotations are used to define sequence regions, but no functional labels, expression data, conservation scores, or disease annotations enter the SHP calculation. SHP encodes each sequence window through two complementary views of the same 3-mer stream -- a chroma axis (which 3-mers are present) and a rhythm axis (which 3-mer transitions occur) -- into the same 64-dimensional hash space, then measures structural tension via the Jaccard distance between them. The instrument is calibrated against a fair IID baseline (k = 4, θ 0 = 0.0999). Applied to the full human protein-coding genome (19,491 genes, 224,518 transcript isoforms, Ensembl release 115), SHP produces an 8-dimensional per-gene structural feature vector (fixed_wit, tail_energy, skew, kurt for both CDS and UTR) plus derived gradients, without consulting any external functional labels. We report the following findings: (1) the CDS structural quiescence rate is 48.8% genome-wide, with regime-specific signatures -- MHC class I genes show 80% CDS quiescence, while neural genes show only 20%; (2) a UTR/CDS structural gradient distinguishes functional regimes, with KRTAP genes as the only CDS-led regime (85.7% UTR quiescence); (3) SHP vectors carry functional information: nearest-centroid classification achieves 30-40% per-category recall for olfactory receptors, HLA genes, and keratins from 2D SHP features alone; (4) this clustering is not a GC-content proxy -- zinc-finger genes (GC=47%) and transcription factors (GC=63%) are separated by 16 percentage points of GC yet cluster together in SHP space (d = 0.004). The method requires no training, no model weights, and no external functional databases beyond the input sequence files. Code, demo FASTA input, calibration summaries, figure-generation scripts, and the per-gene and per-isoform SHP matrices are provided for replication. We provide a calibrated, annotation-free structural spectroscopy tool for genomic sequence screening.
Related articles
Related articles are currently not available for this article.