Heimdall: A Modular Framework for Tokenization in Single-Cell Foundation Models
Abstract
Foundation models trained on single-cell RNA-sequencing (scRNA-seq) data have rapidly become powerful tools for single-cell analysis. Their performance, however, depends critically on how cells are tokenized into model inputs – a design space that remains poorly understood. Here, we present H <sc>eimdall</sc> , a comprehensive framework and open-source toolkit for systematically evaluating tok-enization strategies in single-cell foundation models (scFMs). H <sc>eimdall</sc> decomposes each scFM into modular components: a gene identity encoder ( F G ), an expression encoder ( F E ), and a “cell sentence” constructor ( F C ) with submodules ( <sc>order</sc> , <sc>sequence</sc> , and <sc>reduce</sc> ) enabling fine-grained control and attribution. Using a transformer trained from scratch, we evaluate tokenization strategies for cell type classification across challenging transfer learning settings – cross-tissue, cross-species, and spatial gene-panel shifts – and separately assess reverse perturbation prediction. Tokenization choices show minimal impact in-distribution but are decisive under distribution shift, with F G and <sc>order</sc> driving the largest gains and F E providing additional improvements. H <sc>eimdall</sc> further shows how existing strategies can be recombined to enhance generalization. By standardizing evaluation and providing an extensive library, H <sc>eimdall</sc> establishes a foundation for reproducible, systematic exploration of single-cell tokenization and accelerates the development of next-generation scFMs.
Related articles
Related articles are currently not available for this article.