Reconstructing What the Brain Hears: Cross-Subject Music Decoding from fMRI via Prior-Guided Diffusion Model
Abstract
Reconstructing music directly from brain activity offers a unique window onto the representational geometry of the auditory system and paves the way for next-generation brain–computer interfaces. We introduce a fully data-driven pipeline that combines cross-subject functional alignment with bayesian decoding in the latent space of a diffusion-based audio generator. Functional alignment projects individual fMRI responses onto a shared representational manifold, increasing the performance of cross-participant accuracy with respect to anatomically normalized baselines. A bayesian search over latent trajectories then selects the most plausible waveform candidate, stabilizing reconstructions against neural noise. Crucially, we bridge CLAP’s multi-modal embeddings to music-domain latents through a dedicated aligner, eliminating the need for hand-crafted captions and preserving the intrinsic structure of musical features. Evaluated on ten diverse genres, the model achieves a cross-subject-averaged Identification Accuracy of 0.914 ± 0.019, and produces audio that naïve listeners recognize above chance in 85.7% of trials. Voxel-wise analyses locate the predictive signal within a bilateral circuit spanning early auditory, inferior-frontal, and premotor cortices, consistent with hierarchical and sensorimotor theories of music perception. The framework establishes a principled bridge between generative audio models and cognitive neuroscience, opening avenues for thought-driven composition, objective metrics for music-based therapy, and translational applications in non-verbal communication and neurotechnologies.
Related articles
Related articles are currently not available for this article.