Monitoring Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM ViT and LoRA
Abstract
Monitoring land cover changes from multi-temporal remote sensing imagery requires detecting visual transformations and describing them in natural language. Existing methods often struggle to balance visual accuracy with linguistic coherence. We propose MVLT-LoRA-CC (multi-modal Vision Language Transformer with Low-Rank Adaptation for Change Captioning), a framework that integrates a Vision Transformer (ViT), a Large Language Model (LLM), and Low-Rank Adaptation (LoRA) for efficient multi-modal learning. The model processes image pairs through convolutional patch embeddings and transformer blocks with self-attention and rotary positional encodings, aligning visual and textual representations via a multi-modal adapter. LoRA enhances fine-tuning efficiency by introducing low-rank trainable matrices, reducing computational cost while preserving linguistic knowledge. We also propose the Complementary Consistency Score (CCS) framework including CCSBMRC, CCSMC, and CCSMCS to jointly evaluate descriptive accuracy for change samples and classification precision for no change cases. Experiments on the LEVIR-CC dataset show that MVLT-LoRA-CC surpasses state of the art methods in semantic and consistency metrics. By integrating vision and language pretraining, the model improves generalization, interpretability, and robustness, establishing a scalable approach for multi-modal Earth observation and environmental monitoring.
Related articles
Related articles are currently not available for this article.