MLIF-Net: Multimodal Fusion of Vision Transformers and Large Language Models for AI Image Detection

Xuan Li
Lei Fu
Jinghan Cao
Qiyuan Tian
Jing Cao
Kowei Shih

0 evaluations Published on May 29, 2025

This article on Sciety

Abstract

This paper presents the Multimodal Language-Image Fusion Network (MLIF-Net), a new architecture to distinguish AI-generated images from real ones. MLIF-Net combines Vision Transformer (ViT) and Large Language Models (LLMs) to build a multimodal feature fusion network that improves AI-generated content detection accuracy. The model uses a Cross-Attention Mechanism to combine visual and semantic features and a Multiscale Contextual Reasoning Layer to capture both global and local image features. An Adaptive Loss Function improves the consistency and robustness of feature extraction. Experimental results show that MLIF-Net outperforms existing models in accuracy, recall, and Average Precision (AP). This approach can lead to more accurate detection of AI-generated content and may have applications in other generative content tasks.

Related articles are currently not available for this article.