SCAP: Enhancing Image Captioning through Lightweight Feature Sifting and Hierarchical Decoding

Yuhao Zhag
JiaQi Tong
Honglin Liu

0 evaluations Published on Dec 16, 2024

This article on Sciety

Abstract

Image captioning aims to generate descriptive captions for visual content, thereby strengthening the connection between images and their semantic meanings. In this paper, we propose SCAP, a novel lightweight model that enhances image captioning through an innovative sifting attention mechanism. SCAP incorporates a summary module and a forget module within its encoder to refine visual information, discarding noise and retaining essential details. The hierarchical decoder then leverages sifting attention to align image features with text captions, generating accurate and contextually relevant descriptions. Extensive experiments on the COCO dataset demonstrate SCAP's superior performance, achieving state-of-the-art results while maintaining computational efficiency. This lightweight model represents a promising solution for resource-constrained scenarios, advancing the field of image captioning.

Related articles are currently not available for this article.