论文--中国科学院上海高等研究院

论文

您当前的位置：首页科研成果论文

A local representation-enhanced recurrent convolutional network for image captioning

作者

Wang, Xiaoyi; Huang, Jun
刊物名称

INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL
年、卷、文献号

2022, 11, 2192-6611
关键词

Wang, Xiaoyi; Huang, Jun
摘要

Image captioning is a challenging task that aims to generate a natural description for an image. The word prediction is dependent on local linguistic contexts and fine-grained visual information and is also guided by previous linguistic tokens. However, current captioning works do not fully utilize local visual and linguistic information, generating coarse or incorrect descriptions. Also, captioning decoders have less recently focused on convolutional neural network (CNN), which has the advantage in extracting features. To solve these problems, we propose a local representation-enhanced recurrent convolutional network (Lore-RCN). Specifically, we propose a visual convolutional network to obtain enhanced local linguistic context, which incorporates selected local visual information and models short-term neighboring. Furthermore, we propose a linguistic convolutional network to obtain enhanced linguistic representation, which models long- and short-term correlations explicitly to leverage guiding information from previous linguistic tokens. Experiments conducted on COCO and Flickr30k datasets have verified the superiority of our proposed recurrent CNN-based model.