摘要
Image captioning is a challenging task that aims to generate a natural description for an image. The word prediction is dependent on local linguistic contexts and fine-grained visual information and is also guided by previous linguistic tokens. However, current captioning works do not fully utilize local visual and linguistic information, generating coarse or incorrect descriptions. Also, captioning decoders have less recently focused on convolutional neural network (CNN), which has the advantage in extracting features. To solve these problems, we propose a local representation-enhanced recurrent convolutional network (Lore-RCN). Specifically, we propose a visual convolutional network to obtain enhanced local linguistic context, which incorporates selected local visual information and models short-term neighboring. Furthermore, we propose a linguistic convolutional network to obtain enhanced linguistic representation, which models long- and short-term correlations explicitly to leverage guiding information from previous linguistic tokens. Experiments conducted on COCO and Flickr30k datasets have verified the superiority of our proposed recurrent CNN-based model.