Say in Human-Like Way: Hierarchical Cross-modal Information Abstraction and Summarization for Controllable Captioning
-
作者
-
刊物名称
-
年、卷、文献号
2021, 12891,
-
关键词
-
摘要
Image captioning aims to generate proper textual sentences for an image. However, many existing captioning models explore information incompletely and generate coarse or even incorrect descriptions of region details. This paper proposes a controllable captioning approach called Say in Human-like Way (Shway), which exploits intra- and inter-modal information in vision and language hierarchically in a diamond shape with the control signal of image regions. Shway is divided into abstraction and summarization stages. It can adequately explore cross-modal information in the first stage and effectively summarize generated contexts with a novel fusion mechanism for making predictions in the second stage. Our experiments are conducted on COCO Entities and Flickr30k Entities. The results demonstrate that our proposed model achieves state-of-the-art performances compared with current methods in terms of controllable caption quality.