论文--中国科学院上海高等研究院

论文

您当前的位置：首页科研成果论文

Multimodal high-order relational network for vision-and-language tasks

作者

Pan, Hao; Huang, Jun
刊物名称

NEUROCOMPUTING
年、卷、文献号

2022, 492, 0925-2312
关键词

Pan, Hao; Huang, Jun
摘要

Vision-and-language tasks require the understanding and learning of visual semantic relations, language syntactic relations and mutual relations between these two modalities. Existing methods only focus on intra-modality low-order relations by simply combining pairwise features while ignoring the intramodality high-order relations and the sophisticated correlations between visual and textual relations. We thus propose the multimodal high-order relational network (MORN) to simultaneously capture the intra-modality high-order relations and the sophisticated correlations between visual and textual relations. The MORN model consists of three modules. A coarse-to-fine visual relation encoder first captures the fully-connected relations between all visual objects, and then refines the local relations between neighbor objects. Moreover, a textual relation encoder is used to capture the syntactic relations between text words. Finally, a relational multimodal transformer is designed to align the multimodal representations and model sophisticated correlations between textual and visual relations. Our proposed approach shows state-of-the-art performance on two vision-and-language tasks, including visual question answering (VQA) and visual grounding (VG). (c) 2022 Elsevier B.V. All rights reserved.