摘要
Deep multimodal learning has attracted increasing attention in artificial intelligence since it bridges vision and language. Most existing works only focus on specific multimodal tasks, which limits the ability to generalize to other tasks. Furthermore, these works only learn coarse-grained interactions at the object-level in images and the word-level in text, while ignoring to learn fine-grained interactions at relation-level and attribute-level. In this paper, to alleviate these issues, we propose a Semantic-aware Multi-Branch Interaction (SeMBI) network for various multimodal learning tasks. The SeMBI mainly consists of three modules, Multi-Branch Visual Semantics (MBVS) module, Multi-Branch Textual Semantics (MBTS) module and Multi-Branch Cross-modal Alignment (MBCA) module. The MBVS enhances the visual features and performs reasoning through three parallel branches, corresponding to the latent relationship branch, explicit relationship branch and attribute branch. The MBTS learns relation-level language context and attribute-level language context by textual relationship branch and textual attribute branch, respectively. The enhanced visual features then passed into MBCA to learn fine-grained cross-modal correspondence under the guidance of relation-level and attribute-level language context. We demonstrate the generalizability and effectiveness of the proposed SeMBI by applying it to three deep multimodal learning tasks, including Visual Question Answering (VQA), Referring Expression Comprehension (REC) and Cross-Modal Retrieval (CMR). Extensive experiments conducted on five common benchmark datasets indicate superior performance comparing with state-of-the-art works.