摘要
Pedestrian attribute recognition is a challenging task because of appearance variations, illumination variations, etc., in pedestrian images. We observe that two typical relations, i.e., relations of regions and relations of attributes, are beneficial to accomplish this task. In this paper, we explore the potential of Transformer on pedestrian attribute recognition task for the first time, and propose a Transformer framework called Dual-Relations Transformer (DRFormer). Vision Transformer (ViT) is adopted as a feature extractor for its nature of modeling long-range relations of regions. Furthermore, an Attribute Relation Module (ARM) is designed with Transformer encoder to capture relations of attributes. In ARM, we encode spatial information and semantic information of attributes into vector embedding representations. Being equipped with spatial information, DRFormer is capable of localizing attribute-related regions. Semantic information enables DRFormer to learn underlying semantic relations among attributes. Extensive experiments are conducted on three popular datasets, including PETA, PA-100K, and RAP, and demonstrate the superiority of our proposed DRFormer over state-of-the-art methods. (c) 2022 Elsevier B.V. All rights reserved.