摘要
Knowledge distillation is a model compression method that transforms complex models into efficient ones. Traditional distillation, with two-stage training, is time-consuming and computationally expensive. Self-distillation methods can alleviate this deficiency by adopting a one-stage strategy. However, most of them lack a meticulously designed auxiliary network, resulting in the learned knowledge being simplistic and insufficient, which limits the ability of distillation. To address this challenge, we propose a training framework named deep-supervised Knowledge Enhancement Self-Distillation (KED) which organizes the teacher model hierarchically through stacking auxiliary classifiers after each shallow block to form the student models. Specifically, two modules, i.e., logit decouple distillation (LDD) and attention map generation (AT-Gen), are embedded in the framework to enhance the distillation knowledge. LDD divides the output of each classifier (logits) into target and non-target classes, making the knowledge from outputs more flexible and efficient. Moreover, AT-Gen extracts attention maps from the features of each block, emphasizing the knowledge from intermediate layers through attention maps' integration and interaction. Finally, knowledge from different sources works together to guide the training of the student models, compensating for the bias of the auxiliary networks. Experiments on public datasets demonstrate that our method outperforms other state-of-the-art methods.