- Now, I'm fixing all the issues and refining the codes. It will be easier to understand how each KD works than before.
- Algorithms are already implemented again, but they should be checked more with hyperparameter tuning.
- Note that some algorithms give an insufficient performance with my configuration. For example, for FitNet, multi-task learning is much better than the initialization. However, I have followed the author's way.
- This Repo. will be upgraded version of my previous benchmark Repo. (link)
Defined knowledge by the neural response of the hidden layer or the output layer of the network
- Soft-logit : The first knowledge distillation method for deep neural network. Knowledge is defined by softened logits. Because it is easy to handle it, many applied methods were proposed using it such as semi-supervised learning, defencing adversarial attack and so on.
- Deep Mutual Learning (DML) : train teacher and student network coincidently, to follow not only training results but teacher network's training procedure.
- Factor Transfer (FT) : Encode a teacher network's feature map, and transfer the knowledge by mimicking it.
- Jangho Kim et al. "Paraphrasing Complex Network: Network Compression via Factor Transfer" Advances in Neural Information Processing Systems (NeurIPS) 2018 (on worning) Increase the quantity of knowledge by sensing several points of the teacher network
- FitNet : To increase amounts of information, knowledge is defined by multi-connected networks and compared feature maps by L2-distance.
- Attention transfer (AT) : Knowledge is defined by attention map which is L2-norm of each feature point.
- Activation boundary (AB) : To soften teacher network's constraint, they propose the new metric function inspired by hinge loss which usually used for SVM.
- VID : Define variational lower boundary as the knowledge, to maximize mutual information between teacher and student network.
- Ahn, et. al. Variational Information Distillation for Knowledge Transfer (on worning) Defined knowledge by the shared representation between two feature maps
- Flow of Procedure (FSP) : To soften teacher network's constraint, they define knowledge as relation of two feature maps.
- KD using Singular value decomposition(KD-SVD) : To extract major information in feature map, they use singular value decomposition.
- Seung Hyun Lee, et. al. Self-supervised knowledge distillation using singular value decomposition. ECCV 2018 [the original project link] Defined knowledge by intra-data relation
- Relational Knowledge Distillation (RKD): they propose knowledge which contains not only feature information but also intra-data relation information.
- Multi-head Graph Distillation (MHGD): They proposed the distillation module which built with the multi-head attention network. Each attention-head extracts the relation of feature map which contains knowledge about embedding procedure.
- Comprehensive overhaul (CO):
Full Dataset | 50% Dataset | 25% Dataset | 10% Dataset | |
---|---|---|---|---|
Methods | Accuracy | Last Accuracy | Last Accuracy | Last Accuracy |
Teacher | 78.59 | - | - | - |
Student | 76.25 | - | - | - |
Soft_logits | 76.57 | - | - | - |
FitNet | 75.04 | - | - | - |
AT | 78.14 | - | - | - |
FSP | 76.47 | - | - | - |
DML | - | - | - | - |
KD_SVD | - | - | - | - |
KD_EID | - | - | - | - |
FT | - | - | - | - |
AB | - | - | - | - |
RKD | - | - | - | - |
VID | - | - | - | - |
MHGD | - | - | - | - |
CO | - | - | - | - |
- Check all the algorithms.
- do experiments.