# Review — Mean Teachers are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results

## Mean Teacher, **Teacher Student Approach,** for Semi-Supervised Learning

In this story, **Mean Teachers are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results**, by The Curious AI Company, and Aalto University, is reviewed. In this paper:

**Mean Teacher**is proposed, to**average the model weights**instead of label predictions as in Temporal Ensembling [13].- Mean Teacher
**improves test accuracy**and**enables training with fewer labels**than Temporal Ensembling [13]. - Without changing the network architecture, Mean Teacher achieves a
**lower error rate**.

This is a paper in **2017 NeurIPS **with over **1300 citations**. (Sik-Ho Tsang @ Medium) Teacher-Student approach can have other usages such as knowledge distillation. This is one of the beginner papers for teacher-student approach in self-supervised learning.

# Outline

**Conceptual Idea of Applying Noise & Ensembling****Mean Teacher****Experimental Results**

- A sketch of a
**binary classification task**with**two labeled examples (large blue dots)**and one unlabeled example, demonstrating how the choice of the**unlabeled target (black circle)**affects the**fitted function (gray curve)**. **(a):**A model with**no regularization**is free to fit any function that predicts the labeled training examples well.**(b):**A model trained with**noisy labeled data (small dots)**learns to give**consistent predictions around labeled data points**.**(c):**Consistency to noise around unlabeled examples provides additional smoothing. For the clarity of illustration,**the teacher model (gray curve) is first fitted to the labeled examples**, and then**left unchanged**during the training of the student model. Also for clarity, the small dots in figures (d) and (e) are omitted.**(d)**:**Noise on the teacher model reduces the bias**of the targets without additional training. The expected direction of stochastic gradient descent is towards the mean (large blue circle) of individual noisy targets (small blue circles).**(e)**:**An ensemble of models**gives an**even better**expected target. Both Temporal Ensembling [13] and the Mean Teacher method use this approach.

Thus,

applying noiseand usingmodel ensemblingare the keys in this paper.

**2. Mean Teacher**

## 2.1. Applying Noise

- Both the student and the teacher model
**evaluate the input applying noise (**within their computation.*η,*η’)

## 2.2. Model Ensembling Using EMA

- The
**softmax output of the student model**is compared with the one-hot label using**classification cost**and with the teacher output using**consistency cost**. - After the weights of the
**student model**have been**updated with gradient descent**, the**teacher model**weights are**updated as an exponential moving average (EMA) of the student weights.** - Both model outputs can be used for prediction, but at the end of the training
**the teacher prediction is more likely to be correct**. - A training step with an unlabeled example would be similar, except no classification cost would be applied. (i.e. self-supervised learning, but this paper is focusing more on semi-supervised learning.)

## 2.3. **Consistency Cost**

- Specifically, the
**consistency cost**as the expected distance between the prediction of the*J***student model (with weights**and the prediction of the*θ*and noise*η*)**teacher model (with weights**:*θ’*and noise*η’*)

- The difference between the
*Π*model, Temporal Ensembling, and Mean teacher is how the teacher predictions are generated. - Whereas the
*Π*model uses*θ’*=*θ*, and Temporal Ensembling approximates*f*(*x*;*θ’*;*η’*) with a weighted average of successive predictions,**Mean Teacher defines**:*θ’t*at training step*t*as the EMA of successive weights

- where
*α*is a smoothing coefficient hyperparameter. - An additional difference between the three algorithms is that the
*Π*model applies training to*θ’*whereas Temporal Ensembling and Mean Teacher treat it as a constant with regards to optimization. **Mean square error (MSE)**is used to train the consistency cost.

# 3. Experimental Results

## 3.1. SVHN & CIAFR-10

- All the methods in the comparison use a similar 13-layer ConvNet architecture.
**Mean Teacher improves test accuracy over the**on semi-supervised SVHN tasks.*Π*model and Temporal Ensembling**Mean Teacher also improves results on CIFAR-10**over our baseline*Π*model.- Virtual Adversarial Training (VAT) performs even better than Mean Teacher on the 1000-label SVHN and the 4000-label CIFAR-10. Yet, VAT and Mean Teacher are complimentary approaches.

## 3.2. SVHN with Extra Unlabeled Data

- Besides the primary training data, SVHN includes also an extra dataset of 531131 examples. 500 samples are picked from the primary training as the labeled training examples.
- The rest of the primary training set are together with the extra training set as unlabeled examples.

Mean Teacher again outperformsΠmodel.

## 3.3. Analysis of the Training Curves

- As expected, the EMA-weighted models (blue and dark gray curves in the bottom row) give more accurate predictions than the bare student models (orange and light gray) after an initial period.

Using the EMA-weighted model as the teacher improves results in the semi-supervised settings.Mean Teacher helps when labels are scarce.

**When using 500 labels (middle column) Mean Teacher learns faster**, and continues training after the*Π*model stops improving.- On the other hand, in the all-labeled case (left column), Mean Teacher and the
*Π*model behave virtually identically.

Mean Teacher uses unlabeled training data more efficiently than the

Πmodel.

## 3.4. Mean Teacher with ResNet on CIFAR-10 and ImageNet

- Experiments are run using a 12-block (26-layer) Residual Network [8] (ResNet) with Shake-Shake regularization [5] on CIFAR-10.
- A 50-block (152-layer) ResNeXt architecture is used on ImageNet using 10% of the labels.
- The results improve remarkably with the better network architecture.

There are also ablation experiments (e.g.: the effect of applying noise) and appendix (e.g.: experimental settings) in the paper. If interested, please feel free to read the paper.

## Reference

[2017 NeurIPS] [Mean Teacher]

Mean Teachers are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results

## Self-Supervised Learning

**2008–2010** [Stacked Denoising Autoencoders] **2014** [Exemplar-CNN] **2015** [Context Prediction] **2016 **[Context Encoders] [Colorization] [Jigsaw Puzzles] **2017** [L³-Net] [Split-Brain Auto] [Mean Teacher] **2018 **[RotNet/Image Rotations] [DeepCluster]