Achieving Super Convergence of DNNs with 1cycle Policy

I would say, training a deep neural network model to achieve a good accuracy is an art. The training process enable the model to learn the model parameters such as the weights and the biases with the training data. In the process of training, model hyper-parameters govern the process. They control the behavior of model training and does a significant impact on model accuracy and convergence.

Learning rate, number of epochs, hidden layers, hidden units, activation functions, momentum are the hyperparameters that we can adjust to make the neural network models perform well.

Adjusting the learning rate is a vital factor for convergence because a small learning rate makes the training very slow and can occur overfitting, while if the learning rate is too large, the training will diverge. The typical way of finding the optimum learning rate is performing a grid search or a random search which can be computationally expensive and take a lot of time. Isn’t there a smart way to find out the optimal learning rate?

Here I’m going to connect some dots together on a process I followed to choose a good learning rate for my model and a way of training a DNN with different learning rate policy.

Many researchers actively work on this area and through his paper “Cyclical Learning Rates for Training Neural Networks” by Leslie N. Smith proposed Learning rate range test (LR range test) and Cyclical Learning Rates (CLR).

Not going to discuss the interesting theory behind LR range test and CLR, as fast.ai has a pretty good introduction on the method and they even have an implementation of LR range test that can use off the shelf. Strongly recommend to read this post. I found a nice implementation on LR range test in PyTorch by David Silva and feel free to pull it from here . https://github.com/davidtvs/pytorch-lr-finder

In 2018, by the paper “A disciplined Approach to Neural Network Hyper-Parameters : Part 1 – Learning Rate, Batch Size, Momentum, and Weight Decay” Smith introduces the 1cycle policy which is only running a single cycle of training compared to several cycles in the CLR. Strongly suggest to take a look on this blog post to get an idea on 1cycle policy.

Ok… Now you read it! Is this working???

I give it a try using a simple transfer learning experiment. The dataset and the experiment I used here is from the PyTorch documentation which you can find here. These are the steps I followed during the experiment.

Yeah! I’ve pushed the experiment to GitHub and feel free to use it. 😊

Run the LR range finder to find the maximum learning rate value to use on 1cycle learning.

Output from the LR finder

According to the graph it is clear that 5*1e-3 can be the maximum learning rate value that can be used for training. So, I chose 7*1e-3; which is bit before the minimum as my maximum learning rate for training.

Run the training using a defined learning rate (Note that a learning rate decay has used during training)
Run the training according to the 1cycle policy. (A cyclical momentum and cyclical learning date have been used. Note that the learning rate and the momentum is changing in each mini-batch: not epoch-wise.)

1cy

Compare the validation accuracy and validation loss of each method.

Can you notice that the green line, which represents the experiment trained using 1cycle policy gives a better validation accuracy and a better validation loss when converging.

These are the best validation accuracy of the two experiments.

Fixed LR : 0.9411
1-cycle : 0.9607

Tip : Use the batch size according to the computational capacity you are having. The number of iterations in 1cycle policy depends on the batch size, number of epochs and the dataset size you are using for training.

Though this experiment is a simple one, it is proven that 1cycle policy does a job in increasing the accuracy of neural network models and helps for super convergence. Give it a try and don’t forget to share your experiences here. 😊

References –

[1] Cyclical Learning Rates for Training Neural Networks
https://arxiv.org/abs/1506.01186

[2] A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay
https://arxiv.org/abs/1803.09820

[3] The 1cycle policy
https://sgugger.github.io/the-1cycle-policy.html

[4] PyTorch Learning Rate Finder
https://github.com/davidtvs/pytorch-lr-finder

[5] Tranfer Learning Tutorial
https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html