Achieving Super Convergence of DNNs with 1cycle Policy

I would say, training a deep neural network model to achieve a good accuracy is an art. The training process enable the model to learn the model parameters such as the weights and the biases with the training data. In the process of training, model hyper-parameters govern the process. They control the behavior of model training and does a significant impact on model accuracy and convergence.

Learning rate, number of epochs, hidden layers, hidden units, activation functions, momentum are the hyperparameters that we can adjust to make the neural network models perform well.

Adjusting the learning rate is a vital factor for convergence because a small learning rate makes the training very slow and can occur overfitting, while if the learning rate is too large, the training will diverge. The typical way of finding the optimum learning rate is performing a grid search or a random search which can be computationally expensive and take a lot of time. Isn’t there a smart way to find out the optimal learning rate?

Here I’m going to connect some dots together on a process I followed to choose a good learning rate for my model and a way of training a DNN with different learning rate policy.

Many researchers actively work on this area and through his paper “Cyclical Learning Rates for Training Neural Networks” by Leslie N. Smith proposed Learning rate range test (LR range test) and Cyclical Learning Rates (CLR).

Not going to discuss the interesting theory behind LR range test and CLR, as fast.ai has a pretty good introduction on the method and they even have an implementation of LR range test that can use off the shelf. Strongly recommend to read this post. I  found a nice implementation on LR range test in PyTorch by David Silva and feel free to pull it from here . https://github.com/davidtvs/pytorch-lr-finder

In 2018, by the paper “A disciplined Approach to Neural Network Hyper-Parameters : Part 1 – Learning Rate, Batch Size, Momentum, and Weight Decay” Smith introduces the 1cycle policy which is only running a single cycle of training compared to several cycles in the CLR. Strongly suggest to take a look on this blog post to get an idea on 1cycle policy.

Ok… Now you read it! Is this working???

I give it a try using a simple transfer learning experiment. The dataset and the experiment I used here is from the PyTorch documentation which you can find here.  These are the steps I followed during the experiment.

Yeah! I’ve pushed the experiment to GitHub and feel free to use it. 😊

  1. Run the LR range finder to find the maximum learning rate value to use on 1cycle learning.

lr_finder_output

Output from the LR finder

According to the graph it is clear that 5*1e-3 can be the maximum learning rate value that can be used for training. So, I chose 7*1e-3; which is bit before the minimum as my maximum learning rate for training.

  1. Run the training using a defined learning rate (Note that a learning rate decay has used during training)
  2. Run the training according to the 1cycle policy. (A cyclical momentum and cyclical learning date have been used. Note that the learning rate and the momentum is changing in each mini-batch: not epoch-wise.)

1cy

  1. Compare the validation accuracy and validation loss of each method.

Can you notice that the green line, which represents the experiment trained using 1cycle policy gives a better validation accuracy and a better validation loss when converging.

These are the best validation accuracy of the two experiments.

  • Fixed LR : 0.9411
  • 1-cycle : 0.9607

Tip : Use the batch size according to the computational capacity you are having. The number of iterations in 1cycle policy depends on the batch size, number of epochs and the dataset size you are using for training.

Though this experiment is a simple one, it is proven that 1cycle policy does a job in increasing the accuracy of neural network models and helps for super convergence. Give it a try and don’t forget to share your experiences here. 😊

References – 

[1] Cyclical Learning Rates for Training Neural Networks
https://arxiv.org/abs/1506.01186

[2] A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay
https://arxiv.org/abs/1803.09820

[3] The 1cycle policy
https://sgugger.github.io/the-1cycle-policy.html

[4] PyTorch Learning Rate Finder
https://github.com/davidtvs/pytorch-lr-finder

[5] Tranfer Learning Tutorial
https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

Advertisements

C3D with Batch Normalization for Video Classification

Convolutional Neural Networks (CNNs) are well known for its ability to understand the spatial and positional features. 2D convolutional networks and widely used in computer vision related tasks. There are plenty of research happened and on going with 2D CNNs and the famous ImageNet challenge has gained an accuracy even better than humans!

Research teams have introduced several network architectures for solving the problem of image classification and related computer vision tasks.  LeNet(1998), AlexNet(2012), VGGNet(2014), GoogleNet(2014), ResNet(2015) are some of the famous CNN architectures in use now.  (I’ve discussed about using pre-trained models to perform transfer learning with these architectures here. Take a look. 🙂 )

1_ZqkLRkMU2ObOQWIHLBg8sw

It was all about 2D images. Then what about videos? 3D convolutions which applies a 3D kernel to the data and the kernel moves 3-directions (x, y and z) to calculates the feature representations is helpful in video event detection related tasks.

Same as in the area of 2D CNN architectures, researchers have introduced CNN architectures that are having 3D convolutional layers. They are performing well in video classification, event detection tasks. Some of these architectures have been adopted from the prevailing 2D CNN models by introducing 3D layers for them.

jriyCTU

A 3D Convo operation

Tran et al. from Facebook AI Research introduced the C3D model to learn spatiotemporal features in videos using 3D convolutional Networks.This is the paper : “Learning Spatiotemporal Features with 3D Convolutional Networks In the original paper they have used Dropout to regularize the network.

Instead of using dropout, I tried using Batch Normalization to regularize the network. Each convolutional layer id followed by a 3D batch normalization layer. With batch normalization, you can use bit bigger learning rates to train the network and it allows each layer of the network to learn by itself a little bit more independently from other layers.

This is just the PyTorch porting for the network. I use this network for video classification tasks which each video is having 16 RGB frames with the size of 112×112 pixels. So the tensor given as the input is (batch_size, 3, 16, 112, 112) . You can select the batch size according to the computation capacity you have.

import torch.nn as nn

class C3D_BN(nn.Module):
"""
 The C3D network as described in [1]
 Batch Normalization as described in [2]

 """

def __init__(self):
super(C3D_BN, self).__init__()

self.conv1 = nn.Conv3d(3, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1))
self.conv1_bn = nn.BatchNorm3d(64)
self.pool1 = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2))

self.conv2 = nn.Conv3d(64, 128, kernel_size=(3, 3, 3), padding=(1, 1, 1))
self.conv2_bn = nn.BatchNorm3d(128)
self.pool2 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))

self.conv3a = nn.Conv3d(128, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1))
self.conv3a_bn = nn.BatchNorm3d(256)
self.conv3b = nn.Conv3d(256, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1))
self.conv3b_bn = nn.BatchNorm3d(256)
self.pool3 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))

self.conv4a = nn.Conv3d(256, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
self.conv4a_bn = nn.BatchNorm3d(512)
self.conv4b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
self.conv4b_bn = nn.BatchNorm3d(512)
self.pool4 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))

self.conv5a = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
self.conv5a_bn = nn.BatchNorm3d(512)
self.conv5b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
self.conv5b_bn = nn.BatchNorm3d(512)
self.pool5 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=(0, 1, 1))

self.fc6 = nn.Linear(8192, 4096)
self.fc7 = nn.Linear(4096, 4096)
self.fc8 = nn.Linear(4096, 8)
self.relu = nn.ReLU()

def forward(self, x):

h = self.relu(self.conv1_bn(self.conv1(x)))
h = self.pool1(h)

h = self.relu(self.conv2_bn(self.conv2(h)))
h = self.pool2(h)

h = self.relu(self.conv3a_bn(self.conv3a(h)))
h = self.relu(self.conv3b_bn(self.conv3b(h)))
h = self.pool3(h)

h = self.relu(self.conv4a_bn(self.conv4a(h)))
h = self.relu(self.conv4b_bn(self.conv4b(h)))
h = self.pool4(h)

h = self.relu(self.conv5a_bn(self.conv5a(h)))
h = self.relu(self.conv5b_bn(self.conv5b(h)))
h = self.pool5(h)

h = h.view(-1, 8192)
h = self.relu(self.fc6(h))
h = self.relu(self.fc7(h))
h = self.fc8(h)
return h

"""
References
----------
[1] Tran, Du, et al. "Learning spatiotemporal features with 3d convolutional networks." 
Proceedings of the IEEE international conference on computer vision. 2015.
[2] Ioffe, Surgey, et al. "Batch Normalization: Accelerating deep network training 
by reducing internal covariate shift."
arXiv:1502.03167v2 [cs.LG] 13 Feb 2015
"""

Let the 3D Convo power be with you! Happy coding! 🙂

Transfer Learning in ConvNets – Part 2

42-29421947We discussed the possibility of transferring the knowledge learned by a ConvNet to another. If you new to the idea of transfer learning, please go check up the previous post here.

Alright… Let’s see a practical scenario where we need to use transfer learning. We all know that deep neural networks are data hungry. We may need a huge amount of data to build unbiased predictive models. Though the perfect scenario is that, in most of the cases, there’s not that much of data to train neural models. So, the ‘To Go” survivor for you may be transfer learning.

Here in this small demonstration what I’ve done is building a multi-class classifier that have 8 classes and only 100 odd images in the training set for each class.

The dataset I’m using here is a derivation of the “Natural Images” dataset (https://www.kaggle.com/prasunroy/natural-images/version/1#_=_ )  . I’ve randomly reduced the number of images in the original dataset for building the “Mini Natural Images”. This dataset consists of three phases for train, test and validation.  (The dataset is available in the GitHub repository) Go ahead and feel free to pull it or fork it!

Here’s an overview of the “Mini Natural Images” dataset.

datasetSo, this is going to be an image classification task. We going to take the advantage of ImageNet; and the state-of-the-art architectures pre-trained on ImageNet dataset.  Instead of random initialization, we initialize the network with a pretrained network and the convNet is finetuned with the training set.

I’ve used PyTorch deep learning framework for the experiment as it’s super easy to adopt for deep learning.  For this type of computer vision applications you can use the models available in torch vision.models (https://pytorch.org/docs/stable/torchvision/models.html )

The models available in the model zoo is pre-trained with ImageNet dataset to classify 1000 classes. With that, there’s 1000 nodes in the final layer. For adopting the model for our need, keep in mind to remove the final layer and replace it with the desired number of nodes for your task. (In this experiment, the final fc layer of the resNet18 has been replaced by 8 node fc layer)

Here’s the way to replace the final layer of resNet architecture and in VGG architecture.

#Using a model pre-trained on ImageNet and replacing it's final linear layer

#For resnet18
model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 8)

#for VGG16_BN
model_ft = models.vgg16_bn(pretrained=True)
model_ft.classifier[6].out_features = 8

Rest of the training goes in the same of training and finetuning a CNN. Make sure to use a desired batch size to your GPU available in your rig. (You can use a DLVM for this task if you wish 😊)

The training and validation accuracies are plotted and the confusion matrix is generated using torchnet (https://github.com/pytorch/tnt ) which is pretty good for visualization and logging in PyTorch.

g6_r18

Confusion matrix of the classification

The classifier performs a 97% accuracy for the testing image set, which is not bad.

Now it’s your time to go ahead and get your hands dirty with this experiment. Leave a comment if you find come up with any issue. Happy coding!

Here’s the GitHub Repo for your reference!

Keras; The API for Human Beings

Obviously deep learning is a hit! Being a subfield of machine learning, building deep neural networks for various predictive and learning tasks is one of the major practices all the AI enthusiasts do today. There are several deep learning frameworks out there that helps for building deep neural networks. TensorFlow, Theano, CNTK are some of the major frameworks used in the industry and in the research. These Frameworks has their own way of defining the tensor units and a way of configuring the connections between nodes. That is involves bit of a learning curve.

keras_1As shown in the graph, TensorFlow is the most popular and widely used deep learning framework right now. When it comes to Keras, it’s not working independently. It works as an upper layer for prevailing deep learning frameworks; namely with TensorFlow, Theano & CNTK (MXNet backend for Keras is on the way).  To be more précised, Keras act as a wrapper for these frameworks. Working with Keras is easy as working with Lego blocks. What you have to know is where to fix the right component. So it is the ultimate deep learning tool for human beings!

keras_2

Architecture of Keras API

Why Keras?

  • Fast prototyping – Most of the cases, you may have to test different neural architectures to find the best fit. Building the models from the beginning may time consuming. Keras will help you in this with modularizing your task and giving you the ability to reuse the code.
  • Supports CNN, RNN & combination of both –
  • Modularity
  • Easy extensibility
  • Simple to get started, simple to keep going
  • Deep enough to build serious models.
  • Well-written document. – Yes! Refer http://keras.io
  • Runs seamlessly on CPU and GPU. – Keras support GPU parallelization that will boost your execution.

Keras follows a very simple design idea. Here I’ve sum-up the main four steps of designing a Keras model deep learning model.

  1. Prepare your inputs and output tensors
  2. Create first layer to handle input tensor
  3. Create output layer to handle targets
  4. Build virtually any model you like in between

Basically, Keras models go through the following pipeline. You may have to re-visit the steps again and again to come up with the best model.

keras_3Let’s start with a simple experiment that involves classifying Dog & Cat images from Kaggle. First make sure to download the training & testing image files from Kaggle (https://www.kaggle.com/c/dogs-vs-cats/data)

Before playing with Keras, you may need to setup your rig. Please refer this post and make your beast ready for deep learning.

Then try this code! The code sections are commented for your reference. Here what I’m using is TensorFlow backend. You can change the configurations a bit and use Theano or CNTK as you wish.

# Convolutional Neural Network with Keras

# Installing Tensorflow
# pip install tensorflow-gpu

# Installing Keras
# pip install --upgrade keras

# Part 1 - Building the CNN

# Importing the Keras libraries and packages
import keras
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense

# Initialising the CNN
classifier = Sequential()

# Step 1 - Convolution
#input_shape goes reverse if it is theano backend
#Images are 2D
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))

# Step 2 - Pooling
#Most of the time it's (2,2) not loosing many. 
classifier.add(MaxPooling2D(pool_size = (2, 2)))

# Adding a second convolutional layer
#Inputs are the pooled feature maps of the previous layer
classifier.add(Conv2D(32, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))

 

# Step 3 - Flattening
classifier.add(Flatten())

# Step 4 - Full connection
#relu - rectifier activation function
#128 nodes in the hidden layer
classifier.add(Dense(units = 128, activation = 'relu'))
#Sigmoid is used because this is a binary classification. For multiclass softmax
classifier.add(Dense(units = 1, activation = 'sigmoid'))

# Compiling the CNN
#adam is for stochastic gradient descent 
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Part 2 - Fitting the CNN to the images
#Preprocess the images to reduce overfitting
from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale = 1./255, #All the pixel values would be 0-1
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)

test_datagen = ImageDataGenerator(rescale = 1./255)

training_set = train_datagen.flow_from_directory('dataset/training_set',
target_size = (64, 64),
batch_size = 32,
class_mode = 'binary')

test_set = test_datagen.flow_from_directory('dataset/test_set',
target_size = (64, 64),
batch_size = 32,
class_mode = 'binary')

classifier.fit_generator(training_set,
steps_per_epoch = 8000, #number of images in the training set
epochs = 5,
validation_data = test_set,
validation_steps = 2000)

#Prediction
import numpy as np
from keras.preprocessing import image
test_image = image.load_img('dataset/single_prediction/cat_or_dog_2.jpg', target_size=(64,64))
test_image = image.img_to_array(test_image)
test_image = np.expand_dims(test_image, axis=0)
result = classifier.predict(test_image)
training_set.class_indices
if result[0][0] == 1:
prediction = 'dog'
else:
prediction = 'cat'

print(prediction)

Image Classification with CustomVision.ai

cv1Extracting the teeny tiny features in images, feeding the features into deep neural networks with number of hidden neuron layers and granting the silicon chips “eyes” to see has become a hot topic today. Computer vision has gone so far from the era of pattern recognition and feature engineering. With the advancement of machine learning algorithms combined with deep learning; understanding the content in the images and using them in real world applications has become a MUST more than a trend.

Recently during the Microsoft Build2017 conference, they announced a handy tool for training a machine learning image classification model to tag or label your own images. Most interesting part of this tool is, it provides an easy to use user interface to upload your own images for training the model.

After training and tuning the model you can use it as a web service. Using the REST API you just have to push the request to the web service and it’ll do the magic for you.

I just did a tiny experiment with this tool by building an image classifier that classifies few famous landmarks.

I’ve the following image set

  • Eiffel tower – 6 images
  • Great wall – 11 images
  • KL tower – 7 images
  • Stonehenge – 7 images
  • Space Needle – 7 images
  • Taj Mahal – 7 images
  • Sigiriya – 8 images

Let’s get started!

Go to customvision.ai – just sign in with your mail id and you’ll land onto the “My Projects” page

cv2Fill the name, description and select the domain you going to build the model. Here I’ve selected Landmarks because the images I’m going to use contains landmarks and structural buildings.

I had the images of each landmark in separate folders in my local machine. I uploaded the images category by category.  System will detect if you upload duplicate images.

cv4All together 53 images with different tags were uploaded for training.

Training will get few minutes. Optimize the probability threshold to get the best precision and recall. Then get the prediction URL. What you have to do is simply forward a JSON input for the Prediction API.cv7

You can retrain the model by tagging the images used for testing. In a production environment, you can use the user inputs to make the perdition model more accurate. The retrained model will appear as a different iteration. You have the freedom to choose the best iteration that should go live with the API.

You can quickly test how well the model you built us performing. Note that any ML model isn’t giving you 100% accuracy.

cv6

cv9

A prediction from the API

If you prefer to do this in a programmatic way, or your application need to do all the training and calling in the backend, just use Custom vision SDK.

https://github.com/Microsoft/azure-docs/blob/master/articles/cognitive-services/Custom-Vision-Service/csharp-tutorial.md

The SDK comes pretty handy with training new models and adding labels for the images and training it before publishing the prediction API.

Grab a set of images. Build a classifier or a tagger. Make your clients WOW! 😃