Convolutional Neural Networks (CNNs) are well known for its ability to understand the spatial and positional features. 2D convolutional networks and widely used in computer vision related tasks. There are plenty of research happened and on going with 2D CNNs and the famous ImageNet challenge has gained an accuracy even better than humans!
Research teams have introduced several network architectures for solving the problem of image classification and related computer vision tasks. LeNet(1998), AlexNet(2012), VGGNet(2014), GoogleNet(2014), ResNet(2015) are some of the famous CNN architectures in use now. (I’ve discussed about using pre-trained models to perform transfer learning with these architectures here. Take a look. 🙂 )
It was all about 2D images. Then what about videos? 3D convolutions which applies a 3D kernel to the data and the kernel moves 3-directions (x, y and z) to calculates the feature representations is helpful in video event detection related tasks.
Same as in the area of 2D CNN architectures, researchers have introduced CNN architectures that are having 3D convolutional layers. They are performing well in video classification, event detection tasks. Some of these architectures have been adopted from the prevailing 2D CNN models by introducing 3D layers for them.
Tran et al. from Facebook AI Research introduced the C3D model to learn spatiotemporal features in videos using 3D convolutional Networks.This is the paper : “Learning Spatiotemporal Features with 3D Convolutional Networks“ In the original paper they have used Dropout to regularize the network.
Instead of using dropout, I tried using Batch Normalization to regularize the network. Each convolutional layer id followed by a 3D batch normalization layer. With batch normalization, you can use bit bigger learning rates to train the network and it allows each layer of the network to learn by itself a little bit more independently from other layers.
This is just the PyTorch porting for the network. I use this network for video classification tasks which each video is having 16 RGB frames with the size of 112×112 pixels. So the tensor given as the input is (batch_size, 3, 16, 112, 112) . You can select the batch size according to the computation capacity you have.
Let the 3D Convo power be with you! Happy coding! 🙂