Why we need GPUs for Deep Learning?

“If you wanna do deep learning experiments you should have a GPU”

This is one of the most common statements you should have heard over the years even from me. Do we really need a GPU for performing these experiments? If so why is that? So let’s dig in..

As we all know deep learning is all about deep neural networks. Training a deep neural network (DNN) is one of the most computationally expensive tasks in computer science. Since the DNN contains huge of numerical representations as the inputs as well as weights and biases inside the network, it can be mathematically seen as a set of multidimensional matrices.

A simplified illustration of an ANN

Now we have a set of huge multidimensional matrices.. next? In order to train a neural network to perform a particular task (will say classifying a bunch of images), we commonly use backpropagation algorithm which adjust the weights and biases of DNN according to the train set we given. In forward pass of training, input is passed through the neural network and after processing the input, an output is generated. Whereas in backward pass, we update the weights of neural network on the basis of error we get in forward pass. This operation includes huge set of matrix calculations (multiplications and summations). If you consider one single mathematical operation happens inside these passes, it’s pretty simple as multiplying two numbers. But when comes to a DNN such as VGG16 which contains 16 hidden layers (VGG16 is a convolutional neural network mostly specifically designed for computer vision tasks), it contains ~140 million parameters; aka weights and biases! So the number of calculations to perform for adjusting these weights and biases are tremendous!

Alright.. we now have millions and millions of calculations to be done.. So as we have very fast computers! What’s the issue? There’s no issue to be real. The problem is with the definition of speediness we have seen and using for our daily computations. Central Processing Unit (CPU) is the key component that contributes for computations in our computers. Typically a modern high end CPU is having 4-8 physical cores inside it. ( Intel Core i7 10th gen processor is having 8 physical cores) If you remember your early computer science lessons, a CPU can do only one task at a time. So to perform that millions and millions of calculations it’s gonna take ages! (Literally it may take years to train a complex DNN!) You may think of having a cluster with thousands of CPUs in parallel to do the task. Yes. It’s possible but it’s going to cost in millions plus going to consume a huge amount of power.

Since the single computational elements to be done is not complex in DNN training, Graphical Processing Units (GPUs) which are having hundreds/thousands of simple cores is a best match for this computational task. (typically a Nvidia940MX GPU sits on your laptop has 384 CUDA cores)

The table below will give your a better idea on GPU vs CPU. Just watch the video that demonstrate the power of thousands of small processors demonstrated by Mythbusters.

Source : https://www.slideshare.net/AlessioVillardita/ca-1st-presentation-final-published

Though we have a hype about GPUs today, usage of general purpose GPUs for scientific computing start in early 2000s. In 2006 Nvidia came out with CUDA language which allows to program GPUs using high level languages. Now all the deep learning frameworks have the ability to access the GPUs with just few lines of code. You can easily train a DNN model using the GPU you have on your laptop within minutes compared to training it on CPU which may take hours or days!

Alright.. alright…Now we need a GPU!

There are two main ways of leveraging the GPU power for your computations

1. Use a physical GPU fitted to your workstation/ laptop

This is quite straight forward. I would strongly recommend to use a Nvidia GPU since the cuDNN and CUDA packages are having great support plus good documentations for most of the deep learning libraries. AMD also have ROCm which is similar for CUDA based on openCL (https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Current-Release-Notes.html) but I haven’t experienced it or used it. GPUs maybe bit expensive and plan ahead when you choosing a GPU for your workstation.

2. Using a GPU on cloud

This is a pretty cool. If you don’t have a GPU on your laptop but still you want to train the DNNs, you can easily spin up a GPU on a cloud and use it for your computations.

Almost all cloud providers (Google cloud platform, AWS, Microsoft Azure provides GPU based computing services) Some of these services are free for some extend and may have to pay if you going to run your training for a long time.

One of the main data science platforms, Kaggle provides free access for GPUs with a limit (https://www.kaggle.com/dansbecker/running-kaggle-kernels-with-a-gpu) Just take a look there and it may fit your need too.

So, that’s it! Don’t wait till the CPUs do the computation. Let the GPU train your model.

What’s Your Workhorse?

Going to break the flow of Azure Machine Learning blog series and write bit of a descriptive answer for a frequently asked question.

Here’s that golden question!

What is the development rig and tools you use for deep learning/machine learning development?

We all know doing machine learning experiments comes with its cost of computation power. When it comes to deep learning, it’s very computationally expensive. As a researcher who do most of my experiments on computer vision and deep learning this is the setup I use to my jobs done (As of September 2020).

Disclaimer :

Please note that, all the things am mentioning here are my personal preferences and not any standard or something. Neither any of the manufacturers or software vendors have sponsored me for this post 😀 (Most of the software tools I’m mentioning here are FOSS).

Hardware

Choosing the right set of hardware, is the most vital part of building a DL workstation. It maybe quite expensive to go for the best set which satisfy your need but I see that as an investment. Make sure you don’t spend way too much just for the name sake or the brand. Make sure whether it’s enough to do your job.

The main specifications you need to look on a DL workstation are processor, RAM, Storage and GPU processing unit (If you wanna do computer vision, NLP kinda experiments this is a must!) . I’m using a desktop with a Intel Xeon 4 core processor with a 16GB RAM. When it comes to storage I’m having a SSD as the primary drive installed all my software and a bit big (6TB) hard drive. Why such a big storage? Since I have to deal with somewhat big image and video datasets having a big storage is a need.

GPU! This is the pinnacle of your build! Plus maybe the most expensive. I have a NVIDIA GTX 1080 which is having 8GB GPU memory space and supports CUDA based processing. Comes handy with most of the experiments (When this is not enough, I use a remote cluster with 2 NVIDIA 2080Ti s for big computations.)

Make sue you are having proper cooling for the machine! Believe me else these computations gonna generate a lot of heat and ruin your hardware.

Laptop or a desktop?

This is a hard choice to make. If you have to go place to place to do your job then you may have to choose a laptop. (Many gaming laptops comes with GPUs which can be used for CUDA based processing). I’m more of a desktop person and you can easily build a more powerful rig for the same amount of dollars you spend for a good enough laptop. (I don’t prefer Macbooks since they are not coming with GPUs.. I just use a light laptop to do my presentations and just to carry around for meetings)

With all these beasts, a two-monitor setup, mechanical keyboard with a wireless mouse, RODE USB mic for video cons are records and a Bose sound cancelling headset for a complete isolation is just there to ease up the things.

Software and Tools

This is the most interesting part. Though you have the correct hardware, the wrong choice of tools would make your job hard. (Again this is completely my choice of software. You may have different preferences)

I use Ubuntu Linux as my operating system (18.10 LTS still 😀 ) . Why Ubuntu? Since it’s so easy to setup and work with, that’s my choice. Since Python is my primary programming language it works as magic with all the python environmental dependencies and all. (Plus OpenCV 😀 )

So I said Python.. what else? Yeah.. along with most of the machine learning frameworks (yes I have an Anaconda based setup) PyTorch has become my to-go deep learning framework. Easy debugging, Pythonic syntax, wide support made me to take this choice.

IDE

VSCode with installed extentions

No programmer is complete without his/her own IDE choice and modifications. Microsoft VSCode (which is open source and free) is the IDE I use. It supports Python, Spark, Scala and almost all the languages I use. I added few extensions which comes so handy with experiments. Here are some of the extentions I use. See if they suits for you.

  • Python – Make sure I get all the indentation and tooltips right.
  • Remote explorer – Since I use a remote cluster to train my models and sometimes a NAS to store, this comes handy to manage my SSH connections. Pretty convenient even for remote debugging.
  • Docker – pretty easy to manage your Docker environments.
  • Excel viewer – Just to view the CSV files in style.
  • LaTeX Workshop and Code Spell Checker – This is all because I use LaTeX for scientific writing. Believe me, VSCode is a nice place to do word processing too 😀

With all these tweaks I use the dark theme 😀

This is all about the hardware and software setup I use for my experiments. In a later post, will talk about some tips and tricks in MLOps I use within experiments.

PyTorch Custom Dataset Tips and Tricks

Loading massive and complex datasets for training deep learning models has become a normal practice in most of the deep learning experiments. Handling large datasets which contain multimedia such as images, video frames and sound clips etc. can’t be perform just with simple file open commands which drastically reduce the model training efficiency.

Featuring a more pythonic API, PyTorch deep learning framework offers a GPU friendly efficient data generation scheme to load any data type to train deep learning models in a more optimal manner.

Based on the Dataset class (torch.utils.data.Dataset) on PyTorch you can load pretty much every data format in all shapes and sizes by overriding two subclass functions.

 __len__  – returns the size of the dataset

__getitem__  – returns a sample from the dataset given an index.

Here’s a rough skeleton of the Dataset class which you can modify for your need.

import torch
from torch.utils.data.dataset import Dataset

#If available use GPU memory to load data 
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")


class MyCustomDataset(Dataset):
    def __init__(self, ...):
        # # All the data preperation tasks can be defined here
        # - Deciding the dataset split (train/test/ validate)
        # - Data Transformation methods 
        # - Reading annotation files (CSV/XML etc.)
        # - Prepare the data to read by an index
        
    def __getitem__(self, index):
        # # Returns data and labels
        # - Apply initiated transformations for data
        # - Push data for GPU memory
        # - better to return the data points as dictionary/ tensor  
        return (img, label)

    def __len__(self):
        return count # of how many examples(images?) you have

These are some tips and tricks I follow when writing custom dataloaders for PyTorch.

  • Datasets will expand with more and more samples and, therefore, we do not want to store too many tensors in memory at runtime in the Dataset object. Instead, we will form the tensors as we iterate through the samples list. This approach may be bit slow in processing but save us from going out of memory.
  • __init__ function should be the place where all the initial data preparations and logics happens. Do the operations where you may need to read data annotation files (CSV/XML etc.) here.
  • If you have separate portions of the dataset for train/test and validate, make sure you define that logic inside __init__ function. You can pass the desired data split as an argument for the function.
  • __init__ function is the place where you can define the data transformations. For an example, if you have image data to load and need to do resize and normalize images you can use torchvision transforms here.
#Example transform for image data
self.transform = transforms.Compose([transforms.Resize((224,224)), 
                                             transforms.ToTensor(),
                                             transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])
  • Make sure you index your custom dataset in a relational structure when initiating. Generating an array or a list of the datapoints is a better way to do it.
  • __len__ function comes handy to see how many data points has been loaded through init. The data length is normally the number of records loaded into the final list or array you created inside __init__ .  
  • __getitem__ function should be light weight. Avoid using too complex computations inside __getitem__ function. 
  • PyTorch DataLoaders just call __getitem__() and wrap them up a batch when performing training or inferencing. So, this function is iterative. Make sure you return one datapoint at a time.
  • Always try to return the values from __getitem__ as tensors.
  • If you have multiple components to return from the DataLoader, using a Python dictionary is a handy option. You can structure it as key value pairs in the dictionary. Here’s an example dictionary item which contains four values in it.  
item = {
         'video_id' : video_id,
          'activity_id' :activity_id,
          'activity_frame': activity_frame_as_tensor,
          'activity_annotation' : activity_annotation
        }

Consuming the dataset –

You should create a CustomDataset object when you need to consume the data. This is a sample code snippet that demonstrate how to access the data points through the custom dataloader you created.

#Consuming the dataset 

#creating the dataset object
dataset = MyCustomDataset(...)

#Randomly split dataset into trainset and the validation set 
train_data, val_data = random_split(dataset, [50000, 10000])

#Create DataLoader iterators
train_loader = DataLoader(train_data, batch_size=64, shuffle=True, num_workers=2)
val_loader = DataLoader(val_data, batch_size=64, shuffle=True, num_workers=2)

#Iterating through the data loader object
for i, batch in enumerate(train_loader):
    print(i, batch)

You may notice, the dataLoader iterator can be batched, shuffled and load the data using multiprocessing just by changing the parameters in the function. Make sure you choose a batch size which fits with your memory capacity. If you loading the data to the GPU, it’s the GPU memory you should consider on.

If you using a multi-GPU setup with PyTorch dataloaders, it tries to divide the data batches evenly among the GPUs. If the batch size is less than the number of GPUs you have, it won’t utilize all GPUs.

I would say CustomDataset and DataLoader combo in PyTorch has become a life saver in most of complex data loading scenarios for me. Would love to hear from you on the experiences you have with writing Custom DataLoaders in PyTorch.

Happy Coding!  

GPU Accelerated Application Deployment with NVIDIA-Docker

When it comes to deep learning model development and training, personally for me, the majority of the time is spent on data pre-processing, then for setting up the development environment.  Cloud based development environments such as Azure DLVM, Google CoLab etc. are very good options to go with when you don’t have much time to spend on installing all the required packages for your workstation. But, there are times that we want to do the development on our machines and train/deploy in another place (may be on the client’s environment, for a machine with a better GPU for faster training or to train on a Kubernetes cluster). Docker comes handy in these scenarios.

Docker provides both hardware and software encapsulation by allowing portable deployment. If you are a data scientist/ machine learning guy or a deep learning developer, I strongly recommend you to give it a try with docker and I’m pretty sure that’ll make your life so easy.

Alright! That’s about docker! Let’s assume now you are using docker for deploying your deep learning applications and you want to use docker to ship your deep learning model to a remote computer that is having a powerful GPU, which allows you to use large mini-batch sizes and speedup your training process. Though docker containers solve the problem of framework dependencies and platform dependencies it is also hardware-agnostic. This creates a problem!

Have you ever tried to access the GPU resource on the host computer from a program running inside a docker container? Sadly, Docker does not natively support NVIDIA GPUs within containers.

The early work around was installing the Nvidia drivers inside the docker container. It’s bit of a hassle as the driver version installed in the container should match the driver on the host.

For making docker images that uses GPU resources more portable, Nvidia has introduced nvidia-docker!

nvidia-docker

NVIDIA-Docker plugin enables GPU accelerated application deployment

Nvidia-docker is a wrapper around the docker command that mounts the GPU on the host machine with the docker container. The only thing you should pay your attention is the CUDA version you want to use.

So, in which scenarios you can use this? In my case, nvidia-docker comes handy for me when I’m running my experiment on a cluster which is having a higher GPU power. What I do is just containerize all my code into a docker and run on the remote with nvidia-docker. (Windows guys… nvidia-docker is not still available for windows hosts. Not sure if that is in the development timeline or not 😀 )

Here’s the official GitHub on nvidia-docker. Just install it at make sure to restart your docker engine and make sure nvidia-docker the default docker run-time. Then rest is the same as building and running a typical docker.

Here’s a simple docker file I wrote for containerizing my PyTorch code. I’ve used CUDA 9.1.  You can modify this for your need.

FROM nvidia/cuda:9.1-base-ubuntu16.04

# Install some basic utilities
RUN apt-get update && apt-get install -y \
curl \
ca-certificates \
sudo \
git \
bzip2 \
libx11-6 \
&& rm -rf /var/lib/apt/lists/*

# Create a working directory
RUN mkdir /app
WORKDIR /app

# Create a non-root user and switch to it
RUN adduser --disabled-password --gecos '' --shell /bin/bash user \
&& chown -R user:user /app
RUN echo "user ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/90-user
USER user

# All users can use /home/user as their home directory
ENV HOME=/home/user
RUN chmod 777 /home/user

# Install Miniconda
RUN curl -so ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-4.5.1-Linux-x86_64.sh \
&& chmod +x ~/miniconda.sh \
&& ~/miniconda.sh -b -p ~/miniconda \
&& rm ~/miniconda.sh
ENV PATH=/home/user/miniconda/bin:$PATH
ENV CONDA_AUTO_UPDATE_CONDA=false

# Create a Python 3.6 environment
RUN /home/user/miniconda/bin/conda install conda-build \
&& /home/user/miniconda/bin/conda create -y --name py36 python=3.6.5 \
&& /home/user/miniconda/bin/conda clean -ya
ENV CONDA_DEFAULT_ENV=py36
ENV CONDA_PREFIX=/home/user/miniconda/envs/$CONDA_DEFAULT_ENV
ENV PATH=$CONDA_PREFIX/bin:$PATH

# Install PyTorch with Cuda 9.1 support
RUN conda install -y -c pytorch \
cuda91=1.0 \
magma-cuda91=2.3.0 \
pytorch=0.4.0 \
torchvision=0.2.1 \
&& conda clean -ya
RUN conda install opencv

# Install other dependencies from pip 
#My requirments.txt file jsut contains the following packages I used for the code. Change this for your need.
#numpy==1.14.3
#torch==0.4.0
#torchvision==0.2.1
#matplotlib==2.2.2
#tqdm==4.28.1
COPY requirements.txt .
RUN pip install -r requirements.txt

# Create /data directory so that a container can be run without volumes mounted
RUN sudo mkdir /data && sudo chown user:user /data

# Copy source code into the image
COPY --chown=user:user . /app

# Set the default command to python3
CMD ["python3"]

Here’s the bash command used for running the docker using the Nvidia run-time.

# 1. Build image
docker build .

# 2. Run the docker image
docker run \
--runtime=nvidia -it -d \
--rm <dockerImage> python3 <yourCode.py>

 

Just try it and see how your deep learning life becomes easy! Happy coding! 🙂