Compute Resources on Azure Machine Learning

Early concepts of machine learning came out back in 1950s but people were not able to explore the full power since most of the machine learning algorithms were largely computationally expensive. The computing power of the early systems were not enough to process large amounts of data using complex algorithms. Since cloud computing and GPUs opened the arena to perform complex computations, machine learning and deep learning got a boost and now used widely in many real-world applications.

As we discussed in the previous posts, being a leading cloud-based machine learning platform, Azure Machine Learning solves three main burdens in machine learning model development and deployment process.

  1. Setting up development environment with the burden of solving platform and software library dependencies.
  2. Setting up the computing environments (parallel processing libraries (CUDA) etc.)
  3. Setting and managing deployments

All of these 3 key areas in machine learning model development require some sort of a computing resource to create and manage. Azure machine learning service has centralized all the resources to a easy accessible portal allowing the developer to select the most suitable resource for their need.

Compute tab of the AML studio contains (as of the date am writing this) 4 main compute categories for specific purposes. Will go through each of those and see in where we can occupy them in our machine learning experiments.

Compute Instances –

No need of messing around with configuring CUDA and all python packages to set up the laptop for performing machine learning experiments or the data visualization. You can just go through few steps in a wizard to create a preconfigured computing instance on Azure. This is more similar for creating a virtual machine on Azure. If you need to use GPU based computing, you may have to select a N-series VM on Azure. (Make sure the region you selected is having the required VM families)

In order to do the experiments on the compute instance you can either use JupyterLab, Jupyter notebook, RStudio or through an SSH connection. You have to create a SSH public key on Azure to access the compute instance through SSH.

Tips –

  • GPU based compute instances are costly. Create such instances if you really need to do deep learning based experiments.
  • Think of a complex deep learning scenario where data preprocessing need large amount of CPU processing time while model training should be done using GPUs… You can use two compute instances where preprocessing happens in a CPU based instance while the GPU based expensive compute instance   is used for model training. (Connecting these two processes can be done using Azure machine learning pipelines)
  • Make sure to deallocate the resources when you are not using it. (Else you should have a fat wallet in your pocket)  

Training clusters –

Training clusters in Azure ML is the survivor when we are having complex computations to perform. You can perform tasks as Automated ML and hyperparameter tuning on these preconfigured clusters. The maximum number of nodes can be configured according to your need. Underlying technology behind the training clusters is docker containers. Simply you containerize your experiment and push into the cluster for computations/training.   

Inference clusters –

The end result of the experiments you perform sits on inference clusters. The web service endpoints you create can be deployed on this AKS based inference clusters. You can go for a low-cost inference cluster with few nodes for dev-test and a high performing cluster with many nodes according to the requirements in the production environment.  (normally we use ACI for dev-test and AKS for production web service end points)

Attached compute –

This is an interesting feature in Azure machine learning where you can push your machine learning workloads into external computing environments. Right now, AML is supporting

  • Azure Databricks
  • Data Lake Analytics
  • HDInsight
  • Virtual machine

Your VM should be running Ubuntu in order to connect as an attached compute.

Will discuss how to use these computing resources in your machine learning experiments in the future posts.

Happy coding! 😊  

Handling data sources on Azure Machine Learning

Image source : https://docs.microsoft.com/en-us/azure/machine-learning/concept-data

Being the fundamental and the most vital factor in any machine learning experiment, the way of handling data in your experiments is crucial. Here we going to discuss different ways of managing your data sources inside Azure Machine Learning (AML).

Since the new Azure Machine Learning Service is becoming the one-stop place for managing all ML related workloads in Azure, the functions and methods can be created/managed using the web portal or using AzureML python SDK (You may use AzureML R SDK or the Azure CLI too)

Data comes in all shapes and sizes. In order to tackle these different data scenarios AML offers different options to manage the data. Let’s discuss these options one by one with their usages, pros and cons.

Datastore

Datastore is the place where the data sits in an AML experiment. Your AML workspace can have one or more Datastores connected according to your need.

AML is all about cloud-based machine learning. So that, I would recommend using some sort of Azure based storage to store your data in the first hand. Blob storage, File Share, Data Lake Storage, Azure SQL database, Azure PostgreSQL, Azure DB for MySQL and Databricks file system are currently supported data storage types to create Datastores. (Will say your data is at on-prem SQL database. You can use Azure Data Factory to migrate your data load onto Azure.)

You can see the Datastores registered to your workspace either through AML Studio (ml.azure.com) or through the python SDK. When you create a workspace, by default two Datastores are created : workspacefilestore and workspaceblobstore.

Workspaceblobstore acts as the default Datastore of experiments. You can change it at any time through the SDK. This is the place where all your code and other files you put in the experiment sits.

Is it a good idea to keep your data on the workspaceblobstore?

  • Scenario #1 : You are doing a toy experiment with a small dataset (Eg. a 2 MB CSV file). Dataset is static and no plan on updating that during the experiment. Yes! That’s completely ok to keep the dataset inside workspaceblobstore.
  • Scenario #2 : You are doing a deep learning experiment with a 100,000 images. No! Never use workspaceblobstore to keep your data.

Why not the workspaceblobstore always?

Workspaceblobstore is having a file and storage limitation (300MB and/or 2000 files). So that it’s impossible to use it when we have a large dataset. On the other hand, this directly affects for the docker image size (or the snapshot size) you may create for the experiment. Bulky snapshots or the docker images are not a good thing. Always keep it simple and modularized. So always the workspaceblobstore is a no go!

Datasets

AML datasets is the high-level abstraction of the data you use in experiments. You may create an AML dataset from,

  • A local file / local files
  • Registered datastore (from file(s) sit on a datastore)
  • Web URL
  • Azure Open Datasets

The AML datasets we create may belong to two main types:

  • Tabular datasets – If you have a file/ files that contains data in a tabular format (CSV, JSON line files, Parquet files, Tabular data in SQL databases etc.) creating a tabular dataset would be beneficial as it allows to transform data into a Pandas or Spark DataFrame.    
  • File datasets – Refer to a single or multiple file in your Datastore or on a public URL. File datasets comes handy when you have a scenario like a dataset with 1K images.

AML datasets comes with the advantage of versioning and tracking as well as monitoring. It’s not a hard thing to create perform data drift detection or a simple statistical analysis on data fields of the dataset with a few clicks.  

Microsoft recommends to use AML datasets always in experiments rather than pointing for the datastore directly (which is totally possible). I’ve found out pros and cons in both approaches.

  • AML datasets are easy to version and manage compared to datastore.
  • If you have a tabular dataset, I would always recommend to go for AML tabular datasets.
  • It becomes tricky when you have files. If you use File datasets, you have to use to_path() method to get the list of file paths defined by the dataset. This comes as a flat list! If you are not concerned about the directory structure of the data this is totally fine. But if you wish to create custom dataloaders (Eg. PyTorch custom dataloaders which allows to differentiate classes according to directories) this may not come handy. (You can do a workaround by processing the file path to determine the directory structure though 😀 )  
  • Keep in mind that AML dataset mount() only works for unix-like OS. If you wish to run your experiment on a windows running workstation you may have to download() the dataset.

Will discuss on using these datasets in different model training scenarios in future posts.

There are just some of the experiences I had when playing around with new Azure Machine Learning. The Microsoft Learning GitHub repo (https://github.com/MicrosoftLearning/DP100) for DP100 exam is a really nice place to find some example code on using these functionalities. Let me know your find outs and experiences with AML too 😊

Happy coding!     

ml.azure.com – New Face of Azure Machine Learning

Azure Machine Learning Studio (preview) interface

Out of all the public cloud platforms, Microsoft Azure has adopted all most all the steps in machine learning life cycle into cloud. Though the resources and the abilities are there, sometimes finding the correct cloud-based product or service to adopt for your solution might be a problem.

Providing a perfect answer for that issue, Azure has come up with the whole new Azure Machine Learning Studio which is in the preview by the time am blogging this! (Don’t get confused with the AzureML Studio, the drag and drop interface we had before. This is a new thing – ml.azure.com ) There’s no framework dependency or restrictions for using these services. You can easily adapt your open source machine learning code base (may be written with Python, sci-kit learn, TensorFlow, PyTorch, Keras… anything)  

Most awesome feature of this new service is the AzureML python SDK (https://docs.microsoft.com/en-us/python/api/overview/azure/ml/intro) and AzureML R SDK (https://github.com/Azure/azureml-sdk-for-r) . These SDKs allow you to create and manage the ML experiments with your familiar coding style.  

In order to use this one-stop solution in Azure you may have to create an Azure Machine Learning Service from Azure portal. Then it’ll direct you for the new interface. You can either go for the Enterprise pricing tier or for the Basic tier. In basic tier you won’t get the visual designer and Automated ML features.

Launching the Studio through Azure portal

Let’s go through each and every tab we got in the side pane in the latest release and see what can we do with them.

Notebooks –

These are fully managed Jupyter notebook instances on cloud. These notebook servers are running on top of a new VM instance type called “notebookVM”. There notebookVMs are fully configured work environments to do your machine learning and data science tasks. No need to worry about installing all the python packages and its dependencies. All are already there! You have the privilege to change the notebook sizes (Yes GPU enabled VMs are also there) or install new packages through a python package manager too.

Automated ML –

Not available in the free tier though. A process of selecting the best suited algorithm for the dataset you are having. Right now, this is supporting classification, regression and time series forecasting for tabular data formats. Not supported for deep learning based computer vision applications. Deep learning based text analysis is also in preview. The Automated ML process runs a set of machine learning algorithms on top of your provided data and see which one gives the best accuracy metric. Good for building prototypes and even in some cases might be in production.

Designer –

This is the evolution of Azure ML Studio (Old drag and drop thing). Seems like Microsoft is going to end its lifecycle and give the new Designer be its replacement. Here you can build the complete ML workflow by dragging and dropping modules. If you want can integrate R or python scripts in the experiment. The machine learning service endpoint can be exposed through a Azure Kubernetes Service (AKS) deployment.

Datasets –

The place to manage and version your datasets. Datasets can be either tabular or file based. Here you can profile your dataset by performing a basic statistical analysis on your data. If your dataset is sitting on a datastore (which we going to discuss later) this acts as a high-level encapsulation of that data.

Experiments –

You may execute several runs on the same experiment with different configurations. This is the place where you can see all the log files of them and compare the runs with each other.

Pipelines –

Don’t get confused Azure machine learning pipelines with the Azure pipelines.  Azure ML pipelines are specifically designed for MLOps tasks. You can manage the whole experiment process till production using ML pipelines. These pipelines are reusable and help collaborative development of the solution.

Models –

You can register the trained ML models here. Versioning the models, managing which model to go for production are some use cases of this model registry. You can register models that has been trained outside the particular Azure ML workspace too here.

Endpoints –

Endpoint of an azure machine learning experiment can be a web service or an IoT module endpoint. Managing the endpoint keys etc. are performed in this section.

Compute –

In most of the cases, you may use Azure for computations. Here in the Compute section you can create and manage the following compute resource types.

  • Notebook VMs – as we discussed previously in the Notebooks section this is a fully managed ML development environment suits for development and prototyping purposes.
  • Training clusters – You can make a either a CPU based or a GPU based cluster for running your experiments. Note that this would be charged according to the computation hours as well as for the number of nodes you are using. Good thing is there’s no charge when you are not using the cluster for computation.
  • Inference clusters – This talk about AKS clusters where you can deploy the endpoints. Even you can register a prevailing AKS cluster as an inference cluster.
  • Attached compute – If you working with Azure Databricks, Data Lake Analytics or HDInsight you can configure the computation here. In an interesting use case if you want to attach your physical computer (which should be a workstation running Ubuntu) as a compute target it’s also possible through the AML service.

Datastores –

When it comes to machine learning experiments its normal to have large amount of data. These data may sit in your Azure storage. Datastore is the storage abstraction over an Azure storage account which then you can use inside your machine learning experiments.

Data labeling –

A cool new feature for data annotators. Right now, this supports Image classification in multi-label / multi-class and object identification (bounding box) annotations. The annotator should not have to have an Azure subscription. You can easily outsource your tedious annotation workload through this feature.

This is just an overview of the options we are having with new Azure Machine Learning Studio. It’s pretty clear that Azure team is going to get all the ML related services under one umbrella. Let’s discuss some cool use cases and tips on using these services in next blog posts.

Happy coding! 😊