Accessing data in different data sources is one of the main tasks in machine learning model development life cycle. Let’s discuss one of the most common data accessing scenarios.
We have to set of relational data points stored in a Azure SQL server to develop a machine learning model using Azure Machine Learning. Let’s see how to leverage data stored in an Azure SQL database in an Azure Machine Learning experiment.
The process contains three main steps.
Set access permissions of Azure SQL database
Connect Azure SQL database to an Azure ML datastore
Register the data in datastore as an Azure ML dataset.
1. Set access permissions of Azure SQL database
By default Azure SQL databases are protected with a firewall which limits outside access for data. Since we going to provide access for the traffic from Ips belongs to Azure resources and services, make sure you allow Azure services to access your SQL server.
2. Connect Azure SQL database to an Azure ML datastore
Azure ML datastores can be defined as the abstraction of data sources for the ML workspace or as the interconnection between the data resource and AzureML workspace.
Go to your Azure Machine Learning Studio (ml.azure.com) and click ‘New datastore’. Provide a datastore name and select ‘Azure SQL database’ as the datastore type. Make sure to authenticate the access with Azure SQL server’s user ID and the password.
3. Register the data in datastore as an Azure ML dataset.
In a later article we discussed on different data storage methods we can use with Azure Machine learning. In this article am gonna briefly discuss different computation options we have with Azure ML.
Since computation power is one of the key advantage we get from cloud based machine learning, choosing correct computation resource for our machine learning experiments is important.
AzureML offers 4 main compute types.
01. Compute instances –
If you don’t wanna spend the time in setting up your local computer for doing the ML experiments or you wanna leverage GPUs or powerful CPUs for doing your experiments, Azure Compute instances offer fully managed virtual machines loaded with most of the essential frameworks /libraries for performing machine learning and data science experiments. When you using AzureML notebooks (jupyter notebook instance attached for AzureML), compute instance is the place where the jupyter notebook is running.
You can access the compute instances using different methods. Accessing through Jupyter notebooks and JupyterLab is the all time favourite of most of the data scientists. If you are a R folk, you can use Rstudio with the compute instances. Accessing the compute instance through SSH is really useful (you may have to enable SSH access when creating the compute instance) in occasions where you have to install custom packages and such for the compute instance. (The machine is ubuntu based and you can use all bash scripts there!)
Basically compute instance can be defined as a virtual machine fully loaded with data science and machine learning essentials which you can use right out of the box.
02. Compute clusters –
Compute clusters are different from compute instances with their ability of having one or more compute nodes. These compute nodes can be created with our desired hardware configurations.
Why having more than one node? That comes with the ability of using parallel processing for computations. If you are going do to hyperparameter tuning/ GPU based complex computations/ several machine learning runs at once you may have to create a compute cluster.
If you are running Automated Machine Learning expriment with AzureML, you must have a compute cluster to perform computations.
When selecting the node configurations, you can either go with CPU based nodes or GPU based nodes. GPU based nodes (NC type etc.) is bit pricy. If you are not using GPU based computing, don’t waste your dollars by just creating a compute cluster with some fancy configs.
One other key setting is ‘Virtual machine priority’. If you are ok with pushing your experiment to the cloud and get the result without a hurry, you can go with low priority nodes, which will save you a lot of dollars rather than using dedicated VMs. No harm is gonna happen for the experimentation accuracy and such.
03. Inference clusters –
There are two options to deploy Azure machine learning web services as REST endpoints. 1) Use ACI (Azure Container Instances) 2) Use AKS (Azure Kubernetes Service)
Deploying the REST web service on ACI is good for testing and development uses and AKS would be the to-go for production level large deployments. You can configure the AKS cluster according to your need through AzureML as well as from the Azure portal. These AKS clusters are pretty much similar for AKS clusters you worked in any other Azure based deployments.
04. Attached compute –
Azure machine learning is not limited for doing computations on compute clusters. You can attach Azure Databricks, Data Lake Analytics, HDInsight or a prevailing VM as a compute for your workspace. Keep in your mind that Azure machine learning only supports virtual machines running Ubuntu. These compute targets will not be managed by Azure Machine Learning itself. So you may have to perform some additional steps to make sure they are compatible with your experiments.
Choosing the correct compute resource is a key component in the success of developing machine learning experiments. On the other hand, bad computation choices may leave you with huge Azure bills! 😀
There’s no hard bound rules on selecting different compute options for your machine learning life cycle. Just make sure you use the right tool at the right time.
In the last blog post, we discussed on developing a machine learning classifier with Azure machine learning service. As mentioned there, we going to utilize the familiar development tools and frameworks for model development.
Key areas of the SDK include:
Explore, prepare and manage the lifecycle of your datasets used in machine learning experiments.
Manage cloud resources for monitoring, logging, and organizing your machine learning experiments.
Train models either locally or by using cloud resources, including GPU-accelerated model training.
Use automated machine learning, which accepts configuration parameters and training data. It automatically iterates through algorithms and hyperparameter settings to find the best model for running predictions.
Deploy web services to convert your trained models into RESTful services that can be consumed in any application.
AzureML python SDK acts as the connector between all the resources on the cloud and the dev environment.
Installing Python SDK –
AzureML SDK can be easily installed for your local computer through pip. Refer this guide for the installation process. I’d suggest to go with default installation since it’s enough for the most of the operations we used in the experiment. It’s a good idea to upgrade the SDK before running an experiment since the package is rapidly updating.
Download config file –
For connecting the AzureML workspace, we may need the Azure subscription ID, resource group which the workspace has been created and the workspace name. The easiest way to grab these details is downloading the config.json file from the Azure portal. Place this file inside the experiment directory.
Connect AzureML Workspace –
Connecting the AzureML workspace and and listing the resources can be done by using easy python syntaxes of AzureML SDK (A sample code is provided below). Refer Python SDK documentation to do modifications for the resources of the AML service.
#!pip install --upgrade azureml-sdk[notebooks]
from azureml.core import Workspace
from azureml.core import ComputeTarget, Datastore, Dataset
print("Ready to use Azure ML", azureml.core.VERSION)
ws = Workspace.from_config()
#View resources in the workspace
for compute_name in ws.compute_targets:
compute = ws.compute_targets[compute_name]
print("\t", compute.name, ":", compute.type)
for datastore in ws.datastores:
for dataset in ws.datasets:
for webservice_name in ws.webservices:
webservice = ws.webservices[webservice_name]
In next blog article, will discuss the data loading and experiment creation.
Azure Machine Learning Service is becoming the one-stop place for managing all ML related workloads in Azure cloud. There are two main advantages of using Azure Machine Learning Service for your ML and data science experiments.
#1 – You can mange the whole machine learning workflow in a single environment. From data wrangling to machine learning service deployment, everything is managed on the cloud with its reliable, scalable and efficient services.
#2 – You can use your familiar open source toolset, languages and frameworks in model development. Being a ML engineer or a data scientist, you may be using python or R as your main development languages. Azure machine learning allows you to use any of those languages and frameworks to develop the experiments.
Pima Indians Diabetes Classification is one of the most famous machine learning experiments. It’s a binary classification problem which use a .CSV based tabular dataset as the input. I’ll walk you through the process I went through to perform my experiment.
Diabetes dataset is available as a .CSV file in your local file system.
I have to build a binary classifier trained with the dataset and deploy it as a web service with a REST endpoint.
As shown in the diagram I used the services and tools in AMLS with my typical development environments to build up the solution.
Step 1: Since the experiment is going to build on Azure cloud, I have transferred my dataset into an Azure blob storage. I used Azure storage Explorer to upload the dataset into the cloud. (For better performance, make sure the dataset is in a storage blob in the same region where the AMLS experiment is)
Step 2: In order to access the data stored in the blob space, it’s registered inside AMLS as a datastore.
Step 3: AMLS supports two types of datasets. Since the .CSV file contains tabular data, it’s registered as a tabular dataset. (You can perform the basic statistical operations and visualizations after registering as a tabular dataset.)
Step 4: Now it’s the time for the real job. Since am more familiar with Python and sci-kit learn, I used those languages and libraries to develop my model. The whole coding part has been done on a Linux machine using my favorite VSCode IDE. 😉 You may wonder how I’m going to connect the code base on my local machine with the cloud… Here’s the place where AzureML python SDK comes to the rescue.
Step 5: I don’t have enough computation power to do the model training on my machine. So that, I use an Azure compute cluster to perform the computation. (In my experiment I did hyperparameter tuning to select the best parameters. Using the compute cluster allowed me to perform parallel training)
Step 6: After model training and getting the desired inference accuracy, I had the need of exposing the binary classification model as a web service. For that, I used Azure Container Instance (ACI) since this is going to be a small testing experiment. (I may have to go for an Azure Kubernetes Services (AKS) if I wanna go for global massive deployment)
Yp! It’s just a simple 6 step process. Complex? Don’t worry, I’m going to walk you through the whole process assisted with the code snippets in the upcoming blog posts. Stay tuned. Let’s start a real experiment with Azure Machine Learning Service.
Designing and implementing predictive experiments requires an understanding about the problem domain as well as the knowledge on machine learning algorithms and methodologies. Extensive knowledge on programming is a necessity when it comes to real-world machine learning model training and implementations.
Automated machine learning is capable of training and tuning a machine learning model for a given dataset and specified target metrics by selecting the appropriate algorithms and parameters by its own. Azure Machine Learning offers a user-friendly wizard-like Automated ML feature for training and implementing predictive models without giving you the burden of algorithm and hyperparameter selection.
Azure Automated ML comes handy, where you are able to implement a complete machine learning pipeline without a single line of coding. It saves a time and compute resources since the model tuning is done by following data science best practices.
Azure machine learning currently supports three types of machine learning user cases in their AutomatedML pipeline.
1. Classification – To predict one of several categories in the target column
2. Time series forecasting – To predict values based on time
3. Regression – To predict continuous numeric values
Let’s go through the step by step process of developing a machine learning experiment pipeline with Azure Automated ML.
01. CREATE AN AZURE MACHINE LEARNING WORKSPACE
Azure Machine Learning Workspace is the resource you create on Azure to perform all machine learning related activities on the cloud. The steps are straight forward same as creating any other Azure resource. Make sure you create the Workspace edition is ‘Enterprise’ since AutoML is not available in the basic edition.
02. CREATE AUTOMATEDML EXPERIMENT
ml.azure.com web interface is the one stop portal for accessing all the tools and services related to machine learning on Azure. You have to create a new Automated ML run by selecting Automated ML on the Author section of the left pane.
03. SELECT DATASET
As of now, AutomatedML supports tabular data formats only. You can upload your dataset from the local storage, import from a registered datastore, fetch from a web file or else retrieve from Azure open datasets.
04. CONFIGURE RUN
In this section you have to specify the target column of the experiment. If it’s a classification task this should be the column that indicates the class values and if it’s regression that’s the column where the numerical value to be predicted. Select a training cluster where the experiments going to run. Make sure you select a cluster that is enough for the complexity of the dataset you provided.
05. SELECT TASK TYPE AND SETTINGS
Select the task type that is appropriate for the dataset you selected. If you have textual data in your dataset you can enable deep learning (which is in preview) to extract the features.
In the settings of the run, you can specify the evaluation metrics, any algorithms that you are don’t want to use, validation type, exit criterion etc. for the experiment. If you wish to select only a specific set of features in the provided dataset you can configure that through the settings.
Running the experiment may take some time depending on the complexity of the dataset, algorithms you use and the exit criterion you used.
When the run is completed AzureML provides a summary of the run by indicating the best performing algorithm. You can directly deploy or download the best performing model as a .pkl file from the portal.
Deployment comes as a REST API which runs on an Azure Kubernetes Service (AKS) or Azure Container Instance (ACI).
AutomatedML comes handy when you need to do fast prototyping for a specific set of data and supports the agile process of intelligent application development. Will look on the other tools and features we have on Azure AI stack in the coming articles.
One of the most time-consuming tasks in the machine learning model development pipeline is data labeling. When it comes to a computer vision related task which may use deep learning methodologies, you may need thousands of labelled data to train your models.
Azure Machine Learning offers a new feature for data labeling tasks specifically designed for computer vision related applications. Right now, AzureML supports 3 types of data labeling tasks.
Image Classification multi-label – If the images in the image set is having more than one label for the image, this is the task type to go with.
Image Classification multi-class – This is the simple image classification type tasks, where each image is having a single label and the dataset is having multiple labels
Object Identification (Bounding Box) – If you need to have annotations for set of images to train a model for object detection, you may have to have bounding box annotations. This is the task type you should choose for such tasks.
Data labeling feature is available in both basic and enterprise versions of AzureML. Despite Basic version is not having the capability of ML assisted data labeling where a ML model is automatically built to assist the labeling process. ML assisted data labeling need a GPU based compute resource for performing model training and inferencing. (Obviously this comes with a cost then 😉 )
There are two ways that you can add the images to build the dataset.
Upload the image files into an Azure blob storage and register it as a DataStore in Azure ML (I’d recommend this since it may bypass the storage restrictions of the default storage)
Upload directly for the default storage
One of the attractive features I’ve seen in the data labeling process is the ability to use keyboard shortcuts which makes the process much more user friendly.
The annotation files can be exported as a JSON which follows the COCO dataset format (This file is saved in the default blob storage of the experiment) or else can be registered as an Azure ML dataset. The progress of the data labeling project can be monitored through the dashboard on AzureML Studio.
Seems like Microsoft is having more plans on developing this product further and hope there would be interesting additions in the coming future.
Being one of the major public cloud providers, Microsoft Azure provides numarous products, services and tools for intelligent application development. This is a high level guideline to select the appropriate product for your application development.
I have just pinpointed the most used tools and services here. The services can be interconnected with each other in order to develop applications for more complex use cases.
Early concepts of machine learning came out back in 1950s but people were not able to explore the full power since most of the machine learning algorithms were largely computationally expensive. The computing power of the early systems were not enough to process large amounts of data using complex algorithms. Since cloud computing and GPUs opened the arena to perform complex computations, machine learning and deep learning got a boost and now used widely in many real-world applications.
As we discussed in the previous posts, being a leading cloud-based machine learning platform, Azure Machine Learning solves three main burdens in machine learning model development and deployment process.
Setting up development environment with the burden of solving platform and software library dependencies.
Setting up the computing environments (parallel processing libraries (CUDA) etc.)
Setting and managing deployments
All of these 3 key areas in machine learning model development require some sort of a computing resource to create and manage. Azure machine learning service has centralized all the resources to a easy accessible portal allowing the developer to select the most suitable resource for their need.
Compute tab of the AML studio contains (as of the date am writing this) 4 main compute categories for specific purposes. Will go through each of those and see in where we can occupy them in our machine learning experiments.
Compute Instances –
No need of messing around with configuring CUDA and all python packages to set up the laptop for performing machine learning experiments or the data visualization. You can just go through few steps in a wizard to create a preconfigured computing instance on Azure. This is more similar for creating a virtual machine on Azure. If you need to use GPU based computing, you may have to select a N-series VM on Azure. (Make sure the region you selected is having the required VM families)
In order to do the experiments on the compute instance you can either use JupyterLab, Jupyter notebook, RStudio or through an SSH connection. You have to create a SSH public key on Azure to access the compute instance through SSH.
GPU based compute instances are costly. Create such instances if you really need to do deep learning based experiments.
Think of a complex deep learning scenario where data preprocessing need large amount of CPU processing time while model training should be done using GPUs… You can use two compute instances where preprocessing happens in a CPU based instance while the GPU based expensive compute instance is used for model training. (Connecting these two processes can be done using Azure machine learning pipelines)
Make sure to deallocate the resources when you are not using it. (Else you should have a fat wallet in your pocket)
Training clusters –
Training clusters in Azure ML is the survivor when we are having complex computations to perform. You can perform tasks as Automated ML and hyperparameter tuning on these preconfigured clusters. The maximum number of nodes can be configured according to your need. Underlying technology behind the training clusters is docker containers. Simply you containerize your experiment and push into the cluster for computations/training.
Inference clusters –
The end result of the experiments you perform sits on inference clusters. The web service endpoints you create can be deployed on this AKS based inference clusters. You can go for a low-cost inference cluster with few nodes for dev-test and a high performing cluster with many nodes according to the requirements in the production environment. (normally we use ACI for dev-test and AKS for production web service end points)
Attached compute –
This is an interesting feature in Azure machine learning where you can push your machine learning workloads into external computing environments. Right now, AML is supporting
Data Lake Analytics
Your VM should be running Ubuntu in order to connect as an attached compute.
Will discuss how to use these computing resources in your machine learning experiments in the future posts.
Being the fundamental and the most vital factor in any machine learning experiment, the way of handling data in your experiments is crucial. Here we going to discuss different ways of managing your data sources inside Azure Machine Learning (AML).
Since the new Azure Machine Learning Service is becoming the one-stop place for managing all ML related workloads in Azure, the functions and methods can be created/managed using the web portal or using AzureML python SDK (You may use AzureML R SDK or the Azure CLI too)
Data comes in all shapes and sizes. In order to tackle these different data scenarios AML offers different options to manage the data. Let’s discuss these options one by one with their usages, pros and cons.
Datastore is the place where the data sits in an AML experiment. Your AML workspace can have one or more Datastores connected according to your need.
AML is all about cloud-based machine learning. So that, I would recommend using some sort of Azure based storage to store your data in the first hand. Blob storage, File Share, Data Lake Storage, Azure SQL database, Azure PostgreSQL, Azure DB for MySQL and Databricks file system are currently supported data storage types to create Datastores. (Will say your data is at on-prem SQL database. You can use Azure Data Factory to migrate your data load onto Azure.)
You can see the Datastores registered to your workspace either through AML Studio (ml.azure.com) or through the python SDK. When you create a workspace, by default two Datastores are created : workspacefilestore and workspaceblobstore.
Workspaceblobstore acts as the default Datastore of experiments. You can change it at any time through the SDK. This is the place where all your code and other files you put in the experiment sits.
Is it a good idea to keep your data on the workspaceblobstore?
Scenario #1 : You are doing a toy experiment with a small dataset (Eg. a 2 MB CSV file). Dataset is static and no plan on updating that during the experiment. Yes! That’s completely ok to keep the dataset inside workspaceblobstore.
Scenario #2 : You are doing a deep learning experiment with a 100,000 images. No! Never use workspaceblobstore to keep your data.
Why not the workspaceblobstore always?
Workspaceblobstore is having a file and storage limitation (300MB and/or 2000 files). So that it’s impossible to use it when we have a large dataset. On the other hand, this directly affects for the docker image size (or the snapshot size) you may create for the experiment. Bulky snapshots or the docker images are not a good thing. Always keep it simple and modularized. So always the workspaceblobstore is a no go!
AML datasets is the high-level abstraction of the data you use in experiments. You may create an AML dataset from,
A local file / local files
Registered datastore (from file(s) sit on a datastore)
Azure Open Datasets
The AML datasets we create may belong to two main types:
Tabular datasets – If you have a file/ files that contains data in a tabular format (CSV, JSON line files, Parquet files, Tabular data in SQL databases etc.) creating a tabular dataset would be beneficial as it allows to transform data into a Pandas or Spark DataFrame.
File datasets – Refer to a single or multiple file in your Datastore or on a public URL. File datasets comes handy when you have a scenario like a dataset with 1K images.
AML datasets comes with the advantage of versioning and tracking as well as monitoring. It’s not a hard thing to create perform data drift detection or a simple statistical analysis on data fields of the dataset with a few clicks.
Microsoft recommends to use AML datasets always in experiments rather than pointing for the datastore directly (which is totally possible). I’ve found out pros and cons in both approaches.
AML datasets are easy to version and manage compared to datastore.
If you have a tabular dataset, I would always recommend to go for AML tabular datasets.
It becomes tricky when you have files. If you use File datasets, you have to use to_path() method to get the list of file paths defined by the dataset. This comes as a flat list! If you are not concerned about the directory structure of the data this is totally fine. But if you wish to create custom dataloaders (Eg. PyTorch custom dataloaders which allows to differentiate classes according to directories) this may not come handy. (You can do a workaround by processing the file path to determine the directory structure though 😀 )
Keep in mind that AML dataset mount() only works for unix-like OS. If you wish to run your experiment on a windows running workstation you may have to download() the dataset.
Will discuss on using these datasets in different model training scenarios in future posts.
There are just some of the experiences I had when playing around with new Azure Machine Learning. The Microsoft Learning GitHub repo (https://github.com/MicrosoftLearning/DP100) for DP100 exam is a really nice place to find some example code on using these functionalities. Let me know your find outs and experiences with AML too 😊
Out of all the public cloud platforms, Microsoft Azure has adopted all most all the steps in machine learning life cycle into cloud. Though the resources and the abilities are there, sometimes finding the correct cloud-based product or service to adopt for your solution might be a problem.
Providing a perfect answer for that issue, Azure has come up with the whole new Azure Machine Learning Studio which is in the preview by the time am blogging this! (Don’t get confused with the AzureML Studio, the drag and drop interface we had before. This is a new thing – ml.azure.com ) There’s no framework dependency or restrictions for using these services. You can easily adapt your open source machine learning code base (may be written with Python, sci-kit learn, TensorFlow, PyTorch, Keras… anything)
In order to use this one-stop solution in Azure you may have to create an Azure Machine Learning Service from Azure portal. Then it’ll direct you for the new interface. You can either go for the Enterprise pricing tier or for the Basic tier. In basic tier you won’t get the visual designer and Automated ML features.
Let’s go through each and every tab we got in the side pane in the latest release and see what can we do with them.
These are fully managed Jupyter notebook instances on cloud. These notebook servers are running on top of a new VM instance type called “notebookVM”. There notebookVMs are fully configured work environments to do your machine learning and data science tasks. No need to worry about installing all the python packages and its dependencies. All are already there! You have the privilege to change the notebook sizes (Yes GPU enabled VMs are also there) or install new packages through a python package manager too.
Automated ML –
Not available in the free tier though. A process of selecting the best suited algorithm for the dataset you are having. Right now, this is supporting classification, regression and time series forecasting for tabular data formats. Not supported for deep learning based computer vision applications. Deep learning based text analysis is also in preview. The Automated ML process runs a set of machine learning algorithms on top of your provided data and see which one gives the best accuracy metric. Good for building prototypes and even in some cases might be in production.
This is the evolution of Azure ML Studio (Old drag and drop thing). Seems like Microsoft is going to end its lifecycle and give the new Designer be its replacement. Here you can build the complete ML workflow by dragging and dropping modules. If you want can integrate R or python scripts in the experiment. The machine learning service endpoint can be exposed through a Azure Kubernetes Service (AKS) deployment.
The place to manage and version your datasets. Datasets can be either tabular or file based. Here you can profile your dataset by performing a basic statistical analysis on your data. If your dataset is sitting on a datastore (which we going to discuss later) this acts as a high-level encapsulation of that data.
You may execute several runs on the same experiment with different configurations. This is the place where you can see all the log files of them and compare the runs with each other.
Don’t get confused Azure machine learning pipelines with the Azure pipelines. Azure ML pipelines are specifically designed for MLOps tasks. You can manage the whole experiment process till production using ML pipelines. These pipelines are reusable and help collaborative development of the solution.
You can register the trained ML models here. Versioning the models, managing which model to go for production are some use cases of this model registry. You can register models that has been trained outside the particular Azure ML workspace too here.
Endpoint of an azure machine learning experiment can be a web service or an IoT module endpoint. Managing the endpoint keys etc. are performed in this section.
In most of the cases, you may use Azure for computations. Here in the Compute section you can create and manage the following compute resource types.
Notebook VMs – as we discussed previously in the Notebooks section this is a fully managed ML development environment suits for development and prototyping purposes.
Training clusters – You can make a either a CPU based or a GPU based cluster for running your experiments. Note that this would be charged according to the computation hours as well as for the number of nodes you are using. Good thing is there’s no charge when you are not using the cluster for computation.
Inference clusters – This talk about AKS clusters where you can deploy the endpoints. Even you can register a prevailing AKS cluster as an inference cluster.
Attached compute – If you working with Azure Databricks, Data Lake Analytics or HDInsight you can configure the computation here. In an interesting use case if you want to attach your physical computer (which should be a workstation running Ubuntu) as a compute target it’s also possible through the AML service.
When it comes to machine learning experiments its normal to have large amount of data. These data may sit in your Azure storage. Datastore is the storage abstraction over an Azure storage account which then you can use inside your machine learning experiments.
Data labeling –
A cool new feature for data annotators. Right now, this supports Image classification in multi-label / multi-class and object identification (bounding box) annotations. The annotator should not have to have an Azure subscription. You can easily outsource your tedious annotation workload through this feature.
This is just an overview of the options we are having with new Azure Machine Learning Studio. It’s pretty clear that Azure team is going to get all the ML related services under one umbrella. Let’s discuss some cool use cases and tips on using these services in next blog posts.