In the last blog post, we discussed on developing a machine learning classifier with Azure machine learning service. As mentioned there, we going to utilize the familiar development tools and frameworks for model development.
Key areas of the SDK include:
Explore, prepare and manage the lifecycle of your datasets used in machine learning experiments.
Manage cloud resources for monitoring, logging, and organizing your machine learning experiments.
Train models either locally or by using cloud resources, including GPU-accelerated model training.
Use automated machine learning, which accepts configuration parameters and training data. It automatically iterates through algorithms and hyperparameter settings to find the best model for running predictions.
Deploy web services to convert your trained models into RESTful services that can be consumed in any application.
AzureML python SDK acts as the connector between all the resources on the cloud and the dev environment.
Installing Python SDK –
AzureML SDK can be easily installed for your local computer through pip. Refer this guide for the installation process. I’d suggest to go with default installation since it’s enough for the most of the operations we used in the experiment. It’s a good idea to upgrade the SDK before running an experiment since the package is rapidly updating.
Download config file –
For connecting the AzureML workspace, we may need the Azure subscription ID, resource group which the workspace has been created and the workspace name. The easiest way to grab these details is downloading the config.json file from the Azure portal. Place this file inside the experiment directory.
Connect AzureML Workspace –
Connecting the AzureML workspace and and listing the resources can be done by using easy python syntaxes of AzureML SDK (A sample code is provided below). Refer Python SDK documentation to do modifications for the resources of the AML service.
#!pip install --upgrade azureml-sdk[notebooks]
from azureml.core import Workspace
from azureml.core import ComputeTarget, Datastore, Dataset
print("Ready to use Azure ML", azureml.core.VERSION)
ws = Workspace.from_config()
#View resources in the workspace
for compute_name in ws.compute_targets:
compute = ws.compute_targets[compute_name]
print("\t", compute.name, ":", compute.type)
for datastore in ws.datastores:
for dataset in ws.datasets:
for webservice_name in ws.webservices:
webservice = ws.webservices[webservice_name]
In next blog article, will discuss the data loading and experiment creation.
Azure Machine Learning Service is becoming the one-stop place for managing all ML related workloads in Azure cloud. There are two main advantages of using Azure Machine Learning Service for your ML and data science experiments.
#1 – You can mange the whole machine learning workflow in a single environment. From data wrangling to machine learning service deployment, everything is managed on the cloud with its reliable, scalable and efficient services.
#2 – You can use your familiar open source toolset, languages and frameworks in model development. Being a ML engineer or a data scientist, you may be using python or R as your main development languages. Azure machine learning allows you to use any of those languages and frameworks to develop the experiments.
Pima Indians Diabetes Classification is one of the most famous machine learning experiments. It’s a binary classification problem which use a .CSV based tabular dataset as the input. I’ll walk you through the process I went through to perform my experiment.
Diabetes dataset is available as a .CSV file in your local file system.
I have to build a binary classifier trained with the dataset and deploy it as a web service with a REST endpoint.
As shown in the diagram I used the services and tools in AMLS with my typical development environments to build up the solution.
Step 1: Since the experiment is going to build on Azure cloud, I have transferred my dataset into an Azure blob storage. I used Azure storage Explorer to upload the dataset into the cloud. (For better performance, make sure the dataset is in a storage blob in the same region where the AMLS experiment is)
Step 2: In order to access the data stored in the blob space, it’s registered inside AMLS as a datastore.
Step 3: AMLS supports two types of datasets. Since the .CSV file contains tabular data, it’s registered as a tabular dataset. (You can perform the basic statistical operations and visualizations after registering as a tabular dataset.)
Step 4: Now it’s the time for the real job. Since am more familiar with Python and sci-kit learn, I used those languages and libraries to develop my model. The whole coding part has been done on a Linux machine using my favorite VSCode IDE. 😉 You may wonder how I’m going to connect the code base on my local machine with the cloud… Here’s the place where AzureML python SDK comes to the rescue.
Step 5: I don’t have enough computation power to do the model training on my machine. So that, I use an Azure compute cluster to perform the computation. (In my experiment I did hyperparameter tuning to select the best parameters. Using the compute cluster allowed me to perform parallel training)
Step 6: After model training and getting the desired inference accuracy, I had the need of exposing the binary classification model as a web service. For that, I used Azure Container Instance (ACI) since this is going to be a small testing experiment. (I may have to go for an Azure Kubernetes Services (AKS) if I wanna go for global massive deployment)
Yp! It’s just a simple 6 step process. Complex? Don’t worry, I’m going to walk you through the whole process assisted with the code snippets in the upcoming blog posts. Stay tuned. Let’s start a real experiment with Azure Machine Learning Service.
Designing and implementing predictive experiments requires an understanding about the problem domain as well as the knowledge on machine learning algorithms and methodologies. Extensive knowledge on programming is a necessity when it comes to real-world machine learning model training and implementations.
Automated machine learning is capable of training and tuning a machine learning model for a given dataset and specified target metrics by selecting the appropriate algorithms and parameters by its own. Azure Machine Learning offers a user-friendly wizard-like Automated ML feature for training and implementing predictive models without giving you the burden of algorithm and hyperparameter selection.
Azure Automated ML comes handy, where you are able to implement a complete machine learning pipeline without a single line of coding. It saves a time and compute resources since the model tuning is done by following data science best practices.
Azure machine learning currently supports three types of machine learning user cases in their AutomatedML pipeline.
1. Classification – To predict one of several categories in the target column
2. Time series forecasting – To predict values based on time
3. Regression – To predict continuous numeric values
Let’s go through the step by step process of developing a machine learning experiment pipeline with Azure Automated ML.
01. CREATE AN AZURE MACHINE LEARNING WORKSPACE
Azure Machine Learning Workspace is the resource you create on Azure to perform all machine learning related activities on the cloud. The steps are straight forward same as creating any other Azure resource. Make sure you create the Workspace edition is ‘Enterprise’ since AutoML is not available in the basic edition.
02. CREATE AUTOMATEDML EXPERIMENT
ml.azure.com web interface is the one stop portal for accessing all the tools and services related to machine learning on Azure. You have to create a new Automated ML run by selecting Automated ML on the Author section of the left pane.
03. SELECT DATASET
As of now, AutomatedML supports tabular data formats only. You can upload your dataset from the local storage, import from a registered datastore, fetch from a web file or else retrieve from Azure open datasets.
04. CONFIGURE RUN
In this section you have to specify the target column of the experiment. If it’s a classification task this should be the column that indicates the class values and if it’s regression that’s the column where the numerical value to be predicted. Select a training cluster where the experiments going to run. Make sure you select a cluster that is enough for the complexity of the dataset you provided.
05. SELECT TASK TYPE AND SETTINGS
Select the task type that is appropriate for the dataset you selected. If you have textual data in your dataset you can enable deep learning (which is in preview) to extract the features.
In the settings of the run, you can specify the evaluation metrics, any algorithms that you are don’t want to use, validation type, exit criterion etc. for the experiment. If you wish to select only a specific set of features in the provided dataset you can configure that through the settings.
Running the experiment may take some time depending on the complexity of the dataset, algorithms you use and the exit criterion you used.
When the run is completed AzureML provides a summary of the run by indicating the best performing algorithm. You can directly deploy or download the best performing model as a .pkl file from the portal.
Deployment comes as a REST API which runs on an Azure Kubernetes Service (AKS) or Azure Container Instance (ACI).
AutomatedML comes handy when you need to do fast prototyping for a specific set of data and supports the agile process of intelligent application development. Will look on the other tools and features we have on Azure AI stack in the coming articles.
One of the most time-consuming tasks in the machine learning model development pipeline is data labeling. When it comes to a computer vision related task which may use deep learning methodologies, you may need thousands of labelled data to train your models.
Azure Machine Learning offers a new feature for data labeling tasks specifically designed for computer vision related applications. Right now, AzureML supports 3 types of data labeling tasks.
Image Classification multi-label – If the images in the image set is having more than one label for the image, this is the task type to go with.
Image Classification multi-class – This is the simple image classification type tasks, where each image is having a single label and the dataset is having multiple labels
Object Identification (Bounding Box) – If you need to have annotations for set of images to train a model for object detection, you may have to have bounding box annotations. This is the task type you should choose for such tasks.
Data labeling feature is available in both basic and enterprise versions of AzureML. Despite Basic version is not having the capability of ML assisted data labeling where a ML model is automatically built to assist the labeling process. ML assisted data labeling need a GPU based compute resource for performing model training and inferencing. (Obviously this comes with a cost then 😉 )
There are two ways that you can add the images to build the dataset.
Upload the image files into an Azure blob storage and register it as a DataStore in Azure ML (I’d recommend this since it may bypass the storage restrictions of the default storage)
Upload directly for the default storage
One of the attractive features I’ve seen in the data labeling process is the ability to use keyboard shortcuts which makes the process much more user friendly.
The annotation files can be exported as a JSON which follows the COCO dataset format (This file is saved in the default blob storage of the experiment) or else can be registered as an Azure ML dataset. The progress of the data labeling project can be monitored through the dashboard on AzureML Studio.
Seems like Microsoft is having more plans on developing this product further and hope there would be interesting additions in the coming future.
Being one of the major public cloud providers, Microsoft Azure provides numarous products, services and tools for intelligent application development. This is a high level guideline to select the appropriate product for your application development.
I have just pinpointed the most used tools and services here. The services can be interconnected with each other in order to develop applications for more complex use cases.
Early concepts of machine learning came out back in 1950s but people were not able to explore the full power since most of the machine learning algorithms were largely computationally expensive. The computing power of the early systems were not enough to process large amounts of data using complex algorithms. Since cloud computing and GPUs opened the arena to perform complex computations, machine learning and deep learning got a boost and now used widely in many real-world applications.
As we discussed in the previous posts, being a leading cloud-based machine learning platform, Azure Machine Learning solves three main burdens in machine learning model development and deployment process.
Setting up development environment with the burden of solving platform and software library dependencies.
Setting up the computing environments (parallel processing libraries (CUDA) etc.)
Setting and managing deployments
All of these 3 key areas in machine learning model development require some sort of a computing resource to create and manage. Azure machine learning service has centralized all the resources to a easy accessible portal allowing the developer to select the most suitable resource for their need.
Compute tab of the AML studio contains (as of the date am writing this) 4 main compute categories for specific purposes. Will go through each of those and see in where we can occupy them in our machine learning experiments.
Compute Instances –
No need of messing around with configuring CUDA and all python packages to set up the laptop for performing machine learning experiments or the data visualization. You can just go through few steps in a wizard to create a preconfigured computing instance on Azure. This is more similar for creating a virtual machine on Azure. If you need to use GPU based computing, you may have to select a N-series VM on Azure. (Make sure the region you selected is having the required VM families)
In order to do the experiments on the compute instance you can either use JupyterLab, Jupyter notebook, RStudio or through an SSH connection. You have to create a SSH public key on Azure to access the compute instance through SSH.
GPU based compute instances are costly. Create such instances if you really need to do deep learning based experiments.
Think of a complex deep learning scenario where data preprocessing need large amount of CPU processing time while model training should be done using GPUs… You can use two compute instances where preprocessing happens in a CPU based instance while the GPU based expensive compute instance is used for model training. (Connecting these two processes can be done using Azure machine learning pipelines)
Make sure to deallocate the resources when you are not using it. (Else you should have a fat wallet in your pocket)
Training clusters –
Training clusters in Azure ML is the survivor when we are having complex computations to perform. You can perform tasks as Automated ML and hyperparameter tuning on these preconfigured clusters. The maximum number of nodes can be configured according to your need. Underlying technology behind the training clusters is docker containers. Simply you containerize your experiment and push into the cluster for computations/training.
Inference clusters –
The end result of the experiments you perform sits on inference clusters. The web service endpoints you create can be deployed on this AKS based inference clusters. You can go for a low-cost inference cluster with few nodes for dev-test and a high performing cluster with many nodes according to the requirements in the production environment. (normally we use ACI for dev-test and AKS for production web service end points)
Attached compute –
This is an interesting feature in Azure machine learning where you can push your machine learning workloads into external computing environments. Right now, AML is supporting
Data Lake Analytics
Your VM should be running Ubuntu in order to connect as an attached compute.
Will discuss how to use these computing resources in your machine learning experiments in the future posts.
Being the fundamental and the most vital factor in any machine learning experiment, the way of handling data in your experiments is crucial. Here we going to discuss different ways of managing your data sources inside Azure Machine Learning (AML).
Since the new Azure Machine Learning Service is becoming the one-stop place for managing all ML related workloads in Azure, the functions and methods can be created/managed using the web portal or using AzureML python SDK (You may use AzureML R SDK or the Azure CLI too)
Data comes in all shapes and sizes. In order to tackle these different data scenarios AML offers different options to manage the data. Let’s discuss these options one by one with their usages, pros and cons.
Datastore is the place where the data sits in an AML experiment. Your AML workspace can have one or more Datastores connected according to your need.
AML is all about cloud-based machine learning. So that, I would recommend using some sort of Azure based storage to store your data in the first hand. Blob storage, File Share, Data Lake Storage, Azure SQL database, Azure PostgreSQL, Azure DB for MySQL and Databricks file system are currently supported data storage types to create Datastores. (Will say your data is at on-prem SQL database. You can use Azure Data Factory to migrate your data load onto Azure.)
You can see the Datastores registered to your workspace either through AML Studio (ml.azure.com) or through the python SDK. When you create a workspace, by default two Datastores are created : workspacefilestore and workspaceblobstore.
Workspaceblobstore acts as the default Datastore of experiments. You can change it at any time through the SDK. This is the place where all your code and other files you put in the experiment sits.
Is it a good idea to keep your data on the workspaceblobstore?
Scenario #1 : You are doing a toy experiment with a small dataset (Eg. a 2 MB CSV file). Dataset is static and no plan on updating that during the experiment. Yes! That’s completely ok to keep the dataset inside workspaceblobstore.
Scenario #2 : You are doing a deep learning experiment with a 100,000 images. No! Never use workspaceblobstore to keep your data.
Why not the workspaceblobstore always?
Workspaceblobstore is having a file and storage limitation (300MB and/or 2000 files). So that it’s impossible to use it when we have a large dataset. On the other hand, this directly affects for the docker image size (or the snapshot size) you may create for the experiment. Bulky snapshots or the docker images are not a good thing. Always keep it simple and modularized. So always the workspaceblobstore is a no go!
AML datasets is the high-level abstraction of the data you use in experiments. You may create an AML dataset from,
A local file / local files
Registered datastore (from file(s) sit on a datastore)
Azure Open Datasets
The AML datasets we create may belong to two main types:
Tabular datasets – If you have a file/ files that contains data in a tabular format (CSV, JSON line files, Parquet files, Tabular data in SQL databases etc.) creating a tabular dataset would be beneficial as it allows to transform data into a Pandas or Spark DataFrame.
File datasets – Refer to a single or multiple file in your Datastore or on a public URL. File datasets comes handy when you have a scenario like a dataset with 1K images.
AML datasets comes with the advantage of versioning and tracking as well as monitoring. It’s not a hard thing to create perform data drift detection or a simple statistical analysis on data fields of the dataset with a few clicks.
Microsoft recommends to use AML datasets always in experiments rather than pointing for the datastore directly (which is totally possible). I’ve found out pros and cons in both approaches.
AML datasets are easy to version and manage compared to datastore.
If you have a tabular dataset, I would always recommend to go for AML tabular datasets.
It becomes tricky when you have files. If you use File datasets, you have to use to_path() method to get the list of file paths defined by the dataset. This comes as a flat list! If you are not concerned about the directory structure of the data this is totally fine. But if you wish to create custom dataloaders (Eg. PyTorch custom dataloaders which allows to differentiate classes according to directories) this may not come handy. (You can do a workaround by processing the file path to determine the directory structure though 😀 )
Keep in mind that AML dataset mount() only works for unix-like OS. If you wish to run your experiment on a windows running workstation you may have to download() the dataset.
Will discuss on using these datasets in different model training scenarios in future posts.
There are just some of the experiences I had when playing around with new Azure Machine Learning. The Microsoft Learning GitHub repo (https://github.com/MicrosoftLearning/DP100) for DP100 exam is a really nice place to find some example code on using these functionalities. Let me know your find outs and experiences with AML too 😊
Out of all the public cloud platforms, Microsoft Azure has adopted all most all the steps in machine learning life cycle into cloud. Though the resources and the abilities are there, sometimes finding the correct cloud-based product or service to adopt for your solution might be a problem.
Providing a perfect answer for that issue, Azure has come up with the whole new Azure Machine Learning Studio which is in the preview by the time am blogging this! (Don’t get confused with the AzureML Studio, the drag and drop interface we had before. This is a new thing – ml.azure.com ) There’s no framework dependency or restrictions for using these services. You can easily adapt your open source machine learning code base (may be written with Python, sci-kit learn, TensorFlow, PyTorch, Keras… anything)
In order to use this one-stop solution in Azure you may have to create an Azure Machine Learning Service from Azure portal. Then it’ll direct you for the new interface. You can either go for the Enterprise pricing tier or for the Basic tier. In basic tier you won’t get the visual designer and Automated ML features.
Let’s go through each and every tab we got in the side pane in the latest release and see what can we do with them.
These are fully managed Jupyter notebook instances on cloud. These notebook servers are running on top of a new VM instance type called “notebookVM”. There notebookVMs are fully configured work environments to do your machine learning and data science tasks. No need to worry about installing all the python packages and its dependencies. All are already there! You have the privilege to change the notebook sizes (Yes GPU enabled VMs are also there) or install new packages through a python package manager too.
Automated ML –
Not available in the free tier though. A process of selecting the best suited algorithm for the dataset you are having. Right now, this is supporting classification, regression and time series forecasting for tabular data formats. Not supported for deep learning based computer vision applications. Deep learning based text analysis is also in preview. The Automated ML process runs a set of machine learning algorithms on top of your provided data and see which one gives the best accuracy metric. Good for building prototypes and even in some cases might be in production.
This is the evolution of Azure ML Studio (Old drag and drop thing). Seems like Microsoft is going to end its lifecycle and give the new Designer be its replacement. Here you can build the complete ML workflow by dragging and dropping modules. If you want can integrate R or python scripts in the experiment. The machine learning service endpoint can be exposed through a Azure Kubernetes Service (AKS) deployment.
The place to manage and version your datasets. Datasets can be either tabular or file based. Here you can profile your dataset by performing a basic statistical analysis on your data. If your dataset is sitting on a datastore (which we going to discuss later) this acts as a high-level encapsulation of that data.
You may execute several runs on the same experiment with different configurations. This is the place where you can see all the log files of them and compare the runs with each other.
Don’t get confused Azure machine learning pipelines with the Azure pipelines. Azure ML pipelines are specifically designed for MLOps tasks. You can manage the whole experiment process till production using ML pipelines. These pipelines are reusable and help collaborative development of the solution.
You can register the trained ML models here. Versioning the models, managing which model to go for production are some use cases of this model registry. You can register models that has been trained outside the particular Azure ML workspace too here.
Endpoint of an azure machine learning experiment can be a web service or an IoT module endpoint. Managing the endpoint keys etc. are performed in this section.
In most of the cases, you may use Azure for computations. Here in the Compute section you can create and manage the following compute resource types.
Notebook VMs – as we discussed previously in the Notebooks section this is a fully managed ML development environment suits for development and prototyping purposes.
Training clusters – You can make a either a CPU based or a GPU based cluster for running your experiments. Note that this would be charged according to the computation hours as well as for the number of nodes you are using. Good thing is there’s no charge when you are not using the cluster for computation.
Inference clusters – This talk about AKS clusters where you can deploy the endpoints. Even you can register a prevailing AKS cluster as an inference cluster.
Attached compute – If you working with Azure Databricks, Data Lake Analytics or HDInsight you can configure the computation here. In an interesting use case if you want to attach your physical computer (which should be a workstation running Ubuntu) as a compute target it’s also possible through the AML service.
When it comes to machine learning experiments its normal to have large amount of data. These data may sit in your Azure storage. Datastore is the storage abstraction over an Azure storage account which then you can use inside your machine learning experiments.
Data labeling –
A cool new feature for data annotators. Right now, this supports Image classification in multi-label / multi-class and object identification (bounding box) annotations. The annotator should not have to have an Azure subscription. You can easily outsource your tedious annotation workload through this feature.
This is just an overview of the options we are having with new Azure Machine Learning Studio. It’s pretty clear that Azure team is going to get all the ML related services under one umbrella. Let’s discuss some cool use cases and tips on using these services in next blog posts.
The ideas for neural networks go back to the 1940s. The essential concept is that a network of artificial neurons built out of interconnected threshold switches can learn to recognize patterns in the same way that an animal brain and nervous system does.
Though the name “neural network” gives an idea of a ‘black box’ type predictive operation; ANN is a set of mathematical operations.
As the name implies by itself; neural network is a structural ‘network’. The nodes of the neural network are organized in layers and the nodes are connected with each other with edges. The edges are directional and they are weighted.
Azure Machine Learning Studio comes with pre-built neural network modules that can easily use for predictive analytics.
Pre-built neural networks in AML Studio
Multiclass Neural Network Module –
Used for multiclass classification problems. The number of hidden nodes, the learning date, number of learning iterations and many parameters can be changed easily by changing the module properties.
Two-Class Neural Network –
Ideal for binary classification problems. Same as the Multiclass neural network module, the properties of the neural network can be changed by the module properties.
Neural Network regression –
This is a supervised machine learning method that can be used to predict a numerical value.
These simple pre-built modules can be added to your ML experiment with just a drag and drop and change the parameters by changing the module properties. What you going to do if you want to implement a complex neural network architecture? Or to create a deep neural network with more hidden layers?
AzureML Studio comes handy here with providing you the ability to define the hidden layer/layers of the ANN with a script. Net# scripting language provide the ability to define almost any neural network architecture in an easy to read format.
Net# scripting language is able to
Create hidden layers and control the number of nodes in each layer.
Specify how layers are to be connected to each other.
Define special connectivity structures, such as convolutions and weight sharing bundles.
Specify different activation functions.
In Azure Machine Learning, you can add the Net# scripts by choosing ‘Custom definition script’ in Hidden layer specification property. By default, it would set to the fully connected case.
Net# lexical is more similar to C#. The structure of a Net# script has four main sections.
Constant declaration (Optional) – Define values used elsewhere in the neural network definition
Layer declaration – The input, hidden and output layers are defined with the layer dimensions. The layer declaration for hidden or output layer can include the output function.
Connection declaration – You can define connection bundles (Full, Filtered, Convolutional, Pooling, Response normalization) – Full connection bundle is the default configuration.
Share declaration (Optional) – Defining multiple bundles with shared weights.
This is a simple neural network defined by a Net# script to perform a binary classification. You can customize the number of hidden neurons and the activation functions and see how the accuracy of the model variate.
<!– HTML generated using hilite.me –>
//A simple neural network definition//auto keyword allows the ANN to automatically include all feature columns in the input examples//input layer named Data
input Data auto;
//Hidden layer named "H" including 200 nodes
hidden H  from Data all;
//output layer named "Out" including 2 nodes (binary classification problem) //Sigmoid activation function has been used.
output Out  sigmoid from H all;
Azure Machine Learning Studio allows you to build and deploy predictive machine learning experiments easily with few drags and drops (technically 😉).
The performance of the machine learning models can be evaluated based on number of matrices that are commonly used in machine learning and statistics available through the studio. Evaluation of the supervised machine learning problems such as regression, binary classification and multi-class classification can be done in two ways.
Train-test split evaluation
Train-test evaluation –
In AzureML Studio you can perform train-test evaluation with a simple experiment setup. The ‘Score Model’ module make the predictions for a portion of the original dataset. Normally the dataset is divided into two parts and the majority is used for training while the rest used for testing the trained model.
You can use ‘Split Data’ module to split the data. Choose whether you want a randomized split or not. In most of the cases, randomized split works better. If the dataset is having a periodic distribution for an example a time series data, NEVER use randomized split. Use the regular split.
Stratified split allows you to split the dataset according to the values in the key column. This would make the testing set more unbiased.
Easy to implement and interpret
Less time consuming in execution
If the dataset is small, keeping a portion for testing would be decrease the accuracy of the predictive model.
If the split is not random, the output of the evaluation matrices are inaccurate.
Can cause over-fitted predictive models.
Cross Validation –
Overcome the mentioned pitfalls in train-test split evaluation, cross validation comes handy in evaluating machine learning methods. In cross validation, despite of using a portion of the dataset for generating evaluation matrices, the whole dataset is used to calculate the accuracy of the model.
k-fold cross validation
We split our data into k subsets, and train on k-1 of those subsets. What we do is holding the last subset for test. We’re able to do it for each of the subsets. This is called k-folds cross validation.
More realistic evaluation matrices can be generated.
Reduce the risk of over-fitting models.
May take more time in evaluation because more calculations to be done.
Cross-validation with a parameter sweep –
I would say using ‘Tune model Hyperparameters’ module is the easiest way to identify the best predictive model and then use ‘Cross validate Model’ to check its reliability.
Here in my sample experiment I’ve used the breast cancer dataset available in AzureML Studio that normally use for binary classification.
The dataset consists 683 rows. I used train-test split evaluation as well as cross validation to generate the evaluation matrices. Note that whole dataset has been used to train the model in cross validation case, while train-test split only use 70% of the dataset for training the predictive model.
Two-class neural networks has used as the binary classification algorithm. The parameters are swapped to get the optimal predictive model.
When observing the outputs, the cross-validation evaluation provides that model trained with whole dataset give a mean accuracy of 0.9736 while the train-test evaluation provides an accuracy of 0.985! So, is that mean training with less data has increased the accuracy? Hell no! The evaluation done with cross-validation provides more realistic matrices for the trained model by testing the model with maximum number of data points.
Take-away – Always try to use cross-validation for evaluating predictive models rather than going for a simple train-test split.
You can access the experiment in the Cortana Intelligence Gallery through this link –