Build a Machine Learning Classifier with Azure Machine Learning Service

Azure Machine Learning Service is becoming the one-stop place for managing all ML related workloads in Azure cloud. There are two main advantages of using Azure Machine Learning Service for your ML and data science experiments.

#1 – You can mange the whole machine learning workflow in a single environment. From data wrangling to machine learning service deployment, everything is managed on the cloud with its reliable, scalable and efficient services.

#2 – You can use your familiar open source toolset, languages and frameworks in model development. Being a ML engineer or a data scientist, you may be using python or R as your main development languages. Azure machine learning allows you to use any of those languages and frameworks to develop the experiments.

Pima Indians Diabetes Classification is one of the most famous machine learning experiments. It’s a binary classification problem which use a .CSV based tabular dataset as the input. I’ll walk you through the process I went through to perform my experiment.

Scenario:

  • Diabetes dataset is available as a .CSV file in your local file system.
  • I have to build a binary classifier trained with the dataset and deploy it as a web service with a REST endpoint.

Solution:

As shown in the diagram I used the services and tools in AMLS with my typical development environments to build up the solution.

  • Step 1: Since the experiment is going to build on Azure cloud, I have transferred my dataset into an Azure blob storage. I used Azure storage Explorer to upload the dataset into the cloud. (For better performance, make sure the dataset is in a storage blob in the same region where the AMLS experiment is)
  • Step 2: In order to access the data stored in the blob space, it’s registered inside AMLS as a datastore.
  • Step 3: AMLS supports two types of datasets. Since the .CSV file contains tabular data, it’s registered as a tabular dataset. (You can perform the basic statistical operations and visualizations after registering as a tabular dataset.)
  • Step 4: Now it’s the time for the real job. Since am more familiar with Python and sci-kit learn, I used those languages and libraries to develop my model. The whole coding part has been done on a Linux machine using my favorite VSCode IDE. 😉 You may wonder how I’m going to connect the code base on my local machine with the cloud… Here’s the place where AzureML python SDK comes to the rescue.
  • Step 5: I don’t have enough computation power to do the model training on my machine. So that, I use an Azure compute cluster to perform the computation. (In my experiment I did hyperparameter tuning to select the best parameters. Using the compute cluster allowed me to perform parallel training)
  • Step 6: After model training and getting the desired inference accuracy, I had the need of exposing the binary classification model as a web service. For that, I used Azure Container Instance (ACI) since this is going to be a small testing experiment. (I may have to go for an Azure Kubernetes Services (AKS) if I wanna go for global massive deployment)

Yp! It’s just a simple 6 step process. Complex? Don’t worry, I’m going to walk you through the whole process assisted with the code snippets in the upcoming blog posts. Stay tuned. Let’s start a real experiment with Azure Machine Learning Service.

One-Hot Encoding in Practice

mtimFxhData is the king in machine learning. In the process of building machine learning models, data is used as the input features.

Input features comes in all shapes and sizes. For building a predictive model with a better accuracy rate, we should understand the data as well as the logic behind the algorithm we going to use to fit the model.

Data Understanding; as the second step of CRISP-DM, guides for understanding the types and the way the data we get has been represented. We can distinguish three main kinds of data feature.

  1. Quantitative Data           – Data with numerical scale (Age of a person in years, Price of a house in dollars etc.)
  2. Ordinal features              – Data without a scale but with ordering (Ordered sets/ first, second, third etc.)
  3. Categorical features       – Data without a numerical scale neither an ordering. These features don’t allow any statistical summary. (Car manufacturer categories, Civil status, N-grams in NLP etc.)

Most of the machine learning algorithms such as linear regression, logistic regression, neural network, support vector machine works better with numerical features.

Quantitative features come with a numerical value and they can be directly used (Sometimes data preprocessing, normalization may have to use) as the input features of ML algorithms.

Ordinal features can be easily represented in numbers (Ex. First = 1, Second = 2, Third = 3 …). This is called Integer Encoding. Representing ordinal features using numbers makes sense because the dependency between each representation can be notated in a numerical way.

There are some algorithms that can directly deal with joint discrete distribution such as Markov chain / Naive Bayes / Bayesian network, tree based, etc. These algorithms can work with categorical data without any encoding; while we should encode the categorical features in a way to represent in a numerically to use as the input features for other ML algorithms. That means it’s better to change the categorical features to numerical most of the times 😊

There are some special cases too. For an example, while naïve bias classification only really handles categorical features, many geometric models go in the other direction by only handling quantitative features.

How to convert Categorical data for Numerical data?

There are few ways to covert the categorical data to numerical data.

  • Dummy encoding
  • One-hot encoding / one-of-K scheme

are the most prominent ways of it.

One hot encoding is the process of converting the categorical features into numerical by performing “binarization” of the category and include it as a feature to train the model.

In mathematics, we can define one-hot encoding as…

One hot encoding transforms:

a single variable with n observations and d distinct values,

to

d binary variables with n observations each. Each observation indicating the presence (1) or absence (0) of the dth binary variable.

Let’s get this clear with an example. Suppose you have ‘flower’ feature which can take values ‘daffodil’, ‘lily’, and ‘rose’. One hot encoding converts ‘flower’ feature to three features, ‘is_daffodil’, ‘is_lily’, and ‘is_rose’ which all are binary.

CaptureA common application of OHE is in Natural Language Processing (NLP). It can be used to turn words to vectors so easily. Here comes a con of OHE, where the vector size might get very large with respect to the number of distinct values in the feature column.If there’s only two distinct categories in the feature, no need to construct to additional columns. You can just replace the feature column with one Boolean column.

oJEie

OHE in word vector representation

You can easily perform One-hot encoding in AzureML Studio by using the ‘Convert to Indicator Values’ module. The purpose of this module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model, which is the same happens in OHE. Let’s look at performing One-Hot encoding using python in next article.

Building a News Classifier with Azure ML

newsClassification is one of the most popular machine learning applications used. To classify spam mails, classify pictures, classify news articles into categories are some well known examples where machine learning classification algorithms are used.

This sample demonstrates how to use multiclass classifiers and feature hashing in Azure ML Studio to classify BBC news dataset into appropriate news category.

The popular 2004-2005 BBC news dataset has been used for this experiment. The dataset consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. The news is classified into five classes as Business, Entertainment, Politics, Sports and Tech.

Original dataset downloaded from “Insight Resources”  Dataset consisted 5 directories, each containing text files with the news articles of particular category.

The data has been converted to a CSV file that fits with ML Studio by running a C# console application.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
			//Specify the Directory location 
            string dir = @"D:\Document_Classification\bbc full text\bbc"; 
            var dirs = Directory.EnumerateDirectories(dir);

            List<string> csv = new List<string>();

            StreamWriter sw = new StreamWriter(dir + @"\BBCNews.csv");
            int index = 1;
            foreach(var d in dirs)
            {
                foreach(var file in Directory.EnumerateFiles(d))
                {
                    Console.WriteLine(file);
                    string content = File.ReadAllText(file).Replace(',', ' ').Replace('\n',' ');
                    sw.WriteLine((index++)+","+content+","+new DirectoryInfo(d).Name);
                    sw.Flush();
                }
            }

            Console.WriteLine("DONE");
            Console.Read();
        }
    }
}

The names of the categories has been used as the class label, or attribute to predict.  The CSV file has uploaded to Azure ML Studio to use for the experiment.

Data Preparation –

The dummy column headings was replaced with meaningful column names using Metadata Editor. Missing values were cleared by removing the entire row of containing the missing value.

Term frequency–inverse document frequency (TF-IDF) of each unigram was calculated. The bit-size as 15 bits was specified to extract 2^15 = 32,768 hashing features. Top 5000 related features were selected for this experiment.

Feature Engineering –
I used the Feature Hashing module to convert the plain text of the articles to integers and used the integer values as input features to the model.

Model

BBC classifier model

Predictive Experiment built on Azure ML Studio

 

Multiclass Neural Networks module with default parameters has been used for training the model. The parameters were tuned using “Tune model Hyperparameters” module.

R script for creating word vocabulary –

# Map 1-based optional input ports to variables
dataset <- maml.mapInputPort(1) # class: data.frame
input.dictionary <- maml.mapInputPort(2) # class: data.frame
##################################################
# Determine the following input parameters:-
# minimum length of a word to be included into the dictionary. 
# Exclude any word if its length is less than *minWordLen* characters.
minWordLen <- 3

# maximum length of a word to be included into the dictionary. 
# Exclude any word if its length is greater than *maxWordLen* characters.
maxWordLen <- 25
##################################################

# we assume that the text is the first column in the input data frame
label_column <- dataset[[2]]
text_column <- dataset[[1]]

# Contents of optional Zip port are in ./src/
source("src/text.preprocessing.R");
data.set <- calculate.TFIDF(text_column, input.dictionary, 
	minWordLen, maxWordLen)
data.set <- cbind(label_column, data.set)

# Select the document unigrams TF-IDF matrix to be sent to the output Dataset port
maml.mapOutputPort("data.set")

R Script for text preprocessing

# Map 1-based optional input ports to variables
dataset <- maml.mapInputPort(1) # class: data.frame
##################################################
# Determine the following input parameters:-
# minimum length of a word to be included into the dictionary. 
# Exclude any word if its length is less than *minWordLen* characters.
minWordLen <- 3

# maximum length of a word to be included into the dictionary. 
# Exclude any word if its length is greater than *maxWordLen* characters.
maxWordLen <- 25

# minimum document frequency of a word to be included into the dictionary. 
# Exclude any word if it appears in less than *minDF* documents.
minDF <- 9

# maximum document frequency of a word to be included into the dictionary. 
# Exclude any word if it appears in greater than *maxDF* documents.
maxDF <- Inf
##################################################
# we assume that the text is the first column in the input data frame
text_column <- dataset[[1]]

# Contents of optional Zip port are in ./src/
source("src/text.preprocessing.R");

# the output dictionary includes each word, its DF and its IDF
input.voc <- create.vocabulary(text_column, minWordLen, 
	maxWordLen, minDF, maxDF)
 
# the output dictionary includes each word, its DF and its IDF 
data.set <- calculate.IDF (input.voc, minDF, maxDF)

# Select the dictionary to be sent to the output Dataset port
maml.mapOutputPort("data.set")

Results –
All accuracy values were computed using evaluate module.

This sample can be deployed as a web service and consume for a news classification application. But make sure that you are training the model using the appropriate training data.

Here’s the confusion matrix came as the output. Seems pretty good!

5cfa71bcddf14589a7693b8edf8b1194

Azure Machine Learning provide you the power of cloud to make complex time consuming machine learning problems more easy to compute. Build your own predictive module using AML Studio and see how easy it is. 🙂

You can check out the built experiment in Cortana Intelligence Gallery here! 🙂


 

Citation for the dataset –
D. Greene and P. Cunningham. “Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering”, Proc. ICML 2006.