Azure ML Web Services gets a new look

Huge buzz going on Machine Learning. What for?  Building intelligent apps is one of the dominant usages of machine learning. Web service is one of the understandable “language” for software developers. If the data scientists can provide a web service for the line of devs, they’ll be super excited because they only have to deal with JSON; not regression algorithms or neural networks! 😀

Azure ML studio provides you the power to deploy web services easily and nice interface that a software developer can understand. Consuming a web service built with Azure machine learning has become pretty easy because it even provide you the code samples and the sample JSONs that transfer in and out.

web-services

services.azureml.net

 

Recently AzureML Studio has come out with a new interface for managing the web services. Now it’s pretty easy for manage and monitor the behavior of your web services.

Go for your ML Studio. In web services section, you’ll find a new link directing to “New web services experience”. Currently it’s in the preview.

dashboard

New web services dashboard

 

Dashboard shows the performance of the web service that you built. The average execution time is shown there. Even you can get a glimpse on monetary terms attached with consuming the web service with the dashboard.

Testing the web services can be done through the new portal. If you want to build web application to consume the web service you built, can direct to the azure web app template that is pre-built for consuming ML web services.

Take a look from (http://services.azureml.net)  you’ll get used to it! 😀

 

 

Advertisements

Building a News Classifier with Azure ML

newsClassification is one of the most popular machine learning applications used. To classify spam mails, classify pictures, classify news articles into categories are some well known examples where machine learning classification algorithms are used.

This sample demonstrates how to use multiclass classifiers and feature hashing in Azure ML Studio to classify BBC news dataset into appropriate news category.

The popular 2004-2005 BBC news dataset has been used for this experiment. The dataset consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. The news is classified into five classes as Business, Entertainment, Politics, Sports and Tech.

Original dataset downloaded from “Insight Resources”  Dataset consisted 5 directories, each containing text files with the news articles of particular category.

The data has been converted to a CSV file that fits with ML Studio by running a C# console application.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
			//Specify the Directory location 
            string dir = @"D:\Document_Classification\bbc full text\bbc"; 
            var dirs = Directory.EnumerateDirectories(dir);

            List<string> csv = new List<string>();

            StreamWriter sw = new StreamWriter(dir + @"\BBCNews.csv");
            int index = 1;
            foreach(var d in dirs)
            {
                foreach(var file in Directory.EnumerateFiles(d))
                {
                    Console.WriteLine(file);
                    string content = File.ReadAllText(file).Replace(',', ' ').Replace('\n',' ');
                    sw.WriteLine((index++)+","+content+","+new DirectoryInfo(d).Name);
                    sw.Flush();
                }
            }

            Console.WriteLine("DONE");
            Console.Read();
        }
    }
}

The names of the categories has been used as the class label, or attribute to predict.  The CSV file has uploaded to Azure ML Studio to use for the experiment.

Data Preparation –

The dummy column headings was replaced with meaningful column names using Metadata Editor. Missing values were cleared by removing the entire row of containing the missing value.

Term frequency–inverse document frequency (TF-IDF) of each unigram was calculated. The bit-size as 15 bits was specified to extract 2^15 = 32,768 hashing features. Top 5000 related features were selected for this experiment.

Feature Engineering –
I used the Feature Hashing module to convert the plain text of the articles to integers and used the integer values as input features to the model.

Model

BBC classifier model

Predictive Experiment built on Azure ML Studio

 

Multiclass Neural Networks module with default parameters has been used for training the model. The parameters were tuned using “Tune model Hyperparameters” module.

R script for creating word vocabulary –

# Map 1-based optional input ports to variables
dataset <- maml.mapInputPort(1) # class: data.frame
input.dictionary <- maml.mapInputPort(2) # class: data.frame
##################################################
# Determine the following input parameters:-
# minimum length of a word to be included into the dictionary. 
# Exclude any word if its length is less than *minWordLen* characters.
minWordLen <- 3

# maximum length of a word to be included into the dictionary. 
# Exclude any word if its length is greater than *maxWordLen* characters.
maxWordLen <- 25
##################################################

# we assume that the text is the first column in the input data frame
label_column <- dataset[[2]]
text_column <- dataset[[1]]

# Contents of optional Zip port are in ./src/
source("src/text.preprocessing.R");
data.set <- calculate.TFIDF(text_column, input.dictionary, 
	minWordLen, maxWordLen)
data.set <- cbind(label_column, data.set)

# Select the document unigrams TF-IDF matrix to be sent to the output Dataset port
maml.mapOutputPort("data.set")

R Script for text preprocessing

# Map 1-based optional input ports to variables
dataset <- maml.mapInputPort(1) # class: data.frame
##################################################
# Determine the following input parameters:-
# minimum length of a word to be included into the dictionary. 
# Exclude any word if its length is less than *minWordLen* characters.
minWordLen <- 3

# maximum length of a word to be included into the dictionary. 
# Exclude any word if its length is greater than *maxWordLen* characters.
maxWordLen <- 25

# minimum document frequency of a word to be included into the dictionary. 
# Exclude any word if it appears in less than *minDF* documents.
minDF <- 9

# maximum document frequency of a word to be included into the dictionary. 
# Exclude any word if it appears in greater than *maxDF* documents.
maxDF <- Inf
##################################################
# we assume that the text is the first column in the input data frame
text_column <- dataset[[1]]

# Contents of optional Zip port are in ./src/
source("src/text.preprocessing.R");

# the output dictionary includes each word, its DF and its IDF
input.voc <- create.vocabulary(text_column, minWordLen, 
	maxWordLen, minDF, maxDF)
 
# the output dictionary includes each word, its DF and its IDF 
data.set <- calculate.IDF (input.voc, minDF, maxDF)

# Select the dictionary to be sent to the output Dataset port
maml.mapOutputPort("data.set")

Results –
All accuracy values were computed using evaluate module.

This sample can be deployed as a web service and consume for a news classification application. But make sure that you are training the model using the appropriate training data.

Here’s the confusion matrix came as the output. Seems pretty good!

5cfa71bcddf14589a7693b8edf8b1194

Azure Machine Learning provide you the power of cloud to make complex time consuming machine learning problems more easy to compute. Build your own predictive module using AML Studio and see how easy it is. 🙂

You can check out the built experiment in Cortana Intelligence Gallery here! 🙂


 

Citation for the dataset –
D. Greene and P. Cunningham. “Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering”, Proc. ICML 2006.