How to Streamline Machine Learning/ Data Science Projects?

CRISP-DM (Image from wikipedia)

When it comes to designing, developing and implementing a project related to data mining/ machine learning or deep learning, it is always better to follow a framework for streamlining the project flow.

It is OK to adapt a software development framework such as scrum, or waterfall method to manage a ML related project but I feel like having more streamlined process which pays attention on data would be an advantage for the success of a such project.

To my understanding there can be two variations of ML related projects.

  1. Solely machine learning/ data science based projects
  2. Software development projects where ML related services are a sub component of the main project.

The step-by step process am explaining can be used in both of these variations with your own additions and modifications.

Basically this is what I do when I get a ML related project to my hand.

I follow the steps of a good old standard process known as Cross-industry standard process for data mining (CRISP-DM) to streamline the project flow. Let’s go step by by step.

Step 1 : Business understanding

First you have to identify what is the problem you going to address with the project. Then you have to be open minded and answer the following questions.

  1. What is the current situation of this project? (whether it is using some conventional algorithm to solve this problem etc. )
  2. Do we really need to use machine learning to solve this problem? ( Using ML or deep learning for solving some problems maybe over engineering. Take a look whether it is essential to use ML to do the project.)
  3. What is the benefit of implementing the project? (ML projects are quite expensive and resource hungry. Make sure you get the sufficient RoI with the implementation.)
  4. What are constraints, limitations and risks? ( It’s always better to do a risk assessment prior the project. The data you have to use may have compliance issues. Look on those aspects for sure!)
  5. What tools and techniques am going to use? ( It maybe bit hard to determine the full tech stack you going to use before dipping your feet into the project. But good have even a rough idea on the tools, platforms and services you going to use to development and implementation. DON’T forget implementation phase. You may end up having a pretty cool development which maybe hard to implement with the desired application. So make sure you know your tool-set first)

Tip : If you feel like you are not having experience with this phase, never hesitate to discuss about it with the peers and experts in the field. They may come-up with easy shortcuts and techniques to make your project a success.

Step 2 : Data understanding

Data is the most vital part of any data science/ ML related project. When it comes to understanding the data, I prefer answering these questions.

  1. How big/small the data is? (Sometimes training deep learning models may need a lot of annotated data which is hard to find)
  2. How credible/ accurate the data is?
  3. What is the distribution of data?
  4. What are the key attributes and what are not-so-important attributes in data?
  5. How the data has been stored? (Data comes in CSVs/JSONs or flat files etc.)
  6. Simple statistical analysis of data?

Before digging into the main problem, you can save a lot of time by taking a closer look on data that you have or that you going to get.

Step 3 : Data preparation

To be honest, this step takes 80% of total project time most of the times. Data that we find in real world are not clean or in the perfect shape. Perfectly cleaned and per-processed data will save a lot of time in later stages. Make sure you follow the correct methodologies for data cleansing. This step may include tasks such as writing dataloaders for your data. Make sure to document the data preparation steps you did to the original dataset. Otherwise you may get confused in later stages.

Step 4 : Modelling

This is the step where you actually get the use of machine learning algorithms and related approaches. What I normally do is accessing the data and try some simple modelling techniques to interpret the data I have. For an example, will say I have a set of images to be classified using a artificial neural network based classifier… I’d first use a simple neural network with one or two hidden layers and see if the problem formation and modelling strategy is making any sense. If that’s successful, I’ll move for more complex approaches.

Tip : NEVER forget documentation! Your project may grow exponentially with thousands of code lines and you may try hundreds of modelling techniques to get the best accuracy. So that keep clear documentation on what you did to make sure you can roll back and see what you have done before.

Step 5 : Evaluation

Evaluating the models we developed is essential to determine whether we have done the right thing. Same as software review processes I prefer having a set framework to evaluate the ML projects. Make sure to select appropriate evaluation matrix. Some may not indicate the real behaviour of the models you build.

When performing a ML model evaluation, I plan ahead and make a set structure for the evaluation report. It makes the process easy to compare it against different parameter changes of the single model.

In most of the cases, we neglect the execution or the inference time when evaluating ML models. These can be vital factors in some applications. So that plan your evaluation wisely.

Step 6 : Deployment & Maintenance

Deployment is everything! If the deployment fails in the production, there’s no value in all the model development workload you did.

You should select the technologies and approaches to deliver the ML services (as REST web services, Kubernetes, container instances etc. ). I personally prefer containerising since it’s neat and clean. The deployed models should be monitored regularly. Predictions can get deviated with time. Sometimes data distribution can be changed. Make sure you create a robust monitoring plan beforehand.

Tip : What about the health of the published web endpoints or the capacity of inference clusters you using?? Yp! Make sure you monitor the infrastructure too.

https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview

This is just a high-level guideline that you can follow for streamlining data science/machine learning related tasks. This is a iterative process. There’s no hard bound rules saying you MUST follow these steps. Microsoft has introduced team data science process (TDSP) adapting and improving this concept with their own tool-sets.

Key takeaway : Please don’t follow cowboy coding for machine learning/ data science projects! Having a streamline process is always better! 🙂

Mission Plan for building a Predictive model

maxresdefaultWhen it comes to a machine learning or data science related problem, the most difficult part would be finding out the best approach to cope up with the task. Simply to get the idea of where to start!

Cross-industry standard process for data mining, commonly known by its acronym CRISP-DM, is a data mining process model describes commonly used approaches that data mining experts use to tackle problems. This process can be easily adopted for developing machine learning based predictive models as well.

CRISP-DM_Process_Diagram

CRISP – DM

No matter what are the tools/IDEs/languages you use for the process. You can adopt your tools according to the requirement you’ve.

Let’s walk through each step of the CRISP-DM model to see how it can be adopted for building machine learning models.

Business Understanding –

This is the step you may need the technical knowhow as well as a little bit of knowledge about the problem domain. You should have a clear idea on what you going to build and what would be the functional value of the prediction you suppose to do through the model. You can use Decision Model & Notation (https://en.wikipedia.org/wiki/Decision_Model_and_Notation) to describe the business need of the predictive model. Sometimes, the business need you are having might be able to solve using simple statistics other than going for a machine learning model.

Identifying the data sources is a task you should do in this step. Should check whether the data sources are reliable, legal and ethical to use in your application.

Data Understanding –

I would suggest you to do the following steps to get to know your data better.

  1. Data Definition – A detailed description on each data field in the data source. The notations of the data points, the units that the data points have been measured would be the cases you should consider about.
  2. Data Visualization – Hundreds or thousands of numerical data points may not give a clear idea for you what the data is about or an idea about the shape of your data. You may able to find interesting subsets of your data after visualizing it. It’s really easy to see the clustering patterns or the trending nature of the data in a visualized plot.
  3. Statistical analysis – Starting from the simple statistical calculations such as mean, median; you can calculate the correlation between each data field and it will help you to get a good idea about the data distribution. Feature engineering to increase the accuracy of the machine learning model. For performing that a descriptive statistical analysis would be a great asset.

For data understanding, The Interactive Data Exploration, Analysis and Reporting tool (IDEAR) can be used without getting the hassle of doing all the coding from the beginning. (Will discuss on IDEAR in a long run soon)

Data Preparation –

Data preparation would take roughly 80% of your time of the process implying it’s the most vital part in building predictive models.

This is the phase where you convert the raw data that you got from the data sources for the final datasets that you use for building the ML models. Most of the data you got from raw sources like IoT sensors or collectives are filled with outliers, contains missing values and disruptions. In the phase of data preparation, you should follow data preprocessing tasks to make those data fields usable in modeling.

Modeling –

Modeling is the part where algorithms comes to the scene. You can train and fit your data to a particular predictive model to perform the deserved prediction. You may need to check the math behind the algorithms sometimes to select the best algorithm that won’t overfit or underfit the model.

Different modeling methods may need data in different forms. So, you may need to revert back for the data preparation phase.

Evaluation –

Evaluation is a must before deploying a model. The objective of evaluating the model is to see whether the predictive model is meeting the business objectives that we’ve figured out in the beginning. The evaluation can be done with many parameter measures such as accuracy, AUC etc.

Evaluation may lead you to adjust the parameters of the model and might have to choose another algorithm that performs better. Don’t expect the machine learning model to be 100% accurate. If it is 100% most probably it would be an over fitted case.

Deployment –

Deployment of the machine learning model is the phase where the client, or the end user going to consume. In most of the cases, the predictive model would be a part of an intelligent application that acts as a service that gets a set of information and give a prediction as an output of that.

I would suggest you to deploy the module as a single component, so that it’s easy to scale as well as to maintain. APIs / Docker environments are some cool technologies that you can adopt for deploying machine learning models.

CRISP-DM won’t do all the magic of getting a perfect model as the output though it would definitely help you not to end up in a dead-end.

Lambda Architecture & Cortana Intelligence Suite solutions

Data processing has become the key part of modern applications. Not only processing the data, but also visualizing data in a meaningful way is vital for making business decisions in an enterprise application.

With the rise of massive data storages and the speed of data generation, effective data processing architectural patterns came into industrial standards.

In the era of big data processing where data generated in high volume, variety, velocity, veracity and value; there are many architectural patterns that industrial applications are following for data processing. Lambda, Kappa and Zeta are some patterns used for real time big data processing.

Let’s take a look on how Lambda architecture can be adopted with the products and services comes with Microsoft Cortana Intelligence Suite.

What is Lambda Architecture?

2 - lambaLambda architecture is a data processing architecture designed to handle massive quantities of data by taking the advantage of both batch and stream processing methods. Nathan Marz introduced the term of Lambda Architecture (LA) for having a generic, scalable and fault tolerant data processing architecture.

LA contains different layers which handles data in various methodologies in the process of data processing.

The ability of processing both batch data and real-time data streams is one of the significant features of lambda architecture.

What is Cortana Intelligence Suite?

architectureCortana Intelligence Suite is the Microsoft’s umbrella branding for fully managed business intelligence, big data and advanced analytics offerings comes with Azure cloud which enables businesses to transform the data into intelligent actions. So “Cortana” is there in this name. Then what? Is this related to the smart assistant comes with Windows 10? As Microsoft says, Cortana symbolizes the contextual intelligence that the solutions hope to deliver across the entire suite.

Cortana Intelligence Suite comes with the following services that specially designed for following tasks.

  • Information Management
  • Big Data Stores
  • Machine Learning & Analytics
  • Intelligence
  • Dashboards & Visualizations

How Cortana Intelligence Suite aligns with Lambda architecture?

Cortana Intelligence Suite (CIS) comes with different solutions that can cater both batch data sources and data streams. It is a significant improvement where you combine traditional batch processing systems and data stream analysis systems.

For an example think of a system that indicates the fuel level, oil levels, car tire pressure etc. of a vehicle… The system too should have the ability to analyze the data fetching from the IoT sensors real time as well as do predictions using the stored batch of data. CIS comes handy with various approaches to design this system with lambda architecture.

Lambda

Usage of CIS tools for data processing

IoT sensors creates hundreds or maybe thousands of data points for a second. Handling such data streams and directing them to analytics flows can be done using Event Hubs(https://azure.microsoft.com/en-us/services/event-hubs/).  you can use Azure Stream Analytics to get data from EventHub into Azure Storage Blobs. Thereafter you can use Azure Data Factory (ADF) to copy data on a scheduled basis from Blobs to Azure Data Lake Store. ADF can act as the batch data source. For analyzing and to build predictive models on the batch data HDInsight & Azure Machine Learning is the option you can go with. Azure SQL data warehouse can be used to store the analyzed data and visualizing them using PowerBI can be done. This is the batch data processing line.

In the line of real time data analysis, you can push the data stream coming from event hub to a Stream Analytics service or for an azure machine learning model. Visualizing data with PowerBI would come handy too.

Apart from the above explained components comes for data processing task, Microsoft Cognitive services can be used for transforming the user interaction for more human side. For an example, Bot framework and LUIS can be used with Bing speech API to provide voice commands for applications. Cortana skills can be used for enabling your app to deal with Cortana assistant.

Democratizing Machine Learning with Cloud

HiRes.jpg.800x600_q96We have already passed the era of gigabytes when it comes to data. World is talking about terabytes of unstructured data and massive amounts of data points generated from IoT devices and sensors in millions per a second. To analyze these heaps of data, obviously, we need large computation power and massive storage. Building workhorse machines to fulfil those tremendous workloads would definitely cost a lot. Cloud computing paradigm comes handy here. The resourcefulness and the scalability of the public cloud can be used to perform the large calculations in machine learning algorithms.

Almost all the major public cloud providers in the market comes up with machine learning services. Cloud machine learning services in Google Cloud Platform provides modern machine learning services, with pre-trained models and a service to generate your own tailored models. Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology. IBM analytics comes up with a machine learning platform with its cloud data services. Azure Machine Learning Studio is a GUI-based integrated development environment for constructing and operationalizing Machine Learning workflow on Azure. We discussed a lot about Azure Machine Learning and its appliances in practical scenarios in the previous posts.

All the mentioned platforms provide machine learning as a service. Most of the platforms offer pre-built ML algorithms in packages. Simple drag and drop user interactions and easy deployment has attracted many developers to use these tools.

But, how would it be if you want to go from the scratch? Either you want to use the power of Graphical Processing Units (GPUs) to process the ML algorithms parallelly? Cloud based Virtual Machines specifically optimized for computation is one of the best solutions that you can consume.

Azure Data Science Virtual Machine (DSVM) –

dsvm

DSVM in Azure Portal

If you already have used Azure virtual machines for your computation, hosting or storage tasks, this would not be a new concept for you. Azure DSVM is specifically optimized for large computations. Azure DSVM comes in two flavors. One with Windows and the other with Linux. You can choose the hardware configurations as you wish. Many development environments, programming IDEs, languages are pre-installed in the VM instances.

dsvm_linuxMy personal favorite here is the Linux DSVM instance. Here I’ve created a Linux DSVM with the basic configurations. For accessing the VM you can use any tool that can do a SSH call. What I normally do is calling the accessing the VM using Ubuntu Bash on Windows 10.

GPUs for machine learning –

GPU_1

GPU_2

Configurations of the Linux VM with Nvidia GPU

Many machine learning algorithms currently available can be executed parallely. Execution parts of those algorithms are embarrassingly parallel. With that parallel programming, you can reduce the execution time of the algorithms drastically. Data scientists in both industry and academia have been using GPUs for machine learning to make groundbreaking improvements across a variety of applications including image classification, video analytics, speech recognition and natural language processing.

google_brain

GPUs Vs. CPU computing

Specially in Deep Learning, parallel processing using GPUs can make a drastic decrease in computation time. Purchasing a deep learning dream machine powered with a CUDA enabled high-end GPU such as Nvidia Tesla K80 would cost nearly 6000 dollars! Rather than spending a lot on a machine like that, the most feasible plan is to provision a virtual machine with the specifications we need and pay as we consume.

VM_size

VM instance price plans

The N-series is a family of Azure Virtual Machines with GPU capabilities that you can use for these kinds of tasks. The N-series will feature the NVIDIA Tesla accelerated platform as well as NVIDIA GRID 2.0 technology, providing the highest-end graphics support available in the cloud today. Through your Azure portal, you can choose a desired price plan with the desired configurations for your tasks when provisioning the VM.

teslaHere’s my Azure VM specifically configured for deep learning exercises. The machine is powered with Tesla K80 GPU which is having 4992 cores in it!! I installed anaconda for that and doing computations using Jupyter notebooks.

Just a hint: stop your VM instance when you are not using it for computation to avoid getting huge unnecessary bills. 😉

No need of huge wallets! The wise decision would be applying cloud technologies for machine learning.

Simple Linear Regression with Azure ML + Python

1419973816879Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables: One variable, denoted x, is regarded as the predictor, explanatory, or independent variable. The other variable, denoted y, is regarded as the response, outcome, or dependent variable.

Typically when we doing regression analysis, we consider about the correlation of coefficient of the input variables. Correlation analysis measures the extent to which two variables vary together, including the strength and direction of their relationship.

correlation_dot_graphsLinear correlation coefficient(also called Pearson product-moment correlation coefficient) measure of the strength and direction of a linear association between two random variables.

I used the Istanbul Stock Exchange dataset to demonstrate the steps in doing a simple linear regression prediction. Azure Machine Learning experiment has built (get the experiment from here) for building the regression model. Built-in Bayesian Linear Regression algorithm has been used for building the model.

capture1The most interesting part is coming with python! 🙂

I’ve used a Jupyter Notebook and fetched the data to that workspace to visualize the dataset and to calculate the coefficient values between each variable. Pearsonr method in scipy library has used for that.

Refer the iPython notebook from Azure Notebook for the complete python script and the visualizations.

https://notebooks.azure.com/library/Python%20Visualizations/html/Istanbul%20Stock%20Python%203%20notebook.ipynb

Do run the code by your own. You’ll get it for sure!

 

Jupyter Notebook on AzureML

plot_regression_3d_1 If you are fond of playing with data to dig out the relationships of it and to plot interesting visualizations with data; python is the language you should speak.

Over the years, with the strong community support, python language got dedicated libraries for data analysis and predictive modeling like scikit-learn, Tensorflow, Theano etc. Even the ultimate IDE in town; Visual Studio started supporting python! So, no hesitation. Python is a great choice to make.

You can use many IDEs or even a simple text editor to write your python files. But python comes with a handy web application; Jupyter notebook that can be used to do your code. Even compile it!

Jupyter gets its birth in 2014 as a spin-off project of IPython; which is a command shell for interactive computing in multiple programming languages, originally developed for the Python.

Why Jupyter?

Jupyter notebook is a very popular tool among data scientists which as a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. “Jupyter” is a loose acronym meaning Julia, Python and R. One of the most prominent uses you get when using Jupyter notebook is the ability of sharing the data transformation and visualization steps with your peers.

If you want to run Jupyter notebook in your local machine do refer the link below. With a few easy steps, you can have Jupyter notebook up and running in your machine.

http://jupyter.readthedocs.io/en/latest/install.html

One of the easiest ways to use Jupyter is running the notebook on Azure. No need to have python or the dependencies of it installed on your local machine. You can create, edit and share the Jupyter notes using Azure Machine Learning Studio. All the execution happens on the cloud.

Let’s get started!

1Access your notebook from “Notebooks” tab of AzureML Studio. When creating a new notebook, you can select which language and version you want to have in your notebook. Python 2, Python 3 and R are the supported languages right now.

Same as the Jupyter notebook running on the local machine, you get the same IPython interface on your browser.

2On the notebook menu bar, you can find out the ‘help’ menu which contains a brief user interface tour as well as a list of keyboard shortcuts that you can use to drive the notebook.

Here’s a little data mashup I’ve done using the famous ‘Iris dataset’ included in python sklearn. The .ipynb file is available on my github repo. Feel free to download and play with. A static html page created with the notebook output also included in the repo.

Azure is coming up with Azure Notebook preview feature. Here’s Iris visualization hosted on Azure Notebook

https://notebooks.azure.com/library/Python%20Visualizations/html/Iris+Data+Visualization.ipynb

No Machine learning algorithms or complex code snippets here. Just a data visualization & data transformation. 🙂

 

 

 

Time Series Forecasting with Azure ML

airline1_web-0When we have a series of data points indexed in time order we can define that as a “Time Series”. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Monthly rainfall data, temperature data of a certain place are some examples for time series.

In the field of predictive analytics, there are many incidents that need to analyze time series data and forecast the future values of that based on the previous values. Think of a scenario where you’ve to do a time series prediction for your business data or an incident where part of your predictive experiment contains a time series field that need to predict the future data points… There are many algorithms and machine learning models that you can use for forecasting time series values.

Multi-layer perception, Bayesian neural networks, radial basis functions, generalized regression neural networks (also called kernel regression), K-nearest neighbor regression, CART regression trees, support vector regression, and Gaussian processes are some machine learning algorithms that can be used for time series forecasting.

See here for more about these methods

Autoregressive Moving Average (ARIMA), Seasonal-ARIMA, Exponential smoothing (ETS) are some algorithms that widely used for this kind of time series analysis. I’m not going to dig deep into the algorithms, trend analysis and all numbers & characteristics bound with time series. Just going to demonstrate a simple way that you can do time series analysis in your deployments using Azure ML Studio.

After adding a dataset that contains a time series data into AzureML Studio, you can perform the time series analysis and predictions by using python or R scripts. In addition to that ML Studio offers a pre-built module for Anomaly detection of time series datasets. It can learn the normal characteristics of the provided time series and detect deviations from the normal pattern.

Here I’ve used forecast R package to write code snippets enabling AzureML Studio to do TS forecasting using popular time series algorithms namely as ARIMA, Seasonal ARIMA and ETS.

ARIMA seasonal & ARIMA non-seasonal

#ARIMA Seasonal / ARIMA non-seasonal 
library(forecast)
# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
dataset2 <- maml.mapInputPort(2) # class: data.frame

#Enter the seasonality of the timeseries here
#For non-seasonal model use '1' as the seasonality
seasonality<-12
labels <- as.numeric(dataset1$data)
timeseries <- ts(labels,frequency=seasonality)
model <- auto.arima(timeseries)
numPeriodsToForecast <- ceiling(max(dataset2$date)) - ceiling(max(dataset1$date))
numPeriodsToForecast <- max(numPeriodsToForecast, 0)
forecastedData <- forecast(model, h=numPeriodsToForecast)
forecastedData <- as.numeric(forecastedData$mean)

output <- data.frame(date=dataset2$date,forecast=forecastedData)
data.set <- output

# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort("data.set");

 

ETS seasonal & ETS non-seasonal

#ETS seasonal / ETS non-seasonal 
library(forecast)
# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
dataset2 <- maml.mapInputPort(2) # class: data.frame

#Add the seasonality here
#Assign seasonality as 'a' for non-seasonal ETS  
seasonality<-12
labels <- as.numeric(dataset1$data)
timeseries <- ts(labels,frequency=seasonality)
model <- ets(timeseries)
numPeriodsToForecast <- ceiling(max(dataset2$date)) - ceiling(max(dataset1$date))
numPeriodsToForecast <- max(numPeriodsToForecast, 0)
forecastedData <- forecast(model, h=numPeriodsToForecast)
forecastedData <- as.numeric(forecastedData$mean)

output <- data.frame(date=dataset2$date,forecast=forecastedData)
data.set <- output

# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort("data.set");

 

The advantage of using R script for the prediction is the ability of customizing the script as you want. But if you want looking for an instant solution for doing time series prediction, there’s a custom module in Cortana Intelligence gallery to do time series forecasting.

https://gallery.cortanaintelligence.com/Experiment/Time-Series-Forecasting-using-Custom-Modules-1

You just have to open that in your studio and re-use the built modules in your experiment. See what’s happening to your sales in next December! 🙂

Competing in Kaggle with Azure Machine Learning

MLData science is one of the most trending buzz words in the industry today. Obviously you’ve to have hell a lot of experience with data analytics, understanding on different data science related problems and their solutions to become a good data scientist.

Kaggle (www.kaggle.com) is  a place where you can explore the possibilities of data science, machine learning and related stuff. Kaggle is also known as “the home of data science” because of it’s rich content and the wide community behind it. You can find out hundreds of interesting datasets uploaded by data science enthusiasts all around the world on Kaggle. The most fascinating thing that you can find on Kaggle is competitions! Some competitions are bound with exciting prize tags while some competitions offer wonderful job opportunities when you score a top rank on it.

As we discussed in previous posts, Azure Machine Learning enables you to deploy and test predictive analytics experiments easily. Sometimes you need to not to code a single line to develop a machine learning model. So let’s start our journey on Kaggle with Azure Machine Learning.

01. Sign up for Kaggle – Go to kaggle.com & sign up using your Facebook/Google or LinkedIn account. It’s totally free! 🙂

Kaggle landing page

Kaggle landing page

02. Register for a Kaggle competition – Under the competition section, you can find out many competitions. Will start from a simple experiment that doesn’t go with any prize tag or job offering but worth enough to try out as your first experience on Kaggle.

Can you classify monsters?

Can you classify monsters?

03. Ghouls, Goblins, and Ghosts… Boo! Search for this competition categorized under ‘Knowledge’ sector of the competitions.  The task you have to do in the competition is described precisely on ‘Competition Details’

04. Get the data – After accepting the terms and conditions of Kaggle, you can download the training dataset, test dataset and the sample submission in .csv format. Make sure to take a deep look on features and understand whether you need some kind of data preprocessing before jumping into the task 😉

05. Understand the problem – You can easily figure out this is a multi-class classification machine learning problem. So let’s handle it on that way!

06. Get the data to your Studio – Here comes Azure Machine learning! Go to AML Studio (Setting up Azure Machine Learning is discussed here) and upload the data files through ‘Add Files’ option.

07. Build the classifier experiment – Same as building a normal AML experiment. Here I’ve split the training dataset to evaluate the model. The model with highest accuracy has chosen to do the predictions. ‘Tune model hyperparameter’ has used to find the optimal model parameters.

Classifier Experiment

Classifier Experiment

08. Do the prediction – Now it’s time to use the trained model to predict the type of the ghost using the data in test dataset. You can download the predicted output using ‘Convert to CSV’ module.

Predicting with the trained model

Predicting with the trained model

09. Submission – Make sure to create the output according to the sample submission.

10. Upload the submission to Kaggle –  You can compete as a team or individual. See where you are in the list!

Here's I'm the 278th! :)

Here’s I’m the 278th! 🙂

That’s it! You’ve just completed your first Kaggle competition. This might not lift you to the top of the competitors list. But it’s not impossible to use Azure Machine Learning in real world machine learning related problem solving.

 

SQL support in R tools for Visual Studio

If you have any kind of interest in data science or machine learning, you’ll probably found out that R language is the ultimate survivor. If you are a developer familiar with Visual Studio, you don’t have to adopt for RStudio again. You can code R inside VS!

R Tools for Visual Studio (RTVS) recently released the 0.5 version. One useful feature comes with the new version is SQL integration. With that you can directly import the data loads in your SQL database to a R environment. SQL queries can help you to fetch the data that you want. You can easily play with the data using R then.

First, you have to have Visual Studio 2015 with update 3. (Visual Studio 2015 Comunity edition is freely available to download) Update your VS if you haven’t done it yet and download RTVS 0.5 from here & install it.
https://aka.ms/rtvs-current

1
In your R project you can add SQL Query item (Right click on solution explorer and “Add new item”) which is created as a *.sql file.

2
On the top of the panel you can connect the database using “connect” icon. There you should configure the server name, server authentication and the database details.

3

Inside the .sql file you can execute the typical SQL queries to fetch data from the SQL database. One main advantage of this is, by enabling the execution plan you can analyze and optimize the SQL query you written.

4

Adding a database connection for the R project –

Go to R tools -> Data -> Add Database Connectionconnection-prop
Provide the authentication details of the database that you want to access. Then test the connection using “Test Connection” button. After clicking ‘ok’, you can see the database connection string is automatically generated inside settings.R file. Within the R code you can access for data inside the particular database as shown in the following example code.

final-screen

The str() output is shown in the R console

The example shows the code used for accessing the data in ‘Iris Data’ table inside ‘DMDatasets’ database placed in the local SQL server. Make sure to install “RODBC” R package to use the database related functions inside R.

#Need RODBC package to extablish the ODBC database iterface
install.packages("RODBC")
require("RODBC")

#Auto-generated Settings.R file should be added as a source
#The connection string contains in this file  
source("Settings.R")
conn <- odbcDriverConnect(connection = dbConnection)

#To get the tables of particualr database
tbls <- sqlTables(conn, tableType = "TABLE")
print(tbls)

#The SQL query is used to fetch data from the table 
sql <- "SELECT * FROM [dbo].[Iris Data]"
df <- sqlQuery(conn, sql)
str(df)
#plotting the dataset
plot(df) 

No need of switching developer environments to handle your coding as well as data analytics tasks. Just keep Visual Studio as your default IDE! 🙂

Azure ML Web Services gets a new look

Huge buzz going on Machine Learning. What for?  Building intelligent apps is one of the dominant usages of machine learning. Web service is one of the understandable “language” for software developers. If the data scientists can provide a web service for the line of devs, they’ll be super excited because they only have to deal with JSON; not regression algorithms or neural networks! 😀

Azure ML studio provides you the power to deploy web services easily and nice interface that a software developer can understand. Consuming a web service built with Azure machine learning has become pretty easy because it even provide you the code samples and the sample JSONs that transfer in and out.

web-services

services.azureml.net

 

Recently AzureML Studio has come out with a new interface for managing the web services. Now it’s pretty easy for manage and monitor the behavior of your web services.

Go for your ML Studio. In web services section, you’ll find a new link directing to “New web services experience”. Currently it’s in the preview.

dashboard

New web services dashboard

 

Dashboard shows the performance of the web service that you built. The average execution time is shown there. Even you can get a glimpse on monetary terms attached with consuming the web service with the dashboard.

Testing the web services can be done through the new portal. If you want to build web application to consume the web service you built, can direct to the azure web app template that is pre-built for consuming ML web services.

Take a look from (http://services.azureml.net)  you’ll get used to it! 😀