Data is the king in machine learning. In the process of building machine learning models, data is used as the input features.
Input features comes in all shapes and sizes. For building a predictive model with a better accuracy rate, we should understand the data as well as the logic behind the algorithm we going to use to fit the model.
Data Understanding; as the second step of CRISP-DM, guides for understanding the types and the way the data we get has been represented. We can distinguish three main kinds of data feature.
- Quantitative Data – Data with numerical scale (Age of a person in years, Price of a house in dollars etc.)
- Ordinal features – Data without a scale but with ordering (Ordered sets/ first, second, third etc.)
- Categorical features – Data without a numerical scale neither an ordering. These features don’t allow any statistical summary. (Car manufacturer categories, Civil status, N-grams in NLP etc.)
Most of the machine learning algorithms such as linear regression, logistic regression, neural network, support vector machine works better with numerical features.
Quantitative features come with a numerical value and they can be directly used (Sometimes data preprocessing, normalization may have to use) as the input features of ML algorithms.
Ordinal features can be easily represented in numbers (Ex. First = 1, Second = 2, Third = 3 …). This is called Integer Encoding. Representing ordinal features using numbers makes sense because the dependency between each representation can be notated in a numerical way.
There are some algorithms that can directly deal with joint discrete distribution such as Markov chain / Naive Bayes / Bayesian network, tree based, etc. These algorithms can work with categorical data without any encoding; while we should encode the categorical features in a way to represent in a numerically to use as the input features for other ML algorithms. That means it’s better to change the categorical features to numerical most of the times 😊
There are some special cases too. For an example, while naïve bias classification only really handles categorical features, many geometric models go in the other direction by only handling quantitative features.
How to convert Categorical data for Numerical data?
There are few ways to covert the categorical data to numerical data.
- Dummy encoding
- One-hot encoding / one-of-K scheme
are the most prominent ways of it.
One hot encoding is the process of converting the categorical features into numerical by performing “binarization” of the category and include it as a feature to train the model.
In mathematics, we can define one-hot encoding as…
One hot encoding transforms:
a single variable with n observations and d distinct values,
d binary variables with n observations each. Each observation indicating the presence (1) or absence (0) of the dth binary variable.
Let’s get this clear with an example. Suppose you have ‘flower’ feature which can take values ‘daffodil’, ‘lily’, and ‘rose’. One hot encoding converts ‘flower’ feature to three features, ‘is_daffodil’, ‘is_lily’, and ‘is_rose’ which all are binary.
A common application of OHE is in Natural Language Processing (NLP). It can be used to turn words to vectors so easily. Here comes a con of OHE, where the vector size might get very large with respect to the number of distinct values in the feature column.If there’s only two distinct categories in the feature, no need to construct to additional columns. You can just replace the feature column with one Boolean column.
You can easily perform One-hot encoding in AzureML Studio by using the ‘Convert to Indicator Values’ module. The purpose of this module is to convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model, which is the same happens in OHE. Let’s look at performing One-Hot encoding using python in next article.