Natural Language Processing with Python + Visual Studio

cap_4Human Language is one of the most complicated phenomena to interpret for machines. Comparing to artificial languages like programming languages and mathematical notations, natural languages are hard to notate with explicit rules. Natural Language Processing, AKA Computational Linguistics enable computers to derive meaning from human or natural language input.

When it comes to natural language processing, text analysis plays a major role. One of the major problems we have to face when processing natural language is the computation power. Working with big corpus and chunking the textual data into n-grams need a big processing power. All mighty cloud; the ultimate savior comes handy in this application too.

Let’s peep into some of the cool tools you can use in your developments. In most of the cases, you don’t want to get the hassle of developing from the scratch. There are plenty of APIs and libraries that you can directly integrate with your system.

If you think, you wanna go from scratch and do some enhancements, there’s the space for you too. 😊

Text Analytics APIs

Microsoft text analytics APIs are set of web services built with Azure Machine Learning. Many major tasks found in natural language processing are exposed as web services through this. The API can be used to analyze unstructured text for tasks such as sentiment analysis, key phrase extraction, language detection and topic detection. No hard rules are training loads. Just call the API from your C# or python code. Refer the link below for more info.

https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-apps-text-analytics

Process natural language from the scratch!

Python! Yeah. that’s it! Among many languages used for programming, python comes handy with many pre-built packages specifically built for natural language processing.

Obviously, python works well with unix systems. But now the best IDE in town; Visual Studio comes with a toolset for python which enable you to edit, debug and compile python scripts using your existing IDE.  You should have Visual Studio 2015 (Community edition, professional or enterprise) for installing the python tools. (https://www.visualstudio.com/vs/python/)

Here I’ve used NLTK (Natural Language Tool Kit) for the task. One of the main advantage with NLTK is, it comes with dozens of built in corpora and trained models.

These are the Language processing tasks and corresponding NLTK modules with examples of functionality comes with that.

cap_1

Source – http://www.nltk.org/book/ch00.html

For running python NLTK for the first time you may need to download the nltk_data. Go for the python interactive console and install the required data from the popping up NLTK downloader. (Use nltk.download()  for this task)

cap_2

Here’s a little simulation of natural language processing tasks done using NLTK. Code snippets are commented for easy reading. 😊

import nltk
from nltk.corpus import treebank
from nltk.corpus import stopwords

#Sample sentence used for processing
sentence = """John came to office at eight o'clock on Monday morning & left the office with Arthur bit early."""

#Tokenizing the sentence into words 
word_tokens = nltk.word_tokenize(sentence)
#Tagging words
tagged_words = nltk.pos_tag(word_tokens)
#Identify named entities
named_entities = nltk.chunk.ne_chunk(tagged_words)

#Removing the stopwords from the text - Predefined stopwords in English have been used.
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print('Sentence - ' + sentence)
print('Word tokens - ')
print(word_tokens)
print('Tagged words - ')
print(tagged_words)
print('Named entities - ')
print(named_entities)
print('Word tokens - ')
print(word_tokens)
print('Filtered sentence - ')
print(filtered_sentence)

 

The output after executing the script should be like this.

cap_3You can Improve these basics to build Named Entity Recognizer and many more…

Try processing the language you read and speak… 😉