Free Tools for Natural Language Processing

I’m a bit of an Artificial Intelligence enthusiast. Now and then I meddle in various aspects or AI, such as neural networks in the Living Image project (now discontinued) and natural language processing (in my bachelor’s thesis).

For a recent personal project I needed a bunch of NLP tools that I could use in a server-side script. After realizing that there are no PHP scripts for NLP, I eventually decided to learn Python. In the process I found and tried out many NLP libraries. Below I’ve compiled a list of those Python modules and function that I found most useful. For easier skimming, the list is grouped by NLP task, such as tokenization and tagging.

Sidenote : I’m aware of the fact that there are also lots of good NLP toolkits for Java/Perl/C/C++. I just didn’t want to mix up multiple programming languages in a single post.

Splitting Text Into Words and Sentences (Tokenizers)

Usually you will need to split the input text into words and/or sentences before you can apply any more complex NLP. This process is called tokenization. The simplest approach is to split the text on space characters.

I found that the nltk.tokenize.punkt_word_tokenize() function from the Natural Language Toolkit works pretty well for splitting a sentence into words. It’s also easy to customize (if you know regular expressions). NLTK also contains a class for splitting text into sentences – ntlk.tokenize.PunktSentenceTokenizer. The tokenizers are well-documented.

Getting the Base Form of a Word (+ Stemming)

Say you have a word – “running” – and you want to get it’s base form – “run”. That’s lemmatisation. NLTK includes a primitive lemmatiser (nltk.wordnet.morphy()), but in my opinion it’s not good enough for practical purposes. MontyLemmatiser from the MontyLingua package is definitely better, though not perfect.

On the other hand, NLTK contains several good stemmers. The nltk.stem module contains the popular Porter and Lancaster stemmers, a regexp-based stemmer, and several more.

Part-of-Speech Tagging

POS-taggers examine a sequence of tokens (usually words) and try to determine what part-of-speech (verbs, nouns, adjectives etc.) each of them represents. Tagging is useful for various purposes, e.g. it helps with named entity recognition (identifying the names of people and places in a text).

NLTK provides a number of taggers. I’ve used the TrigramTagger class with good results. It’s a probabilistic tagger which, when trained on a large number of tagged sentences, can tag new sentences with over 90% accuracy. Check out the above links to learn more.

A tagger can take forever to train, so use cPickle to save/load it.

Conjugation and Pluralisation

This is more or less the reverse of lemmatisation. When you have an application that generates text, it’s often a good idea to try and use proper inflection. Otherwise the output could end up sounding like a stereotypical caveman. For this purpose, NodeBox::Linguistics is the best I’ve found so far. It only handles verb conjugation and noun pluralisation, but it does that well.

I was too lazy to look for modules that manipulate comparative adjective forms and the like. I just wrote my own ad-hoc functions.

Synonyms

The undisputed source for synonyms and related words is WordNet. In addition to synonyms, it also contains information about other semantic relations between words. You can use it to find words with a broader meaning (dog -> animal) or a narrower one (building -< shack). The full list of relations is nicely summarized in this Wikipedia article.

Both NodeBox::Linguistics and NLTK include WordNet interfaces. Personally, I think the NLTK Wordnet interface is more intuitive. Also, with some luck you might be able to find a MySQL version of WordNet.

Other Tasks

Google it! :P I haven’t done anything more complex in Python than what’s discussed above, so if you need a full-fledged syntactic parser or a discourse planner, you’re on your own.

There’s lots of free NLP tools available online, and there’s even more information and various papers. The real challenge is finding solutions that you can actually use, preferebly without having to learn five new programming languages (and how to make them all interoperate).

BTW, this is my one hundred and eleventh post :)

Related posts :

5 Responses to “Free Tools for Natural Language Processing”

  1. lemiffe says:

    Hello there,

    By any chance have you recently stumbled upon any PHP NLP parsers? Specifically I am looking for a Part-of-speech tagger.

    I might have to start working on one, because I am not a great fan of Python.

    On the other hand, have you ever used PiP (Python in PHP)? Do you think it would be easy to implement one of the solutions you mention above in PiP to implement NLP in PHP?

    Thanks!

  2. White Shadow says:

    I’m afraid NLP-related PHP scripts are very few and far between. All the NLP researchers seem to be using Java/Perl/Python/C++ or some kind of academic language these days. I didn’t find any parsers or taggers the last time I looked.

    I haven’t used PiP so I can’t really comment on that.

  3. praveen says:

    can you please suggest me some good python module for Named entity recognization..please…

  4. THANK YOU SO MUCH FOR WRITING THIS BLOG POST! Ahem… please excuse the caps but I am just so relieved. I’ve spent so much time searching for discussion of how to conjugate verbs best in python, and I’ve failed to even find a starting point. I did find some tips on pluralisation. I found plenty of resources on stemming / lemmatisation / language models. But for some reason it was very difficult to find info on conjugation. Finally I found your post. I am going to go check out the NodeBox thing you recommended now!!

  5. [...] very informative blogpost led me to the NodeBox Linguistics library, a set of tools with which “you can do grammar [...]

Leave a Reply