I’m a bit of an Artificial Intelligence enthusiast. Now and then I meddle in various aspects or AI, such as neural networks in the Living Image project (now discontinued) and natural language processing (in my bachelor’s thesis).
For a recent personal project I needed a bunch of NLP tools that I could use in a server-side script. After realizing that there are no PHP scripts for NLP, I eventually decided to learn Python. In the process I found and tried out many NLP libraries. Below I’ve compiled a list of those Python modules and function that I found most useful. For easier skimming, the list is grouped by NLP task, such as tokenization and tagging.
Sidenote : I’m aware of the fact that there are also lots of good NLP toolkits for Java/Perl/C/C++. I just didn’t want to mix up multiple programming languages in a single post.
Splitting Text Into Words and Sentences (Tokenizers)
Usually you will need to split the input text into words and/or sentences before you can apply any more complex NLP. This process is called tokenization. The simplest approach is to split the text on space characters.
I found that the nltk.tokenize.punkt_word_tokenize() function from the Natural Language Toolkit works pretty well for splitting a sentence into words. It’s also easy to customize (if you know regular expressions). NLTK also contains a class for splitting text into sentences – ntlk.tokenize.PunktSentenceTokenizer. The tokenizers are well-documented.
Getting the Base Form of a Word (+ Stemming)
Say you have a word – “running” – and you want to get it’s base form – “run”. That’s lemmatisation. NLTK includes a primitive lemmatiser (nltk.wordnet.morphy()), but in my opinion it’s not good enough for practical purposes. MontyLemmatiser from the MontyLingua package is definitely better, though not perfect.
On the other hand, NLTK contains several good stemmers. The nltk.stem module contains the popular Porter and Lancaster stemmers, a regexp-based stemmer, and several more.
POS-taggers examine a sequence of tokens (usually words) and try to determine what part-of-speech (verbs, nouns, adjectives etc.) each of them represents. Tagging is useful for various purposes, e.g. it helps with named entity recognition (identifying the names of people and places in a text).
NLTK provides a number of taggers. I’ve used the TrigramTagger class with good results. It’s a probabilistic tagger which, when trained on a large number of tagged sentences, can tag new sentences with over 90% accuracy. Check out the above links to learn more.
A tagger can take forever to train, so use cPickle to save/load it.
Conjugation and Pluralisation
This is more or less the reverse of lemmatisation. When you have an application that generates text, it’s often a good idea to try and use proper inflection. Otherwise the output could end up sounding like a stereotypical caveman. For this purpose, NodeBox::Linguistics is the best I’ve found so far. It only handles verb conjugation and noun pluralisation, but it does that well.
I was too lazy to look for modules that manipulate comparative adjective forms and the like. I just wrote my own ad-hoc functions.
The undisputed source for synonyms and related words is WordNet. In addition to synonyms, it also contains information about other semantic relations between words. You can use it to find words with a broader meaning (dog -> animal) or a narrower one (building -< shack). The full list of relations is nicely summarized in this Wikipedia article.
Both NodeBox::Linguistics and NLTK include WordNet interfaces. Personally, I think the NLTK Wordnet interface is more intuitive. Also, with some luck you might be able to find a MySQL version of WordNet.
Google it! 😛 I haven’t done anything more complex in Python than what’s discussed above, so if you need a full-fledged syntactic parser or a discourse planner, you’re on your own.
There’s lots of free NLP tools available online, and there’s even more information and various papers. The real challenge is finding solutions that you can actually use, preferebly without having to learn five new programming languages (and how to make them all interoperate).
BTW, this is my one hundred and eleventh postRelated posts :