Text classification is probably the most popular real-world application of machine learning and other AI techniques. This is because the adaptive spam filters that guard our inboxes, comment forms and guestbooks are basically specialized text classifiers that only deal with two categories – “spam” and “not spam”. Text categorization can also be used to detect the language of a document, automatically suggest categories or tags for textual content, and more.
As with most AI algorithms, there are relatively few PHP classes/libraries available that deal with automated text classification. In this post I’ll review the classifier scripts I managed to find.
Note : All of these scripts use some form of the naive Bayes classifier. This algorithm is easy to implement and usually yields pretty good results, but it’s worth knowing that other techniques exist.
This is a simple text classifier backed by a MySQL database. You can train it to recognize any number of categories, and it will also allow you to “untrain” the classifier if you mistakenly assign the wrong category to a document. The downside is that most of the documentation is in French, but that shouldn’t be a huge problem – the code itself is easy enough to understand and commented in English. See also : an English translation of the project’s page.
b8 is an actively developed spam filter implemented in PHP. This is a good choice if you need a document classifier specifically for the purpose of catching forum/comment spammers. However, with a bit of modification you could probably coerce it to do other kinds of classification.
The filter can be trained/untrained and supports several database backends – BerkleyDB (very fast), MySQL (ubiquitous, but slower when used this way) and SQLite (a file-based database). The whole thing has a fairly professional overall look and good documentation.
I felt I had to include this class for completeness, but I personally wouldn’t use it for anything but the simplest projects. The good news is that this classifier supports an arbitrary number of categories and has a very simple implementation (the GPL headers probably take up more space than the code itself). The bad news is that there’s almost no documentation, the source comments are in whatever language they speak in Indonesia, and storing word frequency data as a serialized object isn’t very scalable.
Roll Your Own
If the above scripts don’t suit your needs you can always write your own. For example, the Wikipedia pages on Bayesian spam filtering and naive Bayes classifiers have a good (if a somewhat math-heavy) overview a simple classification algorithm and a list of useful resources and tutorials. Good luck 🙂Related posts :