Simple Text Summarizer In PHP
I’ve written a simple text summarizer that can find the most important sentences in any given (English) text and produce a summary of the specified length. It would be pretty easy to adapt the PHP script to other languages, too.
The summary generator is quite primitive and probably doesn’t compare too favourably to OTS or commercial products. But hey, it’s free, and – as far as I know – the only text summarizer implemented in PHP You can find the download link at the bottom of this post.
How It Works
Here’s the summarization algorithm in a nutshell :
- Split the input text into sentences, and split the sentences into words.
- For each word, stem it and keep count of how many times it occurs in the text. Skip words that are very common, like “you”, “me”, “is” and so on (I just used the stopword list from the Open Text Summarizer).
- Sort the words by their “popularity” in the text and keep only the Top 20 most popular words. The idea is that the most common words reflect the main topics of the input text. Also, the “top 20″ threshold is a mostly arbitrary choice, so feel free to experiment with other numbers.
- Rate each sentence by the words it contains. In this case I simply added together the popularity ratings of every “important” word in the sentence. For example, if the word “Linux” occurs 4 times overall, and the word “Windows” occurs 3 times, then the sentence “Windows bad, Linux – Linux good!” will get a rating of 11 (assuming “bad” and “good” didn’t make it into the Top 20 word list).
- Take the X highest rated sentences. That’s the summary.
RAR archive : php_summarizer.rar (8 KB)
Here’s what the archive contains :
test_summarizer.php – a simple demo of the text summarizer script.
includes/summarizer.php – the Summarizer PHP class. Some of it is commented!
includes/porter_stemmer.php – a lousy stemmer script used by the summarizer class.
includes/html_functions.php – various utility functions (optional). I use the normalizeHtml() function to convert some Unicode characters to ASCII. Most of the functions were written by 5ubliminal.
Good luck and let me know if you create anything interesting with this script