Simple Text Summarizer In PHP
I’ve written a simple text summarizer that can find the most important sentences in any given (English) text and produce a summary of the specified length. It would be pretty easy to adapt the PHP script to other languages, too.
The summary generator is quite primitive and probably doesn’t compare too favourably to OTS or commercial products. But hey, it’s free, and - as far as I know - the only text summarizer implemented in PHP
You can find the download link at the bottom of this post.
How It Works
Here’s the summarization algorithm in a nutshell :
- Split the input text into sentences, and split the sentences into words.
- For each word, stem it and keep count of how many times it occurs in the text. Skip words that are very common, like “you”, “me”, “is” and so on (I just used the stopword list from the Open Text Summarizer).
- Sort the words by their “popularity” in the text and keep only the Top 20 most popular words. The idea is that the most common words reflect the main topics of the input text. Also, the “top 20″ threshold is a mostly arbitrary choice, so feel free to experiment with other numbers.
- Rate each sentence by the words it contains. In this case I simply added together the popularity ratings of every “important” word in the sentence. For example, if the word “Linux” occurs 4 times overall, and the word “Windows” occurs 3 times, then the sentence “Windows bad, Linux - Linux good!” will get a rating of 11 (assuming “bad” and “good” didn’t make it into the Top 20 word list).
- Take the X highest rated sentences. That’s the summary.
Download
RAR archive : php_summarizer.rar (8 KB)
Here’s what the archive contains :
test_summarizer.php - a simple demo of the text summarizer script.
includes/summarizer.php - the Summarizer PHP class. Some of it is commented!
includes/porter_stemmer.php - a lousy stemmer script used by the summarizer class.
includes/html_functions.php - various utility functions (optional). I use the normalizeHtml() function to convert some Unicode characters to ASCII. Most of the functions were written by 5ubliminal.
Good luck and let me know if you create anything interesting with this script ![]()
May 7th, 2008 at 2:26 am
Laura gave the prince this memorial:
Having the privilege accorded me this day of presenting myself before your Royal Highness I beg to assure you that I do so with the greatest gratification to my feelings. I am confident your Royal Highness will pardon the liberty I have taken when your Royal Highness is informed of the circumstances which have led me to do so. [snip]
Wall-of-text removed by W-Shadow. A curious specimen of spam (probably?).
May 7th, 2008 at 8:17 pm
Testing a comment spammer? Impressive how it got past both of my antispam plugins.
July 2nd, 2008 at 8:00 am
Just two small ad-ons:
Very often “.” is not the end of a sentence.
You should have taken into account the Title of the Text (detecting the html “H”, “b” or other tags). It would have given much better results.
Anyway, great work.
July 2nd, 2008 at 8:11 am
Take a look at this algorithm:
http://www.planetsourcecode.com/vb/scripts/ShowCode.asp?txtCodeId=9615&lngWId=1
July 2nd, 2008 at 2:44 pm
Judging by the description on that page, the algorithm is almost exactly the same, it even has the same problem with end-of-sentence symbols.
I know there are better algorithms (I’ve even written a better one in Python myself), but there’s really no point in porting them to PHP - that would be like reinventing the wheel. One would also need to port or rewrite various utility algorithms that don’t exist for PHP (e.g. morphology stuff and sentence tokenization). This script is intended to be simple
Reply - Quote