Simple Text Summarizer In PHP

I’ve written a simple text summarizer that can find the most important sentences in any given (English) text and produce a summary of the specified length. It would be pretty easy to adapt the PHP script to other languages, too.

The summary generator is quite primitive and probably doesn’t compare too favourably to OTS or commercial products. But hey, it’s free, and – as far as I know – the only text summarizer implemented in PHP ;) You can find the download link at the bottom of this post.

How It Works

Here’s the summarization algorithm in a nutshell :

  1. Split the input text into sentences, and split the sentences into words.
  2. For each word, stem it and keep count of how many times it occurs in the text. Skip words that are very common, like “you”, “me”, “is” and so on (I just used the stopword list from the Open Text Summarizer).
  3. Sort the words by their “popularity” in the text and keep only the Top 20 most popular words. The idea is that the most common words reflect the main topics of the input text. Also, the “top 20″ threshold is a mostly arbitrary choice, so feel free to experiment with other numbers.
  4. Rate each sentence by the words it contains. In this case I simply added together the popularity ratings of every “important” word in the sentence. For example, if the word “Linux” occurs 4 times overall, and the word “Windows” occurs 3 times, then the sentence “Windows bad, Linux – Linux good!” will get a rating of 11 (assuming “bad” and “good” didn’t make it into the Top 20 word list).
  5. Take the X highest rated sentences. That’s the summary.

Download

RAR archive : php_summarizer.rar (8 KB)

Here’s what the archive contains :
test_summarizer.php – a simple demo of the text summarizer script.
includes/summarizer.php – the Summarizer PHP class. Some of it is commented!
includes/porter_stemmer.php – a lousy stemmer script used by the summarizer class.
includes/html_functions.php – various utility functions (optional). I use the normalizeHtml() function to convert some Unicode characters to ASCII. Most of the functions were written by 5ubliminal.

Good luck and let me know if you create anything interesting with this script :)

Related posts :

16 Responses to “Simple Text Summarizer In PHP”

  1. malay says:

    Laura gave the prince this memorial:

    Having the privilege accorded me this day of presenting myself before your Royal Highness I beg to assure you that I do so with the greatest gratification to my feelings. I am confident your Royal Highness will pardon the liberty I have taken when your Royal Highness is informed of the circumstances which have led me to do so. [snip]
    Wall-of-text removed by W-Shadow. A curious specimen of spam (probably?).

  2. White Shadow says:

    Testing a comment spammer? Impressive how it got past both of my antispam plugins.

  3. BlackHatDomainer says:

    Just two small ad-ons:

    Very often “.” is not the end of a sentence.

    You should have taken into account the Title of the Text (detecting the html “H”, “b” or other tags). It would have given much better results.

    Anyway, great work.

  4. White Shadow says:

    Judging by the description on that page, the algorithm is almost exactly the same, it even has the same problem with end-of-sentence symbols.

    I know there are better algorithms (I’ve even written a better one in Python myself), but there’s really no point in porting them to PHP – that would be like reinventing the wheel. One would also need to port or rewrite various utility algorithms that don’t exist for PHP (e.g. morphology stuff and sentence tokenization). This script is intended to be simple ;)

  5. […] Nanti Update akan terus saya lakukan. Selain itu saya juga menambahkan fasilitas untuk penyimpulan dari sini , tapi masih dalam Inggrisan dan sepertinya masih belum sesempurna MEAD. Tapi insyaAllah udah […]

  6. Swan Pye Zaw says:

    Your summarizer is good but I think that it is good if it can be reference with any programming languages.

  7. Barry Hunter says:

    What exactly is the licence on this code? Can I use it in a website that destributes its source code as GPL? (would need to distribute the code with the source, which incidental is available at http://code.geograph.org.uk/ )

    We would like to use the code to summarise long descriptions attached to photos.

  8. White Shadow says:

    This code has no single license.

    You can use the portions written by me with GPL code and redistribute them as long as the attribution notice(s) present in the code are preserved. However, I don’t know what license (if any) would apply to the various utility functions in html_functions.php – most of them were found on the php.net site.

  9. Barry Hunter says:

    Ah, I dont need html_functions.

    summarizer.php will just be included as is (unless we need to change anything?, but doesnt look like it so far. )

    As for the stemmer, will check with phpguru.org (I think we can get a discount because we a registers not for profit)

  10. thank you.. but, can i use tagging in summarizer?

  11. White Shadow says:

    I’m not entirely sure what you mean by that. You could likely use this summarizer for tagging, but if tagging is your goal then there are specialized scripts and services (e.g. the Yahoo API) that would probably produce better results.

  12. pigpromoter says:

    I gave it a shot and it does the job quite nicely. Shouldn’t you strip ‘s before even starting to summarize the text? I ran it on some random page and got some “if !(confirm”. Are you still working/supporting/developing this?

  13. White Shadow says:

    Yes, the included example doesn’t strip HTML tags. You’ll need to do that yourself.

    No, this thing is not actively maintained/developed.

  14. Rick says:

    Dude. You should make this into a WordPress plugin that auto creates Excerpts with a click of a button. Or attaches a summary to the top of posts for lazy readers.

  15. Rzhen says:

    Thank you sir, your code is really helpful for my project assignment.. Keep up the good work sir :)

Leave a Reply