Extracting The Main Content From a Webpage

I’ve created a PHP class that can extract the main content parts from a HTML page, stripping away superfluous components like JavaScript blocks, menus, advertisements and so on. The script isn’t 100% effective, but good enough for many practical purposes. It can also serve as a starting point for more complex systems.

Feel free to head straight to the source code if you can’t be bothered with the theoretical stuff 😉

Why Extract?

There are many applications and algorithms that would benefit from being able to find the salient data in a webpage :

Custom search engines.
More efficient text summarization.
News aggregators.
Text-to-voice and screenreader applications.
Data mining in general.

The Algorithm Of Content Extraction

Extracting the main content from a web page is a complex task. A general, robust solution would probably require natural language understanding, an ability to render and visually analyze a webpage (at least partially) and so on – in short, an advanced AI.

However, there are simpler ways to solve this – if you are willing to accept the tradeoffs. You just need to find the right heuristics. My content extractor class initially started out as a PHP port of HTML::ContentExtractor, so some of the below ideas might seem familiar if you’ve used that Perl module.

First, assume that content = text. Text is measured both by word count and character count. Short, solitary lines of text are probably not part of the main content. Furthermore, webpage blocks that contain a lot of links and little unlinked text are probably contain navigation elements. So, we remove anything that is not the main content, and return what’s left.

To determine whether something is part of the content (and should be retained), or is unimportant (and can be discarded) –

The length of a text block. If it’s shorter than the give threshold, the block is deleted.
The links / unlinked_words ratio. If higher than a certain treshold, delete the block.
A number of “uninteresting” tags are unconditionally removed. These include <script>, <style>, <img> and so on.

You can specify both the minimal text length and the ratio threshold when using the script. If you don’t, they will be calculated automatically. The minimal length will be set to the average length of a text block in the current webpage, and the max. link/text ratio will be set to average ratio * 1.30. In my experience, these defaults work fairly well for a wide range of websites.

Technical : All the thresholds and other stuff are calculated at the level of DOM nodes that correspond to predefined “container” tags (e.g. <div>, <table>, <ul>, etc). The DOM tree is traversed recursively (depth-first), in two passes – the first removes the “uninteresting” tags and calculates the average values used for auto-configuration. The second applies the thresholds. The script is not Unicode-safe.

The Source Itself

You can copy the source from the box below, or download it in a .zip archive. A simple usage example is included at the end of the source code. There is no other documentation 😛 Note that the results are returned as HTML – use strip_tags() if you want plain text.

/*
	*** HTML Content Extractor class *** 
	Copyright	: Janis Elsts, 2008
	Website 	: http://w-shadow.com/
	License	: LGPL 
	Notes	 	: If you use it, please consider giving credit / a link :)
*/

class ContentExtractor {
	
	var $container_tags = array(
			'div', 'table', 'td', 'th', 'tr', 'tbody', 'thead', 'tfoot', 'col', 
			'colgroup', 'ul', 'ol', 'html', 'center', 'span'
		);
	var $removed_tags = array(
			'script', 'noscript', 'style', 'form', 'meta', 'input', 'iframe', 'embed', 'hr', 'img',
			'#comment', 'link', 'label'
		);
	var $ignore_len_tags = array(
			'span'
		);	
		
	var $link_text_ratio = 0.04;
	var $min_text_len = 20;
	var $min_words = 0;	
	
	var $total_links = 0;
	var $total_unlinked_words = 0;
	var $total_unlinked_text='';
	var $text_blocks = 0;
	
	var $tree = null;
	var $unremoved=array();
	
	function sanitize_text($text){
		$text = str_ireplace('&nbsp;', ' ', $text);
		$text = html_entity_decode($text, ENT_QUOTES);
		
		$utf_spaces = array("\xC2\xA0", "\xE1\x9A\x80", "\xE2\x80\x83", 
			"\xE2\x80\x82", "\xE2\x80\x84", "\xE2\x80\xAF", "\xA0");
		$text = str_replace($utf_spaces, ' ', $text);
		
		return trim($text);
	}
	
	function extract($text, $ratio = null, $min_len = null){
		$this->tree = new DOMDocument();
		
		$start = microtime(true);
		if (!@$this->tree->loadHTML($text)) return false;
		
		$root = $this->tree->documentElement;
		$start = microtime(true);
		$this->HeuristicRemove($root, ( ($ratio == null) || ($min_len == null) ));
		
		if ($ratio == null) {
			$this->total_unlinked_text = $this->sanitize_text($this->total_unlinked_text);
			
			$words = preg_split('/[\s\r\n\t\|?!.,]+/', $this->total_unlinked_text);
			$words = array_filter($words);
			$this->total_unlinked_words = count($words);
			unset($words);
			if ($this->total_unlinked_words>0) {
				$this->link_text_ratio = $this->total_links / $this->total_unlinked_words;// + 0.01;
				$this->link_text_ratio *= 1.3;
			}
			
		} else {
			$this->link_text_ratio = $ratio;
		};
		
		if ($min_len == null) {
			$this->min_text_len = strlen($this->total_unlinked_text)/$this->text_blocks;
		} else {
			$this->min_text_len = $min_len;
		}
		
		$start = microtime(true);
		$this->ContainerRemove($root);
		
		return $this->tree->saveHTML();
	}
	
	function HeuristicRemove($node, $do_stats = false){
		if (in_array($node->nodeName, $this->removed_tags)){
			return true;
		};
		
		if ($do_stats) {
			if ($node->nodeName == 'a') {
				$this->total_links++;
			}
			$found_text = false;
		};
		
		$nodes_to_remove = array();
		
		if ($node->hasChildNodes()){
			foreach($node->childNodes as $child){
				if ($this->HeuristicRemove($child, $do_stats)) {
					$nodes_to_remove[] = $child;
				} else if ( $do_stats && ($node->nodeName != 'a') && ($child->nodeName == '#text') ) {
					$this->total_unlinked_text .= $child->wholeText;
					if (!$found_text){
						$this->text_blocks++;
						$found_text=true;
					}
				};
			}
			foreach ($nodes_to_remove as $child){
				$node->removeChild($child);
			}
		}
		
		return false;
	}
	
	function ContainerRemove($node){
		if (is_null($node)) return 0;
		$link_cnt = 0;
		$word_cnt = 0;
		$text_len = 0;
		$delete = false;
		$my_text = '';
		
		$ratio = 1;
		
		$nodes_to_remove = array();
		if ($node->hasChildNodes()){
			foreach($node->childNodes as $child){
				$data = $this->ContainerRemove($child);
				
				if ($data['delete']) {
					$nodes_to_remove[]=$child;
				} else {
					$text_len += $data[2];
				}
				
				$link_cnt += $data[0];
				
				if ($child->nodeName == 'a') {
					$link_cnt++;
				} else {
					if ($child->nodeName == '#text') $my_text .= $child->wholeText;
					$word_cnt += $data[1];
				}
			}
			
			foreach ($nodes_to_remove as $child){
				$node->removeChild($child);
			}
			
			$my_text = $this->sanitize_text($my_text);
			
			$words = preg_split('/[\s\r\n\t\|?!.,\[\]]+/', $my_text);
			$words = array_filter($words);
		
			$word_cnt += count($words);
			$text_len += strlen($my_text);
			
		};

		if (in_array($node->nodeName, $this->container_tags)){
			if ($word_cnt>0) $ratio = $link_cnt/$word_cnt;
			
			if ($ratio > $this->link_text_ratio){
					$delete = true;
			}
			
			if ( !in_array($node->nodeName, $this->ignore_len_tags) ) {
				if ( ($text_len < $this->min_text_len) || ($word_cnt<$this->min_words) ) {
					$delete = true;
				}
			}
			
		}	
		
		return array($link_cnt, $word_cnt, $text_len, 'delete' => $delete);
	}
	
}

/****************************
	Simple usage example
*****************************/

$html = file_get_contents('http://en.wikipedia.org/wiki/Shannon_index');

$extractor = new ContentExtractor();
$content = $extractor->extract($html); 
echo $content;
?>

More Ideas

Use a different HTML segmentation logic. Using the DOM for segmentation may be inefficient in complex, markup-heavy pages. On the other end of the scale, primitive websites that don’t embrace “semantical markup” may be impossible to segment using DOM.
Use Bayesian filters to find blocks of text that are not part of the content. For example, various copyright notices and sponsored links.
Invent a heuristic to filter out comments on blog posts/forums. Here’s an idea – find a short text block that mentions “Comments” and has a relatively large text block before it.

Related posts :

This entry was posted on Friday, January 25th, 2008 at 22:11 and is filed under Web Development. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

« (Im)Practical Voice Commands | Friendship, Evolution And “Evil AI” Cliches »

55 Responses to “Extracting The Main Content From a Webpage”

Ross says:

April 12, 2008 at 16:58

Excellent. If you haven’t already, use this as a frontend to Open Calais, it adds a lot of value.

thanks.
White Shadow says:

April 12, 2008 at 17:53

I didn’t know about Open Calais, thanks for bringing it to my attention 🙂
Pavel says:

June 2, 2008 at 10:55

Looks amazing. I wonder where you have experimented with sites of different types.
Thanks a lot.
White Shadow says:

June 2, 2008 at 13:09

I tried it on some blog posts on various blogs, a Latvian cinema site, a webcomic, several forum threads, wikipedia, PHP.net and my own site.
fp says:

June 5, 2008 at 17:40

hmm… does this class still work?
i get always the full html back from your wikipedia example.

cheers
White Shadow says:

June 5, 2008 at 17:52

It works here. It returned most of the actual content (text), but stripped out navigation, menus, images etc. So it works as it’s supposed to.
fp says:

June 5, 2008 at 18:14

aah…! my fault. i looked at it from console and first thought, it’ll strip the whole html out of the page, not just partwise 🙂

thanks for the fast answer. gonna start a python port now. 🙂
Rizqinofa says:

July 28, 2008 at 05:41

yes, it works really great, it’s like magic 😀
i found most of website i try is parsed correctly, it’s working good

but i found some website which using table based and ecommerce site is not working, i wonder why ? some of website which not shown main content using this script
– http://www.cibuku.com/secret-happiness-p-2656.html
– http://www.kutukutubuku.com/product_info.php?products_id=53
– and some other ecommerce website

thanks!
White Shadow says:

July 28, 2008 at 12:31

Heh, thanks 🙂

For some sites you might need to tweak the $ratio parameter of the extract() method manually. The function tries to calculate a reasonable default if you leave it unspecified but this automatic calculation doesn’t work for some webpage designs.
pahinettambadi says:

March 11, 2009 at 09:43

hi i tried this html extractor and it is very useful..so
i need another script for extracting img only and also leaving advertisement seperate.. thank u..
by
pathy
White Shadow says:

March 11, 2009 at 21:29

You could easily do that with a few regular expressions. However, I’m not doing your homework 😛
digiwebtools says:

May 12, 2009 at 16:55

Wow, that’s what i am looking for!
I’ll try and give some feedback here.
Thank you.
Dohn » links for 2009-07-15 says:

July 15, 2009 at 11:33

[…] ø Extracting The Main Content From a Webpage | W-Shadow.com ø […]
ø Detecting Parked Domains | W-Shadow.com ø says:

November 13, 2009 at 23:14

[…] a related note, my content extraction script could also be used for detecting is a site is low on actual content. Its algorithm is much simpler […]
website reviews says:

March 9, 2010 at 11:45

I was just looking for this script to detect parked domain
Web Design Wellington says:

June 1, 2010 at 14:42

Works like a charm, thanks for posting this it has saved me hours of development time
Sreejith says:

June 9, 2010 at 06:43

Hi…

$utf_spaces = array(“\xC2\xA0”, “\xE1\x9A\x80”, “\xE2\x80\x83”,
“\xE2\x80\x82”, “\xE2\x80\x84”, “\xE2\x80\xAF”, “\xA0”);

Pls say me the details for the above coding….
Sreejith says:

June 15, 2010 at 07:48

I have tried the program..Great Result…But some it extracts unnecessary contents (comments,wheather etc) from some sites..

http://www.omaha.com/article/20100526/NEWS11/100529757
http://www.billboard.com/features/will-lee-or-crystal-win-idol-stars-weigh-1004093694.story

pls chk the links….
Sreejith says:

July 6, 2010 at 15:02

Hi…DOM implementation is very slow while using bulk url’s…is it a bug??any solution to solve this???
ag gründen says:

October 5, 2011 at 10:40

You really make it appear so easy with your presentation however I find this topic to be actually something that I believe I’d never understand. It sort of feels too complex and extremely huge for me. I’m taking a look ahead on your subsequent post, I?ll try to get the hold of it!

W-Shadow.com