Extracting The Main Content From a Webpage

I’ve created a PHP class that can extract the main content parts from a HTML page, stripping away superfluous components like JavaScript blocks, menus, advertisements and so on. The script isn’t 100% effective, but good enough for many practical purposes. It can also serve as a starting point for more complex systems.

Feel free to head straight to the source code if you can’t be bothered with the theoretical stuff 😉

Why Extract?

There are many applications and algorithms that would benefit from being able to find the salient data in a webpage :

Custom search engines.
More efficient text summarization.
News aggregators.
Text-to-voice and screenreader applications.
Data mining in general.

The Algorithm Of Content Extraction

Extracting the main content from a web page is a complex task. A general, robust solution would probably require natural language understanding, an ability to render and visually analyze a webpage (at least partially) and so on – in short, an advanced AI.

However, there are simpler ways to solve this – if you are willing to accept the tradeoffs. You just need to find the right heuristics. My content extractor class initially started out as a PHP port of HTML::ContentExtractor, so some of the below ideas might seem familiar if you’ve used that Perl module.

First, assume that content = text. Text is measured both by word count and character count. Short, solitary lines of text are probably not part of the main content. Furthermore, webpage blocks that contain a lot of links and little unlinked text are probably contain navigation elements. So, we remove anything that is not the main content, and return what’s left.

To determine whether something is part of the content (and should be retained), or is unimportant (and can be discarded) –

The length of a text block. If it’s shorter than the give threshold, the block is deleted.
The links / unlinked_words ratio. If higher than a certain treshold, delete the block.
A number of “uninteresting” tags are unconditionally removed. These include <script>, <style>, <img> and so on.

You can specify both the minimal text length and the ratio threshold when using the script. If you don’t, they will be calculated automatically. The minimal length will be set to the average length of a text block in the current webpage, and the max. link/text ratio will be set to average ratio * 1.30. In my experience, these defaults work fairly well for a wide range of websites.

Technical : All the thresholds and other stuff are calculated at the level of DOM nodes that correspond to predefined “container” tags (e.g. <div>, <table>, <ul>, etc). The DOM tree is traversed recursively (depth-first), in two passes – the first removes the “uninteresting” tags and calculates the average values used for auto-configuration. The second applies the thresholds. The script is not Unicode-safe.

The Source Itself

You can copy the source from the box below, or download it in a .zip archive. A simple usage example is included at the end of the source code. There is no other documentation 😛 Note that the results are returned as HTML – use strip_tags() if you want plain text.

/*
	*** HTML Content Extractor class *** 
	Copyright	: Janis Elsts, 2008
	Website 	: http://w-shadow.com/
	License	: LGPL 
	Notes	 	: If you use it, please consider giving credit / a link :)
*/

class ContentExtractor {
	
	var $container_tags = array(
			'div', 'table', 'td', 'th', 'tr', 'tbody', 'thead', 'tfoot', 'col', 
			'colgroup', 'ul', 'ol', 'html', 'center', 'span'
		);
	var $removed_tags = array(
			'script', 'noscript', 'style', 'form', 'meta', 'input', 'iframe', 'embed', 'hr', 'img',
			'#comment', 'link', 'label'
		);
	var $ignore_len_tags = array(
			'span'
		);	
		
	var $link_text_ratio = 0.04;
	var $min_text_len = 20;
	var $min_words = 0;	
	
	var $total_links = 0;
	var $total_unlinked_words = 0;
	var $total_unlinked_text='';
	var $text_blocks = 0;
	
	var $tree = null;
	var $unremoved=array();
	
	function sanitize_text($text){
		$text = str_ireplace('&nbsp;', ' ', $text);
		$text = html_entity_decode($text, ENT_QUOTES);
		
		$utf_spaces = array("\xC2\xA0", "\xE1\x9A\x80", "\xE2\x80\x83", 
			"\xE2\x80\x82", "\xE2\x80\x84", "\xE2\x80\xAF", "\xA0");
		$text = str_replace($utf_spaces, ' ', $text);
		
		return trim($text);
	}
	
	function extract($text, $ratio = null, $min_len = null){
		$this->tree = new DOMDocument();
		
		$start = microtime(true);
		if (!@$this->tree->loadHTML($text)) return false;
		
		$root = $this->tree->documentElement;
		$start = microtime(true);
		$this->HeuristicRemove($root, ( ($ratio == null) || ($min_len == null) ));
		
		if ($ratio == null) {
			$this->total_unlinked_text = $this->sanitize_text($this->total_unlinked_text);
			
			$words = preg_split('/[\s\r\n\t\|?!.,]+/', $this->total_unlinked_text);
			$words = array_filter($words);
			$this->total_unlinked_words = count($words);
			unset($words);
			if ($this->total_unlinked_words>0) {
				$this->link_text_ratio = $this->total_links / $this->total_unlinked_words;// + 0.01;
				$this->link_text_ratio *= 1.3;
			}
			
		} else {
			$this->link_text_ratio = $ratio;
		};
		
		if ($min_len == null) {
			$this->min_text_len = strlen($this->total_unlinked_text)/$this->text_blocks;
		} else {
			$this->min_text_len = $min_len;
		}
		
		$start = microtime(true);
		$this->ContainerRemove($root);
		
		return $this->tree->saveHTML();
	}
	
	function HeuristicRemove($node, $do_stats = false){
		if (in_array($node->nodeName, $this->removed_tags)){
			return true;
		};
		
		if ($do_stats) {
			if ($node->nodeName == 'a') {
				$this->total_links++;
			}
			$found_text = false;
		};
		
		$nodes_to_remove = array();
		
		if ($node->hasChildNodes()){
			foreach($node->childNodes as $child){
				if ($this->HeuristicRemove($child, $do_stats)) {
					$nodes_to_remove[] = $child;
				} else if ( $do_stats && ($node->nodeName != 'a') && ($child->nodeName == '#text') ) {
					$this->total_unlinked_text .= $child->wholeText;
					if (!$found_text){
						$this->text_blocks++;
						$found_text=true;
					}
				};
			}
			foreach ($nodes_to_remove as $child){
				$node->removeChild($child);
			}
		}
		
		return false;
	}
	
	function ContainerRemove($node){
		if (is_null($node)) return 0;
		$link_cnt = 0;
		$word_cnt = 0;
		$text_len = 0;
		$delete = false;
		$my_text = '';
		
		$ratio = 1;
		
		$nodes_to_remove = array();
		if ($node->hasChildNodes()){
			foreach($node->childNodes as $child){
				$data = $this->ContainerRemove($child);
				
				if ($data['delete']) {
					$nodes_to_remove[]=$child;
				} else {
					$text_len += $data[2];
				}
				
				$link_cnt += $data[0];
				
				if ($child->nodeName == 'a') {
					$link_cnt++;
				} else {
					if ($child->nodeName == '#text') $my_text .= $child->wholeText;
					$word_cnt += $data[1];
				}
			}
			
			foreach ($nodes_to_remove as $child){
				$node->removeChild($child);
			}
			
			$my_text = $this->sanitize_text($my_text);
			
			$words = preg_split('/[\s\r\n\t\|?!.,\[\]]+/', $my_text);
			$words = array_filter($words);
		
			$word_cnt += count($words);
			$text_len += strlen($my_text);
			
		};

		if (in_array($node->nodeName, $this->container_tags)){
			if ($word_cnt>0) $ratio = $link_cnt/$word_cnt;
			
			if ($ratio > $this->link_text_ratio){
					$delete = true;
			}
			
			if ( !in_array($node->nodeName, $this->ignore_len_tags) ) {
				if ( ($text_len < $this->min_text_len) || ($word_cnt<$this->min_words) ) {
					$delete = true;
				}
			}
			
		}	
		
		return array($link_cnt, $word_cnt, $text_len, 'delete' => $delete);
	}
	
}

/****************************
	Simple usage example
*****************************/

$html = file_get_contents('http://en.wikipedia.org/wiki/Shannon_index');

$extractor = new ContentExtractor();
$content = $extractor->extract($html); 
echo $content;
?>

More Ideas

Use a different HTML segmentation logic. Using the DOM for segmentation may be inefficient in complex, markup-heavy pages. On the other end of the scale, primitive websites that don’t embrace “semantical markup” may be impossible to segment using DOM.
Use Bayesian filters to find blocks of text that are not part of the content. For example, various copyright notices and sponsored links.
Invent a heuristic to filter out comments on blog posts/forums. Here’s an idea – find a short text block that mentions “Comments” and has a relatively large text block before it.

Related posts :

This entry was posted on Friday, January 25th, 2008 at 22:11 and is filed under Web Development. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

« (Im)Practical Voice Commands | Friendship, Evolution And “Evil AI” Cliches »

55 Responses to “Extracting The Main Content From a Webpage”

sachin says:

June 20, 2012 at 22:44

can anyone help in explaining how to use this code…..
Agustín says:

September 8, 2012 at 21:18

Loot at the end of source code.

Simple usage example
$html = file_get_contents(‘http://en.wikipedia.org/wiki/Shannon_index’);

$extractor = new ContentExtractor();
$content = $extractor->extract($html);
echo $content;
Danny says:

February 24, 2013 at 10:36

Handy. Why are h1, h2…etc tags removed? Is there a way to preserve them?
Jānis Elsts says:

February 24, 2013 at 11:34

Theoretically, it shouldn’t remove heading tags unless it considers them to be outside the primary content area. Without seeing a specific example, it’s hard to say why it removes them in your case.
George says:

March 4, 2013 at 12:46

Does the script support other languages?

For example greek websites? If not, can any code be added to support?

It an awesome script!
Max says:

March 4, 2013 at 12:52

How would it be possible to remove div and span tags depending on ther classname or ID?
Jānis Elsts says:

March 5, 2013 at 20:53

@George:
I expect it will work fine with some other languages, but probably not all. The script doesn’t care about what language a page is written in as long as it contains something that looks like words separated by white-space characters.

@Max:
You could parse the HTML with yourself (see DOM API), find the nodes you want to remove with XPath and remove them with DOMNode::removeChild().
Max says:

April 6, 2013 at 18:25

Thanks for the Hint i found a solution for my Problem.

Now there is one more thing i came across. After i run that script over UTF-8 text i got a lot of encoding errors like Ã and so on. Any Idea on how to fix that?
Jānis Elsts says:

April 7, 2013 at 11:18

Does your page have the right character set? For example, ‘<meta charset=”utf8″>’.
Max says:

April 7, 2013 at 12:40

Yes, Server, Database, and the Script are UTF-8
Phillip says:

June 9, 2013 at 12:34

This code is working properly, but how to get only the main text and not save at html, I want placing the result in my html template..
Max says:

June 9, 2013 at 12:35

Use strip_tags($string)
Jānis Elsts says:

June 9, 2013 at 12:36

Just remove the HTML with something like strip_tags().
Phillip says:

June 9, 2013 at 13:29

Thanks, But I want to get text complete with code like , etc . I don’t want missing it.

I try use nodeValue ( html dom ) http://holidaytravel.asia/dom/ but all code , is miss

I try use your code at http://holidaytravel.asia/dom/ext.php ,

How to get text and html code in …

thanks
Phillip says:

June 9, 2013 at 13:43

How to get text and also html code like br b etc in span … /span
Roman says:

November 6, 2014 at 11:22

Is there any possibility to preserve images ( tags)? I tried to comment out ‘img’ from $removed_tags array but it did not help.
Roman says:

November 6, 2014 at 11:23

PS you have done a great job! Very handy class!
Jānis Elsts says:

November 6, 2014 at 16:40

Hmm, at first glance, I’m not sure why commenting out “img” wouldn’t work. Maybe the image tag is nested inside one of the other $removed_tags tags? Or it’s in a block/div that isn’t detected as part of the content?
Roman says:

November 8, 2014 at 11:49

I’m still investigating how this class works but I can definitely say that img tags disappearing after ContainerRevome function execution.
I tried to put a simple HTML to the ContentExtractor and `img` tag was still there. So you’re probably right that this is a problem with outer markup.
Could you take a look at this page: http://www.gonewiththefamily.com/gone-with-the-family-adv/athens-greece.html ?
The images are removing from this one despite the fact that there is no ‘img’ in $removed_tags array.
Jānis Elsts says:

November 11, 2014 at 15:01

All right, I’ve looked at the page and here’s what’s going on:
- The class removes nodes based on the ratio of links vs. non-linked text. For this page, the automatically calculated acceptable ratio is around 0.057
- Each image is inside a span element.
- The span contains very little text – just the caption of the image, which is usually just a few words long. It also contains a link to a high-resolution version of the image.
- As a result, most of the span‘s have a links-to-words ratio of 0.2 to 0.5, which is above the 0.057 threshold. So they get removed.
I don’t think there’s a way to work around this issue without modifying the class. Maybe you could hard-code an exception for image links or some such.

W-Shadow.com