Extracting The Main Content From a Webpage

I’ve created a PHP class that can extract the main content parts from a HTML page, stripping away superfluous components like JavaScript blocks, menus, advertisements and so on. The script isn’t 100% effective, but good enough for many practical purposes. It can also serve as a starting point for more complex systems.

Feel free to head straight to the source code if you can’t be bothered with the theoretical stuff 😉

Why Extract?

There are many applications and algorithms that would benefit from being able to find the salient data in a webpage :

  • Custom search engines.
  • More efficient text summarization.
  • News aggregators.
  • Text-to-voice and screenreader applications.
  • Data mining in general.

The Algorithm Of Content Extraction

Extracting the main content from a web page is a complex task. A general, robust solution would probably require natural language understanding, an ability to render and visually analyze a webpage (at least partially) and so on – in short, an advanced AI.

However, there are simpler ways to solve this – if you are willing to accept the tradeoffs. You just need to find the right heuristics. My content extractor class initially started out as a PHP port of HTML::ContentExtractor, so some of the below ideas might seem familiar if you’ve used that Perl module.

First, assume that content = text. Text is measured both by word count and character count. Short, solitary lines of text are probably not part of the main content. Furthermore, webpage blocks that contain a lot of links and little unlinked text are probably contain navigation elements. So, we remove anything that is not the main content, and return what’s left.

To determine whether something is part of the content (and should be retained), or is unimportant (and can be discarded) –

  • The length of a text block. If it’s shorter than the give threshold, the block is deleted.
  • The links / unlinked_words ratio. If higher than a certain treshold, delete the block.
  • A number of “uninteresting” tags are unconditionally removed. These include <script>, <style>, <img> and so on.

You can specify both the minimal text length and the ratio threshold when using the script. If you don’t, they will be calculated automatically. The minimal length will be set to the average length of a text block in the current webpage, and the max. link/text ratio will be set to average ratio * 1.30. In my experience, these defaults work fairly well for a wide range of websites.

Technical : All the thresholds and other stuff are calculated at the level of DOM nodes that correspond to predefined “container” tags (e.g. <div>, <table>, <ul>, etc). The DOM tree is traversed recursively (depth-first), in two passes – the first removes the “uninteresting” tags and calculates the average values used for auto-configuration. The second applies the thresholds. The script is not Unicode-safe.

The Source Itself

You can copy the source from the box below, or download it in a .zip archive. A simple usage example is included at the end of the source code. There is no other documentation 😛 Note that the results are returned as HTML – use strip_tags() if you want plain text.

/*
	*** HTML Content Extractor class *** 
	Copyright	: Janis Elsts, 2008
	Website 	: http://w-shadow.com/
	License	: LGPL 
	Notes	 	: If you use it, please consider giving credit / a link :)
*/

class ContentExtractor {
	
	var $container_tags = array(
			'div', 'table', 'td', 'th', 'tr', 'tbody', 'thead', 'tfoot', 'col', 
			'colgroup', 'ul', 'ol', 'html', 'center', 'span'
		);
	var $removed_tags = array(
			'script', 'noscript', 'style', 'form', 'meta', 'input', 'iframe', 'embed', 'hr', 'img',
			'#comment', 'link', 'label'
		);
	var $ignore_len_tags = array(
			'span'
		);	
		
	var $link_text_ratio = 0.04;
	var $min_text_len = 20;
	var $min_words = 0;	
	
	var $total_links = 0;
	var $total_unlinked_words = 0;
	var $total_unlinked_text='';
	var $text_blocks = 0;
	
	var $tree = null;
	var $unremoved=array();
	
	function sanitize_text($text){
		$text = str_ireplace('&nbsp;', ' ', $text);
		$text = html_entity_decode($text, ENT_QUOTES);
		
		$utf_spaces = array("\xC2\xA0", "\xE1\x9A\x80", "\xE2\x80\x83", 
			"\xE2\x80\x82", "\xE2\x80\x84", "\xE2\x80\xAF", "\xA0");
		$text = str_replace($utf_spaces, ' ', $text);
		
		return trim($text);
	}
	
	function extract($text, $ratio = null, $min_len = null){
		$this->tree = new DOMDocument();
		
		$start = microtime(true);
		if (!@$this->tree->loadHTML($text)) return false;
		
		$root = $this->tree->documentElement;
		$start = microtime(true);
		$this->HeuristicRemove($root, ( ($ratio == null) || ($min_len == null) ));
		
		if ($ratio == null) {
			$this->total_unlinked_text = $this->sanitize_text($this->total_unlinked_text);
			
			$words = preg_split('/[\s\r\n\t\|?!.,]+/', $this->total_unlinked_text);
			$words = array_filter($words);
			$this->total_unlinked_words = count($words);
			unset($words);
			if ($this->total_unlinked_words>0) {
				$this->link_text_ratio = $this->total_links / $this->total_unlinked_words;// + 0.01;
				$this->link_text_ratio *= 1.3;
			}
			
		} else {
			$this->link_text_ratio = $ratio;
		};
		
		if ($min_len == null) {
			$this->min_text_len = strlen($this->total_unlinked_text)/$this->text_blocks;
		} else {
			$this->min_text_len = $min_len;
		}
		
		$start = microtime(true);
		$this->ContainerRemove($root);
		
		return $this->tree->saveHTML();
	}
	
	function HeuristicRemove($node, $do_stats = false){
		if (in_array($node->nodeName, $this->removed_tags)){
			return true;
		};
		
		if ($do_stats) {
			if ($node->nodeName == 'a') {
				$this->total_links++;
			}
			$found_text = false;
		};
		
		$nodes_to_remove = array();
		
		if ($node->hasChildNodes()){
			foreach($node->childNodes as $child){
				if ($this->HeuristicRemove($child, $do_stats)) {
					$nodes_to_remove[] = $child;
				} else if ( $do_stats && ($node->nodeName != 'a') && ($child->nodeName == '#text') ) {
					$this->total_unlinked_text .= $child->wholeText;
					if (!$found_text){
						$this->text_blocks++;
						$found_text=true;
					}
				};
			}
			foreach ($nodes_to_remove as $child){
				$node->removeChild($child);
			}
		}
		
		return false;
	}
	
	function ContainerRemove($node){
		if (is_null($node)) return 0;
		$link_cnt = 0;
		$word_cnt = 0;
		$text_len = 0;
		$delete = false;
		$my_text = '';
		
		$ratio = 1;
		
		$nodes_to_remove = array();
		if ($node->hasChildNodes()){
			foreach($node->childNodes as $child){
				$data = $this->ContainerRemove($child);
				
				if ($data['delete']) {
					$nodes_to_remove[]=$child;
				} else {
					$text_len += $data[2];
				}
				
				$link_cnt += $data[0];
				
				if ($child->nodeName == 'a') {
					$link_cnt++;
				} else {
					if ($child->nodeName == '#text') $my_text .= $child->wholeText;
					$word_cnt += $data[1];
				}
			}
			
			foreach ($nodes_to_remove as $child){
				$node->removeChild($child);
			}
			
			$my_text = $this->sanitize_text($my_text);
			
			$words = preg_split('/[\s\r\n\t\|?!.,\[\]]+/', $my_text);
			$words = array_filter($words);
		
			$word_cnt += count($words);
			$text_len += strlen($my_text);
			
		};

		if (in_array($node->nodeName, $this->container_tags)){
			if ($word_cnt>0) $ratio = $link_cnt/$word_cnt;
			
			if ($ratio > $this->link_text_ratio){
					$delete = true;
			}
			
			if ( !in_array($node->nodeName, $this->ignore_len_tags) ) {
				if ( ($text_len < $this->min_text_len) || ($word_cnt<$this->min_words) ) {
					$delete = true;
				}
			}
			
		}	
		
		return array($link_cnt, $word_cnt, $text_len, 'delete' => $delete);
	}
	
}

/****************************
	Simple usage example
*****************************/

$html = file_get_contents('http://en.wikipedia.org/wiki/Shannon_index');

$extractor = new ContentExtractor();
$content = $extractor->extract($html); 
echo $content;
?>

More Ideas

  • Automatically remove blocks that contain predefined words, for example, “all rights reserved”. HTML::ContentExtractor does this.
  • Use a different HTML segmentation logic. Using the DOM for segmentation may be inefficient in complex, markup-heavy pages. On the other end of the scale, primitive websites that don’t embrace “semantical markup” may be impossible to segment using DOM.
  • Use Bayesian filters to find blocks of text that are not part of the content. For example, various copyright notices and sponsored links.
  • Invent a heuristic to filter out comments on blog posts/forums. Here’s an idea – find a short text block that mentions “Comments” and has a relatively large text block before it.
Related posts :

55 Responses to “Extracting The Main Content From a Webpage”

  1. Roman says:

    But in my case the amount of linked text is 0 (zero) while the amount of unlinked text (caption) as slightly above zero (several words). Or the class including any other text within anchor (like ‘alt’ attributes and so on)?

  2. Jānis Elsts says:

    To clarify, it doesn’t care about the amount of linked text, it counts links. The class doesn’t care if the links contain text or not.

  3. Roman says:

    Thank you, Janis!

  4. Roman says:

    Hi, Janis again!
    I still struggling with some unexpected behaviour of this class.
    Please try to extract the content from this page: http://www.worldtravelguide.net/holidays/editorial-feature/feature/20-quirkiest-places-stay-britain
    On my side some part of javascript code coming alone to the extraction.
    As I can see there are some ” strings within scripts and class goes crazy because of this (it is just my guess – but the code coming through the extractor begins just after the ” string).
    Can you check my problem?

    Thank you in advance!

  5. Jānis Elsts says:

    That look like some kind of a bug in the DOM parser. Maybe it’s caused by the fact that the page claims to be “XHTML 1.0 Transitional” but isn’t even close to valid (100+ validation errors). In any case, there’s not much I can do about that.

  6. Roman says:

    Thank you, Janis!
    I’ve get your idea: this source page is wrong initially.

  7. Roman says:

    Janis!
    I’m struggling to preserve images during extraction.
    All that I’ve done is commented out ‘img’ string in $removed_tags array:

    var $removed_tags = array(
    ‘script’, ‘noscript’, ‘style’, ‘form’, ‘meta’, ‘input’, ‘iframe’, ’embed’, ‘hr’,// ‘img’,
    ‘#comment’, ‘link’, ‘label’
    );

    However most of the time images still cutting out. As I can understand it’s taking place due to low links/text ratio in the containers with these images.

    For example this page: http://www.frommers.com/slideshows/848003-vintage-postcards-of-tourist-attractions-the-same-view-then-and-now
    will not contain any images despite that I removed ‘img’ string tag from $removed_tags array

    Could you recommend any adjustment to the class to make it possible to preserv such images?

    Thanks in advance!

  8. Jānis Elsts says:

    You could try changing the class so it treats img tags as content. One way to do that would be to modify ContainerRemove() and make it increment $word_cnt by some empirically chosen amount every time it finds an img node.

  9. Roman says:

    What I’ve done for the moment I replaced ‘<img' tags with a placeholders beforehand and replace them back afterwords:

        function extract($text, $ratio = null, $min_len = null){
    // begin of additional code 1
            $pos = 0;
            while($pos = stripos($text,'',$pos);
                $text = substr($text,0,$pos).'%%IMGCLOSINGBRACKET%%'.substr($text,$pos+1);
            }
            $text = str_ireplace('tree = new DOMDocument();
    // .........
    // rest part of extract() method
    // .........
    // begin of additional code 2 in the end of extract() method
            $result = $this->tree->saveHTML();
            $result = str_replace('%%IMGOPENINGBRACKET%%','',$result);
            return $result;
    // end of additional code 2
    

    It works. However I can imagine that this cen lead to some unexpected behaviour.

    Thank you for keeping track of your class 🙂

  10. Roman says:

    Hmmmm… It’s weird: the code I posted was slightly modified by the blog’s engine.
    However I think that you have an idea of what I made.

  11. Jānis Elsts says:

    I’ve fixed the code somewhat.

  12. Roman says:

    Hi! Now it looks beautiful but still incorrect. 🙂
    If you want I can send the code to your email.
    If ‘yes’ please send your address to r.a.matveev at-sign gmail dot com.

  13. Jānis Elsts says:

    Sorry, my code debugging services are not available at this time 🙂

  14. John Han says:

    Could you please explain the function of var $container_tags, var $removed_tags, var $ignore_len_tags? Thank you.

  15. Jānis Elsts says:

    – Tags in $removed_tags will always be removed from the page.

    – Tags listed in $container_tags will be removed if their links-to-words ratio exceeds the threshold or if their content is too short.

    – $ignore_len_tags is a list of exceptions. These tags will not be removed if their content is too short, but they can still be removed due to high links-to-words ratio.

Leave a Reply