Comments on: Extracting The Main Content From a Webpage

By: Jānis Elsts

Jānis Elsts — Fri, 16 Dec 2016 14:58:10 +0000

– Tags in $removed_tags will always be removed from the page.

– Tags listed in $container_tags will be removed if their links-to-words ratio exceeds the threshold or if their content is too short.

– $ignore_len_tags is a list of exceptions. These tags will not be removed if their content is too short, but they can still be removed due to high links-to-words ratio.

By: John Han

John Han — Thu, 15 Dec 2016 05:29:32 +0000

Could you please explain the function of var $container_tags, var $removed_tags, var $ignore_len_tags? Thank you.

By: Jānis Elsts

Jānis Elsts — Sat, 28 Feb 2015 17:40:18 +0000

Sorry, my code debugging services are not available at this time 🙂

By: Roman

Roman — Sat, 28 Feb 2015 17:03:51 +0000

Hi! Now it looks beautiful but still incorrect. 🙂
If you want I can send the code to your email.
If ‘yes’ please send your address to r.a.matveev at-sign gmail dot com.

By: Jānis Elsts

Jānis Elsts — Mon, 16 Feb 2015 19:28:15 +0000

I’ve fixed the code somewhat.

By: Roman

Roman — Mon, 16 Feb 2015 17:05:35 +0000

Hmmmm… It’s weird: the code I posted was slightly modified by the blog’s engine.
However I think that you have an idea of what I made.

By: Roman

Roman — Mon, 16 Feb 2015 17:03:42 +0000

What I've done for the moment I replaced '<img' tags with a placeholders beforehand and replace them back afterwords: [sourcecode language="php"] function extract($text, $ratio = null, $min_len = null){ // begin of additional code 1 $pos = 0; while($pos = stripos($text,'',$pos); $text = substr($text,0,$pos).'%%IMGCLOSINGBRACKET%%'.substr($text,$pos+1); } $text = str_ireplace('tree = new DOMDocument(); // ......... // rest part of extract() method // ......... // begin of additional code 2 in the end of extract() method $result = $this->tree->saveHTML(); $result = str_replace('%%IMGOPENINGBRACKET%%','',$result); return $result; // end of additional code 2 [/sourcecode] It works. However I can imagine that this cen lead to some unexpected behaviour. Thank you for keeping track of your class :)

By: Jānis Elsts

Jānis Elsts — Mon, 16 Feb 2015 15:23:50 +0000

You could try changing the class so it treats img tags as content. One way to do that would be to modify ContainerRemove() and make it increment $word_cnt by some empirically chosen amount every time it finds an img node.

By: Roman

Roman — Mon, 16 Feb 2015 10:35:46 +0000

Janis!
I’m struggling to preserve images during extraction.
All that I’ve done is commented out ‘img’ string in $removed_tags array:

var $removed_tags = array(
‘script’, ‘noscript’, ‘style’, ‘form’, ‘meta’, ‘input’, ‘iframe’, ’embed’, ‘hr’,// ‘img’,
‘#comment’, ‘link’, ‘label’
);

However most of the time images still cutting out. As I can understand it’s taking place due to low links/text ratio in the containers with these images.

For example this page: http://www.frommers.com/slideshows/848003-vintage-postcards-of-tourist-attractions-the-same-view-then-and-now
will not contain any images despite that I removed ‘img’ string tag from $removed_tags array

Could you recommend any adjustment to the class to make it possible to preserv such images?

Thanks in advance!

By: Roman

Roman — Mon, 02 Feb 2015 19:33:04 +0000

Thank you, Janis!
I’ve get your idea: this source page is wrong initially.