<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Extracting The Main Content From a Webpage</title>
	<atom:link href="http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/feed/" rel="self" type="application/rss+xml" />
	<link>http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/</link>
	<description>Slightly Advanced Computer Stuff (and some magic)</description>
	<lastBuildDate>Fri, 20 Nov 2009 22:08:06 +0200</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: ø Detecting Parked Domains &#124; W-Shadow.com ø</title>
		<link>http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/comment-page-1/#comment-32392</link>
		<dc:creator>ø Detecting Parked Domains &#124; W-Shadow.com ø</dc:creator>
		<pubDate>Fri, 13 Nov 2009 20:14:30 +0000</pubDate>
		<guid isPermaLink="false">http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/#comment-32392</guid>
		<description>[...] a related note, my content extraction script could also be used for detecting is a site is low on actual content. Its algorithm is much simpler [...]</description>
		<content:encoded><![CDATA[<p>[...] a related note, my content extraction script could also be used for detecting is a site is low on actual content. Its algorithm is much simpler [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dohn &#187; links for 2009-07-15</title>
		<link>http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/comment-page-1/#comment-30856</link>
		<dc:creator>Dohn &#187; links for 2009-07-15</dc:creator>
		<pubDate>Wed, 15 Jul 2009 08:33:03 +0000</pubDate>
		<guid isPermaLink="false">http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/#comment-30856</guid>
		<description>[...] ø Extracting The Main Content From a Webpage &#124; W-Shadow.com ø [...]</description>
		<content:encoded><![CDATA[<p>[...] ø Extracting The Main Content From a Webpage | W-Shadow.com ø [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: digiwebtools</title>
		<link>http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/comment-page-1/#comment-30170</link>
		<dc:creator>digiwebtools</dc:creator>
		<pubDate>Tue, 12 May 2009 13:55:30 +0000</pubDate>
		<guid isPermaLink="false">http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/#comment-30170</guid>
		<description>Wow, that&#039;s what i am looking for!
I&#039;ll try and give some feedback here.
Thank you.</description>
		<content:encoded><![CDATA[<p>Wow, that&#8217;s what i am looking for!<br />
I&#8217;ll try and give some feedback here.<br />
Thank you.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: White Shadow</title>
		<link>http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/comment-page-1/#comment-20388</link>
		<dc:creator>White Shadow</dc:creator>
		<pubDate>Wed, 11 Mar 2009 18:29:25 +0000</pubDate>
		<guid isPermaLink="false">http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/#comment-20388</guid>
		<description>You could easily do that with a few regular expressions. However, I&#039;m not doing your homework :P</description>
		<content:encoded><![CDATA[<p>You could easily do that with a few regular expressions. However, I&#8217;m not doing your homework <img src='http://w-shadow.com/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: pahinettambadi</title>
		<link>http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/comment-page-1/#comment-20307</link>
		<dc:creator>pahinettambadi</dc:creator>
		<pubDate>Wed, 11 Mar 2009 06:43:34 +0000</pubDate>
		<guid isPermaLink="false">http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/#comment-20307</guid>
		<description>hi i tried this  html extractor  and it is very useful..so
i need another script for extracting  img only and also leaving advertisement seperate..  thank u..
                       by 
                   pathy</description>
		<content:encoded><![CDATA[<p>hi i tried this  html extractor  and it is very useful..so<br />
i need another script for extracting  img only and also leaving advertisement seperate..  thank u..<br />
                       by<br />
                   pathy</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: White Shadow</title>
		<link>http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/comment-page-1/#comment-12310</link>
		<dc:creator>White Shadow</dc:creator>
		<pubDate>Mon, 28 Jul 2008 09:31:45 +0000</pubDate>
		<guid isPermaLink="false">http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/#comment-12310</guid>
		<description>Heh, thanks :)

For some sites you might need to tweak the $ratio parameter of the extract() method manually. The function tries to calculate a reasonable default if you leave it unspecified but this automatic calculation doesn&#039;t work for some webpage designs.</description>
		<content:encoded><![CDATA[<p>Heh, thanks <img src='http://w-shadow.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>For some sites you might need to tweak the $ratio parameter of the extract() method manually. The function tries to calculate a reasonable default if you leave it unspecified but this automatic calculation doesn&#8217;t work for some webpage designs.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Rizqinofa</title>
		<link>http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/comment-page-1/#comment-12304</link>
		<dc:creator>Rizqinofa</dc:creator>
		<pubDate>Mon, 28 Jul 2008 02:41:10 +0000</pubDate>
		<guid isPermaLink="false">http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/#comment-12304</guid>
		<description>yes, it works really great, it&#039;s like magic :D
i found most of website i try is parsed correctly, it&#039;s working good

but i found some website which using table based and ecommerce site is not working, i wonder why ? some of website which not shown main content using this script
 - http://www.cibuku.com/secret-happiness-p-2656.html
- http://www.kutukutubuku.com/product_info.php?products_id=53
- and some other ecommerce website

thanks!</description>
		<content:encoded><![CDATA[<p>yes, it works really great, it&#8217;s like magic <img src='http://w-shadow.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /><br />
i found most of website i try is parsed correctly, it&#8217;s working good</p>
<p>but i found some website which using table based and ecommerce site is not working, i wonder why ? some of website which not shown main content using this script<br />
 &#8211; <a href="http://www.cibuku.com/secret-happiness-p-2656.html" rel="nofollow">http://www.cibuku.com/secret-happiness-p-2656.html</a><br />
- <a href="http://www.kutukutubuku.com/product_info.php?products_id=53" rel="nofollow">http://www.kutukutubuku.com/product_info.php?products_id=53</a><br />
- and some other ecommerce website</p>
<p>thanks!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: fp</title>
		<link>http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/comment-page-1/#comment-12001</link>
		<dc:creator>fp</dc:creator>
		<pubDate>Thu, 05 Jun 2008 15:14:43 +0000</pubDate>
		<guid isPermaLink="false">http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/#comment-12001</guid>
		<description>aah...! my fault. i looked at it from console and first thought, it&#039;ll strip the whole html out of the page, not just partwise :)

thanks for the fast answer. gonna start a python port now. :)</description>
		<content:encoded><![CDATA[<p>aah&#8230;! my fault. i looked at it from console and first thought, it&#8217;ll strip the whole html out of the page, not just partwise <img src='http://w-shadow.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>thanks for the fast answer. gonna start a python port now. <img src='http://w-shadow.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: White Shadow</title>
		<link>http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/comment-page-1/#comment-12000</link>
		<dc:creator>White Shadow</dc:creator>
		<pubDate>Thu, 05 Jun 2008 14:52:14 +0000</pubDate>
		<guid isPermaLink="false">http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/#comment-12000</guid>
		<description>It works here. It returned most of the actual content (text), but stripped out navigation, menus, images etc. So it works as it&#039;s supposed to.</description>
		<content:encoded><![CDATA[<p>It works here. It returned most of the actual content (text), but stripped out navigation, menus, images etc. So it works as it&#8217;s supposed to.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: fp</title>
		<link>http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/comment-page-1/#comment-11999</link>
		<dc:creator>fp</dc:creator>
		<pubDate>Thu, 05 Jun 2008 14:40:18 +0000</pubDate>
		<guid isPermaLink="false">http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/#comment-11999</guid>
		<description>hmm... does this class still work? 
i get always the full html back from your wikipedia example.

cheers</description>
		<content:encoded><![CDATA[<p>hmm&#8230; does this class still work?<br />
i get always the full html back from your wikipedia example.</p>
<p>cheers</p>
]]></content:encoded>
	</item>
</channel>
</rss>
