Detecting Parked Domains

WWW Some time ago, a commenter asked me if it was possible to make one of my WordPress plugins detect and report parked domains. I’ve done some research since then, and while it’s probably not practical to add such a feature to the plugin, I did come up with several ways to programmatically detect parked pages. I’ve decided to summarize my findings in a blog post in case they’re useful to anyone else.

What’s a Parked Domain?

Broadly speaking, a parked domain is a website that contains no real content. Parked domains can have a benign purpose, like when someone puts a “Coming Soon” message on a domain reserved for an in-development site. However, much more commonly they are nothing but single-page “portals” stuffed with ads, bereft of any value to the unfortunate visitor. Typosquatters also frequently park their domains.

How To Detect Parked Sites

Let me state up-front that there is no single fool-proof method to detect parked domains. Each domain parking service has a lightly different page template and site structure, so even if you build a tool that can identify domains parked with SEDO.com with 100% accuracy it might still be clueless about sites parked with GoDaddy. The best approach is to use a combination of different detection algorithms.

Here’s a summary of techniques that I found promising.

Check for subdomain wildcarding and ubiquitous redirects

A lot of parked domains will gladly accept any HTTP request you throw at them, even if you try to access a non-existent subdomain or page. Instead of the typical “Not Found” error, a parked domain will either return a dynamically generated page full of ads or redirect you to the front page. So you can check if a domain is parked is by requesting a bogus subdomain (e.g. http://bogus.example.com/) or page (e.g. http://example.com/this-page-doesnt-exist.html).

Another sign that a domain might be parked is that it redirects to a subdomain or a sub-directory on a different domain. For example, some domain parking services redirect parked domains to URLs like http://domain-name.domainparking.com or http://domainparking.com/?domain=domain-name.

There is sometimes a good reason to redirect a domain, like when the site has moved to a new address. How can we distinguish between legitimate redirects and parked sites? Legitimate redirects usually use (or should use) the 301 response code, while parked domains use the 302 code.

There are also lots of parked domains that don’t use redirects at all, so this method is unfortunately rather prone to false positives and negatives.

Examine the WHOIS information

The WHOIS info of a parked domain usually points to the domain parking service that owns/controls the domain. You can use this to build a database of known-suspicious WHOIS records that can be used to identify parked domains. Alternatively, if you have access to a big WHOIS database, you could probably catch a good portion of parked sites by flagging any domain registrant that owns a lot of domains (say, more than a hundred).

Use an existing blacklist

Surprise, surprise : there is actually a parked domain blacklist. Run by the schizophrenically named company “I’ve Got a Fang”, the Parked Domains Project is an attempt to build and maintain an extensive database of parked domains. ~~It even has a~~ ~~Firefox extension.~~ (Edit: Not any more. The extension appears to have been discontinued.)

As far as I can tell there is no documented public API, but you can probably reverse-engineer the FF add-on and see what it uses (or just screen-scrape the database if you’re feeling naughty).

Analyse the content

Personally, I think the most promising approach for detecting parked domains automatically is content analysis. Parked domains all have something in common – the ads. If you can come up with a reliable and automated way to detect if a page is devoted solely to advertising, you will be able to identify parked domains pretty accurately.

The paper Content-Based Approach for Identifying Textual Ads-Portal Domains [PDF] describes a very effective algorithm for detecting ads-only domains :

In this paper, we develop a machine-learning-based classifier to identify ads-portal domains. The features of the classifier are extracted from the web content of the domain. In developing the classifier, we first create negative and positive samples set. Then, we identify set of features that are effective in distinguishing ads-portal domains. Finally, we combine these features along with other keyword-based features into a machine learning classifier. The resulting classifier has 97% accuracy in identifying ads-portal domains.

On a related note, my content extraction script could also be used for detecting is a site is low on actual content. Its algorithm is much simpler than the one proposed in the above paper, but – unlike the paper – the source code of my PHP script is freely available 🙂

Illustration credit : clix @ sxc.hu

Related posts :

This entry was posted on Friday, November 13th, 2009 at 23:12 and is filed under Web Development. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

« Plugin Translators Wanted | Game Review : Love »

7 Responses to “Detecting Parked Domains”

Kannan says:

November 16, 2009 at 09:41

Thanks for the highly informative post.
Parked domains are unfortunately used by both typosquatters as well as legitimate domainers but the trend is nowadays shifting to developing minisites on domains by domainers.

Most people realise the benefit of developing their domains so it can be indexed, earn adsense income and sufficiently age in the search engine indexes.

Shadow,please consider adding a twitter plugin so your posts can be twitted to a wider audience. Thanks.
White Shadow says:

November 17, 2009 at 14:17

Actually, I do have a Twitter plugin. My Twitter account is white_shadow.
David Engel says:

April 16, 2011 at 00:59

Yes. We have developed an API to do that. It was a really, long and complicated process. So, lucky you. Just plugin:

Technical Reference: http://api.companydatatrees.com/wiki/parked-api-reference

Landing Page: http://companydatatrees.com/detect-parked-domain-api.html

Use Case: http://api.companydatatrees.com/wiki/parked-b-use-case
Mike says:

July 8, 2011 at 14:21

Nothing wrong with parked domains and, you, David, have been blocked for quite some time. You dont get to browse my 1000+ PARKED domain site…..
Steve says:

August 22, 2011 at 17:01

There is everything wrong with parked domains. They are the scourge of the internet, provide nothing of value and are solely there to promote someone’s wallet.

Whatever content is found on them are merely links to legitimate sites. They do not create their own content. They don’t work for a living like we do. They are slackers.
Steve says:

August 22, 2011 at 17:03

I hope the authors of the Parked Domain Project will revive their website again and provide a plugin for other browsers besides Firefox. I for one would use it and contribute sites to the list as I find them. I hate landing on these because they waste my time and Sedo, etc… are not paying me for my time.
Gary says:

July 2, 2012 at 23:31

Great info- for the record, you can’t do many bulk whois queries without paying (in the hundreds, so might work for some). It looks like it would be pretty easy to apply text heuristics – for example Godaddy puts “This domain was parked for free by Godaddy” in their pages.

W-Shadow.com