Some time ago, a commenter asked me if it was possible to make one of my WordPress plugins detect and report parked domains. I’ve done some research since then, and while it’s probably not practical to add such a feature to the plugin, I did come up with several ways to programmatically detect parked pages. I’ve decided to summarize my findings in a blog post in case they’re useful to anyone else.
What’s a Parked Domain?
Broadly speaking, a parked domain is a website that contains no real content. Parked domains can have a benign purpose, like when someone puts a “Coming Soon” message on a domain reserved for an in-development site. However, much more commonly they are nothing but single-page “portals” stuffed with ads, bereft of any value to the unfortunate visitor. Typosquatters also frequently park their domains.
How To Detect Parked Sites
Let me state up-front that there is no single fool-proof method to detect parked domains. Each domain parking service has a lightly different page template and site structure, so even if you build a tool that can identify domains parked with SEDO.com with 100% accuracy it might still be clueless about sites parked with GoDaddy. The best approach is to use a combination of different detection algorithms.
Here’s a summary of techniques that I found promising.
Check for subdomain wildcarding and ubiquitous redirects
A lot of parked domains will gladly accept any HTTP request you throw at them, even if you try to access a non-existent subdomain or page. Instead of the typical “Not Found” error, a parked domain will either return a dynamically generated page full of ads or redirect you to the front page. So you can check if a domain is parked is by requesting a bogus subdomain (e.g.
http://bogus.example.com/) or page (e.g.
Another sign that a domain might be parked is that it redirects to a subdomain or a sub-directory on a different domain. For example, some domain parking services redirect parked domains to URLs like
There is sometimes a good reason to redirect a domain, like when the site has moved to a new address. How can we distinguish between legitimate redirects and parked sites? Legitimate redirects usually use (or should use) the
301 response code, while parked domains use the
There are also lots of parked domains that don’t use redirects at all, so this method is unfortunately rather prone to false positives and negatives.
Examine the WHOIS information
The WHOIS info of a parked domain usually points to the domain parking service that owns/controls the domain. You can use this to build a database of known-suspicious WHOIS records that can be used to identify parked domains. Alternatively, if you have access to a big WHOIS database, you could probably catch a good portion of parked sites by flagging any domain registrant that owns a lot of domains (say, more than a hundred).
Use an existing blacklist
Surprise, surprise : there is actually a parked domain blacklist. Run by the schizophrenically named company “I’ve Got a Fang”, the Parked Domains Project is an attempt to build and maintain an extensive database of parked domains.
It even has a Firefox extension. (Edit: Not any more. The extension appears to have been discontinued.)
As far as I can tell there is no documented public API, but you can probably reverse-engineer the FF add-on and see what it uses (or just screen-scrape the database if you’re feeling naughty).
Analyse the content
Personally, I think the most promising approach for detecting parked domains automatically is content analysis. Parked domains all have something in common – the ads. If you can come up with a reliable and automated way to detect if a page is devoted solely to advertising, you will be able to identify parked domains pretty accurately.
The paper Content-Based Approach for Identifying Textual Ads-Portal Domains [PDF] describes a very effective algorithm for detecting ads-only domains :
In this paper, we develop a machine-learning-based classifier to identify ads-portal domains. The features of the classifier are extracted from the web content of the domain. In developing the classifier, we first create negative and positive samples set. Then, we identify set of features that are effective in distinguishing ads-portal domains. Finally, we combine these features along with other keyword-based features into a machine learning classifier. The resulting classifier has 97% accuracy in identifying ads-portal domains.
On a related note, my content extraction script could also be used for detecting is a site is low on actual content. Its algorithm is much simpler than the one proposed in the above paper, but – unlike the paper – the source code of my PHP script is freely available 🙂
Illustration credit : clix @ sxc.huRelated posts :