Can We Fight Splogs And Content Theft?
I’ve had some of my posts scraped by spam blogs (splogs) in the past. Since I think “legal action” against these sites is not very effective (and I’m lazy, too 😉 ), I began wondering if there was some automated way to stop the content thieves. In this post I’ll talk about my findings, starting with simple feed copyrighting and up to advanced techniques that could be used to fight sploggers.
So, content theft. To solve this problem you need to do two things –
- Identify splogs, which means finding websites that have scraped your content.
- Identify and block scrapers – stop sploggers from stealing your content in the first place.
Detect Evil (Splogger)
There are already some working solutions for content theft detection. In particular, I’ve seen several WordPress plugins that can add a digital signature, also known as a “fingerprint”, to your RSS feed. The signature is (supposed to be) unique to your blog – a MD5 hash of the blog’s URL would work.
When the evil scraper grabs a post from your feed and puts it on their site, they’ll also post the signature (since scraping is usually done with automated software and sploggers don’t manually examine every post). Then you can find the scraped posts by searching for the signature code on Google. Or set up a Google Alert for this.
One WordPres plugins for “fingerprinting” your RSS feed is Digital Fingerprinting. I’m sure there are similar plugins for other blogging systems, too.
On a related note, Angsuman’s Feed Copyrighter plugin can add a copyright message to your feeds, which would notify anyone visiting the scraper’s site that the content may have been stolen and where to find the original.
Then what?
But locating stolen content is only part of the problem. What do you do when you find a scraped article? Going after each individual splogger is tiresome and largely ineffective – you might succeed in taking down that one blog, but the splogger might not even notice. As they say, blackhat webmastering is all about volume – and sploggers often have hundreds, if not thousands of scraper sites.
You might cause them some trouble if you get their advertising account banned, but that would only be a temporary inconvenience for an experienced spammer. Besides, advertising income is only one of the reasons to run a splog – for example, spam blogs can also be used to get a large volume of links to other sites.
It would be more effective if scrapers couldn’t get your content at all. This is the hard part, and no good solutions exist, yet. The AntiLeech plugin (site down) used to offer a partial solution, but it has been discontinued since. So I’m going to theorize a bit 🙂
The Idea Concept
Here’s my view on what a good anti-scraper system would be like –
Firstly, it needs to be centralized and adaptive, learning both from automatically collected data and user feedback. Any standalone, single-user system just wouldn’t be able to keep up as new scrapers and splogs appear every day.
Second, the purpose of this system would be to prevent scrapers and sploggers from acquiring the actual contents of RSS feeds and possibly even the blogs themselves. Instead of outright blocking scrapers it should provide some kind of fake content (like AntiLeech does), e.g. a link to the owner’s site, 100 lines of “lorem ipsum” or whatever you choose. Giving sploggers fake content is more effective than blocking them, because blocks can be detected more easily. If faux contents are provided, scrapers might not notice they’re in trouble.
So how do you differentiate between legitimate readers and sploggers? Before I answer this I must note that doing this with 100% certainty is not possible in principle. We can only hope that this detection can be done well enough to make scraping sufficiently hard and discourage most sploggers from doing it.
One approach is to check the User-Agent of the application accessing a RSS feed or a webpage. Legitimate web browsers and other applications usually send predefined, well known User Agent values. Some scrapers also have their own unique User-Agent’s, so this value can be used to identify them. On the other hand, it’s trivial to spoof the User-Agent. Therefore this value can only be used to prove “guilt” (being a scraper), but not to prove “innocence”.
What’s left is devising some kind of IP-based identification algorithm. My idea – extend the “digital signature” concept by adding the readers IP address and possibly User-Agent to the posts in the RSS feed. Everyone who accesses the feed gets a slightly different signature, possibly encrypted with an assymetric encryption algorithm for added security. When a user or a specialized module of the system find a splog, they can extract the IP address from the digital fingerprint and add it to a blacklist.
Too bad it’s not so simple. I can immediately see two caveats – open proxies and dynamic IP addresses. If you try to blacklist them all, some “good” readers that use them will be unjustly punished. And sploggers will still be able to find new, unlisted proxies. This problem is also commonly encountered by antispam tools, and as far as I know, there is no reliable solution.
Another problem is feed republishing, like FeedBurner does. When FeedBurner fetches the feed it would be tagged with their IP and scrapers could safely grab the republished feed from FeedBurner. To make the proposed content protection technique effective, all major feed republishers would have to support it.
Overall I think this is an interesting concept, but there are some practical problems that would make reliable implementation hard.
Update : I just found the Copyfeed plugin (english / german) that does some of the abovementioned things.
Bonus Crazy Idea
Instead of putting the text of your posts in your RSS feed, use a picture of your post. Humans can still read it, but srapers will have nothing to work with. No keywords, no content = no SE traffic, no rankings and no AdSense income for the sploggers!
Related posts :
Great post. I’ve been looking for some resources to stop all of the SPLOGS. I hope some of the tools you mention are helpful. Thanks for the help!