DeviantArt, with its huge number of artworks and a large userbase, is just the kind of site that could use a good recommendation engine. A recommendation engine is basically a program that analyzes your tastes and recommends some images/products/whatever that you might like.
Unfortunately there don’t seem to be any official plans to create a recommendation system. So, being the naive creature that I am, I went ahead and started building my own recommendation engine for DA. Maybe I’m in over my head.
Seeing is believing
Here are some screenshots of recommendations that my current system generated. They’re all of the “people that liked this also liked that” type – deviation-based. The script can also make user-based recommendations – “based on your past favorites, you might like this” – but I won’t post those screenshots here, because suggestions the script made for me wouldn’t make a lot of sense to you 😛
Anyway, here we go. The “source” deviation has a red border, and the pictures are the top five generated recommendations. If you think I chose the best examples you are, of course, completely correct 😉
The algorithm might be improved by taking into account what categories each deviation belongs to, so that the suggestions are similar to the initial image.
I used the free version of Vogoo PHP Lib as the basis of the recommendation algorithm. Vogoo implements several collaborative filtering algorithms, including both item-based and user-based models. I have modified it to improve performance, because some of the original scripts do get sluggish when you have tens of thousands of rows in the DB. Sooner or later I’ll also start tweaking the suggestion algorithms – lots of room for experimenting there.
The rest of the setup is PHP + MySQL + Apache, all running on my PC (for now).
I’d love to put the system online and let other people check it out (when I manage to add at least a rudimentary user interface to it), but the harsh truth is that none of my shared hosting servers could possibly handle it. The script needs a lot of bandwidth and CPU power to effectively support more than a couple of users. And even if I had that, I’m not sure if I wouldn’t run into trouble with DA for downloading thousands of RSS feeds non-stop.
I could get a VPS… which would cost ten times more than my current hosting. Hmm.
Anyway, lets delve deeper into the technical aspects (or you can stop reading now if delving isn’t your thing 😉 ).
There are a few things that need to be considered even before you can start daydreaming about how to generate the actual suggestions. One of the tasks is choosing what to use as the source data, and how to obtain it. The first part is easy – your past favorites are a natural source of information about what kind of deviations you like. Getting that information is more complex. If a recommendation engine was developed by DA programmers this wouldn’t be a problem at all – they could query the DeviantArt database(s) straight away. However, a random hobbyist (I) obviously can’t do the same, and DeviantArt doesn’t have an API. I resorted to using the RSS feeds of user favorites, and checking the “Who favorite’d this?” lists on individual deviations.
So I’ve got a way to access the favorites… and I’ve got a resource problem. There are millions of users and probably billions of recorded favorites on DA. It would take a few years to download all that through RSS feeds (if you don’t want to inadvertently DDoS DeviantArt) and a decent server farm to analyze it. I decided to be selective and only download the info that is reasonably relevant to the users that use the suggestion engine (me and a few randomly chosen usernames). It goes like this :
- For every “active” user I look at which deviations (s)he recently +fav’ed.
- For every one of those deviations, I check what other users also favorited them.
- For every of those users I also find what their latest favorites are.
Visually the algorithm could be imagined as a pyramid or an upside-down tree.
How much is enough?
As far as I know, recommender algorithms work better with more data. On the other hand, there are technical limitations to how much information you can store and process. So how much information do you need to generate decent recommendations? Here’s my experience :
- 300 favs processed – Meh. You’d get better results by picking random pictures.
- 2500 favs processed – So-so. Three or four out of 40 images were pretty good.
- 43 500 favs processed – Finally getting somewhere! About 30% of the suggestions were worthy of a +fav.
By the way, it took more than 24 hours to gather the 43 thousand favorites. That’s partly because my connection is slow.
I wrote this post mainly because I wanted to see what reactions and comments (if any) I’d get. If there’s enough interest I might try and figure out how to get the script up and running on a public site somewhere. If nobody cares, well, at least I have another programmer’s toy to amuse myself with 🙂Related posts :