I made an automated system that polls the ratebeer website and extracts my (or anyone’s) beer rating data. This data is stored in a database and then can be manipulated however desired. See a dump of my data here, and see my current end-usage here. I’m willing to provide a nice table of data for anyone, and am happy to make it open source if people care (though if I do, no judgments allowed on my hacked together coding). Future plans are to generate graphs and conclusions- average rating per style, rating counts per style, new beers rated per month of the year, etc. I have the data now I just have to figure out how to use it.
If you want to learn more, read on. Otherwise, begone!
For better or for worse I like and seem to be using the Ratebeer website. They have every beer ever created in their database and they make it easy to access. I often pull it up when in a store to get a quick sanity check to see if a beer is decent. I also use it to keep track of new beers that I try. This is odd because I don’t take beer drinking nearly as seriously as most other people on the ratebeer site. I actually am constantly in fear that I will be permabanned for my typo-ridden unique reviews (now this site ensures I have a backup of my data if that does happen…) I really just try to write reviews that give me reminders as to if I generally like or dislike a style or beer. I’ll also include information about what I was doing at the time of drinking if relevant.
Anyways as I was creating my website I wanted to mimic the style of sites like flavors.me. Unfortunately for them, while I liked their idea, I disliked the amount they were charging for such an uncustomizable service. So I decided to create my own! I was able to find wordpress plugins for netflix and goodreads and picasa/flickr, but decided to go further. I wanted to put my beer information online. I already put it into the ratebeer website, so I wanted to plug it into my website. I found it imperative for some reason to enable strangers to learn about my drinking preferences. It also provided a healthy challenge.
Out With the Old
As it turns out, almost no one else on the internet had a similar desire. I did find something close, another guy, who provided a service that would allow you to pull your ratebeer information into a RSS feed. This was nice, but only provided very basic information. It was also totally dependent on a stranger who could pull their website at any time. Here is what it looked like on my site:
This was nice enough but I wanted more customization. Ratebeer has an API but it doesn’t provide your review text. If it does oops my work is obsolete. I know paid users can export their data but it just exports the most basic information. No review text or detailed rating information. For fun, I decided to make my own solution.
In with the New
It took some work, but I’m happy with the result. I use a combination of the newest technologies available on the web, PHP ft. CURL + Regular Expressions and MySQL (okay I just picked stuff I was familiar with ). My finely tuned system goes to the ratebeer website and scrapes all of my rating data. It actually can easily scrape anyone’s data. I feel a bit bad about the first import because it puts a bit of an extra load on the Ratebeer server (I have to fetch one page for every rated beer), but I cache the information in a database so it only gets hit hard once. My website will poll the ratebeer website once a day max for each user, and on that visit it will only fetch new beer ratings. (If you later update a rating my database will never pick up the change- think of it as a feature.) It fetches everything it can. This includes:
- Beer Name
- URL to ratebeer page with your rating on top
- Numbers- overall rating plus component ratings
- Review text
- Review date
You have a ratebeer ID. Mine is 106321. To find your ID, goto My Account then click on My Ratings. The link you just went to had your ID in the URL. Ratebeer has a nice predictable structure so I can create everything.
First off, I don’t want to hammer the ratebeer servers. My database is checked and if I’ve updated the user in the last 24h I stop. Honestly you shouldn’t need any more frequent updates you alcoholic. If its been more than 24h or if its a new user, I continue.
With the ID, you can generate the URL for all rating pages. They follow the structure http://www.ratebeer.com/user/##ID##/ratings/##PAGE##/5/ . The page starts at 1 and will work forever. Pages past your ratings show up empty. I request all pages in sequential order and store a list of beers using some snazzy regular expressions. I also check to verify that the beers are new. Once I hit the last viewed beer in the database I stop.
Then I iterate through the list of new beers. Each URL has a beer, some random code, and the user ID (which puts your review at the top- a key feature). I fetch that new URL, perform some regex wizardry and pull out the desired information. See an example of the parsing broken down here.
Once I have all of the new data, I stick it in the database and update the user file. For a more graphical update, check here. As a warning you’ll only see something interesting if I haven’t been updated in the last 24h and if I have performed drinking.
Ta-da! The data is in the database! Now I re-own my data.
The front end page updates the user and then queries the database for the most recent 10 beers. It then spits it out in a pretty format.
Well if ratebeer ever changes their website structure I have to change my special sauce. Other issues include:
- Preventing XSS
- Parsing HTML
- Preventing the RB website from being unnecessarily hammered
- Character encoding which botched up all quote marks (thanks for using windows-1252 instead of utf8; you even provide a red herring by calling it ISO-8859-1)
Update – 2014
Unfortunately, Ratebeer changed their site structure, and my scripts broke. Since I don’t care about this old project that much, I just left my scripts broken. They’re poorly designed anyways. For now, this project is End of Life. I can’t grab any more data. Its too bad!