The latest issue of Scientific American has an interesting article, Manipulation of the Crowd: How Trustworthy Are Online Ratings?, a topic of interest to any brewery who’s ever received a bad review from either Beer Advocate or Rate Beer. Intuitively, it’s seemed to me that the overall quality of the ratings on those sites have been improving as they’ve matured and built up the number of users and reviews.
According to Scientific American, the bad news is that while most review-driven websites don’t accurately reflect the expected statistical bell curve (which would imply their accuracy), the good news is that the beer reviews online prove the exception to the rule and are, in fact, more often fairly and reasonably accurate.
The philosophy behind such rating sites is known as the “crowdsourcing strategy” insofar as the “truest and most accurate evaluations will come from aggregating the opinions of a large and diverse group of people.” But according to Eric K. Clemons, at the Wharton School of the University of Pennsylvania, ratings sites like Amazon, TripAdvisor and Yelp “suffer from a number of inherent biases.”
- Disproportion: “People who rate purchases have already made the purchase. Therefore, they are disposed to like the product. ‘I happen to love Larry Niven novels,’ [professor Eric K.] Clemons says. “So whenever Larry Niven has a novel out, I buy it. Other fans do, too, and so the initial reviews are very high—five stars.” The high ratings draw people who would never have considered a science-fiction novel. And if they hate it, their spite could lead to an overcorrection, with a spate of one-star ratings.”
- Polarization: “People tend not to review things they find merely satisfactory. They evangelize what they love and trash things they hate. These feelings lead to a lot of one- and five-star reviews of the same product.”
- Oligarchy of the Enthusiastic: “A small percentage of users account for a huge majority of the reviews. These super-reviewers—often celebrated with ‘Top Reviewer’ badges and ranked against one another to encourage their participation—each contribute thousands of reviews, ultimately drowning out the voices of more typical users (95 percent of Amazon reviewers have rated fewer than eight products). ‘There is nothing to say that these people are good at what they do,’ [computer scientist Vassilis] Kostakos says. ‘They just do a lot of it.’ What appears to be a wise crowd is just an oligarchy of the enthusiastic.”
Yelp, the one I’ve heard more people consistently complain about, apparently has some of the worst transparency issues and there’s the “perception that the company itself might be manipulating the playing field.”
A separate look at Netflix user data, Dissecting the Netflix Dataset, found some of the same relationships in rating the films rented from Netflix. For example, the average rating for a film is 3.8 (out of 5), neatly fitting the average bell curve results, such as this study mentioned in Scientific American.
A controlled offline survey of some of these supposedly polarizing products revealed that individuals’ true opinions fit a bell-shaped curve—ratings cluster around three or four, with fewer scores of two and almost no ones and fives. Self-selected online voting creates an artificial judgment gap; as in modern politics, only the loudest voices at the furthest ends of the spectrum seem to get heard
A similar look at IMDb ratings, Mining gold from the Internet Movie Database, part 1: decoding user ratings, saw complimentary results and the same looking bell curve. The average rating on the IMDb was 6.2 (out of 10) and the median was 6.4.
It seems that the more popular a ratings website is, and consequently the more reviews it gets, the more reliable the results are, or at least the better they seem to fit the bell curve of expected distribution of reviews that usually result from non-online sources. The higher number of reviews, the more fringe reviews at either ends of the spectrum are less heavily weighed. Unless, of course, it just plain sucks or everyone agrees on how terrific it is, but that’s most likely a situation that’s pretty rare.
But, as I said at the outset, the good news is that those problem issues with online ratings are apparently not a problem for the beer ratings websites, which are specifically mentioned as an instance where the crowdsourcing model does work.
RateBeer.com, which has attracted some 3,000 members who have rated at least 100 beers each; all but the most obscure beers have been evaluated hundreds or thousands of times. The voluminous data set is virtually manipulation-proof, and the site’s passionate users tend to post on all beers they try—not just ones they love or hate.
I’m quite certain those numbers would be similar for Beer Advocate, too, of course, suggesting that for both of the most popular beer ratings websites, that the results have become reasonably reliable, especially for the beers that have been most heavily reviewed. For new beers with just a few reviews, obviously it wouldn’t automatically be as reliable, but the only way to build up reviews is start somewhere. And that’s where looking more carefully at the reviewers becomes more important. A review with only 5 reviews where all 5 reviewers are experienced would arguably be different from one where all 5 reviewers were rookies or had very little experience. Obviously, the number of reviews a person has done is no guarantee that his or her reviews are better or more reliable, but it stands to reason that anyone who takes something seriously and continues to practice it will improve over time. And like craft beer itself, the longer it’s been around, the better it gets. It’s nice to see some scientific support to confirm that intuition.