The Problems with Product Ratings

Ratings on the internet (on Amazon, Netflix, IMDB, etc.) can be very valuable, but there are also big problems with them that people aren’t as aware of as they should be. Here, I’ll discuss two major issues with online ratings.

Problem 1: The Different People Problem

Probably the biggest problem with ratings is that the different people rate different sorts of products. In other words, the people that rate The Selfish Gene (4.5 stars on Amazon) are a somewhat different group of people than the people that rate the Tipping Point (4.3 stars on Amazon), and so if you are more like the sort of person that would read the Selfish Gene than the sort of person that would read the Tipping Point, then the group that reads the latter may not be a good reference class for you. Hence the 4.3 ratings for the Tipping Point may not be a very reliable indicator of whether you’d like it.

In other words, ratings are more meaningful the more similar you are to the sort of person that typically rates that sort of product. If you don’t have much experience with that sort of product, you may not even be able to judge whether you are like that sort of person. Therefore, on average, we should predict ratings to be less useful for us when we’re buying things that “people like ourselves” don’t typically buy.

The sort of people who rate the same item is not even stable across time. For instance, it’s well known that IMDB movie ratings tend to fall over time, which IMDB itself provides an interesting discussion of [1]. For big blockbuster films, this seems to be because it’s the eager self-selecting fans that tend to go to the movie right away, giving it artificially high initial ratings. For smaller foreign films, it tends to be because as interest in the movie spreads beyond its original country, people from foreign countries tend to like it less than the originally intended audience.

In an extreme case, the ratings can be fundamentally misleading. For instance, suppose that the sort of people who are likely to rate X tend to be really picky and so give low ratings, whereas the sort of people who are likely to rate Y tend to be easy to please and give high ratings. You’d expect then that Y would have a higher rating than it should be relative to X, simply because of different demographics rating each.

Another big problem occurs when a product is not the sort that most users would bother to rate. Then you can end up with a situation where mainly only very unusual purchasers (e.g., people who use the item for a non-standard purpose or the tiny fraction of people who had a defect in their product) are doing the ratings. Most users may be very satisfied, but you have no way to tell that.

Of course, all of this doesn’t make ratings useless (they should still have some correlation with quality), but it makes them hard to interpret. What’s more, the sort of problems mentioned above persists even if something has a very large number of ratings since these are forms of bias in the rating process itself. You may be treating large numbers of ratings as a more reliable predictor of quality than you should be.

Problem 2: The Number of Ratings Problem

The second big problem I see with ratings is that typically websites make no adjustment to their rankings or search results based on the number of ratings that a product has received. So at face value on Amazon, How to Have a Good Day (4.8 stars with just seven ratings) looks better than The Motivation Switch (4.2 stars with 186 ratings) unless you notice how many ratings each has and attempt some kind of mental adjustment (unfortunately, this mental adjustment is tricky to do). So, in this case, if you sort by “rating,” then How to Have a Good Day will come up much higher, whereas if you sort by popularity, The Motivation Switch will come up higher. It’s not clear which of those sorts of orders is better, as they are both highly flawed.

So which is actually better, 4.8 starts with seven ratings or 4.2 stars with 186? It’s quite unclear. If we don’t get any other information and have to predict whether the former will still have a higher average rating than the latter once a lot more ratings roll in for both, we might end up assigning pretty close to a 50/50 chance.

There are neat strategies to correct for this “number of ratings” problem, but for some reason, websites very rarely use them. Some argue it’s because they want to make sure that newer products with few ratings can still be easily discovered (and indeed surely there is value in making sure new products are found), but whatever the reason, it makes it hard for consumers to find the best products.

For instance, one solution to this problem is to use Bayesian methods, where you begin with a “prior” for each category (i.e., a distribution describing how often items in that category end up with different average ratings) and then adjust this prior based on the votes an item has so far (so the more ratings the item receives, the more you rely on its own ratings rather than this prior).

Another strategy would be to compute a confidence interval around each item’s mean, such as:

sort rating = average_rating – C*standard_deviation(ratings)/square-root(number_of_ratings)

For some constant C, such as C=1.96 (to mimic a 95th percentile confidence interval), which effectively punishes items for having either a small number of ratings or a large standard deviation (variability) of ratings.

Products with few ratings are actually especially problematic, though, because it’s common for a person’s friends to rate their product, which doesn’t matter much if there are hundreds of ratings, but when there are only a few, this can distort ratings very substantially. So items with only a few ratings tend to be even more unreliably rated than their small number of ratings would imply, due to more likelihood for bias.

[1] http://www.imdb.com/help/show_leaf?highinitialvote



Comments

Leave a Reply

Your email address will not be published. Required fields are marked *