logo

Introduction

Objective and Data

The goal of this project is to develop a model using tidymodels that can predict users’ ratings of movies in the IMDb database with the greatest accuracy possible. To do so, three different models were fit, tuned, and compared to determine the best performing model: a k-nearest-neighbors model (KNN), a random forest model (RF), and a boosted trees model (BT).

The data comes from Kaggle (source) and comprises four different source files:

  • movies: Information on 85,856 movies, including things like title, publication date, producers, main actors, length, budget, average rating, etc.
  • names: Information on 297,710 actors, including things like name, date of birth, number of spouses, number of children, etc.
  • ratings: Detailed breakdown on the ratings for each of the 85k movies, like the distribution of votes (how many people gave it a 1, 2, … etc.), demographic information about the raters, US vs international voters, etc.
  • title_principles: A database that links each movie with its principal actors, director, writers, etc.

Only the movies dataset is useful for our purposes. The other datasets contain mainly categorical variables (e.g., principal actor) with thousands of levels that occur infrequently, making them relatively uninformative and likely to be present in the training set but not the test set. This issue was also encountered within the movies dataset and led to the exclusion of several interesting variables (e.g., language, director, etc.). This issue will be discussed in further detail below.

Note: because of the pandemic, there are very few movies from 2020 in the current dataset.

Features

For the reasons stated above, some categorical variables were excluded from the model. In addition, as will be demonstrated in the exploratory data analysis below, some continuous variables had to be excluded due to high levels of missingingess. The variables that were included in the model were:

  • avg_votethe outcome measure: movie rating based on user scores
  • year – the year the movie was released
  • duration – movie length
  • votes – number of user ratings the movie has received on IMDb
  • metascore – critics’ ratings of the movie
  • reviews_from_users – number of written reviews left on the site for a movie by users
  • reviews_from_critics – number of written reviews left on the site for a movie by critics
  • usa_gross_income – domestic gross income
  • worlwide_gross_income – Worldwide gross income
  • budget – Movie budget
  • top_genre – The film’s primary genre

Exploratory Data Analysis

In this EDA I am primarily concerned with exploring missingness and determining whether the categorical predictors are viable for inclusion in the model.

We’ll begin, however, by checking the distribution of our numeric variables:

Interestingly, our outcome variable avg_vote appears nearly normally distributed. Most of the other variables are highly skewed, suggesting we should stick to nonlinear models if we want to avoid transforming all our predictors.

Checking missingness

The below figure plots the proportion of missing values for each variable in the dataset.

Unfortunately, several variables that seem likely to be very useful to the model (e.g., budget, gross_income) have extremely large numbers of missing observations.

Categorical predictors

Next, let’s check the number of levels in our categorical variables.


Unfortunately, most of the categorical predictors have far too many levels to be useful to our model (this includes trying to fit the model using step_other, which does not work).

One exception is the top_genre variable. This is a variable I created out of the original genre variable, which listed many genres for each movie (that is, a single cell might contain a string like “Drama, Comedy, Action”). I took the first genre listed for each movie and interpreted this to be the primary genre. This modified variable resulted in a total of 23 different genres, making it reasonable to include in the model. A similar method for reducing levels was not possible for the other categorical variables.

Let’s check out the distribution of genres in the dataset:

While far from an even distribution, it should still be suitable for our purposes. It is also interesting to note that the distribution appears to follow a power law, and that Drama, Comedy, and Action are the most common genres by far. Since all the models I’ll be fitting are nonlinear, we don’t need to worry about the skewed distribution.

Feature correlations

Lastly, let’s check out how our numeric variables correlate.

Some of our predictors are highly correlated with one another–another good argument for avoiding a linear model. Hopefully tuning mtry over a large range will help minimize the effects of these correlations in our random forest models.

Conclusions from the EDA

The EDA makes clear that missingness and categorical predictors with huge numbers of levels are a problem in this dataset. In order to move forward, the following decisions were made:

  • The only categorical predictor included in the model was the top_genre. All other predictors were numeric.
  • All missingness was eliminated in the data. Only observations that had values for all variables were included.
  • Many of the predictors had highly skewed distributions. This makes the choice of models tested here appropriate since they handle nonlinearity well.


Before moving on, let’s check the distribution of the numeric variables in our reduced dataset:

Importantly, the distributions do not appear to have changed significantly from what they were in the full dataset (see above). This gives us some confidence that we have not drastically changed the relationship between predictors and outcome variables as a result of our cutting down the data.

Model fitting

Procedure

The following procedure was used:

  • An 80/20 split was used between training and test sets, with stratification on the outcome variable
  • 10-fold cross validation with 5 repeats was used on the training set
  • The outcome variable was avg_vote (average user rating of the movie)
  • 10 predictor variables were included (1 categorical, 9 numeric) as well as all possible 2-way interactions between the numeric predictors. The predictors included were: top_genre, year, votes, duration, metascore, reviews_from_users, reviews_from_critics, usa_gross_income_tidy, worlwide_gross_income_tidy, budget_tidy.
  • All numeric predictors were normalized

Model results

K-Nearest-Neighbors

The neighbors hyperparameter was tuned over the range of 1 to 25. The tuning process returned the following best-performing model, indicating the optimal number of neighbors was 22, with an RMSE of .614:

## # A tibble: 1 × 7
##   neighbors .metric .estimator  mean     n std_err .config              
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1        22 rmse    standard   0.614    50 0.00438 Preprocessor1_Model09

Fitting the model to the test data, we get an RMSE of .57 and an RSQ of .66.

## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.566
## 2 rsq     standard       0.655

Plotting predicted by observed, we see fairly terrible performance on the lower tail and better performance on the upper tail. The model appears to overpredict for unpopular movies and underpredict for popular movies.

Random Forest

The mtry hyperparameter was tuned over a range of 1 to 51 and the min_n hyperparameter was tuned over a range of 2 to 40. The tuning process returned the following best-performing model, indicating the optimal values of mtry and min_n were 26 and 2, respectively. The model had an RMSE of .562 on the training data.

## # A tibble: 1 × 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    26     2 rmse    standard   0.562    50 0.00404 Preprocessor1_Model03

Fitting the model to the test data, we get an RMSE of .51 and an RSQ of .71.

## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.513
## 2 rsq     standard       0.713

Plotting predicted by observed, we see slighlty better performance on the tails compared to the KNN model and overall tighter correlation, but a similar pattern of overpredicting unpopular movies and underpredicing popular movies.

Boosted Trees

The mtry hyperparameter was tuned over a range of 1 to 51, the min_n hyperparameter was tuned over a range of 2 to 40, and the learning rate was tuned from -5 to .2. The tuning process returned the following best-performing model, indicating the optimal values of mtry, min_n, and learn_rate were 51 40, and .63, respectively. The model had an RMSE of .590 on the training data.

It’s clear from the below plot that the only parameter that had a true effect on performance was the learning rate.

## # A tibble: 1 × 9
##    mtry min_n learn_rate .metric .estimator  mean     n std_err .config         
##   <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
## 1    51    40      0.631 rmse    standard   0.590    50 0.00483 Preprocessor1_M…

Fitting the model to the test data, we get an RMSE of .54 and an RSQ of .69. Once again, performance increases with movie popularity; however, the BT model does not display the same pattern of over/underprediction that the other two models showed.

## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.542
## 2 rsq     standard       0.686

Comparing models

Below is a summary of all three models’ performance, sorted from best to worse:

## # A tibble: 3 × 3
##   model          rmse   rsq
##   <chr>         <dbl> <dbl>
## 1 Random Forest 0.513 0.713
## 2 Boosted Trees 0.542 0.686
## 3 KNN           0.566 0.655


The random forest model performed best overall…

The 10-fold validation with 5 repeats afforded us very low SEs on our model fits during training (see above). So we can say with 95% confidence that all three models differ significantly from one another in performance. The random forest model performed best on both RMSE and RSQ metrics. We conclude the RF model makes the most accurate predictions on average about a movie’s user rating.

…but perhaps not meaningfully so

All the models performed better than expected. The outcome variable (average votes) was on a scale of 1-10, so RMSEs in the .5 range is arguably not too bad, especially considering we only trained on a fraction of the entire dataset due to missingness (and on a small number of variables).

While the difference between models is statistically significant, an important question is whether those differences are practically significant. One reason to care is training time: the difference in RMSE between the worst-performing model (KNN) and the best-performing model (RF) was about .05. But the KNN model only took about a minute to train, while the random forest model took 7.5 hours!

Therefore, I think a strong argument can be made that the increased accuracy provided by the RF model was not worth the additional training time (moreover, the BT model performed better than the KNN model and only took about an hour to train).

Performance at the tails was best for the boosted trees model

The models differed in their performance at the tails. As shown by the above \(Predicted \sim Observed\) plots, the KNN model performed very poorly at the tails, over-predicting on the low end and under-predicting at the high end. The RF model followed the same pattern, though to a lesser extent. The BT model, however, was less skewed at the tails, such that it did not tend to over-predict or underpredict at either tail. Thus, despite having lower overall accuracy, one might prefer the BT model if accurate predictions at the extremes (i.e., very low or very high movie ratings) is preferred over greater average accuracy. Again, the shorter training time of the BT model compared to the RF model weights this benefit even more highly.

No overfitting was detected

One interesting thing is that all three models performed better on the test set than on the training set. This alleviates concerns about overfitting, but raises questions about why. One possible answer is nonindependence between samples arising in the cross-fold validation training procedure.

Conclusion

Based on these results, I don’t think there is a clear single answer as to which model should be used. Instead, one should select based on how important training time, overall accuracy, or accuracy at the tails is to them. If overall accuracy (i.e., accuracy at the mean values) is most important, then the RF model is the clear winner. If training time is key consideration, however, then KNN (or possibly BT) is the better choice. If performance at the tails (i.e., accurate predictions for very low- or high-rated movies), then the BT model is the best choice.

Issues and limitations

Too much missingness, too many levels!

High levels of missingness and having categorical variables with thousands of levels caused a lot of problems in this project. Both situations led to failure during model fitting because of values that were missing in some folds and not others (and/or present in the training set but not the test set). This fact led me to make the decision to exclude any NAs, reducing the dataset to less than one tenth of its original size.

This is obviously not desireable, and the relative lack of data likely reduced model performance. At the same time, however, reducing the size of the dataset so severely was necessary from a computational burden standpoint. Nevertheless, it would’ve been better to be subset randomly, rather than by eliminating all NAs, which could possibly (probably) bias the model.

Performance at the tails

One outstanding question regards performance at the tails. Why was performance consistently lower for unpopular movies and higher for popular movies? One can think of several possible explanations. First, while the outcome variable looks roughly normal, it is definitely skewed right slightly; but stratification was used, which should address this problem.

Another possibility is that there is more variability in the predictors for unpopular movies than popular movies. However, I ran a few scatter plots to check this hypothesis, and actually found the opposite pattern–see the below example. There was a lot more variability in the highly-popular movies than vice versa. Perhaps this variability proved beneficial to the model (i.e., a variable must have a certain amount of variability in order to have any sort of predictive power. If a variable has the same value for all levels of the outcome variable, it is orthogonal and won’t explain any variance).

What the scatter plots do show, however, is the vast imbalance in the number of observations for lower-popularity movies compared to more popular movies. Perhaps this was too much for stratification to overcome.


gump


Back to portfolio