The goal of this project is to develop a model using
tidymodels
that can predict users’ ratings of movies in the
IMDb database with the greatest
accuracy possible. To do so, three different models were fit, tuned, and
compared to determine the best performing model: a k-nearest-neighbors
model (KNN), a random forest model (RF), and a boosted trees model
(BT).
The data comes from Kaggle (source) and comprises four different source files:
movies
: Information on 85,856 movies, including things
like title, publication date, producers, main actors, length, budget,
average rating, etc.names
: Information on 297,710 actors, including things
like name, date of birth, number of spouses, number of children,
etc.ratings
: Detailed breakdown on the ratings for each of
the 85k movies, like the distribution of votes (how many people gave it
a 1, 2, … etc.), demographic information about the raters, US vs
international voters, etc.title_principles
: A database that links each movie with
its principal actors, director, writers, etc.Only the movies
dataset is useful for our purposes. The
other datasets contain mainly categorical variables (e.g., principal
actor) with thousands of levels that occur infrequently, making them
relatively uninformative and likely to be present in the training set
but not the test set. This issue was also encountered within the
movies
dataset and led to the exclusion of several
interesting variables (e.g., language
,
director
, etc.). This issue will be discussed in further
detail below.
Note: because of the pandemic, there are very few movies from 2020 in the current dataset.
For the reasons stated above, some categorical variables were excluded from the model. In addition, as will be demonstrated in the exploratory data analysis below, some continuous variables had to be excluded due to high levels of missingingess. The variables that were included in the model were:
avg_vote
– the outcome
measure: movie rating based on user scoresyear
– the year the movie was releasedduration
– movie lengthvotes
– number of user ratings the movie has received
on IMDbmetascore
– critics’ ratings of the moviereviews_from_users
– number of written reviews left on
the site for a movie by usersreviews_from_critics
– number of written reviews left
on the site for a movie by criticsusa_gross_income
– domestic gross incomeworlwide_gross_income
– Worldwide gross incomebudget
– Movie budgettop_genre
– The film’s primary genreIn this EDA I am primarily concerned with exploring missingness and determining whether the categorical predictors are viable for inclusion in the model.
We’ll begin, however, by checking the distribution of our numeric variables:
Interestingly, our outcome variable avg_vote
appears
nearly normally distributed. Most of the other variables are highly
skewed, suggesting we should stick to nonlinear models if we want to
avoid transforming all our predictors.
The below figure plots the proportion of missing values for each variable in the dataset.
Unfortunately, several variables that seem likely to be very useful
to the model (e.g., budget
, gross_income
) have
extremely large numbers of missing observations.
Next, let’s check the number of levels in our categorical variables.
Unfortunately, most of the categorical predictors have far too
many levels to be useful to our model (this includes trying to fit the
model using step_other
, which does not work).
One exception is the top_genre
variable. This is a
variable I created out of the original genre
variable,
which listed many genres for each movie (that is, a single cell might
contain a string like “Drama, Comedy, Action”). I took the first genre
listed for each movie and interpreted this to be the primary genre. This
modified variable resulted in a total of 23 different genres, making it
reasonable to include in the model. A similar method for reducing levels
was not possible for the other categorical variables.
Let’s check out the distribution of genres in the dataset:
While far from an even distribution, it should still be suitable for our purposes. It is also interesting to note that the distribution appears to follow a power law, and that Drama, Comedy, and Action are the most common genres by far. Since all the models I’ll be fitting are nonlinear, we don’t need to worry about the skewed distribution.
Lastly, let’s check out how our numeric variables correlate.
Some of our predictors are highly correlated with one another–another
good argument for avoiding a linear model. Hopefully tuning
mtry
over a large range will help minimize the effects of
these correlations in our random forest models.
The EDA makes clear that missingness and categorical predictors with huge numbers of levels are a problem in this dataset. In order to move forward, the following decisions were made:
top_genre
. All other predictors were numeric.
Before moving on, let’s check the distribution of the numeric
variables in our reduced dataset:
Importantly, the distributions do not appear to have changed significantly from what they were in the full dataset (see above). This gives us some confidence that we have not drastically changed the relationship between predictors and outcome variables as a result of our cutting down the data.
The following procedure was used:
avg_vote
(average user rating
of the movie)top_genre
, year
,
votes
, duration
, metascore
,
reviews_from_users
, reviews_from_critics
,
usa_gross_income_tidy
,
worlwide_gross_income_tidy
, budget_tidy
.The neighbors hyperparameter was tuned over the range of 1 to 25. The tuning process returned the following best-performing model, indicating the optimal number of neighbors was 22, with an RMSE of .614:
## # A tibble: 1 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 22 rmse standard 0.614 50 0.00438 Preprocessor1_Model09
Fitting the model to the test data, we get an RMSE of .57 and an RSQ of .66.
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.566
## 2 rsq standard 0.655
Plotting predicted by observed, we see fairly terrible performance on the lower tail and better performance on the upper tail. The model appears to overpredict for unpopular movies and underpredict for popular movies.
The mtry hyperparameter was tuned over a range of 1 to 51
and the min_n
hyperparameter was tuned over a range of 2 to
40. The tuning process returned the following best-performing model,
indicating the optimal values of mtry
and
min_n
were 26 and 2, respectively. The model had an RMSE of
.562 on the training data.
## # A tibble: 1 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 26 2 rmse standard 0.562 50 0.00404 Preprocessor1_Model03
Fitting the model to the test data, we get an RMSE of .51 and an RSQ of .71.
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.513
## 2 rsq standard 0.713
Plotting predicted by observed, we see slighlty better performance on the tails compared to the KNN model and overall tighter correlation, but a similar pattern of overpredicting unpopular movies and underpredicing popular movies.
The mtry hyperparameter was tuned over a range of 1 to 51,
the min_n
hyperparameter was tuned over a range of 2 to 40,
and the learning rate was tuned from -5 to .2. The tuning process
returned the following best-performing model, indicating the optimal
values of mtry
, min_n
, and
learn_rate
were 51 40, and .63, respectively. The model had
an RMSE of .590 on the training data.
It’s clear from the below plot that the only parameter that had a true effect on performance was the learning rate.
## # A tibble: 1 × 9
## mtry min_n learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 51 40 0.631 rmse standard 0.590 50 0.00483 Preprocessor1_M…
Fitting the model to the test data, we get an RMSE of .54 and an RSQ of .69. Once again, performance increases with movie popularity; however, the BT model does not display the same pattern of over/underprediction that the other two models showed.
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.542
## 2 rsq standard 0.686
Below is a summary of all three models’ performance, sorted from best to worse:
## # A tibble: 3 × 3
## model rmse rsq
## <chr> <dbl> <dbl>
## 1 Random Forest 0.513 0.713
## 2 Boosted Trees 0.542 0.686
## 3 KNN 0.566 0.655
The 10-fold validation with 5 repeats afforded us very low SEs on our model fits during training (see above). So we can say with 95% confidence that all three models differ significantly from one another in performance. The random forest model performed best on both RMSE and RSQ metrics. We conclude the RF model makes the most accurate predictions on average about a movie’s user rating.
All the models performed better than expected. The outcome variable (average votes) was on a scale of 1-10, so RMSEs in the .5 range is arguably not too bad, especially considering we only trained on a fraction of the entire dataset due to missingness (and on a small number of variables).
While the difference between models is statistically significant, an important question is whether those differences are practically significant. One reason to care is training time: the difference in RMSE between the worst-performing model (KNN) and the best-performing model (RF) was about .05. But the KNN model only took about a minute to train, while the random forest model took 7.5 hours!
Therefore, I think a strong argument can be made that the increased accuracy provided by the RF model was not worth the additional training time (moreover, the BT model performed better than the KNN model and only took about an hour to train).
The models differed in their performance at the tails. As shown by the above \(Predicted \sim Observed\) plots, the KNN model performed very poorly at the tails, over-predicting on the low end and under-predicting at the high end. The RF model followed the same pattern, though to a lesser extent. The BT model, however, was less skewed at the tails, such that it did not tend to over-predict or underpredict at either tail. Thus, despite having lower overall accuracy, one might prefer the BT model if accurate predictions at the extremes (i.e., very low or very high movie ratings) is preferred over greater average accuracy. Again, the shorter training time of the BT model compared to the RF model weights this benefit even more highly.
One interesting thing is that all three models performed better on the test set than on the training set. This alleviates concerns about overfitting, but raises questions about why. One possible answer is nonindependence between samples arising in the cross-fold validation training procedure.
Based on these results, I don’t think there is a clear single answer as to which model should be used. Instead, one should select based on how important training time, overall accuracy, or accuracy at the tails is to them. If overall accuracy (i.e., accuracy at the mean values) is most important, then the RF model is the clear winner. If training time is key consideration, however, then KNN (or possibly BT) is the better choice. If performance at the tails (i.e., accurate predictions for very low- or high-rated movies), then the BT model is the best choice.
High levels of missingness and having categorical variables with thousands of levels caused a lot of problems in this project. Both situations led to failure during model fitting because of values that were missing in some folds and not others (and/or present in the training set but not the test set). This fact led me to make the decision to exclude any NAs, reducing the dataset to less than one tenth of its original size.
This is obviously not desireable, and the relative lack of data likely reduced model performance. At the same time, however, reducing the size of the dataset so severely was necessary from a computational burden standpoint. Nevertheless, it would’ve been better to be subset randomly, rather than by eliminating all NAs, which could possibly (probably) bias the model.
One outstanding question regards performance at the tails. Why was performance consistently lower for unpopular movies and higher for popular movies? One can think of several possible explanations. First, while the outcome variable looks roughly normal, it is definitely skewed right slightly; but stratification was used, which should address this problem.
Another possibility is that there is more variability in the predictors for unpopular movies than popular movies. However, I ran a few scatter plots to check this hypothesis, and actually found the opposite pattern–see the below example. There was a lot more variability in the highly-popular movies than vice versa. Perhaps this variability proved beneficial to the model (i.e., a variable must have a certain amount of variability in order to have any sort of predictive power. If a variable has the same value for all levels of the outcome variable, it is orthogonal and won’t explain any variance).
What the scatter plots do show, however, is the vast imbalance in the number of observations for lower-popularity movies compared to more popular movies. Perhaps this was too much for stratification to overcome.