Week 3 and 4 at Metis Bootcamp was definitely increasingly onerous. This module combined two major topics — webscraping and linear regression. If you’re like me, with no math or stats background, some of the theories and concepts may seem a bit abstract by explaining with equations. Luckily, I found this channel on Youtube — StatQuest, that explains the concepts in graphs and made it so much easier to understand. Hope this is helpful to you as well!
Same as the last module, there’s a project due by the end of the second week. The second project at Metis is to use data scraped from a website, build linear regression models that address a useful prediction and/or interpretation problem in any domain of interest such as movies or sports.
As a movie lover, I have always enjoyed watching and discussing movies with friends. When it comes to professional movie critiquing, the late Mr. Roger Ebert had reviewed the most and is the best known movie critics of all time. For this linear regression project, I want to analyze what affect’s Roger Ebert’s rating and see if we can predict Roger Ebert’s movie rating if he were alive today?
The primary dataset was web-scraped from the film critic’s website with Python’s BeautifulSoup and Selenium libraries. Once the data was collected and cleaned, I realized that there wasn’t enough features to create a robust model. I wanted other features such as users ratings and box office information. From Kaggle, I found a dataset containing other large movie rating sites — MovieLens and IMDb information for the movies. With the IMDb ID from this dataset, I was able to do a second data scraping from IMDb for the additional features. I decdied not to scrap Rotton Tomato’s rating and Metacritics Scores since those rating takes in the critics’ rating for average score. My original dataset contains 7847 datapoint and 6 features. After data merging and cleaning, I have 2191 datapoint and 11 features.
- BeautifulSoup and Selenium for web-scraping
- Pandas and numpy for data manipulation
- pickle for data storage
- Matplotlib and seaborn for plotting
- scikitlearn and statsmodel for modeling and testing
Each datapoint is an individual movie. The target variable is Ebert rating (on a sale from 0.0 to 4.0). Of the 11 features, there are 3 categorical features — Genre, sub-genre and MPAA rating (which are convert into dummy variables in feature engineering process). The numerical features include year(movie released), runtime(in minutes), movieLens rating (on a scale from 0.0 to 5.0), IMDb rating (on a scale from 0.0 to 10.0), budget, domestic gross, opening week gross, and worldwide gross.
Looking at Ebert’s rating distribution, I see that he gave almost half of the movies a 3 to 3.5 rating and he also does not give too many terrible movie ratings like 0 to 0.5 stars, which might affect the model prediction in lower rating.
Using pairplot and heat map, I found movieLens rating and IMDb rating have higher coefficient with the target variable. But other numerical features do not seem to have apparant linear relationship with the target variable, which indicates some feature engineering might be needed.
- Mapping genre and subgenre columns to less categories (eliminate outliers)
- Converting categorical features to dummy variables
- Create new features with some feature interaction (i.e. Opening week gross proportion calculated by opening week gross divided by cumulative worldwide gross)
- Power transform some numerical features to help with avoiding the outliers
My baseline model has a R-squared of 0.382, using the 9 features with p-values of less than 0.05. I built another 3 models — polynomial, Ridge and Lasso and then I use 5-folds cross-validation test to evaluate which model performs the best. However, the results shows no major difference between the models. R-squared valeus are close and low between models and between train and validation sets. This suggests that models may be underfitting. To improve my model, I increase complexity by adding more features and more feature engineering.
Besides the previous 4 models, I also tried using Ridge and Lasso regularizations on polynomial. Since polynomial has higher training score than validation score, it indicates the polynomial model is overfitting. So I used Ridge and Lasso and tuned the regularization strength to hopingly find the sweet spot in the bias-variance trade-off graph.
Model Evaluation and Selection
After iterative process of model refinement, tuning, and selection on validation test, I finally have a winner — Lasso. It performed slightly better than the other models on the cross validation test.
After retraining the lasso model, I obtained a R-squared value of 0.396 and mean absolute error of 0.55. Both are slightly better than the baseline. In layman’s term, the prediction is off by 0.5 star.
The prediction plot also show that some of the lower ratings prediction are doing slighty better. Given the data doesn’t have much lower rating examples, it is a challenge for the model to make more accurate predciton of the lower ratings.
Here are some examples on how the model did-
Notice that Life of Pi and Pitch Perfect are within 0.5 rating off. But for A Nightmare on Elm Street which as a lower rating has a weaker prediction.
And here, just for fun, I used the model to predict Ebert’s rating on some of my recent favorites…
Overall, I think my prediction model is not the best. But I got some insights from this project. One is the linear regression may not be the best prediction model for this dataset. Also R-squared values are typically lower than 50% for predicting human behavior since humans are simplmy harder to predict than physical processes.(article: How to interpret R-squared in Regression Analysis by Jim Frost)
If I had more time, I’d try using non-linear prediction model, such as tree-based model (Random Forest, Decision Tree, etc.). Secondly, I would get more datapoints by filling missing values in original dataset and webscrap more features. Lastly, I would build a flask app to deploy the predction model.
Don’t dwell on R-square values too much. Altough it serves as a good metric for validation test on models, it does have some limitation. Here’s a good article for the explanation. Always look at other metrics such as Mean Absoluate Error and Mean Squared Error to get a better understanding of the fit of the model.
Overall, I have learned a great deal the past two weeks. It was definitely not easy. But it was rewarding getting through this module. I hope this project is interesting and insightful to you. Thanks for reading :)
You can find my project code here.