IMDb Rating Prediction SystemGithub

This project was a term-long project for SENG 474 Data Mining. The goal of this project was to create a machine learning solution that involved EDA, preprocessing, and modelling. The main focus was finding a way to deal with categorical data, given the IMDb dataset contains many text-related, non-numerical features. For categorical features with a discrete number of values, the chosen route was to one-hot-encode each possible value. For categorical features with a large number of possible values (discovered during EDA), the chosen way to handle them was by one-hot-encoding the top N most frequent values and aggregate the rest into an "other" feature.

After modeling, the obtained results showed that a regression single hidden layer neural network achieved the best Mean Average Error (MAE). This was the chosen metric as it best reflected the rating on a scale from 0.0 to 10.0. The results could have been further improved by choosing a different method for encoding categorical and text-based features. If I were to do this project again, I would encoded most text based features using an embedding layer or by integrating an NLP transformer such as BERT.

The deliverables of this project included a Machine Learning Poster and Research Paper. Both of these can be seen below.

Machine Learning Poster




Machine Learning Research Paper