A Machine Learning Approach to Predict IMDb Ratings from Movie Scripts

Harinda, Janith

A Machine Learning Approach to Predict IMDb Ratings from Movie Scripts

Harinda, Janith

URI: http://dlib.iit.ac.lk/xmlui/handle/123456789/2910

Date: 2025

Abstract:

Accurately predicting a movie’s success before its release remains a key challenge in the film industry. Traditional models rely heavily on post-release feedback or superficial metadata, often ignoring the narrative richness embedded in the script. This project addresses that gap by developing a machine learning pipeline to predict IMDb rating classes using raw movie scripts and associated metadata, enabling data-driven decision-making in the pre-production phase. The system extracts structural, emotional, and linguistic features from scripts using a custom feature engineering pipeline. Semantic understanding was further enhanced using three embedding techniques: TF-IDF, BERT, and Sentence Transformers. Structured metadata such as genre, director, cast, and country were integrated with the engineered features. Machine learning models including Random Forest, XGBoost, and Gradient Boosting were trained using these inputs, along with techniques such as label encoding, SMOTE for class balancing, and hyperparameter tuning. A prototype web interface was also developed using Streamlit and FastAPI, allowing users to upload scripts and receive predictions in real-time via deployed model. The best-performing model combined Sentence Transformer embeddings with Random Forest, achieving an accuracy of 85%, macro F1-score of 0.85. This result, along with the real-time interface, demonstrates that combining deep semantic script analysis with structured metadata can effectively predict IMDb ratings prior to release, offering practical value to producers and analysts.

Show full item record