Abstract:
Accurately predicting a movie’s success before its release remains a key challenge in the film
industry. Traditional models rely heavily on post-release feedback or superficial metadata, often
ignoring the narrative richness embedded in the script. This project addresses that gap by
developing a machine learning pipeline to predict IMDb rating classes using raw movie scripts
and associated metadata, enabling data-driven decision-making in the pre-production phase.
The system extracts structural, emotional, and linguistic features from scripts using a custom
feature engineering pipeline. Semantic understanding was further enhanced using three
embedding techniques: TF-IDF, BERT, and Sentence Transformers. Structured metadata such as
genre, director, cast, and country were integrated with the engineered features. Machine learning
models including Random Forest, XGBoost, and Gradient Boosting were trained using these
inputs, along with techniques such as label encoding, SMOTE for class balancing, and
hyperparameter tuning. A prototype web interface was also developed using Streamlit and
FastAPI, allowing users to upload scripts and receive predictions in real-time via deployed
model.
The best-performing model combined Sentence Transformer embeddings with Random Forest,
achieving an accuracy of 85%, macro F1-score of 0.85. This result, along with the real-time
interface, demonstrates that combining deep semantic script analysis with structured metadata
can effectively predict IMDb ratings prior to release, offering practical value to producers and
analysts.