Abstract:
Environmental, Social and Governance (ESG) aspects are non-financial elements attracting
investor interest as they increasingly incorporate them into their analyses to uncover significant risks and development prospects. However, existing ESG grading systems confront issues like inconsistency and restricted data sources, which jeopardise their accuracy and dependability. To address these difficulties, this project proposes creating a transparent multimodal deep learning pipeline that aims to improve accuracy by combining structured and unstructured data sources. The system utilises both text and numeric data. A multistage pipeline was developed to collect and preprocess company reports, filter ESG-related text using domain-specific keywords and extract numeric finance data through Yahoo Finance. The text branch encodes disclosures with a section-aware ESG BERT and learns a compact signal via ridge regression. The numeric branch models fundamentals with XGBoost. The evaluation uses GroupKFold by company ticker and out-of-fold (OOF) predictions to avoid firm-level leakage. Final predictions are produced by a ridge stacking meta-learner that fuses the text and numeric branches, while an early fusion MLP serves as a deep baseline. Performance was evaluated using standard metrics such as R² and Mean Absolute Error (MAE) and Mean Squared Error (MSE) across ESG dimensions. Results show that numeric data provides a strong backbone, while text adds a consistent and modest list, particularly for social factors. Across OOF folds, the stacked model attains approximately R² = 0.20 (environmental), 0.16 (social), and 0.07 (governance), outperforming text-only models and slightly improving upon numeric-only baselines. Meta weights from the stacker quantify modality reliance. Explainability is integral in this project. SHAP identifies global and local numeric drivers; meta-weights expose cross-modal trust, and anchor-free sentence ranking surfaces disclosure snippets linked to predictions. These results validate the feasibility of using deep learning for automated ESG scoring and establish a performance baseline for further improvement while contributing to explainable AI.