Abstract:
Oro-facial dysfunction is an early, yet frequently under-recognised, manifestation of
amyotrophic lateral sclerosis (ALS). Existing clinical assessments rely on subjective visual
ratings and static photographs, offering limited sensitivity to the subtle, transient movement
deficits that precede overt speech or swallowing problems. While recent computer-vision
studies have shown promise for Parkinson’s disease, a rigorous, explainable, video-based
approach tailored to ALS remains absent.
This dissertation presents a lightweight spatiotemporal pipeline that couples GPU-accelerated
TV-L1 optical-flow extraction with a fine-tuned VideoMAE vision transformer. Frame-level
motion features are aggregated with calibrated probabilistic voting to yield subject-level
predictions, while transformer attention roll-out provides clinician-friendly saliency overlays.
The system is trained under a strict three-fold cross-validation regime to mitigate data scarcity
and overfitting, and hyperparameters are optimised via grid search without synthetic class
balancing.
Evaluated on a subset of the Toronto NeuroFace database (ALS = 60 videos; healthy = 60
videos) and an expanded internal hold-out set, the proposed model achieves a mean AUC of
1.000, F1-score of 0.9785, and subject-level accuracy of 98.04 %. Motion-energy bar-plots
reveal a strong correlation (ρ = 0.84) between optical-flow magnitude and expert bulbarfunction ratings, supporting clinical validity. These results indicate that transformer-guided
optical-flow analysis can furnish a practical, objective screening aid for early ALS-related orofacial impairment, laying the groundwork for larger-scale, multi-centre studies.