| dc.description.abstract |
Bone fractures are a critical medical condition requiring early and accurate detection to ensure
timely treatment, yet conventional analysis relies on manual interpretation, which is time-
consuming and prone to error (Sharma et al., 2025). While deep learning models have been
applied to this problem, they often struggle with the small and complex datasets typical in
medical imaging, leading to unreliable results (Alwzwazy et al., 2025). The primary limitations
hindering their clinical adoption are a heavy dependence on large, expert-annotated datasets,
which are difficult to acquire, and a lack of model transparency, which erodes clinical trust
(Alwzwazy et al., 2025).
To address this gap, this research designed and developed a novel, end-to-end fracture detection
system built upon a Vision Transformer (ViT) architecture. This approach diverges from
traditional supervised methods by leveraging a domain-specific Self-Supervised Learning (SSL)
strategy, a technique that has shown significant potential to enhance clinical diagnostics (Wang
and Siddiqui, 2024). The core of the solution is a two-stage process. First, a standard ViT-
Base/16 encoder was pre-trained on a large corpus of unlabeled musculoskeletal radiographs
from the MURA dataset using a Masked Autoencoder (MAE) framework. This forces the
model to learn rich, high-level semantic features of radiographic anatomy without requiring
any human-provided labels. Subsequently, this pre-trained encoder was fine-tuned for fracture
classification using a smaller, labeled subset of the MURA dataset.
The developed SSL-ViT model was systematically evaluated on a hold-out validation set,
demonstrating the viability and effectiveness of the proposed approach. The system achieved
a validation accuracy of 86.90% and an Area Under the Curve (AUC-ROC) score of 0.8686,
indicating a strong capability to distinguish between fractured and non-fractured cases. Analysis
of the training dynamics confirmed that the SSL pre-training provided a robust foundation,
enabling the model to learn effectively from limited labeled data. These results validate that the
combination of domain-adaptive self-supervised learning with Vision Transformers presents a
promising pathway toward creating more data-efficient, accurate, and trustworthy AI tools for
clinical diagnostics. |
en_US |