Abstract:
Target tracking in dynamic environments is very challenging because the variability in target
motion, background clutter, and lighting conditions vary. Classic tracking models are not
adaptive and interpretable in real time due to their complex nature. Here, the project work is
for overcoming such limitations by proposing an explainable, semi-supervised, and
unsupervised learning framework which improves the tracking accuracy with a clear idea about
model interpretability.
The integrated approach merges unsupervised clustering (CLIP-based feature extraction with
KMeans) and semi-supervised fine-tuning of a supervised classifier on diverse pre-processed
datasets. Motion and context inform data pre-processing through noise reduction, data
augmentation, and meaningful CLIP feature extraction. Furthermore, framework employs
SHAP as an explainability technique to reveal individual feature contributions to tracking
decisions. To enhance robustness and adaptability to novel environmental conditions,
incremental fine-tuning via selective unfreezing of CLIP encoder layers and end-to-end
retraining is incorporated. The system is evaluated quantitatively by means of Average
Precision (mAP) and qualitatively by inference speed (FPS), demonstrating competitive
performance and interpretability.