PRIME-DD: Prioritized Representation Integration through a Multi-phase Hybrid Distillation of Importance Weights and Trajectory Signals for Scalable Dataset Compression

Fernando, Wiras

PRIME-DD: Prioritized Representation Integration through a Multi-phase Hybrid Distillation of Importance Weights and Trajectory Signals for Scalable Dataset Compression

Fernando, Wiras

URI: http://dlib.iit.ac.lk/xmlui/handle/123456789/3074

Date: 2025

Abstract:

The exponential growth of data in machine learning has introduced substantial challenges in terms of computational cost, storage, and model training efficiency particularly for large-scale and resource-intensive applications. Dataset distillation, which aims to synthesize smaller, representative datasets without sacrificing model performance, has emerged as a promising solution. However, existing techniques often suffer from poor scalability, lack of fairness, and limited generalization across diverse model architectures. This research proposes a novel hybrid framework PRIME-DD (Prioritized Representation Integration through a Multi-phase Hybrid Distillation of Importance Weights and Trajectory Signals for Scalable Dataset Compression) that integrates Importance-Aware Adaptive Distillation (IADD) with Trajectory Matching (TM) to address these limitations. The proposed system dynamically identifies and retains high-value data points based on uncertainty, misclassification, and class balance, while aligning the training trajectories of student models to that of the teacher for improved convergence. A fully functional prototype was developed using PyTorch and React, with real-time distillation feedback, synthetic sample visualization, and historical benchmarking support. The framework was evaluated on benchmark datasets such as CIFAR-10, CIFAR-100, SVHN, MNIST and FashionMNIST, and tested across multiple architectures to validate generalization and robustness. Experimental results demonstrate significant improvements in dataset compactness, training time, and cross-architecture accuracy, while also incorporating mechanisms to reduce sampling bias and improve fairness. This research not only advances the state of dataset distillation but also contributes a scalable, explainable, and ethically sound solution suitable for real-world machine learning systems.

Show full item record