Abstract:
The exponential growth of data in machine learning has introduced substantial challenges in terms of computational cost, storage, and model training efficiency particularly for large-scale and resource-intensive applications. Dataset distillation, which aims to synthesize smaller, representative datasets without sacrificing model performance, has emerged as a promising solution.
However, existing techniques often suffer from poor scalability, lack of fairness, and limited generalization across diverse model architectures. This research proposes a novel hybrid framework PRIME-DD (Prioritized Representation Integration through a Multi-phase Hybrid Distillation of Importance Weights and Trajectory Signals for Scalable Dataset Compression) that integrates Importance-Aware Adaptive Distillation (IADD) with Trajectory Matching (TM) to address these limitations. The proposed system dynamically identifies and retains high-value data points based on uncertainty, misclassification, and class balance, while aligning the training trajectories of student models to that of the teacher for improved convergence.
A fully functional prototype was developed using PyTorch and React, with real-time distillation feedback, synthetic sample visualization, and historical benchmarking support. The framework was evaluated on benchmark datasets such as CIFAR-10, CIFAR-100, SVHN, MNIST and FashionMNIST, and tested across multiple architectures to validate generalization and robustness. Experimental results demonstrate significant improvements in dataset compactness, training time, and cross-architecture accuracy, while also incorporating mechanisms to reduce sampling bias and improve fairness. This research not only advances the state of dataset distillation but also contributes a scalable, explainable, and ethically sound solution suitable for real-world machine learning systems.