Knowledge Distillation: Effects of Alpha
Published:
Overview
Research project exploring how the alpha parameter in knowledge distillation affects student model performance. Investigates the optimal balance between learning from ground-truth labels and teacher model predictions.
Research Question
How does varying the parameter alpha in knowledge distillation affect the accuracy of a student model compared to training it from scratch or purely from teacher predictions?
Hypothesis
There exists an optimal range of alpha values that yields higher accuracy than training with hard labels alone, by leveraging the additional structure provided by the teacher’s soft predictions.
Knowledge Distillation Concepts
Core Components
- Teacher Model: Large, high-performing CNN trained to \(\sim\)99% accuracy on MNIST
- Student Model: Lightweight CNN (16 and 32 filters)
- Alpha (\(\alpha\)): Scalar in \([0,1]\) that interpolates between:
- Student loss (ground-truth labels)
- Distillation loss (soft teacher labels)
Loss Functions
The knowledge distillation framework combines two loss components:
Distillation Loss: Cross-entropy between softened student and teacher logits
\[\mathcal{L}_{\text{distill}} = \text{CE}\left(\text{softmax}\left(\frac{z_s}{T}\right), \text{softmax}\left(\frac{z_t}{T}\right)\right)\]where \(z_s\) and \(z_t\) are student and teacher logits, and \(T\) is the temperature parameter.
Student Loss: Cross-entropy between student predictions and hard labels
\[\mathcal{L}_{\text{student}} = \text{CE}(\text{softmax}(z_s), y_{\text{true}})\]Combined Loss: Weighted combination controlled by \(\alpha\)
\[\mathcal{L}_{\text{total}} = \alpha \cdot \mathcal{L}_{\text{student}} + (1-\alpha) \cdot \mathcal{L}_{\text{distill}}\]Implementation
Framework
- TensorFlow/Keras: Primary ML framework
- Custom Distiller Class: Subclass of
keras.Modelfor custom loss computation - Temperature: Tunable softening parameter (default = 3.0)
Model Architecture
Teacher Model:
- 2-layer CNN with 256 and 512 filters
- Trained for 1000 epochs
- Achieved \(\sim\)98.8% test accuracy
Student Model:
- Lightweight 2-layer CNN with 16 and 32 filters
- Trained for 100 epochs
- Various \(\alpha\) values tested
Experimental Setup
Knowledge Distillation Architecture
The distillation process follows a teacher-student paradigm where knowledge is transferred through soft probability distributions:

The teacher model generates soft predictions that encode richer information than hard labels, including uncertainty and class relationships. The student learns from both these soft targets and the ground-truth labels, weighted by the \(\alpha\) parameter.
Experimental Results
Dataset: MNIST

- MNIST: Handwritten digits dataset
- Training: 60,000 samples (28×28 grayscale images)
- Testing: 10,000 samples
- Classes: 10 digits (0-9)
- Preprocessing: Normalization, label perturbation for noise analysis
Accuracy vs Alpha Analysis

Key Observations:
- Student with Distillation (Orange line): Shows consistently higher accuracy across most \(\alpha\) values compared to training from scratch
- Student Scratch (Blue line): Baseline performance without teacher guidance
- Optimal Range: \(\alpha \in [0.0, 0.5]\) demonstrates the strongest performance improvement
- Performance Drop: At \(\alpha = 1.0\) (hard labels only), the student performs significantly worse, validating the benefit of soft label guidance
Multi-Epoch Analysis

Comparing performance across different training durations (100, 300, 500, and 1000 epochs):
Critical Findings:
- 100 Epochs (Blue): Baseline performance, relatively stable across \(\alpha\) values
- 300 Epochs (Orange): Slight improvement over 100 epochs
- 500 Epochs (Green): Noticeable performance gains, especially in mid-range \(\alpha\) values
- 1000 Epochs (Red): Best overall performance
- Maintains high accuracy (~96%) for \(\alpha \in [0.0, 0.8]\)
- Dramatic drop at \(\alpha = 1.0\) (~93%), showing pure hard-label training is suboptimal
- Peak performance at \( \alpha = 0.3 \): Achieves ~97% accuracy
Key Insight: Longer training amplifies the benefits of knowledge distillation, with optimal \(\alpha\) values (\(0.3-0.5\)) showing the most significant improvements.
Loss Landscape Analysis

Loss Behavior:
Student Loss (Red): Remains remarkably stable across all \(\alpha\) values (~0.5), indicating consistent performance on ground-truth labels
Student Scratch Loss (Green): Shows high variability and generally higher loss values
Optimal \(\alpha\): Around \(0.4-0.5\) where both losses are minimized
Detailed MNIST Results


MNIST-Specific Analysis:
- Student Scratch Accuracy (Orange): Stable at ~11.5% (random baseline for 10 classes)
- Student Accuracy (Blue): Near-zero without distillation, dramatically improved with teacher guidance
- Loss Progression: Shows monotonic improvement as $\\alpha$ increases from 0 to 1
This unusual pattern suggests the student model requires significant training to learn effectively from scratch on MNIST.
Extended Training Analysis (100 Epochs)

100-Epoch Training Pattern:
Low \(alpha\) (0.0-0.3): Highest accuracy (~55-57%), student relies heavily on teacher’s soft labels
Mid \(alpha\) (0.4-0.6): Moderate performance (~48-51%), balanced learning
High \(alpha\) (0.7-1.0): Stabilizes at ~51%, more emphasis on hard labels
Interpretation: With shorter training (100 epochs), lower \(alpha\) values perform better as the student benefits more from the teacher’s pre-learned representations.
Key Experimental Findings
Performance Summary:
Teacher Model: \(\sim\)98.8% test accuracy (1000 epochs)
Best Student (\(alpha = 0.3\), 1000 epochs): \(sim\)97% accuracy
Student Scratch (no distillation): \(sim\)93.1% accuracy
Performance Gap: Up to 3.9% improvement with optimal \(alpha\) tuning
Optimal \(alpha\) Characteristics:
The optimal \(alpha\) value depends on: Teacher Model Quality: Better teachers (>95% accuracy) allow lower \(alpha\) values (more reliance on soft labels)
Dataset Complexity: MNIST benefits from \(alpha \in [0.3, 0.5] \)
Training Duration: Longer training (>500 epochs) shows more pronounced distillation benefits
Model Capacity: Smaller student models benefit more from lower \(alpha\) values
Distillation Benefits:
Models trained with intermediate \(alpha\) values demonstrated: Better Generalization: Improved test accuracy compared to hard-label training
Faster Convergence: Reached target accuracy in fewer epochs
Robustness: More stable training curves with less variance
Knowledge Transfer: Successfully learned teacher’s decision boundaries
Smooth Interpolation: Between memorization ((\alpha = 1\)) and imitation (\(alpha = 0\))
Mathematical Framework
Temperature-Scaled Softmax
The temperature-scaled softmax for distillation:
\[p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}\]where higher temperature $T > 1$ produces softer probability distributions, revealing more information about the teacher’s uncertainty.
Key Properties: \(T = 1\): Standard softmax (sharp distributions)
\(T \to \infty\): Uniform distribution
Optimal \(T \in [2, 5]\) for most tasks
This study used \(T = 3.0\) for soft label generation
Theoretical Justification
Why Knowledge Distillation Works:
- Dark Knowledge: Teacher’s soft predictions encode inter-class relationships and uncertainty
- Regularization Effect: Soft targets prevent overfitting to hard labels
- Faster Convergence: Pre-learned feature representations guide student training
- Generalization: Student learns decision boundaries rather than memorizing labels
Conclusions
This research provides empirical validation for the effectiveness of knowledge distillation with varying \(alpha\) parameters:
Main Contributions
- Systematic \(alpha\) Analysis: Comprehensive evaluation across 11 different \(alpha\) values (0.0 to 1.0 in 0.1 increments)
- Multi-Scale Training Study: Performance characterization across 100, 300, 500, and 1000 epochs
- Optimal Parameter Identification: \(alpha \approx 0.3\) yields best performance for MNIST with CNN student
- Quantitative Validation: Up to 3.9% accuracy improvement over pure hard-label training
Practical Insights
For Practitioners: Start with \( \alpha = 0.3-0.5\) for CNN-based models on image classification
Use lower \( \alpha \) (\(< 0.3\)) with high-quality teachers (\(> 95\%\) accuracy)
Train for at least 500 epochs to fully realize distillation benefits
Set temperature \(T = 3.0\) as a robust default
For Researchers:
- Knowledge distillation effectiveness increases with training duration
- Soft labels provide stronger learning signal than hard labels for compact models
- \(\alpha\) serves as a critical hyperparameter requiring task-specific tuning
- Further investigation needed on non-image domains and larger-scale models
Limitations
- Single Dataset: Focused exclusively on MNIST; generalization to complex datasets (CIFAR-10, ImageNet) requires validation
- Architecture Constraints: Limited to CNN-based models; transformer and attention-based architectures not explored
- Fixed Temperature: \(T = 3.0\) used throughout; optimal temperature may vary with \(\alpha\)
- Compute Resources: Extensive hyperparameter search limited by computational budget
Future Directions
Immediate Extensions
- Dataset Diversity: Evaluate on Fashion-MNIST, CIFAR-10/100, and Tiny ImageNet
- Architecture Variations: Test with ResNet, VGG, and modern architectures (Vision Transformers)
- Temperature Tuning: Co-optimize \(\alpha\) and \(T\) for maximum performance
- Advanced Metrics: Include precision, recall, F1-score, and confusion matrix analysis
Advanced Research Questions
- Theoretical Analysis: Mathematical characterization of optimal \(\alpha\) as a function of dataset complexity
- Noisy Label Robustness: Evaluate distillation effectiveness under label noise
- Multi-Teacher Ensembles: Investigate knowledge transfer from multiple teachers with different \(\alpha\) weights
- Neural Architecture Search: Automatic discovery of student architectures optimized for distillation
- Continual Learning: Knowledge distillation for sequential task learning
- Cross-Domain Transfer: Distillation across different modalities (image → text, audio → image)
Broader Impact
This research contributes to the growing body of work on efficient deep learning and model compression, enabling:
- Deployment of high-performance models on edge devices
- Reduced computational costs for inference
- More accessible AI for resource-constrained environments
- Environmental sustainability through reduced energy consumption
Links
- GitHub Repository
- Final Report PDF
- Topics: Deep Learning, Knowledge Distillation, Model Compression, Neural Networks
