Two-Stage Fine-Tuning For Transfer Learning With Resnet-18
I wrote an app to
classify an object using transfer learning, and the model was trained on CIFAR-10 dataset, all by myself.
Two-Stage Fine-Tuning for Transfer Learning with ResNet-18
What Is Two-Stage Fine-Tuning?
Two-stage fine-tuning is a structured approach to adapting a pretrained model to
a new task. Instead of unfreezing all layers at once, we fine-tune the network in
stages, starting from the top (task-specific layers) and gradually moving
deeper into the backbone.
For a
ResNet-18 pretrained on ImageNet, this method works especially
well because different layers learn features of different abstraction levels.
Why Fine-Tuning in Stages Matters
If all layers are unfrozen immediately:
- Gradients from a small dataset can overwrite pretrained features
- Training becomes unstable
- Validation accuracy may stagnate or degrade
Two-stage fine-tuning mitigates these issues by controlling
where learning
happens first.
ResNet-18 Layer Hierarchy
Conceptually, ResNet-18 can be divided into:
- Early layers:
conv1, layer1 → edges, textures
- Middle layers:
layer2, layer3 → shapes, parts
- Late layers:
layer4, fc → semantics, classes
Two-stage fine-tuning leverages this natural separation.
Stage 1: Fine-Tune the Head and Top Layers
Objective
Adapt the pretrained model to the new dataset while preserving general visual
features learned from ImageNet.
Layers to Unfreeze
In Stage 1, we typically unfreeze:
fc (classification head)
layer4 (highest residual block)
All earlier layers remain frozen.
Why This Works
The classifier head is randomly initialized and must be trained from scratch.
layer4 already contains high-level semantic features that can be safely
adapted without destabilizing the network.
Example: Freezing and Unfreezing Layers
# Freeze all parameters
for name, param in model.named_parameters():
param.requires_grad = False
# Unfreeze layer4
for name, param in model.layer4.named_parameters():
param.requires_grad = True
# Unfreeze fully connected layer
for name, param in model.fc.named_parameters():
param.requires_grad = True
Typical Training Behavior
- Rapid improvement in validation accuracy
- Stable loss curves
- Minimal overfitting early on
Stage 1 establishes a strong task-specific baseline.
Stage 2: Fine-Tune Mid-Level Features
Objective
Refine feature representations by allowing limited adaptation of deeper layers
without damaging low-level visual primitives.
Layers to Unfreeze
In Stage 2, we commonly unfreeze:
Lower layers (
layer1,
layer2) usually remain frozen for
small datasets.
Why Not Unfreeze Everything?
Lower layers encode universal features like edges and color contrasts. Updating
them with limited data:
- Provides little performance gain
- Increases overfitting risk
- Can hurt generalization
Example: Unfreezing an Additional Layer
# Unfreeze layer3
for name, param in model.layer3.named_parameters():
param.requires_grad = True
Learning Rate Strategy
Stage 2 typically uses:
- A lower learning rate
- Or a scheduler such as
CosineAnnealingLR
This ensures gradual, controlled updates and prevents catastrophic forgetting.
Why Two-Stage Fine-Tuning Improves Validation Accuracy
From an Optimization Perspective
Stage 1 quickly moves the model into a good region of the loss landscape.
Stage 2 performs smaller, more precise adjustments that improve generalization
rather than memorization.
From a Generalization Perspective
This approach:
- Preserves low-level visual knowledge
- Encourages smoother decision boundaries
- Reduces over-confident predictions
Typical Gains
For ResNet-18 on small to medium datasets:
- ~1–3% improvement in validation accuracy
- More stable results across runs
- Better resistance to overfitting
Key Takeaways
- Two-stage fine-tuning emphasizes control and stability
- Train task-specific layers first, then refine mid-level features
- Lower layers are usually best left frozen for limited data
- ResNet-18 benefits significantly from this structured approach
In practice, two-stage fine-tuning is one of the most reliable ways to extract
maximum performance from a pretrained ResNet-18 without increasing model size or
data requirements.
Any comments? Feel free to participate below in the Facebook comment section.