Three-Stage Fine-Tuning
I wrote an app to
classify an object using transfer learning, and the model was trained on CIFAR-10 dataset, all by myself.
Exploring Stage-3 Fine-Tuning: How Far Should We Unfreeze?
Why Consider a Third Fine-Tuning Stage?
After validating that two-stage fine-tuning produces consistent and reproducible gains,
the next natural question is not
“what else can we add?” but rather:
“Can the data support releasing even more capacity?”
This distinction matters. Stage-3 fine-tuning is not an automatic upgrade — it is a
hypothesis that must be justified by learning dynamics and validated empirically.
Proposed Three-Stage Fine-Tuning Schedule
Stage Definitions
- Stage 1: Train
fc only
- Stage 2: Unfreeze
layer4 + fc
- Stage 3: Unfreeze
layer3 + layer4 + fc
Each stage progressively exposes deeper parts of the network, moving from
task-specific classification to increasingly general feature representations.
What Each Stage Is Responsible For
Stage 1: Linear Readout Alignment
At this stage, the pretrained backbone is fully frozen. The model learns:
- How to map existing high-level features to new class labels
- Initial decision boundaries without modifying representations
- A stable baseline for downstream adaptation
This stage minimizes risk and establishes a strong reference point.
Stage 2: High-Level Semantic Adaptation
Unfreezing
layer4 enables:
- Task-specific refinement of semantic features
- Better alignment between logits and class structure
- Correction of ImageNet-specific biases
This is where most transfer learning gains typically occur.
Stage 3: Mid-Level Representation Adjustment
Unfreezing
layer3 allows the network to modify:
- Object parts and spatial configurations
- Mid-level texture and shape cues
- Feature compositions that feed into high-level semantics
This stage increases expressive power substantially — and therefore must be handled with care.
Why Stage-3 Fine-Tuning Is Risky on Small Datasets
Unfreezing
layer3 introduces a large number of trainable parameters.
On a limited dataset, this creates several risks:
- Representation drift: pretrained features may be overwritten
- Overfitting: mid-level features adapt too closely to training samples
- Optimization instability: gradients propagate deeper into the network
In other words, Stage-3 fine-tuning trades generalization stability for flexibility.
When Stage-3 Fine-Tuning Makes Sense
Despite the risks, there are scenarios where Stage-3 can help:
- The target dataset is visually very different from ImageNet
- Classes depend on subtle part-level differences
- Strong regularization and low learning rates are in place
In these cases, mid-level features may genuinely need to change.
How Stage-3 Fits Into a Data-Driven Workflow
The key point is that Stage-3 is not attempted blindly.
Before introducing it, the data already told us:
- Optimization had stabilized
- Two-stage tuning produced consistent gains
- Validation accuracy had not saturated dramatically
These are necessary (but not sufficient) conditions for considering deeper unfreezing.
Expected Outcomes and Diagnostic Signals
Positive Signals
- Validation accuracy increases beyond two-stage results
- Validation loss decreases or remains stable
- Training accuracy does not spike too early
Negative Signals
- Training accuracy quickly returns to 100%
- Validation loss increases
- Validation accuracy becomes unstable across runs
Negative signals indicate that the dataset cannot support this level of adaptation.
Why This Experiment Matters Even If It Fails
Whether Stage-3 improves performance or not, the experiment is valuable.
- If it works, we learn that mid-level features matter for this task
- If it fails, we confirm that higher-level adaptation is sufficient
Either outcome reduces uncertainty and sharpens future decisions.
Key Takeaway
Stage-3 fine-tuning is about asking a precise question:
“Does my data justify changing how the network sees parts and structures?”
By progressing one stage at a time and letting validation metrics guide decisions,
we ensure that every increase in model freedom is earned — not assumed.
My Two-Stage Training Results
Stage-1 Best results: Train Loss: 1.1005 | Val Loss: 1.3094 | Train Acc: 85.70% | Val Acc: 72.50%
Stage-2 Best results: Train Loss: 0.7060 | Val Loss: 1.0932 | Train Acc: 100.00% | Val Acc: 86.20%
So in short, my stage-2 result already says “layer4 works”. I hit:
100% train acc
86.2% val acc
That means:
layer4 + fc already fully adapted; the remaining gap is earlier feature abstraction.
What the Two-Stage Results Tell Us Before Attempting Stage-3
Before introducing a third fine-tuning stage, it is critical to pause and interpret the
best results from Stage-1 and Stage-2. These numbers are not just checkpoints — they
describe how learning capacity is being absorbed by the model.
Stage-1 Outcome: Feature Space Is Already Highly Informative
Stage-1 training (fc only) achieved:
Train Acc: 85.70% | Val Acc: 72.50%
This is an unusually strong result for a frozen backbone. It tells us that:
- The ImageNet-pretrained ResNet-18 features are already well-aligned with CIFAR-10
- Class separation exists even without any backbone adaptation
- The task does not fundamentally require redefining mid-level representations
This is an important baseline. When a frozen backbone performs this well, it sets a high
bar for how much deeper fine-tuning can realistically help.
Stage-2 Outcome: Most Transfer Learning Gains Are Already Captured
After unfreezing
layer4, Stage-2 reached:
Train Acc: 100.00% | Val Acc: 86.20%
The jump from 72.5% → 86.2% confirms that high-level semantic adaptation was both necessary
and effective. However, the accompanying signals matter just as much:
- Training accuracy saturates completely
- Validation loss remains relatively high
- The train–validation gap widens again
This indicates that the model has already absorbed most of the task-specific structure
that the dataset can support, and is now operating near the bias–variance boundary.
Implication for Stage-3 Fine-Tuning
Taken together, these two stages strongly suggest that:
- The dataset benefits from high-level adaptation (Stage-2)
- Mid-level features are already sufficiently expressive
- Additional capacity is more likely to increase variance than reduce bias
In practical terms, this means Stage-3 fine-tuning is statistically risky. Unfreezing
layer3 introduces a large number of parameters precisely when the model is
already memorizing the training set.
Data-Driven Expectation for Stage-3
Based on these results, a realistic expectation for Stage-3 would be:
- Training accuracy remains at or near 100%
- Validation loss increases or becomes unstable
- Validation accuracy fluctuates or declines across runs
That does not mean Stage-3 should never be tested — but it should be treated as a
confirmation experiment, not a hopeful optimization.
Conclusion: Why This Analysis Matters
The decision to attempt or skip Stage-3 is no longer subjective. The data already provides
a strong prior:
Two-stage fine-tuning captured the majority of transferable signal, and the remaining
error is more likely due to data limitations than representational limits.
This is exactly the kind of evidence-based reasoning that prevents overfitting-driven
regressions and keeps the optimization process principled.
Any comments? Feel free to participate below in the Facebook comment section.