Three-Stage Fine-Tuning

I wrote an app to classify an object using transfer learning, and the model was trained on CIFAR-10 dataset, all by myself.

Exploring Stage-3 Fine-Tuning: How Far Should We Unfreeze?

Why Consider a Third Fine-Tuning Stage?

After validating that two-stage fine-tuning produces consistent and reproducible gains, the next natural question is not “what else can we add?” but rather: “Can the data support releasing even more capacity?” This distinction matters. Stage-3 fine-tuning is not an automatic upgrade — it is a hypothesis that must be justified by learning dynamics and validated empirically.

Proposed Three-Stage Fine-Tuning Schedule

Stage Definitions

Stage 1: Train fc only
Stage 2: Unfreeze layer4 + fc
Stage 3: Unfreeze layer3 + layer4 + fc

Each stage progressively exposes deeper parts of the network, moving from task-specific classification to increasingly general feature representations.

What Each Stage Is Responsible For

Stage 1: Linear Readout Alignment

At this stage, the pretrained backbone is fully frozen. The model learns:

How to map existing high-level features to new class labels
Initial decision boundaries without modifying representations
A stable baseline for downstream adaptation

This stage minimizes risk and establishes a strong reference point.

Stage 2: High-Level Semantic Adaptation

Unfreezing layer4 enables:

Task-specific refinement of semantic features
Better alignment between logits and class structure
Correction of ImageNet-specific biases

This is where most transfer learning gains typically occur.

Stage 3: Mid-Level Representation Adjustment

Unfreezing layer3 allows the network to modify:

Object parts and spatial configurations
Mid-level texture and shape cues
Feature compositions that feed into high-level semantics

This stage increases expressive power substantially — and therefore must be handled with care.

Why Stage-3 Fine-Tuning Is Risky on Small Datasets

Unfreezing layer3 introduces a large number of trainable parameters. On a limited dataset, this creates several risks:

Representation drift: pretrained features may be overwritten
Overfitting: mid-level features adapt too closely to training samples
Optimization instability: gradients propagate deeper into the network

In other words, Stage-3 fine-tuning trades generalization stability for flexibility.

When Stage-3 Fine-Tuning Makes Sense

Despite the risks, there are scenarios where Stage-3 can help:

The target dataset is visually very different from ImageNet
Classes depend on subtle part-level differences
Strong regularization and low learning rates are in place

In these cases, mid-level features may genuinely need to change.

How Stage-3 Fits Into a Data-Driven Workflow

The key point is that Stage-3 is not attempted blindly. Before introducing it, the data already told us:

Optimization had stabilized
Two-stage tuning produced consistent gains
Validation accuracy had not saturated dramatically

These are necessary (but not sufficient) conditions for considering deeper unfreezing.

Expected Outcomes and Diagnostic Signals

Positive Signals

Validation accuracy increases beyond two-stage results
Validation loss decreases or remains stable
Training accuracy does not spike too early

Negative Signals

Training accuracy quickly returns to 100%
Validation loss increases
Validation accuracy becomes unstable across runs

Negative signals indicate that the dataset cannot support this level of adaptation.

Why This Experiment Matters Even If It Fails

Whether Stage-3 improves performance or not, the experiment is valuable.

If it works, we learn that mid-level features matter for this task
If it fails, we confirm that higher-level adaptation is sufficient

Either outcome reduces uncertainty and sharpens future decisions.

Key Takeaway

Stage-3 fine-tuning is about asking a precise question: “Does my data justify changing how the network sees parts and structures?” By progressing one stage at a time and letting validation metrics guide decisions, we ensure that every increase in model freedom is earned — not assumed.

My Two-Stage Training Results

Stage-1 Best results: Train Loss: 1.1005 | Val Loss: 1.3094 | Train Acc: 85.70% | Val Acc: 72.50%
Stage-2 Best results: Train Loss: 0.7060 | Val Loss: 1.0932 | Train Acc: 100.00% | Val Acc: 86.20%

So in short, my stage-2 result already says “layer4 works”. I hit:
100% train acc
86.2% val acc

That means:
layer4 + fc already fully adapted; the remaining gap is earlier feature abstraction.

What the Two-Stage Results Tell Us Before Attempting Stage-3

Before introducing a third fine-tuning stage, it is critical to pause and interpret the best results from Stage-1 and Stage-2. These numbers are not just checkpoints — they describe how learning capacity is being absorbed by the model.

Stage-1 Outcome: Feature Space Is Already Highly Informative

Stage-1 training (fc only) achieved:

Train Acc: 85.70% | Val Acc: 72.50%

This is an unusually strong result for a frozen backbone. It tells us that:

The ImageNet-pretrained ResNet-18 features are already well-aligned with CIFAR-10
Class separation exists even without any backbone adaptation
The task does not fundamentally require redefining mid-level representations

This is an important baseline. When a frozen backbone performs this well, it sets a high bar for how much deeper fine-tuning can realistically help.

Stage-2 Outcome: Most Transfer Learning Gains Are Already Captured

After unfreezing layer4, Stage-2 reached:

Train Acc: 100.00% | Val Acc: 86.20%

The jump from 72.5% → 86.2% confirms that high-level semantic adaptation was both necessary and effective. However, the accompanying signals matter just as much:

Training accuracy saturates completely
Validation loss remains relatively high
The train–validation gap widens again

This indicates that the model has already absorbed most of the task-specific structure that the dataset can support, and is now operating near the bias–variance boundary.

Implication for Stage-3 Fine-Tuning

Taken together, these two stages strongly suggest that:

The dataset benefits from high-level adaptation (Stage-2)
Mid-level features are already sufficiently expressive
Additional capacity is more likely to increase variance than reduce bias

In practical terms, this means Stage-3 fine-tuning is statistically risky. Unfreezing layer3 introduces a large number of parameters precisely when the model is already memorizing the training set.

Data-Driven Expectation for Stage-3

Based on these results, a realistic expectation for Stage-3 would be:

Training accuracy remains at or near 100%
Validation loss increases or becomes unstable
Validation accuracy fluctuates or declines across runs

That does not mean Stage-3 should never be tested — but it should be treated as a confirmation experiment, not a hopeful optimization.

Conclusion: Why This Analysis Matters

The decision to attempt or skip Stage-3 is no longer subjective. The data already provides a strong prior: Two-stage fine-tuning captured the majority of transferable signal, and the remaining error is more likely due to data limitations than representational limits. This is exactly the kind of evidence-based reasoning that prevents overfitting-driven regressions and keeps the optimization process principled.

Any comments? Feel free to participate below in the Facebook comment section.

Enjoy the following random pages..

This website is Michael's official homepage.

This website is for a resort hotel situated in Jiaoxi, Taiwan.

This program detects where eyes are in a photo.

This is a visual search engine acquired by Google in 2010.

Post your comment below.
Anything is okay.
I am serious.