Digit Classification Neural Network (Source Code Included)
Overview
After learning how a neural network works and understanding the underlying mathematics 100%, I felt it was a good idea to explore the optimal number of hidden layers and the optimal number of nodes/units/neurons in each layer that yield the best accuracy. So here's what I did:
Experiment #1
I trained various combinations of the number of units and the number of iterations using a one-hidden-layer architecture.
Experiment #2
I trained various combinations of the number of units and the number of iterations using a two-hidden-layer architecture.
My neural network was trained using Kaggle’s Digit Recognizer database. The training data path at the time of writing was '/kaggle/input/digit-recognizer/train.csv'. Here is my neural network configuration:
- The training set has 41000 samples.
- The validation set has 1000 samples.
- The input layer has 784 neurons (one input image of a digit is 28×28 grayscale pixels).
- The output layer has 10 neurons, one per digit, indicating the probability of being that digit.
- The hidden layer activation function is ReLU (rectified linear unit).
- The output layer activation function is softmax.
- I set alpha (learning rate) to 0.1 in all experiments.
The implementation is straightforward. Below is the source code for forward propagation and backward propagation in Python:
def forward_propagation(W1, b1, W2, b2, X):
Z1 = W1.dot(X) + b1
A1 = ReLU(Z1)
Z2 = W2.dot(A1) + b2
A2 = softmax(Z2)
return Z1, A1, Z2, A2
def backward_propagation(Z1, A1, Z2, A2, W1, W2, X, Y):
one_hot_Y = one_hot(Y)
dZ2 = A2 - one_hot_Y
dW2 = 1 / m * dZ2.dot(A1.T)
db2 = 1 / m * np.sum(dZ2)
dZ1 = W2.T.dot(dZ2) * ReLU_derivative(Z1) # which is simply Z > 0
dW1 = 1 / m * dZ1.dot(X.T)
db1 = 1 / m * np.sum(dZ1)
return dW1, db1, dW2, db2
Note: If you need the entire source code, just let me know.
Experiment #1 Results
Data are arranged in the following format:
# of units in the hidden layer, # of training iterations, [training accuracy, validation accuracy] × 3 trials
10, 1000, [0.884 0.876] [0.874 0.889] [0.882 0.89]
20, 1000, [0.896 0.903] [0.8962926829268293 0.901] [0.9008048780487805 0.898]
50, 1000, [0.9130243902439025 0.916] [0.914829268292683 0.913] [0.9159756097560976 0.896]
Since I would eventually test 1568 units in the hidden layer, I decided to evaluate it first at only 100 iterations to estimate how long it would take:
1568, 100, [0.8099512195121952 0.813] [0.8320487804878048 0.869] [0.8083902439024391 0.816]
The results vary widely across the three trials, meaning the number of iterations is too low to yield stable accuracy. So I increased the iteration count to 200 for all unit configurations. Also, I decided to use multiples of two as a basis for deciding the number of neurons to test. Here are the results:
10, 200, [0.7634390243902439 0.763] [0.7249268292682927 0.749] [0.7083414634146341 0.729]
11, 200, [0.7770975609756098 0.767] [0.7890975609756098 0.784] [0.7302439024390244 0.742]
22, 200, [0.7990731707317074 0.816] [0.7955609756097561 0.813] [0.7975121951219513 0.791]
44, 200, [0.8286829268292683 0.84] [0.8378048780487805 0.842] [0.8379512195121951 0.84]
88, 200, [0.8581219512195122 0.841] [0.8423658536585366 0.82] [0.8536585365853658 0.828]
196, 200, [0.8684634146341463 0.863] [0.8736341463414634 0.87] [0.8702682926829268 0.851]
392, 200, [0.8734878048780488 0.884] [0.8710731707317073 0.873] [0.8699268292682927 0.87]
784, 200, [0.8856341463414634 0.865] [0.879609756097561 0.864] [0.8699756097560976 0.874]
1000, 200, [0.8774634146341463 0.881] [0.881219512195122 0.859] [0.8838536585365854 0.856]
1568, 200, [0.8723170731707317 0.844] [0.8681219512195122 0.854] [0.8758536585365854 0.865]
Since 392 neurons (exactly half the number of input neurons) produced the highest accuracy, I tried more iterations:
392, 1000, [0.9386585365853658 0.92] [0.940780487804878 0.907] [0.9424878048780487 0.927]
A 94% training accuracy and 92% validation accuracy—quite good.
Two-Hidden-Layer Architecture
Before proceeding with Experiment #2, I modified the original functions to include one more hidden layer and renamed some variables for clarity. The updated forward and backward propagation code is below:
def forward_prop(W1, b1, W2, b2, WOutput, bOutput, X):
Z1 = W1.dot(X) + b1
A1 = ReLU(Z1)
Z2 = W2.dot(A1) + b2
A2 = ReLU(Z2)
ZOutput = WOutput.dot(A2) + bOutput
AOutput = softmax(ZOutput)
return Z1, A1, Z2, A2, ZOutput, AOutput
def backward_prop(Z1, Z2, ZOutput, A1, A2, AOutput, W1, W2, WOutput, X, Y):
one_hot_Y = one_hot(Y)
dZOutput = AOutput - one_hot_Y
dWOutput = 1 / m * dZOutput.dot(A2.T)
dbOutput = 1 / m * np.sum(dZOutput)
dZ2 = WOutput.T.dot(dZOutput) * ReLU_derivative(Z2)
dW2 = 1 / m * dZ2.dot(A1.T)
db2 = 1 / m * np.sum(dZ2)
dZ1 = W2.T.dot(dZ2) * ReLU_deriv(Z1)
dW1 = 1 / m * dZ1.dot(X.T)
db1 = 1 / m * np.sum(dZ1)
return dW1, dW2, dWOutput, db1, db2, dbOutput
Experiment #2 Results
Data are presented in the following format:
# of neurons in layer 1, # of neurons in layer 2, # of iterations, [training accuracy, validation accuracy] × 3 trials
Same number of neurons in both layers
10, 10, 500, [0.825390243902439 0.828] [0.825390243902439 0.838] [0.8395609756097561 0.841]
20, 20, 500, [0.8697317073170732 0.861] [0.8773414634146341 0.868] [0.8621463414634146 0.86]
50, 50, 500, [0.9024390243902439 0.891] [0.9023658536585366 0.9] [0.8999756097560976 0.89]
Low neuron count in layer 1, higher in layer 2
10, 20, 500, [0.833780487804878 0.82] [0.8555121951219512 0.859] [0.8400487804878048 0.841]
10, 50, 500, [0.8679268292682927 0.865] [0.8565121951219512 0.844] [0.8647560975609756 0.864]
10, 100, 500, [0.7935609756097561 0.796] [0.854 0.856] [0.8566341463414634 0.849]
Higher neuron count in layer 1, lower in layer 2
20, 10, 500, [0.8624390243902439 0.847] [0.8709756097560976 0.873] [0.8682926829268293 0.867]
50, 10, 500, [0.875780487804878 0.863] [0.851609756097561 0.845] [0.8822439024390244 0.885]
100, 10, 500, [0.8804878048780488 0.875] [0.8760487804878049 0.89] [0.8895365853658537 0.879]
It appears that keeping the number of units in layer 1 and layer 2 the same yields the most optimal results. This configuration gives the best results so far:
50, 50, 500, [0.9024390243902439 0.891] [0.9023658536585366 0.9] [0.8999756097560976 0.89]
I tried higher values:
100, 100, 500, [0.9139512195121952 0.891] [0.9139024390243903 0.906] [0.9126585365853659 0.898]
Accuracy did not significantly improve—likely a plateau.
As a reminder, earlier we tested this one-hidden-layer case:
10, 1000, [0.884 0.876] [0.874 0.889] [0.882 0.89]
Now let's add an extra hidden layer while keeping the same number of units:
10, 10, 1000, [0.8730487804878049 0.863] [0.8737560975609756 0.868] [0.8731463414634146 0.87]
Adding a second hidden layer did NOT improve accuracy.
Testing very small hidden layers
How about make the number of units in one or more hidden layers less than that in the output layer just for kicks?
8, 8, 500, [0.8056585365853659 0.789] [0.8227317073170731 0.826] [0.7661219512195122 0.774]
5, 5, 500, [0.770390243902439 0.745] [0.6456585365853659 0.623] [0.5160243902439025 0.477]
8, 20, 500, [0.8347560975609756 0.824] [0.8184878048780487 0.802] [0.8090731707317074 0.789]
20, 8, 500, [0.8524878048780488 0.835] [0.7997073170731708 0.781] [0.8412926829268292 0.844]
Results are poor, as expected.
Conclusions
- A single hidden layer provides solid accuracy for this type of classification task.
- In a one-hidden-layer architecture, you can often find an optimal number of neurons. In my case, it was exactly half the number of input neurons. Once identified, accuracy can be improved further by increasing the number of training iterations.
- A two-hidden-layer architecture may benefit other types of problems, but not this one.
- If using a two-hidden-layer architecture, keeping the number of units in layer 1 and layer 2 the same appears to yield the best results.
Feel free to download my paper
here.
Any comments? Feel free to participate below in the Facebook comment section.