Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow.
Table of contents
So the idea is you go through the loop:
1
Idea ==> Code ==> Experiment

||W|| = Sum(|w[i,j]|) # sum of absolute values of all w||W||^2 = Sum(|w[i,j]|^2) # sum of all w squared||W||^2 = W.T * W if W is a vectorJ(w,b) = (1/m) * Sum(L(y(i),y'(i)))J(w,b) = (1/m) * Sum(L(y(i),y'(i))) + (lambda/2m) * Sum(|w[i]|^2)J(w,b) = (1/m) * Sum(L(y(i),y'(i))) + (lambda/2m) * Sum(|w[i]|)lambda here is the regularization parameter (hyperparameter)The normal cost function that we want to minimize is:
J(W1,b1...,WL,bL) = (1/m) * Sum(L(y(i),y'(i)))
The L2 regularization version:
J(w,b) = (1/m) * Sum(L(y(i),y'(i))) + (lambda/2m) * Sum((||W[l]||^2)
We stack the matrix as one vector (mn,1) and then we apply sqrt(w1^2 + w2^2.....)
To do back propagation (old way):
dw[l] = (from back propagation)
The new way:
dw[l] = (from back propagation) + lambda/m * w[l]
So plugging it in weight update step:
1
2
3
4
w[l] = w[l] - learning_rate * dw[l]
= w[l] - learning_rate * ((from back propagation) + lambda/m * w[l])
= w[l] - (learning_rate*lambda/m) * w[l] - learning_rate * (from back propagation)
= (1 - (learning_rate*lambda)/m) * w[l] - learning_rate * (from back propagation)
In practice this penalizes large weights and effectively limits the freedom in your model.
The new term (1 - (learning_rate*lambda)/m) * w[l] causes the weight to decay in proportion to its size.
Here are some intuitions:
lambda is too large - a lot of w’s will be close to zeros which will make the NN simpler (you can think of it as it would behave closer to logistic regression).lambda is good enough it will just reduce some weights that makes the neural network overfit.lambda is too large, w’s will be small (close to zero) - will use the linear part of the tanh activation function, so we will go from non linear activation to roughly linear which would make the NN a roughly linear classifier.lambda good enough it will just make some of tanh activations roughly linear which will prevent overfitting.Implementation tip: if you implement gradient descent, one of the steps to debug gradient descent is to plot the cost function J as a function of the number of iterations of gradient descent and you want to see that the cost function J decreases monotonically after every elevation of gradient descent with regularization. If you plot the old definition of J (no regularization) then you might not see it decrease monotonically.
Code for Inverted dropout:
1
2
3
4
5
6
7
8
9
10
keep_prob = 0.8 # 0 <= keep_prob <= 1
l = 3 # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped. 80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob
a3 = np.multiply(a3,d3) # keep only the values in d3
# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to solve the scaling problem
a3 = a3 / keep_prob
keep_prob per layer.keep_prob for some layers than others. The downside is, this gives you even more hyperparameters to search for using cross-validation. One other alternative might be to have some layers where you apply dropout and some layers where you don’t apply dropout and then just have one hyperparameter, which is a keep_prob for the layers for which you do apply dropouts.keep_probs to 1, and then run the code and check that it monotonically decreases J and then turn on the dropouts again.
lambda in L2 regularization).mean = (1/m) * sum(x(i))X = X - mean
variance = (1/m) * sum(x(i)^2)X /= varianceb = 0
1
Y' = W[L]W[L-1].....W[2]W[1]X
Then, if we have 2 hidden units per layer and x1 = x2 = 1, we result in:
1
2
3
4
if W[l] = [1.5 0]
[0 1.5] (l != L because of different dimensions in the output layer)
Y' = W[L] [1.5 0]^(L-1) X = 1.5^L # which will be very large
[0 1.5]
1
2
3
4
if W[l] = [0.5 0]
[0 0.5]
Y' = W[L] [0.5 0]^(L-1) X = 0.5^L # which will be very small
[0 0.5]
Z = w1x1 + w2x2 + ... + wnxn
n_x is large we want W’s to be smaller to not explode the cost.1/n_x to be the range of W’sW’s like this (better to use with tanh activation):
1
np.random.rand(shape) * np.sqrt(1/n[l-1])
or variation of this (Bengio et al.):
1
np.random.rand(shape) * np.sqrt(2/(n[l-1] + n[l]))
2/n[l-1] for ReLU is better:
1
np.random.rand(shape) * np.sqrt(2/n[l-1])

W[1],b[1],...,W[L],b[L] and reshape into one big vector (theta)J(theta)dW[1],db[1],...,dW[L],db[L] into one big vector (d_theta)1
2
3
eps = 10^-7 # small number
for i in len(theta):
d_theta_approx[i] = (J(theta1,...,theta[i] + eps) - J(theta1,...,theta[i] - eps)) / 2*eps
(||d_theta_approx - d_theta||) / (||d_theta_approx||+||d_theta||) (|| - Euclidean vector norm) and check (with eps = 10^-7):
d_theta_approx - d_theta vectorlamda/(2m) * sum(W[l]) to J if you are using L1 or L2 regularization.keep_prob = 1.0), run gradient checking and then turn on dropout again.The weights W[l] should be initialized randomly to break symmetry
It is however okay to initialize the biases b[l] to zeros. Symmetry is still broken so long as W[l] is initialized randomly
Different initializations lead to different results
Random initialization is used to break symmetry and make sure different hidden units can learn different things
Don’t intialize to values that are too large
He initialization works well for networks with ReLU activations.
Observations:
What is L2-regularization actually doing?:
What you should remember:
Implications of L2-regularization on:
What you should remember about dropout:
keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.m = 50 million. To train this data it will take a huge processing time for one step.
X{1} = 0 ... 1000X{2} = 1001 ... 2000...X{bs} = ...X & Y.t: X{t}, Y{t}1
2
3
4
5
for t = 1:No_of_batches # this is called an epoch
AL, caches = forward_prop(X{t}, Y{t})
cost = compute_cost(AL, Y{t})
grads = backward_prop(AL, caches)
update_parameters(grads)

mini batch size = m) ==> Batch gradient descentmini batch size = 1) ==> Stochastic gradient descent (SGD)mini batch size = between 1 and m) ==> Mini-batch gradient descent64, 128, 256, 512, 1024, ...hyperparameter.1
2
3
4
5
6
t(1) = 40
t(2) = 49
t(3) = 45
...
t(180) = 60
...
1
2
3
4
5
V0 = 0
V1 = 0.9 * V0 + 0.1 * t(1) = 4 # 0.9 and 0.1 are hyperparameters
V2 = 0.9 * V1 + 0.1 * t(2) = 8.5
V3 = 0.9 * V2 + 0.1 * t(3) = 12.15
...
1
V(t) = beta * v(t-1) + (1-beta) * theta(t)
~ (1 / (1 - beta)) entries:
beta = 0.9 will average last 10 entriesbeta = 0.98 will average last 50 entriesbeta = 0.5 will average last 2 entriestheta) based on value of beta. If beta is high (around 0.9), it smoothens out the averages of skewed data points (oscillations w.r.t. Gradient descent terminology). So this reduces oscillations in gradient descent and hence makes faster and smoother path towerds minima.
1
2
3
4
5
6
v = 0
Repeat
{
Get theta(t)
v = beta * v + (1-beta) * theta(t)
}
v(0) = 0, the bias of the weighted averages is shifted and the accuracy suffers at the start.1
v(t) = (beta * v(t-1) + (1-beta) * theta(t)) / (1 - beta^t)
(1 - beta^t) becomes close to 11
2
3
4
5
6
7
8
9
vdW = 0, vdb = 0
on iteration t:
# can be mini-batch or batch gradient descent
compute dw, db on current mini-batch
vdW = beta * vdW + (1 - beta) * dW
vdb = beta * vdb + (1 - beta) * db
W = W - learning_rate * vdW
b = b - learning_rate * vdb
beta is another hyperparameter. beta = 0.9 is very common and works very well in most cases.1
2
3
4
5
6
7
8
9
sdW = 0, sdb = 0
on iteration t:
# can be mini-batch or batch gradient descent
compute dw, db on current mini-batch
sdW = (beta * sdW) + (1 - beta) * dW^2 # squaring is element-wise
sdb = (beta * sdb) + (1 - beta) * db^2 # squaring is element-wise
W = W - learning_rate * dW / sqrt(sdW)
b = B - learning_rate * db / sqrt(sdb)
RMSprop will make the cost function move slower on the vertical direction and faster on the horizontal direction in the following example:

sdW is not zero by adding a small value epsilon (e.g. epsilon = 10^-8) to it: W = W - learning_rate * dW / (sqrt(sdW) + epsilon)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
vdW = 0, vdW = 0
sdW = 0, sdb = 0
on iteration t:
# can be mini-batch or batch gradient descent
compute dw, db on current mini-batch
vdW = (beta1 * vdW) + (1 - beta1) * dW # momentum
vdb = (beta1 * vdb) + (1 - beta1) * db # momentum
sdW = (beta2 * sdW) + (1 - beta2) * dW^2 # RMSprop
sdb = (beta2 * sdb) + (1 - beta2) * db^2 # RMSprop
vdW = vdW / (1 - beta1^t) # fixing bias
vdb = vdb / (1 - beta1^t) # fixing bias
sdW = sdW / (1 - beta2^t) # fixing bias
sdb = sdb / (1 - beta2^t) # fixing bias
W = W - learning_rate * vdW / (sqrt(sdW) + epsilon)
b = B - learning_rate * vdb / (sqrt(sdb) + epsilon)
beta1: parameter of the momentum - 0.9 is recommended by default.beta2: parameter of the RMSprop - 0.999 is recommended by default.epsilon: 10^-8 is recommended by default.learning_rate = (1 / (1 + decay_rate * epoch_num)) * learning_rate_0
epoch_num is over all data (not a single mini-batch).learning_rate = (0.95 ^ epoch_num) * learning_rate_0learning_rate = (k / sqrt(epoch_num)) * learning_rate_0decay_rate is another hyperparameter.beta1, beta2 & epsilon.N hyperparameter settings and then try all settings combinations on your problem.Coarse to fine sampling scheme:
a_log = log(a) # e.g. a = 0.0001 then a_log = -4b_log = log(b) # e.g. b = 1 then b_log = 01
2
3
r = (a_log - b_log) * np.random.rand() + b_log
# In the example the range would be from [-4, 0] because rand range [0,1)
result = 10^r
It uniformly samples values in log scale from [a,b].
1 - beta in range 0.001 to 0.1 (1 - 0.9 and 1 - 0.999) and the use a = 0.001 and b = 0.1. Then:1
2
3
4
a_log = -3
b_log = -1
r = (a_log - b_log) * np.random.rand() + b_log
beta = 1 - 10^r # because 1 - beta = 10^r
A[l] to train W[l+1], b[l+1] faster? This is what batch normalization is about.Z[l] or after applying the activation function A[l]. In practice, normalizing Z[l] is done much more often and that is what Andrew Ng presents.Z[l] = [z(1), ..., z(m)], i = 1 to m (for each input)mean = 1/m * sum(z[i])variance = 1/m * sum((z[i] - mean)^2)Z_norm[i] = (z[i] - mean) / np.sqrt(variance + epsilon) (add epsilon for numerical stability if variance = 0)
Z_tilde[i] = gamma * Z_norm[i] + beta
gamma = sqrt(variance + epsilon) and beta = mean then Z_tilde[i] = z[i]
W[1], b[1], …, W[L], b[L], beta[1], gamma[1], …, beta[L], gamma[L]beta[1], gamma[1], …, beta[L], gamma[L] are updated using any optimization algorithms (like GD, RMSprop, Adam)tf.nn.batch-normalization()b[1], …, b[L] doesn’t count because they will be eliminated after mean subtraction step, so:1
2
3
Z[l] = W[l]A[l-1] + b[l] => Z[l] = W[l]A[l-1]
Z_norm[l] = ...
Z_tilde[l] = gamma[l] * Z_norm[l] + beta[l]
b[l] will eliminate the b[l]W[l], beta[l], and alpha[l].Z[l] - (n[l], m)beta[l] - (n[l], m)gamma[l] - (n[l], m)Z[l] within that mini batch. So similar to dropout it adds some noise to each hidden layer’s activations.dog, cat, baby chick and none of that
class = 1class = 2class = 3class = 0y = [0 1 0 0]y = [0 0 1 0]y = [0 0 0 1]y = [1 0 0 0]C = no. of classes(0, ..., C-1)Ny = C1
2
t = e^(Z[L]) # shape(C, m)
A[L] = e^(Z[L]) / sum(t) # shape(C, m), sum(t) - sum of t's for each example (shape (1, m))
np.max over the vertical axis.C classes. If C = 2 softmax reduces to logistic regression.1
L(y, y_hat) = - sum(y[j] * log(y_hat[j])) # j = 0 to C-1
1
J(w[1], b[1], ...) = - 1 / m * (sum(L(y[i], y_hat[i]))) # i = 0 to m
1
dZ[L] = Y_hat - Y
1
Y_hat * (1 - Y_hat)
Example:

J(w) = w^2 - 10w + 25w = 5 as the function is (w-5)^2 = 01
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
import tensorflow as tf
w = tf.Variable(0, dtype=tf.float32) # creating a variable w
cost = tf.add(tf.add(w**2, tf.multiply(-10.0, w)), 25.0) # can be written as this - cost = w**2 - 10*w + 25
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
session.run(w) # Runs the definition of w, if you print this it will print zero
session.run(train)
print("W after one iteration:", session.run(w))
for i in range(1000):
session.run(train)
print("W after 1000 iterations:", session.run(w))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
import tensorflow as tf
coefficients = np.array([[1.], [-10.], [25.]])
x = tf.placeholder(tf.float32, [3, 1])
w = tf.Variable(0, dtype=tf.float32) # Creating a variable w
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0]
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
session.run(w) # Runs the definition of w, if you print this it will print zero
session.run(train, feed_dict={x: coefficients})
print("W after one iteration:", session.run(w))
for i in range(1000):
session.run(train, feed_dict={x: coefficients})
print("W after 1000 iterations:", session.run(w))
feed_dict={x: coefficients} to the current mini-batch data.1
2
3
with tf.Session() as session: # better for cleaning up in case of error/exception
session.run(init)
session.run(w)
tf.nn.sigmoid_cross_entropy_with_logits(logits = ..., labels = ...)1
2
3
W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer())
Z3. The reason is that in TensorFlow the last linear layer output is given as input to the function computing the loss. Therefore, you don’t need A3!tf.reset_default_graph()