Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow.

Table of contents

1. Practical aspects of Deep Learning

1.1 Train / Dev / Test sets

1.2 Bias / Variance

1.3 Basic Recipe for Machine Learning

1.4 Regularization

1.5 Why regularization reduces overfitting?

Here are some intuitions:

Implementation tip: if you implement gradient descent, one of the steps to debug gradient descent is to plot the cost function J as a function of the number of iterations of gradient descent and you want to see that the cost function J decreases monotonically after every elevation of gradient descent with regularization. If you plot the old definition of J (no regularization) then you might not see it decrease monotonically.

1.6 Dropout Regularization

1.7 Understanding Dropout

1.8 Other regularization methods

1.9 Normalizing inputs

1.10 Vanishing / Exploding gradients

1.11 Weight Initialization for Deep Networks

1.12 Numerical approximation of gradients

1.13 Gradient checking implementation notes

1.14 Initialization summary

1.15 Regularization summary

a) L2 Regularization

Observations:

What is L2-regularization actually doing?:

What you should remember:
Implications of L2-regularization on:

b) Dropout

What you should remember about dropout:

2. Optimization algorithms

2.1 Mini-batch gradient descent

2.2 Understanding mini-batch gradient descent

2.3 Exponentially weighted averages

2.4 Understanding exponentially weighted averages

1
2
3
4
5
6
  v = 0
  Repeat
  {
  	Get theta(t)
  	v = beta * v + (1-beta) * theta(t)
  }

2.5 Bias correction in exponentially weighted averages

1
  v(t) = (beta * v(t-1) + (1-beta) * theta(t)) / (1 - beta^t)

2.6 Gradient descent with momentum

1
2
3
4
5
6
7
8
9
  vdW = 0, vdb = 0
  on iteration t:
  	# can be mini-batch or batch gradient descent
  	compute dw, db on current mini-batch                
  			
  	vdW = beta * vdW + (1 - beta) * dW
  	vdb = beta * vdb + (1 - beta) * db
  	W = W - learning_rate * vdW
  	b = b - learning_rate * vdb

2.7 RMSprop

1
2
3
4
5
6
7
8
9
  sdW = 0, sdb = 0
  on iteration t:
  	# can be mini-batch or batch gradient descent
  	compute dw, db on current mini-batch
  	
  	sdW = (beta * sdW) + (1 - beta) * dW^2  # squaring is element-wise
  	sdb = (beta * sdb) + (1 - beta) * db^2  # squaring is element-wise
  	W = W - learning_rate * dW / sqrt(sdW)
  	b = B - learning_rate * db / sqrt(sdb)

2.8 Adam optimization algorithm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
  vdW = 0, vdW = 0
  sdW = 0, sdb = 0
  on iteration t:
  	# can be mini-batch or batch gradient descent
  	compute dw, db on current mini-batch                
  			
  	vdW = (beta1 * vdW) + (1 - beta1) * dW     # momentum
  	vdb = (beta1 * vdb) + (1 - beta1) * db     # momentum
  			
  	sdW = (beta2 * sdW) + (1 - beta2) * dW^2   # RMSprop
  	sdb = (beta2 * sdb) + (1 - beta2) * db^2   # RMSprop
  			
  	vdW = vdW / (1 - beta1^t)      # fixing bias
  	vdb = vdb / (1 - beta1^t)      # fixing bias
  			
  	sdW = sdW / (1 - beta2^t)      # fixing bias
  	sdb = sdb / (1 - beta2^t)      # fixing bias
  					
  	W = W - learning_rate * vdW / (sqrt(sdW) + epsilon)
  	b = B - learning_rate * vdb / (sqrt(sdb) + epsilon)

2.9 Learning rate decay

2.10 The problem of local optima

3. Hyperparameter tuning, Batch Normalization and Programming Frameworks

3.1 Tuning process

3.2 Using an appropriate scale to pick hyperparameters

3.3 Hyperparameters tuning in practice: Pandas vs. Caviar

3.4 Normalizing activations in a network

3.5 Fitting Batch Normalization into a neural network

1
2
3
  Z[l] = W[l]A[l-1] + b[l] => Z[l] = W[l]A[l-1]
  Z_norm[l] = ...
  Z_tilde[l] = gamma[l] * Z_norm[l] + beta[l]

3.6 Why does Batch normalization work?

3.7 Batch normalization at test time

3.8 Softmax Regression

1
2
  t = e^(Z[L])                      # shape(C, m)
  A[L] = e^(Z[L]) / sum(t)          # shape(C, m), sum(t) - sum of t's for each example (shape (1, m))

3.9 Training a Softmax classifier

1
  L(y, y_hat) = - sum(y[j] * log(y_hat[j])) # j = 0 to C-1
1
  J(w[1], b[1], ...) = - 1 / m * (sum(L(y[i], y_hat[i]))) # i = 0 to m
1
  dZ[L] = Y_hat - Y
1
  Y_hat * (1 - Y_hat)

3.10 Deep learning frameworks

3.11 TensorFlow

1
2
3
  with tf.Session() as session:       # better for cleaning up in case of error/exception
  	session.run(init)
  	session.run(w)
1
2
3
  W1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1))

  b1 = tf.get_variable("b1", [25,1], initializer = tf.zeros_initializer())