Coursera Deep Learning Specialization Notes

Posted on Sal 30 Ekim 2018 in new • 10 min read

These do not contain answers to quizzes or assignments per Honor Code. If you are looking for those, look elsewhere.

Binary Classification

Given a picture, classify it as cat or non-cat. The result is \(\hat{y} = P(y=1 | x)\). In other words, given \(x\), we calculate the probability that this data represents a cat.

Feature Vector from Image

We convert a picture, e.g. (64, 64, 3) picture into a (64 * 64 * 3, 1) feature vector.


$$\hat{y} = \sigma(w^Tx + b)$$

\(w\) is the weights. \(w^T\) is the transpose of \(w\). \(x\) is the input. \(b\) is the bias.

Loss Function

Loss function is the error between the real value of \(y\) and our prediction \(\hat{y}\) for a single training example.

$$ L(\hat{y}, y) = \frac{1}{2}(\hat{y}-y)^2 $$

Cost function

The average of loss for all training examples.

Learning rate

Learning rate \(\alpha\) is the coefficient we apply to change weights in

$$w' = w - \alpha \frac{dd(w)}{dw}$$

When it's too small, the learning occurs slowly, when it's so large, it may miss the optimum point for learning. So it should be selected wisely.

IDEA: Can we use a vector instead of a single value for learning rate?

For Leaky ReLU activation functions, that seems possible, though possibly an overkill. When we use \(\alpha\) as a vector, it brings an extra layer of complexity and we also need more computation to adjust \(\alpha.\) (For some features, it must be slower than others, using a vector for learning rate is just for that...) And it looks like an overkill.

The feature vector

\(X\) is a feature vector with \(m\) columns, each for a different training example. Each example has \(n\) features. So, X.shape = (n,m)

Y.shape = (1, m)

The computation graph

We can convert any formula to a graph by operands as vertices and operations as edges.

In neural networks, the computation is forward from inputs to the output and derivative of the whole network is from output to input.

The notation for layers and training examples

\(X^{(i)}\) refers the \(i^{th}\) training example.

\(z^{[i]}\) refers to the \(z\) values in \(i^{th}\) layer.

\(z^{[1]}\) is \(z\) value for layer 1.

\(a^{[2]}\) is \(a\) values for layer 2.

\(a^{[1]} = [a^{[1]}_1, a^{[1]}_2, a^{[1]}_3, a^{[1]}_4]^T\) for a neural net having 4 nodes in hidden layer 1.

Steps of computation

There are 2 steps of computation.

  1. \(z^{[1]} = w^{[1]T}x + b^{[1]}\)

  2. \(a^{[1]} = \sigma(z^{[1]})\)

In step 1, weights and inputs are dot-multiplied and bias added. In step 2, an activation function applied to this result. (This is now \(\sigma\), we'll see that there are other functions for this purpose.

Backpropagation Algorithm

This is the basic algorithm that changes the weights to find a solution to the problem.

In forward pass, the cost function \(L(\hat{y}, y)\) is computed.

In backpropagation, weights and bias is adjusted by their derivatives by the learning factor \(\alpha\).

The actual formulas are a bit long to keep in mind (at least for me) but the general idea is to calculate the derivatives from the last layer to the first and adjust weights accordingly.

Vectorization notation

Multiple training examples are shown as the superscript \((i)\)

The vector for

$$Z=\right(\begin{pmatrix}{5} z^{[1](1)} & z^{[1](2) & z^{[1](3)} & \dots & z^{[1](m)} \\ z^{[2](1)} & z^{[2](2) & z^{[2](3)} & \dots & z^{[2](m)} \\ z^{[3](1)} & z^{[3](2) & z^{[3](3)} & \dots & z^{[3](m)} \\ \dots & \dots & \dots & \dots & \dots \\ z^{[n](1)} & z^{[n](2) & z^{[n](3)} & \dots & z^{[n](m)} \\ \end{pmatrix} \left)$$

The above is a vectoral representation of \(Z\) matrix, for multiple inputs and multiple layers. \(A\) matrix is also similar.

As \(w^{[i]T}x + b^{[i]}\) is a column vector, we concatenate these column vectors for multiple inputs.

Activation Functions

There are roughly 4 kinds of activation functions.


This is the default one and it's historic. There is no need to use it except the final activation layer, when we need to output 1 and 0.


It's better than Sigmoid in almost every case. It's asymptotic between 1 and -1 and this is a better behavior than sigmoid's 1 and 0.


It's a simple function, \(r=max(0, x)\) and is rather popular recently.

Leaky ReLU

ReLU is not differentiable < 0, so this is the differentiable version with \(r=max(0.01x, x)\), it has a very small slope for \(x<0\).

Training Set and Test Set

In the old days, when the data has 100, 1000 or 10000 elements, we could separate the training set/development set/test set as 70/30\% or 60/20/20\%.

However, in the age of Big Data™, when our datasets include 10.000.000 elements, we cannot split the dataset with these percentages, because the reason we use another set is to speed up the development. So instead of percentage split, it's more reasonable to keep 10.000 elements as dev set and a similar size as test set.

The important point: This data should come from the same distribution. We use dev/test sets to check the performance of the model in real data and whether the model overfits.

Bias and Variance

High bias means the model is underfitting.

High variance means the model is *overfitting.

High Bias

E.g. Train Set Error Rate: 15% Dev Set Error Rate: 16%.

We have a small model and it is not able to learn the data.

We may use a larger network. Train more. Or change the NN arch.

High Variance

e.g. Train Set Error Rate: 1% Dev Set Error Rate 14%.

Our model overfits the data and doesn't generalize well.

We need more data or use techniques like regularization to reduce overfitting. Changing NN structure may also help.

High Bias and High Variance

In classical models, we have a bias/variance tradeoff, when you reduce the bias, we increase the variance and vice versa.

However in Deep Learning we can have both. We may, e.g. 15% Train Set error rate and 30% Dev Set Error Rate. This means we both have high bias and high variance.

We both need a larger network, more data, use regularization and change the NN structure.


It's used to decrease the variance and overfitting.

Basically there are two types: L2 Regularization adds a factor to the weights, Dropout Regularization sets some weights randomly to 0.

L2 Regularization

In weight calculation, without regularization

$$w^{[l]} \leftarrow w^{[l]} - \alpha (dw^{[l]})$$

With regularization, we add another factor to this:

$$ -\alpha (\frac{\lambda}{2m} w^{[l]}) $$

Hence the whole formula becomes

$$w^{[l]} \leftarrow (1 - \alpha \frac{\lambda}{2m}) w^{[l]} - \alpha dw^{[l]}$$

This is called the weight decay because it brings \(w\) closer to 0. It makes the network smaller in effect, brings \(w\) to more manageable intervals.


Dropout is a crazy technique, in which the algorithm knocks off some of the nodes in calculating the weights. These temporarily deleted nodes are considered out and cost function \(J\) and weights \(w^{[l]}\) (which are connected to remaining nodes) are calculated without their input.

In each iteration, a random set of nodes are removed from calculations. The number of remaining nodes is calculated via keep-prob. If keep-prob is 0.5, for example, roughly half of the nodes are removed from weight calculations.

Data Augmentation

Adding variations of images found in the dataset, can be thought of as a regularization technique too. For example inverting images horizontally or vertically, or adding noise etc. can be used to for regularization.

The important aspect: The changes shouldn't change the meaning of images. A vertically inverted cat image is also a cat image but a vertically inverted 4 is not.

Early Stopping

While calculating the cost function for training and test sets, there is a point where training set \(J\) continues to decrease but test set \(J\) begin to increase, indicating overfitting.

We can follow this point and stop training the network at that point.


Normalization bringing all feature values \(X\) to a similar range. If \(x_1\) is between 1 and 1000 and \(x_2\) is between 0 and 1, weights and hidden layers may not cope with these ranges.

For each feature, we can use \(x_i = \frac{x_i}{m_i}\) where \(m_i\) is the average of \(x_i\) is a good approach.

Exploding and Vanishing Gradients

For deep neural networks, weight values may exponentially increase or decrease. Suppose all weights are 2 and activations are linear, in this case, for \(l\) layer network, the final activation will be \(2^l\) and the network will learn slowly, if it can learn at all.

A similar problem is found when weights are less than 1. In this case, all weights tend to converge to 0 and the network may not converge at all.

Weight Initialization to alleviate Exploding/Vanishing Gradients

When a node's weights are all summed, we better have them initialize with a variance of \(\frac{1}{n}\) for \(n\) being the number of nodes on a layer.

If we are using ReLU activation on a layer, it's reasonable to used

W[i] = np.random.randn(shape) * np.sqrt(2 / n[i-1])

or if we are using tanh activation

W[i] = np.random.randn(shape) * np.sqrt(1 / n[i-1])

Another option is to use Xavier initialization

W[i] = np.random.randn(shape) * np.sqrt(2 / (n[i] + n[i-1]))

Initialization by 0

Initializing the weights with all 0's fails to break the symmetry and the network doesn't learn anything. The backpropagation algorithm keeps all the weights at 0 (because of the derivatives) hence the network stays the same.

It is suitable to set bias to 0, but weights shouldn't be.

Initialization with Random Numbers

W[i] = np.random.randn(layer_dim[l], layer_dim[l-1]) * FACTOR

If the FACTOR is too large (like 10), the network converges slowly and vanishing/exploding gradients may occur.

If FACTOR is np.sqrt(2 / layer_dim[l-1]) for layer l, it's called He initialization. (2015 Paper's author's name is He.)

If FACTOR is np.sqrt(1 / layer_dim[l-1]) for layer l, it's called Xavier initialization.

These two initialization methods work better than setting a constant.


When we have large number of training data (e.g. 10 million images), we cannot feed them to the network at once. So we divide them into chunks called mini batches.

The order of samples is randomized. They are divided into manageable chunks and these chunks are fed into the model at once, one by one.

If the chunk size is 1, the algorithm is called Stochastic Gradient Descent.

If the batch size is equal to the size of training set, it’s called Batch Gradient Descent.

For dataset sizes < 2000, Batch GS can be used.

For larger datasets, minibatch sizes from 32 to 512 can be used. 1024 is rare, larger is much rarer.

Optimization: Momentum

When training with mini-batches, it may be better to keep the previous batches gradients at hand, so that we don't change the gradient direction too much.

We introduce a new hyper parameter called \(\beta\) and this decides the role of \(dW^{[1..i]}\) and \(db^{[1..i]}\) while calculating \(dW^{[l]}\) and \(db^{[l]}\). If \(\beta\) is high, it smooths the grads, if low, recent calculations determine the grads more.

Optimization: RMSprop

Instead of using to calculate \(W \leftarrow W - \alpha dW\), we calculate a parameter \(s_{dW}\) which has squared version of \(dW\).

$$s_{dW} \leftarrow \beta_2 s_{dW} - (1 - \beta_2) (dW)^2$$
$$ W \leftarrow W - \alpha \frac{dW}{\sqrt{s_{dW} + \epsilon}} $$

Calculation for \(b\) is similar.

The product in \((dW)^2\) is element-wise, not a dot-product. This reduces the perturbations in horizontal axis and allows a fast move in vertical axis.

Optimization: Adam

This is the combination of momentum and RMSprop techniques and better approach to keep the log of previous mini-batch gradients. It's a bit more complex to calculate and uses two hyper parameters \(\beta_1\) and \(\beta_2\) to calculate \(v\) and \(s\) and corrects these with bias. (The actual calculations can be found elsewhere.) However empirically it's a much better approach than momentum and it converges the model much more quickly.

Learning Rate Decay

When we have a fixed learning rate \(\alpha\), it may be difficult to find a good converging value because of the size of the steps. In this case, we can adjust the learning rate as a function of the number of epochs trained so far.

A good option is \(\alpha = \frac{1}{1 - d * t} \alpha_0\) for \(d\) is the decay rate and \(t\) is the epoch number.

Hyperparameter Tuning

We have too many hyperparameters in Deep Learning. The learning rate \(\alpha\), optimization coefficient \(\beta_1\) and \(\beta_2\), number of layers (\(l\)), number of units per layer, activation functions, number of epochs, mini-batch sizes etc. are all hyperparameters that affect the result and speed of neural network.

We have some heuristics to search these.

When we have multiple hyperparameters, selecting them randomly is usually a better approach than making a uniform grid.

When we have a range for the parameters, we can adjust the range by logarithmically, rather than uniformly. So, for example we can use \(\beta = 1 - 10^r\) for \(r \in [-3, -1]\) when selecting \(\beta.\) This way, we can scale the parameter from 0.9 to 0.999 and this makes a bigger affect than changing 0.9 to 0.8 for example.

Batch Normalization

Batch normalization is normalizing layer weights with two learn-able parameters \(\beta\) and \(\gamma\).

For \(t\) is the index for mini-batch and \(i\) is the index for sample

The mean for batch: \(\mu_t = \frac{1}{m^{\{t\}}} \sum_i z^{\{t\}(i)}\)


$$\sigma_t^2 = \frac{1}{m^{\{t\}}} \sum_i (z^{\{t\}(i)} - \mu_t)^2$$

After finding mean and variance

$$z_{\mathrm{norm}}^{\{t\}(i)} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}} $$
$$\tilde{z}^{(i)} = \gamma z_{\mathrm{norm}}^{(i)} + \beta$$

Note that this \(\beta\) is completely different than the \(\beta\) in optimization techniques.

Its primary purpose is to make information flow from input layers to deeper layers more possible. It normalizes the weights so that the weights become more sensitive to change within a range.

At test time, we may not have batches so it's a viable approach to calculate \(\mu\) and \(\sigma^2\) with the exponentially moving average of coming samples.