Coursera Deep Learning Specialization Notes
Posted on Sal 30 Ekim 2018 in new • 10 min read
These do not contain answers to quizzes or assignments per Honor Code. If you are looking for those, look elsewhere.
Binary Classification
Given a picture, classify it as cat or noncat. The result is \(\hat{y} = P(y=1  x)\). In other words, given \(x\), we calculate the probability that this data represents a cat.
Feature Vector from Image
We convert a picture, e.g. (64, 64, 3) picture into a (64 * 64 * 3, 1) feature vector.
Sigmoid
\(w\) is the weights. \(w^T\) is the transpose of \(w\). \(x\) is the input. \(b\) is the bias.
Loss Function
Loss function is the error between the real value of \(y\) and our prediction \(\hat{y}\) for a single training example.
Cost function
The average of loss for all training examples.
Learning rate
Learning rate \(\alpha\) is the coefficient we apply to change weights in
When it's too small, the learning occurs slowly, when it's so large, it may miss the optimum point for learning. So it should be selected wisely.
IDEA: Can we use a vector instead of a single value for learning rate?
For Leaky ReLU activation functions, that seems possible, though possibly an overkill. When we use \(\alpha\) as a vector, it brings an extra layer of complexity and we also need more computation to adjust \(\alpha.\) (For some features, it must be slower than others, using a vector for learning rate is just for that...) And it looks like an overkill.
The feature vector
\(X\) is a feature vector with \(m\) columns, each for a different training example. Each example has \(n\) features. So, X.shape = (n,m)
Y.shape = (1, m)
The computation graph
We can convert any formula to a graph by operands as vertices and operations as edges.
In neural networks, the computation is forward from inputs to the output and derivative of the whole network is from output to input.
The notation for layers and training examples
\(X^{(i)}\) refers the \(i^{th}\) training example.
\(z^{[i]}\) refers to the \(z\) values in \(i^{th}\) layer.
\(z^{[1]}\) is \(z\) value for layer 1.
\(a^{[2]}\) is \(a\) values for layer 2.
\(a^{[1]} = [a^{[1]}_1, a^{[1]}_2, a^{[1]}_3, a^{[1]}_4]^T\) for a neural net having 4 nodes in hidden layer 1.
Steps of computation
There are 2 steps of computation.

\(z^{[1]} = w^{[1]T}x + b^{[1]}\)

\(a^{[1]} = \sigma(z^{[1]})\)
In step 1, weights and inputs are dotmultiplied and bias added. In step 2, an activation function applied to this result. (This is now \(\sigma\), we'll see that there are other functions for this purpose.
Backpropagation Algorithm
This is the basic algorithm that changes the weights to find a solution to the problem.
In forward pass, the cost function \(L(\hat{y}, y)\) is computed.
In backpropagation, weights and bias is adjusted by their derivatives by the learning factor \(\alpha\).
The actual formulas are a bit long to keep in mind (at least for me) but the general idea is to calculate the derivatives from the last layer to the first and adjust weights accordingly.
Vectorization notation
Multiple training examples are shown as the superscript \((i)\)
The vector for
The above is a vectoral representation of \(Z\) matrix, for multiple inputs and multiple layers. \(A\) matrix is also similar.
As \(w^{[i]T}x + b^{[i]}\) is a column vector, we concatenate these column vectors for multiple inputs.
Activation Functions
There are roughly 4 kinds of activation functions.
Sigmoid
This is the default one and it's historic. There is no need to use it except the final activation layer, when we need to output 1 and 0.
Tanh
It's better than Sigmoid in almost every case. It's asymptotic between 1 and 1 and this is a better behavior than sigmoid's 1 and 0.
ReLU
It's a simple function, \(r=max(0, x)\) and is rather popular recently.
Leaky ReLU
ReLU is not differentiable < 0, so this is the differentiable version with \(r=max(0.01x, x)\), it has a very small slope for \(x<0\).
Training Set and Test Set
In the old days, when the data has 100, 1000 or 10000 elements, we could separate the training set/development set/test set as 70/30\% or 60/20/20\%.
However, in the age of Big Data™, when our datasets include 10.000.000 elements, we cannot split the dataset with these percentages, because the reason we use another set is to speed up the development. So instead of percentage split, it's more reasonable to keep 10.000 elements as dev set and a similar size as test set.
The important point: This data should come from the same distribution. We use dev/test sets to check the performance of the model in real data and whether the model overfits.
Bias and Variance
High bias means the model is underfitting.
High variance means the model is *overfitting.
High Bias
E.g. Train Set Error Rate: 15% Dev Set Error Rate: 16%.
We have a small model and it is not able to learn the data.
We may use a larger network. Train more. Or change the NN arch.
High Variance
e.g. Train Set Error Rate: 1% Dev Set Error Rate 14%.
Our model overfits the data and doesn't generalize well.
We need more data or use techniques like regularization to reduce overfitting. Changing NN structure may also help.
High Bias and High Variance
In classical models, we have a bias/variance tradeoff, when you reduce the bias, we increase the variance and vice versa.
However in Deep Learning we can have both. We may, e.g. 15% Train Set error rate and 30% Dev Set Error Rate. This means we both have high bias and high variance.
We both need a larger network, more data, use regularization and change the NN structure.
Regularization
It's used to decrease the variance and overfitting.
Basically there are two types: L2 Regularization adds a factor to the weights, Dropout Regularization sets some weights randomly to 0.
L2 Regularization
In weight calculation, without regularization
With regularization, we add another factor to this:
Hence the whole formula becomes
This is called the weight decay because it brings \(w\) closer to 0. It makes the network smaller in effect, brings \(w\) to more manageable intervals.
Dropout
Dropout is a crazy technique, in which the algorithm knocks off some of the nodes in calculating the weights. These temporarily deleted nodes are considered out and cost function \(J\) and weights \(w^{[l]}\) (which are connected to remaining nodes) are calculated without their input.
In each iteration, a random set of nodes are removed from calculations. The number of remaining nodes is calculated via keepprob.
If keepprob
is 0.5, for example, roughly half of the nodes are removed from weight calculations.
Data Augmentation
Adding variations of images found in the dataset, can be thought of as a regularization technique too. For example inverting images horizontally or vertically, or adding noise etc. can be used to for regularization.
The important aspect: The changes shouldn't change the meaning of images. A vertically inverted cat image is also a cat image but a vertically inverted 4 is not.
Early Stopping
While calculating the cost function for training and test sets, there is a point where training set \(J\) continues to decrease but test set \(J\) begin to increase, indicating overfitting.
We can follow this point and stop training the network at that point.
Normalization
Normalization bringing all feature values \(X\) to a similar range. If \(x_1\) is between 1 and 1000 and \(x_2\) is between 0 and 1, weights and hidden layers may not cope with these ranges.
For each feature, we can use \(x_i = \frac{x_i}{m_i}\) where \(m_i\) is the average of \(x_i\) is a good approach.
Exploding and Vanishing Gradients
For deep neural networks, weight values may exponentially increase or decrease. Suppose all weights are 2 and activations are linear, in this case, for \(l\) layer network, the final activation will be \(2^l\) and the network will learn slowly, if it can learn at all.
A similar problem is found when weights are less than 1. In this case, all weights tend to converge to 0 and the network may not converge at all.
Weight Initialization to alleviate Exploding/Vanishing Gradients
When a node's weights are all summed, we better have them initialize with a variance of \(\frac{1}{n}\) for \(n\) being the number of nodes on a layer.
If we are using ReLU activation on a layer, it's reasonable to used
W[i] = np.random.randn(shape) * np.sqrt(2 / n[i1])
or if we are using tanh
activation
W[i] = np.random.randn(shape) * np.sqrt(1 / n[i1])
Another option is to use Xavier initialization
W[i] = np.random.randn(shape) * np.sqrt(2 / (n[i] + n[i1]))
Initialization by 0
Initializing the weights with all 0's fails to break the symmetry and the network doesn't learn anything. The backpropagation algorithm keeps all the weights at 0 (because of the derivatives) hence the network stays the same.
It is suitable to set bias to 0, but weights shouldn't be.
Initialization with Random Numbers
W[i] = np.random.randn(layer_dim[l], layer_dim[l1]) * FACTOR
If the FACTOR
is too large (like 10), the network converges slowly and vanishing/exploding gradients may occur.
If FACTOR
is np.sqrt(2 / layer_dim[l1])
for layer l
, it's called He initialization. (2015 Paper's author's name is He.)
If FACTOR
is np.sqrt(1 / layer_dim[l1])
for layer l
, it's called Xavier initialization.
These two initialization methods work better than setting a constant.
MiniBatches
When we have large number of training data (e.g. 10 million images), we cannot feed them to the network at once. So we divide them into chunks called mini batches.
The order of samples is randomized. They are divided into manageable chunks and these chunks are fed into the model at once, one by one.
If the chunk size is 1, the algorithm is called Stochastic Gradient Descent.
If the batch size is equal to the size of training set, it’s called Batch Gradient Descent.
For dataset sizes < 2000, Batch GS can be used.
For larger datasets, minibatch sizes from 32 to 512 can be used. 1024 is rare, larger is much rarer.
Optimization: Momentum
When training with minibatches, it may be better to keep the previous batches gradients at hand, so that we don't change the gradient direction too much.
We introduce a new hyper parameter called \(\beta\) and this decides the role of \(dW^{[1..i]}\) and \(db^{[1..i]}\) while calculating \(dW^{[l]}\) and \(db^{[l]}\). If \(\beta\) is high, it smooths the grads, if low, recent calculations determine the grads more.
Optimization: RMSprop
Instead of using to calculate \(W \leftarrow W  \alpha dW\), we calculate a parameter \(s_{dW}\) which has squared version of \(dW\).
Calculation for \(b\) is similar.
The product in \((dW)^2\) is elementwise, not a dotproduct. This reduces the perturbations in horizontal axis and allows a fast move in vertical axis.
Optimization: Adam
This is the combination of momentum and RMSprop techniques and better approach to keep the log of previous minibatch gradients. It's a bit more complex to calculate and uses two hyper parameters \(\beta_1\) and \(\beta_2\) to calculate \(v\) and \(s\) and corrects these with bias. (The actual calculations can be found elsewhere.) However empirically it's a much better approach than momentum and it converges the model much more quickly.
Learning Rate Decay
When we have a fixed learning rate \(\alpha\), it may be difficult to find a good converging value because of the size of the steps. In this case, we can adjust the learning rate as a function of the number of epochs trained so far.
A good option is \(\alpha = \frac{1}{1  d * t} \alpha_0\) for \(d\) is the decay rate and \(t\) is the epoch number.
Hyperparameter Tuning
We have too many hyperparameters in Deep Learning. The learning rate \(\alpha\), optimization coefficient \(\beta_1\) and \(\beta_2\), number of layers (\(l\)), number of units per layer, activation functions, number of epochs, minibatch sizes etc. are all hyperparameters that affect the result and speed of neural network.
We have some heuristics to search these.
When we have multiple hyperparameters, selecting them randomly is usually a better approach than making a uniform grid.
When we have a range for the parameters, we can adjust the range by logarithmically, rather than uniformly. So, for example we can use \(\beta = 1  10^r\) for \(r \in [3, 1]\) when selecting \(\beta.\) This way, we can scale the parameter from 0.9 to 0.999 and this makes a bigger affect than changing 0.9 to 0.8 for example.
Batch Normalization
Batch normalization is normalizing layer weights with two learnable parameters \(\beta\) and \(\gamma\).
For \(t\) is the index for minibatch and \(i\) is the index for sample
The mean for batch: \(\mu_t = \frac{1}{m^{\{t\}}} \sum_i z^{\{t\}(i)}\)
Variance:
After finding mean and variance
Note that this \(\beta\) is completely different than the \(\beta\) in optimization techniques.
Its primary purpose is to make information flow from input layers to deeper layers more possible. It normalizes the weights so that the weights become more sensitive to change within a range.
At test time, we may not have batches so it's a viable approach to calculate \(\mu\) and \(\sigma^2\) with the exponentially moving average of coming samples.