Coursera Deep Learning Specialization Notes

These do not contain answers to quizzes or assignments per Honor Code. If you are looking for those, look elsewhere.

Binary Classification

Given a picture, classify it as cat or non-cat. The result is $\hat{y} = P(y=1 | x)$. In other words, given $x$, we calculate the probability that this data represents a cat.

Feature Vector from Image

We convert a picture, e.g. (64, 64, 3) picture into a (64 _ 64 _ 3, 1) feature vector.


$$\hat{y} = \sigma(w^Tx + b)$$

$w$ is the weights. $w^T$ is the transpose of $w$. $x$ is the input. $b$ is the bias.

Loss Function

Loss function is the error between the real value of $y$ and our prediction $\hat{y}$ for a single training example.

$$ L(\hat{y}, y) = \frac{1}{2}(\hat{y}-y)^2 $$

Cost function

The average of loss for all training examples.

Learning rate

Learning rate $\alpha$ is the coefficient we apply to change weights in $$w’ = w - \alpha \frac{dd(w)}{dw}$$

When it’s too small, the learning occurs slowly, when it’s so large, it may miss the optimum point for learning. So it should be selected wisely.

IDEA: Can we use a vector instead of a single value for learning rate?

For Leaky ReLU activation functions, that seems possible, though possibly an overkill. When we use $\alpha$ as a vector, it brings an extra layer of complexity and we also need more computation to adjust $\alpha.$ (For some features, it must be slower than others, using a vector for learning rate is just for that…) And it looks like an overkill.

The feature vector

$X$ is a feature vector with $m$ columns, each for a different training example. Each example has $n$ features. So, X.shape = (n,m)

Y.shape = (1, m)

The computation graph

We can convert any formula to a graph by operands as vertices and operations as edges.

{{< figure src=“/images/computation-graph.jpg” width=“300px” >}}

In neural networks, the computation is forward from inputs to the output and derivative of the whole network is from output to input.

The notation for layers and training examples

$X^{(i)}$ refers the $i^{th}$ training example.

$z^{[i]}$ refers to the $z$ values in $i^{th}$ layer.

$z^{[1]}$ is $z$ value for layer 1.

$a^{[2]}$ is $a$ values for layer 2.

$a^{[1]} = [a^{[1]}_1, a^{[1]}_2, a^{[1]}_3, a^{[1]}_4]^T$ for a neural net having 4 nodes in hidden layer 1.

{{< figure src=“/images/2-layer-nn.jpg” width=“300” >}}

Steps of computation

There are 2 steps of computation.

  1. $z^{[1]} = w^{[1]T}x + b^{[1]}$

  2. $a^{[1]} = \sigma(z^{[1]})$

In step 1, weights and inputs are dot-multiplied and bias added. In step 2, an activation function applied to this result. (This is now $\sigma$, we’ll see that there are other functions for this purpose.

Backpropagation Algorithm

This is the basic algorithm that changes the weights to find a solution to the problem.

In forward pass, the cost function $L(\hat{y}, y)$ is computed.

In backpropagation, weights and bias is adjusted by their derivatives by the learning factor $\alpha$.

The actual formulas are a bit long to keep in mind (at least for me) but the general idea is to calculate the derivatives from the last layer to the first and adjust weights accordingly.

Vectorization notation

Multiple training examples are shown as the superscript $(i)$

The vector for $$Z=\right(\begin{pmatrix}{5} z^{1} & z^{1 & z^{1} & \dots & z^{1} \ z^{2} & z^{2 & z^{2} & \dots & z^{2} \ z^{3} & z^{3 & z^{3} & \dots & z^{3} \ \dots & \dots & \dots & \dots & \dots \ z^{n} & z^{n & z^{n} & \dots & z^{n} \ \end{pmatrix} \left)$$

The above is a vectoral representation of $Z$ matrix, for multiple inputs and multiple layers. $A$ matrix is also similar.

As $w^{[i]T}x + b^{[i]}$ is a column vector, we concatenate these column vectors for multiple inputs.

Activation Functions

There are roughly 4 kinds of activation functions.


This is the default one and it’s historic. There is no need to use it except the final activation layer, when we need to output 1 and 0.


It’s better than Sigmoid in almost every case. It’s asymptotic between 1 and -1 and this is a better behavior than sigmoid’s 1 and 0.


It’s a simple function, $r=max(0, x)$ and is rather popular recently.

Leaky ReLU

ReLU is not differentiable for < 0, so this is the differentiable version with $r=max(0.01x, x)$, it has a very small slope for $x<0$.

Training Set and Test Set

In the old days, when the data has 100, 1000 or 10000 elements, we could separate the training set/development set/test set as 70/30% or 60/20/20%.

However, in the age of Big Data™, when our datasets include 10.000.000 elements, we cannot split the dataset with these percentages, because the reason we use another set is to speed up the development. So instead of percentage split, it’s more reasonable to keep 10.000 elements as dev set and a similar size as test set.

The important point: This data should come from the same distribution. We use dev/test sets to check the performance of the model in real data and whether the model overfits.

Bias and Variance

High bias means the model is underfitting.

High variance means the model is *overfitting.

High Bias

E.g. Train Set Error Rate: 15% Dev Set Error Rate: 16%.

We have a small model and it is not able to learn the data.

We may use a larger network. Train more. Or change the NN arch.

High Variance

e.g. Train Set Error Rate: 1% Dev Set Error Rate 14%.

Our model overfits the data and doesn’t generalize well.

We need more data or use techniques like regularization to reduce overfitting. Changing NN structure may also help.

High Bias and High Variance

In classical models, we have a bias/variance tradeoff, when you reduce the bias, we increase the variance and vice versa.

However in Deep Learning we can have both. We may, e.g. 15% Train Set error rate and 30% Dev Set Error Rate. This means we both have high bias and high variance.

We both need a larger network, more data, use regularization and change the NN structure.


It’s used to decrease the variance and overfitting.

Basically there are two types: L2 Regularization adds a factor to the weights, Dropout Regularization sets some weights randomly to 0.

L2 Regularization

In weight calculation, without regularization $$w^{[l]} \leftarrow w^{[l]} - \alpha (dw^{[l]})$$

With regularization, we add another factor to this: $$ -\alpha (\frac{\lambda}{2m} w^{[l]}) $$

Hence the whole formula becomes $$w^{[l]} \leftarrow (1 - \alpha \frac{\lambda}{2m}) w^{[l]} - \alpha dw^{[l]}$$

This is called the weight decay because it brings $w$ closer to 0. It makes the network smaller in effect, brings $w$ to more manageable intervals.


Dropout is a crazy technique, in which the algorithm knocks off some of the nodes in calculating the weights. These temporarily deleted nodes are considered out and cost function $J$ and weights $w^{[l]}$ (which are connected to remaining nodes) are calculated without their input.

In each iteration, a random set of nodes are removed from calculations. The number of remaining nodes is calculated via keep-prob. If keep-prob is 0.5, for example, roughly half of the nodes are removed from weight calculations.

Data Augmentation

Adding variations of images found in the dataset, can be thought of as a regularization technique too. For example inverting images horizontally or vertically, or adding noise etc. can be used to for regularization.

The important aspect: The changes shouldn’t change the meaning of images. A vertically inverted cat image is also a cat image but a vertically inverted 4 is not.

Early Stopping

While calculating the cost function for training and test sets, there is a point where training set $J$ continues to decrease but test set $J$ begin to increase, indicating overfitting.

We can follow this point and stop training the network at that point.


Normalization bringing all feature values $X$ to a similar range. If $x_1$ is between 1 and 1000 and $x_2$ is between 0 and 1, weights and hidden layers may not cope with these ranges.

For each feature, we can use $x_i = \frac{x_i}{m_i}$ where $m_i$ is the average of $x_i$ is a good approach.

Exploding and Vanishing Gradients

For deep neural networks, weight values may exponentially increase or decrease. Suppose all weights are 2 and activations are linear, in this case, for $l$ layer network, the final activation will be $2^l$ and the network will learn slowly, if it can learn at all.

A similar problem is found when weights are less than 1. In this case, all weights tend to converge to 0 and the network may not converge at all.

Weight Initialization to alleviate Exploding/Vanishing Gradients

{{< figure src=“/images/node-weights.jpg” width=“300” >}}

When a node’s weights are all summed, we better have them initialize with a variance of $\frac{1}{n}$ for $n$ being the number of nodes on a layer.

If we are using ReLU activation on a layer, it’s reasonable to used

W[i] = np.random.randn(shape) * np.sqrt(2 / n[i-1])

or if we are using tanh activation

W[i] = np.random.randn(shape) * np.sqrt(1 / n[i-1])

Another option is to use Xavier initialization

W[i] = np.random.randn(shape) * np.sqrt(2 / (n[i] + n[i-1]))

Initialization by 0

Initializing the weights with all 0’s fails to break the symmetry and the network doesn’t learn anything. The backpropagation algorithm keeps all the weights at 0 (because of the derivatives) hence the network stays the same.

It is suitable to set bias to 0, but weights shouldn’t be.

Initialization with Random Numbers

W[i] = np.random.randn(layer_dim[l], layer_dim[l-1]) * FACTOR

If the FACTOR is too large (like 10), the network converges slowly and vanishing/exploding gradients may occur.

If FACTOR is np.sqrt(2 / layer_dim[l-1]) for layer l, it’s called He initialization. (2015 Paper’s author’s name is He.)

If FACTOR is np.sqrt(1 / layer_dim[l-1]) for layer l, it’s called Xavier initialization.

These two initialization methods work better than setting a constant.


When we have large number of training data (e.g. 10 million images), we cannot feed them to the network at once. So we divide them into chunks called mini batches.

The order of samples is randomized. They are divided into manageable chunks and these chunks are fed into the model at once, one by one.

If the chunk size is 1, the algorithm is called Stochastic Gradient Descent.

If the batch size is equal to the size of training set, it’s called Batch Gradient Descent.

For dataset sizes < 2000, Batch GS can be used.

For larger datasets, minibatch sizes from 32 to 512 can be used. 1024 is rare, larger is much rarer.

Optimization: Momentum

When training with mini-batches, it may be better to keep the previous batches gradients at hand, so that we don’t change the gradient direction too much.

We introduce a new hyper parameter called $\beta$ and this decides the role of $dW^{[1..i]}$ and $db^{[1..i]}$ while calculating $dW^{[l]}$ and $db^{[l]}$. If $\beta$ is high, it smooths the grads, if low, recent calculations determine the grads more.

Optimization: RMSprop

Instead of using to calculate $W \leftarrow W - \alpha dW$, we calculate a parameter $s_{dW}$ which has squared version of $dW$.

$$s_{dW} \leftarrow \beta_2 s_{dW} - (1 - \beta_2) (dW)^2$$

$$ W \leftarrow W - \alpha \frac{dW}{\sqrt{s_{dW} + \epsilon}} $$

Calculation for $b$ is similar.

The product in $(dW)^2$ is element-wise, not a dot-product. This reduces the perturbations in horizontal axis and allows a fast move in vertical axis.

Optimization: Adam

This is the combination of momentum and RMSprop techniques and better approach to keep the log of previous mini-batch gradients. It’s a bit more complex to calculate and uses two hyper parameters $\beta_1$ and $\beta_2$ to calculate $v$ and $s$ and corrects these with bias. (The actual calculations can be found elsewhere.) However empirically it’s a much better approach than momentum and it converges the model much more quickly.

Learning Rate Decay

When we have a fixed learning rate $\alpha$, it may be difficult to find a good converging value because of the size of the steps. In this case, we can adjust the learning rate as a function of the number of epochs trained so far.

A good option is $\alpha = \frac{1}{1 - d * t} \alpha_0$ for $d$ is the decay rate and $t$ is the epoch number.

Hyperparameter Tuning

We have too many hyperparameters in Deep Learning. The learning rate $\alpha$, optimization coefficient $\beta_1$ and $\beta_2$, number of layers ($l$), number of units per layer, activation functions, number of epochs, mini-batch sizes etc. are all hyperparameters that affect the result and speed of neural network.

We have some heuristics to search these.

When we have multiple hyperparameters, selecting them randomly is usually a better approach than making a uniform grid.

When we have a range for the parameters, we can adjust the range by logarithmically, rather than uniformly. So, for example we can use $\beta = 1 - 10^r$ for $r \in [-3, -1]$ when selecting $\beta.$ This way, we can scale the parameter from 0.9 to 0.999 and this makes a bigger affect than changing 0.9 to 0.8 for example.

Batch Normalization

Batch normalization is normalizing layer weights with two learn-able parameters $\beta$ and $\gamma$.

For $t$ is the index for mini-batch and $i$ is the index for sample

The mean for batch: $\mu_t = \frac{1}{m^{{t}}} \sum_i z^{{t}(i)}$

Variance: $$\sigma_t^2 = \frac{1}{m^{{t}}} \sum_i (z^{{t}(i)} - \mu_t)^2$$

After finding mean and variance

$$z_{\mathrm{norm}}^{{t}(i)} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}} $$

$$\tilde{z}^{(i)} = \gamma z_{\mathrm{norm}}^{(i)} + \beta$$

Note that this $\beta$ is completely different than the $\beta$ in optimization techniques.

Its primary purpose is to make information flow from input layers to deeper layers more possible. It normalizes the weights so that the weights become more sensitive to change within a range.

At test time, we may not have batches so it’s a viable approach to calculate $\mu$ and $\sigma^2$ with the exponentially moving average of coming samples.

Softmax Activation

When we have more than a single class, our final layer $\hat{y}$ becomes a vector instead of a single node. This is called softmax layer.

Suppose we have 4 classes to separate. In this case our activation function at the final layer is $$a^{[l]} = \frac{e^{z^{[l]}}}{\sum_{j=1}^4 t_i}$$

for $$t = e^{(z^{[l]})}$$.

This means, we sum up all $t_i$ values and divide $$a_i^{[l]} = \frac{t_i}{\sum_{j=1}^4 t_i}$$


It’s much better when we have a single number metric like Recall, Precision, F1 score or mAP. These makes us much easier to evaluate the models we have produced.

However there may be some satisfactory criteria that should be met before we optimize for this single number. If we have a model that runs in 1500 seconds with 95% accuracy, and another that runs in 1 second with 90% accuracy, we cannot optimize the first one to reach a sensible running time but we can optimize the second to reach better accuracy. So when we have two different type of metrics: (a) optimizing metric that’s a single number we try to optimize and (b) satisficing metric that we should meet for model to be usable.

Error Analysis

It’s a good approach to check a random subset of misclassified results for knowing where the system goes wrong.

Data Mismatch

If training, dev and test sets come from different distributions (or sources), it may be necessary to split a part of training set randomly to get a training-dev set. If error rates in this set is too different from dev and test sets, we have a data mismatch problem.

Data mismatch problem can be alleviated to some extent by synthetic data generation. But there is a risk of overfitting to a small set of synthetic data, if for example, we generate synthetic voice in the car data set from a large set of voice data and small set of car noise. Then the system may overfit to the small set of car noise.

Transfer Learning

When we have a trained neural network for a task A, and we need a neural network for a similar task B, we can use the trained neural network for B.

We can use NN’s output directly for task B or build a few more layers on top of it. It allows basic features to be used across different tasks.

Multi Task Learning

When one item can have more than one class, e.g. “having a traffic sign”, “having a pedestrian” etc, instead of using multiple neural networks for each task, we can just train for multiple tasks.

It doesn’t reduce the performance of the system if the neural network is big enough.

End-to-end learning

In traditional AI systems, there are sub systems, e.g. to find phonemes for each word etc. In Deep Learning, when we have big enough data, we can just feed the input and get the output, this is end-to-end learning.

When we have large enough dataset, end-to-end learning is feasible, though if we have a small dataset, having feature extraction etc. steps are a better approach.

Convolutional Neural Networks

These are used mostly in image processing. A 3x3 filter like

1, 0, -1
1, 0, -1
1, 0, -1

is applied to an image and each cell of the result is sum of corresponding calculations. e.g when we have an image like

3, 3, 3, 3
4, 4, 4, 4
5, 5, 5, 5
6, 6, 6, 6

the first convolution operation is applied to window

3, 3, 3
4, 4, 4
5, 5, 5

via product and the result is the sum of

1 * 3, 0 * 3, -1 * 3
1 * 4, 0 * 4, -1 * 4
1 * 5, 0 * 5, -1 * 5

hence, 0. We continue to slide the window and calculate each cell like this.

The resulting matrix is smaller than the original. If the filter size is $f$ by $g$, and matrix size is $m$ by $n$, the resulting matrix is $m - 2f + 1$ by $n - 2g + 1$.


We can use padding to remove the shrinking effect of convolution. We can make a frame around the original matrix to have a resulting matrix equal in size to the original matrix.

Classic Networks

Why do we look at classic networks? We can use their architecture for transfer learning and learn from their structure. We will be interested with them in the coming lectures.


It receives 32x32x1 image. The problem is digit classification and the images are grayscale. It’s a 1998 paper and has many quirks, implementation ideas that we don’t need today.

It’s small network by today’s standards: just 60.000 parameters. (Nowadays we are around 10-100 million mark.)

In 3 layers of convolution and max-pooling, it reduces the number of parameters to 400. (5x5x16) Then these are connected to 120 neuron fully connected layer and these 120 to 10 element softmax.

Activations it uses mostly sigmoid/tanh, because in those days ReLU wasn’t invented.

It’s mostly of historical importance now.

AlexNet and VGG-16

AlexNet is a larger network from 2012. It uses ReLU, receives a larger image but uses too much ad-hoc parameters for layer sizes.

VGG-16 is from Zisserman et. al. It uses a 3x3 filter with strides of 1, and repeats convolution and max-pooling in a much simpler structure. It has 2 layers of CONV, then 1 layer MAX-POOL, then 2 layers CONV, then MAX-POOL, then 3 layers of CONV, then MAX-POOL and all structures are similar. Overall it has 138 million parameters and is a large network.

Both of these networks are used in object recognition. VGG performs better and has a similar performance to its larger cousin VGG-19.


One of the vital problems of Neural Networks is vanishing/exploding gradient problems, where some of the weights become so small or so large that they reduce the rate of learning. Convolutional nets and other techniques are used for this but for much larger networks like 100-layers, ResNets become a viable solution.

In this kind of network, there are shortcuts between layers. For example layer $l$ is connected directly to layer $l-2$ and this reduces the vanishing/exploding gradient problem.

The architecture can be thought of blocks with bypass connections. Layer $l$ is connected to the layer $l-1$ like a normal network, and also connected to $l-2$. This structure is a block and not all layers use bypass, only half of them.

$$a[l+2]=g(W[l+2] * a[l+1] + b[l+2] + a[l])$$

When the previous layers output $g(W[l+2] * a[l+1] + b[l+2])$ approaches to zero, the residual part $a[l]$ comes to save from vanishing gradient problem.

1x1 Convolutions

When we have a $n_W x n_H x n_C$ layer in a CNN, we can apply 1x1 convolution to reduce $n_C$ and get a smaller layer.

For example if we have $32x32x192$ layer and we want a $32x32x32$ layer, we can use a 32 1x1 convolutions to reduce the number of channels.

Inception Network

Each layer has 1x1, 3x3, 5x5 and POOL layers. Results of these operations are merged together and used as a single layer.

Object Detection

When we have multiple objects appearing on an image, we can ask the CNN to draw a bounding box of the object. CNN’s are very capable in object detection.

Suppose we are building a CNN for traffic object detection: We can have different objects like cars, traffic signs, pedestrians, etc. We will have a vector like $[p_c, b_x, b_y, b_w, b_h, c_1, c_2, c_3, …]$ where $p_c$ is the probability that the bounding box provided by $b…$ parameters are of one of class $c$.

Landmark detection

It’s also possible for CNNs to detect landmarks… like on a face. In this case, a set of x,y points are output along with the probability of certain point.

Sliding windows

When we want to detect the bounding boxes of an object without training data for the bounding boxes, we can slide a window on the image to detect an object. In this case, we will have different sizes of windows sliding to detect a region where an object resides.

This approach has several problems: (1) Computational cost caused by the number of squares. (2) Decreased accuracy caused by the strides and size of the windows. Since we cannot have too small step size, we need to find a good measure to keep the number of windows low.

It’s possible to use algorithms like YOLO to increase the accuracy of boxes.

YOLO Algorithm.

YOLO means You Only Look Once. It divides an image into grids and sets a center for each class of object it detects.

The grid size in the original paper is 19x19.

Each image has bounding boxes around objects. These objects show the center of the object and width and height of it.

We can represent $n$ classes as a vector with $n$ binary elements or one element that can receive values until $n$. (The prior representation seems better for neural nets, the latter representation is more convenient.) We can use any of them.

What is anchor box

Sometimes objects on the image can have their midpoint in the same grid cell. For example if we have two classes like car & pedestrian: We can have a pedestrian stading in front of a car and middle points of these objects can be at the same grid cell.

In this case, we define two different kinds of boxes: Horizontal and vertical rectangles and have these two descriptors. e.g. $$[p_1, x_1, y_1, w_1, h_1, c_11, c_12, p_2, x_2, y_2, w_2, h_2, c_21, c_22]$$ and then we are able to associate two different objects for an image.

Example in the Homework

Each image is 608x608 pixels. We have 19x19 grid superimposed on this grid cell. Each cell can be the center of 5 anchor boxes. We have 80 classes of objects. We use the vector representation (multiple integers) for classes, therefore we have $(19, 19, 5, 85)$ vectors for each image.

80 of the 85 is the number of classes, the rest is $p_c, x, y, w, h$.

We will use thresholding to filter out the low probability predictions from the results.

The filtering function receives three elements, like $(19, 19, 5, 1)$ for $p_c$, $(19, 19, 5, 4)$ for boundaries and $(19, 19, 5, 80)$ for classes. We get class scores by multiplying confidence $p_c$ matrix with the class probabilities.

Then we apply non-max-suppression using IoU of rectangles.

We use tf.image.non_max_suppression for calculating the resulting boxes.

We use the Keras backend K to write these functions. is used to multiply to tensors. When we multiply a $(19, 19, 5, 80)$ tensor with a $(19, 19, 5, 1)$ tensor, the result is a $(19, 19, 5, 80)$ tensor.

K.argmax is used to find the maximum elements along an axis, in a tensor.

K.max is used to find the maximum values along an axis, in a tensor.

K.gather can be used to filter the tensors by their indices. e.g. t[[0, 4, 3]] is equal to K.gather(t, [0, 4, 4]) in TF/Keras setting.

Region Proposals

Sliding window approach produces too many candidates without any merit. They all consume computational power. Instead of this we can run a crude feature detection algorithm on the image and identify the region we are interested in.

Face Recognition

There are two problems regarding face: Verification and Recognition. Verification is identifying whether a person is actually that person, so we are looking for the distance of an actual image to a stored image. The second problem is selecting a face from a set of candidates as the known person or declaring that the face is unknown.

One Shot Learning

One shot learning is learning from one sample per class. Suppose we have only 1 image per person from their ID cards. We need to identify their faces from these images. It is not feasible to train a CNN with these images and we may have additions/deletions from the network.

Similarity Metric

Suppose we have a feature producer function $f$ that receives a face image and returns a feature vector $v$. For image $A$, $v_A = f(A)$. In order to achieve one shot learning, we can use $f$ to find the distance between to images.

If $f(A)$

Visual Style Transfer

Suppose we have an image C and painting S. (like Starry Night of Van Gogh.) We can transfer the style of painting to the image and get a new “painting” showing the content.

Convolutional Nets learn lower level features in the bottom layers and more complex features in higher layers. We need a cost function for neural style transfer that is composed of two elements.

$J(G) = \alpha J_{Content}(C, G) + \beta J_{Style}(S, G)$ for $G$ is the generated image, $C$ is the content image, and $S$ is the style image.

We need to minimize $J(G)$ to produce an image with content of $C$ and style of $S$.

Generated image $G$ is initialized randomly. Then we use gradient descent to get a closer and closer image.

The content cost function $J(C, G)$ is measured by the norm of difference of activation values in layer $l$ of $C$ and $G$. In other words, $J_{Content}(C, G) = \frac{1}{2} || a^{l} - a^{[l][G]} ||$.

We have two CNN’s and we measure the difference of $l^th$ layer of these CNNs.

Style cost function is a bit more complex.

For $a^{[l]}_{i,j,k}$ being the activation of $(i, j, k)$ of layer $[l]$.

We define a (Style) Gram matrix between two layers of a network as:

$G^{[l]} = n_c^{[l]} \cross n_c^{[l]}$

$G^{(S)}{kk’} = \sum{i}^{n_H^{[l]}} \sum_j^{n_W^{[l]}} a_{i, j, k}^{[l]} a_{i, j, k’}^{[l]}$

In other words, we multiply $a_{..,k}$ and $a_{..,k’}$ for channels $k$ and $k’$ and get a summation over it.

We also use a normalization in the form


and we use the formula to get a cost function in the form

$J_{Style}(S, G) = \sum_k \sum_{k’} (G_{kk’}^{[l]} - G_{kk’}^{[l]})$

Gram matrix for correlation.

LSTMs and GRUs

Both are types of RNNs.

LSTMs are more general. GRUs are simpler.

Both are used to make predictions on sequential data.

Word Embeddings

A word embedding is a matrix where columns are words (word ids) and rows are attributes. Each value denotes a feature for a particular word.


Suppose we have word embeddings $e_{man}$ and $e_{woman}$. The difference between man and woman is represented as $e_m - e_w$.

When we find this difference in another pair, this means we found an analogy. e.g.

$e_{king} - e_{queen} = e_{man} - e_{woman}$.

We can calculate the difference of words with cosine similarity. A word vector $u$ is similar to another word vector $v$ if

$$\frac{u^T v}{||v||_2 ||u||_2}$$

is close to 0. ($||x||_2$ is the L2 norm of vector $x$)

Another option might be Euclidean distance.

Learning word embeddings.

Some of the algorithms are so simple that it is weird to study them. We begin with a complex algorithm.

I want a glass of orange ____

juice: target word

Context: Last 4 words 4 words on left & right

These are put into Softmax for learning the embeddings.


Instead of using all words left and right, we select a random word from 5 or 10 word window and consider this as context.

/deep learning/ /coursera/ /classification/