DNN

Week 1:
Applied ML: iterative process
Correctly ‘guessing’ the right hyper-parameters is challenging
traditional (e.g. 100,000 data) : 70% / 30%: train/test, 60/20/20: train/dev/test
big data (e.g. 1,000,000): dev/test to be smaller %. 10,000 / 10,000 for dev / test, 98%/1%/1%
Mismatched train/test distribtuion: training using pictures from web. dev/test: pictures from users
Dev & test come from the same distribution
not having a test set ok

Bias / Variance:
underfitting vs overfitting
Train set error vs Dev set error
Optimal (bayes) error (i.e. human error)

Bias: under fits the data
Variance

Basic Recipe for ML
target – low bias (bigger netword) & low variance (more data)
contrast to ‘bias/variance trade off’

Regularization
often prevents over-fitting
large reg -> small w -> almost linear -> simpler model -> less overfitting

Early stopping: run once and check

initialisation, regularisation l2, dropout

Week 2

batch vs mini-batch
X^{1}, X^{2}…
epock = pass through train set

size = 1: stochastic gradient descent
mini-batch is not guaranteed to converge

small training set: batch (e.g. m < 2000)
typical mini-batch size: 64, 128, 256, 512
make sure minibactch fit in CPU/GPU memory

Exponentially weighted averages:
v(t) = beta v(t-1) + (1-beta) * theta(t)
v(t) approximately average over 1 / (1-beta) days.
so, beta = 0.9 -> 10 days averages
(1 – eps)^(1/eps) = 1/eps = 0.35 (e.g. 1/3)

bias correction (not in practice used)
v(t) / (1 – beta^t)

Gradient descent with momentum: almost always works better

v_dW = beta * v_dW + (1-beta) * dW
v_db = beta * v_db + (1-beta) * db
(friction * velocity) + acceleration

W = W – alpha * v_dW, b = b – alpha * v_db

beta = 0.9 seems to be a practical choice.

RMSprop

S_dW = beta * S_dW + (1-beta) dW^2 (element-wise)
S_db = beta * S_db + (1-beta) db^2
W = W – alpha * dW / sqrt(S_dW), b = b – alpha * db / sqrt(S_db)

Adam (adaptive moment estimation) optimization algorithm

v_dW = 0, s_dW = 0, V_db = 0, S_db = 0
dW, db
v_dW = beta_1 * v_dW + (1 – beta_1) * dW, v_db = beta_1 * v_db + (1-beta_1) * db
s_dW = beta_2 * s_dW + (1 – beta_2) * dW^2, s_db = beta_2 * s_db + (1-beta_2) * db^2
apply the bias correction to both v_ & s_.
W = W – alpha * v_dW / sqrt(s_dW + eps) …

Hyperparameters
alpha, beta_1 = 0.9, beta_2 = 0.999 (dW^2), eps = 10^(-8)

Learning rate decay
1 epoch = 1 pass through data
alpha = alpha_0 / (1 + decay_rate * epoch_num) and other forms

Problem of local optima: with high dimensional parameters of neural network, saddles are likely.
Problem of plateaus: this can be an issue. Adam algorithm would do something

Click to access 1412.6980.pdf

Week 3:

hyperparameters
alpha (1), beta/beta1&beta2, epsilon, #layers (3), #hidden units (2), learning rate decay (3), mini-batch size (2)

Try random values..
Coarse to fine

Babysitting one model – panda. small resource
Training many models in parallel – caviar. many resources

Batch Normalization: normalise the internal variables z^(i)’s as well
beta & gamma are learnable parameters (e.g. like W’s and b’s)
b becomes irrelavent as it is renormalised and can remove it (or set to zero)

Batch normalisation makes each layer to do its own work to do with affected less by the previous layers.

Batch norma at test time: estimate mu & sigma using expoentially weighted average across mini-batches

Multi-class classification

softmax regression
activiation function:
t = e^(z^[L])
a^[L] = e^(z^[L]) / sum( t ), i.e. a_i^[L] = t_i / sum(t)
unlike prev, g is a vector-function

hardmax = softmax with largest value to one and others to zero

generalisation of logistic function

Loss function = – sum y_j log y^hat_j
e.g. ) y = [0, 1, 0, 0], y^hat = [0.3, 0.2, 0.1, 0.4]
Basically, maximum likelihood function
dz^[L] = y^hat – y

Frameworks

The notebook example is a good example that Adam works much better than Gradient Descent..

Completed: 2020.02.16.

Leave a comment