[Caffe] Optimization Algorithm "Talking Seriously About Caffe"

Caffe now provides six optimization algorithms:

Stochastic Gradient Descent (type: SGD)
AdaDelta (type: AdaDelta)
Adaptive Gradient (type: AdaGrad)
Adam (type: Adam)
Nesterov's Accelerated Gradient (type: Nesterov)
RMSprop (type: RMSProp)

The solver is the core of Caffe. Its function is to alternately perform forward computation to obtain the loss and backward computation to update parameters, thereby minimizing the loss. It is an iterative optimization algorithm.

Assuming a dataset with a total data amount of D, the average loss of the entire dataset is:

Where the first part is to calculate the average loss value of all data, and the second part is the regularization term, where lambda is the regularization coefficient used to prevent overfitting.

If the data amount is very small, the above formula can still be calculated. However, if the data amount D is very large, calculating the loss once will take a long time.

So at this time, the concept of batch is introduced, dividing the entire dataset D into many small batches, denoted as N.

In one forward and backward computation, we only extract a batch of data, and then only update the loss of this batch of data,

Then we can repeatedly extract a batch of data for computation.

** In Caffe, the default optimization algorithm is SGD, which is the Stochastic Gradient Descent algorithm. **

1. Stochastic Gradient Descent (SGD)#

The Stochastic Gradient Descent algorithm is developed based on the gradient descent algorithm, which is also known as the steepest descent algorithm.

In brief, the principle of SGD is that the current parameter W(t+1) equals a linear combination of the parameters Wt after the last update and the current weight update value.

In practice, when we use it in Caffe, we usually add momentum.

Where α is the learning rate (base_lr), and μ is the weight of the last gradient value, which is the momentum.

The learning rate needs to decrease continuously to allow the model to keep optimizing. The learning rate can be understood as the step size; optimizing the model is like climbing mountains to find the lowest valley from the steepest direction. If the step is too large, it is easy to overshoot the lowest valley; if the step is too small, it is slow to find it and can easily get stuck in other non-lowest valleys. In the solver file, there are many learning rate decay strategies (lr_policy), which I will explain in detail later; just knowing this is sufficient.

The steepest direction mentioned above is actually the negative gradient direction.

Momentum is to accelerate the descent as much as possible while ensuring the loss remains stable. ** For SGD, this is generally set to 0.9, and in some special cases, when the learning rate decreases to a very small value, the momentum can be set to 0.99 to improve training effectiveness. **

2. AdaDelta#

Paper link: https://arxiv.org/pdf/1212.5701.pdf

AdaDelta is an adaptive learning rate method that is also developed based on gradient descent. This algorithm does not require manual tuning of the learning rate and is robust to noise.

Here are some advantages of this method:

No need to manually set the learning rate.
Insensitive to hyperparameter initialization.
Each dimension has a separate dynamic learning rate.
Robust to large gradients, noise, and structural choices.
Good adaptability to both local and distributed environments.

The original author initially wanted to make many modifications to the gradient descent method, with the biggest idea being to apply Newton's method, using the second-order derivative cost function, namely the Hessian matrix:

Where Δxt is the change in parameters, Ht is the Hessian matrix, which is equivalent to the learning rate, i.e., the step size. gt is the gradient, which determines the descent direction. At this point, the learning rate is determined by the Hessian matrix, so we do not need to manually set the learning rate.

However, the Hessian matrix is very complex and troublesome to compute, so the author borrowed this idea from the Hessian matrix and thought of enhancing the use of the first-order derivative or using an estimated result of the Hessian matrix.

Ultimately, the choice was made to enhance the use of the first-order derivative to approximate the second-order Newton method.

The basic idea is still similar to SGD, where the actual change is the product of momentum and the last change minus the base learning rate multiplied by the current gradient value (here the base learning rate is 1.0 and remains unchanged).

AdaDelta momentum is generally set to 0.95.

In the solver, delta is generally set to 1e-6.

** base_lr must be set to 1.0 **

** lr_policy must be set to fixed.
**

-------------------------Below are the specific algorithm implementation steps--------------------------------------------------------------------------------------------------------------

In the paper, the author made a lot of comparisons with the AdaGrad method, proposing that the AdaDelta method mainly addresses two shortcomings of AdaGrad. One is to solve the problem of the learning rate continuously decaying as training progresses. I believe the author wants to express that if the learning rate continues to decrease during training, its impact on optimization will drop to a very low level, leading to incomplete optimization in the later stages. The second is to solve the problem of needing to manually set the initial learning rate.

In 2012, Schaul & S. Zhang & LeCun proposed the AdaGrad-like algorithm, and I mention this algorithm because it has similarities with the AdaDelta algorithm.

Where diag(Ht) is the diagonal matrix of the Hessian matrix, E[gt-w] is the expected value of the gradients from the previous w iterations starting from the current t, and the denominator is the expected value of the squares of the gradients from the previous w iterations starting from the current t.

The benefit of taking the previous w gradients is that it prevents the gradients from vanishing.

Now back to AdaDelta, this method does not just take the previous w gradients but takes all gradients from all iterations.

Where ** ρ is the decay constant, which is set in Caffe through the momentum parameter **. Using this formula, we calculate the expected value of the squares of all gradients at the current iteration.

We need the square of the above result for parameter updates, represented here as RMS (root mean square):

Where RMS[g]t represents the root mean square value of the gradient g at the current iteration, ** ε is a constant (which is the delta parameter in the solver) **, a smoothing term to ensure the output is not zero, as this result will be used as the denominator later.

Finally, the formula for the change in parameters is:

The above method is Tieleman & Hinton's RMSProp.

Later, Zeiler considered the issue of units; Δx is used to update the parameters x, so their units must be consistent.

In my understanding, the units of gt in the above formula should be positively correlated with x. The units of RMS[g]t are also positively correlated with x.

However, when divided, it becomes unitless.

So looking at the formula below, the RMS of the change in parameters from the last iteration is added to the numerator, restoring the units.

** How to understand these units still confuses me; I think it should be understood as "units" **

Moreover, adding the last iteration will increase the robustness of the final result, preventing sudden changes.

The ε in the RMS of the numerator and denominator takes the same value.

The actual step size, which can also be considered the learning rate, is related to the current gradient and the last change in parameters.

Thus, the learning rates of parameters in different layers and dimensions are also different. It can be described as "varying from person to person."

Unlike SGD, where all parameters have the same learning rate.

AdaDelta is very fast in the early and mid-training stages, but it is prone to getting stuck in local optima in the later stages, which is not as effective as SGD.

3. AdaGrad#

Adaptive Gradient algorithm, inspired by L2 Regularizer.

Where the denominator is the square root of the sum of all gradients from 1 to t, and η is the base learning rate.

Weight initialization is sensitive to AdaGrad; if the weight initialization is too large, the gradient will be large, and the adjustment will be small.

In the later stages of training, the gradient easily approaches 0, leading to premature termination of training.

It is suitable for handling sparse gradients.

The actual usage formula is:

** In the solver, base_lr is generally set not too high; a reference value is 0.01. **

** lr_policy must be set to "fixed." **

momentum corresponds to ε in the formula.

4. Adam#

Adaptive Moment Estimation.

It is also a gradient-based optimization method.

It only requires first-order gradient information and occupies very little memory.

By estimating the first and second moments of the gradient, it dynamically adjusts the learning rate for each parameter.

Where mt is the first moment estimate of the gradient, and nt is the second moment estimate of the gradient. The terms with hats are for bias correction.

u is generally set to 0.9.

v is generally set to 0.999.

ε is generally set to 1e-8.

5. NAG#

Nesterov's Accelerated Gradient.

Nesterov's improvement is to let the previous momentum directly affect the current momentum.

Where η is the base learning rate, and μ is the momentum.

Momentum first calculates a gradient, the short blue vector. Then it makes a large jump in the direction of the accelerated gradient, the long blue vector.

The Nesterov term first makes a large jump in the previous accelerated gradient direction, the brown vector.

Finally, it calculates the gradient and makes a correction, the green vector.

** In the solver, base_lr is generally set to 0.01, and gamma to 0.1. **

** lr_policy is generally set to "step," with stepsize depending on the situation. **

** Momentum is generally set to 0.95. **

6. RMSprop#

The formula for this algorithm has already been written in the AdaDelta section; RMSprop is an intermediate form of AdaDelta.

** In the solver, base_lr is set to either 1.0 or 0.001 depending on the situation. **

** ** lr_policy is generally set to "fixed." ** **

** ** rms_decay is generally set to 0.98. ** **

** ** momentum is generally set to 0.95. ** **

7. Choosing Optimization Methods and Strategies#

RMSprop solves the problem of gradient vanishing in the later stages of training with AdaGrad.
In AdaDelta, the learning rate of parameters is independent of the initial learning rate, while in RMSprop, the learning rate of parameters is dependent on the initial learning rate.
Adam adds bias correction and momentum compared to RMSprop (note that the momentum value in RMSprop is applied during the expectation calculation, not counted as momentum).
If the input data is sparse, it is best to use adaptive learning rate optimization methods (AdaGrad, AdaDelta, RMSprop, and Adam) because they automatically adjust the learning rate during optimization.
If optimization speed is a concern, then adaptive learning rate methods will be faster. ** Adam may be the best optimization method currently available. However, many experts still use the original SGD without momentum. **
Shuffle: Randomly shuffle the order of data samples during each iteration.
Curriculum Learning: Sort training samples in a meaningful way, which helps to gradually solve difficult problems.
Normalize initial parameter values.
Early Stopping, which can be combined with visualization techniques. Use the model before overfitting.
Add regularization terms to the loss function.
Dropout, do not use Dropout during the testing phase.
Parallel and distributed optimization methods: Hogwild, Downpour SGD, Delay-tolerant Algorithms for SGD, Elastic Averaging SGD, tools using TensorFlow.