Preface#
At first, I was reluctant because I didn't know where to start. However, I forced myself to do this, hoping to gain a lot during the writing process.
1. Introduction to Blob#
If we compare a network structure Net to a building, then layers Layer are like each floor, and Blob is like bricks.
In Net, the data transfer between each Layer is done in the form of Blob, including the forward raw data data and the backward gradient information diff. It is a four-dimensional array, (Num, Channels, Height, Width),
which can also be written as (n, k, h, w).
Taking a three-channel 480640 image as an example, if converted to Blob format, then the size of this Blob is (13480640).
In the context of Caffe, the size of N is the same as the size of each Batch. K is the same as the output size of each layer. H and W are the dimensions of the output feature map of each layer. Note that some blogs mention that for a Convolution layer with 1024 outputs and a 77 convolution kernel, if the batch is 1,
then the output Blob is (1,1024,7,7). This is absolutely wrong!!! The output H and W are the sizes after the H and W of the previous layer have undergone convolution operations with the 77 convolution kernel.
For example, if the output of the previous layer is (32,512,14,14), it means that the size of the output feature map of the previous layer is 1414. The current layer's output quantity (number of convolution kernels) is specified as 1024, with a 55 convolution kernel, a stride of 1, and padding of 0. Therefore, the output feature map size should be (14-5)/1
-
1 = 10 * (14-5)/1 + 1 = 10, so the output Blob size is (32,1024,10,10).
2. Introduction to Layer#
Layer should be the basic computational unit of Caffe. Layer makes Net hierarchical, allowing us to intuitively see the order and relationship of the computations.
Consistent with building a building, data is computed and transmitted from bottom to top. The bottom is the input port, and the top is the output port.
Some derived classes of Layer:
1. Vision Layer: Responsible for processing visual images, both input and output are images.
**
**
1.1. Convolution Layer: Core layer.
lr_mult: Learning rate coefficient, the learning rate of the current layer is based on the product of the base_lr learning rate in solver.prototxt and this parameter. If there are two of this parameter,
the second corresponds to the learning rate of the bias term. Generally, the learning rate for the bias term is twice that of the weight.
num_output: Number of convolution kernels, which is the quantity of output N.
kernel_size: Size of the convolution kernel. If the width and height are not equal, kernel_h and kernel_w can be used to set them.
stride: Convolution kernel stride, default is 1. Can also be set using stride_h and stride_w.
pad: Edge expansion, default is 0. Can also be set using pad_h and pad_w. If pad=(kernel_size-1)/2 is set, the width and height after convolution remain unchanged.
weight_filler: Weight initialization method. Default is "constant", all zeros. Currently, "xavier" is commonly used, and "gaussian" is also used.
bias_filler: Bias initialization method. Generally set to "constant", all zeros.
group: Grouping, default is 1 group.
The formula for calculating the output feature map width and height of the convolution layer: Output width and height = (Input width and height + 2*pad - kernel_size)/stride + 1
1.2. Pooling Layer: Downsampling layer, reduces the size of the data.
kernel_size: Pooling kernel size.
pool: Pooling method, options are MAX, AVE, STOCHASTIC. The calculation method selects the maximum or computes the average within the pooling kernel size matrix as the output of the current connection.
pad: Edge expansion.
stride: Stride.
1.3. Local Response Normalization (LRN): Local region normalization: Lateral inhibition, used in AlexNet and GoogLeNet.
local_size: Default is 5. If it is cross-channel normalization, it represents the number of channels. If normalizing within a channel, it represents the width and height of the processing area.
alpha: Default is 1, parameter in the formula.
beta: Default is 5, parameter in the formula.
norm_region: Default is ACROSS_CHANNELS, indicating summation normalization across adjacent channels. WITHIN_CHANNEL indicates summation normalization within a single channel.
Normalization formula: The numerator is each number, the denominator is
1.4. im2col Layer: Divides a large matrix into multiple overlapping sub-matrices, serializes a sub-matrix into a vector, and then obtains another matrix.
In Caffe, convolution operations first perform im2col on the data, then carry out inner product operations. This approach is faster than performing the original convolution.
1.5. Batch Normalization (BatchNorm) Layer: Normalization, zero mean, unit variance, applied in ResNet.
In the early days, whitening was used for data processing, commonly using PCA whitening. This involves first performing PCA on the data, then normalizing the variance.
However, whitening requires calculating the covariance matrix and performing inverse operations, which is computationally intensive and may not be differentiable during backpropagation.
Ideally, normalization would be performed on the entire dataset, but this is unrealistic.
Thus, Batch Norm was proposed, using the mean and variance of a Batch as estimates for the mean and variance of the entire dataset.
The BN algorithm process is as follows:
There is a parameter:
batch_norm_param {
use_global_stats: false
}
This parameter defaults to false, meaning normalization is done using the mean and variance of the current batch data.
If set to true, it will use the mean and variance of all data for normalization. ** Here is a question: I suspect it should accumulate the mean and variance of all past data. **
** ** In the training phase, this parameter uses the default setting, which is false. Otherwise, the model will not converge. **
**
In the detection phase, this parameter uses true. Otherwise, the accuracy will be very low.
2. Loss Layers: Used to calculate loss values, and update parameters based on the loss values through backpropagation.
**** Includes Softmax (SoftmaxWithLoss), Sum-of-Squares/Euclidean (EuclideanLoss),
Hinge/Margin (HingeLoss), Sigmoid Cross-Entropy (SigmoidCrossEntropyLoss),
Infogain (InfogainLoss), Top-k
2.1. SoftmaxWithLoss
Softmax is a classifier that outputs probabilities (likelihood), which is a generalization of Logistic Regression. Logistic regression can only be used for binary classification,
while Softmax can be used for multi-class classification.
Cross-entropy calculation is performed based on the Softmax output probabilities and known categories to obtain the current category's loss.
3. Activation/Neuron Layer: The operation is in-place computation (in-place computation,
the return value overwrites the original value without occupying new memory)
**
**
3.1. Sigmoid
No additional parameters. Maps input variables to [0,1], can be used as a classifier. Easy to differentiate. However, initialization results can significantly affect the output of sigmoid.
If initialized too large or too small, the gradient will approach 0, causing parameters to fail. Moreover, the output of the sigmoid function does not have zero mean.
Formula: y = 1/(1 + e ^ -x)
Layer type: Sigmoid
3.2. TanH/Hyperbolic Tangent
The hyperbolic tangent function transforms data, closely resembling the shape of Sigmoid. Maps input variables to [-1,1], easy to differentiate, with an expected output of 0,
so to some extent, TanH is slightly better than Sigmoid. It shares the same drawback as Sigmoid, being sensitive to initialization; parameters that are too large or too small will cause the gradient to approach 0.
Formula: y = (e ^ x - e ^ -x) / (e ^ x + e ^ -x)
Layer type: TanH
3.3. ReLU/Rectified-Linear and Leaky-ReLU
ReLU is the most commonly used activation function today, converging quickly and easy to differentiate. Standard ReLU turns all negative numbers to 0, which can affect data performance. Leaky-ReLU sets a parameter,
allowing negative inputs to be multiplied by this parameter, which somewhat protects data performance.
Standard ReLU formula: y = max(0, x)
Leaky-ReLU formula: y = max(x*negative_slope, x)
Layer type: ReLU
3.4. Absolute Value
Calculates the absolute value of each input data.
Formula: y = Abs(x)
Layer type: AbsVal
3.5. Power
Performs exponentiation on input data.
power: Default is 1
scale: Default is 1
shift: Default is 0
Formula: y = (shift + scale * x) ^ power
Layer type: Power
3.6. BNLL
Binomial Normal Log Likelihood
Formula: y = log(1 + exp(x))
Layer type: BNLL
4. Data Layer: The bottom layer of the network, mainly implements data format conversion.
**** High efficiency: LevelDB, LMDB, memory. Low efficiency: hdf5, image formats.
Layer type: Data
include setting indicates whether it is in TEST or TRAIN phase.
transform_param is responsible for data preprocessing.
Data_param parameters vary based on the input source.
4.1. transform_param
scale: 0.00390625 = 1/255, normalizes the input data.
mean_file_size: binaryproto mean file path.
mean_value: Repeated three times, representing the mean of the three channels. The commonly used mean for ImageNet is {104, 117, 123 }.
crop_size: Image scaling size.
The following processing is used in the TRAIN phase.
mirror: 0 or 1, true or false. Mirror processing.
4.2. Data from LevelDB or LMDB database
Layer type: Data
data_param parameters:
source: Path containing the database directory.
batch_size: Batch size.
backend: LevelDB or LMDB, default is the former.
4.3. Data from memory
Layer type: MemoryData
memory_data_param parameters:
batch_size: Batch size.
channels: Number of channels.
height: Height.
width: Width.
4.4. Data from hdf5
Layer type: HDF5Data
hdf5_data_param parameters:
source: Path.
batch_size: Batch size.
4.5. Data from images
Layer type: ImageData
image_data_param parameters:
source: Path of a text file, each line contains the path and label of an image.
root_folder: Path, combined with the paths in the above txt file to form the complete path of the image.
batch_size: Batch size.
shuffle: Random shuffle. Default is false.
new_height, new_width: If set, resize the image.
4.6. Data from Windows
Layer type: WindowData
window_data_param parameters:
source: Path of a text file.
batch_size: Batch size.
5. Common Layers
Inner Product (InnerProduct), Accuracy (Accuracy), Splitting (Split),
Flattening (Flatten), Reshape (Reshape), Concatenation (Concat), Slicing (Slice),
Elementwise (Eltwise), Argmax (ArgMax), Mean-Variance Normalization (MVN)
5.1. Inner Product Fully Connected Layer
Output dimension is (n, k, 1, 1).
Layer type: InnerProduct
lr_mult: Learning rate coefficient.
num_output: This is k.
weight_filler: Weight initialization method. Default is "constant", all zeros, generally set to "xavier" or "gaussian".
bias_filler: Bias initialization method, generally set to "constant", all zeros.
bias_term: Whether to use the bias term, default is true.
5.2. Accuracy
Outputs the accuracy of classification results, only available in the TEST phase, requires inclusion of the include parameter.
Layer type: Accuracy
5.3. Dropout
Used to prevent overfitting, randomly deactivates certain layer node weights.
dropout_ratio: Percentage of deactivated nodes.
5.4. Concatenation
Concatenation of input data.
Layer type: Concat
axis: Index of (n, k, h, w), indicating which channel to concatenate.
5.5. Slicing
Splitting of input data, opposite function to Concatenation.
Layer type: Slice
axis: Refer to above.
slice_point: The number of this parameter must be one less than the number of tops. The first parameter indicates the number of data split out first, and so on.
The number of the last data equals the total number minus the results of all previous parameters.
The above content is partially referenced from: http://www.cnblogs.com/denny402/category/759199.html