Common Definitions Memo for 【Machine Learning】【Math】(Linear Algebra, Probability Theory, Statistics)

Continuously learning a large amount of new knowledge, some content is mastered quickly, but also forgotten quickly. So I made a memo for quick reference.

【2017.6.14
Started recording】--------------------------------------------------------------------------------------------------------------------------------------------------------------

Expectation (Mathematical Expectation):#

Mean. The probability of each possible outcome in the experiment multiplied by the sum of its results.

Standard Deviation (Mean Square Deviation):#

The square root of the sum of the squares of the differences from the mean, represented by σ. The standard deviation is the arithmetic square root of the variance. The standard deviation reflects the degree of dispersion of a data set. Two groups of data with the same mean may not have the same standard deviation.

Variance:#

Indicates the degree of dispersion of data, which is the dispersion of the variable from the expectation. The square of the standard deviation.

Covariance:#

Used to measure the overall error between two variables. Variance is a special case of covariance, specifically when the two variables are the same. The covariance Cov(X,Y) between two real random variables X and Y with expectations E[X] and E[Y] is defined as:

L-1 Norm:#

The sum of absolute values.

L-2 Norm:#

The square root of the sum of squares.

L-N Norm:#

The N-th root of the sum raised to the N-th power.

Manhattan Distance:#

L-1 distance.

Euclidean Distance:#

L-2 distance.

Cross-Entropy:#

Can be used as a loss function in neural networks (machine learning), where p represents the distribution of true labels, and q represents the predicted label distribution of the trained model. The cross-entropy loss function can measure the similarity between p and q.

It can also be written as: -Ep(xi)*log(q(xi))

【2017.6.23
Updated】--------------------------------------------------------------------------------------------------------------------------------------------------------------

Least Squares (Ordinary Least Squares, OLS):#

Minimizing the sum of squared errors to seek parameters. Fitting, regression. Solving parameters by taking partial derivatives and substituting into the original function to obtain a mathematical model. L-2 distance.

Maximum Likelihood Estimation (MLE):#

Used to estimate the parameters that satisfy the distribution of these samples when the experimental results (i.e., the samples) are known, taking the parameter θ that maximizes the likelihood as the true parameter estimate θ*. It retroactively deduces the parameter value that can achieve the known result with the maximum probability. Kullback-Leibler distance (relative entropy).

Kullback-Leibler Distance (Relative Entropy):#

DKL(P|Q) is used to measure the distance between two probability distributions P and Q in the same probability space. In practical applications, P often represents the true distribution of the data, while Q is generally an approximation of P.

【2017.7.6
Updated】--------------------------------------------------------------------------------------------------------------------------------------------------------------

Radial Basis Function:#

A radial basis function is a real-valued function whose value depends only on the distance from the origin, that is, Φ(x) = Φ(‖x‖), or it can also be the distance to any point c, where c is called the center point, thus Φ(x, c) = Φ(‖x-c‖). Any function Φ that satisfies Φ(x) = Φ(‖x‖) is called a radial basis function. The standard one generally uses Euclidean distance (also called Euclidean radial basis function), although other distance functions are also acceptable. In neural network structures, it can serve as the main function for fully connected layers and ReLU layers. In support vector machines, it acts as a kernel function. The parameter gamma in SVM is the parameter of the radial basis function.

【2017.7.27
Updated】--------------------------------------------------------------------------------------------------------------------------------------------------------------

Decided to remember some ML content, as I usually just use framework tools in my work. I have read a lot of underlying knowledge, but I still forget.

Initialization:#

Make the data have a mean of 0 and a unit variance by subtracting the mean and dividing by the variance.

During the training and testing of convolutional neural networks, the input will be mean-centered, aiming to make the input distribution centered around the origin, speeding up the fitting process.
Input data initialization often also includes whitening, which is to remove correlation. Common methods include PCA whitening: performing PCA on the data and then normalizing the variance. Whitening is computationally intensive, and backpropagation may not be differentiable, so it is not recommended.
Batch Normalization: This is just the algorithm below, and as the number of layers increases, it may reduce the model's expressive power. Therefore, two parameters were added (Figure 2).

The above is referenced from: http://blog.csdn.net/elaine_bao/article/details/50890491

DropOut:#

The purpose is to prevent overfitting. Increasing the depth of the network and the number of neurons (deeper and wider) can enhance the expressiveness and classification ability of CNNs, but it also makes them more prone to overfitting.

This method can be used after any layer.

Specifically, during training, randomly let some network nodes not work, i.e., output 0.

DropConnect:#

During training, randomly set some weights to 0. Other aspects are the same.

The above is referenced from: http://blog.csdn.net/elaine_bao/article/details/50890473

【2017.8.31
Updated】--------------------------------------------------------------------------------------------------------------------------------------------------------------

Convolutional Network Parameter Initialization#

If the parameter initialization is too small, the data will gradually shrink during transmission through each layer, making it difficult to have an effect. If the initialization value is too large, the data will gradually amplify during transmission between layers, leading to divergence and failure.

Xavier initialization distributes parameters uniformly within the range. It works particularly well with ReLU. in represents the input dimension of the current layer, and out represents the output dimension of the current layer.

20170831160918833

MSRAFiler initialization considers only the number of inputs, initializing with a Gaussian distribution of mean 0 and variance 2/n.

20170831161324550

Uniform initialization initializes parameters with a uniform distribution, controlling the upper and lower limits with min and max, defaulting to (0,1).
Gaussian initialization generates a Gaussian distribution based on the given mean and standard deviation.
Constant initialization initializes parameters based on a given constant, defaulting to 0.

【2017.11.14
Updated】--------------------------------------------------------------------------------------------------------------------------------------------------------------

Discontinuity Point of the First Kind#

If x0 is a discontinuity point of the function f(x), and both the left limit and right limit exist, then x0 is called a discontinuity point of the first kind of the function f(x).

In the first kind of discontinuity point, if the left and right limits are equal and not equal to f(x0), it is called a removable discontinuity point; if they are not equal, it is called a jump discontinuity point.

Non-first kind discontinuity points are called second kind discontinuity points.

Dirichlet Conditions#

Sometimes written as "Dirichlet Conditions"

Dirichlet believed that a periodic signal can only be expanded into a Fourier series under certain conditions. The conditions are:

The function is continuous in any finite interval or has only a finite number of first kind discontinuities.
Within one period, the function has a finite number of maxima or minima.
x(t) is absolutely integrable over a single period, i.e.

Fourier Transform#

Definition: f(t) is a periodic function of t, if t satisfies the Dirichlet conditions, then the following equation holds. This is called the Fourier transform of the integral operation f(t)

The integral operation of the following formula is called the inverse Fourier transform of F(ω).

F(ω) is called the image function of f(t),

f(t) is called the original function of F(ω).

F(ω) is the image of f(t),

f(t) is the original image of F(ω).

Fourier Series#

The continuous form of the Fourier transform is actually a generalization of the Fourier series, as integration is essentially a summation operator in a limit form.

For a periodic function, its Fourier series representation is defined as:

where T is the period of the function, Fn is the Fourier expansion coefficient:

For real-valued functions (functions with real number ranges), the Fourier series of the function can be written as:

where an and bn are the amplitudes of the real frequency components.

Discrete Fourier Transform (DFT)#

To use computers for Fourier transforms in scientific computing and digital signal processing, the function must be defined at discrete points rather than in a continuous domain, and it must satisfy finiteness or periodicity conditions.

In this case, the discrete Fourier transform of the sequence is:

Its inverse transform is:

The computational complexity of directly using the definition of DFT is O(N squared), while the Fast Fourier Transform (FFT) can improve the complexity to O(n log n).

The above content is referenced from "Baidu Baike"

For a more detailed understanding of the Fourier transform formulas, you can refer to: https://www.zhihu.com/question/19714540

For understanding meanings and significance, you can refer to: https://zhuanlan.zhihu.com/wille/19763358

Complex Number Operations#

Addition: Add real parts, add imaginary parts.

Subtraction: Subtract real parts, subtract imaginary parts.

Multiplication:

(a, ib) × (c, id)

= ac + aid + ibc + i^2bd

= (ac - db) + i(ad + bc)

(i^2 = -1)

If complex numbers are represented in a coordinate system, the horizontal axis is the real part, and the vertical axis is the imaginary part.

The modulus of the complex number (a, ib) is sqrt(a^2 + b^2).

Similarly, it can be concluded that the multiplication of complex numbers in the coordinate system is represented as: multiplying moduli and adding angles.

Coefficient Representation and Point Value Representation of Polynomials#

A polynomial of degree n has n+1 coefficients. (0 ~ n)

If these n+1 coefficients are formed into an n+1 dimensional vector, it can uniquely determine a polynomial. This vector is called the coefficient expression.
If n numbers are substituted in, and n corresponding values are calculated, it can uniquely determine a polynomial, and these numbers and values form the point value expression.

Kronecker Product#

A circle product of A and B, if A is an m×n matrix and B is a p×q matrix, the Kronecker product is an mp×nq block matrix.

【2017.11.15
Updated】--------------------------------------------------------------------------------------------------------------------------------------------------------------

Dirac Delta Function#

Definition:

Properties:

According to its properties, δ(t) can be used to represent any signal.

And this property is used in the derivation of the Fourier transform formula.

To be continued…