optimization
An optimization problem has an objective function that is defined in terms of a set of variables, referred to optimization variables.
The goal of the optimization problem is to compute the values of the variables at which the objective function is either maximized or minimized.
It is common to use minimization form of the objective function in ML and it is often referred to as a loss function.
loss function often refers to an objective function with certain types of properties quantifying a nonnegative “cost” associated with a particular configuration of variables.
Matrix derivative
Types | Scalar | Vector | Matrix |
---|---|---|---|
Scalar | \(\frac{\partial y}{\partial x}\) | \(\frac{\partial \vec{y}}{\partial x}\) | \(\frac{\partial \mathbf{Y}}{\partial x}\) |
Vector | \(\frac{\partial y}{\partial \vec{x}}\) | \(\frac{\partial \vec{y}}{\partial \vec{x}}\) | |
Matrix | \(\frac{\partial y}{\partial \mathbf{X}}\) |
First Derivative
Scalar-by-Vector
It is used in scalar field.
\[\frac{\partial}{\partial \vec{x}} f = \begin{array} ([ \frac{\partial}{\partial x_1}f & \cdots & \frac{\partial}{\partial x_n}f ] \end{array}\]Vector-by-Vector
It refers Jacobian matrix.
\[\begin{align} \frac{\partial}{\partial \vec{x}} F = J_F = \begin{bmatrix} \begin{array} ( \frac{\partial}{\partial x_1} F_1(\vec{x}) & \cdots & \frac{\partial}{\partial x_n} F_1(\vec{x}) \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_1} F_m(\vec{x}) & \cdots & \frac{\partial}{\partial x_n} F_m(\vec{x}) \\ \end{array} \end{bmatrix} & & F: \mathbb{R}^{n} \rightarrow \mathbb{R}^m \end{align}\]E.g. \(F(\vec{x}) =\begin{bmatrix} \begin{array} (F_1 \\ F_2 \\ F_3 \end{array} \end{bmatrix} = \begin{bmatrix} \begin{array} (x_1 + x_2 \\ x_1 - x_2 \\ x_1^2 + x_2^2 \end{array} \end{bmatrix}\)
Scalar-by-Matrix
\[\begin{align} \frac{\partial}{\partial X} f =\begin{bmatrix} \begin{array} ( \frac{\partial}{\partial x_{11}} f & \cdots & \frac{\partial}{\partial x_{m1}} f \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_{1n}} f & \cdots & \frac{\partial}{\partial x_{mn}} f \\ \end{array} \end{bmatrix} & & f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R} \end{align}\]Gradient
Gradient Operator is the transport of Derivative Operator. \(\nabla = (\frac{\partial}{\partial x})^\intercal\)
Scalar-by-Vector
It is used in scalar field. Gradient vector is the first derivative of function \(f\)
\[\begin{align} \nabla f(\vec{x}) = \begin{bmatrix} \begin{array} ( \frac{\partial}{\partial x_1} f(\vec{x}) \\ \vdots \\ \frac{\partial}{\partial x_n} f(\vec{x}) \\ \end{array} \end{bmatrix} & & f: \mathbb{R}^{n} \rightarrow \mathbb{R} \end{align}\]E.g. \(f(\vec{x}) = x_1 + x_2\)
Scalar-by-Matrix
\[\begin{align} \nabla f =\begin{bmatrix} \begin{array} ( \frac{\partial}{\partial x_{11}} f & \cdots & \frac{\partial}{\partial x_{1n}} f \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_{m1}} f & \cdots & \frac{\partial}{\partial x_{mn}} f \\ \end{array} \end{bmatrix} & & f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R} \end{align}\]Second Derivative
Scalar-by-Vector
It refers Hessian matrix
\[\begin{align} H(f(\vec{x})) = \nabla^2f(\vec{x}) = \nabla\nabla^\intercal f = \begin{bmatrix} \begin{array} ( \frac{\partial^2}{\partial x_1^2} f(\vec{x}) & \cdots & \frac{\partial^2}{\partial x_1 \partial x_n} f(\vec{x}) \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_n \partial x_1} f(\vec{x}) & \cdots & \frac{\partial^2}{\partial x_n^2} f(\vec{x}) \\ \end{array} \end{bmatrix} & & f: \mathbb{R}^{n} \rightarrow \mathbb{R} \end{align}\]Connections
Jacobian matrix can be seen as the outer product of Gradient operator and function vector \(F\)
\[J_F = (\nabla \times F)^\intercal = \left( \begin{bmatrix} \begin{array} (\frac{\partial}{\partial x_1} \\ \vdots \\ \frac{\partial}{\partial x_n} \end{array} \end{bmatrix} \begin{bmatrix} \begin{array} (F_1 & \cdots & F_m \end{array} \end{bmatrix} \right)^\intercal\]Relationship between Gradient, Jacobian and Hessian:
\[H(f(\vec{x})) = J(\nabla f (\vec{x}) ) ^\intercal\]Common Objective Derivative
- \(A\) is a constant \(d \times d\) matrix
- \(B\) is a constant \(n \times d\) matrix
- \(\vec{b}\) is a constant \(n\)-dimensional vector independent of the parameter vector \(\vec{w}\)
- \(C\) is a \(k \times d\) matrix
Objective \(J\) | Derivative of \(J\) with respect to \(\vec{w}\) |
---|---|
\(\vec{w}^\intercal A \vec{w}\) | \((A + A^\intercal)\vec{w}\) |
\(\vec{b}^\intercal V \vec{w}\) | \((B^\intercal)\vec{b}\) |
\(\lVert B\vec{w} + \vec{b} \rVert^2\) | \(2 B^\intercal ( B\vec{w} + \vec{b} )\) |
\(f(g(\vec{w}))\) | \(f'(g(\vec{w})) \nabla_{\vec{w}} g(\vec{w})\) |
\(f(\vec{w} \cdot \vec{a} )\) | \(f'(\vec{w} \cdot \vec{a})\vec{a}\) |
Convex
If \(f\) is convex, then every local minimizer \(\vec{w}\) is a global minimizer.
Convex Set
A set \(S\) is convex, if for every pair of points \(w_1, w_2 \in S\), the point \(\lambda w_1 + (1-\lambda) w_2\) must also be in \(S\) for all \(\lambda \in (0, 1)\)
Convex function
A convex function \(F(w)\) is defined as a function with a convex domain that satisfies the following condition for any \(\lambda \in (0, 1)\):
\[F(\lambda w_1 + (1-\lambda) w_2) \leq \lambda F(w_1) + (1-\lambda)F(w_2)\]Properties
- A linear function of the vector \(\vec{w}\) is always convex.
- The sum of convex functions is always convex.
- The maximum of convex functions is convex.
- The square of a nonnegative convex function is convex.
- If \(F(\cdot)\) is a convex function with a single argument and \(G(\vec{w})\) is a linear function with a scalar output, then \(F(G(\vec{w}))\) is convex.
First-Derivative Characterization of Convexity
A differentiable function \(F(\vec{w})\) is a convex function iff it satisfies the following for any pair \(\vec{w}_0\) and \(\vec{w}\):
\[F(\vec{w}) \ge F(\vec{w}_0) + [\nabla F(\vec{w}_0)](\vec{w} - \vec{w}_0)\]Second-Derivative Characterization of Convexity
The twice differentiable function \(F(\vec{w})\) is convex, iff it has a positive semidefinite Hessian at every value of the parameter \(\vec{w}\) in the domain of \(F(\cdot)\)
Unconstrained Optimization
It refers to local minimizer and global minimizer.
FONC & SONC
First Order Necessary Condition
If \(\vec{x}\) is a local minimizer of \(f: \mathbb{R}^{n} \rightarrow \mathbb{R}\) and \(f\) is continuously differentiable in an open neighborhood of \(\vec{x}\), then
\[\nabla f(\vec{x}) = 0\]for 1-dimension scalar: \(\frac{d}{dx} f(x_0) = 0\)
Second Order necessary Condition
Every station point \(\vec{x_0}\) with increasing function values around it is a local minimizer.
If \(x\) is a local minimizer of \(f: \mathbb{R}^{n} \rightarrow \mathbb{R}\) and \(\nabla^2 f\) is continuous in an open neighborhood of \(\vec{x}\), then
\[\begin{cases} \nabla f(\vec{x}) = 0 \\ \nabla^{2} f(\vec{x}) \text{ is positive semidefine} \end{cases}\]A matrix \(A \in \mathbb{R}^{d \times d}\) is positive semidefinite if
\[\vec{x}^\intercal A \vec{x} \ge 0 \text{ for all } \vec{x} \in \mathbb{R}^d\]Positive semidefinite is deduced from second-order, multivariate Taylor expansion.
Here, in station point \(\vec{x}_0\) along the direction \(\vec{v}\) and a small radius \(\epsilon \gt 0\):
\[f(\vec{x}_0+\epsilon \vec{v}) \approx f(\vec{x}_0) + \epsilon \vec{v}^\intercal[ \nabla f(\vec{x}_0) ] + \frac{\epsilon^2}{2} [\vec{v}^\intercal H_f \vec{v}]\]Since \(\vec{x}_0\) is the station point, it means \(\nabla f(\vec{x}_0) = 0\)
Therefore, whether \(f(\vec{x}_0+\epsilon \vec{v})\) is less than \(f(\vec{x}_0)\) depends on the sign of \(\vec{v}^\intercal H_f \vec{v}\)
for 1-dimension scalar:
\[\begin{cases} \frac{d}{dx} f(x_0) & = 0 \\ \frac{d^2}{dx^2} f(x_0) & \ge 0 \end{cases}\]Numerical Optimization
function Optimizer(f)
x[0] <- initialization
for t in {1, ..., t_max - 1} do
x[t+1] <- update(x[t], f)
end
return x[t_max]
end
Different numerical optimization methods refer to different update
strategies.
It is maybe efficient to find a local minimizer. But it is hard to find the global minimizer for a non-convex objective function.
Gradient Descent
\(f: \mathbb{R}^{n} \rightarrow \mathbb{R}\) is the cost function, \(\alpha\) is the step size.
\[\vec{w}_{t+1} \Leftarrow \vec{w}_{t} - \alpha \nabla f (\vec{w}_{t})\]We can numerically calculate \(\nabla f (\vec{w}_{t})\) as following, where \(\Delta\) is a tiny increase on \(w_i\).
\[\frac{\partial}{\partial w_{i}} f (\vec{w}) \approx \frac{ f(w_1, \cdots, w_i + \Delta, \cdots, w_d) - f(w_1, \cdots, w_i, \cdots, w_d) }{\Delta}\]Learning Rate Decay and Bold Driver
Instead of the fixed learning rate \(\alpha\) which takes too long to reach the minimizer as it is small or oscillates around the point for a long time as it is still large.
Therefore, Allowing the learning rate to decay over time can naturally achieve the desired learning-rate adjustment to avoid both challenges.
\[\vec{w}_{t+1} \Leftarrow \vec{w}_{t} - \alpha_t \nabla f (\vec{w}_{t})\]The time \(t\) is typically measured in terms of the number of cycles over all training point.
Exponential Decay
\[\alpha_t = \alpha_0 \space exp(-k \cdot t)\]The parameter \(k\) controls the rate of the decay.
Inverse Decay
\[\alpha_t = \frac{\alpha_0}{1 + k \cdot t}\]The parameter \(k\) controls the rate of the decay.
Coordinate Descent
Coordinate descent is a method that optimizes the objective function one variable at a time.
Each time, we reach the local minimization toward a single direction. And change direction next time.
\[\begin{gather} (w_1^{(i)}, \cdots, w_m^{(i)}, \cdots, w_n^{(i)}) \Leftarrow (w_1^{(i)}, \cdots, w_m^{(i+1)}, \cdots, w_n^{(i)}) \\ \frac{\partial}{\partial w_m} f(\vec{w}) = 0 \end{gather}\]We have two direction change strategies.
- Greedy: choose the direction with maximal descent
- Cyclic: choose the direction from left to right
Basically, when we talk about coordinate descent, we use cyclic coordinate descent.
Constrained Optimization
Many ML applications involve two simple types of constraints:
- Linear and convex constraints: \(f(\vec{x}) \le b\)
- Norm constraints: \(\lVert \vec{w} \rVert = 1\)
Lagrangian Relaxation
Considering a minimization problem:
\[\begin{gather} & P = \min F(\vec{w}) \\ s.t. & f_i( \vec{w} ) \le 0 & 1 \le i \le m \end{gather}\]It is called the primary problem. It is useful if the functions \(F(\vec{w})\) and each \(f_i(\vec{w})\) are convex.
The Lagrangian relaxation is defined with the use of nonnegative Lagrangian multipliers
\[\begin{gather} \vec\lambda = [\lambda_1 \dots, \lambda_m ]^\intercal \\ L(\vec{\lambda}) = \min_{\vec{w}} (F(\vec{w}) + \sum_{i=1}^{m} \lambda_i f_i(\vec{w})) \end{gather}\] \[\text{proof: } L(\vec{\lambda}) \le \min_{\vec{w}} F(\vec{w}) = P\]Lagrangian Duality
the value of \(L(\vec{\lambda})\) for any nonnegative vector \(\vec{\lambda}\) is always no larger than the optimal solution to the primal.
One can tighten this bound by maximizing \(L(\vec{\lambda})\) over all nonnegative \(\vec{\lambda}\) and formulating the dual problem with objective function \(D\).
\[\begin{align} D &= \max_{\vec\lambda \ge 0} L(\vec{\lambda}) \\ &= \max_{\vec\lambda \ge 0} \min_{\vec{w}} (F(\vec{w}) + \sum_{i=1}^{m} \lambda_i f_i(\vec{w})) \end{align}\]We summarize the relationship between the primal and the dual as follows:
\[D = L(\vec{\lambda}^*) \le P\]This result referred to as that of weak duality.
the Lagrangian optimization problem is a minimax problem containing disjoint minimization and maximization variables
The ordering of the minimization and maximization for any minimax optimization problem does matter.
if \(f(\vec w)\) and each \(f_i(\vec w)\) are convex functions. Then the optimal objective function values of the dual problem created using Lagrangian relaxation is almost always the same as that of the primal.
We use the qualification “almost always” because we also need a relatively weak condition referred to as Slater’s condition, which states that at least one strictly feasible point exists satisfying \(f_i(\vec w) < 0\) for each \(i\). For most machine learning problems, these conditions hold by default
General Procedure for Using Duality is to eliminate primary variables generating pure objective function with only \(max\) constraint variables.
As an example, we can reference to SVM