optimization

An optimization problem has an objective function that is defined in terms of a set of variables, referred to optimization variables.

The goal of the optimization problem is to compute the values of the variables at which the objective function is either maximized or minimized.

It is common to use minimization form of the objective function in ML and it is often referred to as a loss function.

loss function often refers to an objective function with certain types of properties quantifying a nonnegative “cost” associated with a particular configuration of variables.

Matrix derivative

Types Scalar Vector Matrix
Scalar \(\frac{\partial y}{\partial x}\) \(\frac{\partial \vec{y}}{\partial x}\) \(\frac{\partial \mathbf{Y}}{\partial x}\)
Vector \(\frac{\partial y}{\partial \vec{x}}\) \(\frac{\partial \vec{y}}{\partial \vec{x}}\)  
Matrix \(\frac{\partial y}{\partial \mathbf{X}}\)    

First Derivative

Scalar-by-Vector

It is used in scalar field.

\[\frac{\partial}{\partial \vec{x}} f = \begin{array} ([ \frac{\partial}{\partial x_1}f & \cdots & \frac{\partial}{\partial x_n}f ] \end{array}\]

Vector-by-Vector

It refers Jacobian matrix.

\[\begin{align} \frac{\partial}{\partial \vec{x}} F = J_F = \begin{bmatrix} \begin{array} ( \frac{\partial}{\partial x_1} F_1(\vec{x}) & \cdots & \frac{\partial}{\partial x_n} F_1(\vec{x}) \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_1} F_m(\vec{x}) & \cdots & \frac{\partial}{\partial x_n} F_m(\vec{x}) \\ \end{array} \end{bmatrix} & & F: \mathbb{R}^{n} \rightarrow \mathbb{R}^m \end{align}\]

E.g. \(F(\vec{x}) =\begin{bmatrix} \begin{array} (F_1 \\ F_2 \\ F_3 \end{array} \end{bmatrix} = \begin{bmatrix} \begin{array} (x_1 + x_2 \\ x_1 - x_2 \\ x_1^2 + x_2^2 \end{array} \end{bmatrix}\)

Scalar-by-Matrix

\[\begin{align} \frac{\partial}{\partial X} f =\begin{bmatrix} \begin{array} ( \frac{\partial}{\partial x_{11}} f & \cdots & \frac{\partial}{\partial x_{m1}} f \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_{1n}} f & \cdots & \frac{\partial}{\partial x_{mn}} f \\ \end{array} \end{bmatrix} & & f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R} \end{align}\]

Gradient

Gradient Operator is the transport of Derivative Operator. \(\nabla = (\frac{\partial}{\partial x})^\intercal\)

Scalar-by-Vector

It is used in scalar field. Gradient vector is the first derivative of function \(f\)

\[\begin{align} \nabla f(\vec{x}) = \begin{bmatrix} \begin{array} ( \frac{\partial}{\partial x_1} f(\vec{x}) \\ \vdots \\ \frac{\partial}{\partial x_n} f(\vec{x}) \\ \end{array} \end{bmatrix} & & f: \mathbb{R}^{n} \rightarrow \mathbb{R} \end{align}\]

E.g. \(f(\vec{x}) = x_1 + x_2\)

Scalar-by-Matrix

\[\begin{align} \nabla f =\begin{bmatrix} \begin{array} ( \frac{\partial}{\partial x_{11}} f & \cdots & \frac{\partial}{\partial x_{1n}} f \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_{m1}} f & \cdots & \frac{\partial}{\partial x_{mn}} f \\ \end{array} \end{bmatrix} & & f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R} \end{align}\]

Second Derivative

Scalar-by-Vector

It refers Hessian matrix

\[\begin{align} H(f(\vec{x})) = \nabla^2f(\vec{x}) = \nabla\nabla^\intercal f = \begin{bmatrix} \begin{array} ( \frac{\partial^2}{\partial x_1^2} f(\vec{x}) & \cdots & \frac{\partial^2}{\partial x_1 \partial x_n} f(\vec{x}) \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_n \partial x_1} f(\vec{x}) & \cdots & \frac{\partial^2}{\partial x_n^2} f(\vec{x}) \\ \end{array} \end{bmatrix} & & f: \mathbb{R}^{n} \rightarrow \mathbb{R} \end{align}\]

Connections

Jacobian matrix can be seen as the outer product of Gradient operator and function vector \(F\)

\[J_F = (\nabla \times F)^\intercal = \left( \begin{bmatrix} \begin{array} (\frac{\partial}{\partial x_1} \\ \vdots \\ \frac{\partial}{\partial x_n} \end{array} \end{bmatrix} \begin{bmatrix} \begin{array} (F_1 & \cdots & F_m \end{array} \end{bmatrix} \right)^\intercal\]

Relationship between Gradient, Jacobian and Hessian:

\[H(f(\vec{x})) = J(\nabla f (\vec{x}) ) ^\intercal\]

Common Objective Derivative

  • \(A\) is a constant \(d \times d\) matrix
  • \(B\) is a constant \(n \times d\) matrix
  • \(\vec{b}\) is a constant \(n\)-dimensional vector independent of the parameter vector \(\vec{w}\)
  • \(C\) is a \(k \times d\) matrix
Objective \(J\) Derivative of \(J\) with respect to \(\vec{w}\)
\(\vec{w}^\intercal A \vec{w}\) \((A + A^\intercal)\vec{w}\)
\(\vec{b}^\intercal V \vec{w}\) \((B^\intercal)\vec{b}\)
\(\lVert B\vec{w} + \vec{b} \rVert^2\) \(2 B^\intercal ( B\vec{w} + \vec{b} )\)
\(f(g(\vec{w}))\) \(f'(g(\vec{w})) \nabla_{\vec{w}} g(\vec{w})\)
\(f(\vec{w} \cdot \vec{a} )\) \(f'(\vec{w} \cdot \vec{a})\vec{a}\)

Convex

If \(f\) is convex, then every local minimizer \(\vec{w}\) is a global minimizer.

Convex Set

A set \(S\) is convex, if for every pair of points \(w_1, w_2 \in S\), the point \(\lambda w_1 + (1-\lambda) w_2\) must also be in \(S\) for all \(\lambda \in (0, 1)\)

Convex function

A convex function \(F(w)\) is defined as a function with a convex domain that satisfies the following condition for any \(\lambda \in (0, 1)\):

\[F(\lambda w_1 + (1-\lambda) w_2) \leq \lambda F(w_1) + (1-\lambda)F(w_2)\]

Properties

  • A linear function of the vector \(\vec{w}\) is always convex.
  • The sum of convex functions is always convex.
  • The maximum of convex functions is convex.
  • The square of a nonnegative convex function is convex.
  • If \(F(\cdot)\) is a convex function with a single argument and \(G(\vec{w})\) is a linear function with a scalar output, then \(F(G(\vec{w}))\) is convex.

First-Derivative Characterization of Convexity

A differentiable function \(F(\vec{w})\) is a convex function iff it satisfies the following for any pair \(\vec{w}_0\) and \(\vec{w}\):

\[F(\vec{w}) \ge F(\vec{w}_0) + [\nabla F(\vec{w}_0)](\vec{w} - \vec{w}_0)\]

Second-Derivative Characterization of Convexity

The twice differentiable function \(F(\vec{w})\) is convex, iff it has a positive semidefinite Hessian at every value of the parameter \(\vec{w}\) in the domain of \(F(\cdot)\)

Unconstrained Optimization

It refers to local minimizer and global minimizer.

FONC & SONC

First Order Necessary Condition

If \(\vec{x}\) is a local minimizer of \(f: \mathbb{R}^{n} \rightarrow \mathbb{R}\) and \(f\) is continuously differentiable in an open neighborhood of \(\vec{x}\), then

\[\nabla f(\vec{x}) = 0\]

for 1-dimension scalar: \(\frac{d}{dx} f(x_0) = 0\)

Second Order necessary Condition

Every station point \(\vec{x_0}\) with increasing function values around it is a local minimizer.

If \(x\) is a local minimizer of \(f: \mathbb{R}^{n} \rightarrow \mathbb{R}\) and \(\nabla^2 f\) is continuous in an open neighborhood of \(\vec{x}\), then

\[\begin{cases} \nabla f(\vec{x}) = 0 \\ \nabla^{2} f(\vec{x}) \text{ is positive semidefine} \end{cases}\]

A matrix \(A \in \mathbb{R}^{d \times d}\) is positive semidefinite if

\[\vec{x}^\intercal A \vec{x} \ge 0 \text{ for all } \vec{x} \in \mathbb{R}^d\]

Positive semidefinite is deduced from second-order, multivariate Taylor expansion.

Here, in station point \(\vec{x}_0\) along the direction \(\vec{v}\) and a small radius \(\epsilon \gt 0\):

\[f(\vec{x}_0+\epsilon \vec{v}) \approx f(\vec{x}_0) + \epsilon \vec{v}^\intercal[ \nabla f(\vec{x}_0) ] + \frac{\epsilon^2}{2} [\vec{v}^\intercal H_f \vec{v}]\]

Since \(\vec{x}_0\) is the station point, it means \(\nabla f(\vec{x}_0) = 0\)

Therefore, whether \(f(\vec{x}_0+\epsilon \vec{v})\) is less than \(f(\vec{x}_0)\) depends on the sign of \(\vec{v}^\intercal H_f \vec{v}\)

for 1-dimension scalar:

\[\begin{cases} \frac{d}{dx} f(x_0) & = 0 \\ \frac{d^2}{dx^2} f(x_0) & \ge 0 \end{cases}\]

Numerical Optimization

function Optimizer(f)
	x[0] <- initialization
	for t in {1, ..., t_max - 1} do
		x[t+1] <- update(x[t], f)
	end
	return x[t_max]
end

Different numerical optimization methods refer to different update strategies.

It is maybe efficient to find a local minimizer. But it is hard to find the global minimizer for a non-convex objective function.

Gradient Descent

\(f: \mathbb{R}^{n} \rightarrow \mathbb{R}\) is the cost function, \(\alpha\) is the step size.

\[\vec{w}_{t+1} \Leftarrow \vec{w}_{t} - \alpha \nabla f (\vec{w}_{t})\]

We can numerically calculate \(\nabla f (\vec{w}_{t})\) as following, where \(\Delta\) is a tiny increase on \(w_i\).

\[\frac{\partial}{\partial w_{i}} f (\vec{w}) \approx \frac{ f(w_1, \cdots, w_i + \Delta, \cdots, w_d) - f(w_1, \cdots, w_i, \cdots, w_d) }{\Delta}\]

Learning Rate Decay and Bold Driver

Instead of the fixed learning rate \(\alpha\) which takes too long to reach the minimizer as it is small or oscillates around the point for a long time as it is still large.

Therefore, Allowing the learning rate to decay over time can naturally achieve the desired learning-rate adjustment to avoid both challenges.

\[\vec{w}_{t+1} \Leftarrow \vec{w}_{t} - \alpha_t \nabla f (\vec{w}_{t})\]

The time \(t\) is typically measured in terms of the number of cycles over all training point.

Exponential Decay
\[\alpha_t = \alpha_0 \space exp(-k \cdot t)\]

The parameter \(k\) controls the rate of the decay.

Inverse Decay
\[\alpha_t = \frac{\alpha_0}{1 + k \cdot t}\]

The parameter \(k\) controls the rate of the decay.

Coordinate Descent

Coordinate descent is a method that optimizes the objective function one variable at a time.

Each time, we reach the local minimization toward a single direction. And change direction next time.

\[\begin{gather} (w_1^{(i)}, \cdots, w_m^{(i)}, \cdots, w_n^{(i)}) \Leftarrow (w_1^{(i)}, \cdots, w_m^{(i+1)}, \cdots, w_n^{(i)}) \\ \frac{\partial}{\partial w_m} f(\vec{w}) = 0 \end{gather}\]

We have two direction change strategies.

  • Greedy: choose the direction with maximal descent
  • Cyclic: choose the direction from left to right

Basically, when we talk about coordinate descent, we use cyclic coordinate descent.

Constrained Optimization

Many ML applications involve two simple types of constraints:

  • Linear and convex constraints: \(f(\vec{x}) \le b\)
  • Norm constraints: \(\lVert \vec{w} \rVert = 1\)

Lagrangian Relaxation

Considering a minimization problem:

\[\begin{gather} & P = \min F(\vec{w}) \\ s.t. & f_i( \vec{w} ) \le 0 & 1 \le i \le m \end{gather}\]

It is called the primary problem. It is useful if the functions \(F(\vec{w})\) and each \(f_i(\vec{w})\) are convex.

The Lagrangian relaxation is defined with the use of nonnegative Lagrangian multipliers

\[\begin{gather} \vec\lambda = [\lambda_1 \dots, \lambda_m ]^\intercal \\ L(\vec{\lambda}) = \min_{\vec{w}} (F(\vec{w}) + \sum_{i=1}^{m} \lambda_i f_i(\vec{w})) \end{gather}\] \[\text{proof: } L(\vec{\lambda}) \le \min_{\vec{w}} F(\vec{w}) = P\]

Lagrangian Duality

the value of \(L(\vec{\lambda})\) for any nonnegative vector \(\vec{\lambda}\) is always no larger than the optimal solution to the primal.

One can tighten this bound by maximizing \(L(\vec{\lambda})\) over all nonnegative \(\vec{\lambda}\) and formulating the dual problem with objective function \(D\).

\[\begin{align} D &= \max_{\vec\lambda \ge 0} L(\vec{\lambda}) \\ &= \max_{\vec\lambda \ge 0} \min_{\vec{w}} (F(\vec{w}) + \sum_{i=1}^{m} \lambda_i f_i(\vec{w})) \end{align}\]

We summarize the relationship between the primal and the dual as follows:

\[D = L(\vec{\lambda}^*) \le P\]

This result referred to as that of weak duality.

the Lagrangian optimization problem is a minimax problem containing disjoint minimization and maximization variables

The ordering of the minimization and maximization for any minimax optimization problem does matter.

if \(f(\vec w)\) and each \(f_i(\vec w)\) are convex functions. Then the optimal objective function values of the dual problem created using Lagrangian relaxation is almost always the same as that of the primal.

We use the qualification “almost always” because we also need a relatively weak condition referred to as Slater’s condition, which states that at least one strictly feasible point exists satisfying \(f_i(\vec w) < 0\) for each \(i\). For most machine learning problems, these conditions hold by default

General Procedure for Using Duality is to eliminate primary variables generating pure objective function with only \(max\) constraint variables.

As an example, we can reference to SVM