Background

True distribution

Assumption

Given $N$ training examples $(\vec{x}_1, y_1), \cdots, (\vec{x}_N, y_N)$ where $\vec{x}$ is feature set and $y$ is label. Assume each features follow independent and identically distribution (i.i.d.) We have

\[(\vec{x}_i, y_i) \sim p^*\]

Classification Problem

Classifier

A classifier is a function $h$, mapping feature vector $\vec{x}$ to label $\hat{y}$.

For sample $\vec{x}_i$ the classifier returns predicted label $\hat{y}_i = h(\vec{x}_i)$

Goal

A learning algorithm takes a dataset and returns a classifier $h$.

True Error

The true classification error of $h$ is

\[\begin{gather} \mathbb{P}[Y \ne h(X)] = \mathbb{E}_{X, Y \sim p^*} [ \mathbb{1}(Y \ne h(X)) ] \\\\ \mathbb{1}(X) = \begin{cases} 1 & X = True \\ 0 & X = False \end{cases} \end{gather}\]

Where $\mathbb{1}$ is the indicator function

It evaluates the average number of misclassifications on a very large test set.

Bayes optimal

A classifier $h^*$ which minimizes the true classification error is called Bayes optimal

Let $p^*(X, Y)$ be the unknown data generating distribution. The classifier

\[h^*(x) = arg \max_{y} p^*(y|\vec{x})\]

It means we need to know the prior probabilities ahead and then to determine the class of $y$ that has highest probability.

KNN

Label new point according to closest training points.

Label a new point according to majority vote of its $k$ nearest neighbors
Stabler decision, averages data noise
For binary classification, $k$ should be odd
Theoretically sound

if

number of training points $N \rightarrow \infty$

number of neighbors $k \rightarrow \infty$

$N$ grows quicker than $k$ $\frac{k}{N} \rightarrow 0$

Then KNN converges towards the Bayes optimal classifier

KNN needs a distance metric

basically, it is Euclidian norm.

Naive Bayes

Bayes Rule

\[p(A|B) = \frac{p(B|A)p(A)}{p(B)}\]

For vector, it converts to

\[p(y|\vec{x}) = \frac{p(\vec{x}|y)p(y)}{p(\vec{x})} = \frac{p(\vec{x}|y)p(y)}{\sum_{y} p(\vec{x}|y)p(y)}\]

Since $p(y \vert \vec{x}) \propto p(\vec{x} \vert y ) p(y)$ , $p(\vec{x})$ is not necessary.

Naive Bayes Assumption

Assuming conditional independence of the features. Then, we have

\[p(\vec{x}|y) = \prod_{d=1}^{D} p(x_d|y)\]

Naive Bayes Classifier

Let $p(y)$ be an estimate of the true class proportions $p^* (y)$, and $p(x_d \vert y )$ estimates of the true class-conditional distributions $p^*(xd \vert y )$, for each feature $X_d , d = 1 . . . D$.

The Naive Bayes Classifier is given as

\[h_{NB}(\vec{x}) = arg \max_{y} p(y) \prod_{d=1}^{D} p(x_d \vert y)\]

Laplace smoothing

for feature $x_d$ in all features $\vec{x}$, it has $K$ choices. $x_d \in \{1, \cdots, K\}$. then the Laplace smoothing is

\[p(x_d = f | y = c) = \frac{\mathbb{1}(x_d = f, y = c) + \alpha}{\mathbb{1}( y =c ) + \alpha K}\]

$\mathbb{1}(x_d = f, y=c)$ is the number of data items that $x_d = f$ and $y = c$ in the whole dataset.

Since each feature follows i.d.d. we have

\[p(\vec{x} = \vec{f} | y = c) = \prod_{i=1}^{d} p(x_i = f_i | y = c)\]

According to the Bayes Rule, we have

\[p(\vec{x} =\vec{f}) = p(y = c) \prod_{i=1}^{d} p(x_i = f_i | y = c)\]

The generally idea to use smoothing is that in some case, the value of $\mathbb{1}(x_d = f, y=c)$ might be $0$ if it is not occurred in the dataset, leading to an incorrect probability distribution.

Therefore, we add smoothing parameter to both numerator and denominator to force it greater than $0$.

log-domain

Since the value of a probability is in $[0, 1]$, therefore, the value of $p(\vec{x} = \vec{f})$ will be too small to calculate and store in the computer.

Therefore, instead of store the original value, we can store it in the log-domain and compute. Then we have

\[\log p(\vec{x} = \vec{f}) = \log p(y = c) + \sum_{i= 1}^{d} \log p(x_i = f_i | y = c)\]

Generally, the base is $e$.

Decision Trees

Given a set of features $X_1, X_2, \dots , X_D$ (discrete or continuous) and a class variable $Y$, a Decision Tree is a directed binary tree with two types of nodes:

Decision nodes (internal nodes) Decision nodes are associated with a feature $X_i$ and a boolean function $f: X_{i} \mapsto {true, false}$
Prediction nodes (leaves). It associated with either
- a fixed label $y$
- a distribution over labels $p(Y)$

Training Methods

CART

L0 <- all nodes
for
	L1, L2 = split(L0)
	L0.left = L1
	L0.right = L2

	split(L0.left)
	split(L0.right)
end

function split(L)
	for possible decison Xd do
		 cost(Xd, f)
	end

	return Xd with min cost
end

function cost(Xd, f)
	N0 <- number of data points to L0
	N1, N2 <- number of data points to L1, L2

	cost = (N1/N0)*impurity(L1) + (N2/N0)*impurity(L2) - impurity(L0)
end

Stopping Criterion

Too small improvement of cost
Too few data points They are both called threshoulds.

Impurity Measures

Gini Impurity

\[G = 1 - \sum_{y} p^2(y)\]

Entropy

\[E = - \sum_y p(y)log(p(y))\]

Self-Classification Error

\[C = 1 - \max_y p(y)\]

Random Forests

Improve performance of Decision Trees by Bagging algorithm to reduce the variance

Make $K$ bootstraps
Train $K$ classifiers, one for each bootstrap
Aggregate the result: Averaging, majority vote…..

Bootstrap

Given $N$ samples, generate $K$ new datasets of size $N$ by sampling with replacement, which are called bootstraps.

For large $N$, each bootstrap contains only $1 - \frac{1}{e} \approx 63.21\%$ of the original samples.

originSet = [x1, ... , xn]

for i in 1..k
	for j in 1..n
		bootstrap[i][j] = originSet[random(1, n)]
	end
end

SVM

Build a hyperplane and max margin.

A hyperplane can be represented in vector form:

\[\begin{gather} \vec{x}^\intercal \vec{\beta} + \beta_0 = 0 \end{gather}\]

Therefore, it converts to an optimization problem:

\[\max_{\vec{\beta}, \beta_0} \min_{i} \frac{y_i(\vec{x}_i^\intercal \vec{\beta} + \beta_0)}{\lVert \vec{\beta} \rVert}\]

It means maximize the min distance between every point to the hyperplane

But this is not a convex function. Therefore, we try to make it convex by introducing the margin hyperplanes:

\[\begin{gather} \vec{x}^\intercal \vec{\beta} + \beta_0 = 1 \\ \vec{x}^\intercal \vec{\beta} + \beta_0 = -1 \\ \end{gather}\]

Therefore, the distance between the hyperplane and the margin hyperplane is $\frac{1}{\lVert \vec{\beta} \rVert}$

Therefore, the optimal question converts to

\[\begin{gather} & \max_{\vec{\beta}, \beta_0} \frac{1}{\lVert \vec{\beta} \rVert} \\ s.t. & y_i(\vec{x}_i^\intercal \vec{\beta} + \beta_0) \ge 1 \end{gather}\]

Then we can convert it to convex easily to:

\[\begin{gather} & \min_{\vec{\beta}, \beta_0} \frac{1}{2} \lVert \vec{\beta} \rVert^2 + C \sum_{i} \xi_i \\ s.t. & y_i(\vec{x}_i^\intercal \vec{\beta} + \beta_0) \ge 1 - \xi_i \\ & \xi_i \ge 0 \end{gather}\]

We square $\beta$ to make it differentiable everywhere.

We introduce $\xi$ to tolerant non-separable data.

Then, we can use Lagrangian Relaxation and Lagrangian Duality to solve it.

\[\begin{gather} & L_P(\vec\lambda, \vec\mu) = \min_{\vec{\beta}, \xi} \frac{1}{2} \lVert \vec{\beta} \rVert^2 + C \sum_{i} \xi_i - \sum_{i} \lambda_i (y_i(\vec{x}_i^\intercal \vec{\beta} + \beta_0) - (1 - \xi_i)) - \sum_{i} \mu_i \xi_i \\ \\ & L_D^* = \max_{\vec{\lambda}, \vec\mu} L_P \\\\ \end{gather}\]

use the Primary constraints to eliminate the primal variables

\[\begin{gather} \nabla_{\vec{\beta}, \beta_0, \xi_i} L_P = 0 \Rightarrow \begin{cases} \vec{\beta} = \sum_{i} \lambda_i y_i \vec{x}_i \\ \sum_{i} \lambda_i y_i = 0 \\ \lambda_i = C - \mu_i \end{cases} \\ \\ \Rightarrow L_D = \sum_{i} \lambda_i - \frac{1}{2} \sum_{i,j} \lambda_i \lambda_j y_i y_j \vec{x}_i^\intercal \vec{x}_j \end{gather}\]

Then we can use Kernel trick

Kernel Trick

For $L_D$, it is all about $\vec{x}^\intercal \vec{x}$, the inner product of vectors. we can treat it as a kernel

\[K_{ij} = \vec{x}_i^\intercal \vec{x}_j\]

We can convert our boundary to non-linear function: $y_i = h(\vec{x}_i)^\intercal \vec{\beta} + \beta_0$

Then the kernel is changed in the same way:

\[K_{ij} = h(\vec{x}_i)^\intercal h(\vec{x}_j)\]

Therefore, for high-enough feature space, the data will be separable. Then the objective converts to

\[\begin{gather} & \max_{\lambda} L_D = \sum_{i} \lambda_i - \frac{1}{2} \sum_{i,j} \lambda_i \lambda_j y_i y_j K(\vec{x}_i, \vec{x}_j) \\ s.t. & \sum_{i} \lambda_i y_i = 0 \\ & 0 \le \lambda_i \le C \end{gather}\]

apply non-linear feature transforms
transform learning problem into higher dimensional feature space

The role of the parameter $C$ is clearer in an enlarged feature space, since perfect separation is often achievable there.

A large value of $C$ will discourage any positive $\xi_i$, and lead to an overfit wiggly boundary in the original feature space

A small value of $C$ will encourage a small value of $\lVert \beta \rVert$, which in turn causes $f(x)$ and hence the boundary to be smoother.

Evaluation

Split data into train set and test set
- train set: train the model
- test set: estimate true error
Reserve a validation set to tune hyperparameters
- Typical split: 60/20/20 or 70/15/15