# The Essence of Machine Learning

Machine learning is suitable for problems where:

- A pattern exists
- We can not pin it down mathematically
- We have data on it

First of all, there has to be a pattern. If there isn’t a pattern, the problem is unsolvable, let alone using machine learning to solve it. Second, if a problem can be solve with regular programming (e.g. calculating the $n_{th}$ Fibonacci number), we normally don’t want to solve it with machine learning. It’s not because machine learning cannot perform the task, but because doing it the old way would be more efficient. Last but not least, we need to have data in order to do machine learning. Data to machine learning is like petroleum to industrial revolution. Without data, there will be no learning.

Note: sometimes (well most of the times) we don’t know if there’s a pattern or not, but we can still apply machine learning and measure the error. That’s one way to see if there’s a pattern.

# PAC Learning Diagram

PAC stands for probably approximately correct. It is a machine learning framework proposed by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the “probably” part), the selected function will have low generalization error (the “approximately correct” part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples.

The hypothesis set and learning algorithm together, is called a **Learning Model**

# Hoeffding’s Inequality

In order to discuss the learnability problem, we need to introduce the concept of Hoeffding’s Inequality first. Consider flipping a bias coin with the frequency of head being an unknown value $\mu$. We flip the coin $N$ times. The frequency of head in our big samples is $\nu$. Does the sample frequency $\nu$ say anything about the actualy frequency $\mu$?

By the definition of *Hoeffding’s Inequality*, $\nu$ is probably close to $\mu$ within an error $\epsilon$ if we have a large enough set of samples.

Formally, $$ \begin{align} \mathbb{P}[|\nu - \mu| > \epsilon] \le 2e^{-2\epsilon^{2}N} \end{align} $$

For example, if we have a bias coin with $\mu = 0.6$, and flip the coin 10000 times. Hoeffding’s Inequality says that the chance of we getting more than 6100 heads or less than 5900 heads ($\epsilon = 0.01$) is no more than 0.2707

# Apply Hoeffding’s Inequality to The Learning Problem

We will break down this section into three sub-sections:

- for a single hypothesis (Testing)
- for a hypothesis set (Training)
- modified Learning Diagram (with Hoeffding’s Inequality)

## For a single hypothesis (Testing)

Imagine we are picking $N$ inputs $x_i$ from an input space $\mathcal{X}$ wiht probability distribution $P$. We call the sample set $\mathcal{S}$ .For a hypothesis $h$ from the set $\mathcal{H}$, we predict the output $h(x_{i})$ and compare it with $y_{i} = f(x_{i})$. There are two cases:

- the hypothesis get it right: $h(x_{i}) = f(x_{i})$
- the hypothesis get it wrong: $h(x_{i}) \ne f(x_{i})$

Now the case where the hypothesis get it right is just like the case where we get a head in flipping a coin. For all the samples $x_i$, we calculate the frequency of $h(x_{i}) = f(x_{i})$ and call it $\nu_h$; we call the unkown frequency $\mu_h$, that is,

- in-sample accuracy: $ \nu_h = \mathbb{P}[ h(x_{i}) = f(x_{i}) | x_i \in \mathcal{S}] $
- out-of-sample accuracy: $ \mu_h = \mathbb{P}[ h(x_{i}) = f(x_{i}) | x_i \in \mathcal{X}] $

And applying Hoeffding’s Inequality,

$$ \begin{align} \mathbb{P}[|\nu_h - \mu_h| > \epsilon] \le 2e^{-2\epsilon^{2}N} \end{align} $$

Looking at this equation, on the left hand side, it is the probability of our in-sample error doesn’t track the out-of-sample error (a bad event); On the right hand side, it says the probability is less than or equal to a quantity. This quantity is related to the number of samples.

## For a hypothesis set (Training)

The statement above applies to a single hypothesis. When we train a model, we first try a hypothesis to see if it fits our samples, and then another, and then another… In essence, we try some hypothesises untill one hypothesis $g$ turns up that satisfies our pre-determine criteria. This increase the probability of the bad event happening. Using the notations from before,

$$ \nu_g = \mathbb{P}[ g(x_{i}) = f(x_{i}) | x_i \in \mathcal{S}] $$ $$ \mu_g = \mathbb{P}[ g(x_{i}) = f(x_{i}) | x_i \in \mathcal{X}] $$

Then $$ \begin{align} \mathbb{P}[|\nu_g - \mu_g| > \epsilon] \le 2Me^{-2\epsilon^{2}N} \end{align} $$where $M$ is the size of the hypothesis set.

## Modified Learning Diagram (with Hoeffding’s Inequality)

In this section, we seek to apply Hoeffding’s inequality to our PAC learning framework. But looking at the learning diagram from previous section, some pieces are missing, namely the sample space $\mathcal{X}$ the probability distribution of our samples $P$. We include them here for completeness purpose.

# Error Messure

We need a error measure to evaluate the hypotheses. This error measure should be specific to the user case. We make the formal definition of error in this section.

Most of the time, instead of measuring the accuracy, we are more insterested in the error. For a binary classification task, the binary error on a specific sample $x$ is given by:

Binary Error: $$ e(h(x), f(x)) = \lfloor h(x) \ne f(x) \rfloor $$

And the in-sample and out-of-sample error on the entire sample set are :

In-Sample Error (for $x_i \in \mathcal{S}$): $$ \begin{align} E_{in}(h) = \frac{1}{N} \sum\limits_{i=1}^{N} e(h(x_i), f(x_i)) \end{align} $$

Out-of-Sample Error: $$ \begin{align} E_{out}(h) = \mathbb{E}_x [e(h(x), f(x))] \end{align} $$

# Noisy Target

There are cases where the target function is not actually a function. For example, credit approval. Two candidates could have the same salary, indentical time of residence, ect, yet one got approved while the other one got rejected. This suggests that our target function is not really a function. Because if it were a function, $f: \mathcal{X} \to \mathcal{Y}$, for any given input $x \in \mathcal{X}$, it should have and only have one output $y \in \mathcal{Y}$.

To accommodate situations like this, instead of $y=f(x)$, we use target distribution: $$ P(y|x) $$

And any sample pair $(x,y)$ is now generated by the joint distribution: $$ P(y|x)P(x) $$

# Final Learning Diagram

Including error measure and noisy target into our learning diagram, we get its final form:

And Hoeffding’s inequality becomes: $$ \begin{align} \mathbb{P}[|E_{in} - E_{out}| > \epsilon] \le 2Me^{-2\epsilon^{2}N} \end{align} $$ where $M$ is the size of the hypothesis set.

Note: Learning normally comes down to two points:

- Can we make $E_{in}$ closed enough to $E_{out}$?
- Can we make $E_{in}$ small enough (close to zero)?

# VC Theory

## Hypotheses vs Dichotomies

Let’s take another look on the Hoeffding’s Inequality in the training case: $$ \mathbb{P}[|E_{in} - E_{out}| > \epsilon] \le 2Me^{-2\epsilon^{2}N} $$ where $M$ is the size of the hypothesis set.

But even for a simple hypothesis set like perceptron, the size of the set ($M$) is infinite. We would like to replace $M$ with something more practical.

For a finite sample set $\mathcal{S} = \lbrace x_1, x_2, …, x_n \rbrace$,

- A hypothesis $h: \mathcal{X} \to \lbrace -1, +1 \rbrace$
- A dichotomy $h: \mathcal{S} \to \lbrace -1, +1 \rbrace$
- Number of hypotheses $|\mathcal{H}|$ can be infinite
- Number of dichotomies $|\mathcal{H}(x_1, x_2, …, x_n)|$ is at most $2^N$

## Growth Function

The *growth function* ($m_{\mathcal{H}}(N)$) counts the *most* dichotomies on any $N$ points for a given hypothesis set. $$ m_{\mathcal{H}}(N) = \max_{x_1,…,x_N \in \mathcal{X}} |\mathcal{H}(x_1,…,x_N)| $$

If no data set of size $k$ can be ‘shattered’ by a hypothesis set $\mathcal{H}$, then $k$ is a *break point* for $\mathcal{H}$.

Properties of break points:

- For a hypothesis set $\mathcal{H}$, if $k$ is a break point, $k+1$ is also a break point
- No break point $\implies m_{\mathcal{H}}(N) = 2^N$ for all $N$
- Any break point $\implies m_{\mathcal{H}}(N)$ is polynomial in $N$
- The existance of a break point suggests learning is feasible
- The value of the break point suggests the samples needed

It turns out that we can use the growth function to replace $M$ with some minor modifications to Hoeffding’s Inequality. $$ \begin{align} \mathbb{P}[|E_{in}(g)-E_{out}(g)| > \epsilon] \le 4 m_{\mathcal{H}}(2N) e^{- \frac{1}{8} \epsilon^2 N} \end{align} $$ And this is called the Vapnik-Chervonenkis Inequality.

## VC Dimension

The VC dimension of a hypothesis set $\mathcal{H}$, denoted by $d_{VC}(\mathcal{H})$, is the largest value of $N$ for which $m_{\mathcal{H}}(N) = 2^N$. Or, in plain English, VC dimension is the most points $\mathcal{H}$ can shatter. Refering to the Learning Diagram, VC dimension has a nice property that it only relates to $\mathcal{H}$ and the samples, which makes it ideal for evaluating a hypothesis set before training.

Examples of VC dimensions:

hypothesis set | VC dimensions |
---|---|

Positive Rays | 1 |

Positive Intervals | 2 |

d-dimensional perceptron | d + 1 |

In essence, $d_{VC}(\mathcal{H})$ measures the effective number of parameters.

## Upper bound on $m_{\mathcal{h}}(N)$

It can be proved that for a given hypothesis set $\mathcal{H}$ $$ \begin{align} m_{\mathcal{h}}(N) \le \sum_{i=0}^{d_{VC}} {N\choose k} \end{align} $$ Putting this result and the Vapnik-Chervonenkis Inequality together, we can conclude that

$d_{VC}(\mathcal{H})$ is finite $\implies g \in \mathcal{H}$ will generalize