VC Theory

VC Theory

Tags
Machine Learning
Published
April 10, 2018
Author
yanbc

The Essence of Machine Learning

Machine learning is suitable for problems where:
  1. A pattern exists
  1. We can not pin it down mathematically
  1. We have data on it
First of all, there has to be a pattern. If there isn't a pattern, the problem is unsolvable, let alone using machine learning to solve it. Second, if a problem can be solve with regular programming (e.g. calculating the nthn_{th} Fibonacci number), we normally don't want to solve it with machine learning. It's not because machine learning cannot perform the task, but because doing it the old way would be more efficient. Last but not least, we need to have data in order to do machine learning. Data to machine learning is like petroleum to industrial revolution. Without data, there will be no learning.
Note that sometimes (well most of the times) we don’t know if there’s a pattern or not, but we can still apply machine learning and measure the error. That’s one way to see if there’s a pattern.

PAC Learning Diagram

PAC stands for probably approximately correct. It is a machine learning framework proposed by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the "probably" part), the selected function will have low generalization error (the "approximately correct" part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples.
The hypothesis set and learning algorithm together, is called a Learning Model.
notion image

Hoeffding's Inequality

In order to discuss the learnability problem, we need to introduce the concept of Hoeffding's Inequality first. Consider flipping a bias coin with the frequency of head being an unknown value μ\mu. We flip the coin NN times. The frequency of head in our big samples is ν\nu. Does the sample frequency ν\nu say anything about the actual frequency μ\mu?
By the definition of Hoeffding's Inequality, ν\nu is probably close to μ\mu within an error ϵ\epsilon if we have a large enough set of samples. Formally,
\begin{align} \mathbb{P}[|\nu - \mu| > \epsilon] \le 2e^{-2\epsilon^{2}N} \end{align}
For example, if we have a bias coin with μ=0.6\mu = 0.6, and flip the coin 10000 times. Hoeffding's Inequality says that the chance of we getting more than 6100 heads or less than 5900 heads (ϵ=0.01\epsilon = 0.01) is no more than 0.2707.

Apply Hoeffding's Inequality to The Learning Problem

We will break down this section into three sub-sections:
  • The learning problem for a single hypothesis (Testing)
  • The learning problem for a hypothesis set (Training)
  • The modified Learning Diagram (with Hoeffding's Inequality)

For a single hypothesis (Testing)

Imagine we are picking NN inputs xix_i from an input space X\mathcal{X} wiht probability distribution PP. We call the sample set S\mathcal{S} .For a hypothesis hh from the set H\mathcal{H}, we predict the output h(xi)h(x_{i}) and compare it with yi=f(xi)y_{i} = f(x_{i}). There are two cases:
  1. The hypothesis get it right: h(xi)=f(xi)h(x_{i}) = f(x_{i})
  1. The hypothesis get it wrong: h(xi)f(xi)h(x_{i}) \ne f(x_{i})
Now the case where the hypothesis get it right is just like the case where we get a head in flipping a coin. For all the samples , we calculate the frequency of h(xi)=f(xi)h(x_{i}) = f(x_{i}) and call it νh\nu_h; we call the unknown frequency μh\mu_h, that is,
  • In-sample accuracy νh=P[h(xi)=f(xi)xiS]\nu_h = \mathbb{P}[ h(x_{i}) = f(x_{i}) | x_i \in \mathcal{S}]
  • Out-of-sample accuracy μh=P[h(xi)=f(xi)xiX]\mu_h = \mathbb{P}[ h(x_{i}) = f(x_{i}) | x_i \in \mathcal{X}]
Applying Hoeffding's Inequality
\begin{align} \mathbb{P}[|\nu_h - \mu_h| > \epsilon] \le 2e^{-2\epsilon^{2}N} \end{align}
Looking at this equation, on the left hand side, it is the probability of our in-sample error doesn't track the out-of-sample error (a bad event); On the right hand side, it says the probability is less than or equal to a quantity. This quantity is related to the number of samples.

For a hypothesis set (Training)

The statement above applies to a single hypothesis. When we train a model, we first try a hypothesis to see if it fits our samples, and then another, and then another... In essence, we try some hypotheses until one hypothesis gg turns up that satisfies our pre-determine criteria. This increase the probability of the bad event happening.
Using the notations from before,
νg=P[g(xi)=f(xi)xiS]\nu_g = \mathbb{P}[ g(x_{i}) = f(x_{i}) | x_i \in \mathcal{S}]μg=P[g(xi)=f(xi)xiX]\mu_g = \mathbb{P}[ g(x_{i}) = f(x_{i}) | x_i \in \mathcal{X}]\begin{align} \mathbb{P}[|\nu_g - \mu_g| > \epsilon] \le 2Me^{-2\epsilon^{2}N} \end{align}
Where MM is the size of the hypothesis set.

Modified Learning Diagram (with Hoeffding's Inequality)

In this section, we seek to apply Hoeffding's inequality to our PAC learning framework. But looking at the learning diagram from previous section, some pieces are missing, namely the sample space X\mathcal{X} the probability distribution of our samples PP. We include them here for completeness purpose.
notion image

Error Measure

We need an error measure to evaluate the hypotheses. This error measure should be specific to the user case. We make the formal definition of error in this section.
Most of the time, instead of measuring the accuracy, we are more interested in the error. For a binary classification task, the binary error on a specific sample xx is given by:
e(h(x),f(x))=h(x)f(x)e(h(x), f(x)) = \lfloor h(x) \ne f(x) \rfloor
The in-sample and out-of-sample error on the entire sample set are:
  • In-Sample Error (for xiSx_i \in \mathcal{S})
\begin{align} E_{in}(h) = \frac{1}{N} \sum\limits_{i=1}^{N} e(h(x_i), f(x_i)) \end{align}
  • Out-of-Sample Error
\begin{align} E_{out}(h) = \mathbb{E}_x [e(h(x), f(x))] \end{align}

Noisy Target

There are cases where the target function is not actually a function. For example, credit approval. Two candidates could have the same salary, identical time for residency, etc, yet one got approved while the other one got rejected. This suggests that our target function is not really a function. Because if it were a function, f:XYf: \mathcal{X} \to \mathcal{Y}, for any given input xXx \in \mathcal{X}, it should have and only have one output yYy \in \mathcal{Y}.
To accommodate situations like this, instead of y=f(x)y=f(x), we use target distribution
P(yx) P(y|x) 
Any sample pair (x,y)(x,y) can now be generated from the joint distribution
P(yx)P(x) P(y|x)P(x) 

Final Learning Diagram

Including error measure and noisy target into our learning diagram, we get its final form
notion image
Hoeffding's inequality becomes
\begin{align} \mathbb{P}[|E_{in} - E_{out}| > \epsilon] \le 2Me^{-2\epsilon^{2}N} \end{align}
Where MM is the size of the hypothesis set. In this framework, a successful machine learning job comes down to two points:
  1. Can we make EinE_{in} closed enough to EoutE_{out}?
  1. Can we make EinE_{in} small enough (close to zero)?
The first point says we should create a dataset that can represent the target data space. The second point corresponds to training a model with the dataset to achieve a low cost error.

VC Theory

Hypotheses vs Dichotomies

Let's take another look on the Hoeffding's Inequality in the training case
P[EinEout>ϵ]2Me2ϵ2N\mathbb{P}[|E_{in} - E_{out}| > \epsilon] \le 2Me^{-2\epsilon^{2}N}
Where MM is the size of the hypothesis set.
But even for a simple hypothesis set like perceptron, the size of the hypothesis set is infinite. Our next step to replace MM with something more practical.
For a finite sample set S={x1,x2,...,xn}\mathcal{S} = \lbrace x_1, x_2, ..., x_n \rbrace
  • A hypothesis h:X{1,+1}h: \mathcal{X} \to \lbrace -1, +1 \rbrace
  • A dichotomy h:S{1,+1}h: \mathcal{S} \to \lbrace -1, +1 \rbrace
  • Number of hypotheses H|\mathcal{H}| can be infinite
  • Number of dichotomies H(x1,x2,...,xn)|\mathcal{H}(x_1, x_2, ..., x_n)| is at most 2N2^N

Growth Function

Some definitions first. The growth function mH(N)m_{\mathcal{H}}(N) counts the most dichotomies on any NN points for a given hypothesis set  mH(N)=maxx1,...,xNXH(x1,...,xN) m_{\mathcal{H}}(N) = \max_{x_1,...,x_N \in \mathcal{X}} |\mathcal{H}(x_1,...,x_N)|. If no data set of size kk can be 'shattered' by a hypothesis set H\mathcal{H}, then kk is a break point for H\mathcal{H}. Properties of break points:
  • For a hypothesis set H\mathcal{H}, if kk is a break point, k+1k+1 is also a break point
  • No break point     mH(N)=2N\implies m_{\mathcal{H}}(N) = 2^N for all NN
  • Any break point     mH(N)\implies m_{\mathcal{H}}(N) is polynomial in NN
  • The existence of a break point suggests learning is feasible
  • The value of the break point suggests the samples needed
It turns out that we can use the growth function to replace MM with some minor modifications to Hoeffding's Inequality. And this is called the Vapnik-Chervonenkis Inequality.
\begin{align} \mathbb{P}[|E_{in}(g)-E_{out}(g)| > \epsilon] \le 4 m_{\mathcal{H}}(2N) e^{- \frac{1}{8} \epsilon^2 N} \end{align}

VC Dimension

The VC dimension of a hypothesis set H\mathcal{H}, denoted by dVC(H)d_{VC}(\mathcal{H}), is the largest value of NN for which mH(N)=2Nm_{\mathcal{H}}(N) = 2^N. Or, in plain English, VC dimension is the most points H\mathcal{H} can shatter. Referring to the Learning Diagram, VC dimension has a nice property that it only relates to H\mathcal{H} and the samples, which makes it ideal for evaluating a hypothesis set before training.
notion image
Examples of VC dimensions:
Hypothesis Set
VC Dimensions
Positive Rays
1
Positive Intervals
2
d-dimensional Perceptron
d+1
In essence, dVC(H)d_{VC}(\mathcal{H}) measures the effective number of parameters.

Upper bound on mh(N)m_{\mathcal{h}}(N)

It can be proved that for a given hypothesis set H\mathcal{H}, mh(N)i=0dVC(Nk)m_{\mathcal{h}}(N) \le \sum_{i=0}^{d_{VC}} {N\choose k}. Putting this result and the Vapnik-Chervonenkis Inequality together, we can conclude that dVC(H)d_{VC}(\mathcal{H}) is finite     gH\implies g \in \mathcal{H} will generalize.

References