Table Of ContentTrabajo Fin de Grado
MATHEMATICAL OPTIMIZATION AND
FEATURE SELECTION
Presented by:
Alejandro Casado Reinaldos
Supervisors:
Dr. Rafael Blanquero Bravo, Universidad de Sevilla
Dr. Emilio Carrizosa Priego, Universidad de Sevilla
June 21, 2015
Contents
1 Introduction 3
2 Support Vector Machine 4
2.1 Linear Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 The Linearly Separable Case . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 The Linearly Nonseparable Case . . . . . . . . . . . . . . . . . . . . 9
2.2 Nonlinear Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 The "Kernel Trick" . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Kernels and Their Properties . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Examples of Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Building the Optimal Path . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 SVM in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Ramp Loss SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 Feature Selection in SVM 36
2
Chapter 1
Introduction
Supervised classification is a common task in big data. It seeks procedures for classifying
objects in a set Ω into a set C of classes, [7]. Supervised classification has been success-
fully applied in many different fields. Examples are found in text categorization, such as
document indexing, webpage classification and spam filtering; biology and medicine, such
as classification of gene expression data, homology detection, protein–protein interaction
prediction, abnormal brain activity classification and cancer diagnosis; machine vision;
agriculture; or chemistry, to cite a few fields.
Mathematical optimization has played a crucial role in supervised classification. Tech-
niques from very diverse fields within mathematical optimization have been shown to be
useful. Support Vector Machine (SVM) is one of the main exponents as application of the
mathematical optimization to supervised classification. SVM is a state of the art method
for supervised learning. For the two-class case, SVM aims at separating both classes by
means of a hyperplane which maximizes the margin, i.e., the width of the band separating
the two sets. This geometrical optimization problem can be written as a convex quadratic
optimization problem with linear constraints, in principle solvable by any nonlinear opti-
mization procedure.
In some applications the amount of features is huge and training SVM using the en-
tire feature set would be computationally very expensive, while its outcome would lack
from insight. This is, for instance, the case in gene expression and text categorization.
In this sense, we talk about the combinatorial problem of selecting a best-possible set of
features, discarding the remaining ones. It is called the feature selection problem.
In this work we analyze SVMs and how relevant features can be identified. In Chap-
ter 2, we describe the SVM, first in the case of a linear kernel and then for the more
interesting case of nonlinear kernels. We show how SVM can be handled in the statistical
programme R. Then, some issues related with feature selection are described in Chapter
3.
3
Chapter 2
Support Vector Machine
2.1 Linear Support Vector Machine
Assume we have available a non-empty set of data Ω, where each u ∈ Ω has two compo-
i
nents:
Ω ={u =(x ,y ): i=1,2,...,n }
i i i
with x ∈ Rr, vector of predictors variables, and y ∈ {1,−1} two given classes of u .
i i i
We now have a non-empty set I, which will be called the learning set. The learning set
is composed of u = (x ,y ), where y is given, ∀ i. The binary classification problem is
i i i i
based on predicting, from the data of I, the y class of a given u ∈ Ω. It is used β ∈ Rr
i i
y β ∈ R in order to construct a function f : Rr→R such that:
0
f(x) = βtx+β
0
This function is called separation function, [16]. It classifies as class 1 those x ∈ Rr with
i
f(x) > 0 and as class -1 those x ∈ Rr with f(x) < 0.
i
The goal is to have a function f such that all positive points in I, i.e. (y = 1), are
i
assigned to class 1 and negative points in I, i.e. (y = −1), to class -1. Points x with
i
f(x) = 0 must be classified according to a predeterminied rule. It is defined with the
system
y (βtx +β ) > 0 ∀i ∈ I
i i 0
2.1.1 The Linearly Separable Case
First, consider the simplest case: suppose the positive points (y = 1) and negative
i
(y = −1) data points from the learning set I can be separated by a hyperplane:
i
{x : f(x) = βtx+β =0}
0
where β is the weight vector with norm kβk, and β is the bias. If this hyperplane can
0
separate the learning set of data into the two given classes without error, the hyperplane
4
is called a separating hyperplane.
If positive and negative data points can be separated by the hyperplane H0 := β +xtβ =
0
0, then
H+ = β +βtx > 0, if y = 1
0 i i
H− = β +βtx < 0, if y = −1
0 i i
For separable sets, we have an infinite number of such hyperplanes. Consider any sep-
arating hyperplane. Let d be the shortest distance from the separating hyperplane to
−
the nearest negative data point, and let d be the shortest distance from the separating
+
hyperplane to the nearest positive data point. We say that the hyperplane is an optimal
separatinghyperplaneifwemaximizethedistancebetweenthehyperplaneandtheclosest
observation.
In order to find the best separating hyperplane, we use a norm k.k in Rr, and derive
the distances between the two given classes and the separating hyperplane. First, let us
consider the Euclidean case, with the Euclidean norm kxk2 = xtx:
Property. Let k.k be the Euclidean distance, Then, given x, we have
{β +βtx,0}
(2.1) d = d(x,{y : β +βty ≤ 0}) = max 0
− 0
kβk
{−(β +βtx),0}
(2.2) d = d(x,{y : β +βty ≥ 0}) = max 0
+ 0
kβk
Proof. Let x be a fixed point, we have the following problem which formulates the dis-
tance between point x from the separating hyperplane:
min kx−yk
(2.3)
subject to: (β +βty) = 0
0
Equivalent to:
min kx−yk2
(2.4)
subject to: (β +βty) = 0
0
With the Euclidean distance we are in conditions to use the Karush-Kuhn-Tucker (KKT)
conditions. Then let L(y,λ) be the Lagrange function defined as follows:
L(y,λ)=ky −xk2 −λ(β +βty)
0
Proceeding as the method KKT says:
5
∂ L(y,λ): 2(y −x)−λβ = 0
∂y
2βt(y −x)−λβtβ = 0 , (βty = −β )
0
−2β −2βtx = λβtβ
0
λ = −2β0−2βtx
βtβ
Replacing λ in equation of ∂ L(y,λ) and applying norm k.k:
∂y
2(y −x) = λβ
ky −xk = |λ|kβk
2
|−2β0−2βtx|kβk = |β0+βtx|kβk = |β0+βtx|
βtβ 2 kβk2 kβk
Summarizing, in the Euclidean case:
d(x,{y : β +βty = 0}) = |β0+βtx|
0 kβk
If β +βtx ≥ 0,
0
d(x,{y : β +βty ≥ 0}) = 0
0
d(x,{y : β +βty ≤ 0}) = |β0+βtx| = β0+βtx
0 kβk kβk
If β +βtx ≤ 0,
0
d(x,{y : β +βty ≤ 0}) = 0
0
d(x,{y : β +βty ≥ 0}) = |β0+βtx| = −(β0+βtx)
0 kβk kβk
In general:
d(x,{y : β +βty ≥ 0}) = max{0, −(β0+βtx)}
0 kβk
d(x,{y : β +βty ≤ 0}) = max{0, β0+βtx} (cid:4)
0 kβk
For an arbitrary norm, we can use the following result, which extends property, [24]:
Theorem 1.1. For any norm k.k and any hyperplane H(β,β ) we have
0
{β0−hβ;xi}, when hβ;xi ≤ β ,
kβk◦ 0
d (x,H(β,β )) =
k.k 0
{hβ;xi−β0}, when hβ;xi > β .
kβk◦ 0
6
Here kβk◦ denotes the dual norm of kβk, defined as
kβk◦ = max utβ
(2.5)
subject to: kuk = 1.
We have in the previous theorem the formula of the distance from one point x to a halfs-
pace. Now, given (x ,··· ,x ) with labels (y ,··· ,y ), the distance of x to the halfspace
1 n 1 n i
of misclassification is given by:
max{y (β +βtx ),0}
i 0 i
d = ,∀ i ∈ I.
i kβk0
The minimum of this equation, d =min d , is called margin.
I ui∈I i
Figure 2.1: Linear SVM with the margin
Thegoalistomaximizethemargin. Thisissolvingbythefollowingoptimizationproblem:
max min max{yi(β0+βtxi),0}.
β,β0 i kβk◦
This problem is equivalent to,
max min {yi(β0+βtxi),0},
β,β0 i kβk◦
which is equivalent to,
7
min max kβk◦ ,
β,β0 i {yi(β0+βtxi),0}
or,
min kβk◦ .
β,β0 mini{yi(β0+βtxi),0}
The function (β ,β) 7−→ kβk◦ is homogeneus in R , hence, we can assume with-
0 miniyi(β0+βtxi) +
outlossofgeneralitythatthedenominatorequals1. Thenwehavethefollowingequivalent
representation:
min kβk◦
β0,β
(2.6) subject to: min y (β +βtx ) = 1
i i 0 i
β ∈ Rr,β ∈ R
0
It is easily seen that this is equivalent to,
min kβk◦
β0,β
(2.7) subject to: min y (β +βtx ) ≥ 1
i i 0 i
β ∈ Rr,β ∈ R
0
i.e.,
min kβk◦
β0,β
(2.8) subject to: y (β +βtx ) ≥ 1, ∀ i ∈ I
i 0 i
β ∈ Rr,β ∈ R
0
In the Euclidean case we have:
min βtβ
β0,β
(2.9) subject to: y (β +βtx ) = 1, ∀ i ∈ I
i 0 i
β ∈ Rr,β ∈ R
0
which is an optimization problem with convex objective function and linear constraints.
Then the problem (2.9) is equivalent to:
min βtβ
β0,β
(2.10) subject to: y (β +βtx ) > 0, ∀ i ∈ I
i 0 i
β ∈ Rr,β ∈ R
0
For polyhedral norms, problem (2.8) can be written as a linear problem. Let us consider
the particular important cases k.k = k.k and k.k = k.k . To achieve the dual of those
1 ∞
norms we have the following property:
Property. Let k.k be a p−norm. Then, its dual norm is k.k◦ = k.k , where p and q
p p q
satisfies the following:
8
1 + 1 = 1
p q
If we have k.k = k.k , then its dual k.k◦ is the infinity norm k.k , and the problem (2.8)
1 ∞
can be expressed as follows:
min kβk
β0,β ∞
(2.11) subject to: min y (β +βtx ) ≥ 1, ∀ i ∈ I
i i 0 i
β ∈ Rr,β ∈ R.
0
This problem can be reformulated as a linear problem,
min z
subject to: y (β +βtx ) ≥ 1, ∀ i ∈ I
(2.12) i 0 i
z ≥ β ≥ −z
i
β ∈ Rr,β ∈ R,z ≥ 0
0
On the other hand, if we have k.k = k.k , then its dual k.k◦ is the 1-norm k.k , and our
∞ 1
problem can be expressed as follows:
min kβk
β0,β 1
(2.13) subject to: y (β +βtx ) ≥ 1, ∀ i ∈ I
i 0 i
β ∈ Rr,β ∈ R
0
which can also be converted in a linear problem,
min P z
j j
subject to: y (β +βtx ) ≥ 1, ∀ i ∈ I
(2.14) i 0 i
z ≥ β ≥ −z , j = 1,..,r
j i j
β ∈ Rr,β ∈ R,z ≥ 0
0 j
2.1.2 The Linearly Nonseparable Case
In real applications, it is unlikely that there will be such a clear linear separation between
data drawn from two classes. More likely, there will be some overlap.
The overlap will cause problems for any classification rule, and depending upon the extent
of the overlap, the overlapping points could not be classified.
The nonseparable case occurs if either the two classes are separable, but not linearly so,
or that no clear separability exists between the two classes, linearly or nonlinearly.
In the previous section we assumed that I was linearly separable, if this is not so the
above problem is infeasible. Therefore, we must find other method.
9
Description:useful. Support Vector Machine (SVM) is one of the main exponents as application of the mathematical optimization to supervised classification.