Table Of ContentDETECTING CLASS-INDEPENDENT LINEAR RELATIONSHIPS WITHIN AN
ARBITRARY SET OF FEATURES
Ashwin Sarma
Naval Undersea Warfare Center, Newport, RI 02841
ABSTRACT velocity respectively at time t. Such estimates are
Classifiers for surveillance sonar systems are often designed provided by tracking algorithms commonly used in
to operate on large sets of predefined clues, or features. sonar. Note that (cid:1)g(Z)+f(cid:1)(Z) = 1. It may be that
Sometimes the mathematical definitions for these features certain classes of objects perform more maneuvers
arepoorlyknown.Othertimesthedesignerisnotawarethat than others and will exhibit this in terms of higher
a fixed and class-independent linear (or affine) relationship values of (cid:1)g(Z) and lower values of f(cid:1)(Z). However
exists between subsets of features. We discuss a method f(cid:1)(Z) is completely defined by (cid:1)g(Z).
based on Gram-Schmidt orthogonalization which allows the Insuchcasesthedependencemayhavebeenknowntothe
classifier designer to determine whether subsets of features classifier designer, but in real cases with large feature sets
have such relationships. Certain features can then be shown the dependence may be unknown or difficult to determine.
unnecessarybyapplicationofWozencraftandJacobs’“The- Furthermore, in real problems, more complicated examples
orem of Irrelevance”. An approach is also described to rank of affine dependence can and do arise in obscure and
features to aid in the selection of an effective subset. inadvertent ways.
We are interested in detecting if any such affine rela-
I. INTRODUCTION
tionships are present among features. In this short paper
The design of classifiers for a surveillance sonar system we describe a nonparametric method for detecting affinely
is often based on a pre-existing set of measurements or dependentfeaturesthatisbasedonthelinearalgebraicstruc-
”features” considered useful for discrimination. This set tureofthefeaturespaceandisindependentoftheunderlying
can be large and usually includes features that are simple distributions as well as separability among classes. Removal
functions of basic physical measurements of the objects to of such dependent features is argued through application
be classified. Features are often defined via linear functions of Wozencraft and Jacobs’ “Theorem of Irrelevance” [1].
of fundamental measurements or other more basic features. Removalofsuchfeaturesiscriticalastheyaddnoadditional
However, these functional relationships are sometimes un- informationandunnecessarilyincreasethedimensionalityof
known to the designer or difficult to sort out. the feature space.
Denote as Z the raw measurements collected from an “Approximate” affine dependence can also be identified
objecttobeclassified.Featurescontainedinthevectorf(cid:1)(Z) andarankingofthefeaturesispossible.Thepre-processing
are said to beaffinely dependent on another set of features advocated herein is a simple but often overlooked first step
contained in the vector (cid:1)g(Z) if f(cid:1)(Z) =(cid:1)b+A(cid:1)g(Z), ∀Z. In in the design of sonar classifiers. The method has proved
other words there exists fixed vector (cid:1)b and matrix A such usefulfordetectingandremovingaffinelydependentfeatures
that f(cid:1)(Z) is given by the above relationship regardless of present in Navy feature databases.
the object type (class). Examples include:
II. “THEOREM OF IRRELEVANCE”
(1) If (cid:1)g(Z) consisted of a single feature such as an
estimate of the angular width of an object and f(cid:1)(Z) Denote the class-conditional pdf for class i as pr(ρ|i)
whereristherandomvectoroffeaturesandρisaparticular
wasthecross-rangeextentmeasuredbyasonarsystem
corresponding to that object at a fixed (non- random) realizationofr.Thendecomposerasr=[r1 r2]T.Denote
range R then (cid:1)g(Z) =∆θ and f(cid:1)(Z) =R∆θ. Certain ρ1 and ρ2 as the corresponding realizations of r1 and
classes of objects may possess greater angular widths r2. We observe that pr(ρ|i) = pr1(ρ1|i) pr2(ρ2|i,r1 =
than others. However, feature f(cid:1)(Z) is completely ρ1) by Bayes rule. Wozencraft and Jacobs’ ”Theorem of
defined by (cid:1)g(Z). Irrelevance” states that the optimum classifier/dichotomizer
(2) If (cid:1)g(Z) and f(cid:1)(Z) are estimated probabilities of an may disregard a vector r2 if and only if
objectundergoingaccelerationormaintainingconstant pr2(ρ2|i,r1 =ρ1)=pr2(ρ2|r1 =ρ1) (1)
This work was supported by the Office of Naval Research (Award In other words r2 conditioned on r1 must be statistically
Number:N0001404WX20326)andaNUWCin-houseresearchgrant. independent of i for r2 to be declared ”irrelevant”. Another
0-933957-35-1 ©2007 MTS
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
SEP 2007 2. REPORT TYPE 00-00-2007 to 00-00-2007
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Detecting Class-Independent Linear Relationships within an arbitrary set
5b. GRANT NUMBER
of features
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
Naval Undersea Warfare Center,Newport,RI,02841 REPORT NUMBER
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
See also ADM002047. Presented at the MTS/IEEE Oceans 2007 Conference held in Vancouver, Canada on
Sep 29-Oct 4, 2007.
14. ABSTRACT
See Report
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE Same as 4
unclassified unclassified unclassified Report (SAR)
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
useful interpretation is that satisfaction of this requirement manifested due to chance is effectively zero. It is worth
impliesthatr1 isasufficientstatisticforestimatingthepdfs mentioning that the dependence is not linked to the manner
pr(ρ|i), ∀i, although not necessarily minimal [2], [3], [4]. in which X was populated. Specifically a permutation of
the rows of X (equivalently X˜) will not alter the linear
III. DETECTING AFFINELY DEPENDENT
dependence.Thiscanbeseenbynotingthatpre-multiplying
FEATURES
x˜k in (4) by a permutation matrix results in
The approach involves construction of a data matrix X
in which all available data are included. If C classes are u˜k =α˜1ku˜1+α˜2ku˜2+...+α˜kk−1u˜k−1 (5)
to be distinguished and Ni (1 x D) real-valued vector where u˜j, j =1,...,D is the permuted x˜j.
measurements (i.e. feature vectors) are available for i =
Thus feature fk must be expressible as
1,...,C, X is a N x D dat(cid:1)a matrix where each row is
a feature vector and N = Ci=1Ni. It is assumed that fk =α1kf1+α2kf2+...+αkk−1fk−1+bk (6)
N > D, a reasonable assumption as there are commonly
or
moremeasurementsthanfeatures.Anunsupervisedsituation (cid:2)D
in which N unlabeled feature vectors are presented can also fk = αjkfj +bk (7)
be considered. j=1
DenotetheDfeaturesbyf1,f2,...,fD.Weareinterested where αjk,j ≥ k must equal 0. If Rmm (k < m ≤ D)
in affine dependence of the form:
is the next zero element along the diagonal of R then x˜m
(cid:2)D is linearly dependent on the vectors x˜1,x˜2,...,x˜m−1 and
fk = αjkfj +bk (2) there exists another set of coefficients, α˜1m,α˜2m,...,α˜mm−1
j=1,j(cid:1)=k such that x˜m can be expressed as:
where bk need not equal zero. Note that the linear algebraic x˜m =α˜1mx˜1+α˜2mx˜2+...+α˜kmx˜k+...+α˜mm−1x˜m−1 (8)
conceptoflineardependenceamongasetofvectorsdoesnot
accommodateaffinedependence.Thusatransformationsuch Onceagainnotallα˜m’sneedbenonzero.Furthermoresince
as standardization [5] must be performed before standard x˜k islinearlydependentonx˜1,x˜2,...,x˜k−1,α˜km canbeset
linear algebraic techniques can be applied. This amounts to equal to zero. Thus fm can be expressed as
estimatingthesamplemeanµˆandsamplestandarddeviation
(cid:2)D
σˆ of each column of X and transforming every element x fm = αjmfj +bm (9)
in that column according to x=(x−µˆ)/σˆ. Denote X after
j=1
standardization as X˜. The transformed features are denoted
if˜s1,efl˜i2m,.in.a.t,efd˜D,.i.eA.ny affine dependence (bk (cid:2)= 0 in eqn. (2)) whTehreerαefjmoremuthset esqeutaolfzfeeraotuforersjr=2 kforaswwheicllhaRsjfjor=j ≥0 amre.
affinely dependent on the remainder of the features denoted
f˜k = (cid:2)D α˜jkf˜j (3) dr1et.eFrmroimneedqnbsy.(ρ71)ainndde(p9e)n,dweentcaonfsceleastshait.ρT2hiusscoemqnp.le1teliys
j=1,j(cid:1)=k satisfied and r2 is irrelevant. It is worth noting that if N =
(see appendix). It is clear that the elements of X˜ will be D and X is full rank then X˜ has a rank of D −1. Thus
approximately zero mean with unit variance. Thus standard- this process can be successfully applied only if N is strictly
izationhastheaddedeffectofcontrollingthedynamicrange greater than D.
of the elements of X and numerically stabilizing Gram- Even if an exact affine dependence existed, errors in
Schmidt orthogonalization. machinecalculationofRcanleadtonon-zeroRjj.Higham
Gram-Schmidtorthogonalizationcanbeperformedonthe [6, p. 24,122] has shown that performing the QR decom-
columns of X˜ via QR matrix decomposition, i.e X˜ =QR. position via a sequence of Givens rotations leads to an
The linearly dependent column vectors in X˜ are marked by ultimate relative error that is acceptable and on the level of
zeros along the diagonal of R. machineprecisionu.Hererelativeerror,definedas ||R−Rˆ||2
||X˜||2
If Rkk is the leftmost zero element along the diago- (= ||R−Rˆ||F)isthenormalizeddifferencebetweentheexact
xn˜a1l,x˜o2f,.R..,tx˜hekn−1x˜kanids tlhineerearleyxidsetspeandesentt oofn cthoeeffivceicetnotrss, R and||X˜Rˆ||F, estimated by the actual dec(cid:3)omposition. Note that
α˜1k,α˜2k,...,α˜kk−1 such that x˜k can be expressed as: ||R−Rˆ||F ≈ √D(cid:6) = D(cid:6)=u (10)
x˜k =α˜1kx˜1+α˜2kx˜2+...+α˜kk−1x˜k−1 (4) ||X˜||F ND N
Note that at least one (but not all) α˜k’s need be nonzero. where (cid:6) is the average value of |Rij−Rˆij| (i,j =1,...,D).
Sincethisdependence isenforcedoverallC classesandfor Thus if N ≈100D, Rjj can be compared to a threshold of
all N measurements, the probability that such a dependence ten times machine precision to detect linear dependence.
0-933957-35-1 ©2007 MTS
If certain features are preferred (ex. considered more then if the p-value is greater than the critical level we may
intuitive or powerful) by the designer, reordering can be conclude fk is “approximately” affinely dependent on the
performed such that these populate the leftmost columns of previous features. Specifically eqn. (1) with r2 = fk and
X.Thisstepincreasesthelikelihoodthatsuchafeaturewill r1= [f1,f2,...,fk−1]T is satisfied at the reported p-value.
beretainedintheeventthatitisaffinelydependentonother Graphical methods such as quantile-quantile plots can be
features. used to corroborate the conclusions.
The process of feature subset selection involves consider-
IV. “APPROXIMATELY” AFFINELY DEPENDENT
ing various subsets and, via a pre-chosen separability mea-
FEATURES
sure (a.k.a. Filter Method) or a classifier architecture (a.k.a.
At this point let us assume that features that are exactly Wrapper Method), ranking the quality of each subset [7],
affinely dependent have been detected and removed. Op- [8]. As testing every possible subset is usually prohibitive,
erating on the resultant set, the process of Gram-Schmidt alternatives that consider only specific subsets are almost
orthogonalization of x˜k amounts to always applied. One such approach is as follows. Features
(1) Determining the least-squares fit of x˜k in the sub- can be ranked according to p-value. If it is required that a
spacespannedby{x˜1,...,x˜k−1}.Thisisequivalentto feature must be discarded, the feature with the largest p-
determining the projection of x˜k onto this subspace. value can be selected. The entire process of Gram-Schmidt
We denote this projection as Px˜k where P is the orthogonalization followed by statistical testing is then re-
projection matrix. peated if further reduction is required.
(2) Subtracting the result from x˜k to form the orthogonal
error vector e. Thus x˜k =Px˜k+e. The norm of the V. SUMMARY
error vector e, ||e||, is minimized in this process [3,
Methodologies are provided to detect exact and “approx-
p. 365] but is never zero.
imate” affine relationships within a set of features. They
Again we observe that permuting the rows of X (equiva- provideinsightintofeaturestructureandaidinfeaturesubset
lently X˜) only permutes the elements of e but leaves ||e|| selection.
unchanged.
The relationship of Gram-Schmidt orthogonalization to VI. APPENDIX
least-squares estimation is important in that, for N > D,
we can interpret feature fk as a sum of A) a fixed (and Assumetheaffinedependence ofeqn.(11)forfeaturefk.
class-independent) linear function of the previous features
(cid:2)D
and B) a residual. It is possible that some (or all) of the fk = αjkfj +bk (11)
previous features may be useful for classification. It is also
j=1,j(cid:1)=k
possible that the least-squares estimate of fk in terms of the
previousfeaturesisusefulaswell.Howeverthisinformation This dependence must be enforced regardless of class. Thus
is implicit in the previous features. Thus it is sufficient to
(cid:2)D
dTehtiesrmaminoeuntthsetoadadsetadtiisntifcoarlmteastito1ncocmapptaurrinedgtihnetshcealarresviadluuaels. xk = αjkxj+bk1N (12)
j=1,j(cid:1)=k
in e corresponding to class i with those corresponding to
class j. Nonparametric tests such as a Chi-squared test where xj is the jth column of X (j = 1,..,D) and 1N is
(or Kolmogorov-Smirnov test) can be used to compare two the vector [11 ... 1]T.
(cid:4) (cid:5)(cid:6) (cid:7)
samples. Specifically the p-value of the test statistic can be
N
returned. The p-value is the probability of the event that
a value of the test statistic greater than or equal to that
observed occurs when both samples have a common proba- Denote xjT1N, the sum of elements in column j, as Sj.
bilitydistribution.Standardcriticallevelsofsignificanceare Denote the columns of X after standardization as x˜j. Thus
0.05. Thus if the p-value is less than the critical level we x˜j equals
csaamnpslaefseliyscdoifnfcelruednet.tAhatditfhfeeredniscteribbuettiwoneegnotvheerndiinsgtritbhuetitownos x˜j = xj−σˆµˆjj1N (13)
indicatesthatthefeaturemayproveusefulforclassification.
Itmustbestressedthatifthereareonlyafewmeasurements where µˆj = SNj and σˆj2 = (xj−µˆj1N)NT(xj−µˆj1N). Rewriting
per class, a large p-value need not indicate distributional eqn. (12) in terms of x˜j results in:
similarity. However if there are enough measurements such
thatp-valueslowerthanthecriticallevelareatleastpossible (cid:2)D
σˆkx˜k+µˆk1N = αjk(σˆjx˜j+µˆj1N)+bk1N (14)
1orsetofpairwisetestswhenmorethantwoclassesareconsidered j=1,j(cid:1)=k
0-933957-35-1 ©2007 MTS
Subtracting µˆk1N from both sides of (14) yields:
(cid:8) (cid:9)
(cid:2)D (cid:2)D
σˆkx˜k = αjkσˆjx˜j+ αjkµˆj1N+bk1N−µˆk1N
j=1,j(cid:1)=k j=1,j(cid:1)=k
(cid:4) (cid:5)(cid:6) (cid:7)
Γ
(cid:1)
It is clear that the vector Dj=1,j(cid:1)=kαjkµˆj1N +bk1N and
µˆk1N can be represented as c1N and d1N respectively and
that Γ = (c−d)1N. Pre-multiplying both Γ and eqn. (12)
by 1NT reveals that c=d. Thus
(cid:2)D
x˜k = α˜jkx˜j (15)
j=1,j(cid:1)=k
where α˜jk equals αjkσˆj/σˆk.
VII. REFERENCES
[1] John M. Wozencraft and Irwin M. Jacobs, Principles
of Communication Engineering, John Wiley and Sons,
New York, 1965.
[2] Thomas S. Ferguson, Mathematical Statistics: A Deci-
sion Theoretic Approach, Academic Press, New York,
1967.
[3] L. L. Scharf, Statistical Signal Processing: Detection,
Estimation and Time Series Analysis, Addison-Wesley:
Reading Massachusetts, 1991.
[4] Edward C. Real, “Feature extraction and sufficient
statisticsindetection and classification,” in Proceedings
of the IEEE International Conference on Acoustics,
Speech and Signal Processing, 1996.
[5] R.O. Duda, Peter E. Hart, and David G. Stork, Pattern
Classification, John Wiley and Sons: New York, second
edition, 2001.
[6] Nicholas J. Higham, Accuracy and Stability of Nu-
merical Algorithms, Society for Industrial and Applied
Mathematics, 1996.
[7] Isabelle Guyon and Andre Elisseeff, “An introduction
to variable and feature selection,” Journal of Machine
Learning Research 3, pp. 1157–1182, 2003.
[8] K.Z.Mao, “Orthogonalforwardselectionandbackward
elimination algorithms for feature subset selection,”
IEEE Transactions on Systems, Man, and Cybernetics
- Part B: Cybernetics, vol. 34, no. 1, pp. 629–634,
February 2004.
0-933957-35-1 ©2007 MTS