Table Of ContentDraftversion February5,2008
PreprinttypesetusingLATEXstyleemulateapjv.10/10/03
THE INTRINSIC DIMENSIONALITY OF SPECTRO-POLARIMETRIC DATA
A. Asensio Ramos1, H. Socas-Navarro2, A. Lo´pez Ariste3 & M. J. Mart´ınez Gonza´lez1,4,5
Draft versionFebruary 5, 2008
ABSTRACT
The amount of information available in spectro-polarimetric data is estimated. To this end, the
intrinsic dimensionality of the data is inferred with the aid of a recently derived estimator based
on nearest-neighbor considerations and obtained applying the principle of maximum likelihood. We
show in detail that the estimator correctly captures the intrinsic dimension of artificial datasets with
known dimension. The effect of noise in the estimated dimension is analyzed thoroughly and we
7
concludethat itintroducesapositive biasthatneeds to be accountedfor. Realsimultaneousspectro-
0
0 polarimetric observations in the visible 630 nm and the near-infrared 1.5 µm spectral regions are
2 also investigated in detail, showing that the near-infrared dataset provides more information of the
physicalconditionsinthesolaratmospherethanthevisibledataset. Finally,wedemonstratethatthe
n
amount of information present in an observed dataset is a monotonically increasing function of the
a
number of available spectral lines.
J
Subject headings: magnetic fields — Sun: atmosphere,magnetic fields — line: profiles—polarization
2
2
1. INTRODUCTION dimensional space. In this case, we can consider that
1
the data “lives” in a subspace of low dimension (the so-
v High-dimensional data present difficulties when an-
calledintrinsicdimension)thatis embeddedinthe high-
4 alyzing and understanding their statistical properties.
0 The efficiency of typical statistical and computational dimensionalspace. Thislowerdimensionsubspaceisnot
6 methods usuallydegradesveryfastwhenthe dimension- usuallysimple to describe because itoften lies ina man-
1 ality of the problem increases, thus making the analysis ifold whose relation with the original high-dimensional
0 space has to be described by a very complex non-linear
of the observed data a cumbersome or, sometimes, un-
7 (and usually unknown) function. In spite of the com-
feasibletask. Thisfactisoftenreferredtoasthecurseof
0 plexity, when facing a high-dimensionality dataset, it is
dimensionality. The advent of computers has permitted
/ of great interest to reduce the dimension of the original
h to face the analysis of increasingly complex data. These
datapriortoanymodelingeffort. Inthismannerwecan
p data usually exhibit an intricate behavior and, in order
- tounderstandtheunderlyingphysicsthatproducessuch uncover more easily the physics underlying the obser-
o vations and even detect previously unknown properties
effects, we have been forced to developvery complicated
r that can be of interest.
t models. Ideally, these models have to be based on phys-
s Among the most popular methods for dimensionality
ical grounds but there seems to be no way of knowing
a
reduction we find principal component analysis (PCA)
: inadvance how complicatedthis modelhas to be to cor-
v orKarhunen-Loeveexpansion. Due toitscomputational
rectly reproduce the observed behavior.
i simplicity, it is one of the most widely employed meth-
X In spite of their inherent complexity, the analysis of
ods (e.g., Rees et al. 2000; Lo´pez Ariste & Casini 2002;
largedatasetssuchas those producedby modern instru-
r Socas-Navarro 2005a; Casini et al. 2005; Ferreras et al.
a mentation,indicatesthatnotallmeasureddatapointsare
2006). PCA seeks orthogonal directions in the origi-
equally relevant for the understanding of the underlying
nal high-dimensional space along which the data cor-
phenomena. In other words, it is clear that the reason
relation is the largest. From a computational point of
why many simplified physical models are successful in
view, the method finds the eigenvalues and eigenvectors
reproducing a large amount of observations is because
of the covariance matrix obtained from a given dataset.
the data itself is not truly high-dimensional. Based on
Then, the directions on the space where the correla-
this premise, efforts are being made to develop methods
tion is large (large eigenvalues) may be approximately
thatarecapableofreducingthedimensionalityoftheob-
described with only one parameter (a factor multiply-
served datasets while still preserving their fundamental
ing the associated eigenvector) and have sometimes a
properties. Mathematically, the idea is that, while the
physical meaning. An example of this can been seen
originaldata may have a very large dimensionality, they
inSkumanich & Lo´pez Ariste(2002),whodemonstrated
are in fact confined to a small sub-region of that high-
how the eigenvectors associated with the largest eigen-
1InstitutodeAstrof´ısicadeCanarias,38205,LaLaguna,Tener- valuesofthecorrelationmatrixobtainedfromspectropo-
ife,Spain larimetricobservationsofasunspotarerelatedtofunda-
2 High Altitude Observatory, National Center for Atmospheric mental physical parameters. They showed that the first
Research,POBox3000, Boulder,CO80307-300, USA∗
3 THEMIS,CNRS-UPS853, C/V´ıaLa´ctea s/n, 38200, LaLa- eigenvectoris associated with the averagespectrum, the
second eigenvector gives information about the velocity
guna,Tenerife,Spain
4 Max-Planck Institut fu¨r Sonnensystemforschung, 37191 and the third eigenvector gives information about the
Katlenburg-Lindau,Germany magnetic splitting.
5 Presentaddress: CNRSUMR8112-LERMA,Observatoire de
One of the weakest points of PCA is its linear charac-
Paris,SectiondeMeudon,92195Meudon,France
Electronicaddress: [email protected] ter, because it relies only in the information provided
2
by second order statistics. Therefore, it cannot effi- In any case, it is always advantageous to have reliable
ciently describe a dataset whose embedding in the origi- information on the intrinsic dimensionality of the ob-
nalhigh-dimensionalspaceis a nonlinearmanifold. Sev- served datasets. Although the spatial resolution of so-
eral methods have been developed during the last years lar spectro-polarimetric observations has improved dur-
to overcome this difficulty. Among them, we can find ing the last decades, the resolution elements are typi-
Locally Linear Embedding (LLE, Roweis & Saul 2000), cally much larger than the organizationscales in the so-
Isomap (Tenenbaum et al. 2000) and Self-Organizing lar atmosphere. The ensuing mixture of signals inside
Maps (SOM, Kohonen 2001). These methods are very the resolution element makes it necessary to use com-
promising and have been shown to outperform PCA plicated models to explain the observed signals. How-
when reducing the dimension of datasets that present ever, it is fundamental to have in mind that too compli-
clearnonlinearities. Recently,apromisingnon-linearex- cated models (with a large amount of free parameters)
tension of PCA (kernel PCA) has also been developed may not be constrained by the observations. This paper
(Scho¨lkopf et al. 1998). It is based on the extension of presents a step forward in the systematic investigation
PCA to non-linear mappings by the application of Mer- of observational datasets with the aim of extracting as
cerkernelsanditeffectivelytakesintoaccounthigh-order much informationas possible from the observations. Al-
statisticsfromthedatasets. Anothernonlinearversionof though we focus on spectroscopic and/or spectropolari-
PCA can be carried out with the aid of auto-associative metric datasets, the philosophy of the approach is ap-
neural networks (AANNs) (e.g., Socas-Navarro 2005b, plicable to other kinds of data as well. Nowadays, solar
for applications in the inversion of Stokes profiles). All spectroscopic and spectropolarimetric datasets are be-
the previous methods are computationally expensive. coming very large and some effort is needed to correctly
AANNs require the training of a neural network with exploit all the information they carry about the physi-
a bottleneck hidden layer that contains d neurons, with cal processes taking place in the plasma. We review a
d the expected intrinsic dimension of the dataset. This powerful method presented recently to estimate the in-
trainingrequiresaverycomplex non-linearoptimization trinsicdimensionofadatasetandweapplyittodifferent
that can be carried out with standard methods, such as observations, analyzing in detail the consequences. An
the backpropagation algorithm (Rumelhart et al. 1986; example of the datasets we are interested here is shown
Werbos1994). ConcerningKPCA,itrequiresthenumer- in Fig. 1. The usual intensity spectrum is shown in the
ical diagonalization of a very large correlation matrix of upper panel for two different spectral ranges, while the
sizeN×N (Scho¨lkopf et al.1998). Forlargedatasets,the wavelengthvariationofthecircularpolarizationisshown
diagonalization poses a heavy burden in terms of com- in the lower panel.
putational time and memory requirements because the 2. BASICTHEORY
correlationmatrix is not sparse.
2.1. Dimension estimation
The above-mentioned tools have been introduced re-
cently and probably require further study in order to The intrinsic dimension of a dataset is informally de-
understand all their statistical and computational prop- fined as the number of parameters that is needed to
erties. Unfortunately, they all suffer from a very im- describe it. In other words, given a dataset consisting
portant limitation: none of these methods is capable of of N different observations, each one made of an M-
giving a reliable estimation of the intrinsic dimension d dimensionalvector,weseekthe dimensionmofthenon-
ofthe datasets. Whenthis number isknownorobtained linear manifoldthat captures the behavior ofthe N vec-
by a different method, the previous methods are able to tors. As already stated, the dimension of this nonlinear
yield the projection of the original dataset in a nonlin- manifold is smaller than that of the originalspace. This
ear subspace of dimension d. If d is close to the correct isa consequenceofthe largenumber ofcorrelationsthat
intrinsic dimension of the original dataset, they usually are present among the data. Consequently, we can con-
capture the structure ofthe nonlinear subspace andgive sider that the number of parameters m that we need to
good results. Although it seems of reduced importance, describeourobservationsfulfills m≪N,alwayskeeping
a good estimation of d gives the key to understanding inmindthattheseparametershavetobeabletodescribe
thephysicsunderlyingintheobservations. Intheframe- the whole nonlinear manifold.
workofspectropolarimetry,it wouldbe desirableto find Dimensionestimationmethodscanbeclassifiedintwo
possible direct relations between the nonlinear dimen- groups. The first group contains all the methods that
sions captured by these methods and the physical pa- rely on the diagonalization of a given correlationmatrix
rameters employed for the forward modeling (magnetic (either linear, such as PCA, or nonlinear, such as kernel
fieldstrength,fillingfactor,macroscopicvelocities,etc.). PCA). These methods estimate the dimension by calcu-
If d is too small, important features of the data are pro- lating the number of eigenvalues greater than a given
jected ontothe same dimensionandpartofthe informa- threshold. As discussed above, these methods depend
tion available is lost. If, on the contrary, d is too large, largely on the ability to capture the nonlinearity of the
then the methods can introduce noise in the nonlinear manifold. Moreover, the estimated dimension critically
manifold. Alsoimportantis the factthatagoodestima- dependsonthethresholdchosen,aquantitythatisoften
tion of d is very important to reduce the computational difficult to define and has some degree of arbitrariness.
work and avoid a trial-and-errorprocedure. However,modelcomplexityinformationmaybeincorpo-
Except for PCA and AANNs, no other dimension ratedinto the dimension estimate problemto generatea
reduction methods have been applied to the field of less arbitrary threshold (Asensio Ramos 2006).
spectropolarimetry. Furthermore, the authors are not The second group contains methods based on geome-
aware that any nonlinear dimension reduction method try, especially important in determining the fractal di-
has been applied to spectropolarimetric data thus far. mension of dynamical systems. The analysis of dynami-
3
cal systems reveals that a large fraction of them exhibit where T (x ) represents the Euclidean distance between
k i
trajectories in the phase space that have not an integer point x and its k-th nearest neighbor. Note that the
i
dimension. After Mandelbrot (1982), these objects have previous equation is only valid for k >2.
been named fractals. Powerful method have been devel- The outcome of the previous equation depends criti-
oped to estimate fractal dimensions. Many of them are cally on the number of neighbors k that are taken into
based on the box-counting dimension (e.g., Kolmogorov account. The reason for this is that k sets the scale
1958; Hilborn 2000). This method estimates the dimen- at which we are analyzing the dataset, and it is possi-
sionofagivendatasetbycalculatingtheminimumnum- ble that the data have a different dimension at different
berof“boxes”ofsiderthatareneededtocoverthespace scales. Forinstance, this is the casefor a setofpoints in
occupied by the dataset. It is expected that the number atwo-dimensionalspacedistributedaccordingtoagaus-
of boxes N(r) increases when r decreases, so that the sian density. At very small scale (small value of k), we
box-counting dimension of the dataset is given by the see individual points and the dimension is close to 0. At
following scaling relation: larger scales, the dimension reaches the value of 2 (e.g.,
K´egl 2002). Like other methods, the quality of the esti-
N(r)= limkr−m, (1)
r→0 mateddimension usually degradeswhen k increasesasa
consequence of the finite number of observations in the
where k is a constant. From this, the dimension is ob-
dataset (Levina & Bickel 2005).
tained by taking logarithms:
The previous equation is interesting because it allows
logN(r) us to give local estimations of the intrinsic dimension,
m=−lim . (2) in cases where one expects it to change from point to
r→0 logr
point. Although more work needs to be done, in princi-
For the case of simple low dimensionality datasets, it is pleitpermitstolocatepointsinthedatasetthatpresent
easy to verify that the box-counting dimension gives the anomalies with respect to the average behavior. In any
correct answer. For instance, if our data are distributed case,it is importantto take into accountthat largefluc-
onastraightlineoflengthLinatwo-dimensionalspace, tuations can be expected in the estimation of the local
itiseasytodemonstratethatN(r)=L/r,sothatm=1. dimension and the information provided by Eq. (3) has
However, this estimation based on box-counting suffers to be analyzed with care. However, if we assume that
from computational problems for complex dataset and the observed dataset belongs to the same manifold, it
the computational work grows exponentially with the is more convenient to use an estimation that takes into
dimension of the original data. Another less compu- account all the points in the dataset. Levina & Bickel
tationally intensive dimension estimation method (and (2005) propose to use the following estimation:
probably the most popular thus far) was introduced by
Grassberger & Procaccia (1983) and employs the corre- 1 N
lation dimension. This correlation dimension is based mˆk = Xmˆk(xi), (4)
N
on the observation that in a N-dimensional dataset, the i=1
number of pairs of points that are closer than a dis-
tance r is proportional to rm, where m is the correla- which is simply an average over the complete dataset.
On the contrary, it has been suggested elsewhere6 that,
tion dimension. Refinements to this method have been
due to the mathematical structure of Eq. (3), it makes
introduced recently to overcome some of its limitations
more sense and is more stable to carry out the average
(Camastra & Vinciarelli 2002; K´egl 2002).
of the inverse of the estimators:
2.2. Maximum likelihood dimension estimation 1 N k−1 T (x )
A recent approach to the estimation of dimension has mˆ−k1 = N(k−1)XXlog Tk(xi), (5)
been suggested by Levina & Bickel (2005). It has been i=1j=1 j i
obtained by applying the principle of maximum like-
lihood to the nearest neighbor distances, resulting in so that the estimation of the dimension is given by
a method for dimension estimation that ourperforms 1/mˆ−k1. Wehaveverifiedthatbothestimatesgivealmost
the previous ones. Let x represent one of the N M- thesamevalueforthedimension,althoughthelatterhas
i
dimensionalvectorsthatconstitutetheobserveddataset. a better behavior for small values of k.
The maximum likelihood dimension estimation assumes The computational cost of this method
that the data points surroundingx can be correctlyde- (Levina & Bickel 2005) is mainly dominated by the
i
scribedwitha uniformprobabilitydistribution function. calculation of the k nearest neighbors for every point
As a consequence, the nearest neighbor distances follow xi. The computational cost of evaluating Eqs. (4) or
a Poisson process. This also leads to an easy calcula- (5) turns out to be almost negligible. Since we are not
tion of the statistical properties of the estimator. We dealing with too large datasets, our calculations rely on
assume that the observed dataset represents a nonlin- the calculation of the distances among all the points, so
earembeddingofalowerdimensionalspaceofdimension that the computational work is essentially proportional
m≪M. Levina & Bickel (2005) demonstrated that the to N2. However, alternative ways of calculating (exact
maximum likelihood estimator mˆ of the intrinsic dimen- or approximate)nearest neighbors have been developed,
sion (MLEID) can be written as: the majority of them being based on the construction
of efficient tree-like structures that highly reduce the
1 k−1 T (x ) computational work.
mˆk(xi)−1 = k−2 Xj=1log Tkj(xii), (3) 6 http://www.inference.phy.cam.ac.uk/mackay/dimension
4
3. ARTIFICIALDATASETS different sizes, from N = 500 to N = 4000. In order to
3.1. Cases with a known number of dimensions minimizefigurecluttering,thecurvescorrespondonlyto
theestimationgivenbyEq.(5). Thesameoverallpattern
In order to show the reliability of the method intro- isfoundforEq.(4),withabehaviorsimilartothatfound
duced by Levina & Bickel (2005), it is of interest to in Fig 2. When M is small (for instance the case with
test it with datasets of known low dimensionality. Al- M = 10 at the top left panel), the estimated dimension
though these tests present nothing new with respect to is very good for small values of k. It degrades as k in-
what is already known (e.g., Levina & Bickel 2005, and creasesbecausetheassumptionofuniformdistributionof
references therein), we consider them necessary to in- the datapointsbreaksforthis 10-dimensionalspacewith
dicate the potential of these methods. To this aim, such a small number of points. Consequently, the as-
we selected a particular Stokes I profile observed with sumptionsunderwhichtheformalismofLevina & Bickel
the Tenerife Infrared Polarimeter (Mart´ınez Pillet et al. (2005)has beendevelopedarenotfulfilled anditcannot
1999) of an internetwork region of the quiet Sun be applied. However, it is surprising that it is possible
(Mart´ınez Gonz´alez et al. 2006a). With this profile we tohavearoughestimatewithadatasetofonlyN =500
generateadatasetof2000elementsbyperformingaran- elements. When the number of elements of the dataset
dom horizontal (i.e., in the wavelength direction) shift. increases, the curves asymptotically tend to M. For in-
Thevaluesoftheshiftobeyagaussiandistribution. The creasing values of M, the dimension estimate is biased
estimated dimension is shown in the left panel of Fig 2. towards smaller values, although it is clear that it still
Due to the possible variation of the dimension with the yields a reasonable approximation to the correct value
scale at which the data are analyzed, we plot the esti- even for very small datasets. Figure 3 shows in detail
mated dimension for each value of k. When k is small, how increasing the number of elements in the dataset
we are referring to small scales while the scale increases leads to an improved estimation of the intrinsic dimen-
when k increases. Because the dataset is probably not sion. Inthe limitingcaseofaspacewithextremelylarge
dense enough to correctly sample the whole nonlinear dimension (M = 230), the method underestimates the
manifold, there may be a systematic deviation from the dimension by a factor of ∼3.
correct dimension for large values of k. The solid line
presents the estimation of the intrinsic dimension ob- 3.3. FeI database
tained from Eq. (4) while the dashed line presents the
One of the fastest techniques for Stokes profiles inver-
estimation given by Eq. (5). Note that they both yield
sion is based on a look-up algorithm with PCA coef-
similar values for the dimension, which is actually the
ficients (Rees et al. 2000). Once a model atmosphere
correct one (since we have allowed only one degree of
(with a given number of parameters) is selected, a
freedom). The method has captured the fact that, al-
database of models and emerging profiles is generated.
though these profiles are discretized in M = 231 wave-
Thedatabasehastobeabletocorrectlysamplethespace
lengthpoints,onlyoneparametersufficestodescribethe
spanned by all the parameters. Due to computational
entire dataset.
limitations, the PCA inversion technique has only been
A further complication is introduced in the artificial
applied to the simple Milne-Eddington atmosphere thus
dataset by carrying out an additional vertical shift to
far. TheeigenvectorsofthePCAdecompositionarethen
the Stokes I profile. The shift follows again a gaus-
saved, along with the projection of each element of the
sian distribution that is no correlated with the horizon-
databaseontheseeigenvectors,andtheMilne-Eddington
tal shift. The estimated dimension is shown in the right
parameters associated with each one. In the inversion
panel of Fig 2. The method correctly gives a dimension
process, an observed set of Stokes profiles is projected
of2. Interestingly,whentheverticalandhorizontalshifts
on the eigenvectors and the corresponding projections
are forced to be correlated (for instance, we make them
are compared to those saved in the database. Here we
equal) the estimated dimension is again 1, just as one
have used the PCA database as our observed dataset.
would expect.
We are interested in estimating the dimension of the
manifold in which these observation “live”. In princi-
3.2. Pure noise
ple, each profile contains 180 wavelength points and the
Noise turns outto be a problemfor estimating dimen- phase space would have dimension 180. However, cor-
sions. If a dataset is confined inside a manifold of a relations between many of these wavelength points (for
high-dimensional space, the inclusion of noise tends to instance, all the continuum points that always present
spread the points out of this manifold and starts to fill the same value) drastically reduce the dimension of the
upalargervolumeoftheoriginalhigh-dimensionalspace. manifold.
Consequently, we expect that the addition of noise will The database that we use consists of ∼6200 solar
tend to increasethe estimateddimensionasymptotically Stokesprofilesofthe 6301-6302˚Aregion,wheretwoFeI
approachingM, the dimensionof the originalspace. We linesandtwotelluriclinesarevisible. Fig4showsthees-
havegeneratedvarioussetsofprofileswithdifferentsizes. timateddimensionfortheStokesI (upperpanel)andthe
EachprofileconsistsofavectorofdimensionM madeof Stokes V profiles (lower panel). The database is recon-
completelyuncorrelatednoisefollowingagaussiandistri- structed from the PCA eigenvectors and the projections
bution. The intrinsic dimension of a dataset composed of each element of the database on these eigenvectors.
of M-dimensional elements of pure noise is equal to M In order to see the information carriedout by the eigen-
and we expect the estimators given by Eqs. (4) and (5) vectors,we show in Fig 4 the estimated dimension using
to converge to this value for sufficiently large values of anincreasingnumberofeigenvectorsN intherecon-
PCA
thenumberofobservablesN. Theestimateddimensions struction. The trend obtained is very instructive, show-
for each value of M are shown in Fig 3 for datasets of ing that the estimated dimension increases with N
PCA
5
untilasaturationisreached. ThecaseN =2demon- A consequence of the previous analysis is a possible
PCA
strates that the first two eigenvectors contain a large technique to recognize when data is affected in an im-
amount of information and they may be seen, as shown portantmannerbynoise. Ourdatasetsusuallypresenta
by Skumanich & Lo´pez Ariste (2002), as directly asso- large value of M so that, in the case of very large noise,
ciated with physical parameters. The situation remains the dimension has to grow until reaching a very large
unchangedwhenreconstructingwithN =4eigenvec- value. Acleareffectofthenoise,asstatedinsection3.2,
PCA
tors, while a saturation is reached when reconstructing isthattheestimateddimensionrapidlyincreasesatsmall
withN =10. Thismeansthat,althoughthenumber valuesofkwhilebeingheldconstantforlargevaluesofk.
PCA
of Milne-Eddington parameters defining each element of Thus, if a calculation shows a estimated dimension that
the dataset is 9, only 6 are actually needed to describe exhibits large values at small k and a steep logarithmic
the entire dataset. This is an alternative way of show- fallforlargek,noiseislikelyratherimportant. Acaveat
ing the strong degeneracy present in the 6301-6302 ˚A is mandatory. This test relies on the behavior of the
Fe I lines (Mart´ınez Gonz´alez et al. 2006a,b). Although MLEID for large values of k. As already pointed out by
PCA cannot capture the possible nonlinearity of the 6- Levina & Bickel (2005) and also shown here, a degrada-
dimensional manifold, it can be shown that the first 6 tionofthedimensionestimationoccursforlargevaluesof
eigenvectors are sufficient to describe all the elements of k andthe maximumlikelihoodestimationdoesnothold.
the database with a very small error. Forthisreason,onehastobecautiouswithlowsignal-to-
noise ratio observations. Recently, Levina et al. (2006)
3.4. The effect of noise
have addressed the problem of dimension estimation of
We have shown that the method developed by high-noise observations when using the MLEID. Their
Levina & Bickel(2005)correctlycapturesthe dimension approachto the problem is based on a smoothing of the
ofpure noise data. It is evenmore importantto see how original dataset, so that the performance of the method
the method behaves when data is corrupted with noise. isgreatlyenhanced. Theyfindthattheestimateddimen-
Itisexpectedthat,sincethenoisereducesthecorrelation sion for high-noise spectroscopic observations of chemi-
between some of the components of the M-dimensional cal mixtures turns out to be extremely large. However,
vectorsthat representthe dataset, the estimated dimen- when a certain amount of smoothing is introduced, the
sionwillgrowwhenthesignal-to-noiseratiodecreases. A MLEIDturnsouttobeaveryaccurateestimationofthe
fundamentalproblemarisesbecauseitisverydifficultto dimension. Thanks to the low noise in our observations,
recognizetrulyhigh-dimensionaldatafromlowsignal-to- our estimations of the intrinsic dimension are surely not
noise data. This test has been carried out with the FeI dominated by noise and we consider that smoothing is
datasetreconstructedusing allthe informationavailable not necessary.
(we used the first 10 eigenprofiles). We have calculated
the estimated dimension for four different noise levels, 4. OBSERVEDDATASETS
given in terms of the standard deviation σ of the gaus- We have shown how the MLEID developed by
sian noise in units of the continuum intensity. Since the Levina & Bickel(2005) worksfor synthetic data. In this
typical Stokes V signals are around1-2 orders of magni- sectionwefocusonrealspectropolarimetricobservations.
tudesmallerthanStokesI,anoiseofthesameσ implies Our aim is to learn about the intrinsic information con-
a much smaller signal-to-noise ratio for Stokes V than tent of the data. This may help understand how much
for Stokes I. Consequently, we expect the dimension in- complexitycanbeintroducedinthemodelsusedtointer-
crease to start at smaller values of σ for Stokes V than pret the observational data. A proposed physical model
for Stokes I. Figure 5 presents the results for three dif- usually consists of a set of free parameters that we have
ferent values of the noise. The value of σ = 10−4 is to constrain with the observations. It is crucial to have
small enough so that no appreciable difference is found as much information as possible in the observed dataset
in either Stokes I or V in the estimated dimension with so that one can constrain the model parameters. Ob-
respect to the case with no noise. When the noise in- viously, it is undesirable to use too complex models to
creases to σ = 10−3, the Stokes I profiles still maintain infer physical information from a dataset if the observ-
theoriginaldimensionwhiletheestimateddimensionfor ables contain only a small amount of information. The
the Stokes V dataset increases rapidly. It is interest- parameters used in the physical models are typically
ingtonotethattheestimateddimensionincreasesfaster non-orthogonalandtheyusuallypresentdegeneraciesbe-
for small values of k. This is because the noise is small causethesameobservablecanbeobtainedwithdifferent
enoughtoproduceperturbations(cancellationofcorrela- sets (finite or infinite) of model parameters. Although it
tionbetweentheM componentsofeachStokesprofile)at is not straightforward to estimate from the intrinsic di-
verysmallscales,whilethe largescaledimensionstillre- mension how many parameters one can introduce in the
mainsunchanged. Whenthenoiseisincreasedfurther,a modeling, it obviously should not be much larger than
drasticincreaseofthe dimensionisobservedinStokesV theestimatedintrinsicdimension. Ifthisnumberismade
andasmalleroneforStokesI. Notethatforevensmaller much larger, many of these parameters may not be con-
signal-to-noiseratios,the estimated dimensions for large strained by the observations, thus leading to unphysical
valuesofk wouldalsoincreaseuntilreaching(inthelim- results or ill-conditioned inversions.
iting case of an infinitely large dataset) a flat dimension An important application of the estimated dimension
estimate, constant for all scales, and equal to M. The toolswe havepresentedhere isto make relativecompar-
typical noise in spectro-polarimetricobservationsis usu- isonsofthe intrinsic informationpresentintwo different
ally well below 10−3, so that it is apparent from Fig 5 observations. There is an ongoing debate about the dif-
that noise is not expected to change appreciably the di- ferent results obtained for unresolved magnetic fields in
mensionality of noiseless data. thequietSunfromtheinversionoftwopairsofFeIlines
6
at two different spectral regions, one at 6302 ˚A and the sions,thereisnoindicationofanartificialincreaseofthe
otheroneat1.56µm. Recently,Mart´ınez Gonz´alez et al. dimension due to noise, as expected for these low-noise
(2006b)hasdemonstratedthattheinformationavailable observations. The presence of noise tends to raise the
inthepairoflinesat6302˚Aisnotsufficienttoconstrain dimensionfor smallvalues ofk, alsoincreasingthe slope
simultaneously the intrinsic magnetic field strength and of the curve for larger values of k. It is interesting to
the thermodynamical properties of the plasma. They point out that the curve obtained for the dataset in the
showed that it is possible to obtain exactly the same visible spectral range appears to be more stable with k
observables from completely different combinations of than that for the near-IR lines. This indicates that the
model parameters. Here, we consider this problem by near-IR data present a richer structure, also yielding a
analyzing in detail the amount of information available structure that changes with the scale at which one ana-
in the two different spectral regions. To this aim, we lyzes it. It is not obvious to build up an intuitive idea
compare the intrinsic dimension of the two pairs of Fe I of what this variation means. A possible interpretation
lines. might be that the set of similar Stokes I profiles present
Theobservationsemployedherehavebeenexplainedin a small variability (dimension ∼3), thus it is possible to
detail elsewhere (Mart´ınez Gonz´alez et al. 2006a,c) and describe them with a very reduced set of parameters. It
an example has been already shown in Fig. 1. They is plausible to consider that similar Stokes I profiles are
were targeted to the detailed investigation of the mag- alsoobservedin nearbyspatiallocationsor locationsex-
netic properties of internetwork regions in the quiet hibiting similar brightness (bright granules versus dark
Sun. These high spatial resolution observation were lanes). This result might appear obvious because data
taken simultaneously at two different spectral windows, seenatsmallscaletypically appearsimilarunless strong
one in the visible around 6302 ˚A and the other one pixel-to-pixel variations are present in the observations.
in the near-IR around 1.56 µm. The visible observa- When the scale is increased, the variability increases as
tions were acquired with the Polarimetric Littrow Spec- well(dimension∼6),meaningthatthesetofparameters
trograph (POLIS; Beck et al. 2005) while the near-IR used for describing them would need to be augmented.
data were obtained with the Tenerife Infrared Polarime- In these intermediate values of k, we are focusing on the
ter (TIP; Mart´ınez Pillet et al. 1999). Both instruments differences between Stokes I profiles coming from differ-
were mounted at the German Vacuum Tower Telescope ent regions (granules and lanes). Therefore, the lack of
(VTT), located at the Observatorio del Teide of the In- variationofthevisibledatahasimportantconsequences,
stitutodeAstrof´ısicadeCanarias. Theinstrumentswere in the sense that their Stokes I profiles tend to be less
used in a configuration such that simultaneous and co- sensitive to the physical properties of the atmosphere.
spatialobservationsof the same field-of-viewwere possi- When the data are observed at large scale, the behav-
ble. The noise level for both sets of data is of the order ior of both spectral domains tends to be similar. The
of 5×10−5 in units of the continuum intensity. decay for k & 1000 is likely produced by the breakage
The Stokes profiles at each spatial location were con- of the fundamental assumption that the points follow a
sidered as vectors in a space of dimension M = 240. In uniformdistributionin the neighborhoodof everypoint.
principle, one expects that, unless noise dominates the The conclusion from the results obtained for Stokes I
signal, the intrinsic dimension has to be much smaller is that there seems to be an indication that the near-
than M. This follows from the fact that simple phys- IR data are capable of detecting more variability in the
ical models are successful in reproducing many of the observations than the visible data.
properties of the observed Stokes profiles. In fact, this Turning our attention to Stokes V, essentially the
is the case as shown in Fig 6. The figure shows the es- same behavior is observed with almost invariable esti-
timated dimension of the observed dataset, the upper mated dimensions for the visible data and strong vari-
paneldisplayingtheresultsfortheTIPobservationsand ations for the near-IR data. The estimated dimension
thelowerpanelpresentingthePOLISresults. Theintrin- is ∼10 for the visible data and only for k & 1000 we
sicdimensionhasbeenestimatedforStokesI andV sep- detect a drop-off. From the results shown in Fig 6
arately using a database of 5000 observed profiles. The it is clear that the near-IR data capture more physi-
results obtained with Eq. (4) are in solid line and those cal information about the atmosphere where they are
of Eq. (5) are in dashed line. It is clear from the figure formed. This is another way of looking at the issues
thattheintrinsicdimensionofStokesI isalwayssmaller described by (Mart´ınez Gonz´alez et al. 2006b). Among
thanforStokesV,implyingthatthe amountofinforma- other problems, due to the small splitting present in
tionencodedinthe StokesI profilesis smallerthanthat the visible lines, it is possible to mask variations in the
in the circular polarization profiles. The magnetic field magnetic field as variations in the thermodynamical pa-
in these observations is unresolved and the filling factor rameters. Consequently, these parameters alone are not
of the magnetic regions inside the resolution element is constrainedby the observationsandonly some (possibly
oftheorderof2%. Therefore,theStokesI profileisrep- nonlinear)combinationofthemcanbeconstrained. The
resentative of the 98% of the resolution element that is splittinginthenear-IRlinesismuchlarger(thesplitting
non-magneticandcarriesvirtuallynoinformationabout isproportionaltothewavelengthandtheeffectiveLand´e
the magnetic field. One expects that it may contain in- factor) and these problems are less prominent. On the
formation about the Doppler velocity shift, temperature other hand, the visible lines produce much stronger sig-
and density stratifications. nals and are less sensitive to noise, especially for weak
FocusingonStokesI,wecanseethattheestimateddi- (.500 G) fields.
mensionis very stable with respectto the scale atwhich
the dataareobserved. Accordingto thepreviousdiscus-
5. AUGMENTINGINFORMATION
7
We have already pointed out that the information en- sidering all the lines simultaneously. Fig 7 demonstrates
coded in the pair of lines at 6302 ˚A is not sufficient to thattheavailableinformationisamonotonicallyincreas-
constrainsimultaneouslythethermodynamicalandmag- ing function of the number of lines.
netic properties of the plasma in small unresolved mag- It is important to note, however, that the results pre-
netic structures in the quiet Sun. It has been suggested sented in this section are not in accordance with those
thatthesolutiontothisproblemreliesonthesimultane- shown in the previous section. In the synthetic experi-
ous observationof many spectrallines (e.g., Semel 1981; ment carried out here, the 630 nm lines capture slightly
Socas-Navarro 2004). Each line contributes by adding more information than the 1.5 µm. We assign this ap-
somewhat different (hopefully complementary) informa- parentlypuzzlingbehaviortothefactthatthissynthetic
tion and constraints, so that the thermodynamical and testis notrealisticinthesensethateitherthe variations
magnetic properties of the plasma can be inferred with in the physical properties that we have included are not
more confidence. This increase in the information con- representative of what is happening in the solar atmo-
tent must be accompanied by an increase in the intrin- sphere, either that the solar case contains correlations
sic dimension of the space spanned by the observations. amongphysicalparametersabsentfromthedatabase,or
It is likely that a large fraction of the information car- both.
ried out by all the spectral lines is common and only
a small part can be better inferred from a set of lines. 6. CONCLUSIONS
Consequently, it is expected that the inclusion of each We have applied a computationally efficient method
additional spectral line would increase slightly the in- developed by Levina & Bickel (2005) for estimating the
formation available, unless the new line turns out to be intrinsic dimension of a dataset. The method relies only
sensitive to a physical parameter to which the original on the calculation of the euclidean distances between
set was nearly insensitive. In order to investigate this in the observables (taken as vectors in a high-dimensional
detail,weshowinFig7the intrinsicdimensionobtained space). Thepropertiesofthemethodhavebeenanalyzed
usingEq.(5)forthreesyntheticdatasets. Thesedatasets in detail with artificial datasets. We have verified that
contain the pair of FeI lines at 1.56 µm, the pair of FeI it is able to correctly estimate the intrinsic dimension
lines at 6302 ˚A and the Mn I line at 8740˚A. The full in artificially generated data. If the simulated observa-
datasethasbeenobtainedusingLocalThermodynamical tions contain noise, the method correctly estimates an
Equilibrium (LTE) synthetic profiles. The HSRA model increase in the intrinsic dimension that tends towards
atmosphere(Gingerich et al.1971)waschosenasarefer- the dimension of the high-dimensional space. In very
enceandrandomvaluesofthefollowingnineparameters high-dimensionalspaceswithasmallnumberofobserva-
were added to it, producing 10000different randompro- tions,theassumptionsunderwhichthemethodreliesare
files: macro-andmicro-turbulentvelocities,fillingfactor, not fulfilled, so that the method cannot be applied. The
temperature offset (shifting the whole HSRA tempera- presence of noise in the observations produces an over-
ture height profile), temperature gradient (changing the estimation of the dimension at small values of k and it
slope of the reference temperature height profile), mag- may be used to judge whether the information has been
netic field offset, magnetic field gradient, velocity field significantly degraded by the presence of noise. Since
offset and velocity gradient. A total of 9 parameters both an intrinsically high-dimensional manifold and the
have been used to construct the database. If the lines noise produce an increase in the estimated dimension, it
containreliableinformationaboutthe9parameters,one turns out to be extremely difficult to discriminate be-
would expect to infer an intrinsic dimension close to 9. tween both. We have suggested a possible way of dis-
However, this is not the case, as can be seen in Fig 7. criminatingbotheffectsbyanalyzingthebehaviorofthe
The maximum value of the dimension we obtain is only MLEID curve for large values of k. However, it suf-
6andthisisthemaximumnumberoforthogonalparam- fers from problems because the hypotheses under which
eters we can introduce in our modeling. There are two MLEIDisbasedarenotcorrectlyfulfilledforlargevalues
possible reasons for this. First, the parameters we have of k. The applicationof the method to real observations
varied for generating the database might be degenerate, inthepairofFeIlinesat1.56µmandthepairat6302˚A
in the sense that (possibly nonlinear) combinations of shows that the near-IR lines appear to carrymore infor-
two or more parameters yield the same (or very similar) mationthan the visible ones. An extra numericalexper-
emergentprofiles. Second, it is possible that some infor- imenthasshownunequivocallythattheamountofinfor-
mation be lost in the line formation process due to ra- mation that may be obtained from an observed dataset
diativetransfereffects(e.g.,line-of-sightblurring). Both increases as the number of included lines increases.
mechanisms tend to reduce the information available in Although this work has focused on spectro-
the observations. polarimetric datasets, it is fundamental to point out the
A very important conclusion of this synthetic experi- enormous applicability of the estimators of the intrinsic
ment is that the amount of information that we can ex- dimension like the one presented by Levina & Bickel
tractfromasetofobservablesincreaseswiththenumber (2005). Physics, and specially Astrophysics depends
ofspectrallinesincludedinthesetincreases. Thismight on the development of models with different levels of
soundobvious,butourapproachofcalculatingtheintrin- complexity that are used to explain the observables.
sic dimensionality of the observed dataset demonstrates A posteriori, inversion techniques allow to infer the
this point rigorously for the first time. The increase in properties of the object under study by fitting the
theinformationcontentisshowninFig7,wherewehave observables with the proposed model. The complexity
plottedtheintrinsicdimensionobtainedfromtheconsid- of the model has to be constrained by the amount of
ered lines. We have also overplotted the result that we information available in the observables. Consequently,
obtain when the intrinsic dimension is estimated con- the estimators of the intrinsic dimensionality of the
8
observed datasets help us accept or reject different Ruiz Cobo and J. Trujillo Bueno for helpful discussions.
models depending on the amount of information carried We thank the anonymous referee for useful suggestions.
out by the observables. Thisresearchhasbeenfundedbythe SpanishMinisterio
deEducaci´onyCienciathroughprojectAYA2004-05792.
We thank R. Casini, M. Collados, E. Khomenko, B.
REFERENCES
AsensioRamos,A.2006,ApJ,646,1445 Mart´ınez Gonza´lez, M. J., Collados, M., Ruiz Cobo, B., & Beck,
Beck, C.,Schmidt, W.,Kentischer, T.,& Elmore,D.2005, A&A, C.2006c, ApJ,inpreparation
437,1159 Mart´ınez Pillet, V., Collados, M., Bellot Rubio, L. R., Rodr´ıguez
Camastra, F., & Vinciarelli, A. 2002, IEEE Trans. on Pattern Hidalgo,I.,RuizCobo,B.,&Soltau,D.1999,inAstronomische
AnalysisandMachineIntelligence, 24,1404 GesselschaftMeetingAbstracts,vol.15
Casini,R.,Bevilacqua,R.,&Lo´pezAriste,A.2005,ApJ,622,1265 Rees, D. E., Lo´pez Ariste, A., Thatcher, J., & Semel, M. 2000,
Ferreras,I.,Pasquali,A.,deCarvalho,R.R.,delaRosa,I.G.,& A&A,355,759
Lahav, O.2006,MNRAS,616 Roweis,S.,&Saul,L.K.2000,Science, 290,2323
Gingerich, O., Noyes, R. W., Kalkofen, W., & Cuny, Y. 1971, Rumelhart, D., Hinton, G., & Williams, R. 1986, in Parallel
Sol.Phys.,18,347 Distributed Processing: Explorations in the Microstructure of
Grassberger,P.,&Procaccia,I.1983,PhysicaD,9,189 Cognition, ed. D. Rumelhart & J. McClelland (Cambridge:
Hilborn,R.C.2000, ChaosandNonlinearDynamics,2ndEdition MIT),318
(New York:OxfordUniversityPress) Scho¨lkopf, B., Smola, A. J., & Mu¨ller, K.-R. 1998, Neural
K´egl,B.2002,Advances inNIPS,14,1404 Computation, 10,1299
Kohonen,T.2001, Self-organizingmaps(Berlin:Springer) Semel,M.1981,A&A,97,75
Kolmogorov,A.N.1958, Dokl.Akad.Nauk.SSSR,119,861 Skumanich,A.,&Lo´pezAriste,A.2002, ApJ,570,379
Levina,E.,&Bickel,P.J.2005,inAdvances inNIPS,Vol.17 Socas-Navarro,H.2004, ApJ,613,610
Levina, E., Wagaman, A. S., Callender, A. F., Mandair, G. S., & —.2005a, ApJ,620,517
Morris,M.D.2006,J.Chemom.,inpress —.2005b, ApJ,621,545
Lo´pezAriste,A.,&Casini,R.2002,ApJ,575,529 Tenenbaum, J. B., de Silva, V., & Langford, J. C. 2000, Science,
Mandelbrot, B. B. 1982, The Fractal Geometry of Nature (San 290,2319
Francisco:W.H.Freeman) Werbos, P. 1994, The Roots of Backpropagation: From Ordered
Mart´ınez Gonza´lez, M. J., Collados, M., & Ruiz Cobo, B. 2006a, Derivatives to Neural Networks and Political Forecasting (New
inSolarPolarization4,ed.R.Casini&B.W.Lites,ASPConf. York:JohnWiley&Sons)
Ser.,inpress
Mart´ınez Gonza´lez, M. J., Collados, M., & Ruiz Cobo, B. 2006b,
A&A,456,1159
9
Fig. 1.— Example of the spectropolarimetric data that we have analyzed in this work. These observations have been obtained in an
internetwork region of the quiet Sun (Mart´ınezGonza´lezetal. 2006a,c). The upper panel shows the Stokes I profiles in two different
spectralregions,oneinthenear-IRandtheotherinthevisible. ThelowerpanelshowsthecircularpolarizationStokes V profiles.
Fig. 2.—Estimated dimensionfortwo simplecases. In the leftpanel we show the resultwhen the database consists of asingleprofile
that is horizontally shifted by a random sub-pixel quantity. The method correctly yields a value of 1 for the dimension. The right panel
showstheresultwhenanadditionalverticalshiftisapplied,givingthecorrectvalueof2.
10
Fig. 3.—Estimateddimensionforprofilescomposedofnoise. Thenumberofwavelength pointsconsideredineachcaseisshowninthe
title.
Fig. 4.— Estimated dimension for the Stokes I and V database of Fe I as different numbers of PCA components are used in the
reconstruction.