Table Of Content1
Predicting Nearly As Well As the Optimal Twice
Differentiable Regressor
N. Denizcan Vanli, Muhammed O. Sayin, and Suleyman S. Kozat, Senior Member, IEEE
Abstract—We study nonlinear regression of real valued data that recursively and effectively partition the regressor space
inanindividualsequencemanner,whereweprovideresultsthat into subsequent regions in a data driven manner, where
are guaranteed to hold without any statistical assumptions. We
a different linear model is learned at each region. Unlike
4
addresstheconvergenceandundertrainingissuesofconventional
1 most of the nonlinear models, learning linear structures at
nonlinear regression methods and introduce an algorithm that
0 elegantly mitigates these issues via an incremental hierarchical each region can be efficiently managed. Hence, using this
2 structure,(i.e.,viaanincrementaldecisiontree).Particularly,we hierarchical piecewise model, we significantly mitigate the
t present a piecewise linear (or nonlinear) regression algorithm convergence and consistency issues. Furthermore, we prove
c that partitions the regressor space in a data driven manner
that the resulting hierarchicalpiecewise model asymptotically
O
and learns a linear model at each region. Unlike the conven-
achievestheperformanceofanytwicedifferentiableregression
tionalapproaches,ouralgorithmgraduallyincreasesthenumber
6 of disjoint partitions on the regressor space in a sequential function that is directly tuned to the underlying observations
manner according to the observed data. Through this data without any tuning of algorithmic parameters or without any
]
G driven approach, our algorithm sequentially and asymptotically assumptions on the data (other than an upper bound on the
achieves the performance of the optimal twice differentiable
magnitude). Since most of the nonlinear modeling functions
L regression functionforany datasequencewithan unknownand
of the regression algorithms in the literature, such as neural
. arbitrarylength.Thecomputationalcomplexityoftheintroduced
cs algorithm is only logarithmic in the data length under certain networks and Volterra filters, can be accurately represented
[ regularity conditions. We provide the explicit description of the by twice differentiable functions[1], [2], [6], [16], [18], [19],
algorithm and demonstrate the significant gains for the well- ouralgorithmreadilyperformsasymptoticallyas well as such
2 known benchmark real data sets and chaotic signals.
nonlinear learning algorithms.
v
3 Index Terms—Online, nonlinear, regression, incremental deci- In particular, the introduced method sequentially and re-
1 sion tree. cursively divides the space of the regressors into disjoint
4 regions according to the amount of the data in each region,
6
. I. INTRODUCTION instead of committing to a priori selected partition. In this
1 sense,weavoidcreatingundertrainedregionsuntilasufficient
We study sequential nonlinear regression, where we aim
0 amount of data is observed. The nonlinear modeling power
4 to estimate or model a desired sequence {d[t]} by us-
t≥1 of the introduced algorithm is incremented (by consecutively
1 ing a sequence of regressor vectors {x[t]} . In particular,
t≥1 partitioning the regressor space into smaller regions) as the
v: we seek to find the relationship, if it exists, between these observeddatalengthincreases. Theintroducedmethodadapts
i two sequences, which is assumed to be unknown, nonlinear,
X itself according to the observed data instead of relying on
and possibly time varying. This generic nonlinear regression
ad-hoc parameters that are set while initializing the algo-
r
framework is extensively studied in the machine learning and
a rithm. Thus, the introduced algorithm providesa significantly
signal processing literatures since it can model a wide range
stronger modeling power with respect to the state-of-the-art
ofreallifeapplicationsbycapturingthesalientcharacteristics
methods in the literature as shown in our experiments.
of underlyingsignals and systems [1]–[15]. In order to define
We emphasize that piecewise linear regression using tree
and find this relationship between the desired sequence and
structuresis extensively studied in the computationallearning
regressorvectors,numerousmethodssuchasneuralnetworks,
and signal processing literatures [7]–[9], [12]–[15], [20]–
Volterrafilters,andB-splinesareused[1],[2],[6],[11],[16]–
[24] due to its attractive convergence and consistency fea-
[19]. However, either these methods are extremely difficult to
tures. There exist several tree based algorithms that mitigate
use in real life applications due to convergence issues, e.g.,
the overtraining problem by defining hierarchical piecewise
Volterra filters and B-splines, or it is quite hard to obtain a
models such as [8], [9], [12]–[15]. Although these methods
consistentperformancein differentscenarios,cf. [1]–[3], [6]–
achieve the performance of the best piecewise model defined
[8], [12]–[15], [20]–[22].
on a tree, i.e., the best pruning of a tree, they only yield
To this end, in this paper, we propose an algorithm that
satisfactory performance when the initial partitioning of the
alleviates these issues by introducing hierarchical models
regressor space is highly accurate or tuned to the underlying
data (which is unknown or even time-varying). Furthermore,
This work was supported in part by the Turkish Academy of Sciences
OutstandingResearcherProgrammeandinpartbyTUBITAKunderContract there are more recent algorithms such as [20] that achieve
112E161andContract 113E517. the performance of the optimal combination of all piecewise
The authors are with the Department of Electrical and Electronics
models defined on a tree that minimizes the accumulated
Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey (e-mail:
{vanli,sayin,kozat}@ee.bilkent.edu.tr). loss. There are also methods that alleviate the overtraining
2
problem by learning the region boundaries [20] to minimize the observed data, ii) that is highly efficient in terms of the
theregressionerrorforafixeddepthtreewithacomputational computational complexity as well as the error performance,
complexityrelatively greatercomparedto the onesin [8], [9], andiii)whoseperformanceconvergestoiii-a)theperformance
[12], [14], [15] (particularly, exponential in the depth of the of the optimal twice differentiable function that is selected in
tree). However, these algorithms can only provide a limited hindsightandiii-b)thebestpiecewiselinearmodeldefinedon
modeling power since the tree structure in these studies is the incremental decision tree, with guaranteed upper bounds
fixed. Furthermore, the methods such as [20] can only learn withoutanystatisticalorstructuralassumptionsonthedesired
the locally optimal region boundaries due to the highly non- data as well as on the regressor vectors (other than an upper
linear (and non-convex) optimization structure. Unlike these boundon them).Hence,unlikethe state-of-the-artapproaches
methods, the introduced algorithm sequentially increases its whoseperformancesusuallydependontheinitialconstruction
nonlinearmodelingpoweraccordingto the observeddata and of the tree, we introduce a method to construct a decision
directly achieves the performance of the best twice differ- tree, whose depth (and structure) is adaptively incremented
entiable regression function that minimizes the accumulated (and adjusted) in a data dependent manner, which we call
regression error. We also show that in order to achieve the an incremental decision tree. Furthermore, the introduced
performance of a finer piecewise model defined on a tree, it algorithm achieves this superior performance only with a
is not even necessary to create these piecewise models when computational complexity O(log(n)) for any data length n,
initializing the algorithm. Hence, we do not train a piecewise under certain regularity conditions. Even if these regularity
modeluntil a sufficientamountof data is observed,and show conditionsare notmet, the introducedalgorithmstill achieves
that the introduced algorithm, in this manner, does not suffer the performance of any twice differentiable regression func-
any asymptotical performance degradation. Therefore, unlike tion, however with a computational complexity linear in the
the relevant studies in the literature, in which undertrained data length.
(i.e., unnecessary) partitions are kept in the overall structure, Theorganizationofthepaperisasfollows.Wefirstdescribe
our method intrinsically eliminates the unnecessarily finer the sequentialpiecewise linear regressionproblemin detailin
partitions without any loss in asymptotical performance (i.e., SectionII.WethenintroducethemainalgorithminSectionIII
we maintain universality). and prove that the performance of this algorithm is nearly as
Aside from such piecewise linear regression techniques well as the best piecewise linear model that can be defined
basedonhierarchicalmodels,therearevariousdifferentmeth- by the incremental decision tree in Section IV. Using this
ods to introduce nonlinearity such as B-splines and Volterra result,wealsoshowthattheintroducedalgorithmachievesthe
series [1], [2], [6], [11], [17]–[19]. In these methods,the non- performance of the optimal twice differentiable function that
linearityisusuallyintroducedbymodifyingthebasisfunctions is selected after observing the entire data before processing
to create polynomial estimators, e.g., in [19], the authors use starts, i.e., non-causally. In Section V, we demonstrate the
trigonometricfunctionsastheirbasisfunctions.Weemphasize performance of the introduced algorithm through simulations
that these techniques can be straightforwardly incorporated and then conclude the paper with several remarks in Section
into our framework by using these methods at each region VI.
in the introduced algorithm to obtain piecewise nonlinear
regressors. Note that the performance of such methods, e.g.,
B-splines and Volterra series (and other variousmethods with II. PROBLEMDESCRIPTION
different basis functions), is satisfactory when the data is
We study sequential nonlinear regression, where the aim
generatedusingtheunderlyingbasisfunctionsoftheregressor.
is to estimate an unknown desired sequence {d[t]} by
In real life applications, the underlying model that generates t≥1
using a sequence of regressor vectors {x[t]} , where the
the data is usually unknown. Thus, the successful implemen- t≥1
desired sequence and the regressor vectors are real val-
tationofthesemethodssignificantlydependsonthematch(or
ued and bounded but otherwise arbitrary, i.e., d[t] ∈ ,
mismatch) between the regressor structure and the underlying x[t] , [x [t],...,x [t]]T ∈ p for an arbitrary integerRp
1 p
model generating the data. On the other hand, the introduced R
and |d[t]|,|x [t]| < A < ∞ for all t and i = 1,...,p. We
i
algorithm achieves the performance of any such regressor
call the regressors as “sequential” if in order to estimate the
providedthatitsbasisfunctionsaretwicedifferentiable.Inthis
desired data at time t, i.e., d[t], they only use the past infor-
sense,unliketheconventionalmethodsintheliterature,whose
mation d[1],...,d[t−1] and the observed regressor vectors1
performances are highly dependant on the selection of the
x[1],...,x[t].
basis functions, our method can well approximate these basis
In this framework, a piecewise linear model is constructed
functions(andregressorsformedbythese basis functions)via
bydividingtheregressorspaceintoaunionofdisjointregions,
piecewise models such that the performance difference with
where in each region a linear model holds. As an example,
respect to the best such regressor asymptotically goes to zero
suppose that the regressor space is parsed into K disjoint
in a strongindividualsequence mannerwithoutany statistical
regions R ,...,R such that K R = [−A,A]p. Given
assumptions. 1 K k=1 k
such a model, say model m, at each time t, the sequential
The main contributions of this paper are as follows. We S
introduce a sequential piecewise linear regression algorithm
1Allvectorsarecolumnvectorsanddenotedbyboldfacelowercaseletters.
i) that provides a significantly improved modeling power by Matrices are denoted by boldface upper case letters. For a vector x, xT is
adaptively increasing the number of partitions according to theordinarytranspose. Wedenote dba,{d[t]}bt=a.
3
A
R1 !R2 R1
R
R !R !R !R 2
1 2 3 4
R
3
R !R R
3 4 4 -A P1 P2 P3 P4 P5
Fig. 2: All different piecewise linear models that can be obtained using a
Fig. 1:Thepartitioning ofaonedimensional regressorspace,i.e.,[−A,A],
depth-2fulldecisiontree,wheretheregressorspaceisonedimensional.These
using a depth-2 full decision tree, where each node represents a portion of
modelsarebasedonthepartitioning showninFig.1.
theregressorspace.
linear2 regressor predicts d[t] as dˆ [t] = vT [t]x[t] when nonlinear modeling power of this tree is fixed and finite
x[t] ∈ R , where v [t] ∈ pmfor all mk,k= 1,...,K. since there are only 2d+1−1 different regions (one for each
k m,k R node) and approximately 1.52d different piecewise models
These linear models assigned to each region can be trained
(i.e., partitions) defined on this tree. Instead of introducing
independently using different adaptive methods such as the
such a limitation, we recursively increment the depth of the
leastmeansquares(LMS)ortherecursiveleastsquares(RLS)
decision tree as the data length increases. We call such a tree
algorithms.
the“incrementaldecisiontree”sincethedepthofthedecision
However, by directly partitioning the regressor space as
K R = [−A,A]p before the processing starts and op- treeisincremented(andpotentiallygoestoinfinity)asthedata
k=1 k length n increases, hence in a certain sense, we can achieve
timizing only the internal parameters of the piecewise linear
S the modeling power of an infinite depth tree. As shown in
model, i.e., v [t], one significantly limits the performance
m,k
Theorem 2, the piecewise linear models defined on the tree
of the overall regressor since we do not have any prior
will convergeto any unknownunderlyingtwice differentiable
knowledgeontheunderlyingdesiredsignal.Therefore,instead
model under certain regularity conditions as n increases.
of committing to a single piecewise linear modelwith a fixed
To this end, we seek to find a sequential regression algo-
andgivenpartitioning,andperformingoptimizationonlyover
rithm(whoseestimate at time t is representedbydˆ[t]), when
the internal linear regressionparametersof this regressor,one s
applied to any sequence of data and regressor vectors, yields
can use a decision tree to partition the regressor space and
the following performance (i.e., regret) guarantee
try to achieve the performance of the best partitioning over
the whole doubly exponential number of different models n 2 n 2
represented by this tree [25]. d[t]−dˆs[t] − inf d[t]−dˆf[t] ≤o(n), (1)
f∈F
As an example, in Fig. 1, we partition the one dimensional Xt=1(cid:16) (cid:17) Xt=1(cid:16) (cid:17)
regressor space [−A,A], using a depth-2 tree, where the over any n, without the knowledge of n, where F represents
regionsR ,...,R correspondtodisjointintervalsonthereal the class of all twice differentiable functions, whose param-
1 4
lineandtheinternalnodesareconstructedusingunionofthese eters are set in hindsight, i.e., after observing the entire data
regions. In the generic case, for a depth-d full decision tree, before processing starts, and dˆf[t] represents the estimate of
thereexist2d leaf nodesand2d−1internalnodes.Eachnode the twice differentiablefunctionf ∈F at time t. The relative
ofthetreerepresentsaportionoftheregressorspacesuchthat accumulatederrorin(1)representstheperformancedifference
theunionoftheregionsrepresentedbytheleafnodesisequal of the introduced algorithm and the optimal batch twice
to the entire regressor space [−A,A]p. Moreover, the region differentiableregressor.Hence, an upperboundof o(n) in (1)
correspondingtoeachinternalnodeisconstructedbytheunion impliesthatthealgorithmdˆ[t]sequentiallyandasymptotically
s
oftheregionsofitschildren.Inthissense,weobtain2d+1−1 converges to the performance of the regressor dˆf[t], for any
different nodes (regions) on the depth-d decision tree (on the f ∈F.
regressor space) and approximately 1.52d different piecewise
models that can be represented by certain collections of the III. NONLINEARREGRESSION VIA INCREMENTAL
regionsrepresentedbythenodesofthedecisiontree[25].For DECISION TREES
example, we consider the same scenario as in Fig. 1, where In this section, we introduce the main results of the pa-
we partition the one dimensional real space using a depth-2 per. Particularly, we first show that the introduced sequential
tree. Then, as shown in Fig. 1, there are 7 different nodes on piecewise linear regression algorithm asymptotically achieves
the depth-2 decision tree; and as shown in Fig. 2, a depth-2 the performance of the best piecewise linear model defined
tree defines 5 different piecewise partitions or models, where on the incrementaldecision tree (with possibly infinite depth)
each of these models is constructed using certain unions of with the optimal regression parameters at each region that
the nodes of the full depth decision tree. minimizes the accumulated loss. We then use this result to
We emphasize that given a decision tree of depth-d, the prove that the introduced algorithm asymptotically achieves
the performance of any twice differentiable regression func-
2Note that affine models can also be represented as linear models by
appending a1tox[t],wherethedimension oftheregressorspaceincreases tion. We provide the algorithmic details and the construction
byone. of the algorithm in Section IV.
4
Theorem 1: Let {d[t]} and {x[t]} be arbitrary, themainresultofthepaper,wherewedefinetheperformance
t≥1 t≥1
bounded, and real-valued sequences of data and regressor of the introduced algorithm with respect to the class of twice
vectors, respectively. Then the algorithm dˆ[t] (given in Fig. differentiable functions as in (1).
4) when applied to these data sequences yields Theorem 2: Let {d[t]} and {x[t]} be arbitrary,
t≥1 t≥1
n bounded, and real-valued sequences of data and regressor
2
d[t]−dˆ[t] vectors, respectively. Let F be the class of all twice differen-
Xt=1(cid:16) (cid:17) tiable functions such that for any f ∈ F, ∂∂2xfi∂(xxj) ≤ D < ∞,
−m∈inMf ′nkv=m1i,,n.k.∈.f,KRpm(Xt=n1(cid:16)d[t]−dˆb[t](cid:17)2+δ||vm||2) ias,elgjqoure=inthc1mes,.yd.ˆi[.et,l]dpsgiavnedn wineFdiegn.o4tewdˆhfe[nt] a=pplfie(dx[tto]).thTehseendathtae
≤O plog2(n) , n n
2 2
d[t]−dˆ[t] − inf d[t]−dˆ [t] ≤o(p2n),
for any n, w(cid:0)ith a com(cid:1)putational complexity upper bounded f∈F f
by O(n), where K denotes the number of leaf nodes Xt=1(cid:16) (cid:17) Xt=1(cid:16) (cid:17)
m
in the hierarchical model m, M represents the set of all for any n, with a computational complexity upper bounded
n
hierarchicalmodelsdefinedontheincrementaldecisiontreeat by O(n).
time n, M′ represents the set of all hierarchicalmodels with
n
at most O(log(n)) leaves defined on the incrementaldecision This theorem presentsthe nonlinear modelingpower of the
tree at time n, i.e., M′ , {m ∈ M : K ≤ O(log(n))}, introducedalgorithm.Specifically,itstatesthatthe introduced
n n m
and v ,[v ;...;v ]. algorithm can asymptotically achieve the performance of the
m m,1 m,Km optimal twice differentiable function that is selected after
This theorem indicates that the introduced algorithm can observing the entire data in hindsight. Note that there are
asymptotically and sequentially achieve the performance of several kernel and neural network based sequential nonlinear
anypiecewisemodelinthesetM′,i.e.,thepiecewisemodels regression algorithms [1], [2], [6] (which can be modeled via
n
havingatmostO(log(n))leavesdefinedonthetree.Inpartic- twice differentiablefunctions)whose computationalcomplex-
ular, over any unknowndata length n, the performanceof the ities are similar to the introduced algorithm. However, the
piecewise models with O(log(n)) leaves can be sequentially performances of such nonlinear models are only comparable
achieved by the introduced algorithm with a regret upper with respect to their batch variants. On the other hand, we
bounded by O plog2(n) . In this sense, we do not compare demonstrate the performance of the introduced algorithm
the performance of the introduced algorithm with a fixed with respect to a extremely large class of regressors without
class of regress(cid:0)ors, over a(cid:1)ny data length n. Instead, the regret any statistical assumptions. In this sense, the performance
of the introduced algorithm is defined with respect to a set of any regression algorithm that can be modeled by twice
of piecewise linear regressors, whose number of partitions differentiable functions is asymptotically achievable by the
are upper bounded by O(log(n)), i.e., the competition class introducedalgorithm.Hence,theintroducedalgorithmyieldsa
growsasnincreases.Intheconventionaltreebasedregression significantlymorerobustperformancewithrespecttothesuch
methods, the depth of the tree is set before processing starts conventionalapproaches in the literature as also illustrated in
and the performance of the regressor is highly sensitive with different experiments in Section V.
respect to the unknowndata length. For example,if the depth The proofsof Theorem 1, Theorem 2, and the construction
ofthetreeislargewhereastherearenotenoughdatasamples, of the algorithm are given in the following section.
then the piecewise model will be undertrained and yield an
unsatisfactoryperformance.Similarly,ifthedepthofthetreeis IV. CONSTRUCTION OF THEALGORITHMAND PROOFS OF
smallwhereashugenumberofdatasamplesareavailable,then THETHEOREMS
trees (and regressors) with higher depths (and finer regions)
In this section, we first introduce a labeling to efficiently
can be better trained. As shown in Theorem 1, the introduced
manage the hierarchical models and then describe the algo-
algorithmelegantlyandintrinsicallymakessuchdecisionsand
rithm in its main lines. We next prove Theorem 1, where
performsasymptoticallyas well as any piecewise regressor in
we also provide the complete construction of the algorithm.
thecompetitionclassthatgrowsexponentiallyinn[25].Such
We then present a proof for Theorem 2, using the results of
a significant performance is achieved with a computational
Theorem 1.
complexity upper bounded by O(n), i.e., only linear in the
datalength,whereasthenumberofdifferentpiecewisemodels
A. Notation
definedontheincrementaldecisiontreecanbeintheorderof
1.5n [25]. Moreover, under certain regularity conditions the We first introduce a labeling for the tree nodes following
computational complexity of the algorithm is O(log(n)) as [26]. The root node is labeled with an empty binary string λ
willbediscussedinRemark2.Thistheoremisanintermediate andassumingthata nodehasa labelκ,whereκ=ν ...ν is
1 l
step to show that the introduced algorithm yields the desired a binary string of length l formed from letters ν ,...,ν , we
1 l
performance guarantee in (1), and will be used to prove the label its upper and lower children as κ1 and κ0, respectively.
next theorem. Here,we emphasizethatastringcanonlytakeitslettersfrom
UsingTheorem1,weintroduceanothertheorempresenting the binary alphabet, i.e., ν ∈ {0,1}, where 0 refers to the
5
A A A A
lowerchild,and1referstotheupperchildofanode.We also
introduceanotherconcept,i.e.,thedefinitionoftheprefixofa
string. We say that a string κ′ =ν′ ...ν′ is a prefix to string
1 l′
κ=ν ...ν if l′ ≤l andν′ =ν foralli=1,...,l′,andthe
1 l i i
empty string λ is a prefix to all strings. Finally, we let P(κ)
representallprefixestothestringκ,i.e.,P(κ),{κ ,...,κ },
0 l
where l , l(κ) is the length of the string κ, κ is the string -A -A -A -A
i
with l(κi)=i, and κ0 =λ is the empty string, such that the t=0 t=1 t=2 t=3
firstilettersofthestringκformsthestringκ fori=0,...,l.
i
LettingLdenotethesetofleafnodesforagivendecisiontree, A A A
eachleafnodeofthetree,i.e.,κ∈L,isgivenaspecificindex
α ∈ {0,...,M −1} representing the number of regressor
κ
vectorsthathas fallen into R . For presentationpurposes,we
κ
consider M =2 throughout the paper.
B. Outline of the Algorithm
-A -A -A
At time t = 0, the introduced algorithm starts with a t=4 t=5 t=6
single node (i.e., the root node) representing the entire re-
gressor space. As the new data is observed, the proposed Fig. 3: A sample evolution of the incremental decision tree, where the
regressor space is one dimensional. The “×” marks on the regressor space
algorithmsequentiallydividestheregressorspaceintosmaller
representsthevalueoftheregressorvectoratthatspecifictimeinstant.Light
disjoint regions according to the observed regressor vectors. nodesaretheoneshavinganindexof1,whereastheindexofthedarknodes
In particular, each region is divided into subsequent child is0.
regions as soon as a new regressor vector has fallen into that
region. In this incremental hierarchical structure, we assign
Before the processing starts, i.e., at time t = 0, we begin
an independent linear regressor to each node (i.e., to each
withasinglenode,i.e.,therootnodeλ,havingindexα =0.
region). Such a hierarchical structure (embedded with linear λ
Then, we recursively construct the decision tree according to
regressors) can define 1.5n different piecewise linear models
the following principle. For every time instant t>0, we find
orpartitions.Wethencombinetheoutputsofallthesedifferent
the leaf node of the tree κ ∈ L such that x[t] ∈ R . For
piecewisemodelsviaamixtureofexpertsapproachinsteadof κ
this node, if we have α = 0, we do not modify the tree but
committing to a single model. However, even for a small n, κ
only incrementthis index by 1. On the otherhand, if α =1,
thenumberofpiecewisemodels(i.e.,experts)growsextremely κ
then we generate two children nodes κ0,κ1 for this node by
rapidly(particularly,exponentialinn).Hence,inordertoper-
dividing the region R into two disjoint regions R ,R ,
formthiscalculationinanefficientmanner,weassignaweight κ κ0 κ1
using the plane x =c, where i−1≡l(κ) (mod p) and c is
toeachnodeonthetreeandpresentamethodto calculatethe i
the midpoint of the region R along the ith dimension. For
final output using these weights with a significantly reduced κ
node κν with x[t] ∈ R (i.e., the children node containing
computationalcomplexity, i.e., logarithmic in n under certain κν
thecurrentregressorvector),we setα =1andthe indexof
regularity conditions. κν
the other child is set to 0. The accumulated regressor vectors
We then compare the performance of the introduced algo-
and the data in node κ are also transferred to its children to
rithm with respect to the best batch piecewise model defined
train a linear regressor in these child nodes.
on the incremental decision tree. Our algorithm first suffers a
As an example, in Fig. 3, we consider that the regressor
“constructional regret” that arise from the adaptive construc-
space is one dimensional, i.e., [−A,A], and present a sample
tionoftheincrementaldecisiontree(sincethefinerpiecewise
evolution of the tree. In the figure, the nodes having an index
models are not present at the beginning of the processing)
of 0 are shown as dark nodes, whereas the others are light
and from the sequential combination of the outputs of all
nodes, and the regressor vectors are marked with ×’s in the
piecewisemodels(i.e.,duetothemixtureofexpertsapproach).
one dimensional regressor space. For instance at time t = 2,
Second, each piecewise model suffers a “parameter regret”
we have a depth-1 tree, where we have two nodes 0 and 1
while sequentially learning the true regression parameters at
with corresponding regions R = [−A,0], R = [0,A], and
each region. We provide deterministic upper bounds on these 0 1
α =1, α =0. Then, at time t=3, we observe a regressor
regrets and illustrate that the introduced algorithm is twice- 0 1
vector x[3] ∈ R and divide this region into two disjoint
universal,i.e.,universalinbothentirepiecewisemodels(even 0
regionsusing x =−A/2 line. We then find that x[3]∈R ,
thoughthefinermodelsappearasnincreasesanddonotused 1 01
hence set α =1, whereas α =0.
until then) and linear regression parameters. 01 00
We assign an independentlinear regressor to each node on
the incremental decision tree. Each linear regressor is trained
C. Proof of Theorem 1 and Construction of the Algorithm using only the information contained in its corresponding
Inthissection,wedescribethealgorithmindetailandderive node. Hence, we can obtain different piecewise models by
a regret upper bound with respect to the best batch piecewise using a certain collection of this node regressors accordingto
model defined on the incremental decision tree. the hierarchical structure. Each such piecewise model suffers
6
a parameter regret in order to sequentially learn the optimal we define the weight of an inner node κ∈/ L as follows [26]
linear regressionparametersateach regionthatminimizesthe 1
cumulative error. This issue is discussed towards the end of Pκ(n), Pκ0(n)Pκ1(n)
2
this section.
1 1 2
Using this incremental hierarchical structure with linear + exp − d[t]−dˆ [t] .
2 2a m,k
regressors at each region, the incremental decision tree can t≤n:Xx[t]∈Rκ(cid:16) (cid:17)
represent up to 1.5n different piecewise linear models after
Using this definitions, the weight of the root node λ can be
observing data of length n. For example, in Fig. 3, at time
constructed as follows
t = 6, we have 5 different piecewise linear models (see Fig.
2),whereasattimet=4,wehave3differentpiecewiselinear Pλ(n)= 2−BmP(n|m),
models. Each of these piecewise linear models can be used mX∈Mn
to perform the estimation task. However, we use a mixture where
of experts approach to combine the outputs of all piecewise n
1 2
linear models, instead of choosing a single one among them. P(n|m),exp − d[t]−dˆm[t]
2a
( )
Tothisend,onecanassignaperformancedependentweight Xt=1(cid:16) (cid:17)
representstheperformanceofa givenpartitionm∈M over
to each piecewise linear model defined on the incremental n
a data length of n, and B represents the number of bits
decision tree and combine their weighted outputs to obtain m
required to represent the model m on the binary tree using a
the final estimate of the algorithm [16], [27], [28]. In a
universal code [30].
conventional setting, such a mixture of expert approach is
Hence, the performance of the root node satisfies P (n)≥
guaranteed to asymptotically achieve the performance of the λ
bestpiecewiselinearmodeldefinedonthetree[16],[27],[28]. 2−BmP(n|m) for any m∈Mn. That is,
However,inourframework,toachievetheperformanceofthe n 2
best twice differentiable regression function, as t increases −2aln(Pλ(n))≤ min d[t]−dˆm[t]
(i.e., we observe new data), the total number of different m∈Mn(Xt=1(cid:16) (cid:17) )
piecewise linear models can increase exponentially with t. In +2aln(2)log(n)+4A2Kmlog(n), (2)
thissense,wehaveahighlydynamicoptimizationframework.
where the last line follows when we maximize B with re-
m
For example, in Fig. 3, at time t = 4, we have 3 different spect to m∈M and the regretterm 4A2K log(n) follows
n m
piecewise linear models, hence calculate the final output of
due to the adaptive construction of the incremental decision
our algorithm as dˆ[t] = w1[t]dˆ1[t]+w2[t]dˆ2[t]+w2[t]dˆ2[t], tree.Thisupperboundcorrespondstotheconstructionalregret
where dˆi[t] represents the output of the ith piecewise linear of our algorithm.
modelandwi[t]representsitsweight.However,attimet=6, Hence,we haveobtaineda weightingassignmentachieving
we have 5 different piecewise linear models, i.e., dˆ[t] = the performance of the optimal piecewise linear model. We
5i=1wi[t]dˆi[t], therefore the number of experts increases. nextintroduceasequentialalgorithmachievingPλ(n).Tothis
Hence, not only such a combination approach requires the end, we first note that we have
P
processing of the entire observed data at each time t (i.e., it
n
P (t)
results in a brute-force batch-to-online conversion), but also P (n)= λ . (3)
λ
it cannot be practically implemented even for a considerably Pλ(t−1)
t=1
Y
short data sequences such as n=100. Now if we can demonstrate a sequential algorithm whose
To elegantly solve this problem, we assign a weight to performance is greater than or equal to P (t)/P (t−1) for
λ λ
each node on the incremental decision tree, instead of using all t, we can conclude the proof. To this end, we present a
a conventional mixture of experts approach. In this way, we sequential update from P (t−1) to P (t).
λ λ
illustrate a method to calculate the original highly dynamic After the structural updates, i.e., the growth of the incre-
combination weights in an efficient manner, i.e., without mentaldecisiontree, arecompleted,say at time t, we observe
requiringtheprocessingoftheentiredataforeachnewsample, a regressor vector x[t] ∈ R for some κ ∈ L. Then, we can
κ
and with a significantly reduced computational complexity. compactlydenote the weightof the rootnode at time t−1 as
follows
To accomplish this, to each leaf node κ ∈ L, we assign a
performance dependant weight [26] as follows 1 2
P (t−1)= π [t−1]exp − d[t′]−dˆ [t′] ,
λ κi 2a κi
κi∈XP(κ) ( x[ttX′′]<∈Rt κ(cid:16)i (cid:17) )
1 2
Pκ(n),exp−2a d[t]−dˆm,k[t] , where dˆκ[t] represents the output of the regressor for node
t≤n:Xx[t]∈Rκ(cid:16) (cid:17) κ, κi ∈ P(κ) is the string formed from the first i letters of
κ=ν ...ν , and π [t] is recursively defined as follows
1 l κi
where dˆ [t] represents the linear regressor assigned to the 12 , if i=0
kthnodemo,kfthemthpiecewisemodelandisconstructedusing πκi[t],12Pκi−1νic(t−1)πκi−1[t] , if 1≤i≤l−1.
the regressor introduced in [29] and discussed in (7). Then, Pκi−1νic(t−1)πκi−1[t] , if i=l
7
Sincex[t]∈Rκ forsomeκ∈L,thenafterd[t]isrevealed, t′<t:x[t′]∈Rkd[t′]x[t′]. The upper bound on the perfor-
the weight of the root node at time t can be calculated as mance of this regressor can be calculated following similar
P
follows lines to [29] and it is obtained as follows
1 2 n 2 n 2
Pλ(t)=κi∈XP(κ)πκi[t−1]exp(cid:26)−2a(cid:16)d[t]−dˆκi[t](cid:17) (cid:27) Xt=1(cid:16)d[t]−dˆm[t](cid:17) −kv=m1m,,.k.i∈.n,KRpm(cid:26)Xt=1(cid:16)d[t]−dˆb[t](cid:17)
1 2
×exp − d[t′]−dˆ [t′] , +δ||v ||2 ≤A2K pln(n/K )+O(1). (7)
2a κi m m m
( t′<t:xX[t′]∈Rκi(cid:16) (cid:17) ) We emphasize that in (cid:27)each region of a piecewise model,
which results in differentlearningalgorithms,e.g.,differentlinearregressorsor
P (t) 1 2 nonlinearones,fromthebroadliteraturecanbeused.Notethat
λ = µ [t−1]exp − d[t]−dˆ [t] ,
P (t−1) κi 2a κi althoughthe maincontributionofthepaperisthe hierarchical
λ κi∈XP(κ) (cid:26) (cid:16) (cid:17) (cid:27) organization and efficient management of these piecewise
(4)
models, we also discuss the implementation of a piecewise
where
linear model [29] into our framework for completeness.
2 Finally, we achieve an upper bound on the performance
πκi[t−1]exp −21a t′<t d[t′]−dˆκi[t′] of the introduced algorithm with respect to the best batch
µ [t−1], ( x[t′]∈R(cid:16)κi (cid:17) ). piecewise linear model. Combining the results in (2) and (7),
κi PP(t−1)
λ we obtain
We then focus on (4) and observe that we have n n
2 2
termκi∈iPn((κ4))µ,κi.ie[t.,− 1] = 1, which means that if the second Xt=1(cid:16)d[t]−dˆ[t](cid:17) ≤mm∈Minn(Xt=1(cid:16)d[t]−dˆm[t](cid:17) )
P +2aln(2)log(n)+4A2K log(n)
m
1 2
is concavef,(tdˆhκein[t]b)y,Jeenxspen(cid:26)’s−i2naeq(cid:16)udal[itt]y−, wdˆeκic[ta]n(cid:17)c(cid:27)on,clude that ≤mm∈Minnkv=m1m,,.k.i∈.n,KRpm(Xt=n1(cid:16)d[t]−dˆb[t](cid:17)2+δ||vm||2)
2 +A2Km(pln(n/Km)+4log(n))+2aln(2)log(n)+O(1),
1
exp−2ad[t]− µκi[t−1]dˆκi[t] ≥Pλ(t|t−1). ≤O(plog2(n))
κi∈XP(κ) whe|re the upper bound on the{zregret follows when Km }=
(5) O(log(n)). This proves the upper bound in Theorem 1 and
Since the function f(dˆκi[t]) is concave when concludes the construction of the algorithm. Before we con-
2
d[t]−dˆ [t] < a, and we have |d[t]| ≤ A, we have cludetheproof,wefinallydiscussthecomputationalcomplex-
κi
ity of the introduced algorithm to in detail.
(cid:16)to set a ≥ 4A(cid:17)2. Therefore, we obtain a sequential regressor
The computational complexity for the construction of the
in (5), whose performance is greater than or equal to the
incremental decision tree is O(|P(κ)|), where κ represents a
performance of the root node, and the final estimate of our
leaf node of the incremental decision tree (see lines 2−35
algorithm is calculated as follows
of the algorithm in Fig. 4 and note that |T | ≤ |P(κ)|). The
κ
dˆ[t], µ [t−1]dˆ [t]. (6) computationalcomplexityof the sequentialweightingmethod
κi κi
is O(|P(κ)|) (see (6) and lines 36−49 of the algorithm in
κi∈XP(κ)
Fig. 4). Accordingto the incrementalhierarchicalpartitioning
Hence, our algorithm can achieve the performance of the method described, the number of light nodes on the tree
best piecewise linear model defined on the incremental tree (see Fig. 3) is t at time t, therefore we may observe a
with a constructional regret given in (2). In order to achieve decision tree of depth n, i.e., |P(κ)| = n, in the worst-case
the performance of the best “batch” piecewise linear model, scenario, e.g., when x[t] = [A,...,A]T for all t. Hence, the
the introducedalgorithmalso suffersa parameterregretwhile computationalcomplexity of the algorithm over a data length
learning the true regression parameters at each region. An of n is upper bounded by O(n). Although theoretically the
upper bound on this regret is calculated as follows. computational complexity of the algorithm is upper bounded
Consider an arbitrary piecewise model defined on the in- by O(n), in many real life applications the regressor vec-
cremental decision tree, say the mth model, having Km dis- tors converge to stationary distributions [16]. Hence, in such
joint regions R1,...,RKm such that Kk=m1Rk = [−A,A]p. practical applications, the computational complexity of the
Then, a piecewise linear regressor can be constructed us- algorithm can be upper bounded by O(log(n)) as discussed
S
ing the universal linear predictor of [29] in each re- in Remark 2. We emphasize that in order to achieve the
gion as dˆm[t] = vTm,k[t]x[t], when x[t] ∈ Rk, with computational complexity O(log(n)), we do not require any
the regression parameters v [t] = (R [t]+δI)−1p [t], statisticalassumptions,insteaditissufficientthattheregressor
m,k k k
where I represents the appropriate sized identity ma- vectorsareevenly(tosomedegree)distributedintheregressor
trix, Rk[t] , t′≤t:x[t′]∈Rkx[t′]xT[t′], and pk[t] , space. This concludes the proof of Theorem 1. (cid:3)
P
8
1: for t=1 to n do 31: ν =0, if x[t]∈Rκ0, ν =1, otherwise.
2: % Find the set of nodes containing x[t] 32: κ=κν
3: κ=λ 33: S =S+κ
4: S =κ 34: ακ =1
5: while s has children do 35: end if
6: κ = κν, where ν is the last letter of the child 36: % Calculate combination weights and perform
containing x[t]. estimation.
7: S =S+κ 37: for all κi ∈P(κ) do
8: end while 38: if κi =λ then
9: % Check the index of the leaf node κ: if ακ =0, 39: πκi =1/2
tree remains the same. 40: else if κi ∈/ {λ,κ} then
10: if ακ =0 then 41: πκi =Pκi−1νicπκi−1/2
11: ακ =ακ+1 42: else
12: Tκ =Tκ+t 43: πκi =Pκi−1νicπκi−1
13: % If ακ =1, create nodes κ0 and κ1. 44: end if
14: else 45: µκi =πκiLκi/Pλ
15: % Train nodes κ0 and κ1. 46: dˆκi =wTκix[t]
16: for all z ∈Tκ do 47: end for
17: if x[z]∈Rκ0 then 48: dˆ=µTdˆ
18: Tκ0 =Tκ0+z 49: e=d[t]−dˆ
19: Lκ0 =Lκ0exp(−(d[z]−wTκ0x[z])2/2a) 50: % Perform algorithmic updates.
20: Pκ0 =Lκ0 51: for all κi ∈P(κ) do
21: Rκ0 =Rκ0+x[z]x[z]T 52: Lκi =Lκiexp(−(d[t]−dˆκi)2/(2a))
22: wκ0 =wκ0+Rκ0\(x[z](d[z]−wTκ0x[z])) 53: if κi =κ then
23: else 54: Pκi =Lκi
24: % Do the similar for node κ1. 55: else
25: end if 56: Pκi =(Pκi0Pκi1+Lκi)/2
26: end for 57: end if
27: for all κ∈S do 58: Rκi =Rκi +x[t]x[t]T
28: Pκ =(Pκ0Pκ1+Lκ)/2 59: wκi =wκi +Rκi\(x[t](d[t]−dˆκi))
29: end for 60: end for
30: % Find the child containing x[t] and perform 61: end for
tree updates.
Fig. 4:Thepseudocode oftheIncremental DecisionTree(IDT)regressor
Remark 1: Note that the algorithm in Fig. 4 achieves a region into two disjoint regions, we may be forced to
the performance of the best piecewise linear model having perform O(t) computations due to the accumulated regres-
O(log(n)) partitions with a regret of O(plog2(n)). In the sor vectors (since we no longer have |T | ≤ |P(κ)| but
κ
most generic case, i.e., for an arbitrary piecewise model instead have |T | ≤ t). However, since a regressor vector
κ
m having O(K ) partitions, the introduced algorithm still is processed by at most O(log(n)) nodes for any n, the
m
achieves a regret of O(pK log(n/K )). This indicates that average computational complexity of the update rule of the
m m
for models having O(n) partitions, the introduced algorithm tree remainsO(log(n)). Furthermore,the performanceof this
achieves a regret of O(pn), hence the performance of the low complexity implementation will be asymptotically the
piecewisemodelcannotbeasymptoticallyachieved.However, same as the exact implementation provided that the regressor
we emphasize that no other algorithm can achieve a smaller vectorsare evenly distributed in the regressor space, i.e., they
regretthanO(pn)[8],i.e.,theintroducedalgorithmisoptimal are not gathered around a considerably small neighborhood.
inastrongminimaxsense.Intuitively,thislowerboundcanbe This result follows when we multiply the tree construction
justifiedbyconsideringthecase,inwhichtheregressorvector regret in (2) by the total number of accumulated regressor
at time t falls into the tth region of the piecewise model. vectors, whose order, according to the above condition, is
Remark 2: As mentioned in Remark 1 (and also can be upper bounded by o(n/log(n)).
observedin(7)),noalgorithmcanconvergetotheperformance Remark 3: We emphasizethat the nodeindexes,i.e., α ’s,
κ
of the piecewise linear models having O(n) disjoint regions. determines when to create finer regions. According to the
Therefore, we can limit the maximum depth of the tree described procedure, if a node at depth l is partitioned into
by O(log(t)) at each time t to achieve a low complexity smaller regions, then its ith predecessor, i.e., κ ∈ P(κ), has
i
implementation. With this limitation and according to the observedatleastl−idifferentregressorvectors.Hence,achild
update rule of the tree, we can observe that while dividing nodeis created when coarser regions(i.e., predecessor nodes)
9
Algorithm Computational Complexity
are sufficiently trained. In this sense, we introducenew nodes
IDT O(cid:0)p2log(n)(cid:1)
tothetreeaccordingtothecurrentstatusofthetreeaswellas CTW O(cid:0)p2d(cid:1)
the most recent data. We also pointout that, in this paper,we LR O(cid:0)p2(cid:1)
divide each region from its midpoint (see Fig. 3) to maintain VSR O(cid:0)p2r(cid:1)
universality. However, this process can also be performed in MARS O(cid:0)rbw3(cid:1)
a data dependant manner, e.g., one can partition each region FNR O(cid:0)(pr)2r(cid:1)
using the hyperplane that is perpendicular to the line joining TABLE I: Comparison of the computational complexities of the proposed
algorithmswiththecorrespondingupdaterules.Inthetable,prepresentsthe
tworegressorvectorsinthatregion.Iftherearemorethantwo
dimensionality of the regressor space, d represents the depth of the trees in
accumulated regressor vectors, then more advanced methods the respective algorithms, and r represents the order of the corresponding
such as support vectors and anomaly detectors can be used filtersandalgorithms.FortheMARSalgorithm (particularly, thefastMARS
algorithm, cf. [32]), b represents the number of basis functions and w
to define a separator hyperplane. All these methods can be
represents thewindowlength.
straightforwardlyincorporatedinto our framework to produce
different algorithms depending on the regression task.
theFouriernonlinearregressorof[19]by“FNR”.Thecombi-
D. Proof of Theorem 2 nationweightsoftheLR,VSR,andFNRareupdatedusingthe
recursiveleastsquares(RLS)algorithm[16].Unlessotherwise
Webeginourproofbyemphasizingthattheintroducedalgo-
stated, the CTW algorithm has depth 2, the VSR, FNR, and
rithmconvergesto thebestlinearmodelineachregionwitha
MARSalgorithmsaresecondorder,andtheMARSalgorithm
regretof O(plog2(n)) for any finite regression parameter v
m uses21knotswithawindowlengthof500thatshiftsinevery
(since ||v || ≤ δGplog(n)) as already proven in Theorem
m 200 samples.
1. Therefore, using any other linear model yields a higher
InTableI,weprovidethecomputationalcomplexitiesofthe
regret. Hence, say we define a suboptimal affine model by
proposedalgorithms.We emphasizethat althoughthe compu-
applying Taylor’s theorem to a twice differentiable function
tationalcomplexityto create andrun the incrementaldecision
f ∈F aboutthemidpointofeachregion.Letdˆ[t] denotethe
s treeisO(log(n)),theoverallcomputationalcomplexityofthe
prediction of this suboptimal affine regressor. Then, we have
algorithmisO(p2log(n))duetotheuniversallinearregressors
n 2 n 2 ateachregion.Particularly,sincetheuniversallinearregressor
d[t]−dˆ[t] ≤ d[t]−dˆs[t] +O(plog2(n)). at each region has a computational complexity of O(p2), the
Xt=1(cid:16) (cid:17) Xt=1(cid:16) (cid:17) overall computational complexity of O(p2log(n)) follows.
NowapplyingthemeanvaluetheoremwiththeLagrangeform However, this universal linear regressor can be straightfor-
of the remainder, we obtain wardly replaced with any linear (or nonlinear) regressor in
n n the literature. For example, if we use the LMS algorithm to
2 2
d[t]−dˆ[t] − d[t]−dˆ [t] update the parameters of the linear regressor instead of using
s f
Xt=1(cid:16) (cid:17) Xt=1(cid:16) (cid:17) the universal algorithm for this update, the computational
n p p ∂2f(x) complexity of the overall structure becomes O(plog(n)).
≤2A (x [t]−a )(x [t]−a ) ,
i κ,i j κ,j Hence, although the computationalcomplexityof the original
Xt=1(Xi=1Xj=1 ∂xi∂xj(cid:12)(cid:12)x=b ) IDT algorithm is O(log(n)), this computational complexity
forsomem∈M′ andb∈R(cid:12)(cid:12) ,wherea ,[a ,...,a ]T may increase according to the computational complexity of
n κ κ κ,1 κ,p
isthemidpointoftheregionR .Maximizingthisupperbound the node regressors.
κ
with respect to x we obtain In this section, we first illustrate the performances of the
proposed algorithms for a synthetic piecewise linear model
n n
d[t]−dˆ[t] 2− d[t]−dˆ [t] 2 that do not match the modeling structure of any of the above
f
algorithms.We thenconsiderthe predictionof chaoticsignals
Xt=1(cid:16) (cid:17) Xt=1(cid:16) (cid:17) (generatedfromDuffingandTinkerbellmaps)andwell-known
A2
≤2ADp2n +O(plog2(n)) data sequences such as Mackey-Glass sequence and Chua’s
O(log2/p(n))
circuit [7]. Finally, we consider the prediction of real life
≤o(p2n). examples that can be found in various benchmark data set
This concludes the proof of Theorem 2. (cid:3) repositories such as [33], [34].
A. Synthetic Data
V. SIMULATIONS
In this subsection, we consider the scenario where the
In this section, we investigate the performance of the
desired data is generated by the following piecewise linear
introduced algorithm with respect to various methods under
model
severalbenchmarkscenarios.Throughouttheexperiments,we
denote the incremental decision tree algorithm of Theorem x [t]+x [t]+n[t] , if ||x[t]||2 ∈[0,0.1]∪[0.5,1]
d[t]= 1 2 ,
1 by “IDT”, the context tree weighting algorithm of [8] by (−x1[t]−x2[t]+n[t] , otherwise
“CTW”, the linear regressor by “LR”, the Volterra series (8)
regressor by “VSR” [6], the sliding window Multivariate and x[t] = [x [t],x [t]]T are sample functions of a jointly
1 2
Adaptive Regression Splines of [31], [32] by “MARS”, and Gaussianprocessofmean[0,0]T andcovariancematrixI,and
10
Normalized Accumulated Squared Error Performance of the Proposed Algorithms Evolution of the Cumulative Node Weights
0.7 1
0.9
Error0.65 VSR 0.8
ulated Squared 0.05.56 MARS CTW−2 Node Weights000...567
m e
d Accu 0.5 mulativ0.4
ormalize0.45 IDT CTW−6 Cu00..23 DDDeeepppttthhh−−−012
N
0.4 Depth−3
0.1 Depth−4
Depth−5
0.35 0
0 2000 4000 6000 8000 10000 100 101 102 103 104
Data Length (n) Data Length (n)
Fig.5:Normalizedaccumulatedsquarederrorperformancesforthepiecewise Fig. 6: Evolution of the normalized cumulative node weights at the corre-
linearmodelin(8)averaged over10trials. sponding depths of the tree for the piecewise linear model in (8) averaged
over10trials.
n[t] is a sample function from a zero mean white Gaussian
model as the coarser regions becomes unsatisfactory. Since
process with variance 0.1. Note that the piecewise model
the universal algorithms such as CTW distribute a “budget”
in (8) has circular regions, which cannot be represented
intonumerousexperts,asthenumberofexpertsincreases,the
by hyperplanes or twice differentiable functions. Hence, the
performanceofsuchalgorithmsdeteriorate.Ontheotherhand,
underlying relationship between the desired data and the
the introduced algorithm intrinsically limits the number of
regressor vectors cannot be exactly modeled using any of the
expertsaccordingtotheunknowndatalengthateachiteration,
proposed algorithms.
henceweavoidsuchpossibleperformancedegradationsascan
In Fig. 5, we present the normalized accumulated squared be observed in Fig. 6.
errors of the proposed algorithms averaged over 10 trials.
For this experiment, “CTW-2” and “CTW-6” show the per-
B. Chaotic Data
formances of the CTW algorithm with depths 2 and 6,
In this subsection, we consider prediction of the chaotic
respectively. Since the performances of the LR and FNR
signals generated from the Duffing and Tinkerbell maps. The
algorithms are incomparable with the rest of the algorithms,
Duffing map is generated by the following discrete time
they are not included in the figure for this experiment. Fig.
equation
5 illustrates that even for a highly nonlinear system (8), our
algorithm significantly outperforms the other algorithms. The x[t+1]=ax[t]−(x[t])3−bx[t−1], (9)
normalizedaccumulatederroroftheintroducedalgorithmgoes
to the variance of the noise signal as n increases, unlike the where we set a = 2.75 and b = 0.2 to produce the chaotic
rest of the algorithms, whose performances converge to the behavior [9], [35]. The Tinkerbell map is generated by the
performance of their optimal batch variants as n increases. following discrete time equations
This observation can be seen in Fig. 5, where the normalized x[t+1]=(x[t])2−(y[t])2+ax[t]+by[t] (10)
cumulativeerroroftheIDTalgorithmsteadilydecreasessince
y[t+1]=2x[t]y[t]+cx[t]+dy[t], (11)
the IDT algorithm creates finer regions as the observed data
length increases. Hence, even for a highly nonlinear model where we set a =0.9, b= −0.6013, c =2, and d=0.5 [8],
such as the circular piecewise linear model in (8), which [35]. We emphasize that these values are selected to generate
cannot be represented via hyperplanes, the IDT algorithm the well-known chaotic behaviors of these attractors.
can well approximate this highly nonlinear relationship by Fig. 7a and Fig. 7b shows the normalized accumulated
incrementallyintroducingfinerpartitionsas the observeddata squared error performances of the proposed algorithms. We
length increases. emphasize that due to the chaotic nature of the signals, we
Furthermore,eventhoughthe depthof the introducedalgo- observe non-uniform curves in Fig. 7. Since the conven-
rithmiscomparablewiththeCTW-6algorithmovershortdata tional nonlinear and piecewise linear regression algorithms
sequences, the performance of our algorithm is superior with commit to a priori partitioning and/or basis functions, their
respect to the CTW-6 algorithm. This results since the IDT performances are limited by the performances of the optimal
algorithm intrinsically eliminates the extremely finer models batch regressors using these prior partitioning and/or basis
at the early processing stages and introduces them whenever functions as can be observed in Fig. 7. Hence, such prior
they are needed, unlike the CTW-6 algorithm.This procedure selections result in fundamental performance limitations for
canbeobservedinFig.6,wheretheIDTalgorithmintroduces these algorithms. For example, in the CTW algorithm, the
finerregions(i.e.,nodeswithhigherdepths)tothehierarchical partitioningof the regressorspace is set beforethe processing