Table Of Content

1 Predicting Nearly As Well As the Optimal Twice Differentiable Regressor N. Denizcan Vanli, Muhammed O. Sayin, and Suleyman S. Kozat, Senior Member, IEEE Abstract—We study nonlinear regression of real valued data that recursively and effectively partition the regressor space inanindividualsequencemanner,whereweprovideresultsthat into subsequent regions in a data driven manner, where are guaranteed to hold without any statistical assumptions. We a different linear model is learned at each region. Unlike 4 addresstheconvergenceandundertrainingissuesofconventional 1 most of the nonlinear models, learning linear structures at nonlinear regression methods and introduce an algorithm that 0 elegantly mitigates these issues via an incremental hierarchical each region can be efficiently managed. Hence, using this 2 structure,(i.e.,viaanincrementaldecisiontree).Particularly,we hierarchical piecewise model, we significantly mitigate the t present a piecewise linear (or nonlinear) regression algorithm convergence and consistency issues. Furthermore, we prove c that partitions the regressor space in a data driven manner that the resulting hierarchicalpiecewise model asymptotically O and learns a linear model at each region. Unlike the conven- achievestheperformanceofanytwicedifferentiableregression tionalapproaches,ouralgorithmgraduallyincreasesthenumber 6 of disjoint partitions on the regressor space in a sequential function that is directly tuned to the underlying observations manner according to the observed data. Through this data without any tuning of algorithmic parameters or without any ] G driven approach, our algorithm sequentially and asymptotically assumptions on the data (other than an upper bound on the achieves the performance of the optimal twice differentiable magnitude). Since most of the nonlinear modeling functions L regression functionforany datasequencewithan unknownand of the regression algorithms in the literature, such as neural . arbitrarylength.Thecomputationalcomplexityoftheintroduced cs algorithm is only logarithmic in the data length under certain networks and Volterra filters, can be accurately represented [ regularity conditions. We provide the explicit description of the by twice differentiable functions[1], [2], [6], [16], [18], [19], algorithm and demonstrate the significant gains for the well- ouralgorithmreadilyperformsasymptoticallyas well as such 2 known benchmark real data sets and chaotic signals. nonlinear learning algorithms. v 3 Index Terms—Online, nonlinear, regression, incremental deci- In particular, the introduced method sequentially and re- 1 sion tree. cursively divides the space of the regressors into disjoint 4 regions according to the amount of the data in each region, 6 . I. INTRODUCTION instead of committing to a priori selected partition. In this 1 sense,weavoidcreatingundertrainedregionsuntilasufficient We study sequential nonlinear regression, where we aim 0 amount of data is observed. The nonlinear modeling power 4 to estimate or model a desired sequence {d[t]} by us- t≥1 of the introduced algorithm is incremented (by consecutively 1 ing a sequence of regressor vectors {x[t]} . In particular, t≥1 partitioning the regressor space into smaller regions) as the v: we seek to find the relationship, if it exists, between these observeddatalengthincreases. Theintroducedmethodadapts i two sequences, which is assumed to be unknown, nonlinear, X itself according to the observed data instead of relying on and possibly time varying. This generic nonlinear regression ad-hoc parameters that are set while initializing the algo- r framework is extensively studied in the machine learning and a rithm. Thus, the introduced algorithm providesa significantly signal processing literatures since it can model a wide range stronger modeling power with respect to the state-of-the-art ofreallifeapplicationsbycapturingthesalientcharacteristics methods in the literature as shown in our experiments. of underlyingsignals and systems [1]–[15]. In order to define We emphasize that piecewise linear regression using tree and find this relationship between the desired sequence and structuresis extensively studied in the computationallearning regressorvectors,numerousmethodssuchasneuralnetworks, and signal processing literatures [7]–[9], [12]–[15], [20]– Volterrafilters,andB-splinesareused[1],[2],[6],[11],[16]– [24] due to its attractive convergence and consistency fea- [19]. However, either these methods are extremely difficult to tures. There exist several tree based algorithms that mitigate use in real life applications due to convergence issues, e.g., the overtraining problem by defining hierarchical piecewise Volterra filters and B-splines, or it is quite hard to obtain a models such as [8], [9], [12]–[15]. Although these methods consistentperformancein differentscenarios,cf. [1]–[3], [6]– achieve the performance of the best piecewise model defined [8], [12]–[15], [20]–[22]. on a tree, i.e., the best pruning of a tree, they only yield To this end, in this paper, we propose an algorithm that satisfactory performance when the initial partitioning of the alleviates these issues by introducing hierarchical models regressor space is highly accurate or tuned to the underlying data (which is unknown or even time-varying). Furthermore, This work was supported in part by the Turkish Academy of Sciences OutstandingResearcherProgrammeandinpartbyTUBITAKunderContract there are more recent algorithms such as [20] that achieve 112E161andContract 113E517. the performance of the optimal combination of all piecewise The authors are with the Department of Electrical and Electronics models defined on a tree that minimizes the accumulated Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey (e-mail: {vanli,sayin,kozat}@ee.bilkent.edu.tr). loss. There are also methods that alleviate the overtraining 2 problem by learning the region boundaries [20] to minimize the observed data, ii) that is highly efficient in terms of the theregressionerrorforafixeddepthtreewithacomputational computational complexity as well as the error performance, complexityrelatively greatercomparedto the onesin [8], [9], andiii)whoseperformanceconvergestoiii-a)theperformance [12], [14], [15] (particularly, exponential in the depth of the of the optimal twice differentiable function that is selected in tree). However, these algorithms can only provide a limited hindsightandiii-b)thebestpiecewiselinearmodeldefinedon modeling power since the tree structure in these studies is the incremental decision tree, with guaranteed upper bounds fixed. Furthermore, the methods such as [20] can only learn withoutanystatisticalorstructuralassumptionsonthedesired the locally optimal region boundaries due to the highly non- data as well as on the regressor vectors (other than an upper linear (and non-convex) optimization structure. Unlike these boundon them).Hence,unlikethe state-of-the-artapproaches methods, the introduced algorithm sequentially increases its whoseperformancesusuallydependontheinitialconstruction nonlinearmodelingpoweraccordingto the observeddata and of the tree, we introduce a method to construct a decision directly achieves the performance of the best twice differ- tree, whose depth (and structure) is adaptively incremented entiable regression function that minimizes the accumulated (and adjusted) in a data dependent manner, which we call regression error. We also show that in order to achieve the an incremental decision tree. Furthermore, the introduced performance of a finer piecewise model defined on a tree, it algorithm achieves this superior performance only with a is not even necessary to create these piecewise models when computational complexity O(log(n)) for any data length n, initializing the algorithm. Hence, we do not train a piecewise under certain regularity conditions. Even if these regularity modeluntil a sufficientamountof data is observed,and show conditionsare notmet, the introducedalgorithmstill achieves that the introduced algorithm, in this manner, does not suffer the performance of any twice differentiable regression func- any asymptotical performance degradation. Therefore, unlike tion, however with a computational complexity linear in the the relevant studies in the literature, in which undertrained data length. (i.e., unnecessary) partitions are kept in the overall structure, Theorganizationofthepaperisasfollows.Wefirstdescribe our method intrinsically eliminates the unnecessarily finer the sequentialpiecewise linear regressionproblemin detailin partitions without any loss in asymptotical performance (i.e., SectionII.WethenintroducethemainalgorithminSectionIII we maintain universality). and prove that the performance of this algorithm is nearly as Aside from such piecewise linear regression techniques well as the best piecewise linear model that can be defined basedonhierarchicalmodels,therearevariousdifferentmeth- by the incremental decision tree in Section IV. Using this ods to introduce nonlinearity such as B-splines and Volterra result,wealsoshowthattheintroducedalgorithmachievesthe series [1], [2], [6], [11], [17]–[19]. In these methods,the non- performance of the optimal twice differentiable function that linearityisusuallyintroducedbymodifyingthebasisfunctions is selected after observing the entire data before processing to create polynomial estimators, e.g., in [19], the authors use starts, i.e., non-causally. In Section V, we demonstrate the trigonometricfunctionsastheirbasisfunctions.Weemphasize performance of the introduced algorithm through simulations that these techniques can be straightforwardly incorporated and then conclude the paper with several remarks in Section into our framework by using these methods at each region VI. in the introduced algorithm to obtain piecewise nonlinear regressors. Note that the performance of such methods, e.g., B-splines and Volterra series (and other variousmethods with II. PROBLEMDESCRIPTION different basis functions), is satisfactory when the data is We study sequential nonlinear regression, where the aim generatedusingtheunderlyingbasisfunctionsoftheregressor. is to estimate an unknown desired sequence {d[t]} by In real life applications, the underlying model that generates t≥1 using a sequence of regressor vectors {x[t]} , where the the data is usually unknown. Thus, the successful implemen- t≥1 desired sequence and the regressor vectors are real val- tationofthesemethodssignificantlydependsonthematch(or ued and bounded but otherwise arbitrary, i.e., d[t] ∈ , mismatch) between the regressor structure and the underlying x[t] , [x [t],...,x [t]]T ∈ p for an arbitrary integerRp 1 p model generating the data. On the other hand, the introduced R and |d[t]|,|x [t]| < A < ∞ for all t and i = 1,...,p. We i algorithm achieves the performance of any such regressor call the regressors as “sequential” if in order to estimate the providedthatitsbasisfunctionsaretwicedifferentiable.Inthis desired data at time t, i.e., d[t], they only use the past infor- sense,unliketheconventionalmethodsintheliterature,whose mation d[1],...,d[t−1] and the observed regressor vectors1 performances are highly dependant on the selection of the x[1],...,x[t]. basis functions, our method can well approximate these basis In this framework, a piecewise linear model is constructed functions(andregressorsformedbythese basis functions)via bydividingtheregressorspaceintoaunionofdisjointregions, piecewise models such that the performance difference with where in each region a linear model holds. As an example, respect to the best such regressor asymptotically goes to zero suppose that the regressor space is parsed into K disjoint in a strongindividualsequence mannerwithoutany statistical regions R ,...,R such that K R = [−A,A]p. Given assumptions. 1 K k=1 k such a model, say model m, at each time t, the sequential The main contributions of this paper are as follows. We S introduce a sequential piecewise linear regression algorithm 1Allvectorsarecolumnvectorsanddenotedbyboldfacelowercaseletters. i) that provides a significantly improved modeling power by Matrices are denoted by boldface upper case letters. For a vector x, xT is adaptively increasing the number of partitions according to theordinarytranspose. Wedenote dba,{d[t]}bt=a. 3 A R1 !R2 R1 R R !R !R !R 2 1 2 3 4 R 3 R !R R 3 4 4 -A P1 P2 P3 P4 P5 Fig. 2: All different piecewise linear models that can be obtained using a Fig. 1:Thepartitioning ofaonedimensional regressorspace,i.e.,[−A,A], depth-2fulldecisiontree,wheretheregressorspaceisonedimensional.These using a depth-2 full decision tree, where each node represents a portion of modelsarebasedonthepartitioning showninFig.1. theregressorspace. linear2 regressor predicts d[t] as dˆ [t] = vT [t]x[t] when nonlinear modeling power of this tree is fixed and finite x[t] ∈ R , where v [t] ∈ pmfor all mk,k= 1,...,K. since there are only 2d+1−1 different regions (one for each k m,k R node) and approximately 1.52d different piecewise models These linear models assigned to each region can be trained (i.e., partitions) defined on this tree. Instead of introducing independently using different adaptive methods such as the such a limitation, we recursively increment the depth of the leastmeansquares(LMS)ortherecursiveleastsquares(RLS) decision tree as the data length increases. We call such a tree algorithms. the“incrementaldecisiontree”sincethedepthofthedecision However, by directly partitioning the regressor space as K R = [−A,A]p before the processing starts and op- treeisincremented(andpotentiallygoestoinfinity)asthedata k=1 k length n increases, hence in a certain sense, we can achieve timizing only the internal parameters of the piecewise linear S the modeling power of an infinite depth tree. As shown in model, i.e., v [t], one significantly limits the performance m,k Theorem 2, the piecewise linear models defined on the tree of the overall regressor since we do not have any prior will convergeto any unknownunderlyingtwice differentiable knowledgeontheunderlyingdesiredsignal.Therefore,instead model under certain regularity conditions as n increases. of committing to a single piecewise linear modelwith a fixed To this end, we seek to find a sequential regression algo- andgivenpartitioning,andperformingoptimizationonlyover rithm(whoseestimate at time t is representedbydˆ[t]), when the internal linear regressionparametersof this regressor,one s applied to any sequence of data and regressor vectors, yields can use a decision tree to partition the regressor space and the following performance (i.e., regret) guarantee try to achieve the performance of the best partitioning over the whole doubly exponential number of different models n 2 n 2 represented by this tree [25]. d[t]−dˆs[t] − inf d[t]−dˆf[t] ≤o(n), (1) f∈F As an example, in Fig. 1, we partition the one dimensional Xt=1(cid:16) (cid:17) Xt=1(cid:16) (cid:17) regressor space [−A,A], using a depth-2 tree, where the over any n, without the knowledge of n, where F represents regionsR ,...,R correspondtodisjointintervalsonthereal the class of all twice differentiable functions, whose param- 1 4 lineandtheinternalnodesareconstructedusingunionofthese eters are set in hindsight, i.e., after observing the entire data regions. In the generic case, for a depth-d full decision tree, before processing starts, and dˆf[t] represents the estimate of thereexist2d leaf nodesand2d−1internalnodes.Eachnode the twice differentiablefunctionf ∈F at time t. The relative ofthetreerepresentsaportionoftheregressorspacesuchthat accumulatederrorin(1)representstheperformancedifference theunionoftheregionsrepresentedbytheleafnodesisequal of the introduced algorithm and the optimal batch twice to the entire regressor space [−A,A]p. Moreover, the region differentiableregressor.Hence, an upperboundof o(n) in (1) correspondingtoeachinternalnodeisconstructedbytheunion impliesthatthealgorithmdˆ[t]sequentiallyandasymptotically s oftheregionsofitschildren.Inthissense,weobtain2d+1−1 converges to the performance of the regressor dˆf[t], for any different nodes (regions) on the depth-d decision tree (on the f ∈F. regressor space) and approximately 1.52d different piecewise models that can be represented by certain collections of the III. NONLINEARREGRESSION VIA INCREMENTAL regionsrepresentedbythenodesofthedecisiontree[25].For DECISION TREES example, we consider the same scenario as in Fig. 1, where In this section, we introduce the main results of the pa- we partition the one dimensional real space using a depth-2 per. Particularly, we first show that the introduced sequential tree. Then, as shown in Fig. 1, there are 7 different nodes on piecewise linear regression algorithm asymptotically achieves the depth-2 decision tree; and as shown in Fig. 2, a depth-2 the performance of the best piecewise linear model defined tree defines 5 different piecewise partitions or models, where on the incrementaldecision tree (with possibly infinite depth) each of these models is constructed using certain unions of with the optimal regression parameters at each region that the nodes of the full depth decision tree. minimizes the accumulated loss. We then use this result to We emphasize that given a decision tree of depth-d, the prove that the introduced algorithm asymptotically achieves the performance of any twice differentiable regression func- 2Note that affine models can also be represented as linear models by appending a1tox[t],wherethedimension oftheregressorspaceincreases tion. We provide the algorithmic details and the construction byone. of the algorithm in Section IV. 4 Theorem 1: Let {d[t]} and {x[t]} be arbitrary, themainresultofthepaper,wherewedefinetheperformance t≥1 t≥1 bounded, and real-valued sequences of data and regressor of the introduced algorithm with respect to the class of twice vectors, respectively. Then the algorithm dˆ[t] (given in Fig. differentiable functions as in (1). 4) when applied to these data sequences yields Theorem 2: Let {d[t]} and {x[t]} be arbitrary, t≥1 t≥1 n bounded, and real-valued sequences of data and regressor 2 d[t]−dˆ[t] vectors, respectively. Let F be the class of all twice differen- Xt=1(cid:16) (cid:17) tiable functions such that for any f ∈ F, ∂∂2xfi∂(xxj) ≤ D < ∞, −m∈inMf ′nkv=m1i,,n.k.∈.f,KRpm(Xt=n1(cid:16)d[t]−dˆb[t](cid:17)2+δ||vm||2) ias,elgjqoure=inthc1mes,.yd.î[.et,l]dpsgiavnedn wineFdiegn.o4tewdˆhfe[nt] a=pplfie(dx[tto]).thTehseendathtae ≤O plog2(n) ,  n n 2 2 d[t]−dˆ[t] − inf d[t]−dˆ [t] ≤o(p2n), for any n, w(cid:0)ith a com(cid:1)putational complexity upper bounded f∈F f by O(n), where K denotes the number of leaf nodes Xt=1(cid:16) (cid:17) Xt=1(cid:16) (cid:17) m in the hierarchical model m, M represents the set of all for any n, with a computational complexity upper bounded n hierarchicalmodelsdefinedontheincrementaldecisiontreeat by O(n). time n, M′ represents the set of all hierarchicalmodels with n at most O(log(n)) leaves defined on the incrementaldecision This theorem presentsthe nonlinear modelingpower of the tree at time n, i.e., M′ , {m ∈ M : K ≤ O(log(n))}, introducedalgorithm.Specifically,itstatesthatthe introduced n n m and v ,[v ;...;v ]. algorithm can asymptotically achieve the performance of the m m,1 m,Km optimal twice differentiable function that is selected after This theorem indicates that the introduced algorithm can observing the entire data in hindsight. Note that there are asymptotically and sequentially achieve the performance of several kernel and neural network based sequential nonlinear anypiecewisemodelinthesetM′,i.e.,thepiecewisemodels regression algorithms [1], [2], [6] (which can be modeled via n havingatmostO(log(n))leavesdefinedonthetree.Inpartic- twice differentiablefunctions)whose computationalcomplex- ular, over any unknowndata length n, the performanceof the ities are similar to the introduced algorithm. However, the piecewise models with O(log(n)) leaves can be sequentially performances of such nonlinear models are only comparable achieved by the introduced algorithm with a regret upper with respect to their batch variants. On the other hand, we bounded by O plog2(n) . In this sense, we do not compare demonstrate the performance of the introduced algorithm the performance of the introduced algorithm with a fixed with respect to a extremely large class of regressors without class of regress(cid:0)ors, over a(cid:1)ny data length n. Instead, the regret any statistical assumptions. In this sense, the performance of the introduced algorithm is defined with respect to a set of any regression algorithm that can be modeled by twice of piecewise linear regressors, whose number of partitions differentiable functions is asymptotically achievable by the are upper bounded by O(log(n)), i.e., the competition class introducedalgorithm.Hence,theintroducedalgorithmyieldsa growsasnincreases.Intheconventionaltreebasedregression significantlymorerobustperformancewithrespecttothesuch methods, the depth of the tree is set before processing starts conventionalapproaches in the literature as also illustrated in and the performance of the regressor is highly sensitive with different experiments in Section V. respect to the unknowndata length. For example,if the depth The proofsof Theorem 1, Theorem 2, and the construction ofthetreeislargewhereastherearenotenoughdatasamples, of the algorithm are given in the following section. then the piecewise model will be undertrained and yield an unsatisfactoryperformance.Similarly,ifthedepthofthetreeis IV. CONSTRUCTION OF THEALGORITHMAND PROOFS OF smallwhereashugenumberofdatasamplesareavailable,then THETHEOREMS trees (and regressors) with higher depths (and finer regions) In this section, we first introduce a labeling to efficiently can be better trained. As shown in Theorem 1, the introduced manage the hierarchical models and then describe the algo- algorithmelegantlyandintrinsicallymakessuchdecisionsand rithm in its main lines. We next prove Theorem 1, where performsasymptoticallyas well as any piecewise regressor in we also provide the complete construction of the algorithm. thecompetitionclassthatgrowsexponentiallyinn[25].Such We then present a proof for Theorem 2, using the results of a significant performance is achieved with a computational Theorem 1. complexity upper bounded by O(n), i.e., only linear in the datalength,whereasthenumberofdifferentpiecewisemodels A. Notation definedontheincrementaldecisiontreecanbeintheorderof 1.5n [25]. Moreover, under certain regularity conditions the We first introduce a labeling for the tree nodes following computational complexity of the algorithm is O(log(n)) as [26]. The root node is labeled with an empty binary string λ willbediscussedinRemark2.Thistheoremisanintermediate andassumingthata nodehasa labelκ,whereκ=ν ...ν is 1 l step to show that the introduced algorithm yields the desired a binary string of length l formed from letters ν ,...,ν , we 1 l performance guarantee in (1), and will be used to prove the label its upper and lower children as κ1 and κ0, respectively. next theorem. Here,we emphasizethatastringcanonlytakeitslettersfrom UsingTheorem1,weintroduceanothertheorempresenting the binary alphabet, i.e., ν ∈ {0,1}, where 0 refers to the 5 A A A A lowerchild,and1referstotheupperchildofanode.We also introduceanotherconcept,i.e.,thedefinitionoftheprefixofa string. We say that a string κ′ =ν′ ...ν′ is a prefix to string 1 l′ κ=ν ...ν if l′ ≤l andν′ =ν foralli=1,...,l′,andthe 1 l i i empty string λ is a prefix to all strings. Finally, we let P(κ) representallprefixestothestringκ,i.e.,P(κ),{κ ,...,κ }, 0 l where l , l(κ) is the length of the string κ, κ is the string -A -A -A -A i with l(κi)=i, and κ0 =λ is the empty string, such that the t=0 t=1 t=2 t=3 firstilettersofthestringκformsthestringκ fori=0,...,l. i LettingLdenotethesetofleafnodesforagivendecisiontree, A A A eachleafnodeofthetree,i.e.,κ∈L,isgivenaspecificindex α ∈ {0,...,M −1} representing the number of regressor κ vectorsthathas fallen into R . For presentationpurposes,we κ consider M =2 throughout the paper. B. Outline of the Algorithm -A -A -A At time t = 0, the introduced algorithm starts with a t=4 t=5 t=6 single node (i.e., the root node) representing the entire regressor space. As the new data is observed, the proposed Fig. 3: A sample evolution of the incremental decision tree, where the regressor space is one dimensional. The “×” marks on the regressor space algorithmsequentiallydividestheregressorspaceintosmaller representsthevalueoftheregressorvectoratthatspecifictimeinstant.Light disjoint regions according to the observed regressor vectors. nodesaretheoneshavinganindexof1,whereastheindexofthedarknodes In particular, each region is divided into subsequent child is0. regions as soon as a new regressor vector has fallen into that region. In this incremental hierarchical structure, we assign Before the processing starts, i.e., at time t = 0, we begin an independent linear regressor to each node (i.e., to each withasinglenode,i.e.,therootnodeλ,havingindexα =0. region). Such a hierarchical structure (embedded with linear λ Then, we recursively construct the decision tree according to regressors) can define 1.5n different piecewise linear models the following principle. For every time instant t>0, we find orpartitions.Wethencombinetheoutputsofallthesedifferent the leaf node of the tree κ ∈ L such that x[t] ∈ R . For piecewisemodelsviaamixtureofexpertsapproachinsteadof κ this node, if we have α = 0, we do not modify the tree but committing to a single model. However, even for a small n, κ only incrementthis index by 1. On the otherhand, if α =1, thenumberofpiecewisemodels(i.e.,experts)growsextremely κ then we generate two children nodes κ0,κ1 for this node by rapidly(particularly,exponentialinn).Hence,inordertoper- dividing the region R into two disjoint regions R ,R , formthiscalculationinanefficientmanner,weassignaweight κ κ0 κ1 using the plane x =c, where i−1≡l(κ) (mod p) and c is toeachnodeonthetreeandpresentamethodto calculatethe i the midpoint of the region R along the ith dimension. For final output using these weights with a significantly reduced κ node κν with x[t] ∈ R (i.e., the children node containing computationalcomplexity, i.e., logarithmic in n under certain κν thecurrentregressorvector),we setα =1andthe indexof regularity conditions. κν the other child is set to 0. The accumulated regressor vectors We then compare the performance of the introduced algo- and the data in node κ are also transferred to its children to rithm with respect to the best batch piecewise model defined train a linear regressor in these child nodes. on the incremental decision tree. Our algorithm first suffers a As an example, in Fig. 3, we consider that the regressor “constructional regret” that arise from the adaptive construc- space is one dimensional, i.e., [−A,A], and present a sample tionoftheincrementaldecisiontree(sincethefinerpiecewise evolution of the tree. In the figure, the nodes having an index models are not present at the beginning of the processing) of 0 are shown as dark nodes, whereas the others are light and from the sequential combination of the outputs of all nodes, and the regressor vectors are marked with ×’s in the piecewisemodels(i.e.,duetothemixtureofexpertsapproach). one dimensional regressor space. For instance at time t = 2, Second, each piecewise model suffers a “parameter regret” we have a depth-1 tree, where we have two nodes 0 and 1 while sequentially learning the true regression parameters at with corresponding regions R = [−A,0], R = [0,A], and each region. We provide deterministic upper bounds on these 0 1 α =1, α =0. Then, at time t=3, we observe a regressor regrets and illustrate that the introduced algorithm is twice- 0 1 vector x[3] ∈ R and divide this region into two disjoint universal,i.e.,universalinbothentirepiecewisemodels(even 0 regionsusing x =−A/2 line. We then find that x[3]∈R , thoughthefinermodelsappearasnincreasesanddonotused 1 01 hence set α =1, whereas α =0. until then) and linear regression parameters. 01 00 We assign an independentlinear regressor to each node on the incremental decision tree. Each linear regressor is trained C. Proof of Theorem 1 and Construction of the Algorithm using only the information contained in its corresponding Inthissection,wedescribethealgorithmindetailandderive node. Hence, we can obtain different piecewise models by a regret upper bound with respect to the best batch piecewise using a certain collection of this node regressors accordingto model defined on the incremental decision tree. the hierarchical structure. Each such piecewise model suffers 6 a parameter regret in order to sequentially learn the optimal we define the weight of an inner node κ∈/ L as follows [26] linear regressionparametersateach regionthatminimizesthe 1 cumulative error. This issue is discussed towards the end of Pκ(n), Pκ0(n)Pκ1(n) 2 this section. 1 1 2 Using this incremental hierarchical structure with linear + exp − d[t]−dˆ [t] . 2  2a m,k  regressors at each region, the incremental decision tree can  t≤n:Xx[t]∈Rκ(cid:16) (cid:17)  represent up to 1.5n different piecewise linear models after Using this definitions, the weight of the root node λ can be observing data of length n. For example, in Fig. 3, at time   constructed as follows t = 6, we have 5 different piecewise linear models (see Fig. 2),whereasattimet=4,wehave3differentpiecewiselinear Pλ(n)= 2−BmP(n|m), models. Each of these piecewise linear models can be used mX∈Mn to perform the estimation task. However, we use a mixture where of experts approach to combine the outputs of all piecewise n 1 2 linear models, instead of choosing a single one among them. P(n|m),exp − d[t]−dˆm[t] 2a ( ) Tothisend,onecanassignaperformancedependentweight Xt=1(cid:16) (cid:17) representstheperformanceofa givenpartitionm∈M over to each piecewise linear model defined on the incremental n a data length of n, and B represents the number of bits decision tree and combine their weighted outputs to obtain m required to represent the model m on the binary tree using a the final estimate of the algorithm [16], [27], [28]. In a universal code [30]. conventional setting, such a mixture of expert approach is Hence, the performance of the root node satisfies P (n)≥ guaranteed to asymptotically achieve the performance of the λ bestpiecewiselinearmodeldefinedonthetree[16],[27],[28]. 2−BmP(n|m) for any m∈Mn. That is, However,inourframework,toachievetheperformanceofthe n 2 best twice differentiable regression function, as t increases −2aln(Pλ(n))≤ min d[t]−dˆm[t] (i.e., we observe new data), the total number of different m∈Mn(Xt=1(cid:16) (cid:17) ) piecewise linear models can increase exponentially with t. In +2aln(2)log(n)+4A2Kmlog(n), (2) thissense,wehaveahighlydynamicoptimizationframework. where the last line follows when we maximize B with re- m For example, in Fig. 3, at time t = 4, we have 3 different spect to m∈M and the regretterm 4A2K log(n) follows n m piecewise linear models, hence calculate the final output of due to the adaptive construction of the incremental decision our algorithm as dˆ[t] = w1[t]dˆ1[t]+w2[t]dˆ2[t]+w2[t]dˆ2[t], tree.Thisupperboundcorrespondstotheconstructionalregret where dî[t] represents the output of the ith piecewise linear of our algorithm. modelandwi[t]representsitsweight.However,attimet=6, Hence,we haveobtaineda weightingassignmentachieving we have 5 different piecewise linear models, i.e., dˆ[t] = the performance of the optimal piecewise linear model. We 5i=1wi[t]dî[t], therefore the number of experts increases. nextintroduceasequentialalgorithmachievingPλ(n).Tothis Hence, not only such a combination approach requires the end, we first note that we have P processing of the entire observed data at each time t (i.e., it n P (t) results in a brute-force batch-to-online conversion), but also P (n)= λ . (3) λ it cannot be practically implemented even for a considerably Pλ(t−1) t=1 Y short data sequences such as n=100. Now if we can demonstrate a sequential algorithm whose To elegantly solve this problem, we assign a weight to performance is greater than or equal to P (t)/P (t−1) for λ λ each node on the incremental decision tree, instead of using all t, we can conclude the proof. To this end, we present a a conventional mixture of experts approach. In this way, we sequential update from P (t−1) to P (t). λ λ illustrate a method to calculate the original highly dynamic After the structural updates, i.e., the growth of the incre- combination weights in an efficient manner, i.e., without mentaldecisiontree, arecompleted,say at time t, we observe requiringtheprocessingoftheentiredataforeachnewsample, a regressor vector x[t] ∈ R for some κ ∈ L. Then, we can κ and with a significantly reduced computational complexity. compactlydenote the weightof the rootnode at time t−1 as follows To accomplish this, to each leaf node κ ∈ L, we assign a performance dependant weight [26] as follows 1 2 P (t−1)= π [t−1]exp − d[t′]−dˆ [t′] , λ κi 2a κi κi∈XP(κ) ( x[ttX′′]<∈Rt κ(cid:16)i (cid:17) ) 1 2 Pκ(n),exp−2a d[t]−dˆm,k[t] , where dˆκ[t] represents the output of the regressor for node  t≤n:Xx[t]∈Rκ(cid:16) (cid:17)  κ, κi ∈ P(κ) is the string formed from the first i letters of κ=ν ...ν , and π [t] is recursively defined as follows   1 l κi where dˆ [t] represents the linear regressor assigned to the 12 , if i=0 kthnodemo,kfthemthpiecewisemodelandisconstructedusing πκi[t],12Pκi−1νic(t−1)πκi−1[t] , if 1≤i≤l−1. the regressor introduced in [29] and discussed in (7). Then, Pκi−1νic(t−1)πκi−1[t] , if i=l  7 Sincex[t]∈Rκ forsomeκ∈L,thenafterd[t]isrevealed, t′<t:x[t′]∈Rkd[t′]x[t′]. The upper bound on the perfor- the weight of the root node at time t can be calculated as mance of this regressor can be calculated following similar P follows lines to [29] and it is obtained as follows 1 2 n 2 n 2 Pλ(t)=κi∈XP(κ)πκi[t−1]exp(cid:26)−2a(cid:16)d[t]−dˆκi[t](cid:17) (cid:27) Xt=1(cid:16)d[t]−dˆm[t](cid:17) −kv=m1m,,.k.i∈.n,KRpm(cid:26)Xt=1(cid:16)d[t]−dˆb[t](cid:17) 1 2 ×exp − d[t′]−dˆ [t′] , +δ||v ||2 ≤A2K pln(n/K )+O(1). (7) 2a κi m m m ( t′<t:xX[t′]∈Rκi(cid:16) (cid:17) ) We emphasize that in (cid:27)each region of a piecewise model, which results in differentlearningalgorithms,e.g.,differentlinearregressorsor P (t) 1 2 nonlinearones,fromthebroadliteraturecanbeused.Notethat λ = µ [t−1]exp − d[t]−dˆ [t] , P (t−1) κi 2a κi althoughthe maincontributionofthepaperisthe hierarchical λ κi∈XP(κ) (cid:26) (cid:16) (cid:17) (cid:27) organization and efficient management of these piecewise (4) models, we also discuss the implementation of a piecewise where linear model [29] into our framework for completeness. 2 Finally, we achieve an upper bound on the performance πκi[t−1]exp −21a t′<t d[t′]−dˆκi[t′] of the introduced algorithm with respect to the best batch µ [t−1], ( x[t′]∈R(cid:16)κi (cid:17) ). piecewise linear model. Combining the results in (2) and (7), κi PP(t−1) λ we obtain We then focus on (4) and observe that we have n n 2 2 termκi∈iPn((κ4))µ,κi.ie[t.,− 1] = 1, which means that if the second Xt=1(cid:16)d[t]−dˆ[t](cid:17) ≤mm∈Minn(Xt=1(cid:16)d[t]−dˆm[t](cid:17) ) P +2aln(2)log(n)+4A2K log(n) m 1 2 is concavef,(tdˆhκein[t]b)y,Jeenxspen(cid:26)’s−i2naeq(cid:16)udal[itt]y−, wdêκic[ta]n(cid:17)c(cid:27)on,clude that ≤mm∈Minnkv=m1m,,.k.i∈.n,KRpm(Xt=n1(cid:16)d[t]−dˆb[t](cid:17)2+δ||vm||2) 2 +A2Km(pln(n/Km)+4log(n))+2aln(2)log(n)+O(1), 1 exp−2ad[t]− µκi[t−1]dˆκi[t] ≥Pλ(t|t−1). ≤O(plog2(n))  κi∈XP(κ)  whe|re the upper bound on the{zregret follows when Km }=   (5) O(log(n)). This proves the upper bound in Theorem 1 and Since the function f(dˆκi[t]) is concave when concludes the construction of the algorithm. Before we con- 2 d[t]−dˆ [t] < a, and we have |d[t]| ≤ A, we have cludetheproof,wefinallydiscussthecomputationalcomplex- κi ity of the introduced algorithm to in detail. (cid:16)to set a ≥ 4A(cid:17)2. Therefore, we obtain a sequential regressor The computational complexity for the construction of the in (5), whose performance is greater than or equal to the incremental decision tree is O(|P(κ)|), where κ represents a performance of the root node, and the final estimate of our leaf node of the incremental decision tree (see lines 2−35 algorithm is calculated as follows of the algorithm in Fig. 4 and note that |T | ≤ |P(κ)|). The κ dˆ[t], µ [t−1]dˆ [t]. (6) computationalcomplexityof the sequentialweightingmethod κi κi is O(|P(κ)|) (see (6) and lines 36−49 of the algorithm in κi∈XP(κ) Fig. 4). Accordingto the incrementalhierarchicalpartitioning Hence, our algorithm can achieve the performance of the method described, the number of light nodes on the tree best piecewise linear model defined on the incremental tree (see Fig. 3) is t at time t, therefore we may observe a with a constructional regret given in (2). In order to achieve decision tree of depth n, i.e., |P(κ)| = n, in the worst-case the performance of the best “batch” piecewise linear model, scenario, e.g., when x[t] = [A,...,A]T for all t. Hence, the the introducedalgorithmalso suffersa parameterregretwhile computationalcomplexity of the algorithm over a data length learning the true regression parameters at each region. An of n is upper bounded by O(n). Although theoretically the upper bound on this regret is calculated as follows. computational complexity of the algorithm is upper bounded Consider an arbitrary piecewise model defined on the in- by O(n), in many real life applications the regressor vec- cremental decision tree, say the mth model, having Km dis- tors converge to stationary distributions [16]. Hence, in such joint regions R1,...,RKm such that Kk=m1Rk = [−A,A]p. practical applications, the computational complexity of the Then, a piecewise linear regressor can be constructed us- algorithm can be upper bounded by O(log(n)) as discussed S ing the universal linear predictor of [29] in each re- in Remark 2. We emphasize that in order to achieve the gion as dˆm[t] = vTm,k[t]x[t], when x[t] ∈ Rk, with computational complexity O(log(n)), we do not require any the regression parameters v [t] = (R [t]+δI)−1p [t], statisticalassumptions,insteaditissufficientthattheregressor m,k k k where I represents the appropriate sized identity ma- vectorsareevenly(tosomedegree)distributedintheregressor trix, Rk[t] , t′≤t:x[t′]∈Rkx[t′]xT[t′], and pk[t] , space. This concludes the proof of Theorem 1. (cid:3) P 8 1: for t=1 to n do 31: ν =0, if x[t]∈Rκ0, ν =1, otherwise. 2: % Find the set of nodes containing x[t] 32: κ=κν 3: κ=λ 33: S =S+κ 4: S =κ 34: ακ =1 5: while s has children do 35: end if 6: κ = κν, where ν is the last letter of the child 36: % Calculate combination weights and perform containing x[t]. estimation. 7: S =S+κ 37: for all κi ∈P(κ) do 8: end while 38: if κi =λ then 9: % Check the index of the leaf node κ: if ακ =0, 39: πκi =1/2 tree remains the same. 40: else if κi ∈/ {λ,κ} then 10: if ακ =0 then 41: πκi =Pκi−1νicπκi−1/2 11: ακ =ακ+1 42: else 12: Tκ =Tκ+t 43: πκi =Pκi−1νicπκi−1 13: % If ακ =1, create nodes κ0 and κ1. 44: end if 14: else 45: µκi =πκiLκi/Pλ 15: % Train nodes κ0 and κ1. 46: dˆκi =wTκix[t] 16: for all z ∈Tκ do 47: end for 17: if x[z]∈Rκ0 then 48: dˆ=µTdˆ 18: Tκ0 =Tκ0+z 49: e=d[t]−dˆ 19: Lκ0 =Lκ0exp(−(d[z]−wTκ0x[z])2/2a) 50: % Perform algorithmic updates. 20: Pκ0 =Lκ0 51: for all κi ∈P(κ) do 21: Rκ0 =Rκ0+x[z]x[z]T 52: Lκi =Lκiexp(−(d[t]−dˆκi)2/(2a)) 22: wκ0 =wκ0+Rκ0\(x[z](d[z]−wTκ0x[z])) 53: if κi =κ then 23: else 54: Pκi =Lκi 24: % Do the similar for node κ1. 55: else 25: end if 56: Pκi =(Pκi0Pκi1+Lκi)/2 26: end for 57: end if 27: for all κ∈S do 58: Rκi =Rκi +x[t]x[t]T 28: Pκ =(Pκ0Pκ1+Lκ)/2 59: wκi =wκi +Rκi\(x[t](d[t]−dˆκi)) 29: end for 60: end for 30: % Find the child containing x[t] and perform 61: end for tree updates. Fig. 4:Thepseudocode oftheIncremental DecisionTree(IDT)regressor Remark 1: Note that the algorithm in Fig. 4 achieves a region into two disjoint regions, we may be forced to the performance of the best piecewise linear model having perform O(t) computations due to the accumulated regres- O(log(n)) partitions with a regret of O(plog2(n)). In the sor vectors (since we no longer have |T | ≤ |P(κ)| but κ most generic case, i.e., for an arbitrary piecewise model instead have |T | ≤ t). However, since a regressor vector κ m having O(K ) partitions, the introduced algorithm still is processed by at most O(log(n)) nodes for any n, the m achieves a regret of O(pK log(n/K )). This indicates that average computational complexity of the update rule of the m m for models having O(n) partitions, the introduced algorithm tree remainsO(log(n)). Furthermore,the performanceof this achieves a regret of O(pn), hence the performance of the low complexity implementation will be asymptotically the piecewisemodelcannotbeasymptoticallyachieved.However, same as the exact implementation provided that the regressor we emphasize that no other algorithm can achieve a smaller vectorsare evenly distributed in the regressor space, i.e., they regretthanO(pn)[8],i.e.,theintroducedalgorithmisoptimal are not gathered around a considerably small neighborhood. inastrongminimaxsense.Intuitively,thislowerboundcanbe This result follows when we multiply the tree construction justifiedbyconsideringthecase,inwhichtheregressorvector regret in (2) by the total number of accumulated regressor at time t falls into the tth region of the piecewise model. vectors, whose order, according to the above condition, is Remark 2: As mentioned in Remark 1 (and also can be upper bounded by o(n/log(n)). observedin(7)),noalgorithmcanconvergetotheperformance Remark 3: We emphasizethat the nodeindexes,i.e., α ’s, κ of the piecewise linear models having O(n) disjoint regions. determines when to create finer regions. According to the Therefore, we can limit the maximum depth of the tree described procedure, if a node at depth l is partitioned into by O(log(t)) at each time t to achieve a low complexity smaller regions, then its ith predecessor, i.e., κ ∈ P(κ), has i implementation. With this limitation and according to the observedatleastl−idifferentregressorvectors.Hence,achild update rule of the tree, we can observe that while dividing nodeis created when coarser regions(i.e., predecessor nodes) 9 Algorithm Computational Complexity are sufficiently trained. In this sense, we introducenew nodes IDT O(cid:0)p2log(n)(cid:1) tothetreeaccordingtothecurrentstatusofthetreeaswellas CTW O(cid:0)p2d(cid:1) the most recent data. We also pointout that, in this paper,we LR O(cid:0)p2(cid:1) divide each region from its midpoint (see Fig. 3) to maintain VSR O(cid:0)p2r(cid:1) universality. However, this process can also be performed in MARS O(cid:0)rbw3(cid:1) a data dependant manner, e.g., one can partition each region FNR O(cid:0)(pr)2r(cid:1) using the hyperplane that is perpendicular to the line joining TABLE I: Comparison of the computational complexities of the proposed algorithmswiththecorrespondingupdaterules.Inthetable,prepresentsthe tworegressorvectorsinthatregion.Iftherearemorethantwo dimensionality of the regressor space, d represents the depth of the trees in accumulated regressor vectors, then more advanced methods the respective algorithms, and r represents the order of the corresponding such as support vectors and anomaly detectors can be used filtersandalgorithms.FortheMARSalgorithm (particularly, thefastMARS algorithm, cf. [32]), b represents the number of basis functions and w to define a separator hyperplane. All these methods can be represents thewindowlength. straightforwardlyincorporatedinto our framework to produce different algorithms depending on the regression task. theFouriernonlinearregressorof[19]by“FNR”.Thecombi- D. Proof of Theorem 2 nationweightsoftheLR,VSR,andFNRareupdatedusingthe recursiveleastsquares(RLS)algorithm[16].Unlessotherwise Webeginourproofbyemphasizingthattheintroducedalgo- stated, the CTW algorithm has depth 2, the VSR, FNR, and rithmconvergesto thebestlinearmodelineachregionwitha MARSalgorithmsaresecondorder,andtheMARSalgorithm regretof O(plog2(n)) for any finite regression parameter v m uses21knotswithawindowlengthof500thatshiftsinevery (since ||v || ≤ δGplog(n)) as already proven in Theorem m 200 samples. 1. Therefore, using any other linear model yields a higher InTableI,weprovidethecomputationalcomplexitiesofthe regret. Hence, say we define a suboptimal affine model by proposedalgorithms.We emphasizethat althoughthe compu- applying Taylor’s theorem to a twice differentiable function tationalcomplexityto create andrun the incrementaldecision f ∈F aboutthemidpointofeachregion.Letdˆ[t] denotethe s treeisO(log(n)),theoverallcomputationalcomplexityofthe prediction of this suboptimal affine regressor. Then, we have algorithmisO(p2log(n))duetotheuniversallinearregressors n 2 n 2 ateachregion.Particularly,sincetheuniversallinearregressor d[t]−dˆ[t] ≤ d[t]−dˆs[t] +O(plog2(n)). at each region has a computational complexity of O(p2), the Xt=1(cid:16) (cid:17) Xt=1(cid:16) (cid:17) overall computational complexity of O(p2log(n)) follows. NowapplyingthemeanvaluetheoremwiththeLagrangeform However, this universal linear regressor can be straightfor- of the remainder, we obtain wardly replaced with any linear (or nonlinear) regressor in n n the literature. For example, if we use the LMS algorithm to 2 2 d[t]−dˆ[t] − d[t]−dˆ [t] update the parameters of the linear regressor instead of using s f Xt=1(cid:16) (cid:17) Xt=1(cid:16) (cid:17) the universal algorithm for this update, the computational n p p ∂2f(x) complexity of the overall structure becomes O(plog(n)). ≤2A (x [t]−a )(x [t]−a ) , i κ,i j κ,j Hence, although the computationalcomplexityof the original Xt=1(Xi=1Xj=1 ∂xi∂xj(cid:12)(cid:12)x=b ) IDT algorithm is O(log(n)), this computational complexity forsomem∈M′ andb∈R(cid:12)(cid:12) ,wherea ,[a ,...,a ]T may increase according to the computational complexity of n κ κ κ,1 κ,p isthemidpointoftheregionR .Maximizingthisupperbound the node regressors. κ with respect to x we obtain In this section, we first illustrate the performances of the proposed algorithms for a synthetic piecewise linear model n n d[t]−dˆ[t] 2− d[t]−dˆ [t] 2 that do not match the modeling structure of any of the above f algorithms.We thenconsiderthe predictionof chaoticsignals Xt=1(cid:16) (cid:17) Xt=1(cid:16) (cid:17) (generatedfromDuffingandTinkerbellmaps)andwell-known A2 ≤2ADp2n +O(plog2(n)) data sequences such as Mackey-Glass sequence and Chua’s O(log2/p(n)) circuit [7]. Finally, we consider the prediction of real life ≤o(p2n). examples that can be found in various benchmark data set This concludes the proof of Theorem 2. (cid:3) repositories such as [33], [34]. A. Synthetic Data V. SIMULATIONS In this subsection, we consider the scenario where the In this section, we investigate the performance of the desired data is generated by the following piecewise linear introduced algorithm with respect to various methods under model severalbenchmarkscenarios.Throughouttheexperiments,we denote the incremental decision tree algorithm of Theorem x [t]+x [t]+n[t] , if ||x[t]||2 ∈[0,0.1]∪[0.5,1] d[t]= 1 2 , 1 by “IDT”, the context tree weighting algorithm of [8] by (−x1[t]−x2[t]+n[t] , otherwise “CTW”, the linear regressor by “LR”, the Volterra series (8) regressor by “VSR” [6], the sliding window Multivariate and x[t] = [x [t],x [t]]T are sample functions of a jointly 1 2 Adaptive Regression Splines of [31], [32] by “MARS”, and Gaussianprocessofmean[0,0]T andcovariancematrixI,and 10 Normalized Accumulated Squared Error Performance of the Proposed Algorithms Evolution of the Cumulative Node Weights 0.7 1 0.9 Error0.65 VSR 0.8 ulated Squared 0.05.56 MARS CTW−2 Node Weights000...567 m e d Accu 0.5 mulativ0.4 ormalize0.45 IDT CTW−6 Cu00..23 DDDeeepppttthhh−−−012 N 0.4 Depth−3 0.1 Depth−4 Depth−5 0.35 0 0 2000 4000 6000 8000 10000 100 101 102 103 104 Data Length (n) Data Length (n) Fig.5:Normalizedaccumulatedsquarederrorperformancesforthepiecewise Fig. 6: Evolution of the normalized cumulative node weights at the corre- linearmodelin(8)averaged over10trials. sponding depths of the tree for the piecewise linear model in (8) averaged over10trials. n[t] is a sample function from a zero mean white Gaussian model as the coarser regions becomes unsatisfactory. Since process with variance 0.1. Note that the piecewise model the universal algorithms such as CTW distribute a “budget” in (8) has circular regions, which cannot be represented intonumerousexperts,asthenumberofexpertsincreases,the by hyperplanes or twice differentiable functions. Hence, the performanceofsuchalgorithmsdeteriorate.Ontheotherhand, underlying relationship between the desired data and the the introduced algorithm intrinsically limits the number of regressor vectors cannot be exactly modeled using any of the expertsaccordingtotheunknowndatalengthateachiteration, proposed algorithms. henceweavoidsuchpossibleperformancedegradationsascan In Fig. 5, we present the normalized accumulated squared be observed in Fig. 6. errors of the proposed algorithms averaged over 10 trials. For this experiment, “CTW-2” and “CTW-6” show the per- B. Chaotic Data formances of the CTW algorithm with depths 2 and 6, In this subsection, we consider prediction of the chaotic respectively. Since the performances of the LR and FNR signals generated from the Duffing and Tinkerbell maps. The algorithms are incomparable with the rest of the algorithms, Duffing map is generated by the following discrete time they are not included in the figure for this experiment. Fig. equation 5 illustrates that even for a highly nonlinear system (8), our algorithm significantly outperforms the other algorithms. The x[t+1]=ax[t]−(x[t])3−bx[t−1], (9) normalizedaccumulatederroroftheintroducedalgorithmgoes to the variance of the noise signal as n increases, unlike the where we set a = 2.75 and b = 0.2 to produce the chaotic rest of the algorithms, whose performances converge to the behavior [9], [35]. The Tinkerbell map is generated by the performance of their optimal batch variants as n increases. following discrete time equations This observation can be seen in Fig. 5, where the normalized x[t+1]=(x[t])2−(y[t])2+ax[t]+by[t] (10) cumulativeerroroftheIDTalgorithmsteadilydecreasessince y[t+1]=2x[t]y[t]+cx[t]+dy[t], (11) the IDT algorithm creates finer regions as the observed data length increases. Hence, even for a highly nonlinear model where we set a =0.9, b= −0.6013, c =2, and d=0.5 [8], such as the circular piecewise linear model in (8), which [35]. We emphasize that these values are selected to generate cannot be represented via hyperplanes, the IDT algorithm the well-known chaotic behaviors of these attractors. can well approximate this highly nonlinear relationship by Fig. 7a and Fig. 7b shows the normalized accumulated incrementallyintroducingfinerpartitionsas the observeddata squared error performances of the proposed algorithms. We length increases. emphasize that due to the chaotic nature of the signals, we Furthermore,eventhoughthe depthof the introducedalgo- observe non-uniform curves in Fig. 7. Since the conven- rithmiscomparablewiththeCTW-6algorithmovershortdata tional nonlinear and piecewise linear regression algorithms sequences, the performance of our algorithm is superior with commit to a priori partitioning and/or basis functions, their respect to the CTW-6 algorithm. This results since the IDT performances are limited by the performances of the optimal algorithm intrinsically eliminates the extremely finer models batch regressors using these prior partitioning and/or basis at the early processing stages and introduces them whenever functions as can be observed in Fig. 7. Hence, such prior they are needed, unlike the CTW-6 algorithm.This procedure selections result in fundamental performance limitations for canbeobservedinFig.6,wheretheIDTalgorithmintroduces these algorithms. For example, in the CTW algorithm, the finerregions(i.e.,nodeswithhigherdepths)tothehierarchical partitioningof the regressorspace is set beforethe processing

Predicting Nearly As Well As the Optimal Twice Differentiable Regressor PDF

0.31 MB·

by N. Denizcan Vanli

#journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Download Predicting Nearly As Well As the Optimal Twice Differentiable Regressor PDF Free - Full Version

by N. Denizcan Vanli| 0.31

Download Predicting Nearly As Well As the Optimal Twice Differentiable Regressor by N. Denizcan Vanli in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Predicting Nearly As Well As the Optimal Twice Differentiable Regressor

No description available for this book.

Detailed Information

Author:	N. Denizcan Vanli
File Size:	0.31
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Predicting Nearly As Well As the Optimal Twice Differentiable Regressor Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Predicting Nearly As Well As the Optimal Twice Differentiable Regressor PDF?

Yes, on https://PDFdrive.to you can download Predicting Nearly As Well As the Optimal Twice Differentiable Regressor by N. Denizcan Vanli completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Predicting Nearly As Well As the Optimal Twice Differentiable Regressor on my mobile device?

After downloading Predicting Nearly As Well As the Optimal Twice Differentiable Regressor PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Predicting Nearly As Well As the Optimal Twice Differentiable Regressor?

Yes, this is the complete PDF version of Predicting Nearly As Well As the Optimal Twice Differentiable Regressor by N. Denizcan Vanli. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Predicting Nearly As Well As the Optimal Twice Differentiable Regressor PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.