Table Of Content6 - 1
Assessment of description quality of models by information
theoretical criteria based on Akaike and Schwarz-Bayes
applied with stability data of energetic materials
Manfred A. Bohn
Fraunhofer Institut fuer Chemische Technologie, ICT, D-76318 Pfinztal, Germany
Abstract
To describe any measurement data by models or parametric equations is mostly a neces-
sity for the interpretation and further evaluations of the data. There is often the case that
several models may be applicable and the question arises, which model is the better one.
Besides many criteria to argue for one or another model there are objective methods for
the assessment of the description quality for models in comparison. Such methods are
based on information theoretical conclusions and two criteria have proven helpful. One
has been developed by Hirot(s)ugu Akaike and it is called Akaike Information Criterion
(AIC). The other method is based on the early work of Thomas Bayes, which was adapted
by Gideon Schwarz to the framework of an information criterion and is named Bayes In-
formation Criterion (BIC). With several data sets obtained with energetic materials and
some models applied on them the usefulness but also limitations of these two information
criteria are demonstrated and discussed. The data comprise stabilizer consumption and
molar mass degradation of nitrocellulose in a gun propellant as well as the degradation of
cellulose in electrical transformer insulation paper.
Keywords: model selection, minimization of information loss, AIC, BIC
1. Introduction
The selection of a model to describe a set of data is at first controlled by the type of data
and their meaning in physical, chemical, biological or any meaning. In most cases one can
construct a description scheme with the knowledge one has about the inherence and the
origin of the data. For example, if one determines a series of concentrations of substances,
which are changing with time, one orients typically to chemical kinetics and select a
description or one builds an adapted one. But there may be the case that one does not
know what the underlying description of data-time course is. Then one applies parametric
descriptions and try to find out, which one is the best. Surely one has the general
correlation coefficient R2 and normalized sum of squared deviations between model and
data, maybe in the form of squared standard deviation SD2. The can help to find the
suitable model. However, there is a lack of some objective evaluation of the description
quality in the sense that they do not take in account weighting of the number of para-
meters used. Here the so-named information criteria based on probability and statistics
have a supportive role.
There are a lot of methods to evaluate model description quality. Some key terms are:
deviance information criterion (DIC)
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
Paper 6, pages 6-1 to 6-23 in Proc. of the 46th International Annual Conference of ICT on ‘Energetic Materials – Per-
formance, Safety and System Applications’, June 23 to 26, 2015, Karlsruhe, Germany. ISSN 0722-4087. Fraunhofer-
Institut fuer Chemische Technologie (ICT), D-76318 Pfinztal. Germany.
6 - 2
model selection
design of experiments (DoE)
Occam's razor (Ockham’s razor)
Akaike information criterion (AIC)
Bayes factor
Bayes information criterion (BIC)
Minimum description length (Algorithmic information theory)
Minimum message length
Kullback–Leibler divergence (KL-divergence), KL entropy, KL-distance
Here the concentration is on three terms:
Occam's razor
Akaike information criterion (AIC)
Bayesian information criterion (BIC) also named Schwarz-Bayes IC (SBIC)
The Bayes information criterion (BIC) or Schwarz criterion (also SBC, SBIC) as well as the
Akaike information criterion (AIC) are criteria for model selection among a finite set of
models. The selection method is based on the likelihood function. When fitting models, it
is possible to increase the likelihood (description quality) by adding parameters. But doing
so may result in overfitting. Both BIC and AIC resolve this problem by introducing a
penalty term for the number of parameters in the model; the penalty term is larger in BIC
than in AIC.
Overfitting is against the principle of Occam’s razor (Ockham’s razor or Ockham’s
principle, always powerful declaimed by William of Ockham), which states that only the
really necessary amount of variables (model parameters) may be used to describe a data
set. The imperative is to shave away unnecessary assumptions and cutting apart two
similar models. In other word:
> If there is more than one theoretical explanation for the same subject
the simplest theory has to be preferred;
> A theory is simple if it contains as less variables and hypotheses as possible,
which are in reasonable relation to another, which can explain the subject
also reasonably.
But Ockham’s principle is only one criterion for the quality of a theory and it allows no
real assessment on the validity of the theory and the explanations
2. Formulation of the information criteria AIC and BIC
Quantities which are used in AIC and BIC with equal meaning are:
n number of data points
p number of model parameters to be fitted (without constraint ones)
y measured value of dependent variable at data point i
i
f(x) by model calculated value of dependent variable at data point i
i
σ2 true variance of the data
σ 2 estimated variance (biased variance)
e
L maximized value of likelihood function (maximum likelihood)
n (y −f(x ))2
χ2 squared standard deviation or error variance, generallyχ2 =∑ i i
2
σ
i=1 i
6 - 3
n
FQS Fehlerquadradsumme, sum of squared deviations (SSD), FQS=∑(y −f(x ))2
i i
i=1
2.1 The Akaike information criterion (AIC)
The general formulation based on Akaike is formulated with the likelihood function and
is given in Eq.(1).
AIC=2⋅p−2⋅ln(L)=2⋅p−2⋅C+χ2 (1)
C is a constant, which cancels in relative model evaluations and one obtains Eq.(2)
AIC=2⋅p+χ2 (2)
The quantity χ2 denotes the fit of models according to minimization of the sum of
squared deviations (SSD) or Fehlerquadratsumme (FQS). The maximum likelihood L is
given by Eq.(3), in the used logarithmic form.
⎜⎛ n ⎛ 1 ⎞1/2⎟⎞ 1 n (y −f(x ))2 1
ln(L)=ln ∏⎜ ⎟ − ∑ i i =C− ⋅χ2 (3)
⎜⎜⎝ i=1 ⎜⎝2⋅π⋅οi2 ⎟⎠ ⎟⎟⎠ 2 i=1 οi2 2
Case 1
All σ are identical, σ = σ , and σ is unknown and is taken as inherent fit parameter. Then
i i e e
Eq.(4) results, whereby the number of fit parameters has been increased by 1, because of
use of a priori unknown σ . By expansion the Eq.(6) results for the maximum likelihood
e
function, used mostly in logarithmic form.
n/2
⎛ 1 ⎞ 1 1 n
ln(L)=ln⎜ ⎟ − ⋅ ⋅∑(y −f(x ))2 (4)
⎜⎝2⋅π⋅ο2e ⎟⎠ 2 ο2e i=1 i i
1 n FQS
ο2 = ⋅∑(y −f(x ))2 = (5)
e i i
n n
i=1
n ⎛ FQS⎞ n n n n ⎛FQS⎞ n ⎛FQS⎞
ln(L)=− ⋅ln⎜2π⋅ ⎟− =− ⋅ln(2π)− − ⋅ln⎜ ⎟=C− ⋅ln⎜ ⎟ (6)
2 ⎝ n ⎠ 2 2 2 2 ⎝ n ⎠ 2 ⎝ n ⎠
By omitting C in relative use of AIC by using the statistical weights (see later) the Eq.(7)
follows. There a short illustration is given on the meaning of the two terms.
⎛FQS⎞
AIC=n⋅ln⎜ ⎟+2⋅(p+1)
⎝ n ⎠
(7)
⎛FQS⎞
AIC=n⋅ln⎜ ⎟ + 2⋅(p+1)
⎝ n ⎠
likelihood term penalty term
The lower the AIC value the better is the assessment of the model. The more fit para-
meters are used the higher the AIC value. This means formulated in information content
6 - 4
about the data: The higher the AIC of a model, the higher the information loss about the
data.
The often used form of AIC is advantageous with models using to much fit parameters.
Means the penalty term is considered to small. Therefore a modified version was proposed
and formulated /1/, which should or even must be used if n is relatively small or n/p is <<
1000. The AIC corrected for small sample sizes is named AICc and has an additional
penalty term formed from p and n, Eq.(8).
⎛FQS⎞ 2⋅p⋅(p+1)
AICc =n⋅ln⎜ ⎟+2⋅(p+1)+ (8)
⎝ n ⎠ n−p−1
The normalized statistical AIC weight wA or wAc of a model i among a set of u models to
i i
fit the data best, means which minimizes most the information loss by the description of
the data, is obtained by Eq.(9). It uses the ‘distance’ ΔA or ΔAc between the best model
i i
with smallest AICc and the AICc of model i, Eq.(10). The quantity exp(-ΔA /2) or exp(-ΔAc
i i i
/2) is the relative maximum likelihood of model i in comparison with the other models.
( )
exp −ΔAc /2
wAc = i (9)
i u
∑exp(−ΔAc /2)
i
i=1
ΔA = AICc −AICc (10)
i i min
Case 2
All σ are different.
i
AIC=χ2 +2⋅p (11)
n (y −f(x ))2
χ2 =∑ i i (12)
2
σ
i=1 i
The evaluation with the relative likelihood function and the application of statistical
weights is the same as above.
2.1 The Bayes (Schwarz-Bayes) information criterion (BIC, SBIC)
Schwarz has formulated the information criterion in the ‘modern’ way, but based on the
ideas of Bayes /2/. The start of the BIC is Eq.(13).
( )
BIC=−2⋅ln(L)+p⋅ ln(n)−ln(2π) (13)
Case 1
All σ are identical, σ = σ, then Eq.(14) follows.
i i
( )
2
BIC=n⋅lnσ +p⋅(ln(n)−ln(2π)) (14)
e
With n large ln(2π) = 1.83: n >1000.000, Eq.(15) can be used. Introducing Eq.(16) results in
Eq,(17), the mostly used form of BIC.
, ( )
BIC=n⋅lnσ 2 +p⋅ln(n) (15)
e
6 - 5
1 n
σ 2 = ⋅∑(y −f(x ))2 (16)
e i i
n
i=1
⎛FQS⎞
BIC=n⋅ln⎜ ⎟+p⋅(ln(n)−ln(2π)) (17)
⎝ n ⎠
Case 2
All σ are different then again the general squared standard deviation χ2 from Eq.(12) is
i
used.
BIC=χ2 +p⋅(ln(n)−ln(2π)) (18)
For both cases the following evaluation holds as above the assessment with the
normalized statistical weight wB of a model i in comparison in a set of u models.
i
( )
exp −ΔB /2
wB = i (19)
i u
∑exp(−ΔB /2)
i
i=1
ΔB =BIC −BIC (20)
i i min
Notes
Finally the difference between AIC and BIC is the different penalty term. Sometimes the
part ln(2π) in BIC penalty term is omitted, which increases the penalty. AIC and BIC can tell
nothing about the quality of the model in an absolute sense. If all candidate models fit
poorly, AIC and BIC will not give any warning about this. Further on, there is no selection
of a possible ‘true’ model. The general assumption about the selected models to be tested
is that none of them is a ‘true’ model. The only assessment result is in terms of inform-
ation loss in using a description (model) for the experimental data. This means that the
entity of the experimental data itself has the highest degree of information. Any
regression or data reduction or model description reduces this amount of information
(information is in the context of thermodynamics an extensive quantity). AIC and BIC help
to select the description, which provides with the least information loss about the data.
3. Examples of model evaluations with AIC and BIC
The experimental data used have been compiled in Appendix A1.
3.1 Reaction kinetic models used with the examples
The following five reaction kinetic models to describe stabilizer consumption /3, 4/ and
molar mass decrease /5/ have been tested with several data sets.
Model ‘S: linear’, zero order consumption of stabilizer. The reaction rate constant
has the unit mass-%1/time.
S(t,T)=S(0)−k (T)⋅t (21)
0
6 - 6
Model ‘S: exponential’, first order consumption of stabilizer. The reaction rate cons
tant has the unit 1/time.
S(t,T)=S(0)⋅exp(−k (T)⋅t) (22)
1
Model ‘S: exponential + linear’, combination of fist and zero order. The reaction
rate constant k1 has the unit 1/time the other one mass-%/time.
⎛ k (T)⎞ k (T)
S(t,T)=⎜S(0)+ 0 ⎟⋅exp(−k (T)⋅t)− 0 (23)
⎜ ⎟ 1
⎝ k1(T)⎠ k1(T)
Model ‘S: nth order’, stabilizer consumption according to nth order. The reaction
rate constant has the unit 1/time.
⎛ 1 ⎞
⎜ ⎟
S(t,T)=S(0)⋅[1−(1−n)⋅k(T)⋅t] ⎝1−n⎠ (24)
Model ‘S: extended’, stabilizer consumption according to a complete but still basic
reaction scheme assuming a zero order (approximated first order) NC degradation
by k and stabilizer reaction by k . NE(0) the part of NC in the formulation in mass
N S
parts, which has no unit. The two reaction rate constants have the unit 1/time.
( )
exp−NE(0)⋅k ⋅k ⋅t2
N S
S(t)= (25)
1 π k ⎛ NE(0)⋅k ⋅k ⎞
+ ⋅ S ⋅erf⎜ N S ⋅t⎟
S(0) 2 NE(0)⋅k ⎜ 2 ⎟
N ⎝ ⎠
Model for molar mass decrease; all models are based on random chain splitting, but by
two different mechanisms: bond scission between chain elements and chain element de-
composition. In the last version, the chain elements decrease one has mass loss and in the
very final end all material is decomposed /5/.
Model ‘M: CS - BS’, chain splitting (CS) by bond scission (BS) between chain ele-
ments. The reaction rate constant has the unit 1/time.
( )
exp +k ⋅t
Mn(t)=m⋅ B1 (26)
m
( ( ) )
+ exp +k ⋅t −1
B1
Mn(0)
Model ‘M: CS - ED’, chain splitting (CS) by chain element decomposition (ED). The
reaction rate constant has the unit 1/time.
( )
Mn(t) exp −k ⋅t
= M1 (27)
m m ( ( ))
+ 1−exp −k ⋅t
M1
Mn(0)
Model ‘M: CS - BS + CR’, chain splitting (CS) by bond scission (BS) between chain
elements and chain recombination (CR). The two reaction rate constants have the
unit 1/time.
( )
Mn(t) exp +(k +k )⋅t
= B1 B2 (28)
m m
k +k ⋅
m B1 B2 Mn(0)
( ( ) )
+ ⋅ exp +(k +k )⋅t −1
B1 B2
Mn(0) k +k
B1 B2
6 - 7
Model ‘M: CS - ED + CR’, chain splitting (CS) by chain element decomposition (ED)
and chain recombination (CR). The two reaction rate constants have the unit
1/time.
( )
exp −k ⋅t
Mn(t)=m⋅ M1 (29)
m k
+ M1 ⋅(exp(−k ⋅t)−exp(−k ⋅t))
M2 M1
Mn(0) k −k
M1 M2
Till now only isothermal formulations of the models have been given. Sometimes non-
isothermal determinations of data are made. Then non-isothermal formulations of the
models must be used /5/. Mostly one performs the measurement with constant heating
rate, but in principle also time or temperature dependent heating rates can be used with
the following model formulations, here done for the two models for molar mass decrease
with chain recombination.
Model ‘M: niso CS - BS + CR’, non-isothermal (niso) chain splitting (CS) by bond scis-
sion (BS) between chain elements and chain recombination (CR). The two reaction
rate constants have the unit 1/time.
⎛Mn(T)⎞
d⎜ ⎟
2
⎝ m ⎠ 1 ⎛ m ⎞ ⎛Mn(T)⎞ 1 ( ) ⎛Mn(T)⎞
=− ⋅⎜k (T)+k (T)⋅ ⎟⋅⎜ ⎟ + ⋅ k (T)+k (T) ⋅⎜ ⎟ (30)
⎜ B1 B2 ⎟ B1 B2
dT h ⎝ Mn(0)⎠ ⎝ m ⎠ h ⎝ m ⎠
Model ‘M: niso CS - ED + CR’, non-isothermal (niso) chain splitting (CS) by chain
element decomposition (ED) and chain recombination (CR). The two reaction rate
constants have the unit 1/time.
⎛Mn(T)⎞
d⎜ ⎟ 2
⎝ m ⎠ =−1⎜⎛k (T)+k (T)exp(+k (T)⋅1⋅(T−Ta)) m ⎟⎞⎜⎛Mn(T)⎟⎞ −1(k (T)−k (T))⎜⎛Mn(T)⎟⎞
⎜ M1 M2 M1 ⎟ M1 M2
dT h⎝ h Mn(0)⎠⎝ m ⎠ h ⎝ m ⎠
(31)
The non-isothermal models for molar mass decrease, Eq.(30) und Eq.(31) could be inte-
grated but the resulting equations are quite cumbersome. It is therefore much easier, to
use the differential forms given above and apply them via combined Runge-Kutta numeri-
cal integration and non-linear parameter fitting to the data.
3.2 Description of DPA decrease in M10 flake propellant
From the five models the model ‘S: extended’ describes the data best. With a
weight of 54% in AICc. Iit surpasses the next best model ‘S: nth order’, which
achieved 21 %.
6 - 8
Fig. 1: Graphical presentation of the fit qualities of the five models for stabilizer consump-
tion. DPA decrease at 65.5°C. Data from US-Army ARDEC, Picatinny Arsenal.
Table 1: Results of the model assessments of the data of M10 flake propellant. DPA de-
crease at 65.5°C
row
Model → linear expon. expn. + linear, nth order extended
summa-
↓ Quantity k k k , k k, n k , k
1 0 O N S tions
n 9 9 9 9 9
p 1 1 2 2 2
SD2 0.02909 0.00107 0.00066 0.00064 0.00052
R2 0.708 0.98925 0.99411 0.99429 0.99539
0.00429 ± 0.012 ± 0.0103 ± 0.01014 ± 8.10E-3 ±
k / k / k [1/d]
1 N 0.0003 0.0005 7.20E-4 6.45E-4 0.0017
0.763 ±
n [-] - - - -
O 0.081
6.45E-4 ± 1.21E-2 ±
k [m.-%] / k [1/d] - - -
0 S 2.38E-4 0.0012
FQS 0.23272 0.00856 0.00462 0.00448 0.00364
AICc -28.325 -58.049 -60.171 -60.448 -62.317 -
exp(-ΔAci/2) 4.157E-08 0.118 0.342 0.393 1 1.853
wAci 2.243E-08 0.064 0.185 0.212 0.54 1
AIC -28.896 -58.621 -62.171 -62.448 -64.317 -
exp(-ΔAi/2) 2.034E-08 0.058 0.342 0.393 1 1.793
wAi 1.135E-08 0.032 0.191 0.219 0.558 1
BIC -32.537 -62.262 -67.453 -67.73 -69.598 -
6 - 9
exp(-ΔBi/2) 8.960E-09 0.026 0.342 0.393 1 1.761
wBi 5.088E-09 0.015 0.194 0.223 0.568 1
BICoπ -30.699 -60.424 -63.777 -64.054 -65.923 -
exp(-ΔBoπi/2) 2.245E-08 0.064 0.342 0.393 1 1.799
wBoπi 1.248E-08 0.036 0.19 0.218 0.556 1
3.3 Description of DPA decrease in single base GP
From the five models the model ‘S: extended’ describes the data best. With a
weight of 61% in AICc. It surpasses the next best model ‘S: nth order’, which
achieved 39%.
In the case of using a reduced data set up to 7 days, the result changes: the best
model is ‘S: exponential + linear’ with 49% followed by model ‘S: nth order’ with
36%; model ‘S: extended’ reaches 15%.
Fig. 2: Graphical presentation of the fit qualities of the five models for stabilizer consump-
tion in a single base GP. DPA decrease at 90°C over 19 days.
Table 2: Results of the model assessments of the data of single base GP GS, DPA decrease
at 90°C. Data measured over 19 days.
row
Model → linear expon. expn. + linear nth order extended
summa-
↓ Quantity k k k , k k, n k , k
1 0 O N S tions
n 26 26 26 26 26
p 1 1 2 2 2
6 - 10
SD2 0.13955 0.00254 0.00204 0.00067258 0.00065
R2 0.02824 0.98229 0.98639 0.99563
0.1097, 0.3155, 0.2879, 0.013 0.4792,
k / k / k [1/d]
1 N 0.008 0.01 0.059
n [-]
O
0.0140, 0.1882,
k [m.-%] / k [1/d]
0 S 0.0051 0.013
FQS 3.48875 0.0635 0.04896 0.016141919 0.0156
AICc -48.056 -152.218 -156.624 -185.473 -186.361
exp(-ΔAci/2) 9.275E-31 3.854E-08 3.489E-07 0.642 1 1.642
wAci 5.650E-31 2.348E-08 2.125E-07 0.391 0.609 1
AIC -48.222 -152.385 -157.146 -185.995 -186.883
exp(-ΔAi/2) 7.766E-31 3.227E-08 3.489E-07 0.642 1 1.642
wAi 4.731E-31 1.966E-08 2.125E-07 0.391 0.609 1
BIC -50.802 -154.965 -160.306 -189.155 -190.043
exp(-ΔBi/2) 5.812E-31 2.415E-08 3.489E-07 0.642 1 1.642
wBi 3.541E-31 1.471E-08 2.125E-07 0.391 0.609 1
Fig. 3: Graphical presentation of the fit qualities of the five models for stabilizer consump-
tion in a single base GP. DPA decrease at 90°C over 7 days.
Table 3: Results of the model assessments of the data of single base GP GS, DPA decrease
at 90°C. Data measured over 19 days, but only used up to 7 days.
expn. + lin- row
Model → linear expon. nth order extended
ear, summa-
↓ Quantity k k k, n k , k
k , k O N S tions
1 0