Table Of ContentHyperbolic statistics and entropy loss in the description of gene distributions
K. Lukierska-Walasek∗
Faculty of Mathematics and Natural Sciences, Cardinal Stefan Wyszyn´ski University, Dewajtis 5, 01-815 Warsaw, Poland
K. Topolski†
Institute of Mathematics, Wroclaw University, Pl. Grunwaldzki 2/4, 50-384 Wroclaw, Poland
K. Trojanowski‡
Institute of Computer Science, Polish Academy of Science, ul.Jana Kazmierza 5,01-248 Warszawa,Poland
4
1
Zipf’slawimpliesthestatisticaldistributionsofhyperbolictype,whichcandescribetheproperties
0
of stability and entropy loss in linguistics. We present the information theory from which follows
2
that if the system is described by distributions of hyperbolic type it leads to the possibility of
r entropy loss. We present the number of repetitions of genes in tne genomes for some bacteria, as
a
Borelia burgdorferi andEscherichia coli. Distributionsofrepetitionsofgenesingenomeappearsto
M
berepresented bydistributions of hyperbolictype.
8
PACSnumbers:
2
Keywords: genelength,hyperbolicdistributions,entropy
]
h
p 1. Introduction. In last years appeared the possibility b, c, ...., which form words as sequence of n letters. In
- to provide some knowledge of the genome sequence data quantum physics the analogy to letters can be attached
o
in many organisms . The genome have been studied in- to pure states, and texts correspond to mixed general
i
b tensively by number of different methods [1]-[10]. The states. Languageshaveverydifferentalphabets: comput-
s. statistical analysis of DNA is complicated because of its ers0,1 (two bits), Englishlanguage27 letters with space
c complex structure; it consists of coding and non coding and DNA four nitric bases: G(guanine), A(adenine),
i
s regions,repetitiveelements,etc.,whichhaveaninfluence C(cytosine), T(thymine). The collection of letters can
y
onthelocalsequencedecomposition. Thelongrangecor- be ordered or disordered. To quantify the disorder of
h
relationsinsequencecompositionsofDNAismuchmore different collections of the letters we use an entropy.
p
[ complexthansimplepowerlaw;moreovereffectiveexpo-
nents of scale regions are varying considerably between H =−Xpilog2pi, (1)
2
different species [9], [10]. In papers [5], [6] the Zipf ap- i=1
v
proachtoanalyzinglinguistictextshasbeenextendedto
1 where p denotes probability of occurrence i-th letter. If
i
6 statistical study of DNA base pair sequences. we take base of logarithm 2 this will lead to the entropy
5 In our paper we take into account some linguistic fea-
measured in bits. When all letters have the same prob-
4 tures of genomes to study the statistics of gene lengths.
ability in all states obviously the entropy has maximum
.
1 We present the information theory from which follows
value H .
max
0 that if the system is described by special distribution
A real entropy has lower value H , because in a real
4 eff
ofhyperbolictype it implies apossibility ofentropyloss.
1 languages the letters have not the same probability of
Distributionsofhyperbolictypedescribealsopropertyof
: appearance. Redundancy R of language is defined [11]
v stability and flexibility which explain that the language
as follows
i
X considered in the paper [11] can develop without chang-
H −H
ing its basic structure. Similar situation can occur in a max eff
r R= (2)
a genomesequencedatawhichcarriesthegeneticinforma- Hmax
tion. We can expect above features in gene strings as in
The quantity H - H is called an information. In-
max eff
language strings (words, sentences) because of presence
formation depend on difference between the maximum
of redundancy in both cases.
entropy and the actual entropy. The bigger actual en-
In Sect.2 we shall present some common features of
tropy means the smaller redundancy. Redundancy can
linguistics and genetics. In Sect.3 we describe Code
be measuredby values of the frequencies with which dif-
Length Game, MaximalEntropyPrinciple and the theo-
ferent letters occur in one or more texts. Redundancy
remabouthyperbolictypedistributionsandentropylost.
R denotes the number that if we remove the part R of
Final Sect.4 contains some applications to genetics and
thelettersdeterminedbyredundancy,themeaningofthe
final remarks. . The distributions appear to be hyper-
text will be still understood. In English some letters oc-
bolic distributions distributions in sense of theorem.
cur more frequently than other and similarly in DNA of
2. Some common features of linguistics and genetics. A lan-
vertebratesthefrequencyofnitricbasesCandGpairsis
guage is characterized by some alphabet with letters: a,
usually less frequent than A and T pairs. The low value
2
of redundancy allows in easier way to fight transcription equilibrium code, P denotes probability. For example,
errorsingenecode. Thepapers[5],[6]itisdemonstrated in a linguistics the listener is a minimizer, speaker is a
that non coding regions of eukaryotes display a smaller maximizer. We have words with distributions p and
i
entropy and larger redundancy than coding regions. their codes k , i = 1,2,.... The listener chooses codes
i
3. Maximal Entropy Principle and Code Length Game. In ki,the speaker chooses probability distributions pi.
this sectionwe shallprovidesome mathematics fromthe We can notice that Zipf argued [14] that in the develop-
informationtheorywhichwillbe helpful inthe quantita- ment of a language vocabulary balance is reached as a
tive formulation of our approach, for details see [11]. result of two opposing forces: unification which tends
Let A be the alphabet which is a discrete set finite or to reduce the vocabulary and corresponds to a principle
countableinfinite. LetM1 and∼M1(A)arerespectively, of least effort, seen from point of view of speaker and
+ +
the set of probability measures on A and the set of non- diversification connectedwith the listeners wishto know
negative measuresP, suchthat P(A)≤1. The elements meaning of speech.
in A can be thought as letters. By K(A) we denote the This principle (7) is so basic as Maximum Entropy
set of mappings, compact codes, k : A → [0,∞], which Principle has a sense that search for one type of op-
satisfy Kraft’s equality [13] timal strategy called as Code Length Game translates
directly into a search for distributions with maximum
Xexp(−ki)=1. (3) entropy. It is a given a code k ∈ ∼K(A) and dis-
i∈A tribution P ∈ M1(A). Optimal strategy according
+
By ∼K(A) we denote the set of all mappings, general H(P) = infk∈∼K < k,P > is represented by entropy
H(P), where actual strategyis representedby <k,P >.
codes, k : A → [0,∞], which satisfy Kraft′sinequality
Zipf’s law is an empirical observationwhich relates rank
[13]
and frequency of words in natural language [14]. This
lawsuggestsmodelingbydistributionsofhyperbolictype
Xexp(−ki)≤1. (4) [11], because no distributions over N have probabilities
i∈A
proportional to 1/i, due to the lack of normalization
For k ∈∼ K(A) and i ∈ A, k is the code lenght, for condition.
i
example the length of the word. For k ∈ ∼K(A) and We consider a class of distributions P =(p1,p2,...) over
P ∈M1(A) the average code length is defined as N. If p1 ≥,p2 ≥..., P is said to be hyperbolic if for any
+ given α > 1, p ≥ i−α for infinitely many indexes i. As
i
<k,P >=Xkipi. (5) an example we can choose pi ∼ i−1(log(i))−c for some
constant c>2.
i∈A
The code lenght game for model P ∈ M1(N) with
+
There is bijective correspondence between pi and ki codes k : A → [0,∞] for which P exp(−ki) = 1, is
i∈A
ki =−lnpi and pi =exp(−ki). in equilibrium if and only if Hmax(co(P) = Hmax(P).
In such a case a distribution P∗ is the H attractor
max
For P ∈M1(A) we also introduce the entropy such that P → P∗, if for every sequence (P ) ⊆ P
+ n n n>1
for which H(P ) → H (P). One expects that
n max
H(p)=−Xpilnpi. (6) H(P∗)= Hmax(P) but in the case with entropy loss we
i∈A have H(P∗)<Hmax(P). Such possibility appears when
distributions are hyperbolic. It follows from theorem in
The entropycanbe representedasminimalaveragecode
[11].
length, (see [11]):
Theorem. Assume that P∗ ∈ M1(N) is of finite
+
H(P)= min <k,P >. entropy and it has ordered point probabilities. Than
k∈∼K(A)
necessary and sufficient condition that P∗ can occur as
Let P ⊆M1(A) than Hmax- attractor in a model with entropy loss, is that P∗
+
is hyperbolic. If this condition is fulfilled than for every
H (P) = sup inf <k,P > h with H(P∗) < h < ∞, there exists a model P = P
max h
P∈Pk∈∼K(A) with P∗ as H –attractor and H (P )=h. In fact,
max max h
= sup H(P) P = (P| < k∗,P >≤ h) is a largest model. k∗ denotes
h
P⊂P the code adopted to P∗, i. e k∗ =−ln(p∗), i>1.
≤ inf sup <k,P >
k∈∼K(A)P⊂P As example we can consider ”an ideal” language where
= Rmin(P). (7) the frequencies of words are described by hyperbolic
distribution P∗ with finite entropy. At a certain stage
The formula 7 present the Nash optimal strategies.
of life one is able to communicate at reasonably height
R denotes minimum risk, k denotes the Nash
min
3
rateaboutH(P∗)andimprovelanguagebyintroduction if we introduce distributions of hyperbolic type. In the
of specialized words, which occur seldom in a language paper we follow the statistical description of the struc-
as a whole. This process can be continued during the ture natural languages and we present the informatical
rest of life. One is able to express complicated ideas
develop language without changing a basic structure of
the language. Similar situation we can expect in gene 2
ws4t.eringAssh,papwllilhciacthiuosnceartroywageenrneientifcoosr.bmtaaitniToendh.e gfreonme daGtaenewBhainchk mber of repetition1
(ftp://ftp.ncbi.nih.gov). Nu
1 0 1 2 3 4 5 6 7 8
Gene
4
n FIG.3: Graphofthenumberofrepetitionofgenesingenome
o
etiti3 of the Borrelia burgdorferiThe genom consists of 1239 genes
p .
e
er of r2
b
m considerationsprovidingasconclusionthestabilityprop-
u
N1 ertyandthepossibilityofentropyloss. Weapplysimilar
reasoning to description of genoms with the genes corre-
sponding to words in languages. We test three examples
1 5 10 15 20 25 30 of genomes and describe the gene statistics dealing with
Gene
thenumberofrepetitionsingenome. Itappearsthatthe
describution of genes is reprezented by dystributions of
FIG.1: Graphofthenumberofrepetitionofgenesingenome
of the Escherischia coli. Genom consists of 5874 genes hyperbolic type.
Oneofthe authors(K.L-W)wouldliketo thank prof.
Franco Ferrari for discussions.
5
4
Number of repetition 23 [1∗†‡] WEEEllleee.cccLtttrrrioooannnniiidccc aaaKddd.dddKrrreeeasssnsss:::ekktkoo..,ltpurEooklujisaekrrnois@[email protected]@hkL.sueuwtknt.sie.w.dw1.uer7.odp,cul6..p5pl5l (1992)
1 [2] C. K.Peng et al.,Nature(London) 356, 168 (1992).
[3] R.F.Voss PhysRev.Lett.68, 3805 (1992).
00 5 10 15 20 25 30 35 40 [4] C. K.Peng et al.,Phys. Rev.E 49 ,1685(1994)
Gene
[5] R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S.
Havlin,C.K.Peng,M.Simons,andH.E.Stanley,Phys.
FIG.2: Graphofthenumberofrepetitionofgenesingenome
Rev. Lett.73, 3169, 1994.
of the Escherischia coli. Genom consists of 4892 genes
[6] R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S.
Havlin,C. K.Peng, M.Simons, andH.E.Stanley,Phys
Figures 1,2 and 3 show how often the same genes are Rev. E52, 2939, 1994.
repeated in gemome.For example,Figure 2 shows that [7] H.EStanleyetal,PhysicaA(Amsterdam)273,1(1999).
mostlyrepeatedgenhasnumber1anditappears5times, [8] M.D.SousaVieira, Phys.Rev.E60,5932 (1999).
the second one also appears 5 times, third gene 4 times [9] D.Holste et al.,Phys.Rev.E 67,061913(2003).
ect.The probability of apperence of genes present hyper- [10] P. W. Messer, P. F. Arndt and M. Lassing, Phys. Rev.
Lett. 94,138103(2005).
bolic type distributions [11].
[11] P. Harremoes and F. Topsoe, Entropy,3,1,2001.
5. Final remarks. Zipf’s law relates the rank k and the [12] C. S.Shannon,Bell Syst. Tech.J. 27, 379, 1948.
frequency of words f in natural languages by exact hy-
[13] I. M. Cover, I. A. Thomas, Information theory, Wiley,
perbolic formula k×f =const. Howeverthe probability New York,1991.
values 1/n for all natural numbers n do not satisfy the [14] K.G.Zipf,HumanBehaviourandthePrincipleofLeast
normalizabilitycondition. This problemis not occurring Effort, Addison-Wesley,Cambridge, 1949.