Table Of ContentFROM STATISTICAL PHYSICS TO DATA-DRIVEN
MODELLING
From Statistical Physics to Data-Driven
Modelling
with Applications to Quantitative Biology
Simona Cocco
Rémi Monasson
Francesco Zamponi
GreatClarendonStreet,Oxford,OX26DP,
UnitedKingdom
OxfordUniversityPressisadepartmentoftheUniversityofOxford.
ItfurtherstheUniversity’sobjectiveofexcellenceinresearch,scholarship,
andeducationbypublishingworldwide.Oxfordisaregisteredtrademarkof
OxfordUniversityPressintheUKandincertainothercountries
©SimonaCocco,RémiMonasson,andFrancescoZamponi2022
Themoralrightsoftheauthorshavebeenasserted
Impression:1
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedin
aretrievalsystem,ortransmitted,inanyformorbyanymeans,withoutthe
priorpermissioninwritingofOxfordUniversityPress,orasexpresslypermitted
bylaw,bylicenceorundertermsagreedwiththeappropriatereprographics
rightsorganization.Enquiriesconcerningreproductionoutsidethescopeofthe
aboveshouldbesenttotheRightsDepartment,OxfordUniversityPress,atthe
addressabove
Youmustnotcirculatethisworkinanyotherform
andyoumustimposethissameconditiononanyacquirer
PublishedintheUnitedStatesofAmericabyOxfordUniversityPress
198MadisonAvenue,NewYork,NY10016,UnitedStatesofAmerica
BritishLibraryCataloguinginPublicationData
Dataavailable
LibraryofCongressControlNumber:2022937922
ISBN978–0–19–886474–5
DOI:10.1093/oso/9780198864745.001.0001
Printedandboundby
CPIGroup(UK)Ltd,Croydon,CR04YY
LinkstothirdpartywebsitesareprovidedbyOxfordingoodfaithand
forinformationonly.Oxforddisclaimsanyresponsibilityforthematerials
containedinanythirdpartywebsitereferencedinthiswork.
Contents
1 Introduction to Bayesian inference 1
1.1 Why Bayesian inference? 1
1.2 Notations and definitions 2
1.3 The German tank problem 4
1.4 Laplace’s birth rate problem 7
1.5 Tutorial 1: diffusion coefficient from single-particle tracking 11
2 Asymptotic inference and information 17
2.1 Asymptotic inference 17
2.2 Notions of information 23
2.3 Inference and information: the maximum entropy principle 29
2.4 Tutorial 2: entropy and information in neural spike trains 32
3 High-dimensional inference: searching for principal compo-
nents 39
3.1 Dimensional reduction and principal component analysis 39
3.2 The retarded learning phase transition 43
3.3 Tutorial 3: replay of neural activity during sleep following task
learning 52
4 Priors, regularisation, sparsity 59
4.1 L -norm based priors 59
p
4.2 Conjugate priors 64
4.3 Invariant priors 67
4.4 Tutorial4:sparseestimationtechniquesforRNAalternativesplic-
ing 71
5 Graphicalmodels:fromnetworkreconstructiontoBoltzmann
machines 81
5.1 Network reconstruction for multivariate Gaussian variables 81
5.2 Boltzmann machines 86
5.3 Pseudo-likelihood methods 92
5.4 Tutorial 5: inference of protein structure from sequence data 97
6 Unsupervised learning: from representations to generative
models 107
6.1 Autoencoders 107
6.2 Restricted Boltzmann machines and representations 112
6.3 Generative models 120
6.4 Learning from streaming data: principal component analysis re-
visited 125
vi Contents
6.5 Tutorial 6: online sparse principal component analysis of neural
assemblies 132
7 Supervised learning: classification with neural networks 137
7.1 The perceptron, a linear classifier 137
7.2 Case of few data: overfitting 143
7.3 Case of many data: generalisation 146
7.4 A glimpse at multi-layered networks 152
7.5 Tutorial7:predictionofbindingbetweenPDZproteinsandpep-
tides 156
8 Time series: from Markov models to hidden Markov models 161
8.1 Markov processes and inference 161
8.2 Hidden Markov models 164
8.3 Tutorial 8: CG content variations in viral genomes 171
References 175
Index 181
Preface
Today’s science is characterised by an ever-increasing amount of data, due to instru-
mental and experimental progress in monitoring and manipulating complex systems
made of many microscopic constituents. While this tendency is true in all fields of
science, it is perhaps best illustrated in biology. The activity of neural populations,
composed of hundreds to thousands of neurons, can now be recorded in real time
and specifically perturbed, offering a unique access to the underlying circuitry and
itsrelationshipwithfunctionalbehaviourandproperties.Massivesequencinghasper-
mitted us to build databases of coding DNA or protein sequences from a huge variety
of organisms, and exploiting these data to extract information about the structure,
function, and evolutionary history of proteins is a major challenge. Other examples
abound in immunology, ecology, development, etc.
How can we make sense of such data, and use them to enhance our understanding
of biological, physical, chemical, and other systems? Mathematicians, statisticians,
theoretical physicists, computer scientists, computational biologists, and others have
developed sophisticated approaches over recent decades to address this question. The
primaryobjectiveofthistextbookistointroducetheseideasatthecrossroadbetween
probability theory, statistics, optimisation, statistical physics, inference, and machine
learning. The mathematical details necessary to deeply understand the methods, as
wellastheirconceptualimplications,areprovided.Thesecondobjectiveofthisbookis
toprovidepracticalapplicationsforthesemethods,whichwillallowstudentstoreally
assimilatetheunderlyingideasandtechniques.Theprincipleisthatstudentsaregiven
adataset,askedtowritetheirowncodebasedonthematerialseenduringthetheory
lectures,andanalysethedata.Thisshouldcorrespondtoatwo-tothree-hourtutorial.
Most of the applications we propose here are related to biology, as they were part of
a course to Master of Science students specialising in biophysics at the Ecole Normale
Sup´erieure. The book’s companion website1 contains all the data sets necessary for
the tutorials presented in the book. It should be clear to the reader that the tutorials
proposed here are arbitrary and merely reflect the research interests of the authors.
Manymoreillustrationsarepossible!Indeed,ourwebsitepresentsfurtherapplications
to “pure” physical problems, e.g. coming from atomic physics or cosmology, based on
the same theoretical methods.
Little prerequisite in statistical inference is needed to benefit from this book. We
expectthematerialpresentedheretobeaccessibletoMScstudentsnotonlyinphysics,
but also in applied maths and computational biology. Readers will need basic knowl-
edge in programming (Python or some equivalent language) for the applications, and
inmathematics(functionalandlinearanalysis,algebra,probability).Oneofourmajor
goals is that students will be able to understand the mathematics behind the meth-
1https://github.com/StatPhys2DataDrivenModel/DDM_Book_Tutorials
viii Preface
ods, and not act as mere consumers of statistical packages. We pursue this objective
without emphasis on mathematical rigour, but with a constant effort to develop in-
tuition and show the deep connections with standard statistical physics. While the
content of the book can be thought of as a minimal background for scientists in the
contemporary data era, it is by no means exhaustive. Our objective will be truly ac-
complished if readers then actively seek to deepen their experience and knowledge by
reading advanced machine learning or statistical inference textbooks.
As mentioned above, a large part of what follows is based on the course we gave
at ENS from 2017 to 2021. We are grateful to A. Di Gioacchino, F. Aguirre-Lopez,
and all the course students for carefully reading the manuscript and signalling us the
typos or errors. We are also deeply indebted to Jean-Franc¸ois Allemand and Maxime
Dahan, who first thought that such a course, covering subjects not always part of the
standard curriculum in physics, would be useful, and who strongly supported us. We
dedicate the present book to the memory of Maxime, who tragically disappeared four
years ago.
Paris, January 2022.
Simona Cocco1, R´emi Monasson1,2 and Francesco Zamponi1
1 Ecole Normale Sup´erieure, Universit´e PSL & CNRS
2 Department of Physics, Ecole Polytechnique
1
Introduction to Bayesian inference
This first chapter presents basic notions of Bayesian inference, starting with the def-
initions of elementary objects in probability, and Bayes’ rule. We then discuss two
historicallymotivatedexamplesofBayesianinference,inwhichasingleparameterhas
to be inferred from data.
1.1 Why Bayesian inference?
Most systems in nature are made of small components, interacting in a complex way.
Think of sand grains in a dune, of molecules in a chemical reactor, or of neurons in a
brain area. Techniques to observe and characterise quantitatively these systems, or at
least part of them, are routinely developed by scientists and engineers, and allow one
to ask fundamental questions, see figure 1.1:
• Whatcanwesayaboutthefutureevolutionofthesesystems?Abouthowtheywill
respond to some perturbation, e.g. to a change in the environmental conditions?
Or about the behaviour of the subparts not accessible to measurements?
• Whataretheunderlyingmechanismsexplainingthecollectivepropertiesofthese
systems?Howdothesmallcomponentsinteracttogether?Whatistheroleplayed
by stochasticity in the observed behaviours?
The goal of Bayesian inference is to answer those questions based on observations,
which we will refer to as data in the following. In the Bayesian framework, both the
Fig. 1.1 A. A large complex system includes many components (black dots) that interact
together (arrows). B. An observer generally has access to a limited part of the system and
canmeasurethebehaviourofthecomponentstherein,e.g. theircharacteristicactivitiesover
time.