Table Of ContentStructures of String Matching
and Data Compression
N. Jesper Larsson
Departmentof ComputerScience
LundUniversity
DepartmentofComputerScience
LundUniversity
Box118
S-22100Lund
Sweden
Copyright'1999byJesperLarsson
CODEN:LUNFD6/(NFCS-1015)/1(cid:150)130/(1999)
ISBN 91-628-3685-4
Abstract
This doctoral dissertation presents a range of results concerning ef(cid:2)-
cient algorithms and data structures for string processing, including
several schemescontributing to sequential data compression.It com-
prises boththeoretic resultsand practical implementations.
We study the suf(cid:2)x tree data structure, presenting an ef(cid:2)cient rep-
resentation and several generalizations. This includes augmenting the
suf(cid:2)xtreetofullysupportslidingwindowindexing(includingapracti-
calimplementation)inlineartime.Furthermore,weconsideravariant
thatindexesnaturallyword-partitioneddata, andpresentalinear-time
constructionalgorithmforatreethatrepresentsonlysuf(cid:2)xesstarting
atwordboundaries,requiringspacelinearinthenumberofwords.
By applying our sliding window indexing techniques, we achieve
an ef(cid:2)cient implementation for dictionary-based compression based
on the LZ-77 algorithm. Furthermore, considering predictive source
modelling, we show that a PPM* style model can be maintained in
linear time using arbitrarily bounded storage space.
We also consider the related problem of suf(cid:2)x sorting, applicable
to suf(cid:2)x array construction and block sorting compression. We present
an algorithm that eliminates super(cid:3)uousprocessing of previoussolu-
tions while maintaining robust worst-case behaviour.We experimen-
tally show favourable performance for a wide range of natural and
degenerate inputs,and present a complete implementation.
Block sorting compression using BWT, the Burrows-Wheeler trans-
form,hasimplicitstructurecloselyrelatedtocontexttreesusedinpre-
dictive modelling. We show how an explicit BWT context tree can
be ef(cid:2)ciently generated as a subset of the corresponding suf(cid:2)x tree
and explore the central problems in using this structure. We experi-
mentallyevaluatepredictioncapabilitiesofthetreeandconsiderrep-
resenting it explicitly as part of the compressed data, arguing that a
conscious treatment of the context tree can combine the compres-
sion performance of predictivemodelling with the computational ef-
(cid:2)ciency of BWT.
Finally,weexploreof(cid:3)inedictionary-basedcompression,andpresent
asemi-staticsourcemodellingschemethatobtainsexcellentcompres-
sion,yetisalsocapableofhighdecodingrates.Theamountofmemory
used by the decoder is (cid:3)exible, and the compressed data has the po-
tential of supporting direct search operations.
Between theoryand practice, sometalk as iftheyweretwo (cid:150)making
a separation and difference between them. Yet wise men know that
bothcan begained inapplyingoneselfwhole-heartedlytoone.
Bhagavad-Gfl(cid:17)tafl 5:4
Short-sighted programming can fail to improve the quality of life. It
can reduce it, causing economic loss or even physical harm. In a few
extreme cases, bad programmingpractice can lead todeath.
P.J. Plauger,
Computer Language, Dec. 1990
Contents
Foreword 7
ChapterOne Fundamentals 9
1.1 Basic De(cid:2)nitions 10
1.2 TrieStorageConsiderations 12
1.3 Suf(cid:2)xTrees 13
1.4 SequentialDataCompression 19
ChapterTwo SlidingWindowIndexing 21
2.1 Suf(cid:2)xTreeConstruction 22
2.2 SlidingtheWindow 24
2.3 StorageIssuesandFinalResult 32
ChapterThree IndexingWord-PartitionedData 33
3.1 De(cid:2)nitions 34
3.2 WastingSpace:AlgorithmA 36
3.3 SavingSpace:AlgorithmB 36
3.4 ExtensionsandVariations 40
3.5 SublinearConstruction:AlgorithmC 41
3.6 AdditionalNotesonPractice 45
ChapterFour Suf(cid:2)xSorting 48
4.1 Background 50
4.2 AFasterSuf(cid:2)xSort 52
4.3 TimeComplexity 56
4.4 AlgorithmRe(cid:2)nements 59
4.5 ImplementationandExperiments 63
ChapterFive Suf(cid:2)xTreeSourceModels 71
5.1 Ziv-LempelModel 71
5.2 PredictiveModelling 73
5.3 Suf(cid:2)xTreePPM*Model 74
5.4 FinitePPM*Model 76
5.5 Non-StructuralOperations 76
5.6 Conclusions 78
ChapterSix Burrows-WheelerContextTrees 79
6.1 Background 80
6.2 ContextTrees 82
6.3 TheRelationshipbetweenMove-to-frontCoding
andContextTrees 86
6.4 ContextTreeBWTCompressionSchemes 87
6.5 FinalComments 89
ChapterSeven Semi-Static DictionaryModel 91
7.1 PreviousApproaches 93
7.2 RecursivePairing 94
7.3 Implementation 95
7.4 CompressionEffectiveness 101
7.5 EncodingtheDictionary 102
7.6 Tradeoffs 105
7.7 ExperimentalResults 106
7.8 FutureWork 110
AppendixA SlidingWindowSuf(cid:2)xTreeImplementation 111
AppendixB Suf(cid:2)xSortingImplementation 119
AppendixC Notation 125
Bibliography 127
Foreword
O
riginally,mymotivationforstudyingcomputersciencewasmost
likely spawned by a calculator I bought fourteen years ago. This gad-
getcouldstoreashortsequenceofoperations,includingaconditional
jumptothestart,whichmadeitpossibletoprogramsurprisinglyintri-
catecomputations.Isoonrealizedthatthissimplemechanismhadthe
powerto replace the tedious repeated calculations I sodetested with
an intellectual exercise: to (cid:2)nd a general method to solve a speci(cid:2)c
problem (something I would later learn to refer to as an algorithm)
thatcouldbeexpressedbypressingasequenceofcalculator keys.My
fascination for this process still remains.
With more powerful computers, programming is easier, and more
challenging problems are needed to keep the process interesting. Ul-
timately,inalgorithmtheory,the bothersofproducingan actual pro-
gram are completely skipped over. Instead, the (cid:2)nal product is an
explanation of how an idealized machine could be programmed to
solve a problem ef(cid:2)ciently. In this abstract world, program elements
are represented as mathematical objects that interact as if they were
physical.Theycanbechainedtogether,piledontopofeachother,or
linked together to any level of complexity. Without these data struc-
tures, which can be combined into specialized tools for solving the
problemat hand, producinglargeorcomplicated programswould be
infeasible. However, they do not exist any further than in the pro-
grammer’smind;whentheprogramistobewritten,everythingmust
again betranslated into morebasic operations.Inmyresearch,Ihave
7
Foreword
triedtomaintain thisconnection, seeingalgorithmtheorynotmerely
as mathematics, but ultimately as a programming tool.
At a low level, computers represent everything as sequences of
numbers, albeit with different interpretations depending on the con-
text. The main topic in this thesis is algorithms and data structures
(cid:150) most often tree shaped structures (cid:150) for(cid:2)nding patterns and repeti-
tions in long sequences, strings, of similar items. Examples of typical
strings are texts (strings of letters and punctuation marks), programs
(stringsofoperations),andgeneticdata(stringsofaminoacids).Even
two-dimensional data, such as pictures, are represented as strings at
a lower level. One area particularly explored in the thesis is storing
strings compactly, compressing them, by recording repetition and sys-
tematically introducing abbreviations forrepeating patterns.
Theresultisacollectionofmethodsfororganizing,searching,and
compressing data. Its creation has deepened my insights in computer
science enormously, and I hope some of it can make a lasting contri-
bution to the computing world as well.
Numerouspeoplehavein(cid:3)uencedthis work.Obviously,mycoau-
thorsfordifferentpartsofthethesis,ArneAndersson,AlistairMoffat,
Kunihiko Sadakane, and Kurt Swanson, have had a direct part in its
creation,butmanyothershavecontributedinavarietyofways.With-
outattemptingtonamethemall,Iwouldliketoexpressmygratitude
toallthecentral andperipheralmembersoftheglobalresearchcom-
munity who have supported and assisted me.
Thein(cid:3)uenceofmyadvisorArneAnderssongoesbeyondthework
wherehe stands as an author.He broughtme into the research com-
munity from his special angle, and imprinted me with his views and
visions.Hisnotionsofwhatisrelevantresearch,andhowitshouldbe
presented, have guided me throughthese last (cid:2)ve years.
Finally,Iwishtospeci(cid:2)callythankAlistairMoffatforinvitingmeto
Melbourneandcollaboratingwithmeforthreemonths,duringwhich
time I was accepted as a full member of his dynamic research group.
This gave me a newperspective,and a signi(cid:2)cant pushtowards com-
pleting the thesis.
Malm(cid:246),August1999
JesperLarsson
8
Chapter One
Fundamentals
T
he main theme of this work is the organization of sequential data
to (cid:2)nd and exploit patterns and regularities. This chapter de(cid:2)nes ba-
sicconcepts,formulatesfundamental observationsandtheorems,and
presentsanef(cid:2)cientsuf(cid:2)xtreerepresentation.Followingchaptersfre-
quently referand relate tothe material givenin this chapter.
The material and much of the text in this current work is taken
primarilyfromthe following(cid:2)ve previouslypresentedwritings:
(cid:149) ExtendedApplicationofSuf(cid:2)xTreesto DataCompression,presented at
the IEEE Data Compression Conference 1996 [42]. A revised and
updated version of this material is laid out in chapters two and (cid:2)ve,
andtosomeextentin§1.3.
(cid:149) Suf(cid:2)x Trees on Words, written in collaboration with Arne Andersson
andKurtSwanson,publishedinAlgorithmica,March1998[4].Apre-
liminary versionwaspresentedat theseventhAnnual Symposiumon
Combinatorial Pattern Matching in June 1996. This is presented in
chapterthree,withsomeofthepreliminariesgivenin§1.2.
(cid:149) TheContextTreesofBlock SortingCompression,presented at the IEEE
Data Compression Conference 1998 [43]. This is the basis of chap-
tersix.
(cid:149) Of(cid:3)ine Dictionary-Based Compression,written with Alistair Moffat of
theUniversityofMelbourne,presentedattheIEEEDataCompression
Conference1999[44].Anextendedversionofthisworkispresented
inchapterseven.
9
ChapterOne
(cid:149) FasterSuf(cid:2)xSorting,writtenwithKunihikoSadakaneoftheUniversity
of Tokyo; technical report, submitted [45]. This work is reported in
chapterfour.Someofitsmaterialhasbeenpresentedinapreliminary
version as A Fast Algorithm for Making Suf(cid:2)xArrays andfor Burrows-
WheelerTransformationbyKunihikoSadakane[59].
1.1 Basic De(cid:2)nitions
We assume that the reader is familiar with basic conventional de(cid:2)ni-
tions regarding strings and graphs, and do not attempt to completely
de(cid:2)ne all the concepts used. However, to resolve differences in the
literature concerning the use of some concepts, we state the de(cid:2)ni-
tions of not only our specialized concepts, but also of some more
general ones.
For quick reference to our specialized notations, appendix C on
pages 125(cid:150)126 summarizes terms and symbols used in each of the
chapters of the thesis.
Notational Convention For notation regarding asymptotic growth of
functionsandsimilarconcepts,weadoptthegeneraltraditionincom-
puterscience;see,forinstance,Cormen,Leiserson,andRivest[20].
All logarithms in the thesis are assumed to be base two, except
where otherwise stated.
1.1.1 SymbolsandStrings Theinputofeach ofthe algorithmsdescribed in
this thesis is a sequence of items which we refer to as symbols. The
interpretationofthesesymbolsasletters,programinstructions,amino
acids, etc., is generally beyond our scope. We treat a symbol as an
abstract elementthat canrepresentany kindofunitinthe actual im-
plementation (cid:150)althoughwe doprovideseveral examplesof practical
uses,andoftenaim oureffortsat aparticular areaofapplication.
Two basic sets of operations for symbols are common. Either the
symbolsareconsideredatomic(cid:150)indivisibleunitssubjecttoonlyafew
prede(cid:2)nedoperations,ofwhichpairwisecomparisonisacommonex-
ample(cid:150)ortheyareassumedtoberepresentedbyintegers,andthereby
possibletomanipulatewithallthecommonarithmeticoperations.We
adoptpredominantlythelatter approach,sinceourprimarygoalisto
develop practically useful tools, and in present computers everything
isalways,atthelowestlevel,representedasintegers.Thus,restricting
allowedoperationsbeyondthesetofarithmeticonesoftenintroduces
an unrealistic impediment.
Wedenote thesize ofthe inputalphabet,thesetof possiblevalues
10
Description:grammer's mind; when the program is to be written, everything must. again be thors for different parts of the thesis, Arne Andersson, Alistair Moffat,. Kunihiko munity from his special angle, and imprinted me with his views and . it needs to be regarded as numeric, we normally assign it the value