Table Of ContentPractical Text Mining
and Statistical Analysis
for Non-structured
Text Data Applications
Gary Miner
Tulsa,OK,USA
DursunDelen
Tulsa,OK,USA
JohnElder
Charlottesville,VA,USA
Andrew Fast
Charlottesville,VA,USA
Thomas Hill
Tulsa,OK,USA
Robert A. Nisbet
SantaBarbara,CA,USA
Major Guest Authors:
JenniferThompson
Woodward,OK,USA
RichardFoley
Raleigh,NC,USA
AngelaWaner
Tulsa,OK,USA
LindaWinters-Miner
Tulsa,OK,USA
KarthikBalakrishnan
SanFrancisco,CA,USA
AMSTERDAM(cid:1)BOSTON(cid:1)HEIDELBERG(cid:1)LONDON(cid:1)NEWYORK(cid:1)OXFORD
PARIS(cid:1)SANDIEGO(cid:1)SANFRANCISCO(cid:1)SINGAPORE(cid:1)SYDNEY(cid:1)TOKYO
AcademicPressisanimprintofElsevier
AcademicPressisanimprintofElsevier
225WymanStreet,Waltham,MA02451,USA
TheBoulevard,Langfordlane,Kidlington,Oxford,OX51GB,UK
Copyright(cid:1)2012ElsevierInc.Allrightsreserved.
Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronicor
mechanical,includingphotocopy,recording,oranyinformationstorageandretrievalsystem,withoutpermission
inwritingfromthepublisher.
PermissionsmaybesoughtdirectlyfromElsevier'sScience&TechnologyRightsDepartmentinOxford,UK:
phone:(+44)1865843830,fax:(+44)1865853333,E-mail:[email protected]
yourrequestonlineviatheElsevierhomepage(http://elsevier.com),byselecting“Support&Contact”then
“CopyrightandPermission”andthen“ObtainingPermissions.”
LibraryofCongressCataloging-in-PublicationData
Practicaltextminingandstatisticalanalysisfornon-structuredtextdataapplications/GaryMiner.[etal.].
e1sted.
p.cm.
Includesbibliographicalreferencesandindex.
ISBN978-0-12-386979-1
1.Datamining.I.Miner,Gary.
QA76.9.D343P732012
006.3012–dc23
2011036169
BritishLibraryCataloguing-in-PublicationData
AcataloguerecordforthisbookisavailablefromtheBritishLibrary.
ISBN:978-0-12-386979-1
For information on all Academic Press publications
visitourWebsiteatwww.books.elsevier.com
PrintedandboundinTheUSA
12 13 14 10 9 8 7 6 5 4 3 2 1
To my mother, CarlynElder, who has been alifelong learner and gifted teacher.
dJohn Elder
Firstandforemost,Idedicatethisbooktomymotherandmyfather,whopassedawayin2008.Bothof
themencouragedandsupportedmethroughoutmylife.Ialsodedicatethisbooktomywife,Handan,
andour twochildren, Altug and Serra.Theyare my inspirationand my motivation.
dDursunDelen
Tothepractitionersofpredictiveanalyticsinhonoroftheircontributionstoknowledgeandprogress.
dThomasHill
Tomygrownchildren,RebeccaandMatthew,whotoldmeonmysixtiethbirthday,“Dad,youneedto
grow up anddo something valuablewith your life. .Startwritingbooks!”
dGary Miner
To my wife, Jean, whose longsuffering patience through the completion of this book is greatly
appreciated.
dBob Nisbet
Endorsements for
Practical Text Mining &
Statistical Analysis for Non-structured Text
Data Applications
Karl Rexer, Ph.D.
“Practical”dthat’sit!Thefirstwordinthetitlereallysumsitup.Ifyouwantbigtheoreticalacademic
discussions,goreadanotherbook.Whenyouwantrealhelpextractinginsightfromthemountainsof
textthatyou'refacing,thisisthebooktoturntoforimmediatepracticaladvice.Itwillgiveyounewideas
andframeworkstoaddressyourtextminingproblems.Thisbookwillhelpyougetthejobdone!
Thetutorials(casestudies)thatGaryandhiscoauthorshaveassembledreallybringthebookalive.
Theyshowreal-worldapplicationsoftextmining,includingthecomplexityoftextanalysis,howreal-
world issues shape the analytic plans and implementations, and the great positive impact that text
mining can have.
dKarl Rexer, Ph.D.
President,RexerAnalytics
October,2011
KarlRexerispresidentofRexerAnalytics(www.RexerAnalytics.com),aBoston-basedconsultingfirmspecializingin
data mining and analytic CRM consulting. Karl and his teams have delivered analytic solutions to dozens of
companies, including customer attrition measurement and prediction, multicountry fraud detection programs,
customer segmentations, targeting and measurement analytics for million-piece direct marketing initiatives, multiyear
customersatisfactiontracking,salesforecasting,productallocationoptimization,marketbasketanalyses,productcross-
sellmodeling,andtextmining.Everyyear,RexerAnalyticsconductsandfreelydistributesthewidelyreadAnnualData
Miner Survey. In 2011, over 1,300 data miners participated in the 5th Annual Survey.
Karlisaleaderinthefieldofapplieddatamining.Heisfrequentlyaninvitedspeakeratconferencesanduniversities.
Karl has served on the organizing committees of numerous international Data Mining and Business Intelligence
conferences (e.g., KDD, BIWA Summit, Oracle Collaborate). He also reviews papers, is a conference moderator, and
serves on awards committees to select the best applied data mining papers. Several software companies have
soughtKarl’sinput,andin2008and2010,heservedonSPSS’s(IBM)CustomerAdvisoryBoard.Sincehiselectionin
2007,KarlhasalsoservedontheboardofdirectorsoftheOracleBusinessIntelligence,Warehousing,&Analytics(BIWA)
SpecialInterestGroup.In2011,LinkedInnamedKarlnumber1intheirlistof“TopPredictiveAnalyticsProfessionals.”
Prior to founding Rexer Analytics, Karl held leadership and consulting positions at several consulting firms and two
xi
multinationalbanks.
xii Endorsements
Richard De Veaux, Ph.D.
TheauthorsofPracticalTextMiningandStatisticalAnalysisforNonstructuredTextDataApplicationshave
managed to produce three books in one. First, in17 chapters theygive afriendly yet comprehensive
introduction to the huge field of text mining, a field comprising techniques from several different
disciplines and a variety of different tasks. Miner and his colleagues have produced a readable
overview of the area that is sure to help the practitioner navigate this large and unruly ocean of
techniques.Second,theauthorsprovideacomprehensivelistandreviewofboththecommercialand
free software available to perform most text data mining tasks. Finally, and most importantly, the
authorshavealsoprovidedanamazingcollectionoftutorialsandcasestudies.Thetutorialsillustrate
various text mining scenarios and paths actually taken by researchers, while the case studies go into
even more depth, showing both the methodology used and the business decisions taken based on
the analysis. These practical step-by-step guides are impressive not only in the breadth of their
applications but in the depth and detail that each case study delivers. The studies are authored by
several guest authors in addition to the book authors and are built on real problems with real
solutions.Thesecasestudiesandtutorialsalonemakethebookworthhaving.Ihaveneverseensuch
a collection of real business problems published in any field, much less in such a new field as text
mining. These, together with the explanations in the chapters, should provide the practitioner
wishing to get a broad view of the text mining field an invaluable resource for both learning and
practice.
dRichard DeVeaux
Professor of Statistics; Dept. of Mathematics and Statistics;
WilliamsCollege; Williamstown MA 01267
October,2011
RichardD.DeVeauxisaninternationallyknowneducatorandlecturerindataminingandstatisticswhoservesas
aconsultantformanyFortune500companiesandfederalagencies.HeisaprofessorofstatisticsatWilliamsCollege,
wherehehastaughtsince1994.HealsotaughtintheEngineeringSchoolatPrincetonUniversityandtheWhartonSchool
ofBusinessattheUniversityofPennsylvania.Dr.DeVeauxisanelectedFellowoftheAmericanStatisticalAssociation
(ASA),andin2008,hewasnamedMostellerStatisticianoftheYearbytheBostonChapteroftheASA.Hewasafounding
officeroftheASASectiononStatisticalLearningandDataMining(SLDM)andiscurrentlytherepresentativeoftheSLDM
tothecouncilonSections.In2006and2007,Dr.DeVeauxwastheWilliamR.KenanJr.VisitingProfessorforDistin-
guishedTeachingatPrincetonUniversity.Inadditiontonumerousjournalarticles,heisthecoauthorofseveralcritically
acclaimedandbest-sellingtextbooks,includingIntroStats(3rded.withPaulVellemanandDavidBock),Stats:Dataand
Models(3rded.withVellemanandBock),Stats:ModelingtheWorld(3rded.withVellemanandBock),BusinessStatistics
(2nd ed. with Norean Sharpe and Velleman), and Business Statistics: A First Course (1st ed. with Norean Sharpe and
Velleman),allpublishedbyPearson.
Endorsements xiii
Gerard Britton, J.D.
In writing Practical Text Mining and Statistical Analysis for Nonstructured Text Data Applications, the six
authors(Miner,Delen,Elder,Fast,Hill,andNisbet)acceptedthedauntingtaskofcreatingacohesive
operationalframeworkfromthedisparateaspectsandactivitiesoftextmining,anemergingfieldthat
theyappropriatelydescribeasthe“WildWest”ofdatamining.Tappingintotheiruniqueexpertiseand
applyingawide cross-application lens, they havesucceeded in their mission.
Rather than listing the facets of text mining simply as independent academic topics of discussion,
the book leans much more to the practical, presenting a conceptual road map to assist users in
correlating articulated text mining techniques to categories of actual commonly observed business
needs. To finish out the job, summaries for some of the most prevalent commercial text mining
solutions are included, along with examples. In this way, the authors have uniquely presented a text
mining resource with value to readersacross that breadth of business applications.
dGerard Britton,J.D.
V.P., GRC Analytics, Opera Solutions LLC
October,2011
GerardJ.BrittonisvicepresidentofTextAnalyticsintheGovernanceGroupatOperaSolutionsLLC.Mr.Britton,an
experienced attorney, prosecutor, and public and private sector executive, has over 15 years of experience bringing
technicalandanalyticsstrategiestothemission-criticalchallengesofgovernance,compliance,fraud,investigations,and
electronicdiscovery.Hehasmanagedthedesignanddevelopmentoftextanalyticssolutionsintheelectronicdiscovery
field, including advanced predictive models to enable early case assessments and machine-assisted issue coding of
documents. He maintains the “Critical Thought in Legal, Compliance and Investigative Analytics” blog at http://
postmodern-ediscovery.blogspot.com/ and is a frequent speaker at leading industry events on topics related to
analyticsapplicationsininvestigations,compliance,andlitigation.
James Taylor
TextMiningisoneofthosephrasespeoplethrowaroundasthoughitdescribessomethingsingular.As
the authors of Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications
show us, nothing could be further from the truth. There is a rich, diverse ecosystem of text mining
approachesandtechnologiesavailable.Readersofthisbookwilldiscoveramyriadofwaystousethese
textminingapproachestounderstandandimprovetheirbusiness.Becausetheauthorsareapractical
bunch the book is full of examples and tutorials that use every approach, multiple commercial and
opensourcetools,andthatshowthepowerandtrade-offseachinvolves.Thecasestudiesareworked
through in detail by the authors so you can see exactly how things would be done and learn how to
apply it to your own problems.
Ifyouareinterestedintextmining,andyoushouldbe,thisbookwillgiveyouaperspectivethatis
broad, deep andapproachable.
dJames Taylor
CEO Decision Management Solutions,
October,2011
xiv Endorsements
James is the CEO of Decision Management Solutions, and is the leading expert in how to use business rules and
predictive analytics to build Decision Management Systems. James is passionate about helping companies improve
decision making anddevelopan agile,analytic andadaptivebusiness. Heprovides strategicconsulting, workingwith
clientstoadoptdecisionmakingtechnology.JamesisalsoafacultymemberoftheInternationalInstituteforAnalytics
andhasledDecisionManagement effortsforleadingcompanies ininsurance,banking,healthmanagement andtele-
communications.Jamesistheauthorof“DecisionManagementSystems:Apracticalguidetousingbusinessrulesand
predictive analytics” (IBM Press, 2011) and previously wrote Smart (Enough) Systems: How to Deliver Competitive
AdvantagebyAutomatingHiddenDecisions(PrenticeHall)withNeilRaden.
Foreword 1
FOREWORDNo.1:for“PracticalTextMiningandStatisticallAnalysisforNon-structuredTextData
Applicationsns”fromtheviewpointofapersontrainedandhavingalife-longinternationallynoted
career in traditional statistics, especially Discriminant Analysis and the Jackknife procedure, and
Co-AuthorGary Miner’smentor in statistical analysis:
Textminingandanalysisisanewareaofresearchthatislessthan15yearsold.Sowecanexpectnew
editionsofthisbookasthefieldmatures.Thisbookconcernsthetopicsofextractinginformationfrom
textandfindinginterestingpatternsintheresults.Thebookcoversworkingwithtextdocuments.The
introduction explains that the Library of Congress has many “documents” that include text (books,
reports, emails, etc.), recordings (sounds), and images (including stills and motion pictures). Appar-
entlytextminingdoesnotincludeminingsoundsandimages,althoughIcanvisualizeitexpandingto
includescriptsandlyrics.Ofcourse,songsandmusicalcontributionsinvolvemuchmorethantext.This
may be adirection for the future.
Much of text mining is observational, and as such, it may pay to be cautious about drawing
conclusions.Forexample,indataminingonemaywishtomakeinferencesaboutconsumerbehavior
(e.g.,purchasesinamarketchain).However,sincethemarketmaynothaveallbrandsofallproducts,
no information can be inferred about nonstocked brands. Also, words and phrases are inherently
unordered,somanystatisticalmethodsthatrelyoncontinuousdatawillnotapply.Methodsbasedon
tables andproportions willbe the order of the day.
Therearealwaysconcernsaboutnewareasofscience.Asscientistsbecomefamiliarwiththepitfalls
andbenefits,practiceswillchangeandnewmethodswillbeadded.Forexample,thestandardofevidence
foranassociationinatextminingcontextmightbequitedifferentfromthatacceptedaslegalevidence.
Theadequacyofaccuracyinwebsitesisanareainwhichconcernhasbeennoted.Thisisnotanareathat
textmininghasbeenconcernedwith,butsome“facts”distributedonthewebcanbebadlydistorted.One
needonlyexaminepoliticalcandidates’websites.Additionally,oneneedonlylookattheletterstothe
editoroflocalnewspaperstoobservemisinformationanddisinformation.Ihopethattextminingcan
includesomefiguresofmeritondocuments(basedonaccuracyofinformation,completenessofinfor-
mation,andcurrencyofinformation,amongotherthings).Inmeta-analysis,studiesthatdon’tmeasure
upareminimizedordiscarded.Akeyconcernisthesetoftextsthataren’tincludedinthedatabase.
Themainstepsare as follows:
(cid:1) Definethepurposeof thestudy: This is alltoo easy to omit. Theresearcher may be so familiarwith
the field that the purposeis very clear tohimor her, but someone else in the fieldmayhave
adifferent idea. xv
xvi Foreword 1
(cid:1) Determinetheavailabilityofthedata:Thisimpliesafullliteraturereviewandcataloguingofdatasets,
andknowing the definitions of the terms ineachset is imperativeas well. Asthe data sets are
typicallydocuments, eachauthor will likely havehis or her own definition.
(cid:1) Prepare thedata: Properencodingof thedocuments makes life much easier. Methodsdiscussedin
thisbook makeit much simpler to use. This will include feature extraction,entityextraction, and
reductionofdimension.Thelastmaynotmeananyfewervariables,buttheywillbecombinedin
ways that reduce the number of items in the modelsddefinitely a nontrivialtask.
(cid:1) Develop and assessthe models: One must considerhow the models apply in the contextof the
problem.Assessing model fit, for example,will evaluate how well the models predictthe data.
(cid:1) Evaluate themodels: This is close to the previous bullet.
(cid:1) Deploy the results: Inthis step, the researcher shares hisor her work withthe public.For example,
withinsurancefraud,onewoulddemonstratethatfraudcouldbereducedusingthesemodels.The
modelswouldidentifykeywordsinfraudulentclaimsthatcouldbeusedinfutureworkdinsome
instances,keeping the results confidentialtoavoid alerting the bad guys.
Allof these steps are detailed in the chapters.
English(andanylanguage)hasagreatdealofredundancy,soitisnotnecessarytomineeachword.
Manywordscarrygreatmeaningincontext,sotheremaynotbeaneedtohaveallwordsencoded.For
example,inamedicalcontext,thetermstumor,cancer,andlesionmaybeusedmoreorlessinterchangeably.
Thebookhas28tutorials,whichshouldbereadasthechaptersareread.Ialsostronglyrecommend
thatthereaderhaveaccesstooneoftheprogramsmentionedinthebook.Findageneralpurposetext
minerand work through the exercises.
Bealertforitemsofinterest.Getsomeexamplesofyourown;thiswilllikelybethebestwaytolearn
text mining.
Peter A. Lachenbruch, Ph.D.
Oregon State University, Professor of Public Health (2006epresent),
Past president of the American Statistical Association
Dr.PeterLachenbruchreceivedhisPh.D.inbiostatisticsfromUCLA.Hehasheldpositionsonthefacultiesofthe
University of North Carolina (1965e1976), the University of Iowa (1976e1985), and UCLA (1985e1994). He was
employedbytheFDA/CBERfrom1994to2005andretiredfromthereasthedirectoroftheDivisionofBiostatistics.Heis
currentlyProfessorofPublicHealthatOregonStateUniversity(2006epresent).HeisaFellowoftheAmericanStatistical
AssociationandaformerelectedmemberoftheInternationalStatisticalInstitute.Hehasheldmanyprofessionaloffices
andwasthepresidentoftheAmericanStatisticalAssociationfor2008.
He has statistical interests in discriminant analysis, two-part models, model-independent inference, statistical
computing, and data analysis. He is known for making the jackknife re-sampling method an accepted procedure,
and in recent years he has published on validation of neural networks using hybrid resampling methods. He has
application interests in rheumatology, psychiatry, pediatrics, gerontology and accident epidemiology. He has more
than 180 publications in these fields. Dr. Lachenbruch serves on the Editorial Boards of Statistics in Medicine,
Methods of Information in Medicine, Journal of Biopharmaceutical Statistics, and Statistical Methods in Medical
Research.
Foreword 2
PracticalText Mining and StatisticalAnalysis for Nonstructured Text Data Applications follows in the tradi-
tionofthepopularHandbookofStatisticalAnalysis&DataMiningApplications(Elsevier,2009),whichwas
authored by three of this new text’s six authors: Robert Nisbet, John Elder, and Gary Miner. Three
additional authorsdThomas Hill, Dursun Delen, and Andrew Fastdcontributed to this book, each
bringingsubstantialexperience in text mining.
Practical Text Mining was written to be used as an application-oriented text mining handbook. At
some thousand pages in length, the authors have created a text that will to my mind serve as the
standard resource in the field for many years. The authors have managed to provide readers with
acomprehensive,yetthoroughlyunderstandable,overviewofthetheoryandscopeoftextmining.In
addition, they provide 28 tutorials that assist readers in actually applying theory to a wide range of
practicalapplications using up-to-date statistical techniques.
Text mining is one of those disciplines that are now emerging as cutting-edge technologies. Only
a few years ago, standard desktop computers did not have the physical capability to handle huge
amount of textual material or to engage in the complex analysis of page-length material. Storage
capacity and in particular MHz and RAM were insufficient to appropriately analyze large or complex
textual situations. Recently, however, PC-type computers have become powerful and fast enough to
engage in sophisticated text mining analysis, which itself has the capability of assisting researchers to
better understand a host ofsocialand biological issues.
Severaltextminingapplicationsnowexisttoenableresearcherstoengageintextminingactivities.
Thebook doesnot limit itself tousing only onetext mining package but rather providesstep-by-step
instructionsonhowtoworkwithseveraloftheforemostapplications,includingStatSoft’sSTATISTICA
DataMinerandTextMiner,SASEnterpriseMiner&TextMiner,andIBM-SPSSModelerPremium.The
authors also give readers using RapidMiner, Topsy, Weka, and Salford System’s STM(cid:1), CART(cid:3), and
(cid:3)
TreeNet tutorials employing these applications. For all of the tutorials, readers are guided through
such text mining processes as classification, prediction, named entity extraction, feature selection,
dimensionality reduction, singular value decomposition, clustering, focused web crawling, web
mining, andotherassociated techniques.
PartItakesthereaderthroughthehistoricalbackgroundoftextmining,providesgeneraltextmining
theory,andpresentscommonapplicationsandtheirassociatedstatisticalanddatamanagementtools.
Part II is called the text mining laboratory in which the 28 tutorials are provided. Each tutorial is
authored by the book’s authors and/or by guest authors having special expertise in the subject of the
tutorialandthesoftwareusedforit.Codeisprovidedforthetutorialswhereapplicable,andannotated
screenshots are displayed throughout in order to make it easier for readers to replicate the specific
tutorial example,as well as to enablethemto work throughtheir own project analysis. xvii
Description:"Theyve done it again. From the same industry leaders who brought you the "bible" of data mining comes the definitive, go-to text mining resource. This book empowers you to dig in and seize value, with over two dozen hands-on tutorials that drive an incredible range of applications such as predict