Table Of Content

Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications Gary Miner Tulsa,OK,USA DursunDelen Tulsa,OK,USA JohnElder Charlottesville,VA,USA Andrew Fast Charlottesville,VA,USA Thomas Hill Tulsa,OK,USA Robert A. Nisbet SantaBarbara,CA,USA Major Guest Authors: JenniferThompson Woodward,OK,USA RichardFoley Raleigh,NC,USA AngelaWaner Tulsa,OK,USA LindaWinters-Miner Tulsa,OK,USA KarthikBalakrishnan SanFrancisco,CA,USA AMSTERDAM(cid:1)BOSTON(cid:1)HEIDELBERG(cid:1)LONDON(cid:1)NEWYORK(cid:1)OXFORD PARIS(cid:1)SANDIEGO(cid:1)SANFRANCISCO(cid:1)SINGAPORE(cid:1)SYDNEY(cid:1)TOKYO AcademicPressisanimprintofElsevier AcademicPressisanimprintofElsevier 225WymanStreet,Waltham,MA02451,USA TheBoulevard,Langfordlane,Kidlington,Oxford,OX51GB,UK Copyright(cid:1)2012ElsevierInc.Allrightsreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronicor mechanical,includingphotocopy,recording,oranyinformationstorageandretrievalsystem,withoutpermission inwritingfromthepublisher. PermissionsmaybesoughtdirectlyfromElsevier'sScience&TechnologyRightsDepartmentinOxford,UK: phone:(+44)1865843830,fax:(+44)1865853333,E-mail:[email protected] yourrequestonlineviatheElsevierhomepage(http://elsevier.com),byselecting“Support&Contact”then “CopyrightandPermission”andthen“ObtainingPermissions.” LibraryofCongressCataloging-in-PublicationData Practicaltextminingandstatisticalanalysisfornon-structuredtextdataapplications/GaryMiner.[etal.]. e1sted. p.cm. Includesbibliographicalreferencesandindex. ISBN978-0-12-386979-1 1.Datamining.I.Miner,Gary. QA76.9.D343P732012 006.3012–dc23 2011036169 BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary. ISBN:978-0-12-386979-1 For information on all Academic Press publications visitourWebsiteatwww.books.elsevier.com PrintedandboundinTheUSA 12 13 14 10 9 8 7 6 5 4 3 2 1 To my mother, CarlynElder, who has been alifelong learner and gifted teacher. dJohn Elder Firstandforemost,Idedicatethisbooktomymotherandmyfather,whopassedawayin2008.Bothof themencouragedandsupportedmethroughoutmylife.Ialsodedicatethisbooktomywife,Handan, andour twochildren, Altug and Serra.Theyare my inspirationand my motivation. dDursunDelen Tothepractitionersofpredictiveanalyticsinhonoroftheircontributionstoknowledgeandprogress. dThomasHill Tomygrownchildren,RebeccaandMatthew,whotoldmeonmysixtiethbirthday,“Dad,youneedto grow up anddo something valuablewith your life. .Startwritingbooks!” dGary Miner To my wife, Jean, whose longsuffering patience through the completion of this book is greatly appreciated. dBob Nisbet Endorsements for Practical Text Mining & Statistical Analysis for Non-structured Text Data Applications Karl Rexer, Ph.D. “Practical”dthat’sit!Thefirstwordinthetitlereallysumsitup.Ifyouwantbigtheoreticalacademic discussions,goreadanotherbook.Whenyouwantrealhelpextractinginsightfromthemountainsof textthatyou'refacing,thisisthebooktoturntoforimmediatepracticaladvice.Itwillgiveyounewideas andframeworkstoaddressyourtextminingproblems.Thisbookwillhelpyougetthejobdone! Thetutorials(casestudies)thatGaryandhiscoauthorshaveassembledreallybringthebookalive. Theyshowreal-worldapplicationsoftextmining,includingthecomplexityoftextanalysis,howreal- world issues shape the analytic plans and implementations, and the great positive impact that text mining can have. dKarl Rexer, Ph.D. President,RexerAnalytics October,2011 KarlRexerispresidentofRexerAnalytics(www.RexerAnalytics.com),aBoston-basedconsultingfirmspecializingin data mining and analytic CRM consulting. Karl and his teams have delivered analytic solutions to dozens of companies, including customer attrition measurement and prediction, multicountry fraud detection programs, customer segmentations, targeting and measurement analytics for million-piece direct marketing initiatives, multiyear customersatisfactiontracking,salesforecasting,productallocationoptimization,marketbasketanalyses,productcross- sellmodeling,andtextmining.Everyyear,RexerAnalyticsconductsandfreelydistributesthewidelyreadAnnualData Miner Survey. In 2011, over 1,300 data miners participated in the 5th Annual Survey. Karlisaleaderinthefieldofapplieddatamining.Heisfrequentlyaninvitedspeakeratconferencesanduniversities. Karl has served on the organizing committees of numerous international Data Mining and Business Intelligence conferences (e.g., KDD, BIWA Summit, Oracle Collaborate). He also reviews papers, is a conference moderator, and serves on awards committees to select the best applied data mining papers. Several software companies have soughtKarl’sinput,andin2008and2010,heservedonSPSS’s(IBM)CustomerAdvisoryBoard.Sincehiselectionin 2007,KarlhasalsoservedontheboardofdirectorsoftheOracleBusinessIntelligence,Warehousing,&Analytics(BIWA) SpecialInterestGroup.In2011,LinkedInnamedKarlnumber1intheirlistof“TopPredictiveAnalyticsProfessionals.” Prior to founding Rexer Analytics, Karl held leadership and consulting positions at several consulting firms and two xi multinationalbanks. xii Endorsements Richard De Veaux, Ph.D. TheauthorsofPracticalTextMiningandStatisticalAnalysisforNonstructuredTextDataApplicationshave managed to produce three books in one. First, in17 chapters theygive afriendly yet comprehensive introduction to the huge field of text mining, a field comprising techniques from several different disciplines and a variety of different tasks. Miner and his colleagues have produced a readable overview of the area that is sure to help the practitioner navigate this large and unruly ocean of techniques.Second,theauthorsprovideacomprehensivelistandreviewofboththecommercialand free software available to perform most text data mining tasks. Finally, and most importantly, the authorshavealsoprovidedanamazingcollectionoftutorialsandcasestudies.Thetutorialsillustrate various text mining scenarios and paths actually taken by researchers, while the case studies go into even more depth, showing both the methodology used and the business decisions taken based on the analysis. These practical step-by-step guides are impressive not only in the breadth of their applications but in the depth and detail that each case study delivers. The studies are authored by several guest authors in addition to the book authors and are built on real problems with real solutions.Thesecasestudiesandtutorialsalonemakethebookworthhaving.Ihaveneverseensuch a collection of real business problems published in any field, much less in such a new field as text mining. These, together with the explanations in the chapters, should provide the practitioner wishing to get a broad view of the text mining field an invaluable resource for both learning and practice. dRichard DeVeaux Professor of Statistics; Dept. of Mathematics and Statistics; WilliamsCollege; Williamstown MA 01267 October,2011 RichardD.DeVeauxisaninternationallyknowneducatorandlecturerindataminingandstatisticswhoservesas aconsultantformanyFortune500companiesandfederalagencies.HeisaprofessorofstatisticsatWilliamsCollege, wherehehastaughtsince1994.HealsotaughtintheEngineeringSchoolatPrincetonUniversityandtheWhartonSchool ofBusinessattheUniversityofPennsylvania.Dr.DeVeauxisanelectedFellowoftheAmericanStatisticalAssociation (ASA),andin2008,hewasnamedMostellerStatisticianoftheYearbytheBostonChapteroftheASA.Hewasafounding officeroftheASASectiononStatisticalLearningandDataMining(SLDM)andiscurrentlytherepresentativeoftheSLDM tothecouncilonSections.In2006and2007,Dr.DeVeauxwastheWilliamR.KenanJr.VisitingProfessorforDistin- guishedTeachingatPrincetonUniversity.Inadditiontonumerousjournalarticles,heisthecoauthorofseveralcritically acclaimedandbest-sellingtextbooks,includingIntroStats(3rded.withPaulVellemanandDavidBock),Stats:Dataand Models(3rded.withVellemanandBock),Stats:ModelingtheWorld(3rded.withVellemanandBock),BusinessStatistics (2nd ed. with Norean Sharpe and Velleman), and Business Statistics: A First Course (1st ed. with Norean Sharpe and Velleman),allpublishedbyPearson. Endorsements xiii Gerard Britton, J.D. In writing Practical Text Mining and Statistical Analysis for Nonstructured Text Data Applications, the six authors(Miner,Delen,Elder,Fast,Hill,andNisbet)acceptedthedauntingtaskofcreatingacohesive operationalframeworkfromthedisparateaspectsandactivitiesoftextmining,anemergingfieldthat theyappropriatelydescribeasthe“WildWest”ofdatamining.Tappingintotheiruniqueexpertiseand applyingawide cross-application lens, they havesucceeded in their mission. Rather than listing the facets of text mining simply as independent academic topics of discussion, the book leans much more to the practical, presenting a conceptual road map to assist users in correlating articulated text mining techniques to categories of actual commonly observed business needs. To finish out the job, summaries for some of the most prevalent commercial text mining solutions are included, along with examples. In this way, the authors have uniquely presented a text mining resource with value to readersacross that breadth of business applications. dGerard Britton,J.D. V.P., GRC Analytics, Opera Solutions LLC October,2011 GerardJ.BrittonisvicepresidentofTextAnalyticsintheGovernanceGroupatOperaSolutionsLLC.Mr.Britton,an experienced attorney, prosecutor, and public and private sector executive, has over 15 years of experience bringing technicalandanalyticsstrategiestothemission-criticalchallengesofgovernance,compliance,fraud,investigations,and electronicdiscovery.Hehasmanagedthedesignanddevelopmentoftextanalyticssolutionsintheelectronicdiscovery field, including advanced predictive models to enable early case assessments and machine-assisted issue coding of documents. He maintains the “Critical Thought in Legal, Compliance and Investigative Analytics” blog at http:// postmodern-ediscovery.blogspot.com/ and is a frequent speaker at leading industry events on topics related to analyticsapplicationsininvestigations,compliance,andlitigation. James Taylor TextMiningisoneofthosephrasespeoplethrowaroundasthoughitdescribessomethingsingular.As the authors of Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications show us, nothing could be further from the truth. There is a rich, diverse ecosystem of text mining approachesandtechnologiesavailable.Readersofthisbookwilldiscoveramyriadofwaystousethese textminingapproachestounderstandandimprovetheirbusiness.Becausetheauthorsareapractical bunch the book is full of examples and tutorials that use every approach, multiple commercial and opensourcetools,andthatshowthepowerandtrade-offseachinvolves.Thecasestudiesareworked through in detail by the authors so you can see exactly how things would be done and learn how to apply it to your own problems. Ifyouareinterestedintextmining,andyoushouldbe,thisbookwillgiveyouaperspectivethatis broad, deep andapproachable. dJames Taylor CEO Decision Management Solutions, October,2011 xiv Endorsements James is the CEO of Decision Management Solutions, and is the leading expert in how to use business rules and predictive analytics to build Decision Management Systems. James is passionate about helping companies improve decision making anddevelopan agile,analytic andadaptivebusiness. Heprovides strategicconsulting, workingwith clientstoadoptdecisionmakingtechnology.JamesisalsoafacultymemberoftheInternationalInstituteforAnalytics andhasledDecisionManagement effortsforleadingcompanies ininsurance,banking,healthmanagement andtele- communications.Jamesistheauthorof“DecisionManagementSystems:Apracticalguidetousingbusinessrulesand predictive analytics” (IBM Press, 2011) and previously wrote Smart (Enough) Systems: How to Deliver Competitive AdvantagebyAutomatingHiddenDecisions(PrenticeHall)withNeilRaden. Foreword 1 FOREWORDNo.1:for“PracticalTextMiningandStatisticallAnalysisforNon-structuredTextData Applicationsns”fromtheviewpointofapersontrainedandhavingalife-longinternationallynoted career in traditional statistics, especially Discriminant Analysis and the Jackknife procedure, and Co-AuthorGary Miner’smentor in statistical analysis: Textminingandanalysisisanewareaofresearchthatislessthan15yearsold.Sowecanexpectnew editionsofthisbookasthefieldmatures.Thisbookconcernsthetopicsofextractinginformationfrom textandfindinginterestingpatternsintheresults.Thebookcoversworkingwithtextdocuments.The introduction explains that the Library of Congress has many “documents” that include text (books, reports, emails, etc.), recordings (sounds), and images (including stills and motion pictures). Appar- entlytextminingdoesnotincludeminingsoundsandimages,althoughIcanvisualizeitexpandingto includescriptsandlyrics.Ofcourse,songsandmusicalcontributionsinvolvemuchmorethantext.This may be adirection for the future. Much of text mining is observational, and as such, it may pay to be cautious about drawing conclusions.Forexample,indataminingonemaywishtomakeinferencesaboutconsumerbehavior (e.g.,purchasesinamarketchain).However,sincethemarketmaynothaveallbrandsofallproducts, no information can be inferred about nonstocked brands. Also, words and phrases are inherently unordered,somanystatisticalmethodsthatrelyoncontinuousdatawillnotapply.Methodsbasedon tables andproportions willbe the order of the day. Therearealwaysconcernsaboutnewareasofscience.Asscientistsbecomefamiliarwiththepitfalls andbenefits,practiceswillchangeandnewmethodswillbeadded.Forexample,thestandardofevidence foranassociationinatextminingcontextmightbequitedifferentfromthatacceptedaslegalevidence. Theadequacyofaccuracyinwebsitesisanareainwhichconcernhasbeennoted.Thisisnotanareathat textmininghasbeenconcernedwith,butsome“facts”distributedonthewebcanbebadlydistorted.One needonlyexaminepoliticalcandidates’websites.Additionally,oneneedonlylookattheletterstothe editoroflocalnewspaperstoobservemisinformationanddisinformation.Ihopethattextminingcan includesomefiguresofmeritondocuments(basedonaccuracyofinformation,completenessofinfor- mation,andcurrencyofinformation,amongotherthings).Inmeta-analysis,studiesthatdon’tmeasure upareminimizedordiscarded.Akeyconcernisthesetoftextsthataren’tincludedinthedatabase. Themainstepsare as follows: (cid:1) Definethepurposeof thestudy: This is alltoo easy to omit. Theresearcher may be so familiarwith the field that the purposeis very clear tohimor her, but someone else in the fieldmayhave adifferent idea. xv xvi Foreword 1 (cid:1) Determinetheavailabilityofthedata:Thisimpliesafullliteraturereviewandcataloguingofdatasets, andknowing the definitions of the terms ineachset is imperativeas well. Asthe data sets are typicallydocuments, eachauthor will likely havehis or her own definition. (cid:1) Prepare thedata: Properencodingof thedocuments makes life much easier. Methodsdiscussedin thisbook makeit much simpler to use. This will include feature extraction,entityextraction, and reductionofdimension.Thelastmaynotmeananyfewervariables,buttheywillbecombinedin ways that reduce the number of items in the modelsddefinitely a nontrivialtask. (cid:1) Develop and assessthe models: One must considerhow the models apply in the contextof the problem.Assessing model fit, for example,will evaluate how well the models predictthe data. (cid:1) Evaluate themodels: This is close to the previous bullet. (cid:1) Deploy the results: Inthis step, the researcher shares hisor her work withthe public.For example, withinsurancefraud,onewoulddemonstratethatfraudcouldbereducedusingthesemodels.The modelswouldidentifykeywordsinfraudulentclaimsthatcouldbeusedinfutureworkdinsome instances,keeping the results confidentialtoavoid alerting the bad guys. Allof these steps are detailed in the chapters. English(andanylanguage)hasagreatdealofredundancy,soitisnotnecessarytomineeachword. Manywordscarrygreatmeaningincontext,sotheremaynotbeaneedtohaveallwordsencoded.For example,inamedicalcontext,thetermstumor,cancer,andlesionmaybeusedmoreorlessinterchangeably. Thebookhas28tutorials,whichshouldbereadasthechaptersareread.Ialsostronglyrecommend thatthereaderhaveaccesstooneoftheprogramsmentionedinthebook.Findageneralpurposetext minerand work through the exercises. Bealertforitemsofinterest.Getsomeexamplesofyourown;thiswilllikelybethebestwaytolearn text mining. Peter A. Lachenbruch, Ph.D. Oregon State University, Professor of Public Health (2006epresent), Past president of the American Statistical Association Dr.PeterLachenbruchreceivedhisPh.D.inbiostatisticsfromUCLA.Hehasheldpositionsonthefacultiesofthe University of North Carolina (1965e1976), the University of Iowa (1976e1985), and UCLA (1985e1994). He was employedbytheFDA/CBERfrom1994to2005andretiredfromthereasthedirectoroftheDivisionofBiostatistics.Heis currentlyProfessorofPublicHealthatOregonStateUniversity(2006epresent).HeisaFellowoftheAmericanStatistical AssociationandaformerelectedmemberoftheInternationalStatisticalInstitute.Hehasheldmanyprofessionaloffices andwasthepresidentoftheAmericanStatisticalAssociationfor2008. He has statistical interests in discriminant analysis, two-part models, model-independent inference, statistical computing, and data analysis. He is known for making the jackknife re-sampling method an accepted procedure, and in recent years he has published on validation of neural networks using hybrid resampling methods. He has application interests in rheumatology, psychiatry, pediatrics, gerontology and accident epidemiology. He has more than 180 publications in these fields. Dr. Lachenbruch serves on the Editorial Boards of Statistics in Medicine, Methods of Information in Medicine, Journal of Biopharmaceutical Statistics, and Statistical Methods in Medical Research. Foreword 2 PracticalText Mining and StatisticalAnalysis for Nonstructured Text Data Applications follows in the tradi- tionofthepopularHandbookofStatisticalAnalysis&DataMiningApplications(Elsevier,2009),whichwas authored by three of this new text’s six authors: Robert Nisbet, John Elder, and Gary Miner. Three additional authorsdThomas Hill, Dursun Delen, and Andrew Fastdcontributed to this book, each bringingsubstantialexperience in text mining. Practical Text Mining was written to be used as an application-oriented text mining handbook. At some thousand pages in length, the authors have created a text that will to my mind serve as the standard resource in the field for many years. The authors have managed to provide readers with acomprehensive,yetthoroughlyunderstandable,overviewofthetheoryandscopeoftextmining.In addition, they provide 28 tutorials that assist readers in actually applying theory to a wide range of practicalapplications using up-to-date statistical techniques. Text mining is one of those disciplines that are now emerging as cutting-edge technologies. Only a few years ago, standard desktop computers did not have the physical capability to handle huge amount of textual material or to engage in the complex analysis of page-length material. Storage capacity and in particular MHz and RAM were insufficient to appropriately analyze large or complex textual situations. Recently, however, PC-type computers have become powerful and fast enough to engage in sophisticated text mining analysis, which itself has the capability of assisting researchers to better understand a host ofsocialand biological issues. Severaltextminingapplicationsnowexisttoenableresearcherstoengageintextminingactivities. Thebook doesnot limit itself tousing only onetext mining package but rather providesstep-by-step instructionsonhowtoworkwithseveraloftheforemostapplications,includingStatSoft’sSTATISTICA DataMinerandTextMiner,SASEnterpriseMiner&TextMiner,andIBM-SPSSModelerPremium.The authors also give readers using RapidMiner, Topsy, Weka, and Salford System’s STM(cid:1), CART(cid:3), and (cid:3) TreeNet tutorials employing these applications. For all of the tutorials, readers are guided through such text mining processes as classification, prediction, named entity extraction, feature selection, dimensionality reduction, singular value decomposition, clustering, focused web crawling, web mining, andotherassociated techniques. PartItakesthereaderthroughthehistoricalbackgroundoftextmining,providesgeneraltextmining theory,andpresentscommonapplicationsandtheirassociatedstatisticalanddatamanagementtools. Part II is called the text mining laboratory in which the 28 tutorials are provided. Each tutorial is authored by the book’s authors and/or by guest authors having special expertise in the subject of the tutorialandthesoftwareusedforit.Codeisprovidedforthetutorialswhereapplicable,andannotated screenshots are displayed throughout in order to make it easier for readers to replicate the specific tutorial example,as well as to enablethemto work throughtheir own project analysis. xvii

Description:

"Theyve done it again. From the same industry leaders who brought you the "bible" of data mining comes the definitive, go-to text mining resource. This book empowers you to dig in and seize value, with over two dozen hands-on tutorials that drive an incredible range of applications such as predict

Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications PDF

1000 Pages·2012·30.87 MB·English

by Miner

Checking for file health...

Save to my drive

Quick download

Download

Download Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications PDF Free - Full Version

by Miner| 2012| 1000 pages| 30.87| English

Download Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications by Miner in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications

Detailed Information

Author:	Miner
Publication Year:	2012
Pages:	1000
Language:	English
File Size:	30.87
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications PDF?

Yes, on https://PDFdrive.to you can download Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications by Miner completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications on my mobile device?

After downloading Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications?

Yes, this is the complete PDF version of Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications by Miner. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.