Table Of ContentNatural Language Processing and Text Mining
Anne Kao and Stephen R. Poteet (Eds)
Natural Language
Processing and
Text Mining
Anne Kao, BA, MA, MS, PhD Stephen R. Poteet, BA, MA, CPhil
Bellevue, WA98008, USA Bellevue, WA98008, USA
British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Control Number: 2006927721
ISBN-10: 1-84628-175-X Printed on acid-free paper
ISBN-13: 978-1-84628-175-4
©Springer-Verlag London Limited 2007
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as
permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced,
stored or transmitted, in any form or by any means, with the prior permission in writing of the
publishers, or in the case of reprographic reproduction in accordance with the terms of licences issued
by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be
sent to the publishers.
The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of
a specific statement, that such names are exempt from the relevant laws and regulations and therefore
free for general use.
The publisher makes no representation, express or implied, with regard to the accuracy of the
information contained in this book and cannot accept any legal responsibility or liability for any errors
or omissions that may be made.
Printed in the United States of America (MVY)
9 8 7 6 5 4 3 2 1
Springer Science+Business Media, LLC
springer.com
List of Contributors
Jan W. Amtrup Razvan C. Bunescu
Kofax Image Products Department of Computer Sciences
5465 Morehouse Dr, Suite 140 University of Texas at Austin
San Diego, CA 92121, USA 1 University Station C0500
Jan [email protected] Austin, TX 78712-0233, USA
[email protected]
John Atkinson
Departamento de Ingeniera In- Kiel Christianson
formtica Department of Educational Psychol-
Universidad de Concepci´on ogy
P.O. Box code: 160-C University of Illinois
Concepci´on, Chile Champaign, IL 61820, USA
[email protected] [email protected]
Navdeep Dhillon
Chutima Boonthum
Insightful Corporation
Department of Computer Science
1700 Westlake Ave N, Suite 500
Old Dominion University
Seattle, WA 98109, USA
Norfolk, VA 23529, USA
[email protected]
[email protected]
Oren Etzioni
Janez Brank Department of Computer Science
J. Stefan Institute University of Washington
Jamova 39, 1000 Seattle, WA 98125-2350, USA
Ljubljana, Slovenia
[email protected]
[email protected]
Bernd Freisleben
Stephen W. Briner Department of Mathematics and
Department of Psychology, Institute Computer Science
for Intelligent Systems University of Marburg
University of Memphis Hans-Meerwein-Str.
Memphis, TN 38152, USA D-35032 Marburg, Germany
[email protected] [email protected]
VI List of Contributors
Marko Grobelnik Jisheng Liang
J. Stefan Institute Insightful Corporation
Jamova 39, 1000 1700 Westlake Ave N, Suite 500
Ljubljana, Slovenia Seattle, WA 98109, USA
[email protected] [email protected]
Renu Gupta
Ying Liu
Center for Language Research
Singapore MIT Alliance
The University of Aizu
National University of Singapore
Aizu-Wakamatsu City
Singapore 117576
Fukushima 965-8580, Japan
[email protected]
[email protected]
Han Tong Loh
Martin Hoof
Dept. of Mechanical Engineering
DepartmentofElectricalEngineering
National University of Singapore
FH Kaiserslautern
Singapore 119260
Morlauterer Str. 31
[email protected]
D-67657 Kaiserslautern, Germany
[email protected]
Giovanni Marchisio
Youcef-Toumi Kamal Insightful Corporation
Dept. of Mechanical Engineering 1700 Westlake Ave N, Suite 500
Massachusetts Institute of Technol- Seattle, WA 98109, USA
ogy [email protected]
Cambridge, MA 02139, USA
[email protected]
Philip M. McCarthy
Department of Psychology, Institute
Anne Kao
for Intelligent Systems
Mathematics and Computing
University of Memphis
Technology
Memphis, TN 38152, USA
Boeing Phantom Works
[email protected]
Seattle, WA 92107, USA
[email protected]
Danielle S. McNamara
Department of Psychology, Institute
Krzysztof Koperski
for Intelligent Systems
Insightful Corporation
University of Memphis
1700 Westlake Ave N, Suite 500
Memphis, TN 38152, USA
Seattle, WA 98109, USA
[email protected]
[email protected]
Irwin B. Levinstein Dunja Mladeni´c
Department of Computer Science J. Stefan Institute
Old Dominion University Jamova 39, 1000
Norfolk, VA 23529, USA Ljubljana, Slovenia
[email protected] [email protected]
List of Contributors VII
Raymond J. Mooney Vasile Rus
Department of Computer Sciences Department of Computer Science,
University of Texas at Austin Institute for Intelligent Systems
1 University Station C0500 University of Memphis
Austin, TX 78712-0233, USA Memphis, TN 38152, USA
[email protected] [email protected]
Eni Mustafaraj
Department of Mathematics and Mauritius A. R. Schmidtler
Computer Science Kofax Image Products
University of Marburg 5465 Morehouse Dr, Suite 140
Hans-Meerwein-Str. San Diego, CA 92121, USA
D-35032 Marburg, Germany Maurice [email protected]
[email protected]
Thien Nguyen Lothar M. Schmitt
Insightful Corporation School of Computer Science &
1700 Westlake Ave N, Suite 500 Engineering
Seattle, WA 98109, USA The University of Aizu
[email protected] Aizu-Wakamatsu City
Fukushima 965-8580, Japan
Lubos Pochman
Insightful Corporation [email protected]
1700 Westlake Ave N, Suite 500
Seattle, WA 98109, USA
Shu Beng Tor
[email protected]
School of Mechanical and Aerospace
Ana-Maria Popescu Engineering
Department of Computer Science Nanyang Technological University
University of Washington Singapore 117576
Seattle, WA 98125-2350, USA [email protected]
[email protected]
Stephen R. Poteet Carsten Tusk
Mathematics and Computing Insightful Corporation
Technology 1700 Westlake Ave N, Suite 500
Boeing Phantom Works Seattle, WA 98109, USA
Seattle, WA 92107, USA [email protected]
[email protected]
Jonathan Reichhold Dan White
Insightful Corporation Insightful Corporation
1700 Westlake Ave N, Suite 500 1700 Westlake Ave N, Suite 500
Seattle, WA 98109, USA Seattle, WA 98109, USA
[email protected] [email protected]
Preface
The topic this book addresses originated from a panel discussion at the 2004
ACM SIGKDD (Special Interest Group on Knowledge Discovery and Data
Mining) Conference held in Seattle, Washington, USA. We the editors orga-
nized the panel to promote discussion on how text mining and natural lan-
guageprocessing,tworelatedtopicsoriginatingfromverydifferentdisciplines,
can best interact with each other, and benefit from each other’s strengths. It
attracted a great deal of interest and was attended by 200 people from all
overtheworld.Wethenguest-editedaspecialissueofACMSIGKDDExplo-
rations on the same topic, with a number of very interesting papers. At the
same time, Springer believed this to be a topic of wide interest and expressed
an interest in seeing a book published. After a year of work, we have put to-
gether 11 papers from international researchers on a range of techniques and
applications.
We hope this book includes papers readers do not normally find in con-
ference proceedings, which tend to focus more on theoretical or algorithmic
breakthroughs but are often only tried on standard test data. We would like
to provide readers with a wider range of applications, give some examples
of the practical application of algorithms on real-world problems, as well as
share a number of useful techniques.
We would like to take this opportunity to thank all our reviewers: Gary
Coen, Ketty Gann, Mark Greaves, Anne Hunt, Dave Levine, Bing Liu, Dra-
gosMargineantu,JimSchimert,JohnThompson,RodTjoelker,RickWojcik,
Steve Woods, and Jason Wu. Their backgrounds include natural language
processing, machine learning, applied statistics, linear algebra, genetic algo-
rithms,webmining,ontologiesandknowledgemanagement.Theycomplement
the editors’ own backgrounds in text mining and natural language processing
very well. As technologists at Boeing Phantom Works, we work on practical
large scale text mining problems such as Boeing airplane maintenance and
safety, various kinds of survey data, knowledge management, and knowledge
discovery, and evaluate data and text mining, and knowledge management
products for Boeing use. We would also like to thank Springer for the oppor-
X Preface
tunity to interact with researchers in the field and for publishing this book
andespeciallyWayneWheelerandCatherineBrettfortheirhelpandencour-
agement at every step. Finally, we would like to offer our special thanks to
Jason Wu. We would not have been able to put all the chapters together into
a book without his expertise in LATEX and his dedication to the project.
Bellevue, Washington, USA Anne Kao
April 2006 Stephen R. Poteet
Contents
1 Overview
Anne Kao and Stephen R. Poteet .................................. 1
2 Extracting Product Features and Opinions from Reviews
Ana-Maria Popescu and Oren Etzioni .............................. 9
3 Extracting Relations from Text:
From Word Sequences to Dependency Paths
Razvan C. Bunescu and Raymond J. Mooney........................ 29
4 Mining Diagnostic Text Reports by Learning to Annotate
Knowledge Roles
Eni Mustafaraj, Martin Hoof, and Bernd Freisleben .................. 45
5 A Case Study in Natural Language Based Web Search
Giovanni Marchisio, Navdeep Dhillon, Jisheng Liang, Carsten Tusk,
Krzysztof Koperski, Thien Nguyen, Dan White, and Lubos Pochman.... 69
6 Evaluating Self-Explanations in iSTART:
Word Matching, Latent Semantic Analysis, and Topic Models
Chutima Boonthum, Irwin B. Levinstein, and Danielle S. McNamara... 91
7 Textual Signatures: Identifying Text-Types Using Latent
Semantic Analysis to Measure the Cohesion of Text
Structures
Philip M. McCarthy, Stephen W. Briner, Vasile Rus, and Danielle S.
McNamara......................................................107
8 Automatic Document Separation:
A Combination of Probabilistic Classification
and Finite-State Sequence Modeling
Mauritius A. R. Schmidtler, and Jan W. Amtrup ....................123
Description:large scale text mining problems such as Boeing airplane maintenance and safety, various kinds of survey data, knowledge management, and knowledge