Table Of ContentDikshant Shahi
Apache Solr
A Practical Approach to Enterprise Search
Dikshant Shahi
Any source code or other supplementary materials referenced by the author in this
text is available to readers at www.apress.com . For detailed information about how
to locate your book’s source code, go to www.apress.com/source-code/ .
ISBN 978-1-4842-1071-0 e-ISBN 978-1-4842-1070-3
DOI 10.1007/978-1-4842-1070-3
© Apress 2015
Apache Solr: A Practical Approach to Enterprise Search
Managing Director: Welmoed Spahr
Acquisitions Editor: Celestin Suresh John
Development Editor: Matthew Moodie
Technical Reviewer: Shweta Gupta
Editorial Board: Steve Anglin, Pramilla Balan, Louise Corrigan, James
DeWolf, Jonathan Gennick, Robert Hutchinson, Celestin Suresh John, Michelle
Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper,
Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing
Coordinating Editor: Rita Fernando
Copy Editor: Sharon Wilkey
Compositor: SPi Global
Indexer: SPi Global
For information on translations, please e-mail [email protected], or visit
www.apress.com/ .
Apress and friends of ED books may be purchased in bulk for academic,
corporate, or promotional use. eBook versions and licenses are also available for
most titles. For more information, reference our Special Bulk Sales–eBook
Licensing web page at www.apress.com/bulk-sales .
This work is subject to copyright. All rights are reserved by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation,
reprinting, reuse of illustrations, recitation, broadcasting, reproduction on
microfilms or in any other physical way, and transmission or information storage
and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed. Exempted from this legal
reservation are brief excerpts in connection with reviews or scholarly analysis or
material supplied specifically for the purpose of being entered and executed on a
computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the
Copyright Law of the Publisher’s location, in its current version, and permission
for use must always be obtained from Springer. Permissions for use may be
obtained through RightsLink at the Copyright Clearance Center. Violations are
liable to prosecution under the respective Copyright Law.
Trademarked names, logos, and images may appear in this book. Rather than use
a trademark symbol with every occurrence of a trademarked name, logo, or
image, we use the names, logos, and images only in an editorial fashion and to the
benefit of the trademark owner, with no intention of infringement of the
trademark. The use in this publication of trade names, trademarks, service marks,
and similar terms, even if they are not identified as such, is not to be taken as an
expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate
at the date of publication, neither the authors nor the editors nor the publisher can
accept any legal responsibility for any errors or omissions that may be made. The
publisher makes no warranty, express or implied, with respect to the material
contained herein.
Distributed to the book trade worldwide by Springer Science+Business Media
New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-
SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit
www.springer.com. Apress Media, LLC is a California LLC and the sole member
(owner) is Springer Science + Business Media Finance Inc. (SSBM Finance Inc.).
SSBM Finance Inc. is a Delaware corporation.
To my foster mother, Mrs. Pratima Singh, for educating me!
Introduction
This book is for developers who are building or planning to build an enterprise
search engine using Apache Solr. Chapters 1 and 3 can be read by anyone who
intends to learn the basics of information retrieval, search engines, and Apache
Solr specifically. Chapter 2 kick-starts development with Solr and will prove to be
a great resource for Solr newbies and administrators. All other chapters explore
the Solr features and approaches for developing a practical and effective search
engine.
This book covers use cases and examples from various domains such as e-
commerce, legal, medical, and music, which will help you understand the need for
certain features and how to approach the solution. While discussing the features,
the book generally provides a snapshot of the required configuration, the
command (using curl) to execute the feature, and a code snippet as required. The
book dives into implementation details and writing plug-ins for integrating custom
features.
What this book doesn’t cover is performance improvement in Solr and
optimizing it for high-speed indexing. This book covers Solr features through
release 5.3.1, which is the latest at the time of this writing.
What This Book Covers
Chapter 1 , Apache Solr: An Introduction, as the name states, starts with an
introduction to Apache Solr and its ecosystem. It then discusses the features,
reasons for Solr’s popularity, its building blocks, and other information that will
give you a holistic view about Solr. It also introduces related technologies and
compares it to other alternatives.
Chapter 2 , Solr Setup and Administration, begins with Solr fundamentals
and covers Solr setup, steps for indexing your first set of documents and searching
them. It then describes the Solr administrative features and various management
options.
Chapter 3 , Information Retrieval, is dedicated to the concepts of information
retrieval, content extraction, and text processing.
Chapter 4 , Schema Design and Text Analysis, covers the schema design, text
analysis, going schemaless, and managed schemas in Solr. It also describes
common text-analysis techniques.
Chapter 5 , Indexing Data, concentrates on the Solr indexing process by
describing the indexing request flow, various indexing tools, supported document
formats, and important update request processors. This is also the first chapter
that provides the steps to write a Solr plug-in, a custom UpdateRequestProcessor
in this case.
Chapter 6 , Searching Data, describes the Solr searching process, various
query types, important query parsers, supported request parameters, and steps for
writing a custom SearchComponent.
Chapter 7 , Searching Data: Part 2, continues the previous chapter and
covers local parameters, result grouping, statistics, faceting, reranking queries, and
joins. It also dives into the details of function queries for deducing a practical
relevance ranking and steps for writing your own named function.
Chapter 8 , Solr Scoring, explains the Solr scoring process, supported scoring
models, the score computation, and steps for customizing similarity.
Chapter 9 , Additional Features, explores Solr features including spell-
checking, autosuggestion, document similarity, and sponsored search.
Chapter 10 , Traditional Scaling and SolrCloud, covers the distributed
architectures supported by Solr and steps for setting up SolrCloud, creating a
collection, distributed indexing and searching, shard splitting and ZooKeeper.
Chapter 11 , Semantic Search, introduces the concept of semantic search and
covers the tools and techniques for integrating semantic capabilities in Solr.
What You Need for This Book
Apache Solr requires Java Runtine Environment (JRE) 1.7 or newer. The
provided custom Java code is tested on Java Development Kit (JDK) 1.8 and
requires Apache Maven.
The last chapter requires downloading resources required by Apache
OpenNLP and WordNet.
Who This Book Is For
This book expects you to have basic understanding of the Java programming
language, which is essential if you want to execute the custom components.
Acknowledgments
My first vote of thanks goes to my daily dose of caffeine (without which this book
would not have been possible), my sister for preparing it, and my wife for
teaching me to prepare it myself. Thanks to my parents for their love!
Thank you, Celestin, for providing me the opportunity to write this book; Rita
for coordinating the whole process; and Shweta, Matthew, Sharon, and SPi
Global for all their help to get this book to completion. My sincere thanks to
everyone else from Apress for believing in me.
I am deeply indebted to everyone whom I have worked with in my
professional journey and everyone who has motivated me and helped me learn
and improve, directly or indirectly.
A special thanks to my colleagues at The Digital Group for providing the
support, flexibility, and occasional work break to complete the book on time. I
would also like to thank all the open source contributors, especially of Apache
Lucene and Solr; without their great work, there would have been no need for this
book.
As someone has rightly said, it takes a village to create a book. In creating this
book, there is a small village, Sandha, located in the land of Buddha, which I
frequented for tranquility and serenity that helped me focus on writing this book.
Thank you!