Table Of ContentLearning Big Data with
Amazon Elastic MapReduce
Easily learn, build, and execute real-world Big Data
solutions using Hadoop and AWS EMR
Amarkant Singh
Vijay Rayapati
BIRMINGHAM - MUMBAI
Learning Big Data with Amazon Elastic MapReduce
Copyright © 2014 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the authors, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2014
Production reference: 1241014
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78217-343-4
www.packtpub.com
Cover image by Pratyush Mohanta ([email protected])
Credits
Authors Project Coordinator
Amarkant Singh Judie Jose
Vijay Rayapati
Proofreaders
Paul Hindle
Reviewers
Venkat Addala Bernadette Watkins
Vijay Raajaa G.S
Indexers
Gaurav Kumar
Mariammal Chettiyar
Monica Ajmera Mehta
Commissioning Editor
Ashwin Nair Rekha Nair
Tejal Soni
Acquisition Editor
Richard Brookes-Bland
Graphics
Sheetal Aute
Content Development Editor
Ronak Dhruv
Sumeet Sawant
Disha Haria
Technical Editors Abhinash Sahu
Mrunal M. Chavan
Gaurav Thingalaya Production Coordinators
Aparna Bhagat
Copy Editors Manu Joseph
Roshni Banerjee
Nitesh Thakur
Relin Hedly
Cover Work
Aparna Bhagat
About the Authors
Amarkant Singh is a Big Data specialist. Being one of the initial users of Amazon
Elastic MapReduce, he has used it extensively to build and deploy many Big Data
solutions. He has been working with Apache Hadoop and EMR for almost 4 years
now. He is also a certified AWS Solutions Architect. As an engineer, he has designed
and developed enterprise applications of various scales. He is currently leading the
product development team at one of the most happening cloud-based enterprises in
the Asia-Pacific region. He is also an all-time top user on Stack Overflow for EMR at
the time of writing this book. He blogs at http://www.bigdataspeak.com/ and is
active on Twitter as @singh_amarkant.
Vijay Rayapati is the CEO of Minjar Cloud Solutions Pvt. Ltd., one of the leading
providers of cloud and Big Data solutions on public cloud platforms. He has over
10 years of experience in building business rule engines, data analytics platforms,
and real-time analysis systems used by many leading enterprises across the world,
including Fortune 500 businesses. He has worked on various technologies such as
LISP, .NET, Java, Python, and many NoSQL databases. He has rearchitected and led
the initial development of a large-scale location intelligence and analytics platform
using Hadoop and AWS EMR. He has worked with many ad networks, e-commerce,
financial, and retail companies to help them design, implement, and scale their data
analysis and BI platforms on the AWS Cloud. He is passionate about open source
software, large-scale systems, and performance engineering. He is active on Twitter
as @amnigos, he blogs at amnigos.com, and his GitHub profile is https://github.
com/amnigos.
Acknowledgments
We would like to extend our gratitude to Udit Bhatia and Kartikeya Sinha from
Minjar's Big Data team for their valuable feedback and support. We would also
like to thank the reviewers and the Packt Publishing team for their guidance in
improving our content.
About the Reviewers
Venkat Addala has been involved in research in the area of Computational
Biology and Big Data Genomics for the past several years. Currently, he is working
as a Computational Biologist in Positive Bioscience, Mumbai, India, which provides
clinical DNA sequencing services (it is the first company to provide clinical DNA
sequencing services in India). He understands Biology in terms of computers and
solves the complex puzzle of the human genome Big Data analysis using Amazon
Cloud. He is a certified MongoDB developer and has good knowledge of Shell,
Python, and R. His passion lies in decoding the human genome into computer
codecs. His areas of focus are cloud computing, HPC, mathematical modeling,
machine learning, and natural language processing. His passion for computers
and genomics keeps him going.
Vijay Raajaa G.S leads the Big Data / semantic-based knowledge discovery
research with the Mu Sigma's Innovation & Development group. He previously
worked with the BSS R&D division at Nokia Networks and interned with Ericsson
Research Labs. He had architected and built a feedback-based sentiment engine and
a scalable in-memory-based solution for a telecom analytics suite. He is passionate
about Big Data, machine learning, Semantic Web, and natural language processing.
He has an immense fascination for open source projects. He is currently researching on
building a semantic-based personal assistant system using a multiagent framework. He
holds a patent on churn prediction using the graph model and has authored a white
paper that was presented at a conference on Advanced Data Mining and Applications.
He can be connected at https://www.linkedin.com/in/gsvijayraajaa.
Gaurav Kumar has been working professionally since 2010 to provide solutions
for distributed systems by using open source / Big Data technologies. He has
hands-on experience in Hadoop, Pig, Hive, Flume, Sqoop, and NoSQLs such as
Cassandra and MongoDB. He possesses knowledge of cloud technologies and
has production experience of AWS.
His area of expertise includes developing large-scale distributed systems to analyze
big sets of data. He has also worked on predictive analysis models and machine
learning. He architected a solution to perform clickstream analysis for Tradus.com.
He also played an instrumental role in providing distributed searching capabilities
using Solr for GulfNews.com (one of UAE's most-viewed newspaper websites).
Learning new languages is not a barrier for Gaurav. He is particularly proficient
in Java and Python, as well as frameworks such as Struts and Django. He has
always been fascinated by the open source world and constantly gives back to the
community on GitHub. He can be contacted at https://www.linkedin.com/in/
gauravkumar37 or on his blog at http://technoturd.wordpress.com. You can
also follow him on Twitter @_gauravkr.
www.PacktPub.com
Support files, eBooks, discount offers, and more
You might want to visit www.PacktPub.com for support files and downloads related to
your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.com and as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
[email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
TM
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book
library. Here, you can access, read and search across Packt's entire library of books.
Why subscribe?
• Fully searchable across every book published by Packt
• Copy and paste, print and bookmark content
• On demand and accessible via web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view nine entirely free books. Simply use your login credentials for
immediate access.
Instant updates on new Packt books
Get notified! Find out when new books are published by following @PacktEnterprise on
Twitter, or the Packt Enterprise Facebook page.