Table Of Content2
n
d
E
d
i
t
i
o
n
Advanced
Analytics with
Spark
PATTERNS FOR LEARNING FROM DATA AT SCALE
Sandy Ryza, Uri Laserson,
Sean Owen, & Josh Wills
SECOND EDITION
Advanced Analytics with Spark
Patterns for Learning from Data at Scale
Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills
BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo
Advanced Analytics with Spark
by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills
Copyright © 2017 Sanford Ryza, Uri Laserson, Sean Owen, Joshua Wills. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐
tutional sales department: 800-998-9938 or [email protected].
Editor: Marie Beaugureau Indexer: WordCo Indexing Services
Production Editor: Melanie Yarbrough Interior Designer: David Futato
Copyeditor: Gillian McGarvey Cover Designer: Karen Montgomery
Proofreader: Christina Edwards Illustrator: Rebecca Demarest
June 2017: Second Edition
Revision History for the Second Edition
2017-06-09: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Advanced Analytics with Spark, the
cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-491-97295-3
[LSI]
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Analyzing Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Challenges of Data Science 3
Introducing Apache Spark 4
About This Book 6
The Second Edition 7
2. Introduction to Data Analysis with Scala and Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Scala for Data Scientists 10
The Spark Programming Model 11
Record Linkage 12
Getting Started: The Spark Shell and SparkContext 13
Bringing Data from the Cluster to the Client 19
Shipping Code from the Client to the Cluster 22
From RDDs to Data Frames 23
Analyzing Data with the DataFrame API 26
Fast Summary Statistics for DataFrames 32
Pivoting and Reshaping DataFrames 33
Joining DataFrames and Selecting Features 37
Preparing Models for Production Environments 38
Model Evaluation 40
Where to Go from Here 41
3. Recommending Music and the Audioscrobbler Data Set. . . . . . . . . . . . . . . . . . . . . . . . . . 43
Data Set 44
iii
The Alternating Least Squares Recommender Algorithm 45
Preparing the Data 48
Building a First Model 51
Spot Checking Recommendations 54
Evaluating Recommendation Quality 56
Computing AUC 58
Hyperparameter Selection 60
Making Recommendations 62
Where to Go from Here 64
4. Predicting Forest Cover with Decision Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Fast Forward to Regression 65
Vectors and Features 66
Training Examples 67
Decision Trees and Forests 68
Covtype Data Set 71
Preparing the Data 71
A First Decision Tree 74
Decision Tree Hyperparameters 80
Tuning Decision Trees 82
Categorical Features Revisited 86
Random Decision Forests 88
Making Predictions 91
Where to Go from Here 91
5. Anomaly Detection in Network Traffic with K-means Clustering. . . . . . . . . . . . . . . . . . . 93
Anomaly Detection 94
K-means Clustering 94
Network Intrusion 95
KDD Cup 1999 Data Set 96
A First Take on Clustering 97
Choosing k 99
Visualization with SparkR 102
Feature Normalization 106
Categorical Variables 108
Using Labels with Entropy 109
Clustering in Action 111
Where to Go from Here 112
6. Understanding Wikipedia with Latent Semantic Analysis. . . . . . . . . . . . . . . . . . . . . . . . 115
The Document-Term Matrix 116
Getting the Data 118
iv | Table of Contents
Parsing and Preparing the Data 118
Lemmatization 120
Computing the TF-IDFs 121
Singular Value Decomposition 123
Finding Important Concepts 125
Querying and Scoring with a Low-Dimensional Representation 129
Term-Term Relevance 130
Document-Document Relevance 132
Document-Term Relevance 133
Multiple-Term Queries 134
Where to Go from Here 136
7. Analyzing Co-Occurrence Networks with GraphX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
The MEDLINE Citation Index: A Network Analysis 139
Getting the Data 140
Parsing XML Documents with Scala’s XML Library 142
Analyzing the MeSH Major Topics and Their Co-Occurrences 143
Constructing a Co-Occurrence Network with GraphX 146
Understanding the Structure of Networks 150
Connected Components 150
Degree Distribution 153
Filtering Out Noisy Edges 155
Processing EdgeTriplets 156
Analyzing the Filtered Graph 158
Small-World Networks 159
Cliques and Clustering Coefficients 160
Computing Average Path Length with Pregel 161
Where to Go from Here 166
8. Geospatial and Temporal Data Analysis on New York City Taxi Trip Data. . . . . . . . . . . 169
Getting the Data 170
Working with Third-Party Libraries in Spark 171
Geospatial Data with the Esri Geometry API and Spray 172
Exploring the Esri Geometry API 172
Intro to GeoJSON 174
Preparing the New York City Taxi Trip Data 176
Handling Invalid Records at Scale 178
Geospatial Analysis 182
Sessionization in Spark 185
Building Sessions: Secondary Sorts in Spark 186
Where to Go from Here 189
Table of Contents | v
9. Estimating Financial Risk Through Monte Carlo Simulation. . . . . . . . . . . . . . . . . . . . . . 191
Terminology 192
Methods for Calculating VaR 193
Variance-Covariance 193
Historical Simulation 193
Monte Carlo Simulation 193
Our Model 194
Getting the Data 195
Preprocessing 195
Determining the Factor Weights 198
Sampling 201
The Multivariate Normal Distribution 204
Running the Trials 205
Visualizing the Distribution of Returns 208
Evaluating Our Results 209
Where to Go from Here 211
10. Analyzing Genomics Data and the BDG Project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Decoupling Storage from Modeling 214
Ingesting Genomics Data with the ADAM CLI 217
Parquet Format and Columnar Storage 223
Predicting Transcription Factor Binding Sites from ENCODE Data 225
Querying Genotypes from the 1000 Genomes Project 232
Where to Go from Here 235
11. Analyzing Neuroimaging Data with PySpark and Thunder. . . . . . . . . . . . . . . . . . . . . . . 237
Overview of PySpark 238
PySpark Internals 239
Overview and Installation of the Thunder Library 241
Loading Data with Thunder 241
Thunder Core Data Types 248
Categorizing Neuron Types with Thunder 249
Where to Go from Here 254
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
vi | Table of Contents
Foreword
Ever since we started the Spark project at Berkeley, I’ve been excited about not just
building fast parallel systems, but helping more and more people make use of large-
scale computing. This is why I’m very happy to see this book, written by four experts
in data science, on advanced analytics with Spark. Sandy, Uri, Sean, and Josh have
been working with Spark for a while, and have put together a great collection of con‐
tent with equal parts explanations and examples.
The thing I like most about this book is its focus on examples, which are all drawn
from real applications on real-world data sets. It’s hard to find one, let alone 10,
examples that cover big data and that you can run on your laptop, but the authors
have managed to create such a collection and set everything up so you can run them
in Spark. Moreover, the authors cover not just the core algorithms, but the intricacies
of data preparation and model tuning that are needed to really get good results. You
should be able to take the concepts in these examples and directly apply them to your
own problems.
Big data processing is undoubtedly one of the most exciting areas in computing
today, and remains an area of fast evolution and introduction of new ideas. I hope
that this book helps you get started in this exciting new field.
— Matei Zaharia, CTO at Databricks
and Vice President, Apache Spark
vii