Table Of ContentFoundations for Analytics
with Python
Clinton W. Brownley
BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo
For Aisha and Amaya,
“Education is the kindling of a flame,
not the filling of a vessel.” —Socrates
May you always enjoy stoking the fire.
Foundations for Analytics with Python
by Clinton W. Brownley
Copyright © 2016 Clinton Brownley. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or [email protected].
Editors: Laurel Ruma and Tim McGovern Indexer: Judith McConville
Production Editor: Colleen Cole Interior Designer: David Futato
Copyeditor: Jasmine Kwityn Cover Designer: Karen Montgomery
Proofreader: Rachel Head Illustrator: Rebecca Demarest
August 2016: First Edition
Revision History for the First Edition
2016-08-10: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491922538 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Foundations for Analytics with Python,
the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-491-92253-8
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Python Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
How to Create a Python Script 1
How to Run a Python Script 4
Useful Tips for Interacting with the Command Line 7
Python’s Basic Building Blocks 11
Numbers 12
Strings 14
Regular Expressions and Pattern Matching 19
Dates 22
Lists 25
Tuples 31
Dictionaries 32
Control Flow 37
Reading a Text File 44
Create a Text File 44
Script and Input File in Same Location 47
Modern File-Reading Syntax 47
Reading Multiple Text Files with glob 48
Create Another Text File 49
Writing to a Text File 52
Add Code to first_script.py 53
Writing to a Comma-Separated Values (CSV) File 55
print Statements 57
Chapter Exercises 58
v
2. Comma-Separated Values (CSV) Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Base Python Versus pandas 61
Read and Write a CSV File (Part 1) 62
How Basic String Parsing Can Fail 69
Read and Write a CSV File (Part 2) 70
Filter for Specific Rows 72
Value in Row Meets a Condition 73
Value in Row Is in a Set of Interest 75
Value in Row Matches a Pattern/Regular Expression 77
Select Specific Columns 79
Column Index Values 79
Column Headings 81
Select Contiguous Rows 83
Add a Header Row 86
Reading Multiple CSV Files 88
Count Number of Files and Number of Rows and Columns in Each File 90
Concatenate Data from Multiple Files 93
Sum and Average a Set of Values per File 97
Chapter Exercises 100
3. Excel Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Introspecting an Excel Workbook 104
Processing a Single Worksheet 109
Read and Write an Excel File 109
Filter for Specific Rows 113
Select Specific Columns 120
Reading All Worksheets in a Workbook 124
Filter for Specific Rows Across All Worksheets 124
Select Specific Columns Across All Worksheets 127
Reading a Set of Worksheets in an Excel Workbook 129
Filter for Specific Rows Across a Set of Worksheets 129
Processing Multiple Workbooks 132
Count Number of Workbooks and Rows and Columns in Each Workbook 134
Concatenate Data from Multiple Workbooks 136
Sum and Average Values per Workbook and Worksheet 138
Chapter Exercises 142
4. Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Python’s Built-in sqlite3 Module 145
Insert New Records into a Table 151
Update Records in a Table 156
MySQL Database 160
vi | Table of Contents
Insert New Records into a Table 165
Query a Table and Write Output to a CSV File 170
Update Records in a Table 172
Chapter Exercises 177
5. Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Find a Set of Items in a Large Collection of Files 179
Calculate a Statistic for Any Number of Categories from Data in a CSV File 192
Calculate Statistics for Any Number of Categories from Data in a Text File 204
Chapter Exercises 213
6. Figures and Plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
matplotlib 215
Bar Plot 216
Histogram 218
Line Plot 220
Scatter Plot 222
Box Plot 224
pandas 226
ggplot 227
seaborn 231
7. Descriptive Statistics and Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Datasets 239
Wine Quality 239
Customer Churn 240
Wine Quality 241
Descriptive Statistics 241
Grouping, Histograms, and t-tests 243
Pairwise Relationships and Correlation 244
Linear Regression with Least-Squares Estimation 247
Interpreting Coefficients 249
Standardizing Independent Variables 249
Making Predictions 251
Customer Churn 252
Logistic Regression 255
Interpreting Coefficients 257
Making Predictions 259
8. Scheduling Scripts to Run Automatically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Task Scheduler (Windows) 261
The cron Utility (macOS and Unix) 270
Table of Contents | vii
Crontab File: One-Time Set-up 271
Adding Cron Jobs to the Crontab File 273
9. Where to Go from Here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Additional Standard Library Modules and Built-in Functions 278
Python Standard Library (PSL): A Few More Standard Modules 278
Built-in Functions 279
Python Package Index (PyPI): Additional Add-in Modules 280
NumPy 280
SciPy 286
Scikit-Learn 290
A Few Additional Add-in Packages 292
Additional Data Structures 293
Stacks 293
Queues 294
Graphs 294
Trees 295
Where to Go from Here 295
A. Download Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
B. Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
viii | Table of Contents
Description:If you’re like many of Excel’s 750 million users, you want to do more with your data—like repeating similar analyses over hundreds of files, or combining data in many files for analysis at one time. This practical guide shows ambitious non-programmers how to automate and scale the processing a