Table Of ContentFive real-world Python projects
Leonard Apeltsin
M A N N I N G
Core algorithms inside the book
Algorithm Use case First introduced
K-means Clustering Section 10
DBSCAN Clustering Section 10
Jaccard similarity computation Text comparison Section 13
Cosine similarity computation Text comparison Section 13
Principal component analysis Dimension reduction Section 14
Singular value decomposition Dimension reduction Section 14
Power iteration Eigenvector computation Section 14
TFIDF vectorization Text comparison Section 15
Shortest path length computation Network path optimization Section 18
PageRank Network centrality measurement Section 19
Markov clustering Social network clustering Section 19
K-nearest neighbors Supervised classification Section 20
Cross-validation Model performance testing Section 20
Perceptron Supervised classification Section 21
Linear regression Supervised classification Section 21
Decision tree Supervised classification Section 22
Random forest Supervised classification Section 22
A trained logistic regression classifier distinguishes between two classes of points by slicing like a
cleaver through 3D space (see section 21).
Data Science Bookcamp
Data Science
Bookcamp
FIVE PYTHON PROJECTS
LEONARD APELTSIN
MANNING
SHELTER ISLAND
For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: [email protected]
©2021 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning Publications
was aware of a trademark claim, the designations have been printed in initial caps or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.
The author and publisher have made every effort to ensure that the information in this book
was correct at press time. The author and publisher do not assume and hereby disclaim any
liability to any party for any loss, damage, or disruption caused by errors or omissions, whether
such errors or omissions result from negligence, accident, or any other cause, or from any usage
of the information herein.
Development editor: Elesha Hyde
Technical development editors: Arthur Zubarev and Alvin Raj
Manning Publications Co. Review editors: Ivan Martinovic´ and Adriana Sabo
20 Baldwin Road Production editor: Deirdre S. Hiam
PO Box 761 Copy editor: Tiffany Taylor
Shelter Island, NY 11964 Proofreader: Katie Tennant
Technical proofreader: Raffaella Ventaglio
Typesetter: Dennis Dalinnik
Cover designer: Marija Tudor
ISBN: 9781617296253
Printed in the United States of America
To my teacher, Alexander Vishnevsky,
who taught me how to think
brief contents
CASE STUDY 1 FINDING THE WINNING STRATEGY IN
A CARD GAME......................................................1
1 ■ Computing probabilities using Python 3
2 ■ Plotting probabilities using Matplotlib 17
3 ■ Running random simulations in NumPy 33
4 ■ Case study 1 solution 58
CASE STUDY 2 ASSESSING ONLINE AD CLICKS FOR
SIGNIFICANCE ...................................................69
5 ■ Basic probability and statistical analysis
using SciPy 71
6 ■ Making predictions using the central limit
theorem and SciPy 94
7 ■ Statistical hypothesis testing 114
8 ■ Analyzing tables using Pandas 137
9 ■ Case study 2 solution 154
vii
viii BRIEF CONTENTS
CASE STUDY 3 TRACKING DISEASE OUTBREAKS USING NEWS
HEADLINES......................................................165
10 ■ Clustering data into groups 167
11 ■ Geographic location visualization and analysis 194
12 ■ Case study 3 solution 226
CASE STUDY 4 USING ONLINE JOB POSTINGS TO IMPROVE
YOUR DATA SCIENCE RESUME ...........................245
13 ■ Measuring text similarities 249
14 ■ Dimension reduction of matrix data 292
15 ■ NLP analysis of large text datasets 340
16 ■ Extracting text from web pages 385
17 ■ Case study 4 solution 404
CASE STUDY 5 PREDICTING FUTURE FRIENDSHIPS FROM
SOCIAL NETWORK DATA ...................................445
18 ■ An introduction to graph theory and network
analysis 451
19 ■ Dynamic graph theory techniques for node ranking
and social network analysis 482
20 ■ Network-driven supervised machine learning 518
21 ■ Training linear classifiers with logistic regression 548
22 ■ Training nonlinear classifiers with decision
tree techniques 586
23 ■ Case study 5 solution 634