Table Of ContentThis page intentionally left blank
DATA ANALYSIS
WILEY SERIES IN PROBABILITY AND STATISTICS
Established by WALTER A. SHEWHART and SAMUEL S. WILKS
Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice,
Iain M. Johnstone, Geert Molenberghs, David W. Scott, Adrian F M. Smith,
Ruey S. Tsay, Sanford Weisberg
Editors Emeriti: Vic Barnett, J. Stuart Hunter, Joseph B. Kadane, JozefL. Teugels
A complete list of the titles in this series appears at the end of this volume.
DATA ANALYSIS
What Can Be Learned From the Past 50 Years
Peter J. Huber
Professor of Statistics, retired
Klosters, Switzerland
WILEY
A JOHN WILEY & SONS, INC., PUBLICATION
Copyright © 2011 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to
the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax
(978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should
be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be
suitable for your situation. You should consult with a professional where appropriate. Neither the
publisher nor author shall be liable for any loss of profit or any other commercial damages, including
but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our
Customer Care Department within the United States at (800) 762-2974, outside the United States at
(317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may
not be available in electronic formats. For more information about Wiley products, visit our web site at
www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Huber, Peter J.
Data analysis : what can be learned from the past 50 years / Peter J. Huber.
p. cm. — (Wiley series in probability and statistics ; 874)
Includes bibliographical references and index.
ISBN 978-1-118-01064-8 (hardback)
1. Mathematical statistics—History. 2. Mathematical statistics—Philosophy. 3. Numerical
analysis—Methodology. I. Title.
QA276.15.H83 2011
519.509—dc22 2010043284
Printed in the United States of America.
10 9 8 7 6 5 4 3 21
CONTENTS
Preface xi
1 What is Data Analysis? 1
1.1 Tukey's 1962 paper 3
1.2 The Path of Statistics 5
2 Strategy Issues in Data Analysis 11
2.1 Strategy in Data Analysis 11
2.2 Philosophical issues 13
2.2.1 On the theory of data analysis and its teaching 14
2.2.2 Science and data analysis 15
2.2.3 Economy of forces 16
2.3 Issues of size 17
2.4 Strategic planning 21
2.4.1 Planning the data collection 21
2.4.2 Choice of data and methods. 22
v
VI CONTENTS
2.4.3 Systematic and random errors 23
2.4.4 Strategic reserves 24
2.4.5 Human factors 25
2.5 The stages of data analysis 26
2.5.1 Inspection 26
2.5.2 Error checking 27
2.5.3 Modification 30
2.5.4 Comparison 30
2.5.5 Modeling and Model fitting 30
2.5.6 Simulation 31
2.5.7 What-if analyses 32
2.5.8 Interpretation 32
2.5.9 Presentation of conclusions 32
2.6 Tools required for strategy reasons 33
2.6.1 Ad hoc programming 33
2.6.2 Graphics 34
2.6.3 Record keeping 35
2.6.4 Creating and keeping order 35
Massive Data Sets 37
3.1 Introduction 38
3.2 Disclosure: Personal experiences 39
3.3 What isi massive? A classification of size 39
3.4 Obstacles to scaling 40
3.4.1 Human limitations: visualization 40
3.4.2 Human - machine interactions 41
3.4.3 Storage requirements 41
3.4.4 Computational complexity 42
3.4.5 Conclusions 43
43
3.5 On the structure of large data sets
43
3.5.1 Types of data
44
3.5.2 How do data sets grow?
44
3.5.3 On data organization
45
3.5.4 Derived data sets
46
3.6 Data base management and related issues
48
3.6.1 Data archiving
CONTENTS VII
3.7 The stages of a data analysis 49
3.7.1 Planning the data collection 49
3.7.2 Actual collection 50
3.7.3 Data access 50
3.7.4 Initial data checking 50
3.7.5 Data analysis proper 51
3.7.6 The final product: presentation of arguments and
conclusions 51
3.8 Examples and some thoughts on strategy 52
3.9 Volume reduction 55
3.10 Supercomputers and software challenges 56
3.10.1 When do we need a Concorde? 57
3.10.2 General Purpose Data Analysis and Supercomputers 57
3.10.3 Languages, Programming Environments and Data-
based Prototyping 58
3.11 Summary of conclusions 59
Languages for Data Analysis 61
4.1 Goals and purposes 62
4.2 Natural languages and computing languages 64
4.2.1 Natural languages 64
4.2.2 Batch languages 65
4.2.3 Immediate languages 67
4.2.4 Language and literature 68
4.2.5 Object orientation and related structural issues 69
4.2.6 Extremism and compromises, slogans and reality 71
4.2.7 Some conclusions 73
4.3 Interface issues 74
4.3.1 The command line interface 75
4.3.2 The menu interface 78
4.3.3 The batch interface and programming environments 80
4.3.4 Some personal experiences 81
4.4 Miscellaneous issues 82
4.4.1 On building blocks 82
4.4.2 On the scope of names 83
4.4.3 On notation 83
viii CONTENTS
4.4.4 Book-keeping problems 84
4.5 Requirements for a general purpose immediate language 85
5 Approximate Models 89
5.1 Models 89
5.2 Bayesian modeling 92
5.3 Mathematical statistics and approximate models 94
5.4 Statistical significance and physical relevance 96
5.5 Judicious use of a wrong model 97
5.6 Composite models 98
5.7 Modeling the length of day 99
5.8 The role of simulation 111
5.9 Summary of conclusions 112
6 Pitfalls 113
6.1 Simpson's paradox 114
6.2 Missing data 116
6.2.1 The Case of the Babylonian Lunar Six 118
6.2.2 X-ray crystallography 126
6.3 Regression of Y on X or of X on Yl 129
7 Create order in data 133
7.1 General considerations 134
7.2 Principal component methods 135
7.2.1 Principal component methods: Jury data 137
7.3 Multidimensional scaling 145
7.3.1 Multidimensional scaling: the method 145
7.3.2 Multidimensional scaling: a synthetic example 145
7.3.3 Multidimensional scaling: map reconstruction 147
7.4 Correspondence analysis 147
7.4.1 Correspondence analysis: the method 147
7.4.2 Kiiltepe eponyms 148
7.4.3 Further examples: marketing and Shakespearean plays 156
7.5 Multidimensional scaling vs. Correspondence analysis 160
7.5.1 Hodson's grave data 162
7.5.2 Plato data 168