Table Of ContentA Handbook of Small Data Sets
A Handbook of
Small Data Sets
Edited by
D.J. Hand,
F. Daly,
A.D. Lunn,
K.J. McConway
and
E. Ostrowski
IU!11
Springer-Science+Business Media, B.V.
First edition 1994
© 1994 D.J. Hand, F. Daly, A.D. Lunn, K.J. McConway
and E. Ostrowski
OrigiDBily pub1ished by CbBpman & Hall in 1994
Sof1lxJvm-n:priDt ofthe banlcover 1st edition 1994
The editors and pub1isher would 1ike to acknowledge the kind pennission to
reproduce each data set given by the individual copyright holders. Wehave
made every effort to contact the copyright holder for each data set and would
be grateful if any errors were brought to the attention of the publisher for
correction at a later printing.
Additional material to this book can be downloaded from http://extras.springer.com
ISBN 978-0-412-39920-6 ISBN 978-1-4899-7266-8 (eBook)
DOI 10.1007/978-1-4899-7266-8
Apart from any fair dealing for the purposes of research or private study, or
criticism or review, as pennitted under the UK Copyright Designs and Patents
Act, 1988, this publication may not be reproduced, stored, or transmitted, in
any form or by any means, without the prior pennission in writing of the
publishers, or in the case of reprographic reproduction only in accordance
with the terms of the licences issued by the Copyright Licensing Agency in
the UK, or in accordance with the terms of licences issued by the appropriate
Reproduction Rights Organization outside the UK. Enquiries concerning
reproduction outside the terms stated here should be sent to the publishers at
the London address printed on this page.
The publisher makes no representation, express or implied, with regard to
the accuracy of the information contained in this book and cannot accept any
legal responsibility or liability for any errors or omissions that may be made.
A catalogue record for this book is available from the British Library
Printedon permanent acid-free text paper, manufactured in accordance
with ANSI/NISO Z39.48-1992 and ANSI/NISO Z39.48-1984 (Permanence of
Paper).
CONTENTS
Introduction vii
How to use the disk xiii
The data sets 1
Data structure index 413
Subject index 437
INTRODUCTION
During our work as teachers of statistical methodology we have often been
in the position of trying to find suitable data sets to illustrate techniques or
phenomena or to use in examination questions. In common with many other
teachers, we have often fabricated numbers to fill the role. But this is far
from ideal for several reasons:
• Obviously unreal data sets ('In a country called Randomania, the Grand
Vizier wanted to know the average number of sheep per household') do
not convey to the students the importance and relevance of the discipline
of statistics. If the technique being taught is as important as the teacher
claims, how is it that he/she has been unable to fmd a real example?
• If data purporting to come from some real domain are invented ('Suppose
a researcher wanted to find out if women scored higher than men on the
WAlS-R test') there is the risk of misleading - it is in fact quite difficult
to create realistic artificial data sets unless one is very familiar with the
application area. One needs to be sure that the means are in the right range,
that the dispersion is realistic, and so on.
• Inventing data serves to reinforce the misconception that statistics is a
science of calculation, instead of a science of problern solving. To avoid
this risk it is necessary to present real problems along with the statistical
solutions.
Since artificial data sets have a number of drawbacks, real ones must be found.
And this is often not easy. Many subject matter journals do not give the raw
data, but only the results of statistical analyses, typically insufficient to allow
reconstruction of the data. One can spend hours browsing through books and
journals to locate a suitable set. Wehave spent many hours so doing, and
we are certain that many other teachers of statistics share our experience.
Forthis reason we decided that a source book, a volume containing a large
number of small data sets suitable for teaching, would be valuable. This book
is the result. In what follows, about 500 real small data sets, with brief
descriptions and details of their sources, are given.
Of course, a book such as this will only realize its potential if users can
locate a data set to illustrate the sort of technique that they wish to use. This
obviously must be achieved through an index, but it is perhaps not as easy
as it might appear. Data sets can be analysed in many different ways. A
viii INTRODUCTION
contingency table can be used to illustrate chi-squared tests, for log-linear
modelling, and for correspondence analysis, but it might also be used for
less obvious purposes. lt might be used to illustrate methods of outlier detection,
the dangen of collapsing tables, Simpson' s paradox, methods for estimating
small probabilities, problems with structural zeros, sampling inadequacies,
or doubtless a whole host of other things we have not thought of. So indexing
the data sets by possible statistical technique, while in a sense ideal, seemed
impracticable.
An alternative was to index the data sets by their properties, so that users
could find a data set which had the sort of structure they needed to demoostrate
whatever it was that they were interested in.
This was the strategy we fmally adopted. The book has two indexes:
(i) a data structure index,
(ii) a subject index.
The second of these is straightforward. It simply contains keywords describing
the application domain - the technical area and the problern from which the
data arose.
The first is more difficult. There are various theories of data which we
could have used to produce a taxonomy through which to classify the data
sets in this volume (for example, Coombs, C.H. (1964) A theory of data.
New York: John Wiley & Sons). However, none ofthem seemed to provide
the right mix of simplicity and power for our purposes. We needed an approach
which could handle most of the data sets, but which was not excessively
complicated and difficult to grasp. lt was not critical if particularly unorthodox
data sets had to be handled by supplementary comments, provided this did
not occur too frequently.
The approach we adopted was to describe the data sets in terms of:
(a) two numbers, the first representing the number of independent units
described in the set and the second the number of measurements taken
on each unit;
(b) a categorization of the variables measured;
(c) an optional supplementary word or phrase describing the structure in
familiar terms.
Of course, such descriptions arenot always unambiguous. For example, they
teil us nothing about any grouping structure beyond that contained in terms
such as nominal, categorical, and binary in (b) above. However, to have
included such descriptions would have led to substantially greater complexity
of description.
Also, there is often more than one way of describing any given data set.
In particular, the description of a data set will depend on the objective of
the analysis. As a simple example, consider responses given as percentages
of items correct in each of six tests taken by two people. lf the aim is to compare
INTRODUCTION ix
people, then one could describe this as two units, each with six scores.
Altematively, if the aim is to compare tests, then one could describe it as
six units, each with two scores.
Such disadvantages have to be weighed against the merits of keeping the
descriptive scheme short and this is the spirit in which our data structure index
was treated. lt is not intended to Iead the user to the single data set which
will do the job but to several which rnight be suitable and from which a choice
can be made. And, a point to which we retum below, it is not intended to
take the place of casual browsing through the data sets.
The terms used in (i) above are as follows:
• A grouping of subjects has been represented by a nominal, binary
(if two groups), or categorical variable, though the table rnight show
the groups separately rather than give an explicit grouping variable.
• Nominal represents a variable with unordered response categories, and
categorical represents a variable with ordered response categories.
• The term numeric has been used to indicate measurement on an interval
or ratio scale.
• Values expressed as 'parts per rnillion' could sensibly be regarded as
proportions, ratios, counts, rates, or simply numeric. We adopted
whatever seemed mostsensible to us in the context ofthe example (which,
of course, need not seem sensible to you, though we hope it does).
• Other terms have been used occasionally, where we thought them desirable,
such as maxima, if the values represent the maximum values observed
in some process, count, if the values are counts, and so on.
• The description of the numbers and types of the variables is sometimes
followed by an overall description of the data set. These occur in square
brackets. Examples are:
[survival] to indicate that the data show survival times (and will
often be censored). Censoring is indicated by a binary variable in the
description of the data set, though again it may not appear explicitly
in the data (it may appear as asterisks against appropriate values - the
descriptions of each data set will make things clear).
[spatial] to indicate that there is a spatial component to the data. (Such
data would normally require a complex descriptive scheme to describe
them adequately, which would be contrary to our aim of simplicity and
could not be justified for a mere handful of data sets.)
[time series]: the fact that the data are a time series will be apparent
from the descriptions of the variables- having the form n m rep(r),
with rep(r) signifying repeated r times. Nevertheless, we thought it
helpful to flag such data sets explicitly.
[latin square].
x INTRODUCTION
[tnmsition matrix].
[dissimßarity matrix].
[correlation matrix].
Some examples of data descriptions are:
40 3 numeric(2), binary [survival]
which means that there are 40 cases, each measured on three variables, two
of which are numeric and the third binary (if there is censoring in the survival
data, this is indicated by this binary variable). The fmal term indicates that
it is survival data.
1 26 rep(26)count [time series]
which indicates that a single object is measured 26 times, producing a single
count on each occasion.
9 x 9 [correlation matrixl
is a correlation matrix of size 9.
13 22 rep(ll)(numeric, rate)
shows that 13 objects each produce a numeric score and a rate on each of
11 occasions.
We stress again that this does not remove all ambiguity. For example, if
two counts are given, with one necessarily being part ofthe other (e.g. number
of children in a family and number of female children in a family) then it
may be described as count(2) or count, proportion. Similarly, very large
counts might arguably be treated as numeric. lt is thus possible that our way
of describing a data set may not be the way you would have chosen. While
it may be possible to define a formallanguage, free from ambiguity, to describe
all conceivable data structures, the complexity of such a language would be
out ofplace here. We hope and expect that usersoftbis volume will browse
through it. In any case, it is worth noting that many of the data sets have
intrinsic interest in their own right and are informative, educational, or even
amusing.
The data sets have been drawn from a very wide range of sources and
application domains and we have also tried to provide material which can
be used to illustrate a correspondingly wide range of statistical methodology.
However, we are all too aware of the enormous size of the discipline of
statistics. Ifyou feel that our coverage of some subdomain of statistics is too
weak, then please Iet us know - we can try to rectify the inadequacy in any
future edition that may be produced.
Similarly, while we have made every effort to ensure the accuracy of the
figures, given the number of digits reproduced it is likely that some
INTRODUCTION xi
inaccuracies exist. We apologisein advance should this prove tobe the case
and would appreciate being informed of any inaccuracies that you spot. At
least the presence of the data disk will remove the risk of further data entry
errors beyond those we may have introduced!
In this connection, the filename of each data set, as used on the data disk,
is indicated in the data structure index.
We hope that the data sets collected here will be of value to both teachers
and students of statistics. And that both teachers and students will enjoy
analysing them.
David J. Hand
Fergus Daly
A. Dan Lunn
Kevin J. McConway
Elizabeth Ostrowski
The Open University, July 1993