Table Of ContentE E
nvironmEntal and cological
S r
tatiSticS with
© 2010 by Taylor & Francis Group, LLC
C6206_FM.indd 1 7/20/09 5:04:08 PM
CHAPMAN & HALL/CRC
APPLIED ENVIRONMENTALS
University of North Carolina
TATISTICS
Series Editor
Richard Smith
University of North Carolina
U.S.A.
Published Titles
Michael E. Ginevan and Douglas E. Splitstone, Statistical Tools for
Environmental Quality
Timothy G. Gregoire and Harry T. Valentine, Sampling Strategies for
Natural Resources and the Environment
Daniel Mandallaz, Sampling Techniques for Forest Inventory
Bryan F. J. Manly, Statistics for Environmental Science and
Management, Second Edition
Steven P. Millard and Nagaraj K. Neerchal, Environmental Statistics
with S Plus
Song S. Qian, Environmental and Ecological Statistics with R
© 2010 by Taylor & Francis Group, LLC
C6206_FM.indd 2 7/20/09 5:04:08 PM
Chapman & Hall/CRC
Applied Environmental Statistics
E E
nvironmEntal and cological
S r
tatiSticS with
S S. Q
ong ian
nicholaS School of thE EnvironmEnt
dukE univErSity
durham, north carolina, u.S.a.
© 2010 by Taylor & Francis Group, LLC
C6206_FM.indd 3 7/20/09 5:04:08 PM
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2010 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Version Date: 20110725
International Standard Book Number-13: 978-1-4200-6208-3 (eBook - PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
© 2010 by Taylor & Francis Group, LLC
Preface
Statistics is partof the curriculumof almostall environmentaland ecological
studiesdepartmentsandprogramsinhighereducationinstitutionsworldwide.
Yet statistics is also often cited as the subject that is least liked and ineffec-
tively taught, especially for students outside the mathematics/statisticsarea.
Acommonprobleminlearningstatisticsisthatstatisticsisoftenperceivedas
asubfieldofmathematics. Consequently,weexpecttolearnasetofrulesand
beabletousestatisticsinourwork. Butappliedstatisticsisnotmathematics.
This book represents an effort in bridging the gap between a typical applied
statisticstextandtheneedofscientistsinenvironmentalandecologicalfields,
with anemphasisonthe inductive natureofstatisticalthinking. Muchofthe
mathematical/theoreticalbackgroundsareavoided. Examples are usedto in-
troduce concepts and to illustrate methods. Statistics is introduced as a tool
to facilitate scientific thinking, as it is intended when R.A. Fisher introduced
statistics to applied scientists.
The approach adopted by this book follows Fisher’s general steps of a sta-
tisticalmodelingproblem,namely,modelspecification,parameterestimation,
and model evaluation. These steps are similar to the steps a scientist takes
in a scientific project. However, as discussed by many, statistics is often the
subject of which students in science and engineering do not like [Berthouex
and Brown, 1994] and upon which ecologists often make mistakes [Peters,
1991]. Thedifficultyliesinthe disconnectbetweenatypicalappliedstatistics
course/book and a typical scientific problem. In solving a scientific problem,
we start with a hypothesis about the underlying mechanism as the basis for
data collection. The proposed hypothesis provides the basis for formulat-
ing a model, often with unknown parameters. Experiments and other data
collection efforts are to provide data for estimating these unknown parame-
ters. Once these parameters are estimated, scientists can evaluate the model
by comparing a model’s prediction to new observations. In this simplified
summary of a scientific problem-solving process, the first step (forming an
hypothesis) is often the most difficult part and requires the scientist to be
bothexperiencedandcreative. Model/hypothesisformulationisalsothemost
important step of the process because a wrong model will never lead us to
success. In applied statistics, the typical steps we take, as described by R.A.
Fisher, are similar to the steps of a scientific problem-solving process. With
a specific problem, we must first examine the data and propose a statistical
model to describe the distribution of the variable of interest. The statistical
model is parameterizedwith unknownparametersto be estimatedwith data.
v
© 2010 by Taylor & Francis Group, LLC
vi Environmental and Ecological Statistics
Whenthe parametersareestimated,wemustassesstheuncertaintyofsucha
model by examining the sampling distributions of the estimated parameters.
Thissimilarityintheprocessesofascientificproblemsolvingandastatistical
modeldevelopment,however,doesnottranslateintoeasylearningofstatistics
for scientists. The difficulty is the transition from a scientific hypothesis to
a statistical model. There is, unfortunately, no easy-to-follow steps to make
this transition. A typical applied statistics course/book presents the subject
asacollectionofmethodsfordifferenttypesofstatisticalmodels,andmoreor
less ignores the problem of model formulation. This treatment is inevitable,
because model formulation is necessarily a scientific problem. Applied statis-
tics books or courses are focused on the statistical problems of parameter
estimation and model evaluation. Different types of models often require dif-
ferent mathematical solutions. Frequently, this treatment of statistics leads
to a misperception of what statistics is and why we learn statistics.
This book is motivated by this underlying link betweenstatistical thinking
andscientificmethods. Thebookisstillorganizedbasedonstatisticalmodels.
However, throughout the book, examples were used to discuss each type of
statistical models and some of these examples are used to coverseveraltypes
of models. The emphasis of these examples is on model formulation and the
underlyingmathematical/statisticaltheoriesaremostlyomittedandreplaced
by presentations of R implementation of these models. The book is based on
teaching materials I accumulated at the Nicholas School of the Environment
of Duke University. The book can be divided into three units.
Chapters1to5havebeenusedinagraduatelevelapplieddataanalysis
•
course. They canbe readasaunittoserveasprerequisiteforadvanced
statisticalmodeling. Thesechaptersareintendedforbuilding afounda-
tion so that readers will be able to conduct a simple data analysis task
such as exploratory data analysis and fitting linear regressionmodels.
Chapters 6 to 8 have been used in a followup course in statistical mod-
•
eling. Thethreechaptersinthisunitaresomewhatindependentofeach
other, and they can be read separately. The same is true for the three
topics in Chapter 8 (Sections 8.1-8.4,8.5, and 8.6).
Chapters 9 and 10 have been used for a PhD-level independent study
•
course. Chapter 9 discusses the use of simulation for model checking,
providing tools for a critical assessment of the developed model. Sim-
ulation is commonly used for parameter estimation and for uncertainty
assessment. The use of simulation for model checking, although less
frequently discussed in the literature, is an important aspect of model
developmentandassessment. Chapter10discussesthe useofmultilevel
regression models, a class of models that can have a broad impact in
environmental and ecological data analysis.
Data sets and R scripts used in the book are available online at
http://www.duke.edu/ song/eeswithr.htm.
∼
© 2010 by Taylor & Francis Group, LLC
Preface vii
Many people helped in the process of writing this book. Kenneth H. Reck-
how, Curtis J.Richardson,and MichaelLavine are my mentorsand longtime
collaborators. This book reflects their influence on my approach to environ-
mental and ecological statistics. Collaboration with Yandong Pan improved
my understanding of ecological problems and the problem-solving process in
ecology. CraigA.Stowconstantlyfeedsmewithinterestingideasandpapers.
His work in analyzing the PCB in the fish data is greatly appreciated. Olli
Malve, George B. Arhonditsis, and Andrew D. Gronewold spent numerous
hours helping me sort through ideas and concepts. Thomas F. Cuffney and
GerardMcMahonpresented the EUSE example to me and spent many hours
discussing the example used in Chapter 10. Zehao Shen hosted me at Peking
University in the summer of 2007 and provided many interesting examples.
Richard L. Smith readthe manuscript of the book and provideda critical re-
viewwhichhelpedgreatlyinthepresentationofthebookandinimprovingthe
clarity of the discussions of some key concepts. Many errors were found and
improvements suggested by Meg Mobley, Ibrahim Alameddine, Itai Shelem,
Kristen Marine, Emily Sharp, Erin Gray, and Wyatt Hartman.
Song S. Qian
Durham, North Carolina
March, 2009
© 2010 by Taylor & Francis Group, LLC
© 2010 by Taylor & Francis Group, LLC
Contents
Preface v
Table of Contents ix
List of Tables xiii
List of Figures xv
I Basic Concepts 1
1 Introduction 3
1.1 The Everglades Example . . . . . . . . . . . . . . . . . . . . 6
1.2 Statistical Issues . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Bibliography Notes . . . . . . . . . . . . . . . . . . . . . . . 12
2 R 13
2.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Getting Started with R . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 R Prompt and Assignment . . . . . . . . . . . . . . . 14
2.2.2 Data Types . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 R Functions . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 The R Commander . . . . . . . . . . . . . . . . . . . . . . . 18
3 Statistical Assumptions 25
3.1 The Normality Assumption . . . . . . . . . . . . . . . . . . . 25
3.2 The Independence Assumption . . . . . . . . . . . . . . . . . 29
3.3 The Constant Variance Assumption . . . . . . . . . . . . . . 30
3.4 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . 32
3.4.1 Graphs for Displaying Distributions . . . . . . . . . . 32
3.4.2 Graphs for Comparing Distributions . . . . . . . . . . 35
3.4.3 Graphs for Exploring Dependency Among Variables . 36
3.5 From Graphs to Statistical Thinking . . . . . . . . . . . . . . 45
3.6 Bibliography Notes . . . . . . . . . . . . . . . . . . . . . . . 47
4 Statistical Inference 49
4.1 Estimation of Population Mean and Confidence Interval . . . 50
4.1.1 Bootstrap Method for Estimating Standard Error. . . 57
4.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . 61
ix
© 2010 by Taylor & Francis Group, LLC
x Environmental and Ecological Statistics
4.2.1 T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.2 Two-Sided Alternatives . . . . . . . . . . . . . . . . . 69
4.2.3 Hypothesis Testing Using the Confidence Interval . . . 70
4.3 A General Procedure . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Nonparametric Methods for Hypothesis Testing . . . . . . . 73
4.4.1 Rank Transformation . . . . . . . . . . . . . . . . . . 73
4.4.2 Wilcoxon Signed Rank Test . . . . . . . . . . . . . . . 74
4.4.3 Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . 75
4.4.4 A Comment on Distribution-Free Methods . . . . . . 77
4.5 Significance Level α, Power 1 β, and p-Value . . . . . . . . 80
−
4.6 One-Way Analysis of Variance . . . . . . . . . . . . . . . . . 87
4.6.1 Analysis of Variance . . . . . . . . . . . . . . . . . . . 88
4.6.2 Statistical Inference . . . . . . . . . . . . . . . . . . . 90
4.6.3 Multiple Comparisons . . . . . . . . . . . . . . . . . . 92
4.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7.1 The Everglades Example . . . . . . . . . . . . . . . . 98
4.7.2 Kemp’s Ridley Turtles . . . . . . . . . . . . . . . . . . 99
4.7.3 Assessing Water Quality Standard Compliance . . . . 105
4.7.4 Interaction between Red Mangrove and Sponges . . . 108
4.8 Bibliography Notes . . . . . . . . . . . . . . . . . . . . . . . 113
II Statistical Modeling 115
5 Linear Models 119
5.1 ANOVA as a Linear Model . . . . . . . . . . . . . . . . . . . 122
5.2 Simple and Multiple Linear Regression Models . . . . . . . . 124
5.2.1 The Least Squares . . . . . . . . . . . . . . . . . . . . 125
5.2.2 PCBs in the Fish Example . . . . . . . . . . . . . . . 126
5.2.3 Regression with One Predictor . . . . . . . . . . . . . 127
5.2.4 Multiple Regression . . . . . . . . . . . . . . . . . . . 129
5.2.5 Interaction . . . . . . . . . . . . . . . . . . . . . . . . 131
5.2.6 Residuals and Model Assessment . . . . . . . . . . . . 133
5.2.7 Categorical Predictors . . . . . . . . . . . . . . . . . . 140
5.2.8 The Finnish Lakes Example and Collinearity . . . . . 144
5.3 General Considerations in Building a Predictive Model . . . 155
5.4 Uncertainty in Model Predictions . . . . . . . . . . . . . . . 159
5.5 Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . 161
5.5.1 Interaction . . . . . . . . . . . . . . . . . . . . . . . . 166
5.6 Bibliography Notes . . . . . . . . . . . . . . . . . . . . . . . 167
6 Nonlinear Models 169
6.1 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . 169
6.1.1 Piecewise Linear Models . . . . . . . . . . . . . . . . . 178
6.1.2 Example: U.S. Lilac First Bloom Dates . . . . . . . . 184
6.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
© 2010 by Taylor & Francis Group, LLC