The Statistical Sleuth
Preface
This book is written for those who need to use statistical methods to analyze data from experiments and observational studies and who need to communicate the results to others. It is intended as a text for graduate students who are preparing to design, implement, analyze, and report their research. The students must have some knowledge of basic statistical concepts such as means, standard deviations, histograms, the normal and t-distributions, but they need not be familiar with calculus or matrix algebra. All should have access to statistical software on a moderately powerful computer.
Statistics is like grout — the word feels decidedly unpleasant in the mouth, but it describes something essential for holding a mosaic in place. Statistics is a common bond supporting all other sciences. It provides standards of empirical proof and a language for communicating scientific results. Statistical sleuthing is the process of using statistical tools to answer questions of interest. It includes devising experiments to unearth hidden truths, describing non-ideal real data using tools based on ideal mathematical models, answering the questions of interest efficiently, verifying that the tools are appropriate, and snooping around to see if there is anything more to be learned. The Statistical Sleuth will show you how this all comes about.
Case Studies. The Statistical Sleuth is organized around case studies, which begin each chapter and are used throughout to illustrate how the statistical tools operate. A small section entitled “Summary of Statistical Findings” accompanies each case study, demonstrating how to communicate statistical findings for a research publication. You should realize that the methods upon which the findings are based will be foreign to you upon first reading. After the chapter has been read, you should return to the studies and consider carefully how the chapter’s methods have been used to answer the questions posed by the researchers. Examine each case study carefully for its structural design. Ask yourself why the study was structured in the way it was. The studies will not only serve to illustrate techniques in analysis; most also illustrate how your own studies might be structured.
Mathematical Level. The emphasis of this book is on the practical usage of statistical methods. The correct practical usage of statistical tools requires that the user has some understanding of what’s behind the tools. Sometimes algebra or elementary mathematical statistics are the best devise for communicating the motivation. In general, however, the level of mathematics required to follow this book is not high.
What Will You Learn? Do not expect to learn all that you will need to make you an experienced analyst. You will improve your understanding of statistical reasoning and of measures of uncertainty. You will learn how to translate mountains of computer output into short summary statements communicating the results in a language common to all scientists. You will also learn a fairly large body of statistical tools that will be useful for a wide range of problems. But there are many more tools that are not covered in this book and many lessons that can only be gained through experience. At some point you may need the to seek the help of a professional statistician. Then, at least, you will know the language, the general tools, and the spirit of statistical data analysis, which will make communication with a statistician more effective and beneficial.
Level of Sophistication. The level of sophistication is high when it comes to models and methods needed to analyze data and interpret results, but low when it comes to mathematics. Our foremost concern is that future researchers learn a proper attitude for conducting the statistical aspects of their research. To this end, mathematics is neither sought out nor avoided.
Case Studies. Most chapters begin with two case studies for motivation and demonstration. Making these a central feature forces us to consider applied statistics more seriously than if we simply provided a data set as a demonstration of a particular tool. It forces us to maintain a question-driven approach to the analysis of data.
The case studies are also our tool for exciting students. We cannot successfully teach them if they are not genuinely interested. We have tried to find a variety of interesting real data sets where the statistical analysis provides useful answers to questions of interest. In some cases, we have found descriptions and summary statistics that make excellent examples, but we were unable to obtain the raw data. We have used some of these by generating data to match the summary statistics. In their descriptions, we use the phrase “based on a real study” to identify these.
Although we have made the data problems central, the limitations of space prevent us from including all the graphical displays and different analyses we would like to present. We encourage the instructor to go more into depth in showing computer output, graphical displays, and alternative analyses.
The Starting Point. At first glance, the first four chapters of The Sleuth appear to be devoted exclusively to the two-sample problem. This is not the case. The Sleuth outlines the fundamentals of drawing sound inferences — making sure the design justifies the inference, making sure that the data are in general agreement with the models upon which the tools are based, making scale adjustments to bring problems into conformity with assumed statistical models, alternative tools for situations where standard tools are expected to be invalid, and presenting the inferences in intelligible formats. The two-sample problem is merely the most convenient backdrop against which these various tough issues may be clearly displayed.
Material Covered. The Sleuth’s principal tool is regression analysis. We have added several topics that are not ordinarily covered in a regression text: (1) Generalized Linear Models, including logistic and log-linear regression. This important tool enables researchers to analyze a wide range of problems that have until recently been analyzed with inappropriate tools (ANOVA) or with appropria ate but difficult tools (contingency table chi-squares). We stress the analogy of generalized linear model regression with ordinary regression. With calculations provided by statistical software, this tool is entirely understandable and extremely valuable. (2) Repeated Measures. Whereas there is a great tendency for researchers to turn to a computer package that has “repeated measures” in the title, we feel there needs to be more guidance on a strategy for considering such data. Chapters 16 and 17 respectively emphasize question-driven and data-driven reduction of dimensionality. (3) Serial Correlation. Although full treatment of time series analysis is outside the scope of this book, adjustment and filtering for the first-order autoregression provide tools that expand the usefulness of regression technology to problems involving serial correlation.
The coverage has evolved from consideration of the kinds of problems graduate researchers typically encounter. These topics arise repeatedly in the campus-wide consulting service operated by Oregon State University’s Statistics department for faculty and graduate students. Inclusion of these topics was also designed to relieve pressure to offer separate courses in categorical data analysis, in multivariate analysis, and in time series analysis to non-statistics majors or to enroll them in classes designed primarily for statistics majors.
Possible Paths Through the Chapters. The Sleuth was designed for a three-quarter sequence covering eight chapters term. Not all the material is typically covered, and we have provided a number of optional topics as “Related Issues” at the end of each chapter. The book may also be used for a two-semester class in its entirety. For a one-semester or two-quarter class, we recommend the following sequence: conclusions and interpretations (1-4), several sample problems (5-6), simple linear regression (7-8), multiple regression (9-12), two-way analysis of variance (13-14), and logistic regression (20-21). There is room to mix and match specific needs.
The Sleuth covers regression prior to two-way analysis of variance, in contrast with the more traditional presentation of two-way anova directly after one-way anova. Our reasoning here is that regression tools applied to the two-way situation are easier to interpret. They are also less subject to misunderstandings arising from imbalance in the design. Experimental design chapters (23-24) appear at the end of the book. In truth, design issues are discussed throughout the text as they apply to the case studies. Topics such as replication, blocking, factorial treatment structures, and randomization appear repeatedly. So the actual chapters on design organize and summarize the issues. We also believe it is difficult to design an experiment without an understanding of the analytic tools that can be applied.
A computer and a packaged statistical computer program are essential companions for The Statistical Sleuth. There are a great number of good packages available. Unfortunately, they differ considerably in their style, language, and output. Some instruction about the use of the particular package must accompany the instruction of the statistical tools in this book. The main point here is that the statistical analysis should be guided by good statistical strategy and not by the package, which is just a means for accomplishing the end. The data sets presented as case studies and as exercises are available on an enclosed disk.