In this first of a three or four part blog posting, I’m going to address how techniques known as exploratory data analysis can be used to achieve a better understanding of scientific data than is usually acquired from typical charts and tables. This first part will be the setup for parts 2 and 3, both of which will include several figures with explanations illustrating specific examples. The examples I’m using are based upon environmental data from my actual projects, but the techniques they illustrate can be used in other branches of science as well.
In the process of doing environmental science, large amounts of data are usually collected at significant expense. Often these data are used in specific but fairly superficial ways; for example, to determine whether regulatory criteria are exceeded or as “evidence” that property or health has suffered damage. Any remaining information in the data is often not extracted.
However, there are difficult environmental questions and problems that require a more comprehensive understanding. If you are called upon to determine why there is high organism mortality in a lake, to ascertain whether a client is responsible for contaminating a neighborhood, or to assess whether the concentrations of metals in soil are due to facility operations, natural background levels, or historic fill, tables of analytical data and t-tests are usually neither sufficient nor convincing. Often trends and patterns in environmental data do not readily emerge from tables and numerical statistics, but understanding this hidden information can be crucial for answering such difficult questions and solving recalcitrant problems.
Exploratory and inferential data analysis (“EDA”) provides rapid and comprehensive approaches for finding patterns and trends in data using statistical methods. Good EDA programs provide rapid visualization of statistical and geostatistical data across multiple variables and allow efficient hypothesis testing to determine whether the trends and patterns are statistically significant. In fact, preformed hypotheses are not necessarily required because hypotheses can emerge from the trends and patterns. In combination with appropriate educational training and professional experience, which are needed to understand and explain the emergent patterns, EDA is a powerful tool for answering environmental questions and finding solutions to environmental problems. The posts to follow will describe and illustrate how EDA was used to process data from actual sites and answer the questions posed above.
Large quantities of data are typically generated when we assess the natural state of the environment or its responses to our anthropogenic activities and, quite often, a large portion of the informational content contained in the data is not accessed. These collected environmental data are generally used in important ways, such as to assess risks to human health and other organisms, to establish baselines to which environmental changes can be contrasted, as indicators of trends in such environmental changes, as criteria for the onset of regulatory actions at a given concentration based upon studies of risk, as the basis of lawsuits for damages believed due to these environmental conditions, and so on.
Quite often, the scientists and engineers using the data are looking for something very specific from the data, such as to negate concerns that the soils and groundwater exceed specific regulatory criteria, such as the Michigan Part 201 criteria (“Part 201”). At a more complex level of need, the data user might be trying to make decisions about whether active remediation is needed, whether monitored natural attenuation is occurring at an acceptable rate, to determine whether enough data have been collected to minimize potential data gaps, or to make an assessment of how and where to install active systems for remediation of contaminants. I have required relatively complex data queries to, among other tasks:
- Review available information on the sediments in Manistee Lake and determine why the sediments in the lake are essentially devoid of benthic organisms in certain areas,
- Resolve whether a facility is responsible for widespread contamination of an area when the facility is being sued for that contamination, and
- Spatially and volumetrically discriminate between natural soil constituents, industrial fill materials and waste-contaminated soils at old industrial sites.
Exploratory and inferential data analysis can often achieve the goals and requirements of understanding complex scenarios by allowing the collected data to be manipulated in a variety of ways to achieve quick testing of hypotheses and yield insight into potential alternative hypotheses through visualization of the data. EDA can provide rapid and comprehensive approaches for finding patterns and trends in data using spatial and statistical methods. In the process of developing informative graphical displays of the data, EDA can generate numerical values for the appropriate statistics to indicate whether or not the relationships among the different datasets are significant. Conceptual site models (“CSM”) can be developed using EDA and these are generally much easier for clients and regulators to understand than simple tables or maps of sample locations that contain lists of contaminant concentrations. This is particularly true for complex sites or when asking complex questions about a site.
The body of this post consists primarily of describing subsets of the EDA done on two real-world examples, numbers one and two above, to illustrate how an EDA approach can help scientists understand and resolve difficult or seemingly intractable issues.
EXPLORATORY DATA ANALYSIS
J.W. Tukey coined the term “exploratory data analysis,” in 1977. He defined and described EDA as follows:
“…the examination of data with minimal preconceptions about its structure through which it is hoped that relationships and patterns, at least some of which are unanticipated, will be uncovered.”
“The principal theme is flexibility of technique. Good exploration requires many and varied analyses of the same data, and a certain amount of trial and error is expected. Creativity in approach may, therefore, be crucial. Precision and efficiency in any particular analysis are not nearly as critical as robustness and convenience.”
“A secondary theme is that structures and patterns in data fall into two broad classes: the obvious and the surprising. The techniques discussed can be correspondingly divided into those designed to display clearly and simply the first level of structure and those intended to look beyond those relations to uncover the unexpected features underlying what confirmatory analysis might dismiss as residuals or errors.”
When Tukey developed EDA virtually all aspects of the exploration had to be calculated manually. In the interim, many computer programs have been developed that leverage his ideas and allow the rapid exploration and analysis of data. The majority of the data explorations and statistical charts, tables, other displays and values presented in this manuscript have been generated using either the program DataDesk or Aabel, which used Tukey’s approach to examining the data to the extent possible.
OK. That’s it for Part 1. The first example (Part 2) will be about an EDA investigation that sought to better understand the nature and biological impacts of sediment contaminants that had been found in Manistee Lake, Michigan. Part 3 will focus on the use of EDA in a lawsuit where a large company was being sued in federal court for air and soil contamination and I was the expert witness for the defendant.