This is Part 3 of my series on using exploratory and inferential data analysis (“EDA”) to solve practical problems in complex environmental scenarios. Part 1 defined EDA and Part 2 was an example of using EDA to assess sediment toxicity in a lake. This second example addresses a lawsuit against a major manufacturing facility and how, as an expert witness, I used EDA to show that the accused manufacturer could not be held liable for metal contamination of neighborhood soils.

Example 2: Big Z Corporation Litigation

This EDA addressed a lawsuit where a company was being sued for air and soil contamination. Although a real circumstance, the specifics of this example have been modified to maintain client anonymity.

Big Z Corporation (“Big Z”), was being sued by a regional authority (“RA”), and accused of contaminating soils in the vicinity of Big Z with zinc (“Zn”) via air deposition over a period of years. To press the suit, the RA collected soil samples from a large number of locations outside the boundaries of Big Z and had them analyzed for Zn and certain other metals. Because Big Z was assumed to be the largest user of Zn in the area, the finding of high Zn concentrations in many of the soil samples was sufficient for the RA to blame Big Z and incorporate this claim into the lawsuit, using the soil samples as evidence.

As the soil chemistry expert for Big Z and their attorneys, I performed EDA on soil concentration data obtained from the RA via the lawsuit to determine whether there was information in these data that would either indicate that Big Z was, or was not, the likely source for the metal contamination. The analyses presented in this paper are a subset of the EDA that was done, but were the central components.


The soil data were subjected to a variety of comparative, statistical and geographical information system (“GIS”) mapping procedures. This was done to assess whether sufficient information was contained in the analyses and concentration distributions to point to Big Z as the source or whether other sources might be responsible. Other sources could include natural soil mineralogy, historic activities, or other facilities. Initial procedures included comparison of the soil metal concentrations to Michigan’s regulatory requirements, the Part 201 Residential and Commercial 1 Generic Cleanup Criteria and Screening Levels, and to those concentrations found by the Michigan Background Soil Survey 2005.

Statistical evaluation procedures for the soil samples involved a range of analyses including simple descriptive statistics of individual analyte concentrations (mean, median, range, skewness, etc.) along with spatial, trend and regression analyses in an attempt to determine whether any information could be gleaned from these soil data that might indicate the metal sources. Tabulations of the metal analyte concentration data, comparisons of the concentrations to Part 201 criteria and background soil concentrations, as well as simple descriptive statistics and GIS maps depicting the concentration distributions of these metals across the sampled areas (with concentrations color-coded relative to Part 201 criteria), were compiled.

To develop a context for the EDA results, it was important to look at the industrial and urban uses of Zn, because the location of Big Z and the surrounding soils is in an area that has been highly industrialized for more than a century. In that regard, the specific industrial and commercial activities that occurred within and surrounding that area were also investigated, providing additional context for the soils data and the results of the EDA. Investigations such as these are useful and important because the EDA results can be supplemented and supported by them. The results of these preliminary investigations showed that Zn had been widely used and disseminated for more than a century in the area and that the natural soil background in the area had higher than average Zn concentrations, as did a lot of the industrial fill materials that had been used to level numerous properties across the region.

Sample concentration and location data were input into either a Microsoft Excel spreadsheet or the EDA program Aabel (Gigawiz.com) and transposed, coded or reorganized when necessary for GIS, data plotting or statistical analyses. Certain GIS displays of the concentration versus location data were done in the program ArcView 9.0 (not displayed here). Most of the statistical data analyses and visualizations were done in Aabel, although certain of the data were also evaluated using the R statistical program and some confirmatory statistics were carried out using the DataDesk program.

GIS displays did not sufficiently clarify the distribution of the Zn in the sampled areas relative to Big Z. To determine whether patterns of soil deposition might emerge in relation to potential sources using other approaches, transects were developed along lines of samples to establish whether there was any directionality to the distribution of the metals. Zn was chosen because this particular metal would be the one most expected to point to Big Z if this facility was the sole or even a primary source of Zn in the soils.

In terms of evaluating the transects analysis it was hypothesized:

  1. A positive likelihood of Big Z being the source could be shown by a general increase in Zn concentrations as transects of soil samples directionally approached the Big Z property boundary (i.e., consistently increasing soil concentrations with approach to the facility); this result would require a plurality of transects showing this pattern.
  2. A negative likelihood of Big Z being the source could be shown by decreases in, or randomness of, Zn concentrations as transects of soil samples directionally approached the Big Z property boundary (i.e., decreasing or random soil concentrations with approach to the facility).
  3. A pattern of high soil concentrations at a generally consistent distance along the transects from the Big Z property boundaries might imply Zn sourcing by the Big Z facility (maximum soil deposition at a fairly consistent distance from the facility).
  4. A random relationship in soil Zn concentrations, with respect to Big Z boundaries, increases the likelihood of numerous, possibly localized, sources for the Zn.

Following the transect analyses, an analysis of spatial concentration “zones” with distance from the Big Z property boundaries was done; the hypotheses for this analysis were essentially the same as for the transect analyses.


Figure 6 provides an overview for the transect analyses, showing all Zn data in the vicinity of Big Z. This plot, created in the program Aabel, depicts:

  • Big Z property boundaries (light blue area)
  • Soil sample locations (all westerly of Big Z)
  • Transects across the soil locations
  • The general location of the Big Z operations
  • Other potential stationary industrial emission sources (by facility code)
  • Color-filled contours for the Zn concentrations to show areas of higher Zn concentration versus lower.
Figure 6. Sampling Locations, Transect Overview, and Zn Concentrations

Figure 6. Sampling Locations, Transect Overview, and Zn Concentrations

The transect overview in Figure 6 displays information about sampling points that were high statistical lognormal outliers at the 95% upper confidence level for one or multiple metals (i.e., they were outside the expected concentration range based on the complete body of data at 95% confidence), which could potentially be called “hotspots,” highlighted by the type of marker:

  • A small yellow-tan circle indicates that no constituents were outliers relative to the body of the data at that sampling location.
  • A circular red “beach-ball” marker indicates a sampling location was an outlier for a metal other than Zn.
  • A hexagonal black marker indicates the sample location was an outlier only for Zn.
  • A hexagonal red marker indicates that the sample location was an outlier for Zn and at least one of the other metals.

The transects in Figure 6 are shown as orange lines or polygons. The distal end (furthest away relative to the Big Z facility) of the transect line/polygon is labeled with a capital letter and a prime symbol ( ‘ ) and the proximal end (closest to Big Z) with just a plain capital letter; for example “A’ – A” signifies a transect. A total of nine transects were done on each dataset and labeled A’ – A through I’ – I. Each transect was subjected to regression analysis, i.e., Zn concentration versus distance from the Big Z property boundary as well as individual visualization using either bubble charts or spatial bar charts. An example is provided in Figure 7 for transect A’ – A.

Transect AA ExampleCrop

Figure 7. A' - A Zinc Concentration Transect

Reviewing all the figures, such as Figure 7, which displayed linear regressions of the concentration data transects showed that the r2 values were quite low, indicating that linear distance/direction along the transect line towards Big Z does not account well for the concentrations of Zn. It is also observed that the slopes of the regression lines are generally not very large in either direction while scatter of the points about those lines is relatively large, supporting the r2 determinations and the general lack of trending in concentration versus distance from the Big Z property.

Table 2 summarizes the r2 and slope values for each transect and states whether the slopes showed increasing Zn trends toward or away from the property. The table shows that there were similar numbers of transects with slopes in either direction. This tends to support the rather dispersed random-appearing hotspots and colored contours of Figure 6. These results also support Hypotheses 2 and 4 (above), or randomness of Zn concentrations relative to the Big Z property boundaries and the likelihood of numerous localized Zn sources.

Table 2. Transect Regression Data for Zinc


Regression Fit (r2) Slope Zn Concentration Slope Relative to

Big Z Property

A’-A 7.30E-02 -5.37E+02 Away
B’-B 2.60E-02 6.89E+02 Toward
C’-C 6.06E-04 -3.60E+01 Away
D’-D 3.06E-01 -3.81E+03 Away
E’-E 7.20E-01 3.21E+03 Toward
F’-F 1.18E-06 -2.33E+00 Away
G’-G 8.21E-02 1.62E+02 Toward
H’-H 3.17E-04 -6.57E+01 Away
I’-I 1.53E-01 5.19E+02 Toward

As an additional check on the spatial relationship of the Zn concentrations in the soil samples relative to the Big Z boundaries, the Zn data were separated into five spatially equivalent zones corresponding to surface areas replicated using the approximate shape of the Big Z boundary, extended westward across the soil sampling space. These zones were numbered one through five and are depicted in Figure 8.

Figure 8. Spatial Zones for Zinc Analysis

Figure 8. Spatial Zones for Zinc Analysis

All the soil samples within a zone were selected as a group and the Zn concentrations within each zone’s group were compared to all the other zones, and the total of all the zones, using statistical methods. The comparisons consisted of one-way analysis of variance (ANOVA) combined with diamond plots, to compare the means, and box and whiskers plots to compare the medians. The ANOVA and mean tables along with these plots are provided in Figure 9.

Figure 9. Comparisons of Zone Means and Medians for Zinc

Figure 9. Comparisons of Zone Means and Medians for Zinc

The diamonds in the “diamond means comparison plots” (left, Figure 9) show the concentration means (central line in each diamond) and 95% confidence intervals of those means (tips of the diamonds) for each zone relative to the grand mean of all zones combined (dashed line across the entire plot) and to one another. When the diamonds overlap, at all, the means for each zone cannot be considered different with 95% confidence. However, when they don’t overlap, the means can be considered different. In this case, the means for all five zones are almost exactly equal to one another, and to the grand mean, and the diamonds fully overlap.

The same holds true for the medians (straight connected lines) in the box and whiskers plot of Figure 9 (right plot), with very small differences in their values and overlap of the 95% confidence “notches” surrounding the medians. The overlap of the notches indicates that the medians cannot be considered different with 95% confidence. These statistical plots indicate no significant differences in the means or the medians of the five zones as one progresses westward away from the Big Z property boundary. The plots are fully supported by both the mean and ANOVA tables in Figure 9. The means, standard deviations, and both lower and upper 95% confidence intervals are almost identical in the means table while the ANOVA table yields a probability (column “P > F”) that indicates none of the means has a significantly different value from the others. The average amount of Zn in the soils is the same, whatever the distance from Big Z.

To further eliminate any doubts or questions about the equivalency of the Zn means and overall zone concentrations, two additional modifications were made to the data for further comparison (figures not shown): (1) The natural log (ln) values of the Zn concentrations within each zone were similarly evaluated and there were still no significant differences between the means and medians. (2) Likewise, all potential statistical Zn high outliers were removed from the evaluations with the same result; no significant differences between means or medians.

These concentration zone results, particularly when combined with the transect analyses, show that there is no spatial relationship of Zn soil concentrations to Big Z whatsoever and make it virtually impossible that the source is solely Big Z activities. It would be difficult or impossible to envision a scenario in which the mean soil concentrations of Zn would be exactly the same at any zonal distance from a facility in an urban area if that facility was to blame for the concentrations of the Zn in those soils. The results are far more indicative of long-term widespread deposition from numerous sources, combined with high native Zn topsoil content and with many instances of localized inputs (hotspots) that account for the high Zn values that were found in a few of the samples.

In summary, the EDA of the Zn soil data seem to indicate that Big Z was not predominantly responsible for the Zn present in the soils within the area.


Using the techniques of exploratory data analysis, particularly with current software packages, can be a rapid and effective way to formulate and test hypotheses using real-world data to achieve the goals and requirements of understanding site scenarios. It allows insight into the relationships among the data components through visualization of the data and elucidation of their patterns, trends and associated statistics. EDA allows the development of conceptual site models that are reality-based and are generally easy for clients and regulators to understand. It is particularly helpful for complex sites or when trying to answer complex questions about a site.

I have presented two examples of how EDA was used to resolve specific and important environmental questions. In the first example, it was possible to develop a general understanding of the locations, structure and relationships among the contaminants in Manistee Lake sediments in Michigan. EDA was then used to assess whether single contaminants or suites of contaminants were responsible for the toxic effects of the sediments on benthic organisms and then identified the most likely suite and source from all the possible contaminant combinations. Based on these results an “Action Plan” for the stakeholders and a further refined plan for sampling Manistee Lake were developed.

In the second example EDA was used, in conjunction with an historic and current understanding of the sampled area, to show in the context of a litigation that it was not possible for a company to have been the sole or primary cause of metals contamination in residential soils west of the facility.


Note: These references are for Parts 1 and 2 of this series also.

  1. “Part 201 criteria.” Part 213 Tier 1 Risk-Based Screening Levels, of the Administrative Rules for Part 201, Environmental Remediation, Michigan Public Act 451, Natural Resources and Environmental Protection Act, of 1994, as amended.
  2. J.W.Tukey, “Exploratory Data Analysis”, 1977, Addisson Wesley.
  3. DataDesk (http://www.datadescription.com/)
  4. Aabel (http://www.gigawiz.com/)
  5. Schaetzl, R. J., 2004. GEO 333, Geography of Michigan and the Great Lakes Region. http://www.geo.msu.edu/geo333/MIwatershed.html.
  6. Kazmierski, J., Kram M., Mills, E., Phemister, D., Reo, N., Riggs, C., and R. Tefertiller. “Upper Manistee River Watershed Conservation Plan.” Prepared for The Grand Traverse Regional Land Conservancy. M. S. Project. Donna Erickson, Faculty Advisor. University Of Michigan. School of Natural Resources & Environment. April 2002.
  7. Rediske, R.; Gabrosek, J.; Thompson, C.; Bertin; Blunt, J.; and P.G. Meier. Preliminary Investigation of The Extent of Sediment Contamination in Manistee Lake. AWRI Publication # TM-2001-7, Great Lakes National Program Office #985906-01, July 2001.
  8. Velleman, P. F. (1997). DataDesk Version 6.0, Handbook, Volumes 2 and 3. Ithaca, N.Y., Data Description, Inc.
  9. Michigan Background Soil Survey 2005. Hazardous Waste Technical Support Unit, Hazardous Waste Section, Waste and Hazardous Materials Division.
  10. ArcView 9.0 (http://www.esri.com/)
  11. R (http://www.r-project.org/)


Exploratory Data Analysis, evaluation, analysis, environmental, statistics, hypotheses, data, inference, litigation, toxicity, testing, comparisons.


This is Part 2 of my series on using exploratory and inferential data analysis (EDA) to solve practical problems in complex environmental scenarios. Part 1 defined EDA. Parts 2 and 3 are examples of its use.

This example addresses how EDA was used on data from Manistee Lake, Michigan. The work was funded by the Little River Band of Ottawa Indians (LRBOI) through a grant from the U. S. EPA. Any references cited will be included in Part 3.

Part 3 of this series should be coming soon. It will focus on a lawsuit against a major manufacturing facility and how, as an expert witness, I used EDA to show that the accused manufacturer could not be held liable for metal contamination of neighborhood soils.

Example 1: Manistee Lake Sediment Toxicity EDA

The overall goals of this project were to:

  • Review available information on the sediments in Manistee Lake. Evaluate these data using statistical exploratory data analysis (“EDA”) to develop a conceptual site model (“CSM”) based on available data to help understand the contaminant issues in the lake.
  • Develop an “inventory and action plan” that assessed the major sources of industrial pollution in Manistee Lake, the major sediment pollutants of concern, and proposed courses of action.
  • Create a presentation to increase local Manistee Lake awareness and begin discussions.
  • Evaluate and design a plan for further sampling of the lake.

A portion of the EDA used to accomplish the first of these goals is presented in this blog posting.

Manistee Lake Setting and Historical Impacts

Manistee Lake is in the Manistee watershed, which encompasses more than 5000 square kilometers, 1930 square miles or 1,240,000 acres. The lake is a drowned river mouth, fed by the Manistee River from the northeast and the Little Manistee River from the southeast. A channel connects the lake to Lake Michigan. Flow is generally SE to NW with crossflow across the northern portion from the Manistee River west to the Lake Michigan channel. Manistee Lake itself has an area of about 930 square acres and a maximum depth of about 50 feet. Manistee Lake was once a large bay of Lake Michigan; water levels dropped and they became separated by sand bars and low dunes.

The history of Manistee Lake includes more than a century and a half of industrial usage. Contamination of the lake bottom sediments is extensive and profound. This has resulted in the near elimination of the natural populations of sediment-dwelling benthic organisms, creating a negative impact on the lake at all trophic levels.

Study Background

This EDA sought to better understand the nature and impacts of some of the sediment contaminants that were quantified in Manistee Lake at 14 sampling locations (Figure 1) in a study by Rediske et al. (2001) and to discern whether additional insights could be gained from the data accumulated in that report for better understanding of the lake’s environment.

Manistee Sampling Locations from Rediske et al.

Figure 1. Manistee Sampling Locations from Rediske et al.


EDA focused initially on the metal and metalloid contaminants, total organic carbon (TOC), and hexane extractable materials with regard to descriptive statistics, depths in the sediments, concentrations, and sampling locations. Further evaluation added additional contaminants, their concentrations and locations, and potential pollutant sources. Sediment organism mortality percentages and counts were evaluated within this context, with a particular emphasis on trying to identify correlations between organism mortality, individual or multiple contaminants, and industrial activities that might be the source of the contaminants.

A detailed evaluation of the chemicals versus four organism studies using stepwise multiple regression analysis was conducted and the most likely chemical causes of organism mortality were identified. Following an inventory of industrial activity in the vicinity of the lake and an assessment that scored these activities for the “significant contamination factors,” (i.e., the contaminants from the stepwise regression analysis that most impacted the organisms) these contaminants and organism mortality results were plotted versus location along the length of the lake with the highest scoring industrial activities highlighted. A conceptual site model for sediment organism mortality was the result.


The first analysis was to determine whether any of the constituents determined in the sediments were found preferentially at any particular depth(s) in the sediments. This was done by creating both dotplots and box and whisker plots of each of the constituents versus depth level. Figure 2 displays, as an example, such a box and whiskers plot for mercury (modified by assigning the reporting limit to samples reported at less than that limit, i.e., a conservative approach). The depth level S is for surface samples acquired using a Ponar sediment sampler whereas T, M, and B refer to top, middle and bottom core sections, respectively.

Boxplots have four components; more rigorous definitions of which can be found in Velleman. These are:

  • The outlined central box depicts the middle half of the data between the 25th and the 75th percentiles.
  • The horizontal line across the box marks the median.
  • The whiskers extend from the top and bottom of the box to depict the extent of the main body of the data.
  • In addition, extreme data values are plotted individually, usually with a circle.
  • Very extreme values are plotted with a starburst.
Figure 2. Box and Whiskers Plot for Mercury

Figure 2. Box and Whiskers Plot for Mercury

In addition to these components the 95% confidence intervals of the data can be depicted with a shaded area. If the shaded areas for two or more groups do not overlap then there is 95% statistical confidence that their medians are different. Using such plots it was possible to observe that, in general, the contaminants of concern were primarily found in the uppermost lake sediments (Figure 2, S and T), potentially indicating anthropogenic origin.

Linear regression of the contaminant concentrations versus sample locations in the flow direction along the length of the lake (not shown) were then done. These plots showed that many of the contaminants, particularly in the shallow samples, increased in concentration from the S end to the N end of the lake, along the flowpath. This indicated that the flowing water might be transporting contaminants in a cumulative manner to the sediments as it passed industrial locations along the shoreline.

Rediske et al. performed two types of biological studies. One of these consisted of counting organisms in the collected sediment samples at each location. A value was assigned to the total number of organisms counted for each location and the species were noted and counted. This resulted in two biological values for each surface sampling location that were used in this EDA:

1. Organisms (total), and

2. Species #

The second type of biological study was independent laboratory tests of acute toxicity, via exposure to the actual sediments collected at each location, for two types of organisms. The organisms were:

1. Hyalella Azteca (amphipod), and

2. Chironomus Tentans (midge)

Eight replicate toxicity tests were done with each organism for each location. For the EDA the final average number surviving at each location was converted to a percent mortality value.

Boxplots of the organism studies indicated that the sediment control location mortalities (Figure 1, M-1 and M-14) were very different from the others along the length of the lake. By the time one arrives at Locations M-2 and M-13, organism mortality/disappearance is worsened by a range of factors from 1.6 to 5.5. This indicates significant sediment toxicity in the industrialized areas of the lake.

Area charts of the four sets of biological study results are displayed versus sampling location in Figure 3.

Organism Area Plots vs. Sampling Location

Figure 3. Organism Area Plots vs. Sampling Location

The charts for sediment mortality to H. Azteca and C. Tentans are stacked (Figure 3, top), whereas the magnitudes of the values for Species # and Organisms (Total) were too different for stacking without applying a data transformation. The percent mortality of the two organisms track one another very well and appear to be almost perfect counterparts to the organism and species # counts, with increased percent mortality corresponding very closely with reductions in the numbers and species of organisms natively present in the sediments. These plots illustrate the extreme toxicity of the shallow Manistee Lake sediments just beyond the river mouths (i.e., the control locations) within the lake.

Accounting for the specific environmental factors that contribute to the sudden increase in organism mortality upon entering the lake environment is important for a variety of reasons, including:

  1. The levels of these toxic chemicals need to be reduced to increase biodiversity
  2. Identifying the primary contaminants of concern might also identify the sources, creating an opportunity to control their discharge
  3. Predictions of residence time and fate in the sediments might be possible based on the contaminant chemistry and the geochemistry
  4. Additional sampling with increased focus on the most toxic areas can be planned
  5. A conceptual model of the lake bed and its environment can begin to be developed.

The next part of the EDA plotted the results of these organism studies versus their position along the lake. Figure 4 illustrates the complexity of understanding the specifics of toxicity to these organisms by also depicting stacked area charts of a variety of selected contaminant concentrations.

Figure 4. Organism Mortality vs. Selected=

Virtually every measured contaminant in these surface sediment (Ponar) samples increased dramatically in concentration beyond the river mouth locations (control Locations 1 and 14) within the lake (Locations 2 through 13). The Percent Mortality lines for H. Azteca and C. Tentans overlying these area graphs (Figure 4) show the extremely close correspondence between the most highly contaminated zones and the toxicity of the sediments to these organisms. Cumulative, rather than individual, toxic contaminant impacts might be implied by these charts, so that possibility was further evaluated.

To determine whether individual toxicities or combined contaminant impacts were responsible a stepwise multiple regression analysis was the next phase of this EDA. This was done using the program DataDesk to develop a “model” for the impacts of the various contaminants on the organisms. The familiar simple linear regression, y = mx+a, describes the relationship between a response variable such as mortality (y) and a predictor variable such as a contaminant concentration (x). The data are plotted as a scatterplot that shows the datapoints, a regression line (based on the regression equation where m = slope and a = y axis intercept), and confidence interval boundaries, usually set at 95% confidence for environmental studies. A multiple regression includes more than one predictor variable to try to further account for the response variable values. However, the results of a multiple regression become difficult to visualize. Using only two predictor variables the straight line of the simple regression becomes a flat surface. Further addition of predictors adds additional dimensions. Because of this, numerical values must be used to explain the model rather than figures.

To understand whether predictor variables (such as arsenic, lead, hexane extractables, etc.) are significant and predict response values (such as H. Azteca % Mortality, Species #, etc.) requires the interpretation of tables containing t-ratios, probabilities, and R2 fit values. A discussion of these values and their interpretation is beyond the scope of this blog but can be found in standard statistics textbooks and the references cited. In summary, the Pearson Product Moment Correlation values for the selected organism study, versus all the possible predictors, (contaminants) is developed. This yields the residual correlations used in the stepwise regression analysis. The predictor variables are then individually added stepwise (hence the name) to the regression model, each causing its own residual correlation to go to zero and generating a probability. Regression begins with the highest residual correlation. In the case of H. Azteca this was arsenic at 0.756 residual. Additional contaminant predictors were added to the regression until the t-ratio probability of the last added predictor was greater than 0.05, indicating that that contaminant was not significant to mortality of the organism at the 95% confidence level.

The results of the stepwise multiple regression analyses are provided in Table 1:

Table 1. Summary of Regression Model Results for the Biological Studies

Study Significant Predictors

Outlier Locations

Model Fit

(R2 value)

H. Azteca % Mortality As, Hg, Hexane Extractables, Se 12 97.7% 109
C. Tentans % Mortality Hexane Extractables, Cr 13 (PAH) 81.7% 20.3
Organisms (total) As, Cr 62.0% 11.6
Species # As 12, 13 (PAH) 88.0% 32.8

The stepwise multiple regression analyses indicated that certain sediment contaminants were well correlated with the loss of organisms in Manistee Lake. These contaminants and their apparent order of importance were:

As > Cr, hexane extractables, PAH > Hg, Se

Following this modeling, facilities along the lake shoreline were “scored” with respect to industrial processes and on-site materials that might generate this particular suite of contaminants. The highest scores were associated with facilities that stored coal or coke in piles along the shoreline of the lake. Figure 5 depicts the locations of these coal storage piles versus these important sediment contaminants from the stepwise regression and plots these versus the organism mortality values.

Figure 5. Coal Pile Locations Relative to Correlated Contaminants and Mortality

Figure 5. Coal Pile Locations Relative to Correlated Contaminants and Mortality

The results indicate that coal contamination of the bottom sediments along the shoreline seems to exert a substantial negative impact on the benthic organisms in those areas and immediately downgradient from them.


Several important conclusions resulted from the EDA study of Manistee Lake. These were:

  • The contaminants of concern were primarily found in the uppermost lake sediments,  indicative of anthropogenic origin.
  • Many contaminants increased in concentration along the flowpath.
  • Area plots of the results of the sediment organism studies versus sample locations showed very good correspondence among the four biological approaches.
  • Mortality/disappearance worsened by factors of 1.6 to 5.5 immediately within the lake relative to controls.
  • There was excellent correspondence between organism mortality and locations of very high total contamination concentrations.
  • Stepwise multiple regression analyses indicated that certain sediment contaminants were well correlated with the dearth of organisms in Lake Manistee.

EDA elucidated that facilities hosting coal/coke storage piles were most strongly associated with contaminant concentrations and organism mortality, establishing a major portion of a conceptual site model. The results illustrate where cleanup could be focused to increase the populations of benthic organisms in Manistee Lake and increase its biodiversity.


Stay tuned for Part 3, which should conclude this series on EDA.

In this first of a three or four part blog posting, I’m going to address how techniques known as exploratory data analysis can be used to achieve a better understanding of scientific data than is usually acquired from typical charts and tables. This first part will be the setup for parts 2 and 3, both of which will include several figures with explanations illustrating specific examples. The examples I’m using are based upon environmental data from my actual projects, but the techniques they illustrate can be used in other branches of science as well.


    In the process of doing environmental science, large amounts of data are usually collected at significant expense. Often these data are used in specific but fairly superficial ways; for example, to determine whether regulatory criteria are exceeded or as “evidence” that property or health has suffered damage. Any remaining information in the data is often not extracted.

    However, there are difficult environmental questions and problems that require a more comprehensive understanding. If you are called upon to determine why there is high organism mortality in a lake, to ascertain whether a client is responsible for contaminating a neighborhood, or to assess whether the concentrations of metals in soil are due to facility operations, natural background levels, or historic fill, tables of analytical data and t-tests are usually neither sufficient nor convincing. Often trends and patterns in environmental data do not readily emerge from tables and numerical statistics, but understanding this hidden information can be crucial for answering such difficult questions and solving recalcitrant problems.

    Exploratory and inferential data analysis (“EDA”) provides rapid and comprehensive approaches for finding patterns and trends in data using statistical methods. Good EDA programs provide rapid visualization of statistical and geostatistical data across multiple variables and allow efficient hypothesis testing to determine whether the trends and patterns are statistically significant. In fact, preformed hypotheses are not necessarily required because hypotheses can emerge from the trends and patterns. In combination with appropriate educational training and professional experience, which are needed to understand and explain the emergent patterns, EDA is a powerful tool for answering environmental questions and finding solutions to environmental problems. The posts to follow will describe and illustrate how EDA was used to process data from actual sites and answer the questions posed above.


    Large quantities of data are typically generated when we assess the natural state of the environment or its responses to our anthropogenic activities and, quite often, a large portion of the informational content contained in the data is not accessed. These collected environmental data are generally used in important ways, such as to assess risks to human health and other organisms, to establish baselines to which environmental changes can be contrasted, as indicators of trends in such environmental changes, as criteria for the onset of regulatory actions at a given concentration based upon studies of risk, as the basis of lawsuits for damages believed due to these environmental conditions, and so on.

    Quite often, the scientists and engineers using the data are looking for something very specific from the data, such as to negate concerns that the soils and groundwater exceed specific regulatory criteria, such as the Michigan Part 201 criteria (“Part 201”). At a more complex level of need, the data user might be trying to make decisions about whether active remediation is needed, whether monitored natural attenuation is occurring at an acceptable rate, to determine whether enough data have been collected to minimize potential data gaps, or to make an assessment of how and where to install active systems for remediation of contaminants. I have required relatively complex data queries to, among other tasks:

    1. Review available information on the sediments in Manistee Lake and determine why the sediments in the lake are essentially devoid of benthic organisms in certain areas,
    2. Resolve whether a facility is responsible for widespread contamination of an area when the facility is being sued for that contamination, and
    3. Spatially and volumetrically discriminate between natural soil constituents, industrial fill materials and waste-contaminated soils at old industrial sites.

    Exploratory and inferential data analysis can often achieve the goals and requirements of understanding complex scenarios by allowing the collected data to be manipulated in a variety of ways to achieve quick testing of hypotheses and yield insight into potential alternative hypotheses through visualization of the data. EDA can provide rapid and comprehensive approaches for finding patterns and trends in data using spatial and statistical methods. In the process of developing informative graphical displays of the data, EDA can generate numerical values for the appropriate statistics to indicate whether or not the relationships among the different datasets are significant. Conceptual site models (“CSM”) can be developed using EDA and these are generally much easier for clients and regulators to understand than simple tables or maps of sample locations that contain lists of contaminant concentrations. This is particularly true for complex sites or when asking complex questions about a site.

    The body of this post consists primarily of describing subsets of the EDA done on two real-world examples, numbers one and two above, to illustrate how an EDA approach can help scientists understand and resolve difficult or seemingly intractable issues.


    J.W. Tukey coined the term “exploratory data analysis,” in 1977. He defined and described EDA as follows:

    “…the examination of data with minimal preconceptions about its structure through which it is hoped that relationships and patterns, at least some of which are unanticipated, will be uncovered.”

    “The principal theme is flexibility of technique. Good exploration requires many and varied analyses of the same data, and a certain amount of trial and error is expected. Creativity in approach may, therefore, be crucial. Precision and efficiency in any particular analysis are not nearly as critical as robustness and convenience.”

    “A secondary theme is that structures and patterns in data fall into two broad classes: the obvious and the surprising. The techniques discussed can be correspondingly divided into those designed to display clearly and simply the first level of structure and those intended to look beyond those relations to uncover the unexpected features underlying what confirmatory analysis might dismiss as residuals or errors.”

    When Tukey developed EDA virtually all aspects of the exploration had to be calculated manually. In the interim, many computer programs have been developed that leverage his ideas and allow the rapid exploration and analysis of data. The majority of the data explorations and statistical charts, tables, other displays and values presented in this manuscript have been generated using either the program DataDesk or Aabel, which used Tukey’s approach to examining the data to the extent possible.

  • OK. That’s it for Part 1. The first example (Part 2) will be about an EDA investigation that sought to better understand the nature and biological impacts of sediment contaminants that had been found in Manistee Lake, Michigan. Part 3 will focus on the use of EDA in a lawsuit where a large company was being sued in federal court for air and soil contamination and I was the expert witness for the defendant.