Posts Tagged ‘analysis’

This is Part 3 of my series on using exploratory and inferential data analysis (“EDA”) to solve practical problems in complex environmental scenarios. Part 1 defined EDA and Part 2 was an example of using EDA to assess sediment toxicity in a lake. This second example addresses a lawsuit against a major manufacturing facility and how, as an expert witness, I used EDA to show that the accused manufacturer could not be held liable for metal contamination of neighborhood soils.

Example 2: Big Z Corporation Litigation

This EDA addressed a lawsuit where a company was being sued for air and soil contamination. Although a real circumstance, the specifics of this example have been modified to maintain client anonymity.

Big Z Corporation (“Big Z”), was being sued by a regional authority (“RA”), and accused of contaminating soils in the vicinity of Big Z with zinc (“Zn”) via air deposition over a period of years. To press the suit, the RA collected soil samples from a large number of locations outside the boundaries of Big Z and had them analyzed for Zn and certain other metals. Because Big Z was assumed to be the largest user of Zn in the area, the finding of high Zn concentrations in many of the soil samples was sufficient for the RA to blame Big Z and incorporate this claim into the lawsuit, using the soil samples as evidence.

As the soil chemistry expert for Big Z and their attorneys, I performed EDA on soil concentration data obtained from the RA via the lawsuit to determine whether there was information in these data that would either indicate that Big Z was, or was not, the likely source for the metal contamination. The analyses presented in this paper are a subset of the EDA that was done, but were the central components.


The soil data were subjected to a variety of comparative, statistical and geographical information system (“GIS”) mapping procedures. This was done to assess whether sufficient information was contained in the analyses and concentration distributions to point to Big Z as the source or whether other sources might be responsible. Other sources could include natural soil mineralogy, historic activities, or other facilities. Initial procedures included comparison of the soil metal concentrations to Michigan’s regulatory requirements, the Part 201 Residential and Commercial 1 Generic Cleanup Criteria and Screening Levels, and to those concentrations found by the Michigan Background Soil Survey 2005.

Statistical evaluation procedures for the soil samples involved a range of analyses including simple descriptive statistics of individual analyte concentrations (mean, median, range, skewness, etc.) along with spatial, trend and regression analyses in an attempt to determine whether any information could be gleaned from these soil data that might indicate the metal sources. Tabulations of the metal analyte concentration data, comparisons of the concentrations to Part 201 criteria and background soil concentrations, as well as simple descriptive statistics and GIS maps depicting the concentration distributions of these metals across the sampled areas (with concentrations color-coded relative to Part 201 criteria), were compiled.

To develop a context for the EDA results, it was important to look at the industrial and urban uses of Zn, because the location of Big Z and the surrounding soils is in an area that has been highly industrialized for more than a century. In that regard, the specific industrial and commercial activities that occurred within and surrounding that area were also investigated, providing additional context for the soils data and the results of the EDA. Investigations such as these are useful and important because the EDA results can be supplemented and supported by them. The results of these preliminary investigations showed that Zn had been widely used and disseminated for more than a century in the area and that the natural soil background in the area had higher than average Zn concentrations, as did a lot of the industrial fill materials that had been used to level numerous properties across the region.

Sample concentration and location data were input into either a Microsoft Excel spreadsheet or the EDA program Aabel (Gigawiz.com) and transposed, coded or reorganized when necessary for GIS, data plotting or statistical analyses. Certain GIS displays of the concentration versus location data were done in the program ArcView 9.0 (not displayed here). Most of the statistical data analyses and visualizations were done in Aabel, although certain of the data were also evaluated using the R statistical program and some confirmatory statistics were carried out using the DataDesk program.

GIS displays did not sufficiently clarify the distribution of the Zn in the sampled areas relative to Big Z. To determine whether patterns of soil deposition might emerge in relation to potential sources using other approaches, transects were developed along lines of samples to establish whether there was any directionality to the distribution of the metals. Zn was chosen because this particular metal would be the one most expected to point to Big Z if this facility was the sole or even a primary source of Zn in the soils.

In terms of evaluating the transects analysis it was hypothesized:

  1. A positive likelihood of Big Z being the source could be shown by a general increase in Zn concentrations as transects of soil samples directionally approached the Big Z property boundary (i.e., consistently increasing soil concentrations with approach to the facility); this result would require a plurality of transects showing this pattern.
  2. A negative likelihood of Big Z being the source could be shown by decreases in, or randomness of, Zn concentrations as transects of soil samples directionally approached the Big Z property boundary (i.e., decreasing or random soil concentrations with approach to the facility).
  3. A pattern of high soil concentrations at a generally consistent distance along the transects from the Big Z property boundaries might imply Zn sourcing by the Big Z facility (maximum soil deposition at a fairly consistent distance from the facility).
  4. A random relationship in soil Zn concentrations, with respect to Big Z boundaries, increases the likelihood of numerous, possibly localized, sources for the Zn.

Following the transect analyses, an analysis of spatial concentration “zones” with distance from the Big Z property boundaries was done; the hypotheses for this analysis were essentially the same as for the transect analyses.


Figure 6 provides an overview for the transect analyses, showing all Zn data in the vicinity of Big Z. This plot, created in the program Aabel, depicts:

  • Big Z property boundaries (light blue area)
  • Soil sample locations (all westerly of Big Z)
  • Transects across the soil locations
  • The general location of the Big Z operations
  • Other potential stationary industrial emission sources (by facility code)
  • Color-filled contours for the Zn concentrations to show areas of higher Zn concentration versus lower.
Figure 6. Sampling Locations, Transect Overview, and Zn Concentrations

Figure 6. Sampling Locations, Transect Overview, and Zn Concentrations

The transect overview in Figure 6 displays information about sampling points that were high statistical lognormal outliers at the 95% upper confidence level for one or multiple metals (i.e., they were outside the expected concentration range based on the complete body of data at 95% confidence), which could potentially be called “hotspots,” highlighted by the type of marker:

  • A small yellow-tan circle indicates that no constituents were outliers relative to the body of the data at that sampling location.
  • A circular red “beach-ball” marker indicates a sampling location was an outlier for a metal other than Zn.
  • A hexagonal black marker indicates the sample location was an outlier only for Zn.
  • A hexagonal red marker indicates that the sample location was an outlier for Zn and at least one of the other metals.

The transects in Figure 6 are shown as orange lines or polygons. The distal end (furthest away relative to the Big Z facility) of the transect line/polygon is labeled with a capital letter and a prime symbol ( ‘ ) and the proximal end (closest to Big Z) with just a plain capital letter; for example “A’ – A” signifies a transect. A total of nine transects were done on each dataset and labeled A’ – A through I’ – I. Each transect was subjected to regression analysis, i.e., Zn concentration versus distance from the Big Z property boundary as well as individual visualization using either bubble charts or spatial bar charts. An example is provided in Figure 7 for transect A’ – A.

Transect AA ExampleCrop

Figure 7. A' - A Zinc Concentration Transect

Reviewing all the figures, such as Figure 7, which displayed linear regressions of the concentration data transects showed that the r2 values were quite low, indicating that linear distance/direction along the transect line towards Big Z does not account well for the concentrations of Zn. It is also observed that the slopes of the regression lines are generally not very large in either direction while scatter of the points about those lines is relatively large, supporting the r2 determinations and the general lack of trending in concentration versus distance from the Big Z property.

Table 2 summarizes the r2 and slope values for each transect and states whether the slopes showed increasing Zn trends toward or away from the property. The table shows that there were similar numbers of transects with slopes in either direction. This tends to support the rather dispersed random-appearing hotspots and colored contours of Figure 6. These results also support Hypotheses 2 and 4 (above), or randomness of Zn concentrations relative to the Big Z property boundaries and the likelihood of numerous localized Zn sources.

Table 2. Transect Regression Data for Zinc


Regression Fit (r2) Slope Zn Concentration Slope Relative to

Big Z Property

A’-A 7.30E-02 -5.37E+02 Away
B’-B 2.60E-02 6.89E+02 Toward
C’-C 6.06E-04 -3.60E+01 Away
D’-D 3.06E-01 -3.81E+03 Away
E’-E 7.20E-01 3.21E+03 Toward
F’-F 1.18E-06 -2.33E+00 Away
G’-G 8.21E-02 1.62E+02 Toward
H’-H 3.17E-04 -6.57E+01 Away
I’-I 1.53E-01 5.19E+02 Toward

As an additional check on the spatial relationship of the Zn concentrations in the soil samples relative to the Big Z boundaries, the Zn data were separated into five spatially equivalent zones corresponding to surface areas replicated using the approximate shape of the Big Z boundary, extended westward across the soil sampling space. These zones were numbered one through five and are depicted in Figure 8.

Figure 8. Spatial Zones for Zinc Analysis

Figure 8. Spatial Zones for Zinc Analysis

All the soil samples within a zone were selected as a group and the Zn concentrations within each zone’s group were compared to all the other zones, and the total of all the zones, using statistical methods. The comparisons consisted of one-way analysis of variance (ANOVA) combined with diamond plots, to compare the means, and box and whiskers plots to compare the medians. The ANOVA and mean tables along with these plots are provided in Figure 9.

Figure 9. Comparisons of Zone Means and Medians for Zinc

Figure 9. Comparisons of Zone Means and Medians for Zinc

The diamonds in the “diamond means comparison plots” (left, Figure 9) show the concentration means (central line in each diamond) and 95% confidence intervals of those means (tips of the diamonds) for each zone relative to the grand mean of all zones combined (dashed line across the entire plot) and to one another. When the diamonds overlap, at all, the means for each zone cannot be considered different with 95% confidence. However, when they don’t overlap, the means can be considered different. In this case, the means for all five zones are almost exactly equal to one another, and to the grand mean, and the diamonds fully overlap.

The same holds true for the medians (straight connected lines) in the box and whiskers plot of Figure 9 (right plot), with very small differences in their values and overlap of the 95% confidence “notches” surrounding the medians. The overlap of the notches indicates that the medians cannot be considered different with 95% confidence. These statistical plots indicate no significant differences in the means or the medians of the five zones as one progresses westward away from the Big Z property boundary. The plots are fully supported by both the mean and ANOVA tables in Figure 9. The means, standard deviations, and both lower and upper 95% confidence intervals are almost identical in the means table while the ANOVA table yields a probability (column “P > F”) that indicates none of the means has a significantly different value from the others. The average amount of Zn in the soils is the same, whatever the distance from Big Z.

To further eliminate any doubts or questions about the equivalency of the Zn means and overall zone concentrations, two additional modifications were made to the data for further comparison (figures not shown): (1) The natural log (ln) values of the Zn concentrations within each zone were similarly evaluated and there were still no significant differences between the means and medians. (2) Likewise, all potential statistical Zn high outliers were removed from the evaluations with the same result; no significant differences between means or medians.

These concentration zone results, particularly when combined with the transect analyses, show that there is no spatial relationship of Zn soil concentrations to Big Z whatsoever and make it virtually impossible that the source is solely Big Z activities. It would be difficult or impossible to envision a scenario in which the mean soil concentrations of Zn would be exactly the same at any zonal distance from a facility in an urban area if that facility was to blame for the concentrations of the Zn in those soils. The results are far more indicative of long-term widespread deposition from numerous sources, combined with high native Zn topsoil content and with many instances of localized inputs (hotspots) that account for the high Zn values that were found in a few of the samples.

In summary, the EDA of the Zn soil data seem to indicate that Big Z was not predominantly responsible for the Zn present in the soils within the area.


Using the techniques of exploratory data analysis, particularly with current software packages, can be a rapid and effective way to formulate and test hypotheses using real-world data to achieve the goals and requirements of understanding site scenarios. It allows insight into the relationships among the data components through visualization of the data and elucidation of their patterns, trends and associated statistics. EDA allows the development of conceptual site models that are reality-based and are generally easy for clients and regulators to understand. It is particularly helpful for complex sites or when trying to answer complex questions about a site.

I have presented two examples of how EDA was used to resolve specific and important environmental questions. In the first example, it was possible to develop a general understanding of the locations, structure and relationships among the contaminants in Manistee Lake sediments in Michigan. EDA was then used to assess whether single contaminants or suites of contaminants were responsible for the toxic effects of the sediments on benthic organisms and then identified the most likely suite and source from all the possible contaminant combinations. Based on these results an “Action Plan” for the stakeholders and a further refined plan for sampling Manistee Lake were developed.

In the second example EDA was used, in conjunction with an historic and current understanding of the sampled area, to show in the context of a litigation that it was not possible for a company to have been the sole or primary cause of metals contamination in residential soils west of the facility.


Note: These references are for Parts 1 and 2 of this series also.

  1. “Part 201 criteria.” Part 213 Tier 1 Risk-Based Screening Levels, of the Administrative Rules for Part 201, Environmental Remediation, Michigan Public Act 451, Natural Resources and Environmental Protection Act, of 1994, as amended.
  2. J.W.Tukey, “Exploratory Data Analysis”, 1977, Addisson Wesley.
  3. DataDesk (http://www.datadescription.com/)
  4. Aabel (http://www.gigawiz.com/)
  5. Schaetzl, R. J., 2004. GEO 333, Geography of Michigan and the Great Lakes Region. http://www.geo.msu.edu/geo333/MIwatershed.html.
  6. Kazmierski, J., Kram M., Mills, E., Phemister, D., Reo, N., Riggs, C., and R. Tefertiller. “Upper Manistee River Watershed Conservation Plan.” Prepared for The Grand Traverse Regional Land Conservancy. M. S. Project. Donna Erickson, Faculty Advisor. University Of Michigan. School of Natural Resources & Environment. April 2002.
  7. Rediske, R.; Gabrosek, J.; Thompson, C.; Bertin; Blunt, J.; and P.G. Meier. Preliminary Investigation of The Extent of Sediment Contamination in Manistee Lake. AWRI Publication # TM-2001-7, Great Lakes National Program Office #985906-01, July 2001.
  8. Velleman, P. F. (1997). DataDesk Version 6.0, Handbook, Volumes 2 and 3. Ithaca, N.Y., Data Description, Inc.
  9. Michigan Background Soil Survey 2005. Hazardous Waste Technical Support Unit, Hazardous Waste Section, Waste and Hazardous Materials Division.
  10. ArcView 9.0 (http://www.esri.com/)
  11. R (http://www.r-project.org/)


Exploratory Data Analysis, evaluation, analysis, environmental, statistics, hypotheses, data, inference, litigation, toxicity, testing, comparisons.


Read Full Post »