How to Analyze Like a Citizen Data Scientist in Flint
If you follow the news in the United States then you’ve heard that there’s a water crisis in Flint, Michigan. Although there’s going to continue to be debate about how much ethics played a role in the data collection practices, it’s worthwhile to at least be ready to perform the correct analysis on the data when you have it. Here’s how you can use Minitab to be like a citizen data scientist in Flint, and see for yourself what the data indicate.
Let’s start with the Environmental Protection Agency’s (EPA) Lead and Copper Rule. The EPA says that a water system needs to act when “lead concentrations exceed an action level of 15 ppb” in more than 10% of samples. The statistic that identifies the highest 10% of the samples is called the 90th percentile.
The applicable Code of Federal Regulations (CFR) does not prescribe a random sample to characterize the entire water system. Instead, the CFR suggests that those who administer the water system should select sampling sites based on the likelihood of contamination. In particular, those who administer the system should prefer sampling sites that meet these two criteria:
(i) Contain copper pipes with lead solder installed after 1982 or contain lead pipes; and/or
(ii) Are served by a lead service line.
Clearly, we are not dealing with a random sample—that's because the goal is not to characterize the entire system, but to better understand the worst contamination risks. In this context we're characterizing only the sites that we sample, which we suspect contain the highest lead results in the system. The CFR suggests taking samples from at least 60 sites for a system the size of Flint’s.
The data we’ll work with was collected through an effort organized by an independent research team at Virginia Tech. The data contain 271 samples from 269 different locations, which exceeds the minimum recommended sample size. Because we’re looking for the 90th percentile, what we do isn’t very different from counting down 271/10 ≈ 27 data points from the maximum. The CFR references the use of “first draw” tap samples, so we’ll pay attention to that column in the Virginia Tech data.
A Quick Calculation of the 90th Percentile
Once the data’s in Minitab Statistical Software, the fastest way to calculate the 90th percentile is with Minitab’s calculator. Try this:
- Choose Calc > Calculator.
- In Store result in variable, enter 90th percentile.
- In Expression, enter percentile (‘PB Bottle 1 (ppb) – First Draw’, 0.9). Click OK.
Minitab stores the value 26.944. Because this value is greater than 15, you are now ready to make strongly-worded statements urging people to take measures to protect themselves from lead exposure.
Communicating the 90th Percentile on a Graph
But if you’re really going to communicate your results, it’s nice to have a graph available. A simple bar chart might do:
However, you can show the data in more detail with a histogram.
- Choose Graph > Histogram.
- Select Simple. Click OK.
- In Graph variables, enter ‘PB Bottle 1 (ppb) – First Draw’.
- Click Scale.
- Select the Reference Lines tab.
- In Show reference lines at data values, enter 15 26.9. Click OK twice.
Histograms divide the sample values into intervals called bins. The height of the histogram represents the number of observations that are in the bin. The taller the bar, the more observations in that interval. The reference lines on the graph show the action limit for the 90th percentile and the actual value of the 90th percentile. This graph shows that the action limit is exceeded.
Gather Your Data
In April of 2015, then-mayor of Flint Dayne Walling reported that he and his family “drink and use the Flint water everyday, at home, work, and schools.” It’s easy for me to believe that the mayor’s personal experience with water that was not dangerous affected his judgment about the situation. The zip code for the mayor’s office in Flint is 48502. The news bureau for WNEM TV 5, one place where Mayor Walling drank tap water on TV, is in the same zip code. The citizen data scientists who analyzed the Flint data knew that the geographically-limited sample being shown on TV and Twitter wasn't good enough. Instead, they collected data from 269 different locations around Flint and found that lead was a serious problem.
Of course, collecting that data was no small task: the data scientists estimate that gathering, preparing, and analyzing water samples ended up costing about $180,000, not including volunteer labor. If you’d like to donate towards offsetting the costs and future efforts, check out the Flint Water Study Research Support Fundraiser.
If you’d like to support residents in Flint, consider volunteering for or contributing to the United Way of Genesee County’s Flint Water Fund which “has sourced more than 11,000 filters systems and 5,000 replacement filters, ongoing sources of bottled water to the Food Bank of Eastern Michigan and also supports a dedicated driver for daily distribution.”
The attention brought to Flint has called into question the water testing done in other municipalities in the United States. If you’re concerned about the potential for lead in your own water, the EPA notes that lead testing kits are available in home improvement stores that can be sent to laboratories for analysis.
The citation for the referenced data set is: FlintWaterStudy.org (2015) “Lead Results from Tap Water Sampling in Flint, MI during the Flint Water Crisis.” This link provides the data as a Minitab worksheet: lead_results_from_tap_water_sampling_in_flint__mi_during_the_flint_water_crisis.MTW