dcsimg
 

The Single Most Important Question In Every Statistical Analysis

Can you trust the data you are collecting? You'd better double-check it.
Can you really trust your data? As scientists at OPERA demonstrated, it can be all too easy to just assume that even the best measurement tools are behaving the way they should.
Photo credit: OPERA Collaboration

Quality improvement is no cakewalk. Doing Lean Six Sigma projects and proving your results with data analysis and reliable statistics is no small task. Every day you're confronted with critical questions. From working with the stakeholders and process owners, to identifying possible solutions, to gathering the data, to selecting the statistical methods you will use to assess your results, to reporting those results in a way that everyone can clearly see, you face a raft of decisions that have serious consequences. 

But if you want to think about a really tough day at work, imagine the conversations that must have taken place recently among colleagues at world-renowned scientific institutions in Italy and Switzerland.

"Ph.D. in particle physics?"  
"Check." 

"Highly technical research project that pushes the very frontiers of scientific knowledge?"  
"Check." 

"Colleagues from 36 prominent institutions in 13 countries?"  
"Check." 

"Experimental data gathered and analyzed?" 
"Check." 

"Results that contradict established laws of physics and garner worldwide publicity?" 
"Oh yeah, baby—check and mate!"

"And you remembered to plug thingamabob A securely into whatchahoosit B, right?"  
"....Uh-oh..."  

The Single Most Important Question In Every Statistical Analysis. Ever. 

When scientists at the OPERA Collaborationan international group of physicists studying the properties of atomic particles called neutrinosreported in September 2011 that they had detected faster-than-light particles, the science world exploded. The findings, based on three years of research, upset assumptions from which much of modern science derives. They contradicted Albert Einstein's special theory of relativity, which posits that light in vacuum travels faster than anything else in the universe. Anything. 

Naturally, many in the scientific community were, to put it politely, skeptical. In fact, even the scientists who prepared the research were dumbfounded by their results. But they went over the results again and again, and the conclusions were sound: they'd clocked neutrinos outracing the speed of light by 60 nanoseconds. Were Einstein alive today, it seemed, he would have some explaining to do.  

Or maybe not. 

It turns out these surprising findings may have been a result of failing to thoroughly answer the first, and most important, question in any type of data analysis. 

Never Accept the Results of an Analysis Unless...

When I heard about this story, I recalled my very first day of training on how to use Minitab Statistical Software. I still have my notes. Our trainer walked in, introduced himself, and then announced that he would tell us five words that would keep us from making the worst, most embarrassing, and potentially most harmful statistical errors. "Write this question down, and use it every time you perform an analysis," he said.

Then he paused. I had my pen poised, and dutifully wrote down, in big letters, what he said next:

"Can I trust my data?" 

I felt a little cheated.  "Can I trust my data?"  Huh. That's it?  I mean, of course you need to trust your data! If you don't trust it, why analyze it, right? 

But as our trainer continued, I quickly saw that this question's deceptively simple wording opens the door to a very complex set of issues that absolutely, positively have to be addressed if you're going to rely on statistics to help understand and improve processes. 

The question "Can I trust my data?" is so important because we humans have a tendency to take things for granted. And as the scientists involved in OPERA have reminded us, it's very easy to either forget to ask yourself this most important question, or to answer it too quickly, and overconfidently. 

Can You Trust Your Data? 

None of this is meant to denigrate the scientists at OPERA; they are very smart people who happened to make an incredibly common mistake in a very public way.

The take-home lesson for all of us is to never just assume that your data collection methods are okay.  To be sure you can trust your data, you need to prove it.

The factors that will affect the reliability of your data will vary depending on what you're studying and how you're analyzing it. But in general, you will always need to: 

  1. Carefully plan how, when, and what you will measure, and identify who will do the measuring
     
  2. Assess your measurement system and characterize its performance
     
  3. Minimize variation in your measurement system before collecting important data. 
     
  4. Check the integrity of the data you've collected, and make sure it meets the assumptions for any analysis you're using it for

 

Have you ever fallen victim to untrustworthy data? 

 

Comments

Name: Omar • Friday, February 24, 2012

Eston, what a great post!
I have really enjoyed. Maybe it is because I also realized, some years ago (and also as part of my business relationship with Minitab Inc.) that "Can I trust my data?" is actually the most important question and now everytime tha young professionals come to me asking about "this-and-that sound statistical tests" and how to make it in a fancy-way, I take my "wisdom-possition" and ask "Do you really think this data is trustworthy?". Most of the time they claim "Sure it is; it comes form our state-of-the-art database"...well...That´s not the point.
I guess we don´t realize the depht of the question, but your advise (the four general points to consider) are great guidelines.

Of course I have fallen victim of untrustworthy data! and I hope and I have learned my lesson...

greetings from Blackberry&Cross in Costa Rica


blog comments powered by Disqus