Beware of big data surprises

Maybe I worry too much, but I do worry about rushing to conclusions when the consequences are as important as they are in health and healthcare. Big data is big time. With the enhanced ability to capture data from everywhere, with the enhanced ability to store it and with more and more tools to analyze it, the allure and potential of new discoveries and new breakthroughs is amazing.

With all that data and all those tools, there’s a fundamental need to understand the pitfalls of exploratory data analysis.

Discoveries come from surprises, and as a mathematician who taught statistics for many years, I know how to quantify a surprise: A minor surprise is an event that occurs less than 5 percent of the time and a major surprise is one that occurs less than 1 percent of the time. Statisticians call this significance, and it very simply means that the probability of the observed result is small if you assume that nothing much out of the ordinary is going on.

Therein lies the problem.

When you have a hypothesis and the data to test it, a result significant at the 1 percent (one chance out of 100, the .01 level)) or even a 5 percent level is rare and should be considered evidence. But what happens when you can run thousands of experiments on huge data sets? On the average, if you ran 1,000 experiments on a set of purely random data (nothing is going on) you would expect 10 of those experiments to show “significance” at the .01 level. Even at the astounding level of .001, you’d expect one “extremely significant” result. The consequences of acting on these” significant” results can be both costly and dangerous.

So what should we do with big data? There is value in mining them, but rather than considering the results as evidence of something, they should be used as ways of discovering interesting hypotheses that could and should be tested independent of the data that was explored to raise them. That is a way to do good science.

I can’t leave this topic without recounting a true story from many years ago. I had a friend who was an assistant professor who ran literally hundreds of statistical tests on data gleaned from a number of rat experiments. He published each of those that returned “significant” results. We discussed the methods and the pitfalls and when I asked him what he thought the consequence of all his publications would be, he accurately answered “tenure."

Albert Shar, PhD, is vice president and program officer for the Pioneer Portfolio at the Robert Wood Johnson Foundation (RWJF). RWJF’s mission is to improve the health and healthcare of all Americans, and the Pioneer team seeks bold ideas that push beyond conventional thinking to explore solutions that will transform health and healthcare.
 

Comments

Daniel Crough
If you look at a million events and you find 2 one-in-a-million events, that isn’t very interesting. If you look at a thousand events and find 2 one-in-a-million events, that is very interesting – and begs to be followed up. The follow-up might be just looking at the putative correlation and deciding that it is not interesting (not clinically useful, or likely the result of data entry errors…).
Dr George Margelis
An interesting perspective on the use of big data. The other question that needs to be answered about the various hypotheses is what their clinical relevance is. The dangers we will find a whole bunch of correlations that have no real health relevance to the patient's quality of life or prognosis and we will spend too much time and resources on dead end investigations and not enough on relevant treatments.

Post new comment

* Fields marked with an asterisk are required.
No HTML code or hyperlinks are allowed in comments.
Login or create an account to save your mHIMSS profile.
By submitting this form, you accept the Mollom privacy policy.