About The ARDA | Tools | FAQs | Contact Us
Data Dredging

When analyzing data that already exist, such as from a questionnaire survey created by other people, there is a temptation to scan through the data looking for remarkable statistics, such as an unexpected correlation between two items. Sometimes this activity is disparaged as a fishing expedition, although of course sometimes one does catch something worth having. The technical term is data dredging, using a similar metaphor of blind trawling through seas of data. Especially when tests of statistical significance are inappropriately applied, this is seen as one of the more egregious sins a social scientist can commit, yet it can also be a legitimate method of discovery.

In recent years, a tremendous amount of effort has been invested by computer and information scientists in developing methods of data mining, a term not very different from data dredging and with similar aims, although the techniques may be rather different. Thus, it is important to understand the classical criticisms of data dredging, and the ways in which secondary analysis can properly explore existing data.

In the classical article about data dredging, Hanan Selvin and Alan Stuart (1966) note that ideally a social-scientific study that collects quantitative data will be based on an explicit theory, and will create measures for the concepts in formal hypotheses. However, they say, in fact questionnaire studies tend to be exploratory. In addition, once the data exist, it is cost-effective to analyze them in ways that had not been anticipated at the time the study was designed. It is a practical fact that survey data are expensive, and secondary analysis increases the intellectual profit from the original investment. However, the possibility exists that data dredging will turn up quirks in the particular dataset, rather than real findings.

A good example is the classic 1966 Survey of Northern California Church Bodies, which contains a very interesting battery of 10 items that listed ten kinds of behavior a person might engage in, following this introduction: “There has always been a good deal of discussion among Christians about how people ought to act in their daily lives. It is not always clear what characteristics ought to be admired and which ones we should disapprove of. Below you will find a series of descriptions of ways in which people act; for each, would you decide how much you would admire or disapprove of a person who acted this way.” The ten different kinds of behavior, and their variable numbers and keywords, were:

77. Drinks moderately (DRKNMOD)
78. Is very ambitious (AMBITIOU)
79. Thinks he is better than others (BTRTHNOT)
80. Dresses in a flashy way (FLSHDRSS)
81. Prefers to be with people like himself (PPLLKSLF)
82. Is very patriotic (PATRIOTI)
83. Does not celebrate holidays (NOCELEBR)
84. Is very rich (VERYRICH)
85. Is very anxious to be thought of as an intellectual (INTELLEC)
86. Is satisfied with his lot in life (SATWLIFE)

Four responses were offered, shown here with percent giving each response for “drinks moderately:” 1) admire him for it (3.6%), 2) Think it was all right (67.0%), 3) be mildly disapproving of him (19.8%), and 4) be highly disapproving of him (9.6%). The total number of respondents was 2,783, and all but 92 of them gave one of these four responses to this drinking item. Of course, the original point of this item was to examine how attitudes toward alcohol consumption varied across religious denominations, but secondary analysis may find other purposes for this battery. One way to dredge data for interesting findings is to blindly run all the correlations, and look for any statistically significant correlations that seem meaningful. For example, there is such a correlation between “thinks he is better than others (BTRTHNOT)” and “is very rich (VERYRICH).”

Does this confirm that rich people think they are better than others? No, first of all because it is a correlation between two opinions that respondents hold, not between how rich they are and how they feel about themselves.

The statistical significance of this relationship is 0.04 for the 2,702 people who asked both questions, and this meets the common test of beating the 0.05 level of significance. However, there are 45 different correlations among the 10 items. The 0.05 level means that one would expect to beat this level 5 percent of the time, on average, just by chance, so perhaps twice in this battery of items. Also, the particular correlation used for this analysis, Pearson’s r, is based on the assumption that the four responses are part of a numerical scale in which the numbers are equal distances apart, and we cannot confirm that they are. So this is a very rough test of significance. Also, statistical significance should not be confused with substantive importance. The correlation between these two items happens also to be 0.04, a bit less than the 0.07 average of the 45 correlations, and much less than the greatest correlations for these two items: 0.20 for “is very rich” and “is very ambitious,” and 0.23 for “thinks he is better than others” and “dresses in a flashy way.” Apparently respondents conceptualize these items in a very different way from what we might infer from one very weak correlation coefficient.

With proper caution, however, it is quite appropriate to explore existing data for new findings. One way that is by no means foolproof, but is based on the full set of data rather than a haphazardously selected pair of items, is exploratory factor analysis. Statistically, this method is quite complex, with many options, but modern statistical analysis software makes it very easy to do – almost too easy if one is not cautious in interpreting results. For sake of illustration, an exploratory factor analysis was done on these ten items, starting with “listwise deletion,” which means focusing only on those 2,390 respondents who answered all ten questions. Among them, the correlation between “thinks he is better than others” and “is very rich” is only 0.03 with a statistical significance of 0.15, which fails to be below the common 0.05 test! Apparently removing respondents who answered in a haphazard fashion kills this false finding. For the record, the technical decisions about what kind of factor analysis to run were: principal components method, using varimax rotation on all factors with eigenvalues greater than 1.

Factor analysis is a data reduction method, akin to what computer scientists today call “machine learning,” that looks for a simpler way to represent the data than the 45 correlation coefficients. In this case it identified three factors (or dimensions of variation), of which the third seems to be leftovers, as is often the case. Each of the first two factors brings together three of the ten items, and the remaining four items fail to fit together into coherent groups. Factor 1 is: “is very ambitious,” “is very rich,” and “is very patriotic.” Factor 2 is: “thinks he is better than others,” “dresses in a flashy way,” and “does not celebrate holidays.” Note that the two items we began with fall into different factors. It seems that respondents to this survey primarily had two very different kinds of concern about the behavior of others. Factor 1 looks like the issue of being too aggressive or power hungry. Factor 2 looks like narcissism or self-centeredness. At the risk of getting carried away by the results, together the two factors look like a general theory of human personality variation, shared by the respondents and expressed unexpectedly in a survey that focused on religion.

It is dangerous to overinterpret the results of data dredging, even after making sure that the findings are strong and cannot easily be explained by methodological errors. Results might be worthy of publication if they connect very strongly to theory, but it would have been better to begin with the theory and look only at the items relevant to it. Perhaps some other existing dataset has similar items, which would allow an approximate replication of the findings. However, the ideal way to go forward would be to build the items into a new questionnaire, along with other measures designed specifically to expand and test the reliability of the original findings, and nail down more firmly what they mean in terms of theory.

Selvin, Hanan C., and Alan Stuart. 1966. “Data-Dredging Procedures in Survey Analysis.” The American Statistician 20(3): 20-23.

QuickSearch The Knowledge-Base

To search the knowledge-base, enter a term below:

Select a Theory below to learn more:

Select a Concept below to learn more:

TCM Contributors

Would you like to be considered for a position on the Theories, Concepts, and Measures contribution team? If so, click on the link below and complete the TCM Online Request Form.

TCM Online Request Form

If you are already a contributor for Theories, Concepts, and Measures and need assistance in managing the content, please click on the link below for instructions.

Site Administration Instructions