Skip to main content
Free

“Thinking, Judging, Noticing, Feeling”: John W. Tukey against the Mechanization of Inferential Knowledge

Abstract

During the past half-century, a set of statistical techniques and ideas about inference have experienced a remarkable scientific success. Significance at the 5 percent level has come to mark a clear and distinct criterion for scientific knowledge in a wide range of fields. Recently, however, this convention has been embroiled in controversy, as the relentless pursuit of significance has produced a range of well-known scientific abuses. Instead of staking out a position in these debates, this article analyzes the history of epistemological values underlying them. It focuses on an earlier critic of the misuse of statistical tests: John W. Tukey. Speaking to behavioral scientists in the middle of the twentieth century, Tukey insisted that reducing inference to a set of universal rules or mechanical procedures to eliminate uncertainty was a pursuit doomed to failure. Scientists needed to accept the irreducibility of individual judgments and decisions in data analysis, even when they risked charges of subjectivism or arbitrariness. For Tukey, the enforcement of scientific consensus and even the value of objectivity must yield to empirical judgments and an ethic of individual conscience. These values were informed by his comparative understanding of the history of science, which reserved a special place for empiricism in younger sciences. Reconstructing Tukey’s work offers an alternative perspective on the quantitative, formal objectivity of the postwar sciences as well as the present, where big data and machine learning have raised thorny new problems for statistical inference and scientific expertise.

As a scientist and investigator you can never give over your responsibilities as a thinking, judging, noticing, feeling person. You can receive much help from such tools as concepts and statistical techniques, but you must use them, not let them use you. You can do better than a machine, but only by taking the chance of doing worse.

—John W. Tukey1

I.  An Outsider to the Inferential Order

It would be difficult to imagine a more influential system for the production of scientific knowledge than the set of statistical techniques and interpretations grouped under the heading of “null hypothesis significance testing” during the twentieth century. According to its conventions, researchers analyze data by measuring the difference between their observations and a model of a null hypothesis, almost always taking the form of no mean difference between experimental groups. If the difference between experimental observations and the model of the null hypothesis reaches a certain magnitude, a threshold conventionally set at the 5 percent level (p < .05), researchers term the evidence “significant” enough to cast doubt on, or even reject the null hypothesis—indirectly affirming the existence of some effect through a negative operation. The 5 percent level of the p value is interpreted probabilistically as a one in twenty chance that a statistical value more extreme than that of the observed data could have arisen under the null model.2

The above picture is one of an elementary statistical procedure, painted in admittedly broad strokes. From humble beginnings it expanded spectacularly over the course of the twentieth century, driving what Gerd Gigerenzer has termed the “inference revolution,” which transformed the sciences of life, behavior, medicine, and mind in its image.3 These diverse sciences strive to produce knowledge about causes in the form of statistical law on the basis of “the inference from sample to population.”4 The validity of these claims to knowledge are then evaluated according to a powerful convention: significance at the 5 percent level. This threshold defines a more or less implicit ontological distinction between the existence of an underlying mechanism explaining the data’s distribution and random variation or noise. Over time this revolution has stabilized into a regime or inferential order, where knowledge from very different fields presents itself as an immense collection of p values, promising a uniform standard for weighing evidence in a variable, uncertain world.

This article analyzes one strain of the many critiques of this inferential order. It focuses less on statistical issues—which have been well-documented—than on their underlying epistemological values. After characterizing the more recent crisis of this order, it reconstructs one genealogical thread, drawing on a little-known set of writings produced by the omnivorous statistician John W. Tukey. The purpose is not to suggest that Tukey holds any secret key to the riddle of inference. His writings are fraught with gnomic allusions and some downright bizarre statements. These works rarely resembled traditional academic articles and were circulated in unpublished form among a select group of colleagues. What makes them worth returning to is not to discover an overlooked foundation for inferential knowledge but rather to unearth suppressed epistemological and even ethical issues that have recently resurfaced. Although his style could be difficult, Tukey insisted on facing up to problems that have been precariously paved over by the dull rumble of mechanical procedures—impersonal, rule bound, computational—for making knowledge through statistical inference.

Beyond the more contemporary crisis of inference, these debates shed light on the history of the social sciences during the Cold War period,5 as well as the longer history of scientific objectivity analyzed by Theodore Porter, Lorraine Daston, Peter Galison, and others.6 Their work has historicized the ideal of objectivity in the formation of knowledge by showing how it has been valued within political contexts and emerges through the ethical self-fashioning of scientific personae. The adjective “mechanical” here draws on their uses. As Porter explains, mechanical objectivity “means following the rules … of a rigorous method, enforced by disciplinary peers, canceling the biases of the knower and leading ineluctably to valid conclusions.”7 Similarly, for Daston and Galison the mechanical variant of objectivity refers to a “passionate commitment to suppress the will,” including the use of technologies and rules to inhibit suspect forms of subjective intervention, interpretation, and judgment—even when this might lead to less accurate scientific representations.8

Tukey offers an atypical vantage on the ideal of objectivity in the postwar sciences in the United States, one that scholars have characterized in terms of a particularly “rigorous objectivity,” a highly “formal,” rule-bound rationality, where “computers might reason better than human minds.”9 Historians have identified both internal and external factors to explain this especially intense form of objectivity. The advent of digital computing brought more and more social phenomena into the domain of the calculable. Political developments, such as a growing scientific bureaucracy and anticommunism, exerted quantitative pressure from the outside to avoid appearances of a politicized science. From an internal perspective, social scientists reacted to the enormous prestige of physics by incorporating its mathematical models and formalisms. In some cases, this borrowing relied on superficial analogical thinking, while in others it produced genuinely productive ways of modeling social worlds, fully aware of the limits of formal description. In both cases subjective, interpretative social thought tended to give way to numerical models that projected an aura of objectivity.10

Tukey’s understanding of data analysis and inference departs from these diagnoses in dramatic ways. Even if he represents an exception that supports the wider rule, his example points to dissensus during the period as well as dynamics that may run beyond it. He believed emphatically that the behavioral sciences should learn from what he saw as their more mature siblings in the physical sciences. But the lessons he drew from these comparisons were not those commonly recounted in histories of the period. His was a measured respect for epistemological habits and practices of the physical sciences, expressed in historical terms, rather than an uncritical physics envy. Tukey emphasized the empirical, provisional character of scientific knowledge in fields like physics, in which successes were built on the subjective skill and judgment of “thinking, judging, noticing, feeling” individuals—not reducible to mechanical mathematical procedure. Again and again he warned against the dangers of optimization and the powerful but ultimately illusory attraction of quantitative precision in the computerized Cold War sciences. Better, in his words, “to be approximately right” than “exactly wrong.”11

However, Tukey was not reacting against the nascent inferential order in any romantic or irrationalist way. The object of his critique was not so much statistical tests themselves; they should be part of any scientists’ arsenal. Rather, he was worried about the way that they were being used and more profoundly that the epistemological values determining these uses—universality, objectivity, and impersonality—might inhibit the development of the social sciences.12 He affirmed a different set of scientific values—judgment, experience, and even pluralism—to guide the use of statistical techniques in new contexts. Analyzing Tukey’s departure from this scientific orthodoxy casts the period into sharper relief. His warnings have been forgotten by the inferential victors and flattened by histories of objectivity.

It has been left to critics of the contemporary inferential order to reprise the epistemic values that Tukey cultivated. Instead of safeguarding conclusions under conditions of uncertainty or producing consensus through impersonal, context-independent standards, he imagined statistical techniques being deployed to “get a feel” for the data, “to dissect data so as to see what is going on,” through “techniques of incision, rather than those of conclusion or decision.”13 This perceptual language, often stretched to the limits of sense, is not accidental. Tukey resisted the increasing separation of calculative, mathematical data analysis from the empirical practices of the physical sciences. The special irony to Tukey’s creative empiricity is that it emerged from the center of the “closed world” of Cold War sciences—intensely quantitative and computational, guided by high theory and calculation rather than observation and judgment.14

These epistemic tensions echo into our present, where a concept of data—“big” and multidimensional—has replaced that of “information,” which tended to predominate in Tukey’s period. Worlds that are probabilistic rather than deterministic require techniques for producing knowledge under conditions of uncertainty and, as Porter has argued, social institutions for producing trust in knowledge that may not have guarantees or foundations.15 Machine learning, for instance, is even more highly mediated and opaque than the testing procedures and inferential regimes that it seems poised to replace, once again demanding new modes of empirical engagement and understanding. If statistical tests distanced analysts from their data, machine learning severs the connection, operating with data on scales that no individual observer could hope to approach. Here, scholars such as Matthew Jones have returned to Tukey’s work as an alternative to the “highly instrumentalist” epistemic culture of machine learning, which “prized prediction over knowledge about an interpretable, causal, or mechanistic model.”16 Building social trust in these techniques may require new ways of experiencing data collectively or publicly, the possibility of multiple perspectives rather than mechanically enforced consensus. At this highest level, Tukey’s work continues to illuminate knotty relationships between technical expertise, knowledge, and politics in modern societies. Before returning to his historical moment, it will be useful to briefly characterize the present crisis of statistical inference to understand these clashes of values.

II.  Tests and Rituals

In the past decade, the inferential order described in broad strokes above has come under great strain. Its effects have perhaps been felt most strongly in the discipline of psychology, where misuses of statistical inference and problematic experimental designs have been grouped under the heading of a “replication crisis.”17 In the biomedical sciences, John Ioannidis’s hyperbolic wake-up call, “Why Most Published Research Findings Are False,” took issue with a similar set of research and publication incentives that threaten to undermine public confidence.18

The outbreak of these controversies has led to a search for their causes. One powerful set of explanations concerns professional norms around the reporting of results and academic publishing practices. Many commenters have noted that journal editors are much more likely to publish results that reach the .05 level of significance. In turn, motivated by the perverse professional imperative to produce as many publications as possible, researchers have distorted their experimental designs and data analysis to produce results at this level, at times eschewing more original scientific contributions in the quest for significance.19 Scientists have engaged in questionable, if not fraudulent practices, including selective reporting—publishing only results that reach significance thresholds and data dredging—using statistical software to identify any relationships significant at the 5 percent level, ignoring inferential problems such as multiple comparisons.20

Even when researchers are not engaged in overt misconduct, there are a wealth of subtleties covered over by binary, mechanical interpretations of statistical tests. Many commenters have noted that the .05 threshold of significance is arbitrary or conventional, rather than an ontological marker of causal relations. Results that fall narrowly on either side of this marker may retrospectively appear to be more different than they actually are. Critics often attribute historical responsibility for the decision to adopt the .05 level to R. A. Fisher, whose influential 1925 book Statistical Methods for Research Workers included a set of tables at this level. It is a remarkable historical contingency that a convention Fisher developed for analyzing agricultural experiments became a widespread means of weighing evidence in areas as diverse as psychology and medicine. However, Fisher and those that read him closely would not have accepted the open-and-shut interpretation of significance tests that have ossified over time. He stressed that only the repetition of significant results could lead to firm scientific knowledge.21 Furthermore, as Stephen Stigler has shown, questionable interpretations of significance levels in fields like psychology predated the widespread dissemination of Fisher’s tables.22

Many other problems with significance tests have been understood for decades before breaking out into the open. In a pair of articles from the 1990s, the psychologist Jacob Cohen cataloged common errors exacerbated by these tests, especially the artificiality of formulating “straw-person” null hypotheses of zero effect size.23 Cohen also noted that ease of computing statistical tests with software estranged analysts from their data. And once it had achieved a dominant position in scientific research and publishing, null hypothesis significance testing overshadowed other factors relevant to inference such as effect size and statistical power.24

Around this time Gigerenzer and Cohen began to cast these problems in wider terms.25 Beyond the clearly pathological cases of misconduct, they saw that the use of statistical tests had become increasingly divorced from scientific judgment; what had once required subjective experience, understanding, and sensitivity to experimental context had petrified into what Cohen called “a single all-purpose, mechanized, ‘objective’ ritual in which we convert numbers into other numbers and get a yes-no answer,” allowing researchers to “neglect close scrutiny of where the numbers came from.”26 The replication crisis has reactivated this more systemic critique. For instance, Gigerenzer has argued that many researchers do not cynically manipulate inferential techniques for professional gain but seem to actually believe in the simplified and often misleading results derived from them. To account for such beliefs, which defy sound statistical sense, Gigerenzer posits the existence of a “statistical ritual” in which “researchers engage in delusions about the meaning of the null ritual, and above all about its sacred number, the p value.”27 This focus on ritualized delusions implies, with some condescension, that many scientists know not what they do; even further, what they do might require little thought at all. However, it also prompts the question, What prevents researchers from leaving the darkness of ritual and delusion and stepping into the clear light of statistical reason?

One set of possible explanations—the forces of habituation, social reproduction, and power involved in rituals—remains underdeveloped in Gigerenzer’s account.28 He is understandably more interested in analyzing his own data on problematic interpretations of statistical tests. But a deeper engagement with the extensive literature on ritual in the social sciences would help us see how practices that from one perspective appear irrational or delusional can also reveal how social practices and claims to knowledge persist over time—including at the very heart of the scientific institutions whose own self-understanding involves the replacement of premodern ritual with modern rationality.29 These works ask us to think about ritual in a positive sense, about the epistemological values they incarnate, in addition to the more negative “delusions” that they may engender. What should interest us is not only how rituals obscure the truth—implying that we could simply withdraw our belief in them to resolve our inferential crisis—but also how they endure and solidify over time. The American Statistical Association summed up the impenetrable circularity of these processes in an editorial commentary on their own “Statement on p-Values,” precipitated by the replication crisis: “We teach it [the .05 level] because it’s what we do; we do it because it’s what we teach.”30 Here the editors present the teaching and interpretation of p values as a sort of vicious cycle, resistant to historical understanding.

This problem motivates historical or genealogical approaches to inferential knowledge. One way of breaking through this apparent cycle is to return to moments before inference stabilized into a mechanical, judgment-free exercise. Such a return would be less interested in locating some original truth or falsity in the use of these techniques than in the contingencies, translations, and epistemological mixtures that have crystallized, producing, at least for a time, both scientific and social consensus.31 It is significant that critiques of the current inferential order often lead back to a single individual: Tukey. For instance, when Cohen turns to the question of how to use statistics more effectively in psychology, the first book he cites is Tukey’s Exploratory Data Analysis from 1977. Tukey, he explains, recognized that “the emphasis on inference in modern statistics has resulted in a loss of flexibility in data analysis.”32 Gigerenzer’s more recent proposal to replace the statistical ritual with a “statistical toolbox”—an array of techniques whose deployment should be informed and justified by judgment—also reserves a prominent place for Tukey.33 What led Tukey to develop these inferential alternatives, and what can his critiques tell us about both the recent history of inferential knowledge and scientific values in worlds of data?

III.  “Badmandments”

Even a cursory review of Tukey’s biography suggests interdisciplinarity, heterodoxy, and conflict as topoi of his career and thought.34 Although his PhD in mathematics was devoted to pure topics in topology, his “conversion” to applied work in statistics and data analysis resulted from experiences in Merrill Flood’s Fire Control Research Office and Princeton’s Statistical Research Group during the Second World War—a major site in the development of the techniques and styles of reasoning of the inference revolution. Tukey remained at Princeton for his entire career while maintaining a half-time appointment at Bell Labs, where he made important contributions on applications ranging from computing to missile systems to signal processing.35 He also filled a number of important roles on governmental science advisory boards, including the National Security Agency, as a delegate to nuclear weapons testing treaties between the United States and Soviet Union, and in producing analyses of pollution and public health.36 Given this huge range of interests, Tukey naturally gravitated to the behavioral sciences, which were asserting a new quantitative confidence following the Second World War.

This interest crystallized during the 1957–58 academic year, which Tukey spent on a fellowship at the Center for Advanced Study in the Behavioral Sciences (CASBS) near Stanford (fig. 1). The institution was founded only a few years earlier in 1954, but it immediately attracted leading lights in an astonishing range of fields. Present among its early classes were economists Kenneth Arrow and Frank Knight, political scientists Karl Deutsch and William Riker, rational choice theorist Vincent Ostrom, anthropologist E. E. Evans-Pritchard, and game theorists Howard Raiffa and Duncan Luce.37 The 1957–58 class of CASBS fellows easily held its own, counting sociologist Talcott Parsons, economists George Stigler and Robert Solow, and Tukey’s Bell colleague Claude Shannon among its members. Tukey, whose earliest published work called for a spirit of “scientific generalism,” thrived in the center’s collegial and collaborative environment.38 His role as a statistician with pure mathematical talent permitted special movement between the already fluid disciplinary boundaries at CASBS; he performed time series analyses with econometricians and assisted psychologists with experimental design and data analysis. Outside of CASBS, Fisher had just published his philosophically minded treatise Statistical Methods and Scientific Inference, which Tukey read with interest.39 In short, it was an auspicious moment. Tukey shared the sense that the postwar behavioral sciences were on the cusp of major discoveries about human nature and society. But he also saw dangers.

Figure 1. 
Figure 1. 

Images from the 1957–58 CASBS yearbook created for the fellows. Left, the cover depicts the center’s acronym on the ground of a “golden egghead.” Right, a photograph of Tukey at CASBS. The shapes in the bottom-right corner indicate “wizard” and “ping-pong”—one of Tukey’s hobbies—in the symbolic code developed for the yearbook. Courtesy: Stephen Stigler.

Tukey expressed this ambivalence in a manifesto of sorts, whose “rather uninhibited expression”—an understatement—spoke to the dangers he saw lurking behind the promise of the behavioral sciences.40 Those who read early drafts were struck by his language. Leslie Kish of the University of Michigan’s Institute for Social Research described the moralistic, even religious fervor that he felt emanating from Tukey’s manuscript: “I can testify from frequent encounters with them that the dragons that you are trying to slay are genuine, big, living, and common. Nevertheless, I tend to believe that your message would be more effective if Chapter A would have a different tone. What you preach about those sinners is true. Yet you get yourself into the role of a preacher who paints his sinners so black and comic that his listeners fail to recognize themselves. They think that the preacher is talking about somebody who hasn’t even come to church.”41 Perhaps Tukey took this advice to heart, as he never formally published the manuscript before its much later inclusion in his Collected Works in 1986. Instead, he circulated it among colleagues, although most statisticians, like Kish, would have counted themselves among the converted. Nonetheless, the document is significant as an expression of ideas that would guide Tukey’s subsequent research. More importantly, it ranged far beyond formal statistical and probabilistic issues, broaching questions of epistemology and values whose tracks were subsequently covered by the mechanization of inference.

Tukey gave this unwieldy manifesto an unwieldy title: “Data Analysis and Behavioral Science or Learning to Bear the Quantitative Man’s Burden by Shunning Badmandments.” The subtitle, echoing Kipling’s notorious apology for racist colonialism, is appropriate in retrospect, given the imperial ambitions of the quantitative behavioral sciences. The neologism “badmandment” referred to a sort of negative commandment: “unwise statements which most of us can imagine someone else teaching to his students, either by word or by deed.”42 Introducing the manuscript, Tukey asked readers to resist the dazzling clarity of statistical formalisms in order to soberly face data—and the world—as they are, “to help the reader face up to what the situation is really like, to what statistics can and cannot do for him, to which burdens of uncertainty and judgment he must shoulder if quantitative procedures are to serve him well.”43 The first “great badmandment” stated: “if it’s messy, sweep it under the rug,” parodying the way researchers ignored empirical observations that failed to conform to the restrictive conditions required by many statistical models.44 All in all, he enumerated one hundred subsequent badmandments. Although they covered a wide range of issues, a number of underlying values emerge, which offer alternatives to mechanized inference procedures that ultimately proved too strong for the behavioral sciences to resist.

IV.  Issues with Significance

Tukey devoted a major section of the manuscript to significance tests, the pillar of the inferential order. However, instead of dealing with technical specifics he focused on larger interpretive issues. First, although significance tests were urgently needed to guide behavioral scientists, they became dangerous when mistaken for guarantors of inferential certainty or as “sanctification” of experimental results, meant to ward off criticism from colleagues.45 For Tukey, the epistemological value of these tests was neither to eliminate doubt nor to enforce consensus by foreclosing further discussion. Ideally, they would allow scientists to speak more precisely about comparisons and uncertainty.

Second, instead of quibbling about the proper magnitude of the threshold for significance, Tukey argued that the larger problem was its binary structure, which worked to “render a portrait [of experimental results] with a single round dot, either black or white.”46 Classification into these overly broad categories obscured more important similarities or differences that other techniques, like confidence intervals, could illuminate. Anticipating later concerns about experimental replication, Tukey maintained that the classes “statistically significant” and “not statistically significant” may not be well-defined, “in the sense that independent reclassification, namely repetition of the whole experiment or observation on independent chosen individuals under independently chosen circumstances would differ from the original classification in a non-negligible fraction of all instances.”47 In other words, the apparent clarity of these binary categories became a weakness when it proved fragile in repetition, which should be the ultimate test of inferences; moreover, the seeming “broadness” of these classes could hide this fragility.

Third, Tukey sensed danger in the way that the apparent clarity of significance tests crowded out other relevant techniques. “Principles of significance are important,” Tukey maintained, “but they gain their value by being combined with techniques”—very often confidence techniques and the analysis of variance.48 The exclusive use of any single technique was likely to make the data analyst overly optimistic, wrongly implying that conclusions could be drawn from a single procedure. Tukey’s characterizations of this narrowness could be harsh: “The idea of the single overall test of significance as something natural, universal, and perhaps even as a cure-all, might almost be considered a statistical disease.”49 Instead of a single procedure or set of rules valid independent of context, Tukey advocated for a plurality of techniques whose combined strengths and weakness would paint a more detailed portrait of the data. This is the opposite of conceiving inference in terms of a single set of standardized rules or calculative procedure. More than a half-century later, critics such as Gigerenzer have reactivated this pluralist approach to knowledge in his call for a “toolbox” of statistical approaches.50

V.  Grinding Up Uncertainty

These discussions of significance tests were embedded within wider reflections on inference, data analysis, and even the history of science. Again and again Tukey argued that researchers in the behavioral sciences needed to give up the seeming security of procedures and learn to be comfortable with a certain amount of subjective, even arbitrary action; they needed to think independently about relationships between samples and populations rather than relying on rules.51 When faced with data that did not meet the restrictive requirements of probabilistic models, scientists should not simply throw up their hands. Nor should they regard “unspecified or unspecifiable populations with disdain and fear”52—the emotion that Daston and Galison posit at the source of objective epistemologies.53 Instead, Tukey argued, “the nature of ‘The Exact Sciences’ is that they are full of ‘corrections,’ ‘art’ and what might even appear to be ‘folk-wisdom,’ especially when one is concerned with the practice of measurement.” Tukey continued, “other fields”—here he is referring to the behavioral sciences—“cannot hope to become ‘Exact’ with a capital E by abjuring good quantitative judgment, or by abjuring empirically sound adjustments, or by abjuring ‘arbitrary’ corrections.”54 Contrary to the broader contours of objectivity in the Cold War sciences, Tukey did not see the movement toward quantitative precision in terms of a constraint on more personal forms of judgment—even when they appeared arbitrary. Rather, subjective judgment must work alongside mathematical procedures to navigate difficult inferential terrain. There could be no simple identification of mathematics with objectivity.

Few would disagree with sentiments about sound judgment in the abstract, but this more subjective value faces perennial challenges in quantitative worlds that value numbers as objective means to resolve conflicts, communicate between cultures, and produce consensus. Tukey broached these conflicts in explicit, if not systematic ways in the “Badmandments” manuscript, where they were ultimately resolved on ethical grounds. As an alternative to impersonal procedures—using “statistical techniques as machines to grind up uncertainty and making certainty out of the grist”—he argued that “everyone ought to make up his own mind about what standard of intellectual honesty, for each individual and for each field, will best support and facilitate progress in the field in question.”55 In other words, statistical procedures alone offered no guarantee of truth. Individual scientists would need to make personal ethical decisions regarding scientific values, faintly echoing the Protestant theology that Tukey invoked elsewhere. Although he tended to be supremely confident in his own intellectual judgments on inferential topics, Tukey recognized that disagreement was inevitable and that disputes among such “warring gods” could not be decided by technical means alone. They could only be resolved by examination of “scientific consciences”—again note the ethical language.56

VI.  Communication and Pluralism

Appeals to conscience reveal the ethical stakes of epistemological conflicts but do little to resolve substantive disagreements and may deepen them. Tukey realized this and throughout the manuscript proposed strategies for producing consensus that avoided the problems with mechanized objectivity. One of his alternatives was the development of techniques for the expression and communication of statistical ideas.57 The “Badmandments” manuscript, for instance, featured some of Tukey’s earliest reflection graphical techniques, which were expanded in his best-known work, Exploratory Data Analysis.58 Their inclusion underscores the fact that Tukey’s attention to graphics came out of a wider set of statistical and inferential concerns. Notably, the importance of observation and judgment took the form of a datafied empiricism, emerging “more or less directly from contact with data … rather than from suggestions by theory or unmitigated ‘common sense.’”59 Despite Tukey’s phrasing, this was by no means a direct or immediate sensory empiricism but rather needed to be subjectively cultivated and supported by techniques.

This empirical attitude formed the core of Tukey’s more positive response to dogmatic theorization and the mechanical use of statistics in the behavioral sciences. He grouped these visual alternatives under the umbrella “modes of expression” and conceptualized them using sensory metaphors—often auditory but broadly consistent with his earlier language of “incision” and “dissection of data.” He argued, “We are, at this point, trying to tune our ears to hear what the data are trying to say to us. Good data try, much harder than most of us realize, to tell us what is going on. We need receptive ears.”60 Here the epistemological relationship is conceived not between knowing subject and an external object but in strangely intersubjective terms, with data anthropomorphized. His solution to this eminently empirical problem of receptivity involved a different conception of techniques like graphing. Instead of using single, finished graphs to demonstrate one’s finished argument as the “deaf-to-data investigator” does, data analysts should sketch rough, iterative series of plots in order to find expressions of the relationships between variables. Tukey gave various visual rules of thumb for helpful modes of expression, such as scaling, to produce curves in the form of “straight parallel lines,” which can be “described by a single number, the vertical distance between the curves.”61 His empirical understanding of data analysis was not simply a stick to beat back theorizing and calculation. In a positive sense it emphasized the generative, creative sides of perception. It focused on surprising relationships, interesting outliers, and the production of new hypotheses rather than rote confirmation of existing ones (fig. 2).

Figure 2. 
Figure 2. 

A rough plot from the “Badmandments” manuscript illustrating Tukey’s frequent practice of using a logit scale to better show relationships. Courtesy: American Philosophical Society.

Alongside visual techniques, Tukey described statistical communication using the language of information theory that had been developed by his Bell and CASBS colleague Claude Shannon. The binary outcome of significance tests transmitted only a single unit or bit of information, and Tukey argued that this was not sufficient: “If we wish to know more about some investigator’s result than merely the dichotomy of significant–not significant, we are likely to require several bits to specify what we have learned … an increased effort, a greater channel capacity.” As an alternative, he endorsed a pragmatic pluralism, holding out hope that “we can all learn to communicate more effectively about results, both with ourselves and with one another, using more flexible and useful codes.”62 This position was characteristic of Tukey’s understanding of the social and professional roles of data analysts. Instead of stamping the imprimatur of sanctification on scientific results, statisticians might reasonably come to different conclusions.

In another unpublished manuscript, “What Have Statisticians Been Forgetting?,” Tukey expanded on this theme. He began by questioning the value not of expertise but of expert consensus, asking, “Economists are not expected to give identical advice to congressional committees. Engineers are not expected to design identical bridges—or aircraft. Why should statisticians be expected to reach identical results from examinations of the same set of data?”63 This affirmation of pluralism was only Tukey’s opening salvo. His larger point was that the value of objectivity itself needed to be revalued if not transformed in statistical practice. After the single-minded pursuit of optimality, “the next fetish to be attacked,” he argued with some relish, “is the fetish of objectivity,” understood as “an eternal supply of scapegoats … the fallacy that to a single body of data there corresponds a unique appropriate analysis.”64 Against this procedural understanding of objectivity as optimum, Tukey opposed a sort of communicative rationality, although his definition stretches the limits of the notion. Objectivity “in … the deepest and best sense” is nothing else than the willingness “to listen to reasoned technical arguments just so far as these arguments were scientifically convincing.”65 Who could disagree?

Recent scholarship has explored the relationship between the nascent sciences of information and communication in terms of objectivity during this period. Orit Halpern, following Daston and Galison’s example, identifies a value of “communicative objectivity” that emerged during this period. Halpern’s notion shares some of Tukey’s concerns with data, perception, and communication, but her category is ultimately grounded in the world of design, rather than the behavioral sciences that concerned Tukey.66 His valuation of objectivity is better characterized as a minor tradition in opposition to the idea of mechanical objectivity as developed by Daston, Galison, and Porter and updated by scholars of the Cold War sciences. Instead of the impersonality of rules, suppression of the will, or algorithmic calculation, Tukey called for scientists to speak freely as rational individuals, even, or perhaps especially, when their interpretations of data differed.

There is an evident affinity between Tukey’s position and the epistemic virtue that Daston and Galison oppose to mechanical objectivity: “trained judgment,” which legitimated subjective judgments through professional training and institutions that arose in the twentieth century.67 However, a closer look at Tukey’s work shows that this virtue captures his positions in a very partial way. While he celebrated subjective judgment and thought deeply about the proper forms of scientific education and training, he also chafed at the conformity professionalization produced. He valued pluralism over the guild’s reliance on inferential statistics to “sanctify” results and protect the behavioral sciences from outside criticism, producing a false optimism. His empiricism was neither a naïve realism nor the “physiognomic sight” that Daston and Galison date roughly to this period. To be sure, both of these eschew exclusively quantitative criteria; approximate accuracy was to be preferred over numerically exact objectivity. But while Daston and Galison emphasize the holistic or unconscious means by which midcentury scientists drew out family resemblances, Tukey emphasized a creative sensory engagement with data.68

VII.  Historical Epistemology

The differences between Tukey’s epistemology and ethos of data analysis and the context of the postwar sciences should now be clear. The question remains how to account for these differences. What motivated them? The external or sociological source of Tukey’s scientific confidence doubtless stemmed from his position as a singular elite, whose mathematical abilities and broad understanding of science allowed him to operate with a high degree of autonomy in academic, corporate, and governmental circles. This also differentiates him from the collective professional standards on which trained judgment, as described by Daston and Galison, rests. Many of the new behavioral scientists felt pressure to use quantification as a means to both initiate new members into its professional guild and to defend against powerful external skeptics—political imperatives that ironically contributed to the subsequent crisis in their influential order. Tukey, however, could speak to behavioral scientists as a friendly but ultimately disinterested outsider. The future of statistics was secure; it would be up to behavioral scientists to make the best use of it.

This attitude shares something with the culture of elite, technocratic Ponts et Chaussées engineers analyzed in Porter’s study of mechanical objectivity. For this nineteenth-century corps, quantification operated from a position of institutional strength. Numbers were less likely to be used as an objective means to protect against outsiders, and within the group ambiguity, subjective judgment, and uncertainty could be openly discussed.69 Tukey was nothing if not an elite and in many ways an aloof one, which made him freer to speak critically about the limits of overly mathematized inferential techniques and the mechanical application of tests—although it is notable that he tended to save his harshest criticisms for privately circulated manuscripts like “Badmandments.” Even though Tukey continues to influence critics of the abuse of statistical testing and inferential impasses, it would be misleading to cast him as having resolved these problems in any definitive way. Instead he suggested alternative strategies. His own epistemic values of pluralism and commitment to perception and judgment are attractive in many ways. We still, as Tukey maintained, need to act in the face of irreducible uncertainty, even when these actions can be called arbitrary. But these values have their own epistemological and political vulnerabilities. They provide little guarantee of consensus or even decision; they are deeply connected to the personal or subjective qualities of the researcher; they reflect an elite sensibility that is likely difficult to democratize. Mechanical forms of inference, for the many flaws that Tukey identified, address these concerns.

What differentiated Tukey from closed circles of elites was his commitment to scientific training and practice in collaborative modes. His interest in communication was based on the hope that statistical insights might be responsibly discussed rather than mechanically codified. The roots of this position were grounded in Tukey’s comparative understanding of the history of the sciences. Faith in a redemptive history of science unified two strands of Tukey’s thought: strong criticisms of the nascent inferential order and his view that the behavioral sciences needed to pierce through the illusions of certainty and quantitative strictures through empirical techniques and subjective judgment. More specifically, Tukey tried to make better comparisons among different scientific disciplines. Behavioral scientists’ desire to imitate the formal physical models could be misleading not because of any qualitative difference in their objects but rather because these sciences were at different points on their historical trajectories. The proper models for the behavioral sciences were not the “completed edifices” of the relatively mature physics of the postwar period.70 Rather, Tukey stressed that behavioral scientists should look to the older physical sciences at earlier moments of their development for lessons. These sciences had been built on the sort of painstaking empirical and experimental work for which Tukey advocated in the behavioral sciences. If inferential statistics mechanized the behavioral sciences, divorcing them from the “arts of empirical approximation” and subjective acts of “thinking, judging, noticing, [and] feeling,” behavioral scientists would fail to learn from the historical norms that had proven so profitable elsewhere.71

The “Badmandments” manuscript, for all its polemical fervor, ended on a positive note, in which Tukey enumerated a smaller set of “goodmandments.” What unified these “goodmandments?” A historical directive: “In building new sciences, look to how the elder sciences actually were built,” as opposed to their retrospective self-presentation. Tukey advocated a critical historical sensibility that drew attention to scientific practices in their diversity. The behavioral scientists that Tukey encountered at CASBS should not look to the imposing monuments left by these sciences but rather to “the actual practice of scientists” during their formative periods.72 Like all historical theories, Tukey’s was not neutral, and it is at the source the “uninhibited,” polemical character of this work. His history relied on its own debatable assumptions, notably that despite a plurality of disciplinary time lines, newer sciences would nonetheless develop in the same manner as the sciences that preceded them. These more mature sciences could still serve as models, provided that those working in newer sciences focused on historical practices rather than finished states. This question of unity has been the subject of debate in the history and philosophy of science, and it is notable that the sciences over which the inferential order reigned tended to be those with complex, variable, and unstable objects, from agriculture, to medicine, to social policy.73

Thus, although Tukey was a singularity and cleared a road that until recently has not been well traveled by behavioral scientists, he can nonetheless tell us about sources of epistemic values and commitments that have come to a head in contemporary crises of inference. Beyond bromides about facts and values, knowledge and power, studying Tukey’s strange writings allows us to perceive differences, surfacing rather than suppressing the sources of epistemic commitments. His willingness to risk awkward expressions was the price to be paid for questioning normally unspoken assumptions of inferential knowledge and attempting to ground alternative norms. His works reveal the formation of the values and ethics at the source of scientific knowledge, what Daston and Galison call “scientific selves.”74 These selves are not always forged through the gradual accumulation of virtues, impersonal forces, or agentless processes of change but rather in the polemical discourses of those like Tukey, reflecting on their own scientific practice. History becomes a powerful source of self-understanding and subjectivity when scientists understand the provisional, historical nature of the knowledge they make. In our present, machine learning and other inferential techniques for analyzing big data have exacerbated many of the epistemic problems that Tukey diagnosed. We may be more or less sanguine about the nature of scientific progress than he was, but we will need to think carefully, as he did, about epistemic values and the politics of inferential knowledge to incorporate these techniques into trustworthy democratic institutions.

Notes

Archival research for this article was generously supported by the Leon and Joanne V. C. Knopoff Library Resident Research Fellowship at the American Philosophical Society. The article also benefited from conversations and additional material shared by Stephen M. Stigler.

1. John W. Tukey, “Data Analysis and Behavioral Science or Learning to Bear the Quantitative Man’s Burden by Shunning Badmandments,” in The Collected Works of John W. Tukey, vol. 3, Philosophy and Principles of Data Analysis: 1949–1964, ed. Lyle V. Jones (Pacific Grove, CA: Wadsworth & Brooks, 1986), 312.

2. Ronald L. Wasserstein and Nicole A. Lazar, “The ASA Statement on p-Values: Context, Process, and Purpose,” American Statistician 70, no. 2 (April 2, 2016): 131–33, https://doi.org/10.1080/00031305.2016.1154108.

3. Gerd Gigerenzer and David J. Murray, Cognition as Intuitive Statistics (Hillsdale, NJ: Lawrence Erlbaum Associates, 1987).

4. Gerd Gigerenzer, “Statistical Rituals: The Replication Delusion and How We Got There,” Advances in Methods and Practices in Psychological Science 1, no. 2 (2018): 200, https://doi.org/10.1177/252524591877139.

5. Joel Isaac, “The Human Sciences in Cold War America,” Historical Journal 50, no. 3 (2007): 725–46.

6. Theodore M. Porter, Trust in Numbers: The Pursuit of Objectivity in Science and Public Life (Princeton, NJ: Princeton University Press, 1995); Lorraine J. Daston and Peter Galison, Objectivity (New York: Zone Books, 2007).

7. Porter, Trust in Numbers, 4.

8. Daston and Galison, Objectivity, 143.

9. Paul Erickson et al., How Reason Almost Lost Its Mind: The Strange Career of Cold War Rationality (Chicago: University of Chicago Press, 2015), 3–4.

10. Theodore M. Porter, “Foreword: Positioning Social Science in Cold War America,” in Cold War Social Science: Knowledge Production, Liberal Democracy, and Human Nature, ed. Mark Solovey and Hamilton Cravens (New York: Palgrave Macmillan, 2012), ix. For an analysis of the relationship between physics and economics, see Philip Mirowski, More Heat Than Light: Economics as Social Physics, Physics as Nature’s Economics (Cambridge: Cambridge University Press, 1991).

11. John W. Tukey, “The Future of Processes of Data Analysis,” in The Collected Works of John W. Tukey, vol. 4, Philosophy and Principles of Data Analysis: 1965–1986, ed. Lyle V. Jones (Pacific Grove, CA: Wadsworth & Brooks, 1986), 540.

12. Porter, Trust in Numbers, 5. Porter’s discussion of the ways in which some accounts worked to protect their professional discretion and judgment in the name of realism has interesting similarities with Tukey.

13. Tukey, “Data Analysis and Behavioral Science,” 188.

14. Paul N. Edwards, The Closed World: Computers and the Politics of Discourse in Cold War America (Cambridge, MA: MIT Press, 1996).

15. Porter, Trust in Numbers, 89.

16. Matthew L. Jones, “How We Became Instrumentalists (Again): Data Positivism since World War II,” Historical Studies of Natural Science 48, no. 5 (November 2018): 673–74, https://doi.org/10.1525/hsns.2018.48.5.673.

17. By now the literature on this crisis is vast. For a good early introduction, see Harold Pashler and Eric-Jan Wagenmakers, “Editors’ Introduction to the Special Section on Replicability in Psychological Science: A Crisis of Confidence?,” Perspectives on Psychological Science, November 7, 2012, https://doi.org/10.1177/1745691612465253. Readers may also consult Adam Morton’s more recent article in KNOW: “Evidence-Based Beliefs?,” KNOW: A Journal on the Formation of Knowledge 1, no. 2 (2017): 339–51, https://doi.org/10.1086/693355.

18. John P. A. Ioannidis, “Why Most Published Research Findings Are False,” PLoS Medicine 2, no. 8 (2005): e124, https://doi.org/10.1371/journal.pmed.0020124.

19. For a wider view, see Mario Biagioli and Alexandra Lippman, eds., Gaming the Metrics: Misconduct and Manipulation in Academic Research (Cambridge, MA: MIT Press, 2020).

20. John W. Tukey, “The Philosophy of Multiple Comparisons,” Statistical Science 6, no. 1 (1991): 100–116.

21. Ronald Aylmer Fisher, “The Arrangement of Field Experiments,” Journal of the Ministry of Agriculture of Great Britain, no. 33 (1926): 503–13.

22. Stephen Stigler, “Fisher and the 5% Level,” CHANCE 21, no. 4 (December 1, 2008): 12, https://doi.org/10.1007/s00144-008-0033-3.

23. David H. Krantz, “The Null Hypothesis Testing Controversy in Psychology,” Journal of the American Statistical Association 94, no. 448 (1999): 1376, https://doi.org/10.2307/2669949.

24. Jacob Cohen, “The Earth Is Round (p < .05),” American Psychologist 49, no. 12 (1994): 997, https://doi.org/10.1037/0003-066X.49.12.997.

25. Gigerenzer and Murray, Cognition as Intuitive Statistics.

26. Jacob Cohen, “Things I Have Learned (So Far),” American Psychologist 45, no. 12 (1990): 1309, https://doi.org/10.1037/0003-066X.45.12.1304.

27. Gigerenzer, “Statistical Rituals,” 200.

28. He cites a single psychological study comparing cultural rituals and obsessive-compulsive disorders: Siri Dulaney and Alan Page Fiske, “Cultural Rituals and Obsessive-Compulsive Disorder: Is There a Common Psychological Mechanism?,” Ethos 22, no. 3 (1994): 243–83, https://doi.org/10.1525/eth.1994.22.3.02a00010.

29. For an introductory overview to this vast literature, see Catherine Bell, Ritual: Perspectives and Dimensions (New York: Oxford University Press, 1997).

30. Wasserstein and Lazar, “ASA Statement on p-Values,” 130.

31. Gigerenzer’s account of the way that the current inferential order emerged from a highly unstable mixture of ideas from R. A. Fisher, on the one hand, and Jerzy Neyman and Egon Pearson, on the other, is exemplary. See Gerd Gigerenzer et al., “The Inference Experts,” in The Empire of Chance: How Probability Changed Science and Everyday Life (Cambridge: Cambridge University Press, 1989), 70–122.

32. Cohen, “Things I Have Learned (So Far),” 1310.

33. Gigerenzer, “Statistical Rituals,” 212.

34. Recently a full biographical treatment of Tukey has appeared, although it is unclear whether it draws on Tukey’s archives held at the American Philosophical Society: Mark Jones Lorenzo, Adventures of a Statistician: The Biography of John W. Tukey (Philadelphia: CreateSpace Independent Publishing Platform, 2018).

35. F. R. Anscombe, “Quiet Contributor: The Civic Career and Times of John W. Tukey,” Statistical Science 18, no. 3 (2003): 292, https://doi.org/10.2307/3182747.

36. David R. Brillinger, “John W. Tukey’s Work on Time Series and Spectrum Analysis,” Annals of Statistics 30, no. 6 (2002): 1595–1618.

37. S. M. Amadae, Rationalizing Capitalist Democracy: The Cold War Origins of Rational Choice Liberalism (Chicago: University of Chicago Press, 2003), 7.

38. Hendrik Bode, Frederick Mosteller, John W. Tukey, and Charles Winsor, “The Education of a Scientific Generalist,” Science 109, no. 2840 (1949): 553–58, https://doi.org/10.2307/1676674.

39. R. A. Fisher, Statistical Methods and Scientific Inference (Edinburgh: Oliver & Boyd, 1956).

40. Tukey, “Data Analysis and Behavioral Science,” 187. Other manuscripts and working documents are preserved among Tukey’s papers. See “Data Analysis and Behavioral Science or Learning to Bear the Quantitative Man’s Burden by Shunning Badmandments,” 1960, Series II, Works by Tukey, 3 folders, John W. Tukey Papers, American Philosophical Society Library, Philadelphia, PA.

41. Leslie Kish, “Letter from Leslie Kish to John W. Tukey,” November 15, 1965, Series X, Research, box 1, John W. Tukey Papers.

42. Tukey, “Data Analysis and Behavioral Science,” 197.

43. Tukey, “Data Analysis and Behavioral Science,” 188.

44. Tukey, “Data Analysis and Behavioral Science,” 198.

45. Tukey, “Data Analysis and Behavioral Science,” 291–93. Porter makes a similar point: that the rules of statistical inference are often enforced in response to suspicious outsiders. See Porter, Trust in Numbers, 214.

46. Tukey, “Data Analysis and Behavioral Science,” 293.

47. Tukey, “Data Analysis and Behavioral Science,” 294.

48. Tukey, “Data Analysis and Behavioral Science,” 294.

49. Tukey, “Data Analysis and Behavioral Science,” 297.

50. Gigerenzer, “Statistical Rituals,” 214.

51. Tukey, “Data Analysis and Behavioral Science,” 220.

52. Tukey, “Data Analysis and Behavioral Science,” 221.

53. Daston and Galison, Objectivity, 372–74.

54. Tukey, “Data Analysis and Behavioral Science,” 212.

55. Tukey, “Data Analysis and Behavioral Science,” 220–21.

56. Tukey, “Data Analysis and Behavioral Science,” 288.

57. Tukey, “Data Analysis and Behavioral Science,” 225.

58. John W. Tukey, Exploratory Data Analysis (Reading, MA: Addison-Wesley, 1977).

59. Tukey, “Data Analysis and Behavioral Science,” 222.

60. Tukey, “Data Analysis and Behavioral Science,” 247.

61. Tukey, “Data Analysis and Behavioral Science,” 247.

62. Tukey, “Data Analysis and Behavioral Science,” 227.

63. John W. Tukey, “What Have Statisticians Been Forgetting?,” in Jones, Collected Works of John W. Tukey, 4:589.

64. Tukey, “What Have Statisticians Been Forgetting?,” 589.

65. Tukey, “What Have Statisticians Been Forgetting?,” 589.

66. Orit Halpern, Beautiful Data: A History of Vision and Reason since 1945 (Durham, NC: Duke University Press, 2014), 95.

67. Daston and Galison, Objectivity, 319–24.

68. Daston and Galison, Objectivity, 314–35.

69. Porter, Trust in Numbers, 138.

70. Tukey, “Data Analysis and Behavioral Science,” 313.

71. Tukey, “Data Analysis and Behavioral Science,” 312–13.

72. Tukey, “Data Analysis and Behavioral Science,” 312–13.

73. This debate no longer seems very fashionable. For two of many statements on disunity, see Nancy Cartwright, How the Laws of Physics Lie (Oxford: Oxford University Press, 1983); and John Dupré, The Disorder of Things: Metaphysical Foundations of the Disunity of Science (Cambridge, MA: Harvard University Press, 1993).

74. Daston and Galison, Objectivity, 39.