Research & Discovery

After 10 Years, ‘Many Labs’ Comes to an End – But Its Success Is Replicable

May 23, 2022 • By Eric Williamson, williamson@virginia.edu Eric Williamson, williamson@virginia.edu

A woman in a white coat types at a computer. Charts and graphs float in the background.

Photo illustration by Sako Yamaguchi, University Communications

Brian Nosek, a professor of psychology at the University of Virginia who studies human biases, was a graduate student in the 1990s when he first started thinking closely about one of science’s open secrets: failure to replicate research findings is a common occurrence.

“At that time, the textbooks had been talking for 30 years about the problem,” Nosek said. “I thought, ‘What’s going on here? Is there actually a problem?’”

He couldn’t have known then that this fascination would someday make him one of the most hated figures in science, at least for a while, and also one of the most respected.

Replication is integral to the scientific method. It’s the lather-rinse-repeat approach that either increases the likelihood that a theory holds legitimacy, or forces reconsideration. If results fail to replicate, was something off in the experimental process? If not, are the conclusions wrong?

But with careers and grant monies so often in the balance, the incentives align with generating new discoveries over running old experiments repeatedly.

“Replication is a core concept of science, and yet it’s not very often practiced,” said Nosek, who is also co-founder and executive director of the independent nonprofit Center for Open Science, which advocates for transparency in research methods and data. “Innovation without verification gets us nowhere. But any individual researcher has choices to make about how they spend their time. The answer is pretty easy when thinking about career advancement: Try something new.”

UVA psychology professor Brian Nosek is an expert in human biases, and co-founder and executive director of the independent Center for Open Science. (Photo by Dan Addison, University Communications)

Ten years ago, mindful of the trend of research being increasingly generated, but not increasingly revisited, Nosek and colleagues decided to re-run a series of published scientific experiments, creating the “Many Labs” project. The global effort, which at times has been both headline-grabbing and apple cart-turning, wrapped up at the end of April.

“It’s hard to overstate how central Brian Nosek’s role in the reform movement, sometimes called the ‘replication crisis’ or ‘credibility revolution,’ has been,” commented Simine Vazire, a professor at the University of Melbourne who studies psychology ethics and science’s ability to self-correct. “Brian has been a leader in the movement, a diplomat reaching out across sharp dividing lines in our field, and someone who gets things done.

“Among the many projects that Nosek and the Center for Open Science made possible, the Many Labs projects, which Brian collaborated on, but which were individually led by Richard Klein and Charlie Ebersole, are among the most impressive and important for science. Each of these five projects was a gargantuan effort, and tackled a question that had been the topic of much debate, but virtually no empirical tests.”

‘#Repligate’ Lights Up the Scientific Community

The Many Labs revolution has a complex history, with Nosek running similar projects simultaneously. He began his initial Many Labs replication studies in 2011. It wasn’t long before the scientific community was feeling “twitchy,” The Guardian reported.

Nosek reflected, “Because performing replication was so unusual, when someone said, ‘I want to replicate your study,’ instead of the person taking it as a compliment, it was often considered a threat. Like, ‘What, you don’t trust me? What’s up?’”

He thinks of 2012 as when Many Labs officially started, however. That’s when the journal Social Psychology accepted his and Dutch researcher Daniël Lakens’ pitch to serve as guest editors for a special issue with a unique approach. They would invite researchers to submit proposed replications and put the designs through peer review before knowing the results, so that no one would be biased for or against the studies based on the outcomes.

The first Many Labs project was one of the papers in this special issue, and its approach was slightly different: It tested the replicability of 13 classic and contemporary studies.

Many Labs 1 (Klein et al., 2014) was part of a special issue of Social Psychology that introduced Registered Reports edited by @lakens and I.

The issue spawned the infamous #repligate, summarized here: https://t.co/xNSPRaZii4
— Brian Nosek (@BrianNosek) April 29, 2022

A sampling of the inquiries for that component included: Can people’s behavior be “primed” by visual cues? Would a quote be perceived differently if someone thought it came from Thomas Jefferson versus Soviet founder Vladimir Lenin? Can a chance observation – in this case a dice roll – influence thoughts about what might have happened prior?

Overall, the crowd-sourced Many Labs experiments spanned 36 independent research sites, utilizing a total of 6,344 participants.

Recognizing the depth of the challenge, Nosek embarked with a simple design.

“With Lab 1, we wanted to look at a number of different findings in an easy-to-transfer protocol,” Nosek said. Using a combination of online and in-person methods, “We chose research findings in which the study and procedure could be done in 1 to 3 minutes.”

At first blush, the results might not have seemed revolutionary. There was arguably good news: 10 of the 13 studies replicated their original findings, a much higher percentage than Nosek and colleagues might have expected.

Three people pose together against a pastoral landscape

But that was bad news for the remaining three studies. There was weak support for one theory, that imagining contact with someone can reduce prejudice, as well as in general for the theory of “priming,” which was integral to the other two studies.

In one of the two, participants were shown a subtle image of an American flag to see if it would influence their sense of conservatism. No strong evidence could be found for the visual cue having an effect on their subsequent behavior. With priming having previously been a well-accepted theory, social media users wagged their tongues at this and related findings from the special issue, often flagging the conversation as “#repligate.”

Nosek, with 269 collaborators, followed up Many Labs the next year with the concurrent research titled “The Reproducibility Project: Psychology.”

While not formally part of Many Labs, the work continued its themes. The researchers conducted replication attempts on research published in three journals in 2008. They were only able to reproduce results in fewer than 50 of the 100 cases – worse odds than flipping a coin.

The marquee observation was that, ironically, 97 of the original studies claimed significant effects. Even the studies that did successfully replicate didn’t usually do so with as much oomph as originally touted. In a few worst-case scenarios that were realized, the new researchers even found effects that were opposite of the initial findings.

Does loneliness amplify a belief in the supernatural? Maybe, but researchers couldn’t confirm a connection. The same goes for finding causation between racial prejudice and how people responded to pictures of ethnically diverse people with guns.

As the new research dropped, The Atlantic described Nosek as a “respected” and “level-headed figure” who expressed no glee in the outcomes. In fact, when he found, oddly, that the cognitive psychology experiments (about how we think) were about twice as likely to replicate as the social psychology experiments (involving how people influence each other), Nosek cringed for fear of scientific in-fighting.

If people were talking, though, that was what mattered most, he told himself.

“Little of the criticism I received suggested that I was doing something misaligned with my principles or values,” he said. ‘Most of the criticism was that I might embarrass the field or that I am incompetent. The possibility that the work might illuminate something embarrassing only reinforced my view of the importance of doing the research. If it revealed something embarrassing, all the more important that we know about it to fix it.”

“I never felt any conflict about which path to take.”

Prior to replication studies like his, potentially flawed research was only whispered about at conferences, he said. Now, disparate findings were being discussed openly. He and his many collaborators carefully prepared every word of their reports with an understanding of the many sensitivities. Yet even so, Nosek was both the hero and the goat for leading the revolution.

“Getting criticized and losing friends was not fun, at all,” he said. “For me, this was a situation in which doing the right thing and doing the popular thing were sometimes in conflict. I never felt any conflict about which path to take.”

But what about any conflicts in how he orchestrated the work?

Recall Nosek’s expertise: biases. While the importance of the initial findings focused on where scientists “could” be wrong, Nosek was also obsessed with where his efforts could stray – including his own potentially unrealized, or “implicit,” biases in setting up Many Labs. This awareness set the stage for the labs to come.

Many Labs 2-5: ‘Many’ More Takeaways

Failures to replicate could represent flaws in either the original experiments or on the part of the Many Labs, so Nosek wanted to take any criticism he received seriously. There were notions that some research didn’t replicate because of geographical differences on the part of participants. Others pointed to the time of year. College students are often targeted to take part in research for the obvious reason of their accessibility to university researchers. One assumption on the part of critics was that maybe an exams period, for example, could throw the results.

“It seemed wise to lean into the areas of debate.”

“The basic idea was each time we started a new Many Lab, the most important thing we couldn’t resolve in the last one would become an important feature in the next one,” Nosek said. “It seemed wise to lean into the areas of debate.”

So Lab 2 added more types of testing sites in different cultural contexts around the world. Lab 3 adjusted the times during the semester that the researchers conducted the experiments.

While Nosek acknowledged that culture can have an impact on how a research participant responds, culture couldn’t account for whether a finding replicated successfully or not. Similarly, the time of the semester didn’t seem to matter.

The key outcomes of ML3 were (a) oh boy, more findings seem harder to replicate than we thought, and (b) with good coordination and materials sharing, it is possible to do a follow-up replication with 1200+ participants across 10 labs in a few months.
— Brian Nosek (@BrianNosek) April 29, 2022

Next, Labs 4 and 5 placed an emphasis on shoring up any potential variability from the original experimental design, including by allowing original authors to have greater participation in the replication efforts. Failure to truly understand the original experimental process, because those duplicating it weren’t “the” experts, was a legitimate possibility for failure to replicate.

But every scientist also knew that “p-hacking” was another possibility.

“The ‘p’ refers to a statistical outcome used in most science to decide whether an observed finding is unlikely to have occurred by chance,” Nosek explained. “Researchers are motivated to find low p-values so that they can claim to have discovered something. But, in ordinary practice, researchers have flexibility in the decisions they make when they analyze their data, which might erroneously increase the occurrence of low p-values.”

Nosek’s academic career started by investigating the concept of implicit bias: how people might inadvertently behave in ways that are misaligned with their intentions and ethics.

And he’s studied a number of these biases – including the public belief that if science has a gender, it is probably male.

The bias is often characterized by a lack of self-awareness. So, in theory, a self-respecting scientist could both be appalled at the thought that she would cherry-pick evidence, and yet still be guilty of it.

“The other major criticism is: ‘You guys are just idiots.’”

There are other possibilities in a hypothetical laundry list of scientific errors, but Nosek said introspection is required as the process of elimination continues to scratch them off. He himself never rules out the possibility of his own incompetence being the reason that he failed to replicate others’ findings. After all, how would an incompetent person know?

“The other major criticism is: ‘You guys are just idiots,’” he said. “We might be idiots, but we constructed our replication studies to minimize the likelihood of that by getting materials and advice from original authors in advance, by preregistering our design and analysis plan, and by transparently reporting all of our findings and data.

“Also, in a couple investigations, we directly tested whether including more of the original authors’ expertise in the design of the research would improve replicability, and it did not. We may well be idiots, but – outside of assertions by my teenage daughters – we haven’t yet found any evidence to support that.”

The Long Tail of the Many Labs Approach

Over its decade, the sprawling Many Labs endeavor generated five major papers, extending the evidence provided by The Reproducibility Project: Psychology, and put numerous theories to the test. One, called terror management theory, which asserts that human beings embrace certain mindsets in order to tamp down the fear of death, didn’t fare well in new trials. Another, a Nobel Prize-winning theory positing that losses loom larger in people’s minds than gains, seemed to hold up.

But, more importantly, Many Labs put to bed assumptions.

“There had been a lot of talk about how psychological findings are very sensitive to these kinds of variations in contexts, and leaders in the field argued, sometimes in the pages of The New York Times, that this explains why replications often fail,” said Vazire, of the University of Melbourne. “This was meant as a defense of the field of psychology – the idea was that failed replications do not represent a symptom of a serious problem, but are to be expected even when research is conducted to the highest standards. But all of these claims were based on speculation – no one bothered to test whether these contextual factors actually matter, and no one had bothered to check whether experts, who supposedly know just the right conditions that should produce the effect, can actually get the effects to replicate.”

“There is a lot still to do, but there is a strong community spirit to improve.”

Lab 5’s results on terror management, published April 29, marked the end of the Many Labs component of Nosek’s replication work. “That program of research accomplished what I wanted to accomplish,” he said. “Time to move on to bigger and better things that build on it.”

But instead of being punctuated by a period, the experiments ended with an ellipsis.

“It has been incredibly gratifying to see a dozen other groups form and adapt the Many Labs concept in their own area of research,” Nosek said. “There are many ‘Manys’ now – Many Babies, Many Smiles, Many Dogs, Many Birds, Many EEGs, etc. Each of those projects are examining replicability of findings in their fields with big collaborative teams. It is very gratifying to see communities form and advance this approach across the scientific community.”

Congratulations & thank you to everyone that contributed to the Many Labs projects over the last 10 years, especially @raklein3 & @CharlieEbersole for their leadership. The "Many Labs" series may be ending, but there is so much interesting work now that builds on this foundation.
— Brian Nosek (@BrianNosek) April 29, 2022

The conversation about how much trust the public has, or should have, in published research also continues – as does the hard work of a community focusing inward.

“There is a lot still to do,” he said, “but there is a strong community spirit to improve, and tangible results indicating that some of the reforms are working.”

Journals have gotten more discerning in what they accept, as have some grant providers in what they look for methodologically, Nosek noted. Many Labs helped lead that discussion with its “predictions markets,” in which researchers attempted to prognosticate the ability to replicate, based on their relative confidence in the stated approaches.

“For me, the most important thing is for the research to be transparent and open.”

They were right 71% of the time. Red flags might be if the methods are vague, for example, or if it appears as if only the original researchers would have the know-how to reproduce the results.

Unfortunately, “There is no one thing that clarifies whether a finding is trustworthy or not,” he said. “Research is hard; lots of things can go wrong. For me, the most important thing is for the research to be transparent and open. We can’t expect to get everything right, but we can expect that researchers will show each other how they got to their claims, so that productive critique will occur to reveal error and accelerate progress.”

Nosek earned his master’s and doctoral degrees in psychology from Yale University, after obtaining his bachelor’s degrees from California Polytechnic State University in psychology, women’s studies and computer science. Over his illustrious career, he co-developed the Implicit Association Test, a method that advanced research and public interest in implicit bias. In addition to co-founding the Center for Open Science, he also co-founded two other nonprofit organizations: Project Implicit, and the Society for the Improvement of Psychological Science.

I, like many I’m sure, read at least one of these papers and had my research focus flipped on it’s head.

How many people can say they’ve had that kind of impact on the field?

Kudos to all involved. https://t.co/0w5eHIOAdE
— John Mills  (@drjpmills) April 30, 2022

In 2018, the American Association for the Advancement of Science named Nosek a fellow. The group cited him for inclusion with this language: “In the field of psychology, for unsurpassed leadership in the quest to improve openness, integrity, and reproducibility in research, not only for psychological science, but for all of science.”

He’s currently working on a new replication-related project that would use algorithms to provide an early indicator of whether a finding is credible. “Initial evidence is promising,” he said. “And, if it works, it could help to direct attention toward findings that are interesting, but need more work to determine if they are credible.”

So, in the end, how excited should any of us get about the latest trumpeted discovery?

Nosek said new possibilities are always worthy of cautious enthusiasm, but “we should simultaneously recognize that it is uncertain, and be eager to support the follow-on research to assess its reliability and validity.”

The Complete List of Many Labs Replication Studies, Psychology

A Critique of the Many Labs Projects

Related Research

The Reproducibility Project: Psychology

The Reproducibility Project: Cancer Biology