The Decline Effect - Why Most Published Research Findings are False
In a compelling and important programme broadcasted on 26 August 2014 on BBC Radio 4, and produced and presented by Jolyon Jenkins, the distinguished Greek-American professor of health research and policy, and professor of statistics at Stanford University, John Ioannidis, claimed, with regard to published research papers, that most of what we take to be true in these papers turns out, in time, to be either false, or much less likely to be true than we thought initially (“the decline effect"). He wrote a very influential paper entitled “Why most published research findings are false," in which he argued that fewer than half of scientific papers can be believed, and that the more scientific research teams are involved in a particular field, the less likely the research findings are to be true.
Hal Herzog, a professor of psychology at Western Carolina University and author of Some We Love, Some We Hate, Some We Eat, and involved for many years in research on the influence of pets on their owners pointed out that most people believe that this influence is very positive. In a 1-year follow-up study, for instance, Erica Freedman (1980) claimed that having a pet decreased high blood pressure in 100 people who had suffered a heart attack and thus increased their survival time. However, Herzog said, this has not been confirmed in many subsequent studies. When asked how the influence of having a dog could not be positive, he responded by listing some of the many ways this may happen, for example the dog might bark and make your neighbour angry with you, you might be bitten by your dog, you might trip over your dog and injure yourself, you may not bond with your dog, but still have to take it out every day and scoop up its faeces, and so forth.
Daniel Finelli (spelling uncertain) of the University of Montreal chose another example: second-generation antipsychotic drugs have turned out to be much less effective than first-generation antipsychotics. It is as if the truth is a perishable commodity which wears off.
At the heart of research is the idea of replicability, namely the notion that initial findings need to be confirmed by subsequent findings if they are to be regarded as likely to be true. But what if the same experiment done under the same conditions in different laboratories across the world yields different results? This is what John Crabbe, a behavioural neuroscientist at the University of Oregon, found in a study of the same genetic strain of mice in three different labs in North America: the same mice in Edmonton, Alberta, were generally less active than those in the other two labs. One explanation might be that the mice responded differently to different experimenters, for instance to their different smell.
Ioannidis did a study in 2004 of the 49 most frequently cited papers in medical journals and found that only 34 of them had been retested and that 41% of these, i.e. almost a third of the 49, were found to be wrong, although they were still being cited. He also noted that even if the same investigators of the best studies tried to replicate their own results, they failed in 70-90% of cases (this was shown in two papers). With studies linking genes to disease the failure-to-replicate rate was even higher at 98-99%. He concluded that the average paper in a scientific journal is probably not true.
What are the reasons for this? One reason is that too many studies are too small, i.e. have an insufficient number of subjects to attain statistically significant results, or they are “underpowered”, that is to say they have an insufficient capacity to detect an effect which does exist, and are more likely to find effects which don’t exist. When an underpowered study finds an effect which is true, then this effect may well seem bigger than it actually is, a phenomenon known as the “winner’s curse”. This term is used because investigators who find a genuine effect are in that sense “winners”, but they are also “cursed” because the effect seems more important than it really is. Underpowered studies don’t make scientific sense, but they make sense in terms of advancing a scientific career. If you have limited financial resources with which to carry out research, you don’t want to spend them all in a single large study which might come up with nothing. In this context smaller, underpowered studies seem sensible – as a way of hedging your bets. If smaller studies produce inconclusive or inconsistent results, then this line of research continues to be pursued, whereas a large, at the outset better funded study may have already produced a conclusive result and the grounds for further studies in this area would then have appeared much less cogent. In short, insufficient financial resources are driving scientists to produce more and more unreliable results.
In the last 20 to 40 years scientists are reporting more and more positive results (Finelli). One explanation for this is that journal editors don’t want to publish negative results, but this does not explain why the psychological literature, for instance, contains many more positive results than, say, astrophysics. Another possibility is that scientists themselves “sit on” their negative results. Hal Herzog, the previously mentioned “dog man”, found, by asking around at conferences, three unpublished studies showing that pets had no beneficial effects on their owners, perhaps, he thought, because many of the researchers are animal lovers and don’t like their own results.
However, the most important reason for the unreliable results is career pressure: scientists, above all in the USA, are expected to publish papers, especially ones which are widely cited, if they are to advance in their careers. Nonetheless it is unscientific practice to link scientific career advancement to publication impact measured in this way. It becomes obligatory to inflate the importance of your findings. If a scientist spends two years on a study and finds nothing, the paper she writes about it probably won’t be accepted by a major journal. Alternatively she can “dredge the data” and come up with some results that seem interesting, and possible even statistically significant, but she has now deviated from her original intention to test a particular hypothesis and entered an area where she can find almost any result she wants. You can find a pattern in any data, just as you can detect the outline of faces in shadows or passing clouds. Ioannidis said that there is no result which cannot be made to seem plausible, even if it’s a complete red herring. This is a kind of subtle, “allowable” fraud, which Finelli calls “grey areas”. 30% of scientists, he claimed, admit to writing a paper stating as the main hypothesis something which was a chance finding, and are prepared to admit this because they don’t think it is entirely wrong to do so.
To give an example of this, imagine that the main hypothesis to be tested is that a particular drug will lower blood pressure. Half of the people studied are given a placebo and the other half who are given the new drug turn out to have the same blood pressure as before, but the male subjects have increased hair growth. Now a publishable paper can be written claiming that the new drug promotes hair growth in men. The only problem is that this might well be a chance finding. The paper doesn’t mention the original hypothesis at all (that the new drug will lower blood pressure) and it doesn’t present the finding about increased hair growth as very possibly a chance finding. The scientists involved fool themselves into believing that they have found something which is truly important.
But surely, you might think, the truth will out. However, little effort goes into replicating research findings. The psychologist, Brian Nozick, from the University of Virginia, quoted from a study published in 2012 which claimed that only 1% of the published literature could be described as replications of prior results. You cannot make a career out of replications, i.e. get published in prestigious journals and get money from grant funders. An "ecosystem" of universities, journals and grant funders rewards only the next innovation, at the expense of replication. But it may also be risky, especially for young researchers, to try to replicate, and end up undermining, the results of an influential senior researcher, or to produce new results which undermine the results of this researcher. One route to success is not to try to replicate or innovate, but to tinker a little with known results and test some of the marginal details of these results. Another route is to be innovative, but not clash with existing theories. If experienced researchers have developed a theory from their own results they will, consciously or unconsciously, try to defend their theory. As a consequence they tend to generate further results which fit with that theory.
The Reproducibility Project, involving more than 150 scientists worldwide, has been set up to address some of these issues. Studies were chosen according to agreed criteria from three noted psychology journals in 2008. Only one third of subsequent replication papers reproduced the original findings, which suggests that there is less reproducibility than is generally assumed.
Sometimes bitter and personalized battles can ensue. Jolyon quoted the example of the psychologist and Director of the Cambridge Embodied Cognition and Emotion Laboratory, Simone Schnall. She did a study showing that people who had been primed with sentences about cleanliness (and so felt clean) had less severe moral judgments. A US study trying to replicate these results got different results and the main American researcher described Schnall’s results in his blog as an “epic fail”, despite the possibility that the US students, as opposed to the British students studied by Schnall, may have had stronger moral judgments. Schnall was subjected to attacks on social media. One senior Harvard researcher described some of the replicators as "shameless little bullies". Schnall continues to believe that her results are right, but thinks that her reputation has been damaged by the attacks. She noted that in a recent interview for a substantial grant somebody had questioned her work.
Although, as already mentioned, researchers usually try to defend their own results, discovering where we go wrong is what leads to great science. The Nobel-Prize-winning physicist, Richard Feynman, once wrote: “The first principle is that you must not fool yourself, and you are the easiest person to fool”. The second part of this statement has been one of the most important conclusions reached by another Nobel-Prize winner, the psychologist, Daniel Kahneman, in his very insightful book, “Thinking, Fast and Slow”, and has been called the “bias blind spot”. He is sceptical about our ability to notice our own cognitive biases.
Ioannidis pointed out that there are some 15 million scientists worldwide publishing papers, but that there are not 15 million major discoveries. The probability is that most scientists will work very hard, but not come up with any major discoveries. He thought it was perfectly alright to admit this, and to respect their effort. But the problem is not so much that the vast majority of scientists are not finding anything out, but that they have to pretend to themselves that they are - for instance in order to get grant money. Ioannidis concluded: “This is what, I think, we need to change. We can promise that we will do our best, but we cannot promise that we will save the world”.
I have attempted to summarize the main points made in the radio programme on which this is based as accurately as I could, and in some passages I have quoted what was said word for word. I have added a comment of my own, namely about Daniel Kahnemann's views. There may be minor errors in the transcription of the programme, but not, I hope, major errors in the presentation of the main points that were made.
Paul Crichton, London, 15 September. 2014