[00:00:02.670] - Wesley
You're listening to Journal Entries, a podcast about philosophy and cognitive science, where researchers open up about the articles they publish. I'm Wesley Buckwalter.
[00:00:12.560]
In this episode, Edouard Machery talks about his paper with with Benjamin at al. Redefine Statistical Significance published in Nature Human Behavior in 2018.
[00:00:21.370]
Edouard is a distinguished professor in the Department of History and Philosophy of Science, as well as the Director for the Center of Philosophy of Science at the University of Pittsburgh, where he works primarily on philosophical issues raised by cognitive science and neuroscience.
[00:00:36.850] - Edouard
Yes, so the paper is very simple, what we are just arguing is that we should cut the significance level by an order of magnitude, so moving it from .05 to .005 in order to address the replication crisis.
[00:00:50.920]
So that's a very simple message. It's the beauty of the paper really. Simple idea. And the structure is also very simple. We first explain why .05 is the wrong threshold. Then we explain why .005 would be a better threshold. And then we address a few objections.
[00:01:12.950]
I think the history of .05 is quite complicated. In fact, while it was codified by Fisher, Fisher himself actually changed quite a bit his tune with respected to .05. At the beginning he's actually fairly clear that .05 is a good threshold. Later on in his life, he's actually thinks there should be no threshold whatsoever. And it also ties with practices that predate Fisher's work in a complicated manner. So it's actually quite a bit of history of statistics on the matter. I think one of the shortcomings, a tiny shortcomings of the paper we published in Nature Human Behavior was in fact the history.
[00:01:54.390]
I think we said somewhere that Fisher was responsible for the .05. Well, in fact, the history is much more complicated than that. It's a small tiny point. And I think the history here is fascinating. And one interesting aspect for a historian of philosophy of science like me is how decisions made one century ago are still with us.
[00:02:16.620]
And in fact, might weigh down science and the fact that we know they get to get part of the way we do science and somehow we stop reflecting on these decisions and they frame the type of science we're doing for decades. In fact, for more than a century, I think a remarkable aspect of science. People think that sounds very self-reflective practice, but in fact, we have a very clear counterexample.
[00:02:46.520]
The notion of a P-value is one of the most misunderstood concepts in science by scientists, which is amazing because every single paper, not every single paper, but pretty much 90 percent of scientific papers use P values.
[00:03:01.530]
But then when you ask a scientist to define a P value, you usually get a mistaken answer. And worse, some textbooks, including some intro to stats textbooks mischaracterize P values. So what's the P value? Well, a p value is to simplify a tiny bit, but not too much. It's the probability of the observed data or more extreme data conditional on the null hypothesis. So the null hypothesis specify a specific distribution of data points in a population.
[00:03:35.680]
Right. And the p value describes the probability of observing the data that we've actually obsevered. Or more extreme data. Date that would be further from a null hypothesis, even if the null hypothesis is true. It's used as a way to decide whether or not the hypotheses the scientists had in mind is supported by the data. So the way scientists use it is that they set a significance level, a criteria. And the p value has to be below this criterion for them to conclude that the data invalidates the null hypothesis.
[00:04:11.790]
And as a result, support their own hypotheses. So scientists, for example, will want to say that let's say a drug works and a scientist will create two conditions, one in which people are given a drug, for example, against Covid 19, and one condition in which they're given a placibo like sugar, for example. And the question is, will the drug make people feel better after a couple of weeks? After a few weeks?
[00:04:41.690]
Now, what they do is they give people the drugs, they give the other people the placebo in a random manner. And after a few weeks they measure, improvements in the health of the participants. They take the difference in improvements between the test condition and the control condition and then they compute the p value. The probability of getting that difference or a larger difference if there is no real difference, if the drug has no impact. And then if that p value is below a significant table to say, oh, the null hypothesis is false, the hypothesis that there is no difference is false.
[00:05:19.520]
As a conclusion, we can accept the hypothesis that there is a difference between the test condition and the control condition. So the way it's used is you must set a significant stable when the p value is below the significance level, you can reject the null hypothesis and endorse your own hypothesis. For example, the hypothesis that a drug is actually efficient, that the drug's actually working, improving people's health. In practice, in many senses, not in all, of course, in many senses a significance level is put at .05.
[00:05:56.060]
So you need to have a P value lower than 0.05 to be able to reject the null hypothesis and conclude that your own hypothesis, the one you really want to support, is actually supported by the data you've observed. And the whole point of the paper is to argue, look, .05 is just a bad, bad level. It's way too high. We need to cut it done and cut it done substantially, at least an order of magnitude.
[00:06:21.840]
I'm going to try to walk you through the argument for why .05 is a weak significance level, and that's a little bit complicated. So the first thing to do is to keep in mind that a P-value is not comparative. It does not provide you a comparison between two hypotheses. It just takes the null hypothesis, the hypotheses that there is no difference between you two conditions, for example, and tells you what the probability of the data you've observed and more extreme data are conditional on this null hypothesis.
[00:06:57.730]
And so it does not compare to hypotheses. Just one hypothesis, the null hypothesis. It means for that and other reasons that it's not quite a good measure of evidence, right. A good measure of evidence would take two competing hypotheses, the null hypothesis and the scientist's hypothesis, which we can call the alternative hypothesis and says, look the data supports more the alternative hypothesis, the scientist's hypothesis, than the null hypothesis. So a good measure of evidence would be comparative. And furthermore, one might think that the measure of evidence really depend on the data you've observed.
[00:07:38.200]
What you have seen and not data you have not observed. P value depends on the data you've observed, but also on these things like more extreme data which are data you haven't seen really. You could have seen. But you haven't seen. P values aren't really a good measure of evidence. So what we needed to do was to translate p values into a good measure of evidence and a good measure of evidence is in a Bayesean framework bayes factors.
[00:08:09.000]
So bayes factor is the probability of the data you've observed. So to compute a bayes factor you compare two things. The probablitily of the data you've observed conditional on your own hypothesis and the probability of the data you've observed, conditional on another hypothesis, maybe the null hypothesis. But any other hypothesis. So a bayes factor to summarize is a ratio between the probability of the data you've observed conditional on your own hypothesis and the probability of the data you observed conditional on another hypothesis.
[00:08:48.430]
Maybe the null hypothesis. Notice that two bayes factor has two virtues compared to the p value. It's comparative. It tells you oh, the data supports more my hypothesis than the other hypothesis. And it only depends on the data you've observed. All right. So what do we need to do is to find a translation between a p value and a bayes factor. That's the technical part of of of the paper. It's not very difficult, but it is a little bit technical.
[00:09:25.580]
So the idea is if you put a significant level that .05, it means that if you get a P value exactly equal to .05, then you are in a position to claim to reject the null hypothesis and to accept your own hypothesis.
[00:09:44.360]
So the question is, let's suppose you have such a such a p value. Let's suppose you have a data point or did a data set with a p value exactly equal to .05. How much evidence do you have? And which really means what is your bayes factor for your own hypotheses and against the null hypothesis. So we need to do this translation from the p value to a bayes factor. Now there is no one to one translation and a P-value can can corresponds to many bayes factor.
[00:10:17.930]
Why? Well, because the bayes factor depends on the alternative hypothesis. It depends on which alternative alternative you choose. So depending on the on the alternative hypothesis you choose, there would be different bayes factors corresponding to the same P value. So that's of course, a tricky issue. Right. So good news then is that you can find the upper bound. You can find the maximum bayes factor that a p value of .05 would correspond to. So what is upper bound?
[00:10:49.040]
Well, you need to find the hypotheses, the alternative hypotheses that best predicts a data point. So one that's really the best possible predictor of the data you've observed. And of course, that's the best predictor of the data you've observed no other hypothesis will make a better prediction. And as a result, the bayes factor, the ratio of the probability of the data conditional on your own hypothesis and the probability of the data conditional on the null hypothesis will be maximal for that hypothesis.
[00:11:24.840]
So the question is to find the upper. The highest bayes factor. We need to find hypotheses that best predicts the data. And what is this hypotheses? Well that's very obvious. That's the hypothesis that gets it exactly right. So the hypothesis that predict exactly the data you've observe. No other hypothesis could do better. This one, just predict exactly what you what you've observed. So we can say that a p value of .05 is going to correspond at most to as a bayes fact to a bayes factor, comparing the hypotheses that predict exactly the data you've observed and the null hypothesis.
[00:12:04.860]
So when you do that, when you had a little bit of math, which I will skip here, you can show that a P-value, at .05 corresponds at most to a bayes factor equal to about 3. And we want to argue that a bayes factor equal to about 3 is just not enough evidence to reject the null hypothesis. Next, I wanted to explain why a bayes factor of 3 just isn't enough evidence to reject the null.
[00:12:38.150]
Right. So remember, we can translate p value equal to .05 to a bayes factor equal to about three, slightly larger than 3. And that's a maximum bayes factor that could correspond to a p value equal to .05. There's a few simplifications here, a background, but I will. I will bracket. Why is a bayes factor equal to 3 not not enough. Well, the reason is the main reason we get several arguments in the paper but I think the crucial argument is, let's suppose you start with an unlikely hypothesis.
[00:13:18.780]
Let's suppose your hypothesis, you don't want to test something trivial. And of course, scientists don't want to test anything trivial. Why would they? If they test something to trivial and if they get evidence that supports their trivial hypotheses, they will not be able to publish it. Right. Scientists are incentivized to do original science, to find to make discoveries that are groundbreaking. And that's the only way to to publish in the top journals.
[00:13:45.210]
But in fact, to publish at all. A usual way to reject a paper is to say, look, that's interesting, but that's fairly obvious. Why would anyone doubt that that is the case? So, scientists try usually to test an unlikely hypothesis, and that really means that the prior probability of this hypothesis is low. It means if you take all the hypothesis, it is quite likely it is more likely to be false than it is true.
[00:14:14.580]
One way to think about that is that if we take all the hypotheses, scientists test. And most of those hypotheses are going to be false. Right. They're testing unlikely hypothesis. So what is the probability it's going to be false? Well, I don't I don't really know, but we can imagine that maybe only one out of ten hypothesis will be true. Nine out of ten hypothesis would be false. People do risky science. Right. So the prior probability of the hypothesis would be maybe .1, which means there's only 1 chance out of 10 to be true.
[00:14:52.680]
If you were to choose it randomly, among all the possible hypotheses to test well, nine times out of ten, you would choose a false hypothesis one time out of ten to the true hypothesis. We're doing risky science. And if that's true, and if you reject the null hypothesis because you have a bayes factor equal to 3 well, it turns out that you the posterior probability of your hypothesis, meaning the probability of your hypothesis, once you've collected the data remains in favor of the null hypothesis.
[00:15:34.230]
Right. So even if you start with, if you start with a risky hypothesis and if you reject it based on a bayes factor equal to point three, at the end of the day, once you've collected the evidence, you should still betting on the null hypothesis being true. Right. Because point three is not going to move to be enough to move from an unlikely hypothesis to a likely hypothesis. So that's the main reason why we feel that bayes factor equal to .3 just isn't enough when you do risky science and most scientists are actually going to do risky, risky science. All right.
[00:16:19.630]
So that's that's that's a main. That's the main thoughts behind behind the view. What you need is a much stronger amount of evidence that allows you to move from an unlikely hypothesis to to a likely hypothesis. And we give various evidence for various arguments for why .005 is a good threshold.
[00:16:42.570]
As of the three arguments we give, only one of them is a pretty good argument. And that's the argument that's that uses the false positive rate. So let me explain to you what a false positive rate is it. It's. It's a concept that's been used quite a lot because of the replication crises for now fifteen years. And the idea is that it's a proportion of significant results that have false positives, right, so you take all your significant results. All the ones that below where the P value is below that significance level. And you say of let's say 100 significant results, how many of them are false positives? 5 percent? 10 percent? 30 percent? 40 percent or more? Now, it's a very important quantity because if you have a false positive rates that's very high and if everything that gets to be published a significant maybe because if you don't get a significant result, you're not going to submit it to a journal because maybe you think it's not going to be published, maybe because reviewers really don't want to publish non-significant results, then you'll get your false positive rates very high. When you open a journal, let's say Cognition or Cognitive Science, or Journal of Personality and Social Psychology, you know that maybe one third or half or two third of significant results you're looking at are false positives and if you're in that kind of situation probably you're not going to trust the results that are reported in the journal. So that's a very important quantity. And now the false positive rate is not a very simple quantity to define. But in a nutshell, it depends on two other quantities. It depends on the prior. So the proportion of hypotheses that are false on hypotheses that are true. We've looked at this quantity before. It also depends on the power of of your experiment. And the power from memory is the probability of rejecting the null hypothesis when the null hypothesis is false. When you do an experiment with a high power, you are doing an experiment such that even if a hypothesis is false then you have a high probability of rejecting it. Right. That's you kind of sense you want to do. You don't want to do experiments that are such that, well, even if the null is false I have no chance to reject it or barely any chance to reject it. So the false positive rate depends on the prior probability, the significance level, and the power of your experiment. So what we show in the paper is that for many priors, if you cut down the significance level from .05 to .005, you can dramatically reduce the false positive rate. In fact, what we show is that if you set your significance level at .05 as is usually the case, and if your power is .8, which means if the null hypothesis is false, you have 8 chance out of 10 to reject the null hypothesis. That's a very high power, much larger then the real power in psychology. So if your significance level is at .05 and your power is at .8 then for many priors, the rate of false positive among significant finding is 30 percent more than 30 percent.
[00:20:21.290]
So 3 significant results out of 10 is a false positive. We think it's huge. We think it's unacceptable. We think it's disturbing. We think it's actually undermining the trustworthiness of science. By contrast, supposing again that you have a power of .8 and for the same range of prior probabilities, if you set your significant level at .005, you decrease the false positive rate from about 35 percent to 5 percent. So only one significant results out of 20 happens to be a false positive.
[00:20:57.230]
We think that's, one might think it's not great, but we think that's acceptable. In fact, we think that's the kind of of proportion that many scientists are willing to live with us. Right. Many scientists are probably willing to live in a world in which when they open their journals and they look at 20 experiments and one of them 20 significant results. One of them is a false, but not too bad. Many 19 of them are positive, actually, that that's not perfect. But actually that's a risk we think most scientists are willing to take.
[00:21:31.970]
So the suggestion here is that for for a range of prior probabilities, depending on how risky your science is and for a high power point .8 moving from .05 to .005 is going to decrease your false positive rate by a ratio of 7 and then bring it to .005. Which we think is acceptable.
[00:21:58.870]
So many scientists, in fact, think that the significance level is identical to first positive, right? Right. So that's why is it saying that when they have a significant level at .05, it means that only 5 percent of the significant findings are false positives. But that's not at all what it means. Right. It means something very different. It just means the probability of getting the data you observe or more extreme data conditional on the null hypothesis being true. But that's compatible with a false positive rate being as high as 30, 50 or 60 percent depending on the power, the significance level, and the prior probability.
[00:22:39.170]
Right. So in fact, what scientists always have had in mind is something like the false positive rate. And what we are doing by setting the significance level at point .005 is giving them what they've always wanted. Even so they didn't want they didn't know that was what they wanted. Right. What what people really want is. OK, here's a bunch of significant results. How many of them are false positive? And what you really want is a low number, maybe 5 percent. That's a false positive rate. If you fix your significance level at .005 instead of .05, you're going to get what you want. That's the message we're giving scientists.
[00:23:20.890]
So in the paper, we consider many objections to our work, and actually many of the objections have been made in print immediately after the publication of our paper. And we consider three objections mostly. One of them is that by decreasing the significance level, we're going to increase the rate of false negatives.
[00:23:45.280]
So from memory of false negative is a failure to reject the null hypothesis when the null hypothesis happens to be false. And that's that as a real phenomenon to be found and we fail to find it. That's that's a bad thing too. That's a first objection. A second objection is. Well, look. There are many other issues that are responsible for the replication crises. And why aren't you talking about those? That's a second objection. And a third objection is, well, why are you still sticking to classical statistics? Why aren't we all becoming bayesian statisticians? So let me just say a few word about these three objections. I think the second one is a bit silly. Of course, there are many reasons, many explanations and many causes to the replication crisis. And we acknowledge that in the paper, but the reason we focus on the significance level is because we believe that publication without a sufficient amount of evidence is one of the main reasons for the replication crisis.
[00:24:50.260]
And it's a very simple intervention that will have you believe, a dramatic impact on science. So it's an important cause and it's a very simple intervention. So that's it strikes me as a good enough reason to be focusing on on that idea. Why not switch to bayes? Well, I think there's a lot of reasons. One of them is to bayes in the right way is actually not easy. So he is going to speak for myself and not for my collaborators on this paper. So many of my collaborators are Bayesian. So they will say we'd be delighted if we all move to a Bayesian framework and indeed have developed tools and internet software to compute bayes factor and to do statistics based on bayes factor. I tend to think that if you want to become a Bayesian, you should not imitate what classical statisticians are doing. So you should not do t test in a Bayesian framework using bayes factor instead of p value. You should not do the Bayesian counterpart based on bayes factor, the Bayesian counterparts of classical statistics based on bayes factor. You should become a full blown Bayesian. You should build Bayesian models based on reasonable assumptions about the priors. And you should update them in light of the of of the data. And that requires quite a bit of training in statistics. It's not the kind of things you can easy to learn. And in a matter of a couple of weeks. And then just go online and use Jasper, which is one of the new softwares to do that kind of things or even some of the applications on R as you can do, for example, with bayes factor.
[00:26:37.090]
So bayes factors are easy to use. But I think if you want to become a Bayesian statistician that should do it well and you should learn to develop Bayesian models and learn how to compute posterior probabilities by using approximation algorithms. And this kind of thing. It's doable. No, I don't think it's impossible to do. There are excellent textbooks to do that, but that does require quite a fair amount of work. And I don't think it's reasonable to expect all scientists who we've already made quite a career you knows have been successful, they've published thirty papers already full professor or associate professors. They might have god knows 100 papers behind their belt below their belt, it's not reasonable to expect those people to suddenly move to another statistical framework. I think what we want is at least a simple way to improve statistics and that's what our proposal is all about.
[00:27:36.780]
Now the first one is a bit of a better objection. Right. And the reason is that there is a tradeoff when everything else is equal between false positives and false negatives. So if you reduce the significance level, you're going to reduce the rate of false positives. So the false positive rate is going to go down, as we saw a minute ago. But because you make it harder to reject the null hypothesis, you're going to increase the rate of false negatives. So when you keep the power of your experiment fixed, when you decrease a significant level, you decrease the frequency of false positives, but increase the frequency of false negatives.
[00:28:22.090]
And that's a real worry because one might think that false negatives are as bad as false positives. We don't want to miss real discoveries. I think that's a that's a real concern. But the key point here that that we make in the paper, and I think that I make in follow up papers also is that things need not be equal. What we want is in fact, decrease the significant level, decrease the probability of making a false positive and increase the sample size of experiments. If one increases the sample size of the experiment, one increases the power of one experiment and one can keep the rate of false negative constant. So the idea is, well, by itself, maybe that's anyway, my take on this objection, which I think is a very important objection. That by itself may be decreasing the significant level might not be such a good idea because it would result in an increase in false negatives.
[00:29:32.620]
But combined with an increase of the sample size, we're going to be able to maintain the rate of false negatives constant and decrease dramatically the rate, the frequency of false positives. So what I want. What are we suggesting here is that we want a combo. We want two things to happen at the same time, a decrease of false positive and an increase in the sample size of of experiment. In any case, I just think we need to increase the sample size of experiments right now.
[00:30:05.860]
And it is still the case many experiments are based on an insufficiently large sample size. Earlier I mentioned that when we computed the false positive rate, we assumed a point of .8. That's an idealization. For now, 60 years, the power in psychology has been equal to .5. Meaning that if the null hypothesis is happens to be false, so there's a real discovery to be made. Then you have one chance out of two to show that it's false.
[00:30:34.550]
That's ridiculous. You'd better throw a coin. It's faster, less expensive and works as well as your experiment, right? So what? Power of .5 is unacceptable. It's been the average power in psychology for now 60 years. It needs to increase and it must increase. And the best way to increase power, one of the best ways to increase power. But one of the most obvious way is by increasing the sample size. So. sample size must increase. And when we combine it with the decrese of the significance level, we will maintain the rate of false negatives constant.
[00:31:14.150]
I should say one thing about many of my examples that drawn from psychology. And sometimes people say, yes, there's a problem with social psychology, but that's not a problem with science. Things are much better outside in other disciplines. Not so. In fact, studies of the power in neuroscience suggest that it's usually around .2, .3 which means even lower in psychology. Studies of various fields in drug testing and various areas of the biomedical sciences suggest that the average power is .08. Less than 0.1, which is an unbelievably low power. One wonder why. What are people thinking when they do biomedical experiments? Where the chance of rejecting the null when the null is false, the chance of making a real discovery is less than 1 out of 10.
[00:32:13.480]
So the issue is not an issue for psychology, it's an issue with many of the sciences, ecology, evolutionary biology, the part which is experimental, psychology, neuroscience of biomedical sciences and so on and so forth. So we need to increase sample size. We need to move to big science. There's no way around that. And if we do that, we're going to maintain the rate of false negative constant, even if we decrease the rate of false positives.
[00:32:44.760]
If it turns out that in some fields you actually cannot increase sample size, so, for example, you're doing cognitive anthropology and you're working with as small scale society in the Amazon and you're going to to work with people that are that know going to work in population, in villages that are really small.
[00:33:08.760]
Now, surely you're not going to get hundreds of participants because there aren't hundreds of people speaking that language or living in that community. In that case, I think the right thing to do is not to pretend that you have amazing evidence for your hypothesis. It's just to say, look, here is some weak evidence for my hypothesis, it suggests something. It's better than nothing. That's the best we will ever get, given the limited amount of of evidence we would ever be in a position to get right.
[00:33:40.650]
So I think is the right thing to do is that when you can get good evidence, well, you might still want to do the science. But you should just be completely honest about the fact that you're working in an area where you can't get stronger evidence for your hypothesis. And I think then you want to talk of suggestive evidence. You want to. You don't want to assert that you've made discoveries based on on data that just can't allow you to make discovery.
[00:34:06.420]
I think that's the same kind of things we find when we do, you know, the history of paleoanthropology or when we do historical census. And when we look at a deep past, you know, what we can get in this area is at best suggestive evidence. And I think that mistake is to say, oh, no, we have very good evidence that blah, blah, blah. Instead of saying, well, we find suggestive evidence that might suggest that. I think the responsible thing to do as scientists is to calibrate the strength of the conclusion based on the type of evidence we're able to get. So if you can't get good evidence, maybe you should keep doing the science, but you should modulate the strength of the claim you're making.
[00:34:52.590]
I'd like to respond to what I take to be the most interesting point in Daniel Laken's paper Justify Your Alpha, which I believe was published one year after our paper was published. So the criticism here is that we shouldn't have a standard. We should in every scientific context, think very hard about what the right significance level is, and decide depending on the costs of the false positive and the costs of the false negatives, have an adaptive significance level rather than a fixed one.
[00:35:26.570]
I say in principle, Lakens and colleauges might be right. That in an idealized scientific context, we would want to set up the significance label by looking at costs of the false positive and the costs of the false negatives. But I don't think that's the way science is done and how science can be done because of the limited resources, we are finite beings, as Herb Simon said some time ago. And that's true of the subjects of psychological experiments. And it's true that scientists too, sometimes psychologists believe that they them themselves are quite different, but we all have finite cognitive resources, and we must use crutches in order to make proper decisions.
[00:36:12.340]
What we've known from the replication crisis and also from the history of science, that scientists are not that reflective. Most scientists we don't have time or the inclination or the resources or the training to be extremely reflective about every aspect of their science. Some things must be taken for granted. They must be part of the norm that are governing science so that people don't have to think about every aspect of doing science.
[00:36:40.310]
And I say here is a very important lesson that fits with some of the best psychology. We are limited cognizers. Our resources are limited when we do when we act and census is full of action, we can't think about every aspect of our decision. So we must in a sense, externalize some aspects of a decision. We must rely on norms that are going to help us act properly. And I think what we want is to have a normative apparatus of science that helps scientists make good decisions when they do science.
[00:37:18.800]
Most of the time. All right. And I think the significance level is one of these crutches. One might say one of these norms that is there to help scientists not having to think about everything when they do when they do science. You know, I think that the Lakens and colleagues proposal, as well as the proposals coming from many statisticians, do take scientists to have an enormous amount of intellectual resources, the capacity to think about every single possible aspect of research and basically ignoring the realities of doing, of doing, of doing science.
[00:37:56.480]
I think what we want is science for limited scientists, limited human beings who have limited cognitive resources, limitless financial resources and so on and so forth. And I think our proposal is exactly geared toward real science, not toward idealized science. And the alternative, you should decide in advance, which is significant, say what is going to be taking into account the costs of false positives and the cost of false negatives while granted in, in, in in an ideal world is actually bad advice in practice.
[00:38:25.550]
There are also other considerations for why it is bad advice. One of them is it's not going to look good. Laypeople are actually worried that scientists are manipulating their data and for good reason. Because scientists are manipulating their data and now we tell them oh look in addition to being able to drop an outlier, they can also change a significance level depending on very vague costs which are hard to specify.
[00:38:54.920]
How good is that going? And how are reviewers going to be able to make these decisions and so on and so forth.
[00:39:01.070]
So I think this this idealized view of science just really what it means when it meets the road is just bound to fail. But that does not mean that in some circumstances we should not relax the significance level. Of course we should like, you know, in extreme circumstances, when you do drug testing. In a situation of crises, you do. You do exploratory science. You really need to explore a set of possibilities. Of course, relax the significance level.
[00:39:29.240]
No one's going to object to that. But that's definitely compatible for a default significance level in situations of every day, every day science.
[00:39:41.080]
The other response I'd like to mention, is one which I thought was very interesting. The idea was when we computed a false positive rate and the effect of decreasing the false positive rate, we didn't take into account P hacking. P-hacking from memory is all the practices that increase the probability of getting a significant result. So you might drop one data point. You might run 20 different analysis. You might look at 20 different measures and just find out the ones at works and so on. All of that increase the probability of getting a significant result. And the idea was, well, our proposal works well when there's no p hacking, but as soon as you as you as you add p hacking to the picture our proposal fails. And and so that was that was a suggestion.
[00:40:34.070]
And this is not something we do not consider in the paper. It was a useful, useful contribution to the debate. Unfortunately, this objection really fails as well. And it fails because of the costs of doing p hacking. When you decrease a significant result, the significance level. It's very easy to p hack to .05. Simon Sutton and colleagues have shown that by just doing a few manipulations of your data, you drop a few outliers, you add a covariate, and you compare two or three different valuables, you're likely to get a significant results when your significance level is at .05. However, your number of manipulations you need to do increases exponentially. Now we're all used to exponential curves these days because of Covid 19, but they have this very increasing speed. So for .05, few manipulations are sufficient to get to be likely to get a significant result. For .005 you need to do dozens of manipulation to your data. And now what I think is the case is that scientists at this in the past have been willing to play a little bit with the data.
[00:41:51.660]
Not because you want not because we were trying to fraud, but because it was about trying to say, oh, what do I really show? You know what, do the these data show, that what maybe is due to a bad outlier, maybe that measure that doesn't work, maybe this measure works better. They're exploring their data and then reporting the outcome of exploration as if that was what they intended to show in the first place. So not not not not not really being engaged in fraud at all, but rather exploring data, what the data really show and confusing exploration with testing your hypothesis. That was very common practice. And I do believe we still reasonably common. I think it's fine to do that provided you actually don't say that it was testing your hypotheses in the first place. I think it's fine to do that, but you're not going to be able to do that. When a significant level is at .005. Because you're not going to be able to explore just a little bit your data. You're going to have to do a systematic manipulation of your data in all possible ways to get a significance level, dozens of different ways of looking at your data. And that's going to start feeling like your frauding. It's going to start feeling, oh, I'm not simply exploring my data. I'm not simply trying to see what my data are telling me. I'm trying to manipulate my data so so as to get the results that allow me to get a publication.
[00:43:17.760]
So I think there's a difference between doing a few things to your data to see what they might tell and massaging your data in a crazy way so as to get significant results. So even so, once you take into account P hacking, you just realize that in fact our proposal would have a very positive side effect. It would make p hacking much, much, much harder. And I see we believe much less come on because scientists are by large honest people. They don't want to fraud. They don't want to feel like they're frauding.
[00:43:52.410]
They don't want to feel like they're cheating. They don't view themselves as cheaters. They're actually committed to a norm of doing good science. They are tolerating exploring data because that's a common practice, they're not tolerating massaging the data to a point where it looks like fraud. So I think a side effect of our project, which we hadn't discussed at all in the original paper, but I believe is actually true, that it might reduce the frequency of P hacking.
[00:44:25.410]
A few years after the publication, I think it's probably time now to try to assess whether it's been a success or not. And our goal was really to get journals to just change their requirements and to be explicit as well as associations like the Psychological Association and so on and so forth or the APA. I think in that respect we failed. To my knowledge, no journal has officially on endorsed .005 an expected threshold for significance.
[00:44:57.630]
In other respects we've been more successful. I think we've brought an important topic to the discussion and there's been a lot of discussion of these ideas among psychologists and other scientists. So in that respect it's been quite successful. And also indeed, I think we've slightly changed expectations. If all the p values in a paper are around .04 there will be some concerns about the quality of the science that's getting submitted. So that sense in a sense, a byproduct of the debate we created rather than our intended goal.
[00:45:40.920]
So I think we failed to meet our main goal largely because of the reaction of the critics. I think the critics were just not very helpful. Our paper created a flurry of responses. Dozens of blog posts and a bunch of publications and so and so forth and many of these responses instead of taking a pragmatic approach, how to make science better, how to decrease the proportion of false positives in science, just went to foundational questions about Bayesian statistics, vs frequency statistics and so on and so forth, which in a sense prevented scientists to just see the simple, pragmatic point we were trying to make in this paper.
[00:46:34.410]
So the main goal here was let's bracket all disagreements about foundational questions. Let's try to improve science by by doing simple things. And that simple idea was lost in the debate. And the outcome is that we haven't been able to change the norms as much as probably we were hoping.
[00:46:54.640]
Where to go next? I don't exactly know. And to be honest, I I do worry sometimes that all the proposals that we are making and by we I don't mean the authors of this paper, but in general, people who care about improving science.
[00:47:13.480]
All this proposal work for a short period, but then gets subverted by, in a sense, inertia. For example, preregistration, the idea of indicating before doing one experiment was hypotheses are, what data are going to be collected, and how is the data going to be analyzed, is an excellent idea that in a sense addresses some of the causes of the replication crisis, particularly p hacking. On the other hand, what we see now for the last two or three years is that preregistration has been subverted to the point that it's now often pretty much useless.
[00:47:53.440]
People pre-registered very vague hypotheses. They don't preregister their data analysis. They preregister 20 different measures, they preregister many different ways for looking at the data and the outcome is that the value for registration is is zero. If you preregister 20 measures and report only one of them in your paper, then you are in fact p hacking and you're pre-registering that you're going to be p hacking. So the way here is that many good ideas have been put forward, but some of these best ideas, such as cutting your P value, setting your significance level to .005, preregistering your research are getting subverted by science.
[00:48:41.900]
So I don't I don't I don't exactly know where to go from from from there. There are signs that some aspects of science are improving. The sample size has been increasing over the last two or three years, suggesting that scientists have been sensitive to some of the debates that have taken place and took issues with having a very low sample size. I think there's been an improvement in the kind of statistics people are doing. I think you see more and more sophisticated statistics coming from psychologists.
[00:49:17.680]
So that's an improvement. But you but I think there is room for optimism is not is not very large. There's a concern that science is a very inertial system and changing its course is actually much harder than what we the authors of the Benjamin et al paper, but also people concerned with improving science, what we thought maybe three or four years ago.
[00:49:46.970]
So I see signs for optimism. I also see signs for concern in the inertia of science. I also should say that in reaction to our proposal and also to others' proposals not only did I thought the debates were less than useful, I also thought establish scientists were less than helpful partners. There was a lot of talk and there is still a lot of talk to the extent that all is fine and dandy and we can still keep doing the sounds we were doing 10 years ago.
[00:50:28.250]
People coming from very prestigious places at Harvard and Princeton, and no name will be given here, have actually, I think, undermined scientific reforms by publishing op eds, special issues in Proceedings of the National Sciences, the National Academy of Sciences, and so and so forth, arguing again and again that we should trust science and that science is doing just well. I just think this is I understand where it comes from. We would live at a time where there's a distrust in science. So I understand where it comes from. We don't want to foster a distrust in science. But I do think that it has undermined some of the efforts to do better science. And that also another reason why I am not extremely optimistic these days.
[00:51:22.130] - Wesley
That's it for today's episode. Visit our website at journal entries dot fireside to fm for more information about Edouard Machery, his work, and some of the resources mentioned in this episode. Special thanks to Two Cheers for creating our theme music and to Christopher McDonald for sound engineering.