What does “Scientists rise up against statistical significance” mean? (Comment in Nature)References containing arguments against null hypothesis significance testing?Statistical significance for correlated significance testsDoes Greenhouse-Geisser correction influence the effect size estimation, the statistical significance treshold, or both of them?Is the “hybrid” between Fisher and Neyman-Pearson approaches to statistical testing really an “incoherent mishmash”?A psychology journal banned p-values and confidence intervals; is it indeed wise to stop using them?Statistical significanceStatistical test significanceASA discusses limitations of $p$-values - what are the alternatives?Is this the solution to the p-value problem?Statistical significance between years; what test to use
How would you translate "more" for use as an interface button?
Is there a reason to prefer HFS+ over APFS for disk images in High Sierra and/or Mojave?
Do I have to take mana from my deck or hand when tapping a dual land?
What is this high flying aircraft over Pennsylvania?
Why do Radio Buttons not fill the entire outer circle?
Sigmoid with a slope but no asymptotes?
When and why was runway 07/25 at Kai Tak removed?
Did I make a mistake by ccing email to boss to others?
How to reduce predictors the right way for a logistic regression model
When is "ei" a diphthong?
Deciphering cause of death?
Review your own paper in Mathematics
How would a solely written language work mechanically
Why didn't Voldemort know what Grindelwald looked like?
Do I have to know the General Relativity theory to understand the concept of inertial frame?
If Captain Marvel (MCU) were to have a child with a human male, would the child be human or Kree?
Has the laser at Magurele, Romania reached a tenth of the Sun's power?
In One Punch Man, is King actually weak?
Can I cause damage to electrical appliances by unplugging them when they are turned on?
What is the meaning of the following sentence?
What the heck is gets(stdin) on site coderbyte?
Typing CO_2 easily
What does "tick" mean in this sentence?
Can you identify this lizard-like creature I observed in the UK?
What does “Scientists rise up against statistical significance” mean? (Comment in Nature)
References containing arguments against null hypothesis significance testing?Statistical significance for correlated significance testsDoes Greenhouse-Geisser correction influence the effect size estimation, the statistical significance treshold, or both of them?Is the “hybrid” between Fisher and Neyman-Pearson approaches to statistical testing really an “incoherent mishmash”?A psychology journal banned p-values and confidence intervals; is it indeed wise to stop using them?Statistical significanceStatistical test significanceASA discusses limitations of $p$-values - what are the alternatives?Is this the solution to the p-value problem?Statistical significance between years; what test to use
$begingroup$
The title of the Comment in Nature Scientists rise up against statistical significance begins with:
Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.
and later contains statements like:
Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.
I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth that I can understand.
Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?
When talking about compatibility intervals, bear in mind four things.
First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...
Second, not all values inside are equally compatible with the data, given the assumptions...
Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...
Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...
statistical-significance p-value bias
$endgroup$
add a comment |
$begingroup$
The title of the Comment in Nature Scientists rise up against statistical significance begins with:
Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.
and later contains statements like:
Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.
I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth that I can understand.
Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?
When talking about compatibility intervals, bear in mind four things.
First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...
Second, not all values inside are equally compatible with the data, given the assumptions...
Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...
Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...
statistical-significance p-value bias
$endgroup$
add a comment |
$begingroup$
The title of the Comment in Nature Scientists rise up against statistical significance begins with:
Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.
and later contains statements like:
Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.
I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth that I can understand.
Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?
When talking about compatibility intervals, bear in mind four things.
First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...
Second, not all values inside are equally compatible with the data, given the assumptions...
Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...
Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...
statistical-significance p-value bias
$endgroup$
The title of the Comment in Nature Scientists rise up against statistical significance begins with:
Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.
and later contains statements like:
Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.
I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth that I can understand.
Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?
When talking about compatibility intervals, bear in mind four things.
First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...
Second, not all values inside are equally compatible with the data, given the assumptions...
Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...
Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...
statistical-significance p-value bias
statistical-significance p-value bias
asked 3 hours ago
uhohuhoh
1264
1264
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
I'll try.
- The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.
- Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.
- 95% is just a convention. You can compute 90% or 99% or any% intervals.
- The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.
$endgroup$
add a comment |
$begingroup$
The first three points, as far as I can tell, are a variation on a single argument.
Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:
When actually, they are much more likely to look like this:
As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.
So to break down the points:
Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.
- The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.
95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.
This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.
Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.
Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.
$endgroup$
1
$begingroup$
Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
$endgroup$
– Ingolifs
53 mins ago
$begingroup$
I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
$endgroup$
– uhoh
20 mins ago
add a comment |
$begingroup$
Much of the article and the figure you include make a very simple point:
Lack of evidence for an effect is not evidence that it does not exist.
For example,
"In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".
Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.
But this is the mistake the authors claim scientists routinely make.
For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.
The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f398646%2fwhat-does-scientists-rise-up-against-statistical-significance-mean-comment-i%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I'll try.
- The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.
- Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.
- 95% is just a convention. You can compute 90% or 99% or any% intervals.
- The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.
$endgroup$
add a comment |
$begingroup$
I'll try.
- The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.
- Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.
- 95% is just a convention. You can compute 90% or 99% or any% intervals.
- The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.
$endgroup$
add a comment |
$begingroup$
I'll try.
- The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.
- Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.
- 95% is just a convention. You can compute 90% or 99% or any% intervals.
- The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.
$endgroup$
I'll try.
- The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.
- Values near the middle of the confidence (compatibili5y) interval are more compatible with the data than values near the ends of the interval.
- 95% is just a convention. You can compute 90% or 99% or any% intervals.
- The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.
answered 3 hours ago
Harvey MotulskyHarvey Motulsky
10.9k44485
10.9k44485
add a comment |
add a comment |
$begingroup$
The first three points, as far as I can tell, are a variation on a single argument.
Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:
When actually, they are much more likely to look like this:
As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.
So to break down the points:
Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.
- The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.
95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.
This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.
Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.
Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.
$endgroup$
1
$begingroup$
Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
$endgroup$
– Ingolifs
53 mins ago
$begingroup$
I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
$endgroup$
– uhoh
20 mins ago
add a comment |
$begingroup$
The first three points, as far as I can tell, are a variation on a single argument.
Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:
When actually, they are much more likely to look like this:
As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.
So to break down the points:
Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.
- The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.
95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.
This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.
Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.
Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.
$endgroup$
1
$begingroup$
Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
$endgroup$
– Ingolifs
53 mins ago
$begingroup$
I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
$endgroup$
– uhoh
20 mins ago
add a comment |
$begingroup$
The first three points, as far as I can tell, are a variation on a single argument.
Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:
When actually, they are much more likely to look like this:
As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.
So to break down the points:
Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.
- The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.
95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.
This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.
Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.
Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.
$endgroup$
The first three points, as far as I can tell, are a variation on a single argument.
Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:
When actually, they are much more likely to look like this:
As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.
So to break down the points:
Measurements outside the CI still have a chance of happening, because the real (likely gaussian) uncertainty is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.
- The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.
95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.
This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.
Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.
Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.
edited 55 mins ago
answered 1 hour ago
IngolifsIngolifs
543219
543219
1
$begingroup$
Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
$endgroup$
– Ingolifs
53 mins ago
$begingroup$
I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
$endgroup$
– uhoh
20 mins ago
add a comment |
1
$begingroup$
Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
$endgroup$
– Ingolifs
53 mins ago
$begingroup$
I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
$endgroup$
– uhoh
20 mins ago
1
1
$begingroup$
Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
$endgroup$
– Ingolifs
53 mins ago
$begingroup$
Ack. I can't figure out how to make the points 1.-4. align vertically. Any help?
$endgroup$
– Ingolifs
53 mins ago
$begingroup$
I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
$endgroup$
– uhoh
20 mins ago
$begingroup$
I've made the edit, just needs accepting; single space in front of each number and blank line between each item seems to do it.
$endgroup$
– uhoh
20 mins ago
add a comment |
$begingroup$
Much of the article and the figure you include make a very simple point:
Lack of evidence for an effect is not evidence that it does not exist.
For example,
"In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".
Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.
But this is the mistake the authors claim scientists routinely make.
For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.
The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)
$endgroup$
add a comment |
$begingroup$
Much of the article and the figure you include make a very simple point:
Lack of evidence for an effect is not evidence that it does not exist.
For example,
"In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".
Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.
But this is the mistake the authors claim scientists routinely make.
For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.
The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)
$endgroup$
add a comment |
$begingroup$
Much of the article and the figure you include make a very simple point:
Lack of evidence for an effect is not evidence that it does not exist.
For example,
"In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".
Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.
But this is the mistake the authors claim scientists routinely make.
For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.
The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)
$endgroup$
Much of the article and the figure you include make a very simple point:
Lack of evidence for an effect is not evidence that it does not exist.
For example,
"In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".
Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.
But this is the mistake the authors claim scientists routinely make.
For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.
The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)
answered 29 mins ago
usulusul
1672
1672
add a comment |
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f398646%2fwhat-does-scientists-rise-up-against-statistical-significance-mean-comment-i%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown