When testing multiple hypotheses, what does it mean when there are not enough extremes? [closed]

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
4
down vote

favorite












Suppose you are testing a large number of hypotheses, say a million. Unlike the usual situation where you have a lot of very small p-values, in this case all of your p-values are greater than 5%.



What does that imply, and what's the best way to handle something like this?







share|cite|improve this question














closed as unclear what you're asking by Martijn Weterings, mdewey, Michael Chernick, kjetil b halvorsen, whuber♦ Aug 12 at 17:10


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.










  • 8




    If tests are independent it's possible that there's an issue with the assumptions. If the tests are sufficiently dependent, then it may not indicate any problem. Your question is a bit light on details -- and the details may provide clues that would be needed to give a useful answer.
    – Glen_b♦
    Aug 12 at 5:37

















up vote
4
down vote

favorite












Suppose you are testing a large number of hypotheses, say a million. Unlike the usual situation where you have a lot of very small p-values, in this case all of your p-values are greater than 5%.



What does that imply, and what's the best way to handle something like this?







share|cite|improve this question














closed as unclear what you're asking by Martijn Weterings, mdewey, Michael Chernick, kjetil b halvorsen, whuber♦ Aug 12 at 17:10


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.










  • 8




    If tests are independent it's possible that there's an issue with the assumptions. If the tests are sufficiently dependent, then it may not indicate any problem. Your question is a bit light on details -- and the details may provide clues that would be needed to give a useful answer.
    – Glen_b♦
    Aug 12 at 5:37













up vote
4
down vote

favorite









up vote
4
down vote

favorite











Suppose you are testing a large number of hypotheses, say a million. Unlike the usual situation where you have a lot of very small p-values, in this case all of your p-values are greater than 5%.



What does that imply, and what's the best way to handle something like this?







share|cite|improve this question














Suppose you are testing a large number of hypotheses, say a million. Unlike the usual situation where you have a lot of very small p-values, in this case all of your p-values are greater than 5%.



What does that imply, and what's the best way to handle something like this?









share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Aug 12 at 6:28









kjetil b halvorsen

25.1k976179




25.1k976179










asked Aug 12 at 4:38









badmax

32317




32317




closed as unclear what you're asking by Martijn Weterings, mdewey, Michael Chernick, kjetil b halvorsen, whuber♦ Aug 12 at 17:10


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.






closed as unclear what you're asking by Martijn Weterings, mdewey, Michael Chernick, kjetil b halvorsen, whuber♦ Aug 12 at 17:10


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.









  • 8




    If tests are independent it's possible that there's an issue with the assumptions. If the tests are sufficiently dependent, then it may not indicate any problem. Your question is a bit light on details -- and the details may provide clues that would be needed to give a useful answer.
    – Glen_b♦
    Aug 12 at 5:37













  • 8




    If tests are independent it's possible that there's an issue with the assumptions. If the tests are sufficiently dependent, then it may not indicate any problem. Your question is a bit light on details -- and the details may provide clues that would be needed to give a useful answer.
    – Glen_b♦
    Aug 12 at 5:37








8




8




If tests are independent it's possible that there's an issue with the assumptions. If the tests are sufficiently dependent, then it may not indicate any problem. Your question is a bit light on details -- and the details may provide clues that would be needed to give a useful answer.
– Glen_b♦
Aug 12 at 5:37





If tests are independent it's possible that there's an issue with the assumptions. If the tests are sufficiently dependent, then it may not indicate any problem. Your question is a bit light on details -- and the details may provide clues that would be needed to give a useful answer.
– Glen_b♦
Aug 12 at 5:37











1 Answer
1






active

oldest

votes

















up vote
4
down vote













As the sample size ($n$) grows, hypothesis tests become significant. However, as the number of independent hypothesis tests ($k$) grows, individual tests still behave the same.



The problem of multiple testing is that the overall chance of a false positive becomes deceivingly large. Among multiple tests, the chance of at least one false positive is larger than the significance level ($alpha$). Hence why you should apply multiple testing correction.



Suppose the effects/differences you are testing for are simply not present in the population, or they are so infinitesimally small that you cannot even detect them with your current hypothesis tests. This is essentially what you assume when applying a Bonferroni correction: There are no true effects, so every test has only the ability to produce a false positive. There are now $k$ potential false positives and a chance of $1 - (1 - alpha)^k$ of at least one false positive.




So what does it mean when you don't observe extremely small $p$-values? Under the null-hypothesis, the $p$-value is uniformly distributed so even if there are no true effects you would expect the number of values closer to $0$ to increase with the number of tests, since you would essentially be drawing $k$ numbers from $mathsfUnif(0,1)$.



If you are running a very large number of tests and don't conclude any nominally significant differences (uncorrected), then perhaps your test is not powerful enough, or your tests are not actually independent. However, if you conclude approximately $alphacdot100%$ nominally significant $p$-values, then nothing strange is going on. (In your example, you would expect about $50,000$ $p$-values below $0.05$.)




Lastly, as for the conclusion: It might be more interesting to report a set of confidence intervals / credibility ranges, so you can say something about the effect sizes. Alternatively, if your sample size is indeed large and you want to demonstrate that there are no effects, then you should be running tests of equivalence instead.




To elaborate on what Glen_b aluded to in the comments:



If your tests are not actually independent, then neither are your $p$-values. In other words, $p$-values only follow a uniform distribution if you (1) repeatedly draw samples from the same population and test the same hypothesis, or (2) perform independent tests for different effects. A simple, albeit somewhat contrived example would be if you were to perform the same test multiple times. In this case, every $p$-value is identical and may well be above the significance threshold.






share|cite|improve this answer






















  • "If you are running a very large number of tests and don't conclude any nominally significant differences (uncorrected), then perhaps your test is not powerful enough" But wouldn't the p-values still be uniformly distributed in this situation, if there is a null effect? I'm just having a hard time wrapping my mind around there being a clustering of p-values above a threshold, since null results would imply a uniform distribution. I haven't done any simulations of this, so I may very well be wrong.
    – Mark White
    Aug 12 at 14:53










  • I'm not sure what you mean, but you would expect $(1-alpha)cdot100%$ of the tests to have a $p$-value below $alpha$. You don't need any 'clustering' for that, that's just $frac1alpha$ of the uniform distribution from $0$ to $1$.
    – Frans Rodenburg
    Aug 12 at 15:00

















1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
4
down vote













As the sample size ($n$) grows, hypothesis tests become significant. However, as the number of independent hypothesis tests ($k$) grows, individual tests still behave the same.



The problem of multiple testing is that the overall chance of a false positive becomes deceivingly large. Among multiple tests, the chance of at least one false positive is larger than the significance level ($alpha$). Hence why you should apply multiple testing correction.



Suppose the effects/differences you are testing for are simply not present in the population, or they are so infinitesimally small that you cannot even detect them with your current hypothesis tests. This is essentially what you assume when applying a Bonferroni correction: There are no true effects, so every test has only the ability to produce a false positive. There are now $k$ potential false positives and a chance of $1 - (1 - alpha)^k$ of at least one false positive.




So what does it mean when you don't observe extremely small $p$-values? Under the null-hypothesis, the $p$-value is uniformly distributed so even if there are no true effects you would expect the number of values closer to $0$ to increase with the number of tests, since you would essentially be drawing $k$ numbers from $mathsfUnif(0,1)$.



If you are running a very large number of tests and don't conclude any nominally significant differences (uncorrected), then perhaps your test is not powerful enough, or your tests are not actually independent. However, if you conclude approximately $alphacdot100%$ nominally significant $p$-values, then nothing strange is going on. (In your example, you would expect about $50,000$ $p$-values below $0.05$.)




Lastly, as for the conclusion: It might be more interesting to report a set of confidence intervals / credibility ranges, so you can say something about the effect sizes. Alternatively, if your sample size is indeed large and you want to demonstrate that there are no effects, then you should be running tests of equivalence instead.




To elaborate on what Glen_b aluded to in the comments:



If your tests are not actually independent, then neither are your $p$-values. In other words, $p$-values only follow a uniform distribution if you (1) repeatedly draw samples from the same population and test the same hypothesis, or (2) perform independent tests for different effects. A simple, albeit somewhat contrived example would be if you were to perform the same test multiple times. In this case, every $p$-value is identical and may well be above the significance threshold.






share|cite|improve this answer






















  • "If you are running a very large number of tests and don't conclude any nominally significant differences (uncorrected), then perhaps your test is not powerful enough" But wouldn't the p-values still be uniformly distributed in this situation, if there is a null effect? I'm just having a hard time wrapping my mind around there being a clustering of p-values above a threshold, since null results would imply a uniform distribution. I haven't done any simulations of this, so I may very well be wrong.
    – Mark White
    Aug 12 at 14:53










  • I'm not sure what you mean, but you would expect $(1-alpha)cdot100%$ of the tests to have a $p$-value below $alpha$. You don't need any 'clustering' for that, that's just $frac1alpha$ of the uniform distribution from $0$ to $1$.
    – Frans Rodenburg
    Aug 12 at 15:00














up vote
4
down vote













As the sample size ($n$) grows, hypothesis tests become significant. However, as the number of independent hypothesis tests ($k$) grows, individual tests still behave the same.



The problem of multiple testing is that the overall chance of a false positive becomes deceivingly large. Among multiple tests, the chance of at least one false positive is larger than the significance level ($alpha$). Hence why you should apply multiple testing correction.



Suppose the effects/differences you are testing for are simply not present in the population, or they are so infinitesimally small that you cannot even detect them with your current hypothesis tests. This is essentially what you assume when applying a Bonferroni correction: There are no true effects, so every test has only the ability to produce a false positive. There are now $k$ potential false positives and a chance of $1 - (1 - alpha)^k$ of at least one false positive.




So what does it mean when you don't observe extremely small $p$-values? Under the null-hypothesis, the $p$-value is uniformly distributed so even if there are no true effects you would expect the number of values closer to $0$ to increase with the number of tests, since you would essentially be drawing $k$ numbers from $mathsfUnif(0,1)$.



If you are running a very large number of tests and don't conclude any nominally significant differences (uncorrected), then perhaps your test is not powerful enough, or your tests are not actually independent. However, if you conclude approximately $alphacdot100%$ nominally significant $p$-values, then nothing strange is going on. (In your example, you would expect about $50,000$ $p$-values below $0.05$.)




Lastly, as for the conclusion: It might be more interesting to report a set of confidence intervals / credibility ranges, so you can say something about the effect sizes. Alternatively, if your sample size is indeed large and you want to demonstrate that there are no effects, then you should be running tests of equivalence instead.




To elaborate on what Glen_b aluded to in the comments:



If your tests are not actually independent, then neither are your $p$-values. In other words, $p$-values only follow a uniform distribution if you (1) repeatedly draw samples from the same population and test the same hypothesis, or (2) perform independent tests for different effects. A simple, albeit somewhat contrived example would be if you were to perform the same test multiple times. In this case, every $p$-value is identical and may well be above the significance threshold.






share|cite|improve this answer






















  • "If you are running a very large number of tests and don't conclude any nominally significant differences (uncorrected), then perhaps your test is not powerful enough" But wouldn't the p-values still be uniformly distributed in this situation, if there is a null effect? I'm just having a hard time wrapping my mind around there being a clustering of p-values above a threshold, since null results would imply a uniform distribution. I haven't done any simulations of this, so I may very well be wrong.
    – Mark White
    Aug 12 at 14:53










  • I'm not sure what you mean, but you would expect $(1-alpha)cdot100%$ of the tests to have a $p$-value below $alpha$. You don't need any 'clustering' for that, that's just $frac1alpha$ of the uniform distribution from $0$ to $1$.
    – Frans Rodenburg
    Aug 12 at 15:00












up vote
4
down vote










up vote
4
down vote









As the sample size ($n$) grows, hypothesis tests become significant. However, as the number of independent hypothesis tests ($k$) grows, individual tests still behave the same.



The problem of multiple testing is that the overall chance of a false positive becomes deceivingly large. Among multiple tests, the chance of at least one false positive is larger than the significance level ($alpha$). Hence why you should apply multiple testing correction.



Suppose the effects/differences you are testing for are simply not present in the population, or they are so infinitesimally small that you cannot even detect them with your current hypothesis tests. This is essentially what you assume when applying a Bonferroni correction: There are no true effects, so every test has only the ability to produce a false positive. There are now $k$ potential false positives and a chance of $1 - (1 - alpha)^k$ of at least one false positive.




So what does it mean when you don't observe extremely small $p$-values? Under the null-hypothesis, the $p$-value is uniformly distributed so even if there are no true effects you would expect the number of values closer to $0$ to increase with the number of tests, since you would essentially be drawing $k$ numbers from $mathsfUnif(0,1)$.



If you are running a very large number of tests and don't conclude any nominally significant differences (uncorrected), then perhaps your test is not powerful enough, or your tests are not actually independent. However, if you conclude approximately $alphacdot100%$ nominally significant $p$-values, then nothing strange is going on. (In your example, you would expect about $50,000$ $p$-values below $0.05$.)




Lastly, as for the conclusion: It might be more interesting to report a set of confidence intervals / credibility ranges, so you can say something about the effect sizes. Alternatively, if your sample size is indeed large and you want to demonstrate that there are no effects, then you should be running tests of equivalence instead.




To elaborate on what Glen_b aluded to in the comments:



If your tests are not actually independent, then neither are your $p$-values. In other words, $p$-values only follow a uniform distribution if you (1) repeatedly draw samples from the same population and test the same hypothesis, or (2) perform independent tests for different effects. A simple, albeit somewhat contrived example would be if you were to perform the same test multiple times. In this case, every $p$-value is identical and may well be above the significance threshold.






share|cite|improve this answer














As the sample size ($n$) grows, hypothesis tests become significant. However, as the number of independent hypothesis tests ($k$) grows, individual tests still behave the same.



The problem of multiple testing is that the overall chance of a false positive becomes deceivingly large. Among multiple tests, the chance of at least one false positive is larger than the significance level ($alpha$). Hence why you should apply multiple testing correction.



Suppose the effects/differences you are testing for are simply not present in the population, or they are so infinitesimally small that you cannot even detect them with your current hypothesis tests. This is essentially what you assume when applying a Bonferroni correction: There are no true effects, so every test has only the ability to produce a false positive. There are now $k$ potential false positives and a chance of $1 - (1 - alpha)^k$ of at least one false positive.




So what does it mean when you don't observe extremely small $p$-values? Under the null-hypothesis, the $p$-value is uniformly distributed so even if there are no true effects you would expect the number of values closer to $0$ to increase with the number of tests, since you would essentially be drawing $k$ numbers from $mathsfUnif(0,1)$.



If you are running a very large number of tests and don't conclude any nominally significant differences (uncorrected), then perhaps your test is not powerful enough, or your tests are not actually independent. However, if you conclude approximately $alphacdot100%$ nominally significant $p$-values, then nothing strange is going on. (In your example, you would expect about $50,000$ $p$-values below $0.05$.)




Lastly, as for the conclusion: It might be more interesting to report a set of confidence intervals / credibility ranges, so you can say something about the effect sizes. Alternatively, if your sample size is indeed large and you want to demonstrate that there are no effects, then you should be running tests of equivalence instead.




To elaborate on what Glen_b aluded to in the comments:



If your tests are not actually independent, then neither are your $p$-values. In other words, $p$-values only follow a uniform distribution if you (1) repeatedly draw samples from the same population and test the same hypothesis, or (2) perform independent tests for different effects. A simple, albeit somewhat contrived example would be if you were to perform the same test multiple times. In this case, every $p$-value is identical and may well be above the significance threshold.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited Aug 12 at 15:23

























answered Aug 12 at 5:15









Frans Rodenburg

2,645322




2,645322











  • "If you are running a very large number of tests and don't conclude any nominally significant differences (uncorrected), then perhaps your test is not powerful enough" But wouldn't the p-values still be uniformly distributed in this situation, if there is a null effect? I'm just having a hard time wrapping my mind around there being a clustering of p-values above a threshold, since null results would imply a uniform distribution. I haven't done any simulations of this, so I may very well be wrong.
    – Mark White
    Aug 12 at 14:53










  • I'm not sure what you mean, but you would expect $(1-alpha)cdot100%$ of the tests to have a $p$-value below $alpha$. You don't need any 'clustering' for that, that's just $frac1alpha$ of the uniform distribution from $0$ to $1$.
    – Frans Rodenburg
    Aug 12 at 15:00
















  • "If you are running a very large number of tests and don't conclude any nominally significant differences (uncorrected), then perhaps your test is not powerful enough" But wouldn't the p-values still be uniformly distributed in this situation, if there is a null effect? I'm just having a hard time wrapping my mind around there being a clustering of p-values above a threshold, since null results would imply a uniform distribution. I haven't done any simulations of this, so I may very well be wrong.
    – Mark White
    Aug 12 at 14:53










  • I'm not sure what you mean, but you would expect $(1-alpha)cdot100%$ of the tests to have a $p$-value below $alpha$. You don't need any 'clustering' for that, that's just $frac1alpha$ of the uniform distribution from $0$ to $1$.
    – Frans Rodenburg
    Aug 12 at 15:00















"If you are running a very large number of tests and don't conclude any nominally significant differences (uncorrected), then perhaps your test is not powerful enough" But wouldn't the p-values still be uniformly distributed in this situation, if there is a null effect? I'm just having a hard time wrapping my mind around there being a clustering of p-values above a threshold, since null results would imply a uniform distribution. I haven't done any simulations of this, so I may very well be wrong.
– Mark White
Aug 12 at 14:53




"If you are running a very large number of tests and don't conclude any nominally significant differences (uncorrected), then perhaps your test is not powerful enough" But wouldn't the p-values still be uniformly distributed in this situation, if there is a null effect? I'm just having a hard time wrapping my mind around there being a clustering of p-values above a threshold, since null results would imply a uniform distribution. I haven't done any simulations of this, so I may very well be wrong.
– Mark White
Aug 12 at 14:53












I'm not sure what you mean, but you would expect $(1-alpha)cdot100%$ of the tests to have a $p$-value below $alpha$. You don't need any 'clustering' for that, that's just $frac1alpha$ of the uniform distribution from $0$ to $1$.
– Frans Rodenburg
Aug 12 at 15:00




I'm not sure what you mean, but you would expect $(1-alpha)cdot100%$ of the tests to have a $p$-value below $alpha$. You don't need any 'clustering' for that, that's just $frac1alpha$ of the uniform distribution from $0$ to $1$.
– Frans Rodenburg
Aug 12 at 15:00


Comments

Popular posts from this blog

What does second last employer means? [closed]

List of Gilmore Girls characters

Confectionery