I think the math here should be right, although perhaps the confusion here is be...

lalaland1125 · on May 24, 2020

The problem is that you are equally likely to get a BBBAAA ordering which would give you the same Type 1 error as AAABBB in the case where A and B have the same distribution.

The main key to remember when deriving a p value is that you are computing the probability of seeing an outcome at least as extreme given the null hypothesis, not the probability of seeing your particular outcome.

saeranv · on May 24, 2020

I see your point. But isn't that only true for two-tailed tests? This sort of ordering problem seems more suited to a one-tailed test. Let's say for example, we represent the bins of the probability distribution as the number of B's we can obtain (0, 1, 2, or 3).

In this case I am defining the statistical significance of our test ordering (AAABBB) as anything less than or equal to 0.05 of our probability distribution - which corresponds to the 0 Bs bin. This corresponds to a one-sided confidence interval of 95%.

What am I doing wrong?

lalaland1125 · on May 25, 2020

A one sided test is only valid if you know that it's not possible to get a result in the other direction. In the general case you don't know for certain that B can't be faster than A. One sided tests are almost always invalid as it's very difficult to know that the other direction is impossible.

If this doesn't make sense, I would recommend running simulations under the null hypothesis. You will see that 5% of the time you will falsely conclude that A < B and that in another 5% of the time you will conclude that B < A, leading to an overall false positive rate of 10%.

saeranv · on May 26, 2020

I think I'm finally starting to see the heart of the problem now.

> You will see that 5% of the time you will falsely conclude that A < B and that in another 5% of the time you will conclude that B < A, leading to an overall false positive rate of 10%.

From my perspective, this makes perfect sense, since I know there is a 1 / 20 chance of getting 3 AAAs based on the combinations formula. Where I think I am having trouble is why the p value needs to be derived to account for both these extremes.

Earlier you wrote: > The main key to remember when deriving a p value is that you are computing the probability of seeing an outcome at least as extreme given the null hypothesis, not the probability of seeing your particular outcome.

This is where the confusion is coming from for me. Based on this definition of the p-value everything you say makes sense. But, I don't understand what the point of deriving a p value this way is, or what it tells us. And to clarify, I'm not saying your wrong, more that I just need to read up on p-value derivation to understand this.