We have previously performed some 2-sample t-tests. When the p-value was smaller than 0.05, we have concluded that the 2 samples originate from 2 populations with equal means. Obviously the probability to conclude that 2 samples (originating from populations with different means) have equal means depends on the sample size, the actual difference in means and the type-I error. That is what we will investigate here.

Let us define some parameters:

rm(list = ls()) 
delta<-0.5 # difference in means
sd<-1.5 # standard deviation in populations
mu1<-20 # mean of 1st population
mu2<-mu1+delta # mean of 2nd population
n<-40 # number (n) of elements in each sample

Under the null hypothesis, we assume both samples originate from 2 populations with equal means:

H 0 : μ 1 = μ 2 Because (i) the size of the samples is large (here n is set at 40) and (ii) the variance of the populations is known (and equal, here set as 2.25), the observed difference in means follows a normal distribution with mean 0 and variance (Z-test): 2 / n

As for any test, I calculate a critical value (stat_t) above which I reject the null hypothesis (here with a type-I error of 0.05):

# Let us calculate some parameters using the formula
s_d<-sd*sqrt(2/n) # standard error of the mean difference
stat_c<-qnorm((1-0.05/2),0,s_d) # max value
stat_c
## [1] 0.6573919

Let us plot some stuffs:

Here, I have defined the alternative hypothesis as: H 1 : μ 2 - μ 1 = δ where δ is set at 0.5 here.

# plot. Using non-standardized variables
x<-seq(-1.2,1.8,0.001)
y<-dnorm(x,0,s_d)
plot(x,y,type='l',ylab='density',xlab='difference in means',lwd=1.5,ylim=c(0,1.35))
abline(v=c(-stat_c,stat_c),lty=2,lwd=2)
abline(h=0)
# annotate
text(-0.35, 1, expression("H"[0]),cex=1.5)
text(-0.75,0.05,expression(alpha/2),cex=0.8)
text(+0.74,0.05,expression(alpha/2),cex=0.8)
text(0.95, 1,expression("H"[1]),cex=1.5)
text(0.3,0.4,expression(beta),cex=1.5)
text(0.8,0.4,expression(Power),cex=1)
text(delta/2,1.35,expression(mu[2]-mu[1]),cex=1)
segments(0, 1.3,delta,1.3)
segments(0, 1.1,0,1.3)
segments(delta, 1.1,delta,1.3)
# just fill some areas
polygon(c(x[x>=stat_c], stat_c), c(y[x>=stat_c], y[x==max(x)]), col="#BEBEBE80") # area under curve
polygon(c(-stat_c,x[x<=-stat_c]), c(y[x==min(x)], y[x<=-stat_c]), col="#BEBEBE80") # area under curve

x1<-x+(delta)
y1<-y
lines(x1,y1,lwd=2)
polygon(c(stat_c,x1[x1<=stat_c]), c(y[x1==min(x1)], y[x1<=stat_c]),  border = gray(.1),density = 10,angle = 135,col = "#BEBEBE80") 
polygon(c(x1[x1>=stat_c], stat_c), c(y[x1>=stat_c], y[x1==max(x1)]), border = gray(.1), density = 5, angle = 45,col ="#BEBEBE80") 

We define the power as :

P = 1 β

This is the probability to reject the null hypothesis (because the difference in observed means is below stat_c) given the alternative hypothesis is true. To calculate this value, we just have to use pnorm and set the mean at δ.

(1-pnorm(stat_c,delta,s_d)) # calculate Power assuming normal distribution
## [1] 0.3194448

Here, given a type-I error of 0.05, a sample size of 40, a difference in means (in populations) of 0.5 and a variance (in populations) of 2.25 (square of 1.5), we will 'see' a difference in means (reject the null hypothesis) 32 percent of the time. In other words, we will incorrectly conclude that the means are different 68 percent of the time! This value can be also obtained using the function power.t.test:

power.t.test(n, delta, sd)$power # calculate Power using a t.test
## [1] 0.3129206

The difference between the 2 values can be attributed to the fact that we perform here a t-test and not a z-test. A third possibility consists in performing a simulation (might talke a few tens of seconds):

# mini-simulation
n_sim<-10^6
all<-sapply(1:n_sim,function(z){
  abs((mean(rnorm(n,mu2,sd))-mean(rnorm(n,mu1,sd))))>stat_c
} )
sum(all/n_sim) 
## [1] 0.320129

Here, we just have calculated 1,000,000 times the difference in observed means (for 2 datasets sampled from 2 normal distributions) and have checked if the difference (in absolute value) were larger than stat_c.

In summary, we shall always specify the statistical power when performing t-tests. Here, assuming that your test was non significant (because your p-value was, by chance, larger than .05), you should write: I do not see any difference in means but I shall mention that my test is such that, in more than 30 percent of the times, my test is not able to detect a difference smaller than 0.5. Alternatively, you might state that the smallest difference you might detect for a power of 80 percent (which is considered as a good statistical power) is about 1 (see code below):

delta_c<-seq(0.1,2,0.1)
p_all<-power.t.test(n, delta_c, sd)$power
plot(delta_c,p_all,abline(h=0.8,lwd=2),ylab="Power [Type I-error: 0.05, n=40]",xlab="Difference in means")
grid()

Finally, a power-analysis also explains why the type-I errror is often chosen at 0.05: a value of 0.01 would increase the type-II error and, in turns, decrease statistical power.