Let's first have a look at the code below:
rm(list = ls()) n<-2:20 m<-factorial(n)/(factorial(n-2)*factorial(2)) p<- 1- (1-0.05)^m plot(n,p, main='Probability to observe at least 1 false discovery', xlab='Number of samples') grid()
What I calculate above is called the FWER, which is the probability (when making multiple comparisons, e.g. multiple t-tests) of having at least one false positive (a type-I error, e.g. to conclude that a coin is biased while, in reality, it is a fair coin). To determine the FWER for samples, we first have to calculate the total number of tests that has to be performed:
Then, we calculate the probability NOT to make a type-I error, which is . For samples, the overall probability is : And so the probability to make at least one type-I error by chance (to reject the null hypothesis is: which is what is plotted above (assuming a type-I error of 0.05).Comparing dispersions NOT means
There are different methods to maintain a low FWER when performing multiple comparisons (we will discuss one later) but they all suffer from limitations. In contrast, ANOVA is a robust statistical analysis and allows to determine if, among those samples, there is at least one mean that is different from others. Instead of looking at the means, ANOVA looks at dispersions among samples. Let us look at the plot below (I also give the code for convenience):
# if (!require(shape)) install.packages('shape') par(mfrow=c(1,2)) x <- seq(-2, 2, length=100) y <- dnorm(x,sd=0.5) par(mfrow=c(1,2)) plot(x,y, type = "l", lwd = 2, axes = FALSE, xlab = "", ylab = "",xlim=c(-4,4),ylim=c(0,1)) axis(1, at = -5:5) lines(x+2,y,col='red',lwd=2) lines(x-3,y,col='blue',lwd=2) Arrows(-3.5,0.92,+2.7,0.92,code=3,arr.type="triangle") text(-0.5,0.97,"overall dispersion") Arrows(-0.5,0.82,0.5,0.82,code=3,arr.type="triangle") text(-0.2,0.86,"individual dispersion") plot(x,y, type = "l", lwd = 2, axes = FALSE, xlab = "", ylab = "",ylim=c(0,1)) axis(1, at = -2:2) lines(x+0.2,y,col='red',lwd=2) lines(x-0.3,y,col='blue',lwd=2) Arrows(-0.35,0.92,+0.35,0.92,code=3,arr.type="triangle") text(-0.2,0.97,"overall dispersion") Arrows(-0.25,0.82,0.25,0.82,code=3,arr.type="triangle") text(-0.1,0.86,"individual dispersion") par(mfrow=c(1,1))
Assuming that all three samples have equal dispersions (variances), we can see the overall dispersion will be much larger than individual dispersions when samples have different means (left panel). In contrast, we expect the overall dispersion to be almost identical to any of the individual dispersions when samples have roughly identical means. This is exactly what ANOVA does.Example In the following paragraphs, I demonstrate how to perform an ANOVA in R. Let's assume we have measured PSA levels in different groups. We might consider that each group (factor) corresponds to a specific medication.
# ANOVA # Some data T1<-c(20,17,18,22,20,22,24,25) T2<-c(14,12,12,13,16,14,18,11,17,14) T3<-c(19,20,18,21,21,17,22) T4<-c(27,24,22,30,26,29,26,28) values<-c(T1,T2,T3,T4) exp<-as.factor(c(rep('T1',length(T1)),rep('T2',length(T2)),rep('T3',length(T3)),rep('T4',length(T4)))) all<-data.frame(values,exp) # just create a data.frame with values and factor head(all)
What I do above is something I have done before to group values per factors (see here). Let's have a look at the beginning and the at the end of the data.frame:
## values exp ## 1 20 T1 ## 2 17 T1 ## 3 18 T1 ## 4 22 T1 ## 5 20 T1 ## 6 22 T1
## values exp ## 28 22 T4 ## 29 30 T4 ## 30 26 T4 ## 31 29 T4 ## 32 26 T4 ## 33 28 T4
Having defined factors, it is easy to do multiple operations (here plot 4 boxplots) in a one line command.
with(all,boxplot(values~exp,xlab='Treatment' ,ylab='Protein Concentration (ng/ml)',main='PSA screening')) # boxplot
# Note that the plot above could have been obtained with # boxplot(all$values~all$exp,xlab='Treatment',ylab='Protein Concentration (ng/ml)',main='PSA screening') # or # boxplot(all[,1]~all[,2],xlab='Treatment',ylab='Protein Concentration (ng/ml)',main='PSA screening')
ANOVA relies on several assumptions: normality of the residuals, homogeneity of Variance. In addition it considers that, in each group, observations are (i.i.d.).Normality
To test the normality of the residuals, I use the following script:
# Assumption 1 # Normality of residuals # For each group, we take the mean and then subtract it to the actual values res_cal<-c(T1-mean(T1),T2-mean(T2),T3-mean(T3),T4-mean(T4)) # and then do check quantiles for normality qqnorm(res_cal) qqline(res_cal)
There I use a quantile-quantile plot, which allows to determine if two data sets have the same distribution. Basically, I compare quantiles calculated for a normal distribution with those calculated in my sample. If the data do have the same distribution (normal), data points should lie along a line (qqline). Note that this result could be obtained with the aov function:
# The above results can also be obtained using: res.aov<-aov(all[,1]~all[,2]) # plot(res.aov,2) # same graph as before. But shows possible outliers
# BTW, residuals are indeed what we have calculated above aov_residuals <- residuals(object = res.aov ) aov_residuals - res_cal # this is equal to 0
## 1 2 3 4 5 ## 8.215650e-15 1.021405e-14 -2.220446e-15 -8.104628e-15 -3.996803e-15 ## 6 7 8 9 10 ## -4.551914e-15 -3.996803e-15 -4.440892e-15 4.163336e-16 8.881784e-16 ## 11 12 13 14 15 ## 8.881784e-16 8.881784e-16 6.661338e-16 4.163336e-16 8.881784e-16 ## 16 17 18 19 20 ## 8.881784e-16 4.440892e-16 4.163336e-16 7.771561e-16 5.551115e-16 ## 21 22 23 24 25 ## 1.110223e-15 6.661338e-16 6.661338e-16 1.332268e-15 8.881784e-16 ## 26 27 28 29 30 ## 0.000000e+00 4.440892e-16 0.000000e+00 -4.440892e-16 0.000000e+00 ## 31 32 33 ## 4.440892e-16 0.000000e+00 -2.220446e-16
While a Q-Q plot provides sufficient evidence, some people prefer to run a test. The Shapiro-Wilk is a test for normality and assume, under the null hypothesis, that the observations originate from a normal distribution.
shapiro.test(x = aov_residuals )
## ## Shapiro-Wilk normality test ## ## data: aov_residuals ## W = 0.97926, p-value = 0.7637
Here, the p-value is larger than 0.05 and so we cannot conclude that the observations do not originate from a normal distribution.Homogeneity of variances
Again, I wil be using a graphical approach to test the homogeneity of variances:
# Assumption 2 # Homogeneity of variances plot(res.aov,1)
This graph just plots residuals as a function of factors (for instance, treatment 2 has a mean at about 14 and has residuals, which are in between -3.1 and 3.9). If the dispersion is about the same for all groups (which is the case here), we can conclude that variances are similar among groups (note that the red line is a local fit to the data).
As before, we can do a test (Bartlett's test) to check if samples originate from populations that have equal variances.
## ## Bartlett test of homogeneity of variances ## ## data: all[, 1] by all[, 2] ## Bartlett's K-squared = 1.2415, df = 3, p-value = 0.7431
Here, the p-value is larger than 0.05 and so we cannot conclude the observations do not originate from populations having diffent variances.ANOVA
We already have performed the ANOVA when calculating residuals (res.aov<-aov(all[,1]~all[,2], see above). To check the results, we only need to enter:
# ANOVA summary(res.aov) # anova
## Df Sum Sq Mean Sq F value Pr(>F) ## all[, 2] 3 694.6 231.6 39.89 2.03e-10 *** ## Residuals 29 168.3 5.8 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, the p-value is much smaller than 0.05 and indicates that there is at least one mean that is different than others.What's next ?
When the ANOVA is significant (low p-value), we can perform multiple t-tests to determine which means are different. As mentioned above, there are different possibilities with pros and cons. You might have heard about the Bonferroni correction, which is conservative. In other words, it keeps the type-I error low (by controlling it) at the expense of a low statistical power (i.e. an increase in the number of false negatives; see also my post). Thus it is more likely you will declare than two means are statistically identical while, in reality, they are not.
Today, the False Discovery Rate (FDR) correction seems to be preferred. Because it is less conservative (as compared to the Bonferroni correction) and so tend to increase the number of false positives, it has a greater statistical power. To perform all possible t-tests, we just have to use:
# There are different solutions pairwise.t.test(all[, 1],all[, 2], p.adjust.method = "fdr") # this calculates all possible t.tests (here 6)
## ## Pairwise comparisons using t tests with pooled SD ## ## data: all[, 1] and all[, 2] ## ## T1 T2 T3 ## T2 4.3e-06 - - ## T3 0.3110 8.1e-05 - ## T4 0.0001 6.0e-11 1.5e-05 ## ## P value adjustment method: fdr
where the argument p.adjust.method of pairwise.t.test is set to 'fdr'. Here, any value lower than 0.05 indicates that two means are statistically different (e.g. T3 and T4).