In a time series, change point detection tries to identify abrupt changes. Different approaches have been proposed and here I show one of them based on a simple rolling t-test.
For this example, I use ggplot2 and so you need to enter:
rm(list = ls())
if (!require(ggplot2)) install.packages('ggplot2')
## Loading required package: ggplot2
Let's generate some data : 50 points where the first (last) 25 points originate from a normal distribution with mean 2 (8) and variance 0.25 (0.25). The change point therefore occurs at index 26 (or 25 depending on how you define it).
# generate data
set.seed(10)
data<-c(rnorm(25,2,0.5),rnorm(25,8,0.5))
There are many papers discussing change point detection with a t-test. Here, I have chosen to cite a paper from Seol et al. that basically applies this method to Magnetic Tweezers time traces. The idea is quite simple: for each point (i), we select 2 different windows of length (n-1):
We then perform a t-test on this 2 samples and return 0 if the p-value smaller than some chosen value (say 0.05) or the test's statistics (its absolute value) otherwise. In my opinion the paper of Seol and collaborators is over-complicating things (giving analytic formulas for the cumulative distribution) but it shows some graphics that help the reader to understand.
The t-test is performed for all possible datapoints (given that both windows have a size of n-1). Note that I perform here an unequal variances t-test (the default argument is var.equal = FALSE) because we expect large differences in variances (when close to the change point).
# test
alpha<-0.05
n<-10 # length of window
t<-1:(length(data)-2*n+2) # define a vector for sapply (rolling)
values<-sapply(t, function(z){
w1 <- data[z:(n + z-2)] # window1
w2 <- data[(n+z):(2*n + z-2)] #window2
res<-t.test(w1, w2) # 0 for H0, t for H1 (see below)
if (res$p.value< alpha) {
t<-abs(res$statistic)
}
else {
t<-0
}
})
That is basically it. We have calculated 32 values (50 minus 2 times n-1, where n is 10) that correspond to indexes:
values_x<-seq(n,length(data)-n+1,1)
values_x
## [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [26] 35 36 37 38 39 40 41
A graph is always good to see what's happening. Because I will make 3 graphs using the same script, I first create a function that creates 1 graphs composed of 2 panels: the first one displays the data (with 2 rectangles indicating the 2 windows for index i); the second one shows the output of the sapply calculation (for index i and below) :
# a simple function to plot
my_plot<-function(i)
{
par(mfrow=c(2,1))
i<-i-n+1
j<-i-2
plot(data,pch=19,cex=1,cex.main=1.5, cex.lab=1.5, cex.axis=1.5)
rect(j+2,-n,n+j,100, col= 'blue', density=5)
rect(n+j+2,-50,2*n+j,100, col= 'black', density=5)
plot(values_x[1:(i)],values[1:(i)],
ylim=c(0,10),xlim=c(0,50),
pch=19,cex=1, cex.main=1.5, cex.lab=1.5, cex.axis=1.5,
xlab="index",ylab="test statistics (H1, abs)")
par(mfrow=c(1,1))
}
my_plot(10)
my_plot(20)
my_plot(41)
As you can see, the change point corresponds to the maximum of the test's statistics (as defined above).
values_x[which.max(values)]
## [1] 25
Note (i) that some averaging is performed in the original paper (I did not do it here as I guess it is always best to work on raw data) and (ii) that data have to be normally distributed (t-test assumptions).
Work done with Iryna Riabko