In a time series, change point detection tries to identify abrupt changes. In a previous post we have learned how to identify single change points using likelihood estimates and AIC values. Here, I show how to identify multiple change points states (i.e. statistically identical data).

# packages
rm(list=ls()) # clear memory
if (!require(ggplot2)) install.packages('ggplot2')
## Loading required package: ggplot2
if (!require(RColorBrewer)) install.packages('RColorBrewer')
## Loading required package: RColorBrewer
if (!require(dplyr)) install.packages('dplyr')
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Data

The script below looks similar to the previous one except that I now have multiple change points.

# parameters
penalty<-10 # Delta AIC penalty
display<-1 # display all graphs during calculations (good to see what is going on)
sd<-1 # sd is known and set here as 1
# data 
set.seed(50)
values<-c(rnorm(200,10,1),rnorm(200,2.1,1),rnorm(250,6,1),rnorm(75,2.1,1)) # NEW VALUES
plot(values)

Function(s)

We will also be using the function I have defined previously.

#main function
likelihood <- function (data,display){
  # Log likelihood calculation
  # R_1 is just the log of the cumulative product (w/o -1/(2*sd^2) )
  n<-length(data)
  R_1 <- sapply (1:n,function(t){
    part_inf<-(data[1:t]-mean(data[1:t]))^2
    part_sup<-(data[(t+1):n]-mean(data[(t+1):n]))^2
    cumsum(part_inf)[length(part_inf)]+cumsum(part_sup)[length(part_sup)]
  }  )
  R_0 <- cumsum((data-mean(data))^2)[n]
  #Log \hat{L}_1 (just need to add -(n/2)*log (2 * pi * sd^2)) and multiply -1/(2*sd^2) 
  norm_sum<-(-n/2)*(log(2*pi*sd^2))
  Log_H1<-norm_sum+(-1/(2*sd^2)) * R_1 # This is the *log* MLE for Hypothesis 1
  #Log \hat{L}_0 (just need to add -(n/2)*log (2 * pi * sd^2)) and multiply -1/(2*sd^2) 
  Log_H0<-norm_sum+(-1/(2*sd^2)) * R_0 # This is the *log* MLE for Hypothesis 0
  #AIC_1 (2 parameters in the model, so k=2)
  AIC_1 <- 2*2-2*Log_H1
  #AIC_0 (1 parameter in the model, so k=1)
  AIC_0 <- 2*1-2*Log_H0
  # Penalty and display
  if ((AIC_0-min(AIC_1,na.rm = TRUE)) <= penalty | which.min(AIC_1) ==1 | which.min(AIC_1) == n)
  {hyp<-0
  }else{hyp<-1
  }
  if (display==1) {
    par(mfrow=c(2,1))
    plot(data,xlab="Index (relative to chunk start)")
    if (hyp ==0)
    {title(main= paste("No changepoint"))
    }else{title(main= paste("Change point :",which.min(AIC_1)))
      abline(v=which.min(AIC_1), col="red",lty=2,lwd=3)
    }
    plot(AIC_1,ylab="Akaike information criterion",xlab="Index (relative to chunk start)")
    abline(h=AIC_0, col="blue",lty=2,lwd=2)
  }
  return<-c(hyp,which.min(AIC_1))
}
Script to identify multiple changepoints

There are different approaches (reviewed by Killick and Eckley to detect multiple change points. Binary segmentation (as proposed by Edwards and Cavalli-Sforza is a simple yet efficient method and consists in dividing the time trace into two 2 sub traces (where at least one trace contains no change point) and repeating this operation until no change point is found (see Figure below for an example).

That's pretty easy to code.

####################################################
####################################################
# Binary segmentation
# This identifies changepoints NOT states
####################################################
####################################################
start<-c(1) # For the first iteration, we take all data. 
stop<-c(length(values)) # And there is only one array
ch_points<-vector() # This is where I store the changepoints
nstart<-vector()
nstop<-vector()

while (length(start)!=0) {
  indC<-vector()
  i<-1
  for (i in i:length(start)) {
    return_lik <- likelihood(values[start[i]:stop[i]],display)
    if (return_lik[1]==0){
      nstart<-start[-i]
      nstop<-stop[-i]
      indC<-c(indC,i)
    }else{
      nstart<-c(start,(start[i]+return_lik[2]))
      nstop<-c(stop,(start[i]+return_lik[2]-1))
      ch_points<-c(ch_points,(start[i]+return_lik[2]))
    }
  }
  if (length(indC) ==0){
    start<-sort(nstart)
    stop<-sort(nstop)
  }else{
    start<-sort(nstart[-indC])
    stop<-sort(nstop[-indC])
  }  
}

This script produces 7 graphs here (note that the index is relative to the start of the (sub)-trace and so this is different from the graph just above).

We can finally list change points [using an absolute indexing and adding (or not) a value of one].

# Display results
final<-sort(ch_points[!duplicated(ch_points)])
final
## [1] 201 401 651
States

Having found change points, we now have to detect if different chunks (a part of the time series in between subsequent change points) are statically identical. Again there are different approaches and I have chosen a simple one (explained by Ensign and Pande). Basically, we (i) re-arrange chunks by (ascending) mean values and select the ones with the two smallest means (ii) apply our change point algorithm (the likelihood function) (iii) declare that those two chunks are statistically identical if no change point is found and (iv)repeat this procedure with new chunks (having larger means). This what is done in the script below.

####################################################
####################################################
## States
####################################################
####################################################
# first get all means and cpoints
cp_loc<-c(1,final,length(values)+1) # all change points (including 1 and length of data +1)
cp_means<-sapply(1:(length(cp_loc)-1),function(i){
  mean(values[cp_loc[i]:cp_loc[i+1]])
}) # all_means
# Then associate values, group (i.e. values between 2 changepoints) and states (0 as we do not know yet)
# to two data frame
states<-rep(0,length(values))
df_main<-data.frame(values, group = rep(seq_along(diff(cp_loc)), diff(cp_loc)), states)->df_main_iter
s<-0 # index of states 
# greedy algorithm
# we just take 2 groups with the 2 smallest means and check
# Then we remove the smallest mean
while (length(unique(df_main_iter$group))>1){
  means_all<-aggregate(df_main_iter$values, by=list(df_main_iter$group), FUN=mean)
  colnames(means_all)<-c("group",'mean')
  order_means<-means_all[order(means_all$mean),]
  # Then get the 2 groups having the 2 smallest means: order_means[1,1] and =order_means[2,1]
  # And Finally extract data 
  values_extracted<-c(df_main[which(df_main[,2]==order_means[1,1]),1], df_main[which(df_main[,2]==order_means[2,1]),1])
  # Now, check if there is a changepoint
  # And associate a new value (states) to column states of the main (iter) data frame
  cp_states_calc<-likelihood(values_extracted,display)
  if (cp_states_calc[1] == 0){
    df_main_iter[which(df_main_iter[,2]==order_means[1,1]),3]<-s->df_main_iter[which(df_main_iter[,2]==order_means[2,1]),3]
    df_main[which(df_main[,2]==order_means[1,1]),3]<-s->df_main[which(df_main[,2]==order_means[2,1]),3]
  }else{
    df_main_iter[which(df_main_iter[,2]==order_means[1,1]),3]<-s
    df_main[which(df_main[,2]==order_means[1,1]),3]<-s
    s<-s+1 # there is a new state
    df_main_iter[which(df_main_iter[,2]==order_means[2,1]),3]<-s
    df_main[which(df_main[,2]==order_means[2,1]),3]<-s
  }
  # remove values with smallest means (those defined by order_means[1,1])
  df_main_iter<-df_main_iter[-which(df_main_iter[,2]==order_means[1,1]),]
  df_main_iter[-which(df_main_iter[,2]==order_means[1,1]),]
}

Again, I will not spend much time on explain this script (there are many comments) and I have not introduced new R functions apart from aggregate. Instead, we will look at the output of the script (producing several graphs).

The two sub- time traces with the lowest means are the ones located between indexes [201,401[ and [651,[. The likelihood (AIC) algorithm does not detect any change point and so we will consider these two chunks statistically identical

Then we consider the chunks between [651,[ and [401,651]. Here a change point is found and so these two chunks are statistically different (their parent distributions have non equal means).

Finally we consider the chunks between [401,651] and [,201[. Here a change point is found and so these two chunks are statistically different (their parent distributions have non equal means).

Final Graph

Almost done. We can re-organize the data using tidyverse.

####################################################
####################################################
# Re-organize data
####################################################
####################################################
df_main <- df_main %>%
  mutate(idx = seq_along(values), group = as.integer(group)) %>%
  group_by(group) %>%
  mutate(m = mean(values)) %>%
  ungroup() %>%
  mutate(group2 = cumsum(group != lag(group, default = -1)))

The tibble contains all information (raw data, indexes and state) as well as a new column (group2), wich will be used to draw segments (see graph below).

df_main # print
## # A tibble: 725 x 6
##    values group states   idx     m group2
##     <dbl> <int>  <dbl> <int> <dbl>  <int>
##  1  10.5      1      2     1  9.89      1
##  2   9.16     1      2     2  9.89      1
##  3  10.0      1      2     3  9.89      1
##  4  10.5      1      2     4  9.89      1
##  5   8.27     1      2     5  9.89      1
##  6   9.72     1      2     6  9.89      1
##  7  10.4      1      2     7  9.89      1
##  8   9.41     1      2     8  9.89      1
##  9  11.0      1      2     9  9.89      1
## 10   8.55     1      2    10  9.89      1
## # ... with 715 more rows
####################################################
####################################################
# Final Graph
####################################################
####################################################


col_values <- brewer.pal(length(unique(df_main$states)), "Dark2")

ggplot(data = df_main, mapping = aes(x= idx, y = values)) +
  geom_point(shape = 1, color = "black", size=2) +
  geom_line(aes(x = idx, y = m, group = group2, color = as.factor(states)),
            size = 2) +
  scale_color_manual(values = col_values,
                     name = "states") +
  theme_light()

Finally, we can estimate means for all states using:

# Get states 
df_main%>%
  group_by(states) %>%
  dplyr::summarize(Mean = mean(values)) 
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##   states  Mean
##    <dbl> <dbl>
## 1      0  2.18
## 2      1  6.02
## 3      2  9.89
Final remarks

Just to demonstrate that this method works nicely at a SNR close to 1. Using :

set.seed(50)
values<-c(rnorm(200,3.5,1),rnorm(200,2.1,1),rnorm(250,3,1),rnorm(75,2.1,1))

we get

final
## [1] 198 405 678

... and we correctly identify states.


Work done with Iryna Riabko