异常(断点)检测

Table of Contents

1 Break Point Detection

1.1 Why need Break Point Detection?

  • Break Point is a kind of anomaly
  • Break Point may influence the result for fitting ARIMA

1.2 CUSUM

1.2.1 原理1

对于时间序列 $ X = x1, x2, …, xn $, 假设: $$ S_{n + 1} = max(0, S_n + x_n - w_n)$$ 选定某个阈值 h, 当 Sk > h 时, 即认为在k处出现了断点(发生了异常)。

1.2.2 参数选择

通常情况下, 选择 $ wi = x.mean() , h = 0 $

1.2.3 优点:

计算简单,快捷

1.3 Adaptive threshold algorithm2

1.4 其他方法

通过暴力计算时间序列各部分的拟合函数,找到残差最小的点。参考以下代码:

#cumsum test
fit <- arima(x, order = c(2,0,0), include.mean = FALSE)
e <- residuals(fit)
sigma <- sqrt(fit$sigma2)
n <- length(x)
cs <- cumsum(e) / sigma

#find the break point
rss <- sum(residuals(fit)^2)
sigma2 <- fit$sigma2
stats <- rep(NA, n)
for (i in seq.int(20, n-20))
{
  fit1 <- arima(x[seq(1,i)], order = c(2,0,0), include.mean = FALSE)
  fit2 <- arima(x[seq(i+1,n)], order = c(2,0,0), include.mean = FALSE)
  ess <- sum(c(residuals(fit1), residuals(fit2))^2)
  stats[i] <- (rss - ess)/sigma2
}
plot(stats)
abline(h = qchisq(0.05, df = length(coef(fit)), lower.tail = FALSE), lty = 2, col = "red")
which.min(1 - pchisq(stats, df = 2))

2 Anomaly Detection

2.1 针对时间序列的异常检测

2.1.1 Grubbs' Test3

2.1.2 概述

Grubbs' test (Grubbs 1969 and Stefansky 1972) is used to detect a single outlier in a univariate data set that follows an approximately normal distribution. Grubbs' test is also known as the maximum normed residual test.

2.1.3 原理

Grubbs' test is defined for the hypothesis:

H0: There are no outliers in the data set Ha: There is exactly one outlier in the data set

Test Statistic: The Grubbs' test statistic is defined as: $$ G = \frac{\max{|Y_{i} - \bar{Y}|}} {s} $$

with $ \bar{Y}$ and \(s\) denoting the sample mean and standard deviation, respectively. The Grubbs' test statistic is the largest absolute deviation from the sample mean in units of the sample standard deviation. This is the two-sided version of the test. The Grubbs' test can also be defined as one of the following one-sided tests:

test whether the minimum value is an outlier $$ G = \frac{\bar{Y} - Y_{min}} {s} $$ with Ymin denoting the minimum value.

test whether the maximum value is an outlier $$ G = \frac{Y_{max} - \bar{Y}} {s} $$ with Ymax denoting the maximum value.

Significance Level Critical Region:

For the two-sided test, the hypothesis of no outliers is rejected if $$ G > \frac{(N-1)} {\sqrt{N}} \sqrt{\frac{(t_{\alpha/(2N),N-2})^2} {N-2+(t_{\alpha/(2N),N-2})^2}} $$

with tα/(2N),N−2 denoting the critical value of the t distribution with (N-2) degrees of freedom and a significance level of α/(2N). For one-sided tests, we use a significance level of level of α/N.

2.1.4 Tietjen-Moore Test for Outliers4

2.1.5 概述

The Tietjen-Moore test (Tietjen-Moore 1972) is used to detect multiple outliers in a univariate data set that follows an approximately normal distribution. The Tietjen-Moore test is a generalization of the Grubbs' test to the case of multiple outliers. If testing for a single outlier, the Tietjen-Moore test is equivalent to the Grubbs' test.

It is important to note that the Tietjen-Moore test requires that the suspected number of outliers be specified exactly. If this is not known, it is recommended that the generalized extreme studentized deviate test be used instead (this test only requires an upper bound on the number of suspected outliers).

2.1.6 原理

The Tietjen-Moore test is defined for the hypothesis: H0: There are no outliers in the data set Ha: There are exactly k outliers in the data set Test Statistic: Sort the n data points from smallest to the largest so that yi denotes the ith largest data value. The test statistic for the k largest points is

$$ L_{k} = \frac{\sum_{i=1}^{n-k}{(y_{i} - \bar{y}_{k})^{2}}} {\sum_{i=1}^{n}{(y_{i} - \bar{y})^{2}}} $$ with y¯ denoting the sample mean for the full sample and y¯k denoting the sample mean with the largest k points removed.

The test statistic for the k smallest points is

$$ L_{k} = \frac{\sum_{i=k+1}^{n}{(y_{i} - \bar{y}_{k})^{2}}} {\sum_{i=1}^{n}{(y_{i} - \bar{y})^{2}}} $$ with y¯ denoting the sample mean for the full sample and y¯k denoting the sample mean with the smallest k points removed.

To test for outliers in both tails, compute the absolute residuals

$$ r_{i} = |y_{i} - \bar{y}| $$

and then let zi denote the yi values sorted by their absolute residuals in ascending order. The test statistic for this case is

$$ E_{k} = \frac{\sum_{i=1}^{n-k}{(z_{i} - \bar{z}_{k})^{2}}} {\sum_{i=1}^{n}{(z_{i} - \bar{z})^{2}}} $$

with z¯ denoting the sample mean for the full data set and z¯k denoting the sample mean with the largest k points removed.

Significance Level: α Critical Region:

The critical region for the Tietjen-Moore test is determined by simulation. The simulation is performed by generating a standard normal random sample of size n and computing the Tietjen-Moore test statistic. Typically, 10,000 random samples are used. The value of the Tietjen-Moore statistic obtained from the data is compared to this reference distribution. The value of the test statistic is between zero and one. If there are no outliers in the data, the test statistic is close to 1. If there are outliers in the data, the test statistic will be closer to zero. Thus, the test is always a lower, one-tailed test regardless of which test statisic is used, Lk or Ek.

2.1.7 Generalized ESD(extreme Studentized deviate) Test 5

2.1.8 概述

ESD被用来判断近似正态分布中的异常值/离群点。 Grubbs Test只能检测一个异常值, 而Tietjen-Moore Test需给定异常值的个数k,并且如果k设置不合理,可能很难获得正确结果。 但是, ESD则不同。只需设置异常值最多有多少个,ESD就能找出那些可能的异常值。因此,可以说ESD比之前的办法更通用。

2.1.9 原理

Given the upper bound, r, the generalized ESD test essentially performs r separate tests: a test for one outlier, a test for two outliers, and so on up to r outliers. The generalized ESD test is defined for the hypothesis: H0: There are no outliers in the data set Ha: There are up to r outliers in the data set Test Statistic: Compute

$$ R_i = \frac{\mbox{max}_i |x_i - \bar{x}|}{s} $$

with x¯ and s denoting the sample mean and sample standard deviation, respectively.

Remove the observation that maximizes |xi−x¯| and then recompute the above statistic with n - 1 observations. Repeat this process until r observations have been removed. This results in the r test statistics R1, R2, …, Rr.

2.1.10 示例

R commands and output:

## Input data. y = c(-0.25, 0.68, 0.94, 1.15, 1.20, 1.26, 1.26, 1.34, 1.38, 1.43, 1.49, 1.49, 1.55, 1.56, 1.58, 1.65, 1.69, 1.70, 1.76, 1.77, 1.81, 1.91, 1.94, 1.96, 1.99, 2.06, 2.09, 2.10, 2.14, 2.15, 2.23, 2.24, 2.26, 2.35, 2.37, 2.40, 2.47, 2.54, 2.62, 2.64, 2.90, 2.92, 2.92, 2.93, 3.21, 3.26, 3.30, 3.59, 3.68, 4.30, 4.64, 5.34, 5.42, 6.01)

## Generate normal probability plot. qqnorm(y)

## Create function to compute the test statistic. rval = function(y){ ares = abs(y - mean(y))/sd(y) df = data.frame(y, ares) r = max(df$ares) list(r, df)}

## Define values and vectors. n = length(y) alpha = 0.05 lam = c(1:10) R = c(1:10)

## Compute test statistic until r=10 values have been ## removed from the sample. for (i in 1:10){

if(i==1){ rt = rval(y) R[i] = unlist(rt1) df = data.frame(rt2) newdf = df[df$ares!=max(df$ares),]}

else if(i!=1){ rt = rval(newdf$y) R[i] = unlist(rt1) df = data.frame(rt2) newdf = df[df$ares!=max(df$ares),]}

## Compute critical value. p = 1 - alpha/(2*(n-i+1)) t = qt(p,(n-i-1)) lam[i] = t*(n-i) / sqrt((n-i-1+t**2)*(n-i+1))

} ## Print results. newdf = data.frame(c(1:10),R,lam) names(newdf)=c("No. Outliers","Test Stat.", "Critical Val.") newdf

##> No. Outliers Test Stat. Critical Val. ##> 1 1 3.118906 3.158794 ##> 2 2 2.942973 3.151430 ##> 3 3 3.179424 3.143890 ##> 4 4 2.810181 3.136165 ##> 5 5 2.815580 3.128247 ##> 6 6 2.848172 3.120128 ##> 7 7 2.279327 3.111796 ##> 8 8 2.310366 3.103243 ##> 9 9 2.101581 3.094456 ##> 10 10 2.067178 3.085425

###################################################################### ## ============================================================== ## ######################################################################

2.1.11 Seasonal Hybrid ESD (S-H-ESD)

2.2 针对特征空间的异常检测6

2.2.1 TODO SVM

2.2.2 TODO iForest

Footnotes:

Author: Marcnuth

Last Updated: 2017-05-24 Wed 13:42