loading

The perfect choice of one-stop service for diversification of architecture.

How to Select Anomaly Detection Algorithm

Anomaly detection (also known as outlier detection) is a task to detect abnormal instances, which are very different from conventional instances. These instances are called outliers or outliers, while normal instances are called internal values.

Anomaly detection can be used in a variety of applications, such as:

â‘  Fraud identification

â‘¡ Detect defective products in manufacturing

â‘¢ Data cleansing -- removing outliers from a data set before training another model.

You may have noticed that some unbalanced classification problems are often solved by anomaly detection algorithms. For example, spam detection task can be considered as a classification task (spam is much less than ordinary e-mail), but we can use exception detection to achieve this task.

A related task is singular value detection. It differs from anomaly detection in that it is assumed that the algorithm is trained on a clean data set (no outliers). It is widely used in online learning when it is necessary to identify whether a new instance is an outlier.

Another related task is density estimation. It is the task of estimating the probability density function of the random process generated by the data set. Density estimation is usually used for anomaly detection (instances located in low-density areas are likely to be anomalies) and data analysis. It is usually solved by clustering algorithm based on density (Gaussian mixture model or DBSCAN).

statistical method

The easiest way to detect outliers is to try statistical methods, which were developed a long time ago. One of the most popular methods is called outlier detection Tukey method (or quartile distance IQR).

Its essence is to calculate the range between percentile and quartile. Data points before q1-1.5 * IQR and after Q3 1.5 * IQR are considered outliers. Below you can see an example of using a person's height data set. Heights below 54.95 inches (139 cm) and above 77.75 inches (197 cm) are considered outliers.

This and other statistical methods (Z-score method for detecting outliers, etc.) are usually used for data cleaning.

Clustering and dimensionality reduction algorithm

Another simple, intuitive and usually effective anomaly detection method is to use some clustering algorithms (such as Gaussian mixture model and DBSCAN) to solve the task of density estimation. Then, any instance located in the low-density area can be considered as an exception. We only need to set some density thresholds.

In addition, any with inverse can be used_ The dimension reduction algorithm of transform() method. This is because the abnormal reconstruction error is always much larger than that of the normal example.

Isolated forest and SVM

Some supervised learning algorithms can also be used for anomaly detection, of which the two most popular are isolated forest and SVM. These algorithms are more suitable for singular value detection, but they are usually also suitable for anomaly detection.

The isolated forest algorithm constructs a random forest, in which each decision tree grows randomly. With each step, the forest isolates more and more points until all points become isolated. Because exceptions are located far from the usual data points, they are usually isolated in fewer steps than normal instances. The algorithm performs well for high-dimensional data, but needs a larger data set than SVM.

SVM (a kind of SVM in our example) is also widely used in anomaly detection. Kernel SVM can construct an effective "constraint hyperplane", which separates normal points from abnormal points. Like any SVM modification, it can handle high-dimensional or sparse data well, but it is only suitable for small and medium-sized data sets.

Local anomaly factor

The local outlier factor (LOF) algorithm is based on the assumption that the anomaly is located in a low-density region. It not only sets the density threshold (as we can do with DBSCAN), but compares the density of a point with the density of K of its nearest neighbor. If the density of this particular point is much lower than that of its neighbors (which means it is far from them), it is considered an anomaly.

The algorithm can be used for both anomaly detection and singular value detection. Because of its simple calculation and good quality, it will be often used.

Minimum covariance determinant

The minimum covariance determinant (MCD or its modified fast MCD) can be used for outlier detection, especially in data cleaning. It assumes that interior points are generated from a single Gaussian distribution, while outliers are not generated from this distribution. Because many data have normal distribution (or can be simplified to normal distribution), the algorithm usually performs well. In sklearn, the ellipticenvelope class is its implementation.

How to select anomaly detection algorithm?

If you need to clean up the dataset, you should first try classical statistical methods, such as Tukey method for outlier detection. If you know that the data distribution is Gaussian, you can use fast MCD,.

If you don't do exception detection for data cleaning, first try a simple and fast lof. If it doesn't work well (or if you need to separate hyperplanes for some reason) - try other algorithms based on your task and dataset:

Single class SVM for sparse high-dimensional data or isolated forest for continuous high-dimensional data

If you can assume that the data is generated by the mixture of multiple Gaussian distributions, you can try the Gaussian mixture model

How to Select Anomaly Detection Algorithm 1

GET IN TOUCH WITH Us
recommended articles
Related Blogs blog
Although GPR has been widely used in hydrology, engineering, environment and other fields, many basic theoretical and technical problems have not been fundamentally ...
So I am thinking of a possible answer to my own question:Build my own journal note entry app linked to a wiki.Zim Wiki uses a file based system for wiki. Maybe I cou...
About 8 I think1. where can i get ncaa football 10 rosters with names?For the last two years I got mine from "Pastapadre". From what i can tell they are really prett...
The Haunting I vaguely remember some kind of eye injury in the movie.• Other Related Knowledge ofa cocktail glass— â€&...
rsync can be somewhat painful if you have a very large number of files - especially if your rsync version is lower than 3. On the other hand: if you use tar, you wou...
Blockchain Technology Explained: Powering BitcoinMicrosoft recently became the latest big name to officially associate with Bitcoin, the decentralized virtual curren...
PLEASE HELP ME CHOOSE A VIDEO CAMERA!?the Flip ultra HD is a really good HD portable camcorder, and it's fairly cheap. A lot of famous youtubers use it, such as timo...
In order to implement the tasks proposed in the outline of the national medium and long term science and technology development plan (2006-2020), the national key R ...
Alibaba group and Royal Philips of the Netherlands announced that they have officially signed an IT infrastructure service framework agreement to jointly promote the...
but it seems the company has disappeared and you can only get it from other sites such as Canadian Content.It works on Windows 7 and due to the nature of the OSes, t...
no data
Customer service
detect