One of the greatest challenges in business is dealing with anomalies in the data that come across your desk. According to the Engineering Statistics Handbook, “An outlier is an observation that lies at an abnormal distance from other values in a random sample from a population.” In order for a business to be successful, its principals need to understand any deviation from what constitutes the norm — from “business as usual,” so to speak.
Suppose you own a company called Trendline Sunglasses and you’ve just discovered a spike in overall sales of your top product. This could be a great business opportunity, a reflection of a successful marketing campaign that’s led to this spike, which could mean greater profitability and the need to reallocate funds. But a spike in sales could also indicate a problem, such as a pricing glitch that’s caused a run on an item which is leading to lost revenue.
Point anomalies are single data points that are termed global because they lie faroutside the distribution of the data set as a whole.
A business customer who typically deposits $10,000 in the bank each Friday suddenly depositing $50,000 on two consecutive Fridays would be a global anomaly since this stands outside this customer’s history. A time series line chart of their activity would show a hockey stick uptick in their activity, which would likely alert the federal authorities to the implication of illicit activity.
2.Contextual (conditional) outliers
A data point is considered contextual if it deviates from data of the same kind. In textual data, this could be punctuation among letters; in speech recognition, background noise.
An example of a contextual outlier might be the aforementioned sudden surge in sales volume at the sunglasses company if this falls outside of a promotion. Might this surge be due to a price glitch?
A subset of data points within a larger data set is considered anomalous if their values deviate collectively and significantly from that of the larger data set.
It is axiomatic that stock prices of publicly traded companies fluctuate. It’s why we hire people to manage our portfolios for us: so that we don’t have to watch whether our stocks are up or down. But if a stock stayed at the exact same price (to the penny) for a long period of time…well, that would be a collective outlier. Such an event seemed to have happened due to a computer glitch in 2017 and several tech stocks—including Apple and Amazon—were listed at $123.45 for a very long time.
Within each of these categories, you can find examples of univariate and multivariate anomalies. Univariate anomalies are outliers on one variable; multivariate anomalies are outliers on at least two. Both types can influence outcomes in statistical analysis.
Time series data & analysis
A time series consists of a succession of data points taken from measurements over time. Some examples of time series are ocean tide measurements, counts of sunspots, stock market values, and measurements of weather activity. Visualizations of time series data are typically done with line charts.
Common business applications of time series analysis include webpage views over time, active app users, sales by platform, time on site and numbers of transactions over time. You see how valuable time series can be within a business context.
Detection methods for time series data
Univariate outlier (anomaly) detection
Univariate time series measures one variable over time. For example, data might be collected on the number of visitors on a specific webpage every quarter-hour. This will give you a one-dimensional value every fifteen minutes. Univariate anomaly detection will focus on this one specific metric.
An advantage of univariate modeling is that it allows you to hone in on specific processes, to see them fully. But this is also its disadvantage because by focusing solely on one metric, it potentially prevents you from realizing problems elsewhere on other metrics.
Multivariate outlier (anomaly) detection
Multivariate outlier detection refers to processes for detecting anomalies in two or more variables in time series data. An advantage of multivariable detection is that it seeks to detect outliers as complete incidents and learn a single model for all of the data metrics. But like Maslow’s concept that “if you have a hammer, everything looks like a nail”, not all problems are appropriate to the heavy-handedness of multivariate models. Some problems require the focused intensity of univariate modeling.
A hybrid approach combines both univariate and multivariate anomaly detection and it is superior to either as a standalone in most cases. This is because both types of detection are complementary to one another and each is necessary for conciseness. It’s the ideal approach in that analyzing with univariate modeling gives you a focused view of individual metrics while multivariate modeling offers grouping and deciphering of related anomalies. Using both together is the only way of getting a complete picture.
Automated real-time anomaly detection for business metrics
Businesses today have thousands—often millions—of metrics to track. It is literally financially infeasible to hire the number of analysts necessary to track all the anomalies that occur. As a result today’s companies all too often settle for creating high-level dashboards that aren’t nearly as refined as they need them to be. With too few analysts and dashboards that have coarse filters, catching all the anomalies becomes a matter of luck. The only way to solve the problem is to adopt an automated approach to anomaly detection.
A reliable automated real-time anomaly detection system should use sophisticated, intelligent detection methods so as to detect all types of outliers—global, contextual, collective—and to understand the relationships between different data sets.
This is best achieved through:
A hybrid detection approach
The implementation of appropriate models and distributions for each time series (stationary, non-stationary, irregularly sampled, discrete, etc.)
The consideration of seasonal and trend patterns
When is a spike in sales a sign of a good thing and when is it a sign of a problem? This and more is what outlier detection can tell you. Anomaly detection is crucial for understanding how to separate signal and noise when it comes to business data. Real-time automated anomaly detection helps pinpoint outliers in the millions of metrics generated each year.
Ira Cohen is chief data scientist and co-founder of Anodot, where he develops real-time multivariate anomaly detection algorithms designed to oversee millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has more than 12 years of industry experience.