One article explains the storage, search, analysis and mining of time series and sequence data

Source: Data STUDIO
This article is about 4800 words, it is recommended to read for 5 minutes
This article will summarize the methods of industrial-level time series analysis.

In algorithm practice, it is of great significance to quickly grasp the characteristics of time series data and estimate long-term trends, repetitive changes, and cyclic changes. The reason is that it helps in choosing appropriate methods for analysis and helps in comprehensive understanding of the data. Through systematic analysis of trends, cycles, seasons and movements of irregular components, people can formulate long-term or short-term forecasts (i.e. forecast time series) under more reasonable circumstances.

How to conduct trend analysis

There are four main components of variation used to specialize time series data:

01 Long-term or trend changes

Trend movement, which is used to reflect the general direction of change, and its time series diagram is the data change over a longer time interval. This change is reflected as a trend curve, or trend line. Typical methods for determining trend curves or trend lines include the weighted moving average method and the least squares method, as discussed in detail below.

02 Cyclic movement or cyclic change

Cyclic movement or cyclic variations mainly refers to cyclicity, that is, the trend line or curve shows signs of swinging over a long period of time, which may or may not be cyclical. That is, the loop does not need to evolve along the same pattern between equal time intervals.

03 Seasonal movement or seasonal changes

Seasonal movements, or seasonal variations, reflect events that recur every year, such as a sudden increase in sales of chocolates and flowers before Valentine’s Day, or a sudden increase in sales of stored goods before Christmas. In other words, seasonal movement refers to the same, or nearly identical, pattern that repeats during relevant months in consecutive years.

04 Irregular or random changes

Irregular or random movements, which reflect sporadic temporal changes in random or accidental events, such as labor demands, floods, or personnel changes within the enterprise, etc.

The above trends, cyclic, repetitive, and irregular movements can be represented by variables T, C, S, and I respectively. Timing analysis can also refer to analysis that decomposes timing into the above four basic movements. The time series variable Y can be expressed as the product of four variables (i.e. Y=T%C%S%I), or the sum of the four variables. The choice is usually based on experience. “Given a set of values for Y (i.e., 1,2,3…., how to determine the trend of the data?” A common method to determine the trend is to calculate a moving average of order n according to the arithmetic mean sequence. average of order n).

Methods of trend analysis

01 Common methods of trend analysis: moving average

Using a weighted moving average of appropriate order can eliminate cycles, repetitions and irregular patterns in the data, while retaining only trend changes. Moving averages reduce the total amount of variation in a data set.

Therefore, using moving average instead of time series can reduce unwanted fluctuations, so it is also called smoothing of time series. The moving average will lose the head and tail data in the series, which sometimes generates cycles or other changing trends that do not appear in the original data, and it may be affected by some extreme data. The negative impact of extreme data can be reduced by using a weighted moving average with appropriate weights.

02 Other ways to calculate trends

One of them is the so-called free-hand method, which draws an approximate curve or straight line to fit a set of data based on the user’s judgment. This method is expensive and only reliable for large-scale data mining. The other is the least squares method, in which the most consistent curve C is used as the least squares curve, that is, the curve has a minimum value, where the deviation or error of d refers to the value y of point (x, y): and the corresponding curve C The difference between the values.

03 Adjustment methods for cyclical movements and seasonal fluctuations

In many business transactions, there are expected seasonal fluctuations, such as peak sales during the Christmas period. Therefore, for trend and cycle data analysis, it is important to identify such recurring changes and “deseasonalize” the data.

To this end, the concept of seasonal index is introduced, using a set of numbers to represent the relative value of a variable in certain months of the year. For example, if the sales in October, November and December are respectively 80%, 120% and 140% of the annual average monthly sales, then 80, 120 and 140 are the seasonal indexes for this year. If the original monthly data are removed by the corresponding seasonal index, the resulting data are said to be deseasonalized, or adjusted for seasonal variables. Counter-seasonalized data can be further adjusted for trends, that is, these data are removed according to the corresponding trend values. Moreover, a suitable moving average can smooth out irregular changes, leaving only cyclic changes for further analysis.

If the cycle exhibits periodicity or near-periodicity, a cyclic index can be introduced in the same way as a seasonal index.

04 Others

Finally, irregular or random changes can be estimated by adjusting the data for trends, seasonal and cyclical changes. Generally, small deviations occur with high frequency, while large deviations occur with low frequency, following the normal distribution. Recommended to follow @public account: DATA STUDIO for more high-quality articles~

Technical basis of timing analysis

01 What is a time series database?

What is a sequence database? A time series database refers to a database that consists of values or times that change over time. Values are usually data measured at equal time intervals. Time series databases are common in many applications, such as daily fluctuations in the stock market, dynamic product processing, scientific experiments, medical treatments, etc.

Time series database is also a kind of sequence database. However, a sequence database refers to a database composed of an ordered sequence of events, which may or may not have time stamps. For example, a web page traversal sequence is a type of sequence data, but may not be time series data. This section will introduce several important contents of time series database and sequence database mining, including trend analysis, similarity search, sequence pattern mining and periodic pattern mining of time-related data.

02 What is a similar search?

Unlike general database queries, which are intended to find the exact data that matches the query, similarity search is about finding the data sequence that is closest to a given query sequence. Subsequence matching is to find all data sequences that are similar to a given sequence, while whole sequence matching is to find sequences that are similar to each other. Analysis of financial markets (such as stock data analysis), Similar searches in time series analysis are of great use in medical diagnosis (such as electrocardiogram analysis), and scientific and engineering databases (such as energy consumption analysis).

03 Data transformation: from time domain to frequency domain

Question 1: Why do we need to transform data?

Many signal analysis techniques require data from the frequency domain. Therefore, orthogonal transformations related to distances are often used to transform data from the time domain to the frequency domain. Typically, a data-independent transformation is used, whose transformation matrix is predetermined independent of the input data. Two common data-independent transforms are the discrete Fourier transform (DFT) and the discrete wavelet transform (DWT)29. Since the distance between two signals in the time domain is similar to the Euclidean distance in the frequency domain, DTF works well, especially at the first few coeficients. By keeping only the first few (i.e., “strongest”) coefficients of the DFT, we can calculate a lower bound on the actual distance.

Question 2: Once the data has been transformed, such as DFT, how to perform a similar search?

To improve access efficiency, a multidimensional index can be constructed using the first few Fourier coefficients. When a similar query is submitted to the system, the index can be used to retrieve sequences that maintain a certain minimum distance from the query sequence. By calculating the actual distance between the time domain sequence and the sequence that does not satisfy the query, the necessary post-processing (postprocessing) can be performed.

Question 3: How to perform subsequence matching?

For subsequence matching, each sequence is first divided into window “segments” of length . Each sequence is mapped to a “trail” in the feature space. For subsequence analysis, the clues of each sequence are divided into “subtrails”, each of which is represented by a minimum bounding rectangle. Longer matching sequences can be searched using a multipiece assembly algorithm.

04 Enhanced similarity search method to handle gaps and differences in offset and amplitude

Most practical applications do not necessarily require that the matched subsequences are completely consistent on the timeline. In other words, we can consider pairs of subsequences to match if they have the same shape but have gaps within the sequence or differences in offsets or amplitudes. This is particularly useful in the analysis of many similar sequences, such as stock market analysis and electrocardiogram analysis.

Question 1: How to improve similar search so that it can still determine the similarity even if such differences exist?

An improved similarity model allows users or experts to specify some parameters, such as sliding window size, similarity range width, maximum gap, matching ffaction, etc. As shown in Figure 9.7, a two-person time series is given, in which the gaps are removed, and the resulting sequence is normalized by offset transfer and amplitude adjustment, so that sequences with different amplitudes and offsets can be matched.

When a sub-sequence and another sub-sequence are within a certain width e (where e is a small number that can be specified by the user or an expert), the two sub-sequences are similar and matchable. Two sequences are similar when there are enough non-overlapping pairs of similar subsequences in time order between them.

Question 2: What are the steps to perform similar searches with gaps, offsets and amplitude differences?

Atomic matching: Find all pairs of smaller identical windows without gaps.
Window stitching: Combining identical windows to form large pairs of similar subsequences, allowing gaps between several pairs of atoms.
Subsequence ordering: Linearly arrange subsequence matches to determine whether there are large enough similar fragments. Through the above processing steps, similar sequences with similar shapes but gaps or differences in offset or amplitude can be obtained.

05 Indexing method for similar search

“Is there an efficient implementation method?” To improve the efficiency of similar searches in large databases, various indexing methods have been proposed. For example, R-tree, R*-tree is used to store the minimum bounding rectangle to speed up similarity search. In addition, the ekdB tree was proposed to improve the speed of spatial similarity connections on high-dimensional points, and the suffix tree (suix tree) was also proposed.

Sequential Pattern Mining

Sequence pattern mining refers to mining patterns that appear frequently relative to time or other patterns. An example of a sequential pattern is “A customer who bought a Pentium PC nine months ago is likely to order a new CPU chip within a month.” Since many business transactions, telex records, weather data, and product processing are time series data, sequence pattern mining is very useful in data analysis for target markets, customer attraction, weather forecasts, etc. Recommended to follow @public account: DATA STUDIO for more high-quality articles~

01 Situations and parameters of sequence pattern mining

Many studies on sequence pattern mining mainly focus on symbolic patterns (SVMbolic patterns), because numerical curve patterns usually fall into the category of trend analysis and prediction in statistical time series analysis.

For sequence pattern mining, there are some parameters whose values will seriously affect the mining results. The first parameter is the duration of the time series (duration)T.

The second parameter is the event folding window w.

The third parameter is the interval int between times in the discovered pattern. This parameter can take the following values: int-0: indicates no time interval.

02 Sequential pattern mining method

The Apriori feature used in association rule mining can be used for mining sequence patterns, because if a sequence pattern of length k is infrequent, its superset (length k + 1) cannot be frequent. Therefore, most methods for sequence pattern mining adopt variants of the Apriori-like algorithm, although the parameter settings and constraints considered are different. Another method of mining such patterns is the database project based sequential pattern growth (database project based sequential pattern growth) technology, which is similar to the candidate-free frequent pattern mining (frequent pattern) method, frequent pattern growth (FP-growth).

Cycle Analysis

It refers to the mining of periodic patterns, that is, finding recurring patterns in time series databases. Periodic patterns can be applied in many important areas. Examples include seasons, tides, planetary orbits, daily energy consumption, daily traffic patterns, and all TV programs at a specific time of week. As pointed out in the previous section, periodic pattern mining can be regarded as sequence pattern mining with a set of fragmented sequences as the duration, such as each position before and after an event occurs every year, and so on.

01 Periodic pattern mining problems can be divided into three categories

Mining the full periodic pattern: Here each time point affects (exactly or approximately) the cyclic behavior in the timing. Each day of the year plays a role in the seasonal cycle of the year.

Mining partial periodic patterns: It describes timing cycles at partial points in time. For example, Sand reads the New York Times between 7:00 and 7:30 on weekday mornings, but not at other times. Partial cycles are a looser form than full cycles and are more common in the real world.

Mining cyclic or periodic association rules: This type of rule is an association rule for events that occur periodically. An example of a periodic association rule is “Based on daily business records, if afternoon tea on weekends is between 3:00-5:00pm, then the best business hours for dinner are 7:00-9:00.

The technique of full cycle analysis has been studied in signal analysis and statistics. Methods such as FFT (Fast Fourier Transform) have been widely used to convert data from time domain to frequency domain to facilitate such analysis.

02 Can full-cycle pattern mining methods be used for partial-cycle pattern mining?

Efficient partial periodic pattern mining has received attention in recent research in data mining. Most methods of full-cycle pattern mining are either inapplicable or too costly compared to partial-cycle pattern mining, because partial-cycle patterns contain periodic and non-periodic events in the same cycle.

For example, FFT cannot be used for partial cycle mining because it treats time series as inseparable data streams. Some periodic detection methods cannot cover partial periodic patterns unless the period, length, and timing of the partial pattern are explicitly specified. Taking news reading as an example, we need to clearly state something like “Using a 24-hour cycle, find Sandv’s regular activities within half an hour after 7:00.” It is not advisable to simply apply this method to the problem of mining partial periodic patterns, because it needs to deal with a large number of combinations of period, length, and timing.

Most studies on partial periodic pattern and cyclic association rule mining have applied Apriori feature heuristics and adopted modified Apriori mining methods. Constraints can be introduced in sequential and periodic pattern mining. Apriori features, various improved Apriori algorithms, and the use of mining constraint.

Editor: Wang Jing

Proofreading: Lin Yilin