In industry, time series data plays a vital role: from stock prices in financial markets, to temperature changes in meteorology, to electrocardiogram data in medicine, time series data plays a crucial role. Since these raw data are collected from various sensors, they often contain a lot of noise and fluctuations, and it is important to smooth them out to extract valuable information.
Traditionally, moving averages have been one of the most commonly used time series data smoothing techniques. The basic idea is to smooth the data by calculating the average of data points over a period of time, thereby reducing short-term fluctuations. However, the moving average method also has significant shortcomings. For example, it is easy to lose important details and mutation points in the data while eliminating noise. In addition, the moving average can only process data within a fixed window, and the data smoothing effect on the window boundary is poor.
In order to overcome these shortcomings, the Savitzky-Golay filter came into being. The Savitzky-Golay filter smooths the data by fitting polynomials within a sliding window, removing noise while maximizing the preservation of detail and peaks in the data. Compared to the moving average, the Savitzky-Golay filter is not only able to process the data at the window boundary better, but also maintain the overall shape of the signal during the smoothing process. Therefore, the Savitzky-Golay filter has become a new method for time series data smoothing, which is widely favored by researchers and data analysts in various fields.
Savitzky-Golay 滤波器 是什么
The Savitzky-Golay Filter is a digital filter that can be applied to a set of digital data points with the aim of smoothing the data, i.e., improving the accuracy of the data without distorting the signal trend. This is achieved through a process called convolution, which uses a linear least squares method to fit a continuous subset of adjacent data points with a low-order polynomial. When data points are equidistantly distributed, an analytical solution to the least squares equation can be found in the form of a set of "convolution coefficients" that can be applied to all data subsets to estimate the smoothed signal (or the derivative of the smoothed signal) at the center point of each subset. The method, which is based on an established mathematical procedure, was popularized by Abraham Savitzky and Marcel J.E. Golay, who published tables of convolution coefficients for various polynomials and subset sizes in 1964.
The Savitzky-Golay filter convolutes around a subset of the time series dataset, each of which is fitted with a polynomial function that attempts to minimize the fitting error for that particular subset, which in turn gives us a smoother set of new points than before (a moving average transformation, so to speak, with an additional fitting step).
Savitzky-Golay 滤波器 的python 示例
Next, the M4 race-related time series dataset is used as an example to show the smoothing effect of the Savitzky-Golay filter. The M4 Competition is an important event in the field of time series forecasting, which aims to compare and evaluate the performance of various forecasting methods.
The following code is to import the analysis package, load the dataset, and display the fluctuations:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.signal import savgol_filter
import plotly.express as px
from statsforecast import StatsForecast
from tqdm.autonotebook import tqdm
train = pd.read_csv('https://auto-arima-results.s3.amazonaws.com/M4-Hourly.csv')
uid = np.array(['H386'])
df_train = train.query('unique_id in @uid')
StatsForecast.plot(df_train, plot_random = False, engine='plotly')
In the smoothing method of time series data, the "window size" plays a key role in the results of our analysis. Both the moving average and the Savitzky-Golay filter are controlled by this parameter, which determines the range of data to consider smoothing at any given point. Think of the window size as the aperture of the camera lens: the wider the aperture, the more images will be captured, affecting the clarity and detail of the final image. In a moving average, the window size defines the number of data points that produce a single smoothing point on average. For Savitzky-Golay filters, this size not only averages, but also fits polynomials to the data within the window, striking a balance between smoothing and maintaining the fidelity of the signal features. The following code calculates a total of 4 smoothed versions of the original timeseries, using window sizes of 10 and 25, smoothing methods as moving averages, and Savitzky-Golay filters. As follows:
computed_features = [] # I will need this list to plot later the smoothed series
for window_size in [10, 25]:
df_train.loc[:,f'moving_average_{window_size}'] = df_train['y'].rolling(window=window_size, center=True).mean()
df_train.loc[:,f'savgol_filter_{window_size}'] = savgol_filter(df_train['y'], window_size, 2)
computed_features.append(f'moving_average_{window_size}')
computed_features.append(f'savgol_filter_{window_size}')
df_train.tail(30)
Let's see how the results of a smoothed window size of 10 compare to the original time series:
fig = px.line(df_train, x='ds', y=['y'] + computed_features[:2],
title='窗口大小为10:移动平均和Savitzky-Golay滤波器平滑效果',
labels={'Value': 'y', 'Date': 'Date'},
line_shape='linear')
fig.update_layout(
xaxis_title='Date',
yaxis_title='Sensor Value',
hovermode='x'
)
fig.show()
The moving average, while smoothed, does not reflect the smaller peaks that preceded it after a larger peak. The Savitzky-Golay filter, on the other hand, follows the time series more precisely, preserving the details of the peaks and valleys. As the window size increases, the Savitzky-Golay filter is able to integrate this information by capturing the peaks in advance. Although the simplicity of the moving average is its greatest strength, it is also its Achilles' heel.
Moving averages are slower to respond to true changes in the data and often lag when the trend changes. In addition, it treats all the points within the window equally, ignoring the different importance of each of them. In contrast, the Savitzky-Golay filter performs a low-order polynomial fit to a subset of adjacent data points by linear least squares, preserving the shape and features of the original data. This means that it retains important features such as peaks and valleys, which are often smoothed out by moving averages.
As we can see in the previous graph, the Savitzky-Golay filter still reports spikes, and sometimes we may want to eliminate them. So, let's see if we can achieve this effect by increasing the window size. Let's take a look at the smoothing effect with a window size of 25:
fig = px.line(df_train, x='ds', y=['y'] + computed_features[2:4],
title='窗口大小为25:移动平均和Savitzky-Golay滤波器平滑效果',
labels={'Value': 'y', 'Date': 'Date'},
line_shape='linear')
fig.update_layout(
xaxis_title='Date',
yaxis_title='Sensor Value',
hovermode='x'
)
fig.show()
The Savitzky-Golay filter in the image above does an excellent job of capturing seasonal variations in the time series, with no delay, and spikes removed. Moving averages, on the other hand, focus only on long-term averages and lose much of the information contained in the signal.
The Savitzky-Golay filter, when appropriately sized in the window, is able to maintain the high fidelity of the signal while removing unwanted noise and anomalies. Although moving averages can still be used to calculate the average of time series, the same result can be achieved by increasing the window size of the Savitzky-Golay filter (and possibly more precisely). In most smoothing applications, the Savitzky-Golay filter performs better.