Implementation of Lowess local weighted regression algorithm in python and R language

Article directory

  • Algorithm application scenarios
  • Algorithm principle
    • main idea
    • Weight function selection
    • Use of weight function
    • Residual calculation and iteration
  • Parameter Description
  • Easy to use library
    • Disadvantages of library transfer
  • Preparation:
  • Programming ideas:
  • python implementation
  • R language implementation
  • Compare results
  • Reflect and improve

Most of the theories in this article refer to this blog [Algorithm] Local Weighted Regression (Lowess) is only used for your own learning records.

Algorithm application scenarios

The essential function of the Lowess local weighted regression algorithm is to fit the trend line of the data, and is commonly used to solve prediction problems and smoothing problems.

  • When solving forecasting problems, use trend lines to make forecast data, which is suitable for cyclical and volatile data;
  • When doing data smoothing, for trend or seasonal data, you cannot simply use the mean plus or minus three standard deviations to remove outliers. Therefore, use the trend line as the baseline to eliminate true outliers that deviate far from the baseline.

Algorithm principle

Core idea

The general idea of local weighted regression (Lowess) is: centering on a point The center value of the regression line, where y is the corresponding value of the curve after fitting. For all n data points, n weighted regression lines can be made, and the line connecting the central value of each regression line is the Lowess curve of this data.

Weight function selection

The general idea of selecting the weight function is: we hope that W(x) is greater than 0, and the scope is [-1,1], and it is a symmetric function. This function has a larger value in the middle (at 0) and a larger value on both sides (-1\ 1) The value is small. The weight in the middle is higher and has a greater impact on weighted regression; the reason for [-1,1] is thatfor any irregular data segment, it can be compressed and mapped to [-1,1] , easy to handle.

W

(

x

)

=

{

(

1

?

x

3

)

3

,

for

x

< 1 0 , for ∣ x ∣ ≥ 1 W(x)= \begin{cases} (1-|x|^3)^3, & amp;\text{for} |x|<1\ 0, & amp;\text{for} |x| ≥1 \end{cases} W(x)={(1?∣x∣3)3,0,?for∣x∣<1for∣x∣≥1?

B

(

x

)

=

{

(

1

?

x

2

)

2

,

for

x

< 1 0 , for ∣ x ∣ ≥ 1 B(x) = \begin{cases} (1-|x|^2)^2, & amp;\text{for} |x|<1\ 0, & amp;\text{for} |x| ≥1 \end{cases} B(x)={(1?∣x∣2)2,0,?for∣x∣<1for∣x∣≥1?

The difference between quadratic and cubic functions is that the cubic function slows down faster for surrounding weights, has a good initial smoothing effect, and is suitable for most distributions, but it increases the variance of the residuals.
For weight function selection, the W function (cubic function) is used for the first iteration, and the B function (quadratic function) is used for subsequent iterations.

Using weight function

  1. Use the weight function W(x);
  2. The data segment [d 1, d 2] is mapped to the coordinates corresponding to [?1,1];
  3. Bring in the function W(x) and calculate the w i corresponding to each point;
  4. Use weighted regression to derive the model:

    Y

    ^

    =

    X

    (

    X

    T

    w

    X

    )

    ?

    1

    X

    T

    w

    Y

    \hat Y = X ( X ^T w X ) ^{-1}X ^T w Y

    Y^=X(XTwX)?1XTwY

Residual calculation and iteration

First, the original value is y, the predicted value is y ^, the residual is e = y ? y ^, and s is the median of ∣ e i ∣. The additional value of robustness weight adjustment is δ k = W ( e k/ 6 s ), and the corrected weight is δ k w k .
The iterative process is:
1. Use the W function (cubic function) as the weight function to find w i .
2. Bring w i into weighted regression to calculate y ^.
3. Find e = y ? y and s.
4. Use the B function as the modified weight function to find δ k = B (e k/6 s) and calculate δ k w k.
5. Use δ k w k as the correction weight and repeat steps 2, 3, and 4.

Parameter description

  1. x_data, the data type is a list, storing the value of x
  2. y_data, the data type is a list, storing the value of y
  3. The length frac is the length that should be intercepted for local processing. frac is the proportion of the original data volume and is a floating point number;
  4. The number of iterations iter is of type int. The function will basically converge around 3 times, usually 3.
  5. Regression interval delta, floating point number, when it is 0, each point is counted as a weighted regression; it is recommended that when the data points N>5000, delta=0.01*N, a weighted regression is performed every delta point, and the intermediate point uses: linear interpolation , quadratic interpolation, cubic interpolation and other methods.

Easy to use library transfer

import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv("Data.csv")
x_data = list(data.Date)
y_data = list(data.Value)
# Directly transfer the library using lowess, the input is a one-dimensional array, and the return is a list of point coordinates
lowess = sm.nonparametric.lowess(y_data, x_data, frac=0.1, it=3, delta=0.0)
x1 = [point[0] for point in lowess]
y1 = [point[1] for point in lowess]
plt.plot(x1,y1)

Disadvantages of library adjustment

  1. The weight function is not adjustable
  2. The interpolation function is not adjustable when using delta

Preparation:

  1. Write the interception data function get_points (input x list, y list, center x, interception ratio frac; return the intercepted x list, y list)
  2. Determine the weight functions W(x) and B(x)
  3. Write the weighted regression function weighted_linear_regression (x, y list and weight list w_list; return

    y

    ^

    \hat{y}

    list of y^?)

  4. Write a function new_W that calculates new weights (input

    y

    ^

    \hat{y}

    y^?, y, and weight list w_list; output residuals, median residuals, and new weight list) for the iterative process

Programming ideas:

  1. First check whether the input point coordinate data is arranged in sequence.
  2. Loop through x
  3. Intercept frac proportional data as a list
  4. Calculate the weight function for the points in the list and perform weighted regression to return

    y

    ^

    \hat{y}

    y^?

  5. Enter the iteration loop to calculate the residual e and the median s of the residual, use the weight function B(x) to calculate the weight correction added value, and output the new weight
  6. Then calculate the new

    y

    ^

    \hat{y}

    y^?, repeat until the end of the iteration (generally only iterate 3 times)

  7. Returns the center point of the result

python implementation

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

####Interception data function: with x as the center, intercept a piece of data with a length of frac forward and backward
def get_points(x_data,y_data,x,frac):
    num = len(x_data)*frac
    index = x_data.index(x)

    if int(index-num//2) >= 0 and int(index + num//2) <= len(x_data):
        x_list = x_data[int(index-num//2):int(index + num//2)]
        y_list = y_data[int(index-num//2):int(index + num//2)]
    elif int(index-num//2) < 0:
        x_list = x_data[0:int(index + num//2)]
        y_list = y_data[0:int(index + num//2)]
    else:
        x_list = x_data[int(index-num//2):len(x_data)]
        y_list = y_data[int(index-num//2):len(x_data)]
    return x_list,y_list


def weighted_linear_regression(x_list, y_list, w_list):
    # Convert the input list to a NumPy array for mathematical operations
    x = np.array(x_list)
    y = np.array(y_list)
    w = np.array(w_list)

    # Calculate the weight parameters of weighted linear regression
    #The weighted least squares method is used here
    #X to be converted from the list to the design matrix ((((((((((((((((((here ((here ((here it has been troubled here for a long time here
    X = np.vstack((x, np.ones_like(x))).T
    #The weight list also needs to be converted into a diagonal matrix
    W = np.diag(w)
    '''
    A = np.dot(X.T, np.dot(W, X))
    B = np.dot(X.T, np.dot(W, y))
    params = np.linalg.solve(A, B)
    # Calculate predicted value
    y_hat = np.dot(X, params)
    '''
    # can also be written directly as one
    y_hat = [email protected](X.T@W@X)@X.T@W@y
    return y_hat


####Find the residual, median, and new weights
def new_W(y_hat,y_list,w_list):
    e = y_list - y_hat ##Calculate the corresponding residual list
    e_sort = sorted(abs(e))
    n = len(e)
    ##s is the median of the error
    if n % 2 == 1:
        s = e_sort[(n - 1)//2]
    else:
        s = (e_sort[n//2] + e_sort[n//2 + 1])/2

    new_w_list = np.array(B(e/(6*s)))*np.array(w_list)
    return e,s,new_w_list


####Quadratic weight function B(x)
def B(x_list):
    #When using the B function to calculate the weight value to adjust the additional value, there is no need to map it to [-1,1], otherwise the curve will not be smooth.
    #x_list = [(x - min(x_list)) / (max(x_list) - min(x_list)) * 2 - 1 for x in x_list]
    w_list = []
    for i in x_list:
        if abs(i) < 1:
            w_list.append((1-i**2)**2)
        else:
            w_list.append(0)
    return w_list

####Cubic weight function W(x)
def W(x_list):
    ##First map x_list to [-1,1]: compressing and mapping any irregular data segment to [-1,1] will not change the order of the original sequence
    x_list = [(x - min(x_list)) / (max(x_list) - min(x_list)) * 2 - 1 for x in x_list]
    w_list = []
    for i in x_list:
        if abs(i) < 1:
            w_list.append((1 - abs(i) ** 3) ** 3)
        else:
            w_list.append(0)
    return w_list

def mylowess(x_data,y_data,frac,iter):
    result = []
    #Before intercepting the code segment centered on
    sorted_data = sorted(zip(x_data,y_data),key= lambda p: p[0])
    x_data,y_data = zip(*sorted_data)
    for x in x_data:
        x_list, y_list = get_points(x_data, y_data, x, frac)
        w_list = W(x_list)
        y_hat = weighted_linear_regression(x_list, y_list, w_list)
        for it in range(iter):
            e, s, w_list = new_W(y_list, y_hat, w_list)
            y_hat = weighted_linear_regression(x_list, y_list, w_list)
        result.append(y_hat[x_list.index(x)])
    return x_data, result



data = pd.read_csv("PolarArea_Data.csv")
x_data = list(data.Date)
y_data = list(data.Value)
frac = 0.1
iter = 3

plt.scatter(x_data,y_data)

for i in [1,2,3,4,5]:
    x,y_fit = mylowess(x_data,y_data,frac,i)
    plt.plot(x,y_fit, label = f"iter={<!-- -->i}")

plt.legend()
plt.show()

R language implementation

# Interception data function
get_points <- function(x_data, y_data, x, frac) {<!-- -->
  num <- length(x_data) * frac
  index <- which(x_data == x)
  
  if (index - num / 2 >= 1 & amp; & amp; index + num / 2 <= length(x_data)) {<!-- -->
    x_list <- x_data[(index - num / 2):(index + num / 2)]
    y_list <- y_data[(index - num / 2):(index + num / 2)]
  } else if (index - num / 2 < 1) {<!-- -->
    x_list <- x_data[1:(index + num / 2)]
    y_list <- y_data[1:(index + num / 2)]
  } else {<!-- -->
    x_list <- x_data[(index - num / 2):length(x_data)]
    y_list <- y_data[(index - num / 2):length(x_data)]
  }
  return(list(x_list, y_list))
}

# Weighted linear regression function
weighted_linear_regression <- function(x_list, y_list, w_list) {<!-- -->
  X <- cbind(x_list, 1)
  W <- diag(w_list)
  params <- solve(t(X) %*% W %*% X) %*% t(X) %*% W %*% y_list
  y_hat <- X %*% params
  return(y_hat)
}

# Quadratic weight function B(x)
B <- function(x_list) {<!-- -->
  w_list <- sapply(x_list, function(x) {<!-- -->
    if (abs(x) < 1) {<!-- -->
      return((1 - x^2)^2)
    } else {<!-- -->
      return(0)
    }
  })
  return(w_list)
}

# Cubic weight function W(x)
W <- function(x_list) {<!-- -->
  x_list <- (x_list - min(x_list)) / (max(x_list) - min(x_list)) * 2 - 1
  w_list <- sapply(x_list, function(x) {<!-- -->
    if (abs(x) < 1) {<!-- -->
      return((1 - abs(x)^3)^3)
    } else {<!-- -->
      return(0)
    }
  })
  return(w_list)
}

# Find residuals, medians, and new weights
new_W <- function(y_hat, y_list, w_list) {<!-- -->
  e <- y_list - y_hat
  e_sort <- sort(abs(e))
  n <- length(e)
  
  if (n %% 2 == 1) {<!-- -->
    s <- e_sort[(n - 1) %/% 2 + 1]
  } else {<!-- -->
    s <- (e_sort[n %/% 2] + e_sort[n %/% 2 + 1]) / 2
  }
  
  new_w_list <- B(e / (6 * s)) * w_list
  return(list(e, s, new_w_list))
}

# mylowess function
mylowess <- function(x_data, y_data, frac, iter) {<!-- -->
  result <- numeric(length(x_data))
  
 
  # Check if x_data is in order
  sorted_indices <- order(x_data)
  x_data <- x_data[sorted_indices]
  y_data <- y_data[sorted_indices]
  
  
  for (x in x_data) {<!-- -->
    points <- get_points(x_data, y_data, x, frac)
    x_list <- points[[1]]
    y_list <- points[[2]]
    w_list <- W(x_list)
    y_hat <- weighted_linear_regression(x_list, y_list, w_list)
    
    for (it in 1:iter) {<!-- -->
      new_w <- new_W(y_hat, y_list, w_list)
      e <- new_w[[1]]
      s <- new_w[[2]]
      new_w_list <- new_w[[3]]
      y_hat <- weighted_linear_regression(x_list, y_list, new_w_list)
    }
    result[which(x_data == x)] <- y_hat[which(x_list == x)]
  }
  return(result)
}

# Example call
data <- read.csv("PolarArea_Data.csv")
x_data <- data$Date
y_data <- data$Value
frac <- 0.1
iter <- 3

plot(x_data, y_data, pch = 19, col = "blue", xlab = "Date", ylab = "Value")

result <- mylowess(x_data, y_data, frac, iter)
lines(x_data, result, type = "l", col = "red")

#Add title, legend and other drawing parameters
title(main = "Scatter Plot with LOESS Fit", sub = "", xlab = "Date", ylab = "Value")
legend("topright", legend = c("Scatter", "LOESS Fit"), col = c("blue", "red"), pch = c(19, NA), lty = c(NA, 1) , cex = 0.8)


Result comparison

python library adjustment uses iteration 5 times

mylowess iterates 5 times

It can be seen that mylowess still has a lot of problems. The amplitude of the change in each iteration is so small that it is almost impossible to see any change. There is still a lot of room for improvement; in addition, the interval interpolation included in the original tuning library function is not available here. Realization, still needs to be added.

Reflection and improvement

The step code for mapping the list to [-1,1] is wrong and should be modified as follows:

####Cubic weight function W(x)
def W(xc,x_list):
    ##First map x_list to [-1,1]: compressing and mapping any irregular data segment to [-1,1] will not change the order of the original sequence
    x_list = [(x - xc) / (max(x_list) - min(x_list)) for x in x_list]
    w_list = []
    for i in x_list:
        if abs(i) < 1:
            w_list.append((1 - abs(i) ** 3) ** 3)
        else:
            w_list.append(0)
    return w_list

operation result

The image seems to be closer to the library function, but it is not smooth enough. Due to my limited level of mathematics, I only know that the wrong multiplication by 2 actually made the weight of points closer to the center larger and the weight of points farther away from the center smaller.

####Cubic weight function W(x)
def W(xc,x_list):
    ##First map x_list to [-1,1]: compressing and mapping any irregular data segment to [-1,1] will not change the order of the original sequence
    x_list = [(x - xc) / (max(x_list) - min(x_list))*2 for x in x_list]
    w_list = []
    for i in x_list:
        if abs(i) < 1:
            w_list.append((1 - abs(i) ** 3) ** 3)
        else:
            w_list.append(0)
    return w_list

Insert code snippet here
Another problem is that when fitting curves to sparse points, the x values are not continuous. Taking a certain number of points according to the ratio will cause problems. It may be necessary to add a distance filter judgment.