Noisy ring clustering analysis based on DBSCAN (MATLAB-with case code)

Table of Contents

1. Summary

2. Basic principles

2.1. DBSCAN clustering algorithm

2.2. Calculation principle

2.3. Advantages and Disadvantages

3. Cluster analysis based on DBSCAN

references

Appendix–Case Code

1. Summary

The principle of the DBSCAN method is explained, and the DBSCAN method is used to perform cluster analysis on two randomly generated rings with noise. In addition, the clustering results under different density parameters were calculated, and the influence of different density parameters on the clustering effect was analyzed.

2. Basic principles

2.1 DBSCAN clustering algorithm

DBSCAN (Density-Based Spatial Clustering of Applications with Noise, density-based clustering method with noise) is a density-based spatial clustering algorithm. This algorithm divides areas with sufficient density into clusters and finds arbitrarily shaped clusters in spatial databases with noise. It defines a cluster as the largest set of density-connected points strong>.

Clustering algorithms such as traditional k-means can only handle spherical clusters, that is, a solid group (this is because of the limitations of the algorithm itself in calculating the average distance). However, for other rings and irregular shapes that often exist in reality, DBSCAN’s density-based spatial clustering feature has advantages that traditional clustering algorithms do not have.

3c5f276511984f649631fca1575ab951.png

Figure 2.1 DBSCAN clustering results

7f1fda9abfa548f7a3e51bb491c02c4a.png

Figure 2.2 K-means clustering results

2.2. Calculation principle

Density-based method: The density of any point in space is the number of points contained in a circle with the point as the center and EPS as the radius.

It works like this:

  1. First, select two parameters, a radius epsilon and a minimum number of clusters minPoints.
  2. Then, start from any point in the data set. If there are more than minPoints points (including the original point itself) within a distance epsilon of this point, we consider all these points to be part of the “cluster”.
  3. Then, extend the cluster by checking all new points to see if they also have points within a distance of ε that exceed minPoints, and if so, then Grow the cluster recursively.
  4. Eventually, when you have run out of points to add to the cluster, select a new arbitrary point and repeat the process.

If you select a point it is entirely possible that there are fewer points within its epsilon distance than minPoints and also not be part of any other cluster. If so, it is considered a “noise point” that does not belong to any cluster.

2.3. Advantages and Disadvantages

(1) Advantages

  • This type of algorithm can overcome the shortcomings of distance-based algorithms that can only find “circular” (convex) clusters.
  • Can discover clusters of any shape and is insensitive to noisy data.
  • No need to specify the number of classes cluster
  • There are only two parameters in the algorithm, scanning radius (eps) and minimum number of included points (min_samples)

(2) Disadvantages:

  • Computational complexity, without any optimization, the time complexity of the algorithm is O(N^{2}), usually R-tree, k-d tree, ball
  • tree index to speed up calculation and reduce the time complexity of the algorithm to O(Nlog(N)).
  • It is greatly affected by eps. When the density of data distribution in a class is uneven, when eps is small, the cluster with small density will be divided into multiple clusters with similar properties; when eps is large, clusters that are closer and denser will be merged into one cluster. In high-dimensional data, the selection of eps is more difficult due to the curse of dimensionality.
  • Depends on the selection of the distance formula. Due to the dimension hazard, the distance metric is not important.
  • It is not suitable for data sets with large density differences, because it is difficult to select eps and metric.

3. Cluster analysis based on DBSCAN

Two circles containing noise are randomly generated with radii of 0.8 and 5 respectively. Each circle contains 400 pieces of data, totaling 800 pieces of original data. The original data visualization results are as follows:

d2c6ffb8725b4b189793886169d7b6e8.png

Figure 3.1 Raw data visualization

Use matlab’s self-encoding DBSCAN algorithm to calculate the clustering results. Different radius and density parameters were selected, and the following three scenarios were set:

Scene number

Epsilon

Minpoints

1

0.5

5

2

0.5

10

3

1

5

The clustering results are as follows:

cf696b025289425e93575445876448a2.png

Figure 3.2 Epsilon=0.5, minpoints=5 clustering results

894ab0c23a344c23b2d18d442f392012.png

Figure 3.3 Epsilon=0.5, minpoints=10 clustering results

ded22c29a6fc448abc314b4542db5510.png

Figure 3.4 Epsilon=1, minpoints=5 clustering results

It can be seen that different density parameters will greatly affect the data clustering results. The larger the radius, the smaller the number of clusters and the fewer categories, and vice versa. In addition, DBSCAN can discover clusters of arbitrary shapes and is insensitive to noisy data.

References

[1] DBSCAN clustering algorithm – machine learning (theory + diagram + python code)_Feng Xianhe’s blog – CSDN blog. Blog. DBSCAN clustering algorithm – machine learning (theory + diagram + python code) – CSDN blog

[2] DBSCAN clustering algorithm – density-based clustering method (theory + diagram + python code)_dbscan clustering_Lindsay.Lu丶’s blog – CSDN blog. Blog. DBSCAN clustering algorithm – density-based clustering method (theory) + diagram + python code)_Density-based clustering algorithm dbscan-CSDN blog

Appendix-Case Code

The Matlab code for this case is as follows:

clc;
clear;
close all;
 
%% Generate Data
% Generate data containing two noise circles
 N =400; % Size of each cluster
r1 = 0.8;
r2 = 5;
theta = linspace(0,2*pi,N)';

X1 = r1*[cos(theta),sin(theta)] + rand(N,1);
X2 = r2*[cos(theta),sin(theta)] + rand(N,1);
X = [X1;X2]; % Noisy 2-D circular data set

%% Visualize raw data
figure('name','Original data visualization')
scatter(X(:,1),X(:,2));
set(gca,'FontSize',12);
 
%% Run DBSCAN Clustering
 
epsilon=1; % the values of two key parameters
MinPts=5;
IDX=DBSCAN(X,epsilon,MinPts);
 
 %% Plot Results
 figure('name','DBSCAN clustering result');
PlotClusterinResult(X, IDX);
set(gca,'FontSize',12);
title(['DBSCAN Clustering (\epsilon = ' num2str(epsilon) ', MinPts = ' num2str(MinPts) ')']);
%%%%%%% ----DBSCAN function ------%%%%%%%%%%%
function [IDX, isnoise]=DBSCAN(X,epsilon,MinPts) % DBSCAN clustering function
    C=0;
    n=size(X,1);
    IDX=zeros(n,1);
    D=pdist2(X,X);
    visited=false(n,1);
    isnoise=false(n,1);
    for i=1:n
        if~visited(i)
            visited(i)=true;
            Neighbors=RegionQuery(i);
            if numel(Neighbors)<MinPts % if less than MinPts
                % X(i,:) is NOISE
                isnoise(i)=true; % This point is an abnormal point
            else
                C=C + 1;
                ExpandCluster(i,Neighbors,C);
            end
            
        end
    
    end
    function ExpandCluster(i,Neighbors,C)
        IDX(i)=C;
        
        k = 1;
        while true % keeps looping
            j = Neighbors(k);
            if ~visited(j)
                visited(j)=true;
                Neighbors2=RegionQuery(j);
                if numel(Neighbors2)>=MinPts
                    Neighbors=[Neighbors Neighbors2]; %#ok
                end
            end
            if IDX(j)==0
                IDX(j)=C;
            end
            
            k = k + 1;
            if k > numel(Neighbors)
                break;
            end
        end
    end
    function Neighbors=RegionQuery(i) % This function is used to query the number of surrounding points whose distance is less than or equal to epsilon
        Neighbors=find(D(i,:)<=epsilon);
    end
end

%%%% ---- Drawing function ----%%%%%%%
function PlotClusterinResult(X, IDX)
    k=max(IDX);
    Colors=hsv(k);
    Legends = {};
    for i=0:k
        Xi=X(IDX==i,:);
        if i~=0
            Style = 'x';
            MarkerSize = 8;
            Color = Colors(i,:);
            Legends{end + 1} = ['Cluster #' num2str(i)];
        else
            Style = 'o';
            MarkerSize = 6;
            Color = [0 0 0];
            if ~isempty(Xi)
                Legends{end + 1} = 'Noise';
            end
        end
        if ~isempty(Xi)
            plot(Xi(:,1),Xi(:,2),Style,'MarkerSize',MarkerSize,'Color',Color);
        end
        hold on;
    end
    hold off;
    axis equal;
    grid on;
    legend(Legends);
    legend('Location', 'NorthEastOutside');
end