Table of Contents
1. Summary
2. Basic principles
2.1. DBSCAN clustering algorithm
2.2. Calculation principle
2.3. Advantages and Disadvantages
3. Cluster analysis based on DBSCAN
references
Appendix–Case Code
1. Summary
The principle of the DBSCAN method is explained, and the DBSCAN method is used to perform cluster analysis on two randomly generated rings with noise. In addition, the clustering results under different density parameters were calculated, and the influence of different density parameters on the clustering effect was analyzed.
2. Basic principles
2.1 DBSCAN clustering algorithm
DBSCAN (Density-Based Spatial Clustering of Applications with Noise, density-based clustering method with noise) is a density-based spatial clustering algorithm. This algorithm divides areas with sufficient density into clusters and finds arbitrarily shaped clusters in spatial databases with noise. It defines a cluster as the largest set of density-connected points strong>.
Clustering algorithms such as traditional k-means can only handle spherical clusters, that is, a solid group (this is because of the limitations of the algorithm itself in calculating the average distance). However, for other rings and irregular shapes that often exist in reality, DBSCAN’s density-based spatial clustering feature has advantages that traditional clustering algorithms do not have.
Figure 2.1 DBSCAN clustering results |
Figure 2.2 K-means clustering results |
2.2. Calculation principle
Density-based method: The density of any point in space is the number of points contained in a circle with the point as the center and EPS as the radius.
It works like this:
- First, select two parameters, a radius epsilon and a minimum number of clusters minPoints.
- Then, start from any point in the data set. If there are more than minPoints points (including the original point itself) within a distance epsilon of this point, we consider all these points to be part of the “cluster”.
- Then, extend the cluster by checking all new points to see if they also have points within a distance of ε that exceed minPoints, and if so, then Grow the cluster recursively.
- Eventually, when you have run out of points to add to the cluster, select a new arbitrary point and repeat the process.
If you select a point it is entirely possible that there are fewer points within its epsilon distance than minPoints and also not be part of any other cluster. If so, it is considered a “noise point” that does not belong to any cluster.
2.3. Advantages and Disadvantages
(1) Advantages
- This type of algorithm can overcome the shortcomings of distance-based algorithms that can only find “circular” (convex) clusters.
- Can discover clusters of any shape and is insensitive to noisy data.
- No need to specify the number of classes cluster
- There are only two parameters in the algorithm, scanning radius (eps) and minimum number of included points (min_samples)
(2) Disadvantages:
- Computational complexity, without any optimization, the time complexity of the algorithm is O(N^{2}), usually R-tree, k-d tree, ball
- tree index to speed up calculation and reduce the time complexity of the algorithm to O(Nlog(N)).
- It is greatly affected by eps. When the density of data distribution in a class is uneven, when eps is small, the cluster with small density will be divided into multiple clusters with similar properties; when eps is large, clusters that are closer and denser will be merged into one cluster. In high-dimensional data, the selection of eps is more difficult due to the curse of dimensionality.
- Depends on the selection of the distance formula. Due to the dimension hazard, the distance metric is not important.
- It is not suitable for data sets with large density differences, because it is difficult to select eps and metric.
3. Cluster analysis based on DBSCAN
Two circles containing noise are randomly generated with radii of 0.8 and 5 respectively. Each circle contains 400 pieces of data, totaling 800 pieces of original data. The original data visualization results are as follows:
Figure 3.1 Raw data visualization
Use matlab’s self-encoding DBSCAN algorithm to calculate the clustering results. Different radius and density parameters were selected, and the following three scenarios were set:
Scene number |
Epsilon |
Minpoints |
1 |
0.5 |
5 |
2 |
0.5 |
10 |
3 |
1 |
5 |
The clustering results are as follows:
Figure 3.2 Epsilon=0.5, minpoints=5 clustering results
Figure 3.3 Epsilon=0.5, minpoints=10 clustering results
Figure 3.4 Epsilon=1, minpoints=5 clustering results
It can be seen that different density parameters will greatly affect the data clustering results. The larger the radius, the smaller the number of clusters and the fewer categories, and vice versa. In addition, DBSCAN can discover clusters of arbitrary shapes and is insensitive to noisy data.
References
[1] DBSCAN clustering algorithm – machine learning (theory + diagram + python code)_Feng Xianhe’s blog – CSDN blog. Blog. DBSCAN clustering algorithm – machine learning (theory + diagram + python code) – CSDN blog
[2] DBSCAN clustering algorithm – density-based clustering method (theory + diagram + python code)_dbscan clustering_Lindsay.Lu丶’s blog – CSDN blog. Blog. DBSCAN clustering algorithm – density-based clustering method (theory) + diagram + python code)_Density-based clustering algorithm dbscan-CSDN blog
Appendix-Case Code
The Matlab code for this case is as follows:
clc; clear; close all; %% Generate Data % Generate data containing two noise circles N =400; % Size of each cluster r1 = 0.8; r2 = 5; theta = linspace(0,2*pi,N)'; X1 = r1*[cos(theta),sin(theta)] + rand(N,1); X2 = r2*[cos(theta),sin(theta)] + rand(N,1); X = [X1;X2]; % Noisy 2-D circular data set %% Visualize raw data figure('name','Original data visualization') scatter(X(:,1),X(:,2)); set(gca,'FontSize',12); %% Run DBSCAN Clustering epsilon=1; % the values of two key parameters MinPts=5; IDX=DBSCAN(X,epsilon,MinPts); %% Plot Results figure('name','DBSCAN clustering result'); PlotClusterinResult(X, IDX); set(gca,'FontSize',12); title(['DBSCAN Clustering (\epsilon = ' num2str(epsilon) ', MinPts = ' num2str(MinPts) ')']); %%%%%%% ----DBSCAN function ------%%%%%%%%%%% function [IDX, isnoise]=DBSCAN(X,epsilon,MinPts) % DBSCAN clustering function C=0; n=size(X,1); IDX=zeros(n,1); D=pdist2(X,X); visited=false(n,1); isnoise=false(n,1); for i=1:n if~visited(i) visited(i)=true; Neighbors=RegionQuery(i); if numel(Neighbors)<MinPts % if less than MinPts % X(i,:) is NOISE isnoise(i)=true; % This point is an abnormal point else C=C + 1; ExpandCluster(i,Neighbors,C); end end end function ExpandCluster(i,Neighbors,C) IDX(i)=C; k = 1; while true % keeps looping j = Neighbors(k); if ~visited(j) visited(j)=true; Neighbors2=RegionQuery(j); if numel(Neighbors2)>=MinPts Neighbors=[Neighbors Neighbors2]; %#ok end end if IDX(j)==0 IDX(j)=C; end k = k + 1; if k > numel(Neighbors) break; end end end function Neighbors=RegionQuery(i) % This function is used to query the number of surrounding points whose distance is less than or equal to epsilon Neighbors=find(D(i,:)<=epsilon); end end %%%% ---- Drawing function ----%%%%%%% function PlotClusterinResult(X, IDX) k=max(IDX); Colors=hsv(k); Legends = {}; for i=0:k Xi=X(IDX==i,:); if i~=0 Style = 'x'; MarkerSize = 8; Color = Colors(i,:); Legends{end + 1} = ['Cluster #' num2str(i)]; else Style = 'o'; MarkerSize = 6; Color = [0 0 0]; if ~isempty(Xi) Legends{end + 1} = 'Noise'; end end if ~isempty(Xi) plot(Xi(:,1),Xi(:,2),Style,'MarkerSize',MarkerSize,'Color',Color); end hold on; end hold off; axis equal; grid on; legend(Legends); legend('Location', 'NorthEastOutside'); end