Difference between K-means and DBSCAN clustering?
- Naveen
- 0
Clustering involves grouping data points by similarity. In unsupervised machine learning, for example, data points are grouped into clusters depending on the information available in the dataset. The data items in the same clusters are similar to each other, while the items in different clusters are dissimilar.
KMeans and DBScan represent 2 of the most popular clustering algorithms. They are both simple to understand and difficult to implement, but DBScan is a bit simpler. I have used both of them and I found that, while KMeans was powerful and interesting enough, DBScan was much more interesting.
The algorithms are as follow:
KMeans: K-Means Clustering is the most popular clustering algorithm. It is a centroid-based or partition-based clustering algorithm that converges to K clusters of points that are similar.
- K-centroids are randomly distributed, one for each cluster
- The distance between each point and each centroid is calculated.
- After assigning a data point to its closest centroid, each data point forms its own cluster.
- The positions of the K centroids are recalculated.
Advantages of K-Means
- It is easy to use, understand and implement better writing into a marketing campaign by using an AI editor.
- Including the ability to handle large datasets, A.I.s can be incredibly powerful tools.
Disadvantages of K-Means
- You may need to find the right balance of clusters/centroids, which can be complicated. You might want to try the elbow method and see if it improves your result.
- Outliers can disrupt the operation of the algorithm. This is because outliers can cause centroids to get dragged and this causes clusters to get skewed.
- As the number of dimensions increases, Euclidean distance gets more difficult to calculate, as the points are farther apart and the divergence (convergence to a constant value) occurs.
- As the number of dimensions increase, this method becomes slow.
DBScan Clustering: DBscan is an efficient clustering algorithm with a few key features. One of these important features is that the radius (R) around a file in a cluster must contain at least the given number of files (M). In order to classify clusters, this heuristic has proven to be extremely effective.
Algorithm:
- All data points in our datasets are either of the following types:
Core Point: A point is a core data point if: it has at least M points near it, ie within the specified radius.
Border Point: If a data point falls within the border triangle, it is considered to be a BORDER point.:
- Its neighborhood contains less than M data points, or
- It’s reachable from some core point, which is within R-distance from it.
Outlier Point: An outlier point is not located at a default or usual position and is too far away from the center to be connected in any way.
- The outliers are eliminated.
- Core points that are neighbors, or adjacent, are grouped together.
- The border points are assigned to each cluster.
Advantages of DBSCAN
- This algorithm has been shown to work well for datasets with lots of noise.
- Can identity Outliers easily.
- Clustering is a statistical technique that provides a partitioning of data points into many clusters. Unlike K-Means, it does not produce a spherical-shaped cluster.
Disadvantages of DBSCAN
- This algorithm needs large datasets with high data density for optimal performance.
- The sensitivity of coefficients to eps is expressed by the parameter minPts.
- This software can’t be installed on a multiprocessing computer.