What is GMM and Agglomerative clustering?
- Naveen
- 0
A Gaussian mixture is a statistical model that assumes all the data points are generated from a linear combination of multivariate Gaussian distributions. This assumption has unknown parameters that can be estimated from the data, which we refer to as hyperparameters. Firstly, K-means employs the Gaussian distributions and centers of latent Gaussians. However, unlike K-means, the covariance structure of the distributions are also taken into account. The algorithm uses the EM algorithm to iteratively find the distribution parameters that maximize a model quality measure called log-likelihood. The key steps performed in this model are:
- Initialize ~~{k y} gaussian distributions
- Equally weight each point and find probability of that the point is associated with distribution
- Please recalculate the distribution parameters based on the probability associated with each point.
- Repeat process until the log-likelihood is maximized.
There are 4 options for calculating covariances in GMM:
- Full: “Each distribution has its own general covariance matrix” is true for
- Tied: All distributions share a general covariance matrix.
- Diag: Generally speaking, each distribution has its own covariance matrix.
- Spherical: Every distribution has its own individual variance
We have to make decisions about the covariance type, as well as the number of clusters in a model. BIC score, Silhouette score, Calinski Harabasz score and Davies Bouldin are used for selecting both parameters with grid search.
Agglomerative clustering is a family of clustering algorithms that builds nested clusters successively. This hierarchy of clusters can be represented as a tree diagram known as a dendrogram. The top of the tree contains all data points while the bottom gives you individual points. You can link data points together in a successive manner with:
- Single linkage: The distance between the two clusters is minimized by centering the covariance matrix in these observations. around the central value for each cluster. A final multivariate distance matrix is then calculated between these two sets of clusters.
- Complete or Maximum linkage: Uses the same measure of cluster compactness as the Fowlkes-Mallows criterion, but with absolute maximum distance instead of the sum of squared distance Closeness is the minimum number of clusters that can be placed next to a single point without crossing clusters.
- Average linkage: (Method) The distance between each observation and the average of clusters is minimized.
- Ward: For our exercise, we will be using hierarchical clustering which is an extension of the k-means algorithm. Minimizing the sum of squared differences between all clusters is conceptually similar but this software requires you to specify levels.