Cluster Analysis aims to establish a set of clusters such that cases within a cluster are more similar to each other than are cases in other clusters.
In other words, we're using data to arrange objects into groups. Arranging objects into groups is a natural skill we all use and share.
In life, we use an item’s characteristics or traits to determine what group to throw them in. In Cluster Analysis, the metrics “similarity” and “distance” are used to perform the very same action when arranging items into groups.
Minitab uses a hierarchical clustering method. It starts with single member clusters, which are then fused to form larger clusters (This is also known as an agglomerative method.)
Here are some tips and facts about Hierarchical Clustering:
- Appropriate for smaller samples, i.e. < 250.
- The bigger the distance coefficient, the more clustering involves combining unlike entities, which may be undesirable.
- The researcher must specify the similarity/distance, how clusters are aggregated, and how many clusters are needed.
- Clusters are nested rather than being mutually exclusive.
The first step in cluster analysis calculations is establishing a data matrix of distances between the observations that you are analyzing. “Distance” really is a distance metric: it’s trying to calculate the distance between clusters, which helps in figuring out which items go into certain clusters. There are different ways of measuring distances between two clusters.
There are also different ways to determine how distances between two clusters are defined via linkage methods. Now, that sounds like the same thing we just said in the preceding paragraph, right? Not quite. “Distance measures” focus on how to calculate the distance between clusters once the start and end points have been established, while the "linkage methods" help determine those start and end points. Some linkage methods are single linkage (a), complete linkage (b), and average linkage(c):
In the Minitab dialog window for Cluster Observations, you must specify the final partition via a final number of clusters or by a level of similarity. This can be confusing because the common expectation is for Minitab to come up with the final groupings for all clusters.
As mentioned before, the clustering technique utilizes linkage and distance methods to make sense of the items being grouped and link them accordingly, in a hierarchical fashion. However, it’s the user’s responsibility to view the hierarchy and make sense of it. A good way of doing this is by looking at a dendrogram. The dendrogram graphically represents the hierarchical clustering as a tree. Sometimes it’s useful to first look at the dendrogram without specifying a final partition. Here is an example of how Minitab determines grouping if you did choose the final partition to be 4 clusters:
The same method applies to searching by similarity level. If you specify a similarity of 50, then it identifies the fastest routes to 100% similarity from this starting point. If one vertical line is a straight shot down to 100%, it’s a cluster. If another line is a straight shot down to a group of other observations, then that entire group is a cluster.
There are many applications for Cluster Analysis:
- Biology: transcriptomics, sequence analysis
- Marketing research: used when working with data from surveys. It's also used to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers
- Social network analysis-recognizing communities
- Image segmentation-dividing digital images into regions for object recognition
- Data mining-partitioning of data items into subsets
I hope these tips help next time you open up Minitab!