Multivariate statistics can be used to better understand the structure of large data sets, typically customer-related data.

Suppose you have a large amount of data about your customers (preferences, degree of satisfaction, expectations, dislikes etc…), and a large number of variables you need to analyze.

Your data might seem somewhat chaotic at first, and you might consider the use of many different types of graphs to better understand the overall data structure. But a large number of variables makes it very difficult to obtain the full picture in only a few graphs. At this point, you need to use some more powerful statistical tools, such as the multivariate techniques.

I'll focus here on Principal Component Analysis (PCA) to analyze a large dataset. Multivariate techniques are very useful when you need to summarize many variables into a smaller number of variables (i.e., reduce the number of dimensions) to simplify the analysis of a dataset and better understand how variables may be grouped.

**The Tour de France**

A few years ago Minitab organized a “Tour de France,” with events taking place in various places and cities around France. We presented new features of Minitab Statistical Software to audiences of customers and potential customers. Afterwards, we asked attendees to fill in evaluation forms.

We wanted to understand whether we had different profiles of attendees (and if so, what were they?). We also wanted to identify the attendees' most important expectations, and see if some of the questions were redundant (and therefore highly correlated). We posed more than ten questions to 115 attendees, who responded with a Yes (1) or a No (0). I then performed a principal component analysis (PCA) on the results.

## Principal Component Analysis

The graph below is a Loading Plot from a principal component analysis. Lines that go in the same direction and are close to one another indicate how the variables may be grouped. In this diagram, the first component in the horizontal direction is a summary of the Info Training, Info Minitab, Info Quality Companion variables, and of the Meet Minitab, Meet users, Why use Quality Companion, and Other events variables. These variables have been grouped together as they are closely associated/correlated from a statistical point of view.

The first component (in the horizontal direction) is the most important one: these attendees came to get more information (about training, Minitab and Quality Companion), and these three variables are strongly correlated.

Some attendees had more “social” interests (Meet users or Minitab employees, willingness to attend to other events, learn about why using Quality Companion), and these variables were also strongly correlated. So we see evidence for two types of attendee expectations: Get more information (“Info”) and Meet other professionals (“Social expectations”).

Let's now consider the second component (in the vertical direction) and the two variables that are closely associated with it: Learn about software and Why use Minitab. The second component summarizes these two variables. It also helps us differentiate between the “Info” variables and the more “Social” ones.

We therefore have three different types of expectations : Get more Info (three variables), which is somewhat correlated (same direction) with more “Social” expectations (four variables), and then -- completely uncorrelated to the previous expectations -- we find more “Technical” requests (Learn about software, Why use Minitab: two variables).

The ten variables were later summarized into only two components (in the graph below), and these variables have been distributed into three different groups : Get more Info, “Social” interests and “Technical” expectations. Thanks to this multivariate analysis, we can better understand the expectations of the attendees.

A Cluster analysis (another multivariate technique) has been used to distribute the attendees into three coherent groups according to the First component and the Second component. A clustering analysis seeks to minimize the differences *within *groups (for greater cohesiveness) while maximizing the differences *between *groups.

Compare the Loading plot to the Score plot: the black points (in the upper left corner) tend to represent attendees that are interested in more “Technical” issues, whereas the green points represent attendees that are interested in getting more info and they also have “Social” interests. The last group, with red points (in the lower left corner), are more neutral with no major, well-defined interest.

So we've identified three different attendee profiles, and the Score Plot in Minitab is helpful to show how a particular individual is positioned when considering the main variables (by using the brushing functionality in Minitab, for example).

Finally, in the Score plot below, we have displayed the professional positions of the attendees according to the two main PCA components. Master Black Belts (the violet triangles) are positioned in the lower right corner, they seem to have more “Social” expectations from such meetings, whereas engineering professionals tend to be positioned in the upper left corner, with more “Technical” expectations. The other professional groups tend to be spread over a large area, making it difficult to define a clear behavior.

Thanks to this Principal Component Analysis we now have a much better view of what different groups of attendees expect from us. The way in which variables are grouped, according to correlations, makes sense from a logical point of view.

This type of multivariate statistical analysis can be used in many different contexts. Although this approach is computationally intensive, the graphs in Minitab make it easier to understand the structure of your data.

Time: Thursday, November 10, 2011

Thanks for working out this example. This helps me further understand the use of these analyses.

I would like to have the raw datafile, or mtb file with the data if possible, so I can repeat the analysis for myself.

Hope this is available.

Erwin

Time: Tuesday, November 15, 2011

I join to the opinion of Erwin. The value of the Blog would be increased if we could repeat the examples at home. The possibility of loading the data would be highly appreciated.

Karoly

Time: Thursday, February 13, 2014

Thanks for such a nice and lucid explanation. I am a biologist by training. I have a query regarding number of independent groups that can be included at a once. I mean to say can I perform PCA/PLS-DA for 3 independent groups. I am asking this because I have rarely seen papers involving more than 2 two groups for NMR based multivariate analysis. Your explanation and example (score plot) shows that we can indeed.