Adam Ozimek had an interesting post April 15th on the Modeled Behavior blog at Forbes.com. He observed that one of the advantages of big data is how easy it is to get test data to validate a model that you built from sample data.
Ozimek notes that he is “for the most part a p-value checking, residual examining, data modeling culture economist,” but he’s correct to observe that if you can test your model on real data, then you should.
What I’ll describe is certainly not the only way to divide data in Minitab Statistical Software. Still, I think it’s pretty good if I do say so myself. Want to follow along? I’ll use steps that go with the Educational Placement data set for Minitab 17. To follow along in Minitab 16, use the Education.MTW data set, but change the column titles as appropriate. (Here's how to find the sample data set folder in Minitab 16 if its existence is new to you.)
Different professionals will have need to divide the data into different numbers of groups and in different proportions. The goal is usually to use part of the data to develop a model and part of the data to test the prediction quality of the model. The basic operations in Minitab will be similar whether you're dividing the data into two samples for training and validation or into 3 samples for fitting, validation, and testing. The steps will also be pretty similar whether you want equally sized groups, or to use only 10% of your data for validation.
I’m going to divide the data set into two groups where the size of the training set is twice the size of the validation set.
First, I’m going to set up a column to randomly assign the 180 observations in the data set to the two different samples. The size of this dataset does not require so many steps, but the steps can save you some effort if you have such a large data set that keeping track of the numbers requires thought.
Now we’ll randomize the groups to reduce the bias. After all, in this dataset, the data are in order by track. The most common application would probably not be to use data from tracks 1 and 2 to fit the model and from track 3 to validate the model.
Now, you have a column that identifies which observations belong in each data set.
Now, you have two data sets, one with 120 observations and one with 60 observations. You can fit the model on the larger data set, then use the second data set to validate the model.
Statistics like predicted R2 have done a lot to help us get good models when we don’t have enough data to get good estimates and see how well the model does on new data. But in cases where you have access to more data than you need to get good parameter estimates, it’s good practice to use some of the data for validation. With a few moves in Minitab, you’re all set to go. Next time I post, I'll show some ways that you can use a validation data set to check the quality of a model in Minitab.