Dividing a Data Set into Training and Validation Samples
Adam Ozimek had an interesting post April 15th on the Modeled Behavior blog at Forbes.com. He observed that one of the advantages of big data is how easy it is to get test data to validate a model that you built from sample data.
Ozimek notes that he is “for the most part a p-value checking, residual examining, data modeling culture economist,” but he’s correct to observe that if you can test your model on real data, then you should.
What I’ll describe is certainly not the only way to divide data in Minitab Statistical Software. Still, I think it’s pretty good if I do say so myself. Want to follow along? I’ll use steps that go with the Educational Placement data set for Minitab 17. To follow along in Minitab 16, use the Education.MTW data set, but change the column titles as appropriate. (Here's how to find the sample data set folder in Minitab 16 if its existence is new to you.)
Different professionals will have need to divide the data into different numbers of groups and in different proportions. The goal is usually to use part of the data to develop a model and part of the data to test the prediction quality of the model. The basic operations in Minitab will be similar whether you're dividing the data into two samples for training and validation or into 3 samples for fitting, validation, and testing. The steps will also be pretty similar whether you want equally sized groups, or to use only 10% of your data for validation.
I’m going to divide the data set into two groups where the size of the training set is twice the size of the validation set.
First, I’m going to set up a column to randomly assign the 180 observations in the data set to the two different samples. The size of this dataset does not require so many steps, but the steps can save you some effort if you have such a large data set that keeping track of the numbers requires thought.
- Choose Calc > Make Patterned Data > Simple Set of Numbers.
- In Store patterned data in, enter c4.
- In From first value, enter 1.
- In To last value, enter 1.
- In Number of times to list each value, enter 120. Click OK.
- Press CTRL + E to reopen Simple Set of Numbers.
- Change Store patterned data in to c5.
- Change From first value to 2.
- Change To last value to 2.
- Change Number of times to list each value to 60. Click OK.
- Choose Data > Stack > Columns.
- In Stack the following columns, enter c4 c5.
- In Store stacked data in, select Column of current worksheet.
- In Column of current worksheet, enter c6. Click OK.
Now we’ll randomize the groups to reduce the bias. After all, in this dataset, the data are in order by track. The most common application would probably not be to use data from tracks 1 and 2 to fit the model and from track 3 to validate the model.
- Choose Calc > Random Data > Sample From Columns.
- In Number of rows to sample, enter 180.
- In From columns, enter c6.
- In Store Samples In, enter c6. Click OK.
Now, you have a column that identifies which observations belong in each data set.
- Choose Data > Unstack Columns.
- In Unstack the data in, enter ‘Test Score’ Motivation.
- In Using subscripts in, enter c6.
- In Store unstacked data, select After last column in use. Click OK.
Now, you have two data sets, one with 120 observations and one with 60 observations. You can fit the model on the larger data set, then use the second data set to validate the model.
Statistics like predicted R2 have done a lot to help us get good models when we don’t have enough data to get good estimates and see how well the model does on new data. But in cases where you have access to more data than you need to get good parameter estimates, it’s good practice to use some of the data for validation. With a few moves in Minitab, you’re all set to go. Next time I post, I'll show some ways that you can use a validation data set to check the quality of a model in Minitab.