Use a Line Plot to Show a Summary Statistic Over Time
If you’re already a strong user of Minitab Statistical Software, then you’re probably familiar with how to use bar charts to show means, medians, sums, and other statistics. Bar charts are excellent tools, but traditionally used when you want all of your categorical variables to have different sections on the chart. When you want to plot statistics with groups that flow directly from one category to the next, look no further than Minitab’s line plots. I particularly like line plots when I want to use time as a category, because I prefer the connect line display to separated bars.
I like to illustrate Minitab with data about pleasant subjects: poetry, candy, and maybe even the volume of ethanol in E85 fuel. Data that are about unpleasant subjects also exist, and we can learn from that data too. We’re fortunate to have both the Chicago Project on Security and Terrorism (CPOST) and the National Consortium for the Study of Terrorism and Responses to Terrorism (START) working hard to produce publicly-accessible databases with information about terrorism.
START has been sharing analyses of its 2013 data recently. The new data prompted staff from the two institutions to engage in an interesting debate on the Washington Post’s website about whether the Global Terrorism Database (GTD) that Start maintains “exaggerates a recent increase in terrorist activities.” For today, I’m just going to use the GTD to demonstrate a nice line plot in Minitab, which will give a tiny bit of insight into what that debate is about.
When you download the GTD data, you can open one file that has all of the data except for the year 1993. Incident-level data for 1993 was lost, so that year is not included, although you can get country-level totals for numbers of attacks and casualties from the GTD Codebook. Those who maintain the GTD recommend “users should note that differences in levels of attacks and casualties before and after January 1, 1998, before and after April 1, 2008, and before and after January 1, 2012 are at least partially explained by differences in data collection” (START, downloaded August 18th, 2014).
The GTD is great for detail. One column it contains records a one if an event was a suicide attack and a 0 if an event is not a suicide attack, which makes it easy to sum that column so that you can see the number of suicide attacks per year. Absent from the data is a column that references the changes in methodology, but we can easily add this column in Minitab. Without a methdology column, it’s easy to end up with the recently-criticized graph that started the debate between the staff at the two institutions. The graph shows all of the data in the GTD for the number of suicide attacks for each year since 1970. It looks a bit like this:
The message of this graph is that the number of suicide attacks has never been higher. The criticism about the absence of the different methodologies seems fair. So how would we capture the different methodologies in Minitab? With a calculator formula, of course. Try this, if you’re following along:
- Choose Calc > Calculator.
- In Store result in variable, enter Methodology.
- In Expression, enter:
if(iyear < 1998, 1, iyear < 2009, 2, iyear=2009 and imonth < 4, 2, iyear < 2012, 3, 4)
- Click OK.
Notice that because the GTD uses 3 separate columns to record the dates, I’ve used two conditions to identify the second methodology. With the new column, you can easily divide the data series trends according to the method for counting events. This is where the line plot comes in. The line plot is the easiest way in Minitab to plot a summary statistic with time as a category. You can try it this way:
- Choose Graph > Line Plot.
- Select With Symbols, One Y. Click OK.
- In Function, select Sum.
- In Graph variables, enter suicide.
- In Categorical variable for X-scale grouping, enter iyear.
- In Categorical variable for legend grouping, enter Methodology.
You’ll get a graph that looks a bit like this, though I already edited some labels.
One interesting feature of this line plot is that there are two data points for 2009. Because we’re calling attention to the different methodologies, it’s important to consider that the first quarter and the last 3 quarters of 2009 use different methodologies. In this display, we can see the mixture of methodologies. The fact that the two highest points are from the newest methodology also lend some credence to the question of whether the numbers from 2012 and 2013 should be directly compared to numbers from earlier years. The amount of the increase due to better data collection is not clear.
Interestingly, a line plot that shows the proportion of suicide attacks out of all terrorist attacks presents a different picture about the increase related to the different methodologies. That’s what you get if you make a line plot of the means instead of the sums.
Considering which statistics to compute and how to interpret them in conjunction with one another is an important task for people doing data analysis. In the final installment of the series on the Washington Post’s website, GTD staff members note that they do not “rely solely on global aggregate percent change statistics when assessing trends.” The flexibility of the line plot to show different statistics can make the work of considering the data from different perspectives much easier.
We do like to have fun at the Minitab Blog, but we know that there’s serious data in the world too. Whether your application is making tires that keep people safe on the road or helping people recover from wounds, our goal is to give you the best possible tools to make your process improvement efforts successful.