3 Ways to Clean Up Data So You Can Promote Public Dialog
"By publishing the historical data, public dialogue that results from the data release can be more productive because you’ll be able to discuss changes over time." — Denice Ross, 5/17/2015
Last month, President Obama launched the Police Data Initiative. A key goal of the initiative was to make data about police departments more accessible to the public. Twenty-one communities decided to participate in the initial round, including Philadelphia.
Among Code for America's recommendations to help police departments get started was the suggestion to open historical records. On June 19, the data set "Philadelphia Police Advisory Commission Complaints" was made available via opendataphilly.org. The data set includes several variables about complaints made against police officers between 2009 and 2012, and gives us the chance to explore some steps you can take to clean up your data for analysis, using features in Minitab.
One thing to look for is redundant categories and labels. If you download the data and take a look at the actions that resulted from the complaints, you’ll find these values in these frequencies.
Tally for Discrete Variables: ACTION
It’s easy to see that the values “Accept” and “ACCEPT” should be the same. If you're using Minitab, it can change those values for you. (If you're not using Minitab, you can get a free 30-day trial.) Try this:
- Choose Calc > Calculator.
- In Store result in variable, enter ‘Action taken’.
- In Expression enter Proper(ACTION). Click OK.
Now there’s a column with these values and frequencies:
Tally for Discrete Variables: Action taken
Action taken Count
Instead of having to make 62 corrections in the data, you have to make only 2. Prefer a different format? You could substitute LOWER or UPPER for PROPER to get all lowercase or all uppercase letters.
The Philadelphia data set includes a variable for the date and time of the incident, but none of the times are recorded. Including the unused values for time yields data like these:
To get the usable "date" portion of the data, you can use the calculator. Try this:
- Choose Calc > Calculator.
- In Store result in variable, enter 'Text date'.
- In Expression, enter Left(DATE_, 10). Click OK.
The column that results is still formatted as text. To do an analysis where you can sort by date, you can quickly change the date format. Select a cell in the column, right-click, and select Format Column. When you pick Date from the list of types, Minitab recognizes the format for you.
If you dig a bit deeper into the data, you’ll notice an oddity that’s not readily apparent. The current web site for the police in Philadelphia lists 21 districts. In the data, 23 units are included. That's because the 23rd District has been incorporated into the 22nd District, and the 4th District incorporated into the 3rd. If we want to include complaints about officers from those districts in their new districts, you can recode the districts. Try this:
- Choose Data > Code > To Text.
- In Code values in the following columns, enter Unit.
- In Method, select Code Individual Values.
- For District 4, change the Coded value to District 3.
- For District 23, change the Coded value to District 22.
- Click OK.
Original Value Recoded Value of Rows
District 4 District 3 2
District 23 District 22 7
Source data column UNIT
Recoded data column Coded UNIT
Number of unchanged rows: 446
Minitab shows you a summary table so you can see how the values were recoded and you’re ready to go!
Whether you have data about police complaints or patient throughput times, you’re likely to need to do a little bit of work for your data to be ready to analyze. Fortunately, Minitab makes it easy to make common adjustments like getting the case of letters to match across entries. The faster your data is ready to analyze, the faster you can do the analysis to make better decisions.