Exploratory Data Analysis: The First (and Sometimes Last) Step

Minitab Guest Blogger | 7/9/2013

by Matthew Barsalou, guest blogger

A good way to begin researching a topic is with exploratory data analysis (EDA). In his 1977 book Exploratory Data Analysis, John Tukey suggested using EDA to collect and analyze data—not to confirm a hypothesis, but to form a hypothesis that could later be confirmed through other methods.


In some cases, EDA can even eliminate the need for a more in-depth hypothesis test. Here's a case in point. 

When I heard about the new Star Trek movie, I had started to complain to anybody who would listen (which was not many people) that director J. J. Abrams had used such a young cast in the 2009 Star Trek film.

With a tentative hypothesis of “the new Star Trek films use very young actors and actresses compared to the older Star Trek series,” I decided to look into this further. The first thing I did was collect data to use later in boxplots, which are a part of Tukey’s EDA.

Collecting Data for the Exploratory Analysis

I needed to determine the ages at which each main Star Trek actor first appeared; however, before I started looking for ages, I needed a method to determine whom I should consider as a main character in each series. To select the actors to consider I went to www.StarTrek.com and observed which characters were listed for each Star Trek series. This way I avoided biasing my results by selecting older or younger crewmembers who may not have had as much relevance as others.

The tables below list the characters and the episode or movie in which they first appeared. The name of the actor playing each character is then listed, and their year of birth as determined by viewing their entry at the Internet Movie Database. To determine the person’s age, the date of birth was subtracted from the year of first appearance. This resulted in rough calculations which could be wrong by a year, because month of birth and month of first appearance were not considered.

Table 1: Star Trek: The Original Series

Name

Character

First appeared in

Year of birth

Year of first appearance

Age

+/- 1 year

William Shatner

James T. Kirk

The Man Trap

1931

1966

35

Leonard Nimoy

Spock

The Man Trap

1931

1966

35

DeForest Kelley

Leonard “Bones” McCoy

The Man Trap

1920

1966

46

James Doohan

Montgomery “Scotty” Scott

The Man Trap

1920

1966

46

George Takei

Sulu

The Man Trap

1937

1966

29

Nichelle Nichols

Uhura

The Man Trap

1932

1966

34

Walter Koenig

Pavel Andreievich Checkov

Amok Time

1936

1967

31

 

Table 2: Star Trek: The Next Generation

Name

Character

First appeared in

Year of birth

Year of first appearance

Age

Patrick Stewart

Jean-Luc Picard

Encounter at Farpoint

1940

1987

47

Jonathan Frakes

Will Riker

Encounter at Farpoint

1952

1987

35

Brent Spiner

Data

Encounter at Farpoint

1949

1987

38

Levar Burton

Geordi La Forge

Encounter at Farpoint

1957

1987

30

Michael Dorn

Worf

Encounter at Farpoint

1952

1987

35

Marina Sirtits

Deana Troi

Encounter at Farpoint

1955

1987

32

Gates McFadden

Beverly Crusher

Encounter at Farpoint

1949

1987

38

Wil Wheaton

Wesley Crusher

Encounter at Farpoint

1972

1987

15

 

Table 3: Star Trek: Deep Space Nine

Name

Character

First appeared in

Year of birth

Year of first appearance

Age

Avery Brooks

Benjamin Sisko

Emissary

1948

1993

45

Nan Visitor

Kira Nerys

Emissary

1957

1993

36

Rene Auberjonois

Odo

Emissary

1940

1993

53

Alexander Siddig

Julian Bashir

Emissary

1965

1993

28

Colm Meany

Miles O’Brien

Emissary

1953

1993

40

Terry Farrell

Jadzia Dax

Emissary

1963

1993

30

Armin Shimerman

Quark

Emissary

1949

1993

44

Cirroc Lofton

Jake Sisko

Emissary

1978

1993

15

Michael Dorn

Worf

The Way of the Warrior

1952

1995

46

Nicole de Boer

Ezri Dax

Image in the Sand

1970

1998

28

 

Table 4: Star Trek: Voyager

Name

Character

First appeared in

Year of birth

Year of first appearance

Age

Kate Mulgrew

Kathryn Janeway

Caretaker

1955

1995

40

Robert Beltran

Chakotay

Caretaker

1953

1995

42

Tim Russ

Tuvok

Caretaker

1956

1995

39

Robert Duncan McNeill

Tom Paris

Caretaker

1964

1995

31

Roxann Dawson

B’Elanna Torres

Caretaker

1958

1995

37

Garrett Wang

Harry Kim

Caretaker

1968

1995

27

Robert Picardo

The Doctor

Caretaker

1953

1995

42

Ethan Phillips

Neelix

Caretaker

1955

1995

40

Jennifer Lien

Kes

Caretaker

1974

1995

21

Jerry Ryan

Seven of Nine

Scorpion:
Part 2

1968

1997

29

 

Table 5: Star Trek: Enterprise

Name

Character

First appeared in

Year of birth

Year of first appearance

Age

Scott Bakula

Jonathan Archer

Broken Bow:
Part 1

1954

2001

47

Jolene Blalock

T’pol

Broken Bow:
Part 1

1975

2001

26

Connor Trinneer

Charles “Trip”

Tucker III

Broken Bow:
Part 1

1969

2001

32

Dominic Keating

Malcom Reed

Broken Bow:
Part 1

1962

2001

39

John Billingsley

Phlox

Broken Bow:
Part 1

1960

2001

41

Linda Park

Hoshi Sato

Broken Bow:
Part 1

1978

2001

23

Anthony Montgomery

Travis Mayweather

Broken Bow:
Part 1

1971

2001

30

 

Table 6: Star Trek (2009)

Name

Character

First appeared in

Year of birth

Year of first appearance

Age

Chris Pine

James T. kirk

Star Trek (2009)

1980

2009

29

Zachary Quinto

Spock

Star Trek (2009)

1977

2009

32

Karl Urban

Leonard “Bones” McCoy

Star Trek (2009)

1972

2009

37

Zoe Saldana

Nyota Uhura

Star Trek (2009)

1978

2009

31

Simon Pegg

Montgomery “Scotty” Scott

Star Trek (2009)

1970

2009

39

John Cho

Hukaru Sulu

Star Trek (2009)

1972

2009

37

Anton Yelchin

Pavel Andreievich Checkov

Star Trek (2009)

1989

2009

30

EDA: Interpreting the Data with a Boxplot

Simply looking at the results in tables 1 through 6 led to me suspect my hypothesis may have been incorrect, but I still proceeded to create a Minitab boxplot with the data.

boxplot of star trek ages

The boxplot depicts the ages of the actors and actresses in each Star Trek series as well as in the 2009 reboot. The rectangular boxes represent the middle 50% of each data set and the vertical lines on top of the rectangular boxes represent the upper 25% of the data. The vertical lines on the bottom of the rectangular boxes represent the lower 25% of the data—except in the case of outliers. Outliers are unusually large or small observations and are represented by an asterisk. There is only one outlier in this boxplot, and that is Will Wheaton as Wesley Crusher in Star Trek: TNG.

The symbol that looks like a plus sign inside of a small circle is used to represent the average of the data set. The average age of actors and actresses in the 2009 reboot is 33.57 years, and this is just slightly lower than Star Trek: TNG, which had an average of 33.75 years of age. The highest average age was for Star Trek: TOS with an average of 36.57.

What truly stands out in the boxplot is the spread of the data. The distribution of actors' ages in the reboot was less than that of all of the other series. This would make sense as it would not be plausible to use actors or actresses in their 50s or 60s to portray people who are still attending Star Fleet Academy.

The hypothesis that originally started this was “the new Star Trek films use very young actors and actresses compared to the older Star Trek series,” but a look at the boxplots in figure one show that this may not be the case. In fact, there is no reason to proceed on to confirmation testing because my hypothesis can be discarded at this point. 

It looks like I owe director J. J. Abrams an apology.

Exploratory Data Analysis Raises New Questions

Even a hypothesis that was discarded after performing EDA can lead to the...um...next generation of hypotheses, and new insights. In this case, my new hypothesis could be, “The actors and actresses in Star Trek are not getting younger; I am getting older.” The new hypothesis could also be explored with EDA prior to moving on to more robust methods.  

However, in this case, I will not investigate my new hypothesis. I would rather just change the subject.

 

About the Guest Blogger: 

Matthew Barsalou is an engineering quality expert in BorgWarner Turbo Systems Engineering GmbH’s Global Engineering Excellence department. He has previously worked as a quality manager at an automotive component supplier and as a contract quality engineer at Ford in Germany and Belgium. He possesses a bachelor of science in industrial sciences, a master of liberal studies and a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany.

 

Photo of Star Trek figures by Miguel Bernas, used under creative commons 2.0 license.