People say that I overthink everything. I've given this assertion considerable thought, and I don't believe that it is true. After all, how can any one person possibly overthink every possible thing in just one lifetime?
For example, suppose I live 85 years. That's 2,680,560,000 seconds (85 years x 365 days per year x 24 hours per day x 60 min per hour x 60 seconds per minute). I'm asleep about a third of the time, so that leaves just 1,787,040,000 seconds to ponder a nearly infinite variety of things. This morning, I paused for about 2 seconds to ruminate about a gray hair. ("Hey, that hair wasn't gray yesterday.") At a rate of 1 cogitation every 2 seconds, I would have time in life to mull over only 893,520,000 items.
That's a plethora, for sure. But this number doesn't seem so big when you consider the large (though shrinking) number of not-yet-gray hairs on my head, or the vast number of ways that you can use Minitab Statistical Software to improve quality in your organization. So, to those who say that I overthink everything, after much deliberation I am confident that you are mistaken.
But I do overthink some things. Take my house...please. After much blood (literal), sweat (literal), and tears (not telling), I am finally ready to list my house for sale. (If you're in the market for a lovely 4-bedroom home, nestled in the heart of State College, Pennsylvania, I have just the house for you.)
The other day, a gaggle of realtors (I think that's the collective noun for realtors) inspected the property, and each submitted their personal estimate of what the house is worth. Being a numbers guy (and being anxious to know how much I could get for the
ol' homestead), I was excited to see the results.
Imagine my disappointment when my realtor gave me only summary data!
Ladies and gentlemen, the data you are about see are real; only the values have been transformed, to protect the innocent (a.k.a., the guy who bought way more house than he ever should have, or ever will again).
|Number of realtors||12|
|Written comments||"Great large rooms, bright, nice windows, loved the decks!"
"Nice, presents well!"
"Baseboards are blotchy/scratched and need to be painted."
Desperate questions crowded my frantic mind as I struggled to process the surprisingly sparse information:
- What happened to the rest of the data!?!
- How many realtors thought the house is worth $460,000? Just one? I can't tell!
- Is $425,000 an outlier? Did the other realtors take all the chocolate chip cookies that I baked and leave poor 425 grumpy and snackless?
- Do I really need to paint the baseboards!?!
I was deeply distraught by this dearth of data, this omission of observations, this not-enough-ness of numbers. So, I asked my realtor if I could see the raw data. Her response shocked me: "My assistant threw out the individual responses. These valuations are just gut feelings. Don't overthink it."
What? "Threw out the individual responses"!?!?!
"Don't overthink it"?!?!?
"Paint the baseboards"?!?!?!?!
I had planned to use the realtor valuations to help me come up with a list price. I was concerned because the mean of different distributions can be the same, even if the shapes of the distributions are wildly different. For example, each sample in this histogram has a mean of 4. Obviously, the mean alone doesn't tell you anything about how the observations are distributed.
Also, the mean itself probably wouldn't be a good list price. I'm not trying to appeal to the average buyer; I'm trying to appeal to those special few buyers who actually like the house and are willing to bid more for it. On the other hand, if I pick a number that is too high, I could out-price even the high-bidders and I might get no offers.
What to do, what to do?
I had done a little reading about Monte Carlo simulations. And I recalled that data simulations were invaluable when we designed the new test and confidence interval for 2 variances in Minitab Statistical Software. (You can read more than most people ever want to know about those simulations here.) So I decided to try some simple simulations to see what I could learn about possible sample distributions that fit the summary statistics I was given.
First, a quick note about my methodology. For simplicity, I assigned each observation 1 of 15 discrete values: $425,000, $427,500, $430,000, ..., up to $460,000. Each hypothetical distribution includes 12 observations and has a mean of $450,000 (within rounding error). Each distribution includes at least one observation at $425,000 (the reported minimum) and at least one observation at $460,000 (the reported maximum). Values on the graph are in units of $1,000 (for example, 425 = $425,000). Reference lines are included on the histograms to show the following statistics:
Mn = the mean, which is always equal to $450,000
Md = the median
Mo = the mode
Q3 = the 3rd quartile (also called the 75th percentile)
Simulated Sample Data
My first guess when I saw the summary data was that the distribution of the realtor evaluations was probably left-skewed, so I simulated that first.
In this scenario, most of the valuations are clustered at the high end, with fewer valuations in the middle, and even fewer valuations at the low end. This is my favorite scenario, because the most frequent response (the mode) is $460,000, which is the highest value in the sample. If the real distribution looked like this, I'd be comfortable choosing $460,000 as my list price because I'd know that 3 of the 12 realtors think the house is worth that price.
Next I wondered what it would look like if there was a major disagreement among the realtors. So I worked up this bimodal scenario.
In order to maintain a mean of $450,000, I could not include very many observations on the low end of the spectrum. So most of the valuations in this scenario fall on the high end. But—and this is a big but—in this scenario, 3 different realtors actually gave the house the minimum valuation. I would definitely want to know why those realtors priced the house so differently from the others. It could be that they noticed something that the other realtors did not. In this scenario, I can't really come up with a reasonable list price until I find out why there are two distinct peaks.
Next, I wondered what the data might look like if the realtors were feeling blasé about the price. This flat-looking distribution is my statistical interpretation of realtor ennui.
Again, in order to maintain a mean of $450,000, I could not put many observations on the low side. In fact, I included only the one minimum observation on the low side, which makes that observation an outlier. If I didn't already know that this outlier was just Mr. Blotchy Baseboards having a bad day, I'd need to investigate. The other valuations are distributed fairly evenly between $445,000 and $460,000. In a case like this, it seems like the 3rd quartile (Q3) might be a reasonable choice. By definition, at least 25% of the observations in a distribution are greater than or equal to Q3. If 25% of potential buyers think the house is worth $455,000, then I'll have a decent chance of getting an offer quickly at that price.
I also wondered what the data might look like if most of the realtors were in close agreement on the price.
In this scenario, most the valuations are grouped closely together near the mean. The minimum valuation is again an outlier. The maximum valuation also appears to be an outlier. I definitely would not base my list price on the maximum valuation because it does not seem representative. The mode is a disappointing $450,000, but Q3 is a little higher at $452,500.
Just for the heck of it, I tried one final scenario—a right-skewed distribution.
Again, Mr. Blotchy Baseboards is an outlier. This is another disappointing scenario because the mode is $450,000. But at least Q3 is $454,400, which is a little higher than Q3 for the peaked scenario.
Here is a recap of the list prices I would choose for each simulated data set (in decreasing order):
I'm still mad at my realtor for throwing away perfectly good data. But I am feeling better about choosing a list price for my house. I would like to think that the left-skewed scenario is closest to the truth. But even if it is not, the lowest list price that I came up with was $452,500, which isn't much different. The bimodal scenario is problematic, but since I don't know if the actual data were bimodal, I kind of have to ignore that one.
I will probably go with the second highest list price of $455,000. In the end, it's just gut feel, right? I don't want to overthink it.