My wife and I are expecting a baby girl soon—very soon, in fact, as in "Will this blog post be published before the baby is born?" soon. The due date given is May 19th, but we stat geeks know that a point estimate just isn't good enough...we want probability intervals that reflect the uncertainty in the data.
I found a chart that lets me know the number of babies born to "spontaneous labor" by each week of pregnancy, but I'm interested in more precision than just the week. I converted the data to days instead of weeks (for example Week 40 starts on day 280 and runs through day 286), and here it is:
Week | First Day | Last Day | Births |
37 | 259 | 265 | 1166 |
38 | 266 | 272 | 3048 |
39 | 273 | 279 | 6616 |
40 | 280 | 286 | 8015 |
41 | 287 | 293 | 4852 |
42+ | 294 | * | 1317 |
One of the more overlooked areas of statistics is Reliability, a field that was originally intended to estimate when parts would fail but has much broader applications than that. If you ever have ranges of values in which your data fall, rather than exact values, Reliability is the menu for you!
Data such as this, where we have a count of occurrences within a range of values (common when things are reported in whole days, weeks, etc.) but know the real distribution is on a continuous scale, is known as Arbitrarily Censored data. In this case I'd like to know the odds of the baby being born by certain days, which requires me to first find the distribution of the data. To do this I go to Stat > Reliability Survival > Distribution Analysis (Arbitrary Censoring) > Distribution ID Plot and complete the dialog like this:
Based on the excellent fit on the probability plot and high Correlation Coefficient, I'm going to use the 3-Parameter Weibull distribution for my analysis:
So from our original categorized data, we now have a continuous distribution to work with.
To learn some more about what this distribution means for when to expect a baby, I go to Stat > Reliability/Survival > Distribution Analysis (Arbitrary Censoring) > Parametric Distribution Analysis and complete the dialog like this:
I also click on "Graphs" and choose to show a "Cumulative failure plot".
First I look at the "Characteristics of Distribution" table from the Session Window:
So what can I learn from this table?
Minitab also gives a "Table of Percentiles" in the Session Window, but I prefer to use the Cumulative Failure Plot and again want to remind readers that the term "failure" is associated with a part or product failing and not a child being born...
This graph plots the day of pregnancy on the x-axis, and the percentage of babies born by that day on the y-axis. So for a given day—such as day 285—we can find the corresponding point on the line and read that a little over 70% of babies are born by that day (our first was five days late). We could also look at the cumulative probabilities of days 280 and 281—47.47% and 52.14%, respectively—to find that the odds of the baby being born at some point on the due date are about 4.67%.
Now that we have used what was originally very categorized data to form a continuous distribution, we could answer many questions more precisely, such as: