My wife and I are expecting a baby girl soon—very soon, in fact, as in "Will this blog post be published before the baby is born?" soon. The due date given is May 19th, but we stat geeks know that a point estimate just isn't good enough...we want probability intervals that reflect the uncertainty in the data.
I found a chart that lets me know the number of babies born to "spontaneous labor" by each week of pregnancy, but I'm interested in more precision than just the week. I converted the data to days instead of weeks (for example Week 40 starts on day 280 and runs through day 286), and here it is:
|Week||First Day||Last Day||
One of the more overlooked areas of statistics is Reliability, a field that was originally intended to estimate when parts would fail but has much broader applications than that. If you ever have ranges of values in which your data fall, rather than exact values, Reliability is the menu for you!
Data such as this, where we have a count of occurrences within a range of values (common when things are reported in whole days, weeks, etc.) but know the real distribution is on a continuous scale, is known as Arbitrarily Censored data. In this case I'd like to know the odds of the baby being born by certain days, which requires me to first find the distribution of the data. To do this I go to Stat > Reliability Survival > Distribution Analysis (Arbitrary Censoring) > Distribution ID Plot and complete the dialog like this:
Based on the excellent fit on the probability plot and high Correlation Coefficient, I'm going to use the 3-Parameter Weibull distribution for my analysis:
So from our original categorized data, we now have a continuous distribution to work with.
To learn some more about what this distribution means for when to expect a baby, I go to Stat > Reliability/Survival > Distribution Analysis (Arbitrary Censoring) > Parametric Distribution Analysis and complete the dialog like this:
I also click on "Graphs" and choose to show a "Cumulative failure plot".
First I look at the "Characteristics of Distribution" table from the Session Window:
So what can I learn from this table?
- The Mean, also known as MTTF or Mean Time to Failure (remember these terms were created for part failures and not childbirth!), tells me that on average babies are born almost exactly at 280 days, or 40 weeks.
- The Median is about 280.5 days, so about 50% of children are born by half a day past 40 weeks.
- The First Quartile and Third Quartile tell me by which day 25% and 75% of babies are born, respectively, so the "Middle 50%" of babies are born between days 274.5 and 286...since for us day 280 is the 19th, that means we have a 50% chance of the baby being born between Sunday the 13th and Friday the 25th.
Minitab also gives a "Table of Percentiles" in the Session Window, but I prefer to use the Cumulative Failure Plot and again want to remind readers that the term "failure" is associated with a part or product failing and not a child being born...
This graph plots the day of pregnancy on the x-axis, and the percentage of babies born by that day on the y-axis. So for a given day—such as day 285—we can find the corresponding point on the line and read that a little over 70% of babies are born by that day (our first was five days late). We could also look at the cumulative probabilities of days 280 and 281—47.47% and 52.14%, respectively—to find that the odds of the baby being born at some point on the due date are about 4.67%.
Now that we have used what was originally very categorized data to form a continuous distribution, we could answer many questions more precisely, such as:
- When should a relative arrive on a 7-day stay to have the greatest chance of being there for the birth? (May 17th)
- What are the odds of the baby being born on a weekend? (28.6%)
- What are the odds of the baby being born on her great-grandmother's birthday, May 14th? (3.4%)
- What should we name her? (Actually Parametric Distribution Analysis can't answer that but it's still a pretty great tool)