Data. Science. Begin with Data. Follow the scientific method. Solving the mystery - Fact by fact.
Data Science is about making inferences from the data at hand. How do we make inferences from data in the first place? We will illustrate this process via a story from the history of the scientific method, a mystery story that was solved via the dedication and extreme effort of a single physician.
Only in the mid 1800’s was it discovered that the hygiene of physicians is directly related to the health of their patients. The physician Ignaz Semmelweis was distressed at the number of his patients who were dying in childbirth. He noticed that the number of women dying differed between two maternity wards in the same hospital, and wished to determine the cause. One ward had three times the death rate of the other. He quickly worked through several hypotheses. He had multiple lines of evidence at hand. First, women were dying at a higher rate in the hospital than those who were overcome by labour in transit and gave birth on the street. Secondly, he wondered if rough medical exams by medical students could be the cause. Semmelweis rejected this hypothesis, noting that the injuries due to birth are greater than those that might be caused by partially trained medical students. He also noted that in the ward with fewer deaths, the midwives were using much the same examination procedures as the medical students were using in the ward with the higher death rate. Semmelweis wondered if a priest ringing a bell for the last sacraments could have terrorized the women to death. He convinced the priest to change his route and the deaths did not appear to decrease regardless of changes made to the priest’s route.
Starting to get desperate to solve this mystery he noted that in one ward women were delivered on their backs and in another ward women were delivered on their sides. He examined switching delivery positions, with no effect. Finally he had a critical insight. A colleague of his received a puncture wound while performing an autopsy, and soon died of an illness that appeared identical to that of his female childbirth patients. Semmelweis wondered if the contact of ‘cadaverous’ matter and an open wound might be implicated. It then occurred to him (in a moment of horror we might imagine) that he and his medical students often attended to one of the maternity wards immediately after conducting dissections in the autopsy room, and did not thoroughly wash their hands. He immediately ordered all medical students to wash their hands in a chlorinated lime solution before examining women. Very soon after, the number of patients in the ward with the higher death rate fell to match the other ward. At that point, several bits of evidence he had fell into place. The ward with the lower death rate was attended by mid-wives, who do not participate in autopsies. Secondly, women who give birth on the street, are usually not examined on arrival, and hence avoided getting infected. Semmelweis was constantly making inferences from data. His data and experimental methods might not satisfy a modern day researcher, but there is no faulting his process of inference.
There are three aspects of his (or anyone’s) inference process that concern us: error, independence, and sensitivity. First Semmelweis developed hypotheses, and based on these, developed predictions. He noted the error between his predictions, and the actual results obtained. Based on this error, he refined hypotheses to make new predictions. For example, he later found that not only cadaverous material, but also putrid material from living patients could cause the fever. Secondly, he altered conditions. If he found that the results were independant of all possible conditions for a single factor he could alter (such as position of delivery), he ruled out that factor. Finally, he looked at the sensitivity of his results to additional data. In addition to the women who died at childbirth of fever, he considered their children. Only the children of mothers who contracted the disease during labour also fell sick. The philosopher, Deborah Mayo contends that error is the basis of experimental knowledge. Error in this sense, is the difference between a prediction from a hypothesis and the actual results obtained. This concept is central to modelling processes whether they are predictive (regression, classification, simulation) or pattern finding (clustering, probabilistic techniques, even generative AI).
Often our predictions are in terms of particular statistics, such as the mean or variance. In that case, we also need to take into account the range of variability in the statistics – the confidence limits. In the case where we have multiple alternate hypotheses, we have strong grounds to choose in favour of a particular hypotheses if the error for it is much less than the other alternatives and falls within the confidence limits for the data obtained. These concepts lead to the idea of ‘severe tests’, where an experiment is set up such that a hypothesis either clearly passes or fails. In essence there is a very high probability the test procedure would not yield a passing result if the hypothesis were false. The stability of our inferences, then depends on the ability to provide severe tests amongst alternate hypotheses.
This essay is modified from Chapter 3 of
Topological Stability and Dynamic Resilience in Complex Networks.
Thanks for reading Product Innovation AI. If you liked this post, please click ♡ button below to help product innovation ai grow
Subscribe for free to receive new posts and support this work.
Previous Letter: