Friday, March 30, 2012

Resolve issues now! Do it in a "GIFI"

My colleagues and I came up with a new acronym today. We are prone to making short silly phrases to pass around like memes.

This time, after several spurned candidates, we settled on "GIFI":

Go, Investigate, Fix Immediately.

If you research Lean, or other similar efficiency and continuous flow methods, this kind of mindset is encouraged relentlessly.

Abounding are phrases like "Go to the gemba", meaning go to the place where the work is being done, or "Stop the Line", "Stop and Fix", etc .

The goal of all of these ideas is not to build up a "backlog" or "inventory" of undone ideas and problems, or "debts" that need to be "revisited later".

Instead, they tell us to stop piling onto the heap and FOCUS on correcting issues immediately when they are found.

We made tremendous progress this week by reinforcing this mindset and its attendant behavior this week.

Our whole team oriented its focus toward Quality and Security, and toward "Defect Prevention".

This kind of focus is extremely motivating and addictive. It creates momentum that can be stopped only by lethargy and lack of passion or disinterest.

So, next time you see an issue staring at you, address it in a GIFI.

Go! Investigate, Fix Immediately!

Sunday, March 25, 2012

Control and Prevent “Defect Outbreaks” in Public Health Information Systems by Applying Epidemiological Methods


Have you ever thought about much about the following statement?

"CDC 24/7: Saving Lives, Protecting People, Saving Money through Prevention"

Maybe, maybe not. This is the banner headline on, the home page for the United States Centers for Disease Control and Prevention. It's an important statement that conveys the constant vigilance, the goals, and the primary mindset required in today's world needed to help keep people healthy!

Another thing you may have never thought about is the vast and varied number of information systems required for epidemiologists and other public health professionals to quickly and reliably perform the public health surveillance and other scientific work required to achieve their goals of improving human health. It's easy to understand why such systems are necessary, though. Simply consider how quickly people travel today from country to country and how quickly infectious diseases can spread. Recall the SARS pandemic from 2003 as an example.

In the world of public health, these systems operate all over the United States and world, at local, state, territorial, and federal levels and in collaboration across national boundaries. They empower the public health workforce to control and prevent disease in a variety of biological populations. Human health also depends upon animal health and the health of plants, trees, and ecosystems as a whole. The entire ecosystem is the shared environments, or context, within which we all live.

Controlling and Preventing Information System Disease

Information systems are like ecosystems. Instead of being composed of populations of biological objects, they're composed of populations of technological objects. Beyond that obvious differences in these types of populations are a great many similarities regarding control and prevention surveillance and intervention techniques needed to keep these populations healthy and free of disease.

Can information systems really be diseased? I believe they can, and that all too many of them are.

Here's a standard dictionary definition of the word "disease":
"a disordered or incorrectly functioning organ, part, structure, or system of the body resulting from the effect of genetic or developmental errors, infection, poisons, nutritional deficiency or imbalance, toxicity, or unfavorable environmental factors; illness; sickness; ailment."
Information System Disease Definition

Here's my adapted definition for "Information System Disease":
"an incorrectly functioning or incomplete component, feature, sub-system, or unit of a an information system resulting from the effect of requirements, design, or developmental errors and defects, performance, usability, or capability deficiency, or unfavorable environmental factors such as network communications failures or operating system incompatibilities."

Aside: With the increasing use of biotechnology and nanotechnology that interacts with our own biology, it will become increasing difficult to draw any clear distinctions between a designed technologically-augmented biological system and one that is strictly naturally evolved.

This definition encompasses a lot of different types of "inputs", though not all possible inputs, but it focuses in the beginning on one critical perception: 
an incorrectly functioning or incomplete component, feature, sub-system, or unit
What makes an information system differ from a biological system is that since an information system is specifically designed to serve human cognitive needs, we have better control over defining what incorrect or
incomplete means than we do in many cases for purely biological systems.


NOTE: this next part is more notes....not sure it follows so quickly from above yet.....

The Quarantine Model of Integration

Think now about quarantine, considering the SARS outbreak again. When SARS happened, public health officials acted quickly and implemented quarantine procedures to try to control and prevent the spread of the pathogen into their own populations. Consider this summation of quarantine measures from Taiwan:
During the 2003 Severe Acute Respiratory Syndrome (SARS) outbreak, traditional intervention measures such as quarantine and border control were found to be useful in containing the outbreak. We used laboratory verified SARS case data and the detailed quarantine data in Taiwan, where over 150,000 people were quarantined during the 2003 outbreak, to formulate a mathematical model which incorporates Level A quarantine (of potentially exposed contacts of suspected SARS patients) and Level B quarantine (of travelers arriving at borders from SARS affected areas) implemented in Taiwan during the outbreak. We obtain the average case fatality ratio and the daily quarantine rate for the Taiwan outbreak. Model simulations is utilized to show that Level A quarantine prevented approximately 461 additional SARS cases and 62 additional deaths, while the effect of Level B quarantine was comparatively minor, yielding only around 5% reduction of cases and deaths. The combined impact of the two levels of quarantine had reduced the case number and deaths by almost a half. The results demonstrate how modeling can be useful in qualitative evaluation of the impact of traditional intervention measures for newly emerging infectious diseases outbreak when there is inadequate information on the characteristics and clinical features of the new disease-measures which could become particularly important with the looming threat of global flu pandemic possibly caused by a novel mutating flu strain, including that of avian variety.

What this summary illustrates is that quarantine, when applied at a higher level in the chain of transmission led to a far better reduction in the incidence rate of infection. The other measure led to a more modest, 5% reduction of cases and deaths.

Defining the Quarantine Model of Integration

Let's define a simplified "Quarantine Model of Integration" that can apply to more than just humans with possible infections crossing borders.

Population A: Some set of individual objects.
Population B: Another set of individual objects similar to Population A.
Population B-Harmful: Some subset of population B with harmful characteristics that will disrupt and
weaken the integrity of desired characteristics if introduced into Population A.
Population B-Benign: Some subset of population B without harmful characters if integrated into Population A.
Mitigation Procedures: A set of actions that can be taken upon Population B to identify Population B-Harmful and Population B-Benign, thus allowing Population B-Benign to be integrated into Population A without harming it.


Apply to populations at borders
Apply to scientific body of knowledge in general
Apply to systems development, code, unit tests, continuous integration, etc, etc, etc

Scientific Knowledge Develops through Iteration

Information Systems Develop through Iteration


TODO: revise everything below

The vast majority of public health information systems are built to serve one or more functions of public health surveillance for epidemiologists and other public health professionals collaborating to control and prevent health problems in human and animal populations.

A simple definition of public health surveillance by the World Health Organization is:
Public health surveillance is the continuous, systematic collection, analysis and interpretation of health-related data needed for the planning, implementation, and evaluation of public health practice. Such surveillance can:
  • serve as an early warning system for impending public health emergencies;
  • document the impact of an intervention, or track progress towards specified goals; and
  • monitor and clarify the epidemiology of health problems, to allow priorities to be set and to inform public health policy and strategies.

And, a simple definition of epidemiology by the CDC is:

The study of the distribution and determinants of health-related states in specified populations, and the application of this study to control health problems.

Series Goals: Increasing Value, Protecting Investments, and Saving Money through Prevention

 This is the first article in a series which will compare the practices of public health surveillance and epidemiology with modern practices of information systems development. It will show how systems developers can view their work through the lens of a prevention-focused mindset

Here is more detail from that link:
  • Study—Epidemiology is the basic science of public health. It's a highly quantitative discipline based on principles of statistics and research methodologies.
  • Distribution—Epidemiologists study the distribution of frequencies and patterns of health events within groups in a population. To do this, they use descriptive epidemiology, which characterizes health events in terms of time, place, and person.
  • Determinants—Epidemiologists also attempt to search for causes or factors that are associated with increased risk or probability of disease. This type of epidemiology, where we move from questions of "who," "what," "where," and "when" and start trying to answer "how" and "why," is referred to as analytical epidemiology.
  • Health-related states—Although infectious diseases were clearly the focus of much of the early epidemiological work, this is no longer true. Epidemiology as it is practiced today is applied to the whole spectrum of health-related events, which includes chronic disease, environmental problems, behavioral problems, and injuries in addition to infectious disease.
  • Populations—One of the most important distinguishing characteristics of epidemiology is that it deals with groups of people rather than with individual patients.
  • Control—Finally, although epidemiology can be used simply as an analytical tool for studying diseases and their determinants, it serves a more active role. Epidemiological data steers public health decision making and aids in developing and evaluating interventions to control and prevent health problems. This is the primary function of applied, or field, epidemiology.
In this and following articles in this series, we will refer to these basic principles multiple times to build an analogous model within the field of information systems development. The analogy, like all analogies, will not be perfect. It will serve to help both epidemiologists and systems developers to better understand each others' professions and to begin to craft a better "shared language" when discussing both the subject matter of the systems they use and develop, and the actual development and management process of such systems.

Goal: Control and Prevent Defects and Create Healthier Systems

Through better understanding of the fundamental precepts and concepts of epidemiology and public health surveillance, system developers building public health information systems can "control and prevent defects", leading to "healthier systems".  Similarly, epidemiologists and public health professionals, with a better understanding of the defect control and prevention practices available for systems development, will see how These systems can be seen as a "technological populations" of individual components which share technological contexts and environments. Thus, we see the mission as similar to the core mission of public health to "control and prevent disease" from afflicting biological populations.
As this series of articles will show, there is already wide and pervasive overlap in the skills and mindsets of both epidemiologists and system developers. However, field-specific jargon and other domain-specific differences has often prevented better understanding by both sides.
To begin, it is obvious that the fields of epidemiology and systems development are far too vast to draw point-by-point analogies. Instead we will start by comparing and contrasting a "Disease Outbreak" investigation to a "Defect Discovery" investigation. This comparison will closely follow CDC's Excellence in Teaching Epidemiology web site (EXCITE), which is a series of classroom-focused articles and exercises for teaching epidemiology to students.

Defects in Software: High Prevalence Due to Repeated Incidence:

It is important to note that "defect discovery" is a process that, as this series will ultimately demonstrate, is something that should not happen "in the field" (by a systems' users) in quite the same way that epidemiologists must investigate outbreaks when and where they occur. The reason why this should not happen is similar to what many public health professionals would say: through better understanding and application of disease control and prevention practices and behaviors, both the incidence and prevalence of infectious disease can be greatly reduced.
In the case of software systems development, we can say: through better understanding and application of defect control and prevention practices and behaviors, both the incidence and prevalence of system defects can be greatly reduced, and in many cases completely eradicated.
Why would I say "completely eradicated" in the case of systems development, but not for infectious disease, when it's well-known that many horrific diseases have indeed been eradicated through inoculation and other measures? Unfortunately, new infectious diseases can and do emerge, most often by crossing over from animal populations into human populations. In fact, CDC publishes a scientific journal called Emerging Infectious Diseases devoted entirely to this vast sub-field of epidemiology.
As a side bar, see the section about "Zero Defects", a philosophical approach to systems engineering pioneered by the Martin Marietta Corporation (Now Lockheed Martin) in the early 1960s which led to the successful human moon landings.
However, in systems development, there is a distinct advantage. The technological populations, the source lines of code, the objects, components, the hardware components, the networks, etc, are all 100% completely designed objects. There is not a single part of a software system that was not developed by human intelligence. Because of this, achieving complete eradication of defects from software systems is not only possible, it is something that any public health official would simple expect by default. After all, they have the arduous task of discovering the invisible to the naked eye objects that can interact with our biology to harm or kill us. While, systems developers have the task of designing well-structured, highly-visible objects that must behave in well-specified, 100% predictable ways.
So, where is the disconnect? Why do so many software systems have such high prevalence of defects? The simplified answer returns us to epidemiological science: it has everything to do with the incidence rate of defects introduced during any single time-period. Because very few technological systems have "self-healing" capabilities like biological organisms, once a defect enters a technological population, it can only be removed by human discovery and concentrated effort. This defect discovery and removal process is far too costly, both monetarily and in terms of public-health preparedness.
So, this series will show how to reduce defect incidence rates as close to zero as possible during any one time-period (or system release or software version), to come class to guaranteeing zero prevalence of defects.
But..before we get there, we do need to discuss exactly how defects can and do get reported by those "in the field", the users of systems. Doing so will make clear why the defect discovery process is very time-consuming and costly.
CDC: Steps of an Outbreak Investigation
Disease Outbreak Investigation Step Defect Root-Cause Investigation Step Comments
Prepare for field work Establish defect communication channels End users are the ones "in the field", and management must provide communication channels (automatic error trapping within deployed systems and telephone, email, issue tracking systems) that enable users to provide their feedback
Establish the presence of an outbreak Establish confirmed presence of a defect When users report an issue as a defect, the team must be able to verify whether this is true or perhaps some other kind of malfunction or user training issue (misunderstanding, lack of experience, etc)
Verify the diagnosis Verify the steps to reproduce In order to investigate the defect, it is required to document the precise sequence of actions taken by the user. These are needed to reproduce the same issue under controlled testing conditions.
Define and identify cases Define and identify type of issue Not all reported issues fall neatly into the category of "defect". Other categories include:
  • Request for enhancement
  • Difficulty using feature
It is still important to classify each reported item for the purpose of improving the system based upon its users' experiences.
Describe and orient the data in terms of time, place, and person Describe and orient the ??report?? in terms of time, browser, operating system, user role and other system-specific characteristics Document conditions that can help characterize the event for detailed investigation by the management team.??
Develop Hypotheses Reproduce Defect within Controlled Environment The management team must
Evaluate Hypotheses Determine Root-Cause This step may involve a number of trial-and-error "hypotheses" depending upon how well the system itself has been constructed according to quality engineering practices.
Refine hypotheses and carry out additional studies Develop a "fix" for the defect and perform system regression testing This simplifies a much larger series of steps, but the most important part is that a "regression test" is performed which verifies that the resolution of this specific defect does not introduce additional defects in other areas (or even in the same area!)
Implement control and prevention measures Implement control and prevention measures This step requires the team to reflect upon the root-cause determinant of the defect and analyze what it can do to prevent similar defects from occurring again
Communicate findings Deploy fix and communicate resolution This step involves updating the system with the defect corrected (and the rest of the potentially affected system fully regression tested)
Let's now go into more detail about each of these points to create a more complete and nuanced analogy.
First, the CDC site prefaces the steps with:
In investigating an outbreak, speed is essential, but getting the right answer is essential, too. To satisfy both requirements, epidemiologists approach investigations systematically, using the following 10 steps:
    <see above>

It concludes with this:

The steps are presented here in conceptual order. In practice, however, several may be done at the same time, or they may be done in a different order. For example, control measures should be implemented as soon as the source and mode of transmission are known, which may be early or late in any particular outbreak investigation.
In the software systems development industry, this situation is very similar. It is well known that scientists find the idea of a strictly defined "scientific method" as naïve, at best, and dangerous at worst (TODO give references to well known scientists in past and contemporary who discuss this problem)
In the software industry, there are similar naïve models of the software development process, or "Life Cycle" which, while applicable in certain situations, are completely inadequate for sophisticated, expensive, and mission-critical systems development.
Without going into great detail here, one of the most prevalent naïve models of development is that of the "Waterfall method", also known as "single-pass all-at-once delivery". Here is a diagram from the paper by Winston J. Royce from 1971 which began the propagation of this so-called method:

Sadly, and to the extreme financial diminution of countless companies and government budgets, this diagram was Royce's illustration of a well-known risk factor, to use the language of epidemiology. That is, Royce was saying that when a system was developed this way, it had a higher than 50% chance of resulting in failure, a "negative health outcome" so-to-speak.
What Royce went on to illustrate in his paper was this far more nuanced and realistic model of systems development:


Here is a diagram from Public Health Practices and Principles which shows the generally accepted pattern of public health science:
To more fully understand the history of sequential, single-pass Waterfall, and why it has been strongly discouraged due to its high cost and predictable rates of failure by certain government organizations, such as the Department of Defense, please say Dr. Craig Larman's paper: "Iterative and Incremental Development a Brief History". <<TODO>>