Showing posts with label agile. Show all posts
Showing posts with label agile. Show all posts

Wednesday, May 16, 2012

Draft: The Centers for Defect Control and Prevention: Public Health and Epidemiology Principles for the Development of Information Systems


Have you ever thought much about the following statement?

"CDC 24/7: Saving Lives, Protecting People, Saving Money through Prevention"

This is the banner headline on, the home page for the United States Centers for Disease Control and Prevention. It's an important statement that conveys the constant vigilance, goals, and primary mindset required in today's world needed to help keep people healthy!

Another thing you may have never thought about is the vast and varied number of information systems required for epidemiologists and other public health professionals to quickly and reliably perform the public health surveillance and other scientific work required to achieve their goals of improving human health. It's easy to understand why such systems are necessary, though. Simply consider how quickly people travel today from country to country and how quickly infectious diseases can spread. Recall the SARS pandemic from 2003 as an example.

In the world of public health, these systems operate all over the United States and world, at local, state, territorial, and federal levels and in collaboration across national boundaries. They empower the public health workforce to control and prevent disease in a variety of biological populations. Human health also depends upon animal health and the health of plants, trees, and ecosystems as a whole. The entire ecosystem is the shared environment, or context, within which we all live.


I do not work directly for CDC or as a federal employee, so these opinions are based only in my own experience working with contracting companies on technology teams providing services to CDC and the public health community at large. I am also not a public health expert, so these ideas are a work-in-progress as my own understanding of public health and epidemiology evolves.

Article Series Goal: Building The Centers for Defect Control and Prevention

Having helped build CDC mission-critical information systems that protect the public's health, I feel is important to share ideas for improving those systems and the process undertaken to build them. This is the first of a multi-part series of articles that will create a vision for CDC's information systems acquisition and development process, a vision that applies the very principles of public health itself and epidemiology to guide those processes. As we'll see, there are already many parallel concepts between the disciplines. The goal is that CDC should also stand for Centers for Defect Control and Prevention when it comes to its information systems.

This first article will introduce several fundamental concepts of epidemiology and disease control and prevention while drawing parallels with the activities necessary for designing and developing successful, useful, and cost-effective information systems. 

Terms we'll introduce related to epidemiology are:
  • Epidemiology
  • Populations
  • Control (as in controlling health problems)
  • Disease
  • Determinant
  • Incidence
  • Prevalence
  • Incubation Period
  • Subclinical Infection
  • Quarantine and Isolation
For each of these concepts from the domain of epidemiology, which pertains to biological, chemical, ecological (ultimately physical) objects, we'll draw parallel models within the world of information systems which pertain, ultimately, to technological objects.

Definition: Epidemiology 

CDC defines epidemiology as:

The study of the distribution and determinants of health-related states in specified populations, and the application of this study to control health problems.

There is a lot more to say about that, but for this article, let's highlight these two parts:
  • Populations—One of the most important distinguishing characteristics of epidemiology is that it deals with groups of people rather than with individual patients.
  • Control—Although epidemiology can be used simply as an analytical tool for studying diseases and their determinants, it serves a more active role. Epidemiological data steers public health decision making and aids in developing and evaluating interventions to control and prevent health problems. This is the primary function of applied, or field, epidemiology.

Controlling and Preventing Information System Disease in Populations of Technological Objects 

Information systems are like ecosystems. But, instead of being composed of populations of biological objects, they're composed of populations of technological objects. Beyond that obvious differences in these types of populations are a great many similarities regarding control and prevention surveillance and intervention techniques needed to keep these populations healthy and free of disease.

Wait, can information systems really be diseased? I believe they can, and that all too many of them are. 

Here's a standard dictionary definition of the word "disease":

"a disordered or incorrectly functioning organ, part, structure, or system of the body resulting from the effect of genetic or developmental errors, infection, poisons, nutritional deficiency or imbalance, toxicity, or unfavorable environmental factors; illness; sickness; ailment."

Definition: Information System Disease 

Here's my adapted definition for "Information System Disease": 
"an incorrectly functioning or incomplete component, feature, sub-system, or unit of a an information system resulting from the effect of requirements, design, or developmental errors and defects, performance, usability, or capability deficiency, or unfavorable environmental factors such as network communications failures or operating system incompatibilities." 
Aside: With the increasing use of biotechnology and nanotechnology that interacts with our own biology, it will become increasing difficult to draw any clear distinctions between a designed technologically-augmented biological system and one that is strictly naturally evolved.

The phrase "developmental errors and defects" has a much catchier name: Bugs! That actually sounds a bit like the germ-theory of disease doesn't it? A lot of people refer to catching "the flu bug" or being "sick with some bug".

Here is a photo of the "first actual bug" found in 1946:

Trivia aside, our definition encompasses a lot of different types of "inputs", though not all, but it focuses in the beginning on one critical perception: 

an incorrectly functioning or incomplete component, feature, sub-system, or unit

This brings us to one more important definition before we move on.

Definition: Determinant 

any factor that brings about change in a health condition or in other defined characteristics

In epidemiology, a determinant can take on a broad range of concrete forms. In summary, the World Health Organization groups them into these categories: 
  • the social and economic environment,
  • the physical environment, and
  • the person's individual characteristics and behaviors. 

The Determinants of Information System Health are Almost Always Human-Caused 

Information systems differ from biological systems because they are specifically designed by humans to serve human needs or goals. Because information systems are designed by us, we have better internal control over the resulting  behavior, and thus the healthy status, of information systems. Compare this to the medical or epidemiology professions where purely naturalistic, biological systems are constrained only by the laws of nature, many of which we only partially understand and have only partial external control.

Since software development is entirely human-made, and consists of a closed set of concepts entirely understandable and controllableshould we understand and follow a few simple guiding principles that we'll introduce in the next article. Because of this fact, software development can be done in a way that defect prevention is this built in from the beginning. But for now, let's introduce a few more epidemiology terms and see how they apply to software development.

Definition: Incidence 

Incidence refers to the occurrence of new cases of disease or injury in a population over a specified period of time 

 Definition: Prevalence

Prevalence, sometimes referred to as prevalence rate, is the proportion of persons in a population who have a particular disease  or attribute at a specified point in time or over a specified period of time. Prevalence differs from incidence in that prevalence includes all cases, both new and preexisting, in the population at the specified time, whereas incidence is limited to new cases only.

Applying Incidence and Prevalence to Information System Development and Defect Control and Prevention 

We saw above that defect prevention can be built into the software development process from the beginning. While this is true and will be explained in detail in another article, we need to consider the all too common scenario that we are all used to: buggy software.

Let us equate a software defect, bug or otherwise "incorrectly functioning or incomplete component, feature, sub-system, or unit", with"disease or injury" from the definition of incidence.

Now, suppose an organization hires a contracting company to build a large information system. The contractor says the system will be ready to deploy to a production environment for use by the end of one year's time from project inception.

Next, suppose this company sets out to analyze and define all the requirements to build that system before building even a single small portion of the system. Suppose this process takes six months before any new code is written at all. The company delivers large requirements and design documents to their customer at the end of this process.

At this point, there may already be a high prevalence of undiagnosed defects inside of the requirements and design documents for that system! Thus, any ensuing "disease" has not yet had a "date of first occurrence" because none of the system's code has been written, tested, or used -- not even in prototype or proof-of-concept form!

Here are a few more epidemiological terms that draw immediate analogies:

Definition: Incubation Period 

A period of subclinical or unapparent pathologic changes following exposure, ending with the onset of symptoms of infectious disease.

Definition: Latency Period 

A period of subclinical or unapparent pathologic changes following exposure, ending with the onset of symptoms of chronic disease.

Defects Latent in Large Documents Have a Long Incubation Period Followed by Sudden Onset 

Now we can understand that when the contractor spent six months building a large requirements and design document, but built no physical code for others to review and use they raised the risk of "infection" which will likely result in a sudden, or acute, onset of a variety of problems. Ultimately, this will be measured as both a  high incidence and a high prevalence during the time period the defects are discovered.

Latent Defects are Like Subclinical Infections Until Onset 

Wikipedia defines a subclinical infection as follows:

"A subclinical infection is the asymptomatic (without apparent sign) carrying of an (infection) by an individual of an agent (microbe, intestinal parasite, or virus) that usually is a pathogen causing illness, at least in some individuals. Many pathogens spread by being silently carried in this way by some of their host population. Such infections occur both in humans and nonhuman animals."

Now we know such infections occur in humans, nonhuman animals, and large requirements and design documents not yet tested by tangible development. Keep in mind that "tangible development" does not mean 100% complete and ready for release, but it does mean, at minimum, prototyped and delivered in a visible, clickable, malleable form -- not just words on paper or promises in contractual agreements.

Applying Quarantine and Isolation Tactics Not Just at Borders 

Let's now consider quarantine and isolation practices, considering the SARS outbreak mentioned above. When SARS happened, public health officials acted quickly and implemented quarantine procedures to try to control and prevent the spread of the pathogen into their own populations. Consider this summation of quarantine measures from Taiwan:

During the 2003 Severe Acute Respiratory Syndrome (SARS) outbreak, traditional intervention measures such as quarantine and border control were found to be useful in containing the outbreak. We used laboratory verified SARS case data and the detailed quarantine data in Taiwan, where over 150,000 people were quarantined during the 2003 outbreak, to formulate a mathematical model which incorporates Level A quarantine (of potentially exposed contacts of suspected SARS patients) and Level B quarantine (of travelers arriving at borders from SARS affected areas) implemented in Taiwan during the outbreak. We obtain the average case fatality ratio and the daily quarantine rate for the Taiwan outbreak. Model simulations is utilized to show that Level A quarantine prevented approximately 461 additional SARS cases and 62 additional deaths, while the effect of Level B quarantine was comparatively minor, yielding only around 5% reduction of cases and deaths. The combined impact of the two levels of quarantine had reduced the case number and deaths by almost a half. The results demonstrate how modeling can be useful in qualitative evaluation of the impact of traditional intervention measures for newly emerging infectious diseases outbreak when there is inadequate information on the characteristics and clinical features of the new disease-measures which could become particularly important with the looming threat of global flu pandemic possibly caused by a novel mutating flu strain, including that of avian variety.

What this summary illustrates is that quarantine, when applied at a higher level in the chain of transmission led to a far better reduction in the incidence rate of infection. The other measure led to a more modest, 5% reduction of cases and deaths.

What would happen if we applied this kind of model to the development of information systems, and did it at many levels, in order to prevent large populations of infected, buggy, defect-ridden documents or code from becoming integrated with healthy, corrected, defect-free populations (of software objects)?

Defining the Quarantine Model of Integration

Let's define a simplified "Quarantine Model of Integration" that can apply to more than just humans with possible infections crossing borders, but can also apply to requirements documents, design documents, napkin sketches, whiteboard scrawling, information system releases or upgrades, specific system features, and certainly all the way down to discrete units of software code.

Population A: Some set of individual objects.
Population B: Another set of individual objects similar to Population A.
Population B-Harmful: Some potential subset of population B with harmful characteristics that would disrupt and weaken the integrity of desired characteristics if introduced into Population A.
Population B-Benign: Some potential subset of population B without harmful characters if integrated into Population A.
Mitigating Filter Procedures: A set of actions that can be taken upon Population B to identify Population B-Harmful and Population B-Benign, thus allowing Population B-Benign to be integrated into Population A without harming it (while also isolating and preventing Population B-Harmful from integrating)

Improving Outcomes by Applying the Quarantine Integration Model Throughout the Development of an Information System 

We will delve into the specifics of how to apply a model like this to control the development process in the next article. However, the type of control and prevention practices that are necessary when building an information system are different from what you might have seen in many large projects, such as the fictional one described above. Many projects undertaken by large corporations or governments attempt, with good intention, to prevent exposure to risks and defects by trying to define as many "requirements" and "design details" in large documents long before any of the software system is constructed. This is most often a mistake. It's a mistake, as we'll see, that goes back at least 42 years to 1970, but perhaps even further.

You probably remember that I earlier wrote:

Because information systems are designed by us, we have better internal control over the resulting behavior, and thus the healthy status, of information systems.

The key phrase there is "resulting behavior". What is unstated is that the process of creating that resulting behavior is itself can take a very meandering path that is very iterative (completed in multiple passes) and incremental (completed as a series of smaller divisions of a larger whole) 

It's often said that an empirical model of process control is needed to properly manage this kind of creative, evolutionary process. 

Definition: Empirical Process Control Model 

The empirical model of process control provides and exercises control through frequent inspection and adaptation for processes that are imperfectly defined and generate unpredictable and unrepeatable outputs.

Notice that an empirical process control model is a lot like the scientific method. In the next article, we'll also discuss how scientific knowledge advances through iterative, incremental, and evolutionary spurts. For example: we all know that one woman's hypothesis and experiment would not overturn the germ-theory of disease if she claimed that illness was caused by another mechanism. 

Peer Review is the Hallmark of Sound Science (And Also of a Sound Information Systems Development Process)

In the case above, we know the scientist's ideas must face the rigor of the peer review system that is the hallmark of science. The peer review process is just one implementation of the "Quarantine Model of Integration" we just defined. And, peer review is, in fact, the self-correcting mechanism built into the heart of science which differentiates it from countless other "ways of knowing" that our human species has and continues to utilize.

That peer-review system is also, naturally, at the heart of what CDC does in its constant effort to do sound science. And, as we'll preview next time, several types of peer review, and even-wider-review, are at the heart of any successful process for developing a winning, useful, and cost-effective information system.

Tuesday, April 5, 2011

Delivery and Simplicity : Don't Leave Home Without These Agile Principles

Bootstrapping Agile from the Trenches

In February of 2006, I was offered the position of Lead Architect for the redevelopment of CDC's Epi-X system, CDC's flagship secure communications platform for emergent disease outbreak notification and bi-directional collaboration between multi-jurisdictional public health authorities. However, on the same day I was offered a position at a private .NET consulting company, Abel Solutions. Realizing that actual redevelopment of Epi-X would be months, if not years away, due to the then very disruptive agency-wide reorganization, I decided to leave so that I could gain more experience in a variety of private sector industries.

During the five years since I left Epi-X, I've worked as a senior software engineer, architect, lead application architect, and as an independent consultant. My first assignment with Abel Solutions was to re-architect and re-develop a very popular web-based electronic commerce & auction system to support more than 1 million registered users and the processing of more than 300 million dollars in annual sales. For a different company,  I re-engineered the security, object-relational, and querying architecture of a complicated human resources & payroll processing system used by thousands of companies. Most recently, I helped lead the design and development of both a modular user-interface architecture and the core service-oriented architecture for a new correspondence banking & ACH settlement platform to be used by hundreds of local and regional banks to conduct business more easily with the Federal Reserve and each other.

For the companies sponsoring the first two projects mentioned above, I introduced and lead the successful adoption of Agile management and development practices. For the third, I was recruited specifically to consult both on their adoption of Agile and the design of its new system's user-interface and service-oriented architecture.

I've also consulted with many other private entrepreneurial businesses about technology strategy, and in 2008 founded both the Atlanta Science Tavern and the ATL ALT.NET community groups.

Aligning the Agile Approach to the Business Domain

Let me be the first to state that adopting Agile in the "real world" is not easy. To be successful, you must internalize the values of Agile, especially the very first one which reads:

Our highest priority is to satisfy the customer

through early and continuous delivery

of valuable software.

Did you notice that this specifies nothing whatsoever about writing code? It specifies nothing at all about writing code at all. It specifies delivery of valuable software.

Later on in the principles document, it says:

Simplicity--the art of maximizing the amount

of work not done--is essential.

It says simplicity is essential, not optional, but essential. How many projects have you seen that feature unnecessary complexity? That is the exact opposite of this Agile principle. For more about this problem, see my post that reviews a Skype architect's presentation.

You can read the rest of the Agile principles here:

I highlight this because a lot of practitioners think that Agile is some kind of magic bullet that will solve all the problems that sequential "waterfall" style development has. This is absolutely not the case.  Agile has its own pitfalls that must be addressed as well, and one of them is plainly that development teams don't even understand or truly believe in these two core principles!

The Core of Agile: Communication and Collaboration

As the principles in the Agile Manifesto explain, collaboration and communication are the two most critical underlying themes of agile development. What if, by communicating with your clients successfully you could help them avoid spending millions of dollars custom-developing a solution to a problem that you could solve using low-cost or open-source software?

Would that not be the ultimate fulfillment of the first principle of Agile? I think it most certainly would. And, it would certainly fulfill the later example I highlighted!

Unfortunately, many people, even managers, fail to think this way when they adopt Agile. This is not to say that they don't mean well. It's often just the case that they recognize Agile, and associated development practices like XP and TDD, as a better way of building software, but can lose sight of principle number one: delivery of valuable software.

Internalized Agility = Flexibility

True internalization of Agile values should cause architects, developers, testers, and all manner of managers to adopt an attitude of true collaboration with their stakeholders.

So, keep it mind that being agile doesn't always mean building software. First and foremost, it means delivering valuable software.

Monday, March 28, 2011

Strangulation: The Pattern of Choice for Risk Mitigating, ROI-Maximizing Agilists When Rewriting Legacy Systems

"The most important reason to consider a strangler application over a cut-over rewrite is reduced risk. A strangler can give value steadily and the frequent releases allow you to monitor its progress more carefully. Many people still don't consider a strangler since they think it will cost more - I'm not convinced about that. Since you can use shorter release cycles with a strangler you can avoid a lot of the unnecessary features that cut over rewrites often generate." -- Martin Fowler, Chief Scientist, ThoughtWorks, on The Strangler Pattern

When rewriting a system in a new technology, it's tempting to think that the task will be easier and quicker than the first time it was written. Because of this, sometimes business sponsors believe that a "waterfall" or "big bang all at once" approach will work out, but this is rarely the case for any project large enough and important enough to warrant rewriting. It's always important to practice iterative and incremental development to provide for feedback loops. But, it's even more important to do this in the case of a large application rewrite. This article will explain why this is true. There are a few bedrock development principles that project sponsors and team members should put into practice to ensure the success of large scale migrations. Having learned these lessons from experience, these are:
  1. Involve business sponsors and end-users directly (or a user-experience specialist) and the entire support and operations teams during the entire rewrite
  2. Involve permanent quality-assurance professionals from the beginning and during the entire rewrite
  3. Design, code, and test one complete feature rewritten from the existing system as quickly as possible
  4. Thereafter, design, code, test, and pilot user-valued, return-on-investment-generating (ROI) features in small increments
  5. Most importantly, continuously build team member skills, knowledge, and leadership abilities

Lessons Learned in Rewriting Large Legacy Systems 

In February of 2006 I joined a small .NET consulting company. Shortly thereafter I was assigned to a brand new project for one of their clients to analyze, design, and develop a new version of an existing electronic commerce platform. The system was a highly successful, niche-market leading auction site with nearly 700,000 registered users at the time. In operation for more than seven years by then, the system was built on classic ASP, C++/COM, and SQL Server 2000. It consisted of about 330 ASP pages. Our client wanted to do two primary things. First, he wanted to add new, value-added features to the system to provide a much better user experience, one that would be similar to eBay. These features would be called "My Auctions". This new set of features would take the place of roughly 30 pages from the existing web site. Second, he wanted to migrate the other 300 pages, without introducing any functionality or usability improvements, to ASP.NET WebForms. Having already personally designed and developed the entire business object back-end COM objects, he wanted all of the new web site to reuse this investment by utilizing COM Interop.

My Recommendation: Perform a Phased, Vertical Migration One Piece at a Time

My first assignment was to analyze the existing ASP and C++ code to and produce a migration strategy recommendation. This strategy document would lay out our company's professional opinion for migrating the system to the .NET platform and the C# language. My recommendation was for our client to perform a vertical migration, which is a migration that incorporates an entire functional slice of a subset of the system (My Auctions) which cuts across all architectural layers (top-to-bottom). In their book The Pragmatic Programmer, Dave Thomas and Andrew Hunt call this a "tracer bullet". This was, in fact, what Microsoft recommended in their best practices guidance documents that I researched about performing large scale system architecture migrations. I recommended that our client hire us to build a new core platform on ASP.NET with C# and get the new, value-added features to market as soon as possible on top of that core platform. Only after these value-added features were in production would we then move on to replacing the rest of the 300 pages with ASP.NET replacements.

My Reasoning: Place Customer Satisfaction, ROI, and Risk Mitigation First

My reasoning was that by creating a new core platform and building the brand new, usability-focused, value-added My Auctions features on top of that, our client would generate a return-on-investment (ROI) much sooner by generating more sales volume with the user-friendly features and would simultaneously mitigate significant risk by testing the viability of the COM Interop strategy. By virtue of the features being value-added, there would be no risk whatsoever for him to deploy them to a parallel web server and get his users to begin pilot testing the system and providing valuable feedback early on in the game when he could still make significant changes prior to committing to replacing the entire system with the new technology.

I've since learned from Dan North at QCon 2010 in S.F. that is called The Strangler Pattern according to Martin Fowler, hence the title of this post!

Client Decision: Let's Do It All At Once

Our client considered my recommendation very carefully, but wanted to take a different approach. Rather than deploy the new My Auctions features independently, side-by-side with the existing system, he wanted to have his in-house staff work on the other 300 pages while our company worked on the value-added features. With more than 330 pages to complete, I estimated that the project would not take less than a year, but would more likely take two years or more to complete. Our client and my manager thought things could be done much faster if we had three or four people working on the system. This was certainly the case early on when I worked side-by-side with another developer in our company. Within four months, he and I had completed the new C# application foundation and the value-added features to the point that they were ready for beta-testing.

And that's when all the fun began!

Planning is Essential; Plans are Useless

As anyone who has worked in the software industry for a number of years knows, the best laid plans never go as you planned. Our client's lead C# developer left his company. Soon after that, my manager at my company was let go, but several months later he was hired by our client to take over the development management of the project. This made sense since he already had strong background on the project since its inception. Shortly after this, our client's HTML, graphics, and CSS developer quit when asked to change his focus to become an ASP.NET developer. They hired a lead C# developer and he got to working on a large slice of the application while I continued to work on another large slice. Five months later, they hired a second C# developer and he began working on several other slices of the application.

Wanting to see the project through to success, I joined the client as a direct employee to continue being the lead architect for the project.

Naturally, There's a Big Trade Show In The Story

What would any development story be without a "Big Trade Show" lurking around the corner? As luck and fate would have it, in early 2008 there was a huge industry trade show, and it was critical that we would be able to demonstrate the new version of the system to the roughly 25,000 customers that would be passing through our booth. And, it would be very important that these customers be able to see their own real items, either ones they were selling or ones they were buying. The problem was, of course, that the system was not ready to replace the production system! Due to security requirements brought about by a changing legal environment, we had to repartition the back-end database for the new system from 2 SQL databases into 6 separate databases just before the trade show. It was deemed too risky to perform this radical "surgery" on the live, production system just two months before the trade show. The new system's schema was about 95% the same as the old system, but there were corrections to long-standing column name problems or foreign-key reference inconsistencies. This was a complicating factor, however, for migration.

We tossed around various ideas, such as:

  • Perform the "surgery" on the production database to upgrade it to the new system schema, then use views and synonyms to create a "pass-through" database that looked like the old schema, but mapped across to the new DBs and structures.
  • Do the reverse: create several "pass-through" new databases with views and synonyms that actually resolved to the single existing production database's objects.
We felt that we could mitigate risk entirely by following the second option. What this also allowed us to do was to "override" some of the production system's tables with configuration data specific to the new system. The approach of using synonyms and views ensured that all writes and reads against the pass-through objects would actually resolve into the production database, thus enabling the beta version of the new system to live side-by-side with the legacy system.

The War Room

After some proof-of-concept prototyping, we realized this would be a winning strategy. Over the next couple of weeks, the four of us on the development team gathered daily in our "war room", and worked together to create all the necessary SQL scripts and shell databases, synonyms, views, etc that would be the magic glue. We ensured that we could re-run the scripts at will and automated our quality-control checks and sanity checks to be certain that all mappings would have proper permissions and configurations. After enough practice runs, we felt confident that it was ready to go. We created a single zip file which contained 5 BAK files, and a T-SQL script. We handed them off to our lead database administrator and he ran the scripts. Everything worked just as planned!


At the trade show, everything went off flawlessly! Customers attended our booth and we, the development team, aided them directly in logging into the system and showcasing the new features we had worked so hard to develop. It was a very gratifying feeling to see how our improvised plan came together so well. Most importantly, we had succeeded in mitigating all risks to the money-generating production system, while also achieving the benefit of showcasing the new system to customers with real data. This was very exciting to them because they felt that the new features would greatly help them run their own businesses atop our platform.

Phased Transition From Legacy to New

We had now successfully demonstrated and validated the new, value-added features directly with customers in person. This was a great success. Yet, there was still much to do after the trade show. Features of lesser prominence, those in the 300 other pages set still needed to be developed and tested. This ended up taking a very long time, but we ultimately cycled back to my original recommendation by adopting an incremental replacement strategy.

It worked like this:
  • We deployed the new system to a new web server, named v2.
  • The existing, v1 site, remained at www.
  • We provided a link from v1 to v2 in the header of the v1 site, including advertising the benefits of the new system, but also including disclaimers and calls for assistance in testing and validating the usability of the new system.
  • This garnered a lot of early-adopters who helped find bugs and inconsistencies, all for free to us!
  • We monitored the usage patterns of v2 versus v1, to help estimate the load capacity under real-world conditions.
    • Michael Nygard's book "Release It!" proved prophetic here. In his book he says that "feature complete" is not the same as "production ready."
    • We learned this because the COM code had to be completely replaced with pure C# code since it could not stand up under load using COM Interop.
      • This result bore out my original advice to get the new features into production as soon as possible to monitor under real world conditions.
  • We formally adopted Scrum and Agile practices by identifying business-driven priorities and working through them in sprints.
    • We did this by closely monitoring the real-world usage of both the existing v1 system and the v2 system and focusing our effort first on the highest traffic pages, such as Viewing, Browsing, and Searching. Of course, Bidding and Payment, while producing less volume, were also mission-critical.
    • This focus allowed us to prioritize properly. We did not place inordinate emphasis on automating the testing of all areas of the system.
      • For example: we did not write Selenium test suites for things like Help Pages or Support Pages. Why? They are seldom used! And, they generate no revenue.
        • Instead, we built comprehensive Selenium test suites for the Big Four: Viewing, Browsing & Searching, Bidding, and Payment.

A Pleasant Surprise!

With the site now operating both in legacy, classic mode at www, and in "beta" mode at v2, the team began to actively monitor the new system's health health and encourage more and more users to jump into using v2. And, because we had focused on developing the value-added My Auctions features in the very beginning of the project, those features sat ready and willing to get into production! Our newest member of the team, who joined about two years after those features were ready and "shelved", took it on his own initiative, to our delight, to start building a mobile version of the core My Auctions features using ASP.NET MVC and the business objects that supported the My Auctions features. He was reluctant to show this prototype to the "higher ups", but the rest of our team encouraged him to do so. Within a few months, his mobile application was released into production before the global "switchover", described below, to the new system. A job well done!

Switching Over Right on Time for a Cool Billion Dollars

Over the course of more than a year, the team monitored the usage of v1 and v2, and began to more aggressively push the late adopters and stragglers into the new system. Eventually a "switchover" was made, and the v2 system took over the place of www. At that point, there was now a link back to v1, which ran from a virtual machine. Several months after this, the VM was retired, and the v1 system, and all of its legacy COM, was no more.

Just after the legacy system was retired for good, the company celebrated its 10th anniversary and 1 billion dollars in sales volume!


In retrospect, I spent nearly three years working on this project and learned a great deal! While I wish that the original plan of seeing the entire migration take place "all at once" could have been successful, I also am pleased that my original recommendation to take a phased, incremental, risk-mitigating, ROI-maximizing approach was very sound. Ultimately, that very approach became necessary due to the "expected" unexpected bumps along the road!

Application to Domains Seeking Non-Financial Returns

I understand that not all projects involve financial reward goals. Before I began working on the project just described, I worked for four years at the US Centers for Disease Control and Prevention. While working there, we were not seeking to generate financial return-on-investment. However, we did seek returns in the form of utility and value to the users and stakeholders of our systems. To assess this properly, it was critical to either observe the real users working with the system or to sit down with them and experience their pain, frustration, and sometimes: delight! Our team did this regularly by conducting evaluations, performing proficiency testing, and through coordinated multi-agency and stakeholder exercises under simulated public health emergency "war games".

Tying This All Back to Agile

While I've written more extensively on Agile in other posts on this blog, this post has not been about the "mechanics" of agile so much as it has been about the why. But, I want to look at just the first principle of the Agile Manifesto and make a brief comment:

Our highest priority is to satisfy the customer
through early and continuous delivery
of valuable software.

One might look at this principle and ask how can this be done when facing the situation I originally faced that featured my own client asking not for a continuous delivery model, but a "big bang" model? That's a very good question, and it's not one that has any quick-fix answer. My best advice here is that you need to learn the language and goals of both your client and your client's ultimate customers. If your client values financial returns, then ask him or her exactly what it is that generates financial returns. 

In my client's case, returns come in when more people purchase items through his system. The next question should be: what is the shortest path we can take to increase that rate? If the client answers that the shortest path is to rewrite the entire system and deploy a big-bang upgrade, then you're going to have to keep breaking that down into smaller and smaller value-added chunks. You might have to suggest straw-men in terms of business-value if your client will not prioritize by business value naturally. Ultimately, like in this story, reality may bear down on the situation and if you have done your best to incrementally develop the system in terms of business value, then you can deliver value, upholding your end of the deal to the utmost of your ability within your realm of control. Sometimes that's the best you can do, until you run your own show!

Saturday, September 12, 2009

Recommended Resources: Becoming an Expert, Extending Agile, and Individual Improvement

There are three recent presentations posted from Agile 2009 to that I highly recommend you listen to and learn from. Here are the links with the descriptions:

Mary Poppendieck: Deliberate Practices in Software Development


In the nature vs. nurture debate, researchers have declared nurture the winner. People who excel are the ones who work the hardest; it takes ten+ years of deliberate practice to become an expert. Deliberate practice is not about putting in hours, it’s about working to improve performance. It does not mean doing what you are good at; it means challenging yourself under the guidance of a teacher.
Mary Poppendieck started her career as a process control programmer, moved on to manage the IT department of a manufacturing plant, and then ended up in product development, where she was both a product champion and a department manager. Mary tried to retire in 1998, but instead found herself managing a government software project where she first encountered the word "waterfall".

Josh Comments:

I’ve listened to Mary’s talks at Google and read most of her Implementing Lean Software Development book recently. This talk is excellent. She discusses not just software, but also music performance and artistic talent development, citing studies that have shown it typically takes about 10,000 focused hours for musicians to truly reach the level of expert, and that many of them who begin early in life reach this number of hours by age 20! Regarding software development, 10 years of working professionally is about 20,000 working hours, but of course not all of those hours are spent crafting software. I’ve been working professionally about 10 years, and I think I’m near that level of expertise in a broadened sense, but have much, much, much more to learn in the depth direction.


Allistair Cockburn: I Come to Bury Agile, Not to Praise It


Agile came from small, colocated projects in the 1990s. It has spread to large, globally distributed commercial projects, affecting the IEEE, the PMI, the SEI and the Department of Defense. Agile now sits in a larger landscape and should be viewed accordingly. This talk shows that landscape, clarifying how classical agile fits in and what constitutes effective development outside that narrow area.

Dr. Alistair Cockburn is a world-renowned expert at what is called agile development, the early and regular delivery of business value through improved communications, fast feedback and staged delivery. Dr. Cockburn co-founded the agile development movement, co-authoring the Manifesto for Agile Software Development and the project leadership Declaration of Inter-dependence.

Josh Comments:

This is a great talk which is not really about burying agile, but about recognizing that the basic practices of agile now need to give way to ideas like Software Craftsmanship. He covers much more ground than this, but I’ll just highlight the Software Craftsmanship principles:

“As aspiring Software Craftsmen we are raising the bar of professional software development by practicing it and helping others learn the craft. Through this work we have come to value:

Not only working software,

              but also well-crafted software

Not only responding to change,

              but also steadily adding value

Not only individuals and interactions,

              but also a community of professionals

Not only customer collaboration,

              but also productive partnerships

That is, in pursuit of the items on the left we have found the items on the right to be indispensable.”

Ashley Johnson and Amr Elssamadisy: Scaling Up by Scaling Down: A (re)Focus on Individual Skills


In this presentation, the causality between performance in the small (individuals and teams) and performance in the large is highlighted and explained. Discover what you can do as an individual regardless of your position in the hierarchy to enable higher performance software development.

Ashley Johnson splits his time between understanding the challenges that companies face, and consulting with senior IT management to facilitate organizational optimization. Author of the leading book on Agile Adoption, Amr Elssamadisy spends his time helping others learn better ways to develop software and connecting the software development process with its rightful driver - business value.

Josh Comments:

What I like most about this presentation, is how Ashley Johnson incorporates audience participation and experimentation into the course of the presentation. This is the essence of teaching and learning. During my Scrum training with Jeff Sutherland, I was impressed by how Jeff used Scrum to run the training, creating his own task wall with sticky notes which also served as a burndown mechanism. He broke us up into small groups and we worked on problems and exercises that way.

Just to highlight, at one point Ashley asks the audience to break into pairs of two and to come up with a list of “What is required for a high performing team?” This is what the participants came up with on their own:

  • Shared vision that matters
  • Trust between the team members
  • Communication
  • Passion
  • Empowerment
  • Having fun
  • Challenging each other while holding each other accountable
  • Leadership

Saturday, September 5, 2009

D = V * T : The formula in software DeVelopmenT to get features DONE

There’s a hidden formula in software DeVelopmenT that tells how fast a team can get features DONE and Ready-to-Ship.


The formula is: D = V * T

It reads as: DONE Features = Velocity multiplied by Time


The importance of a software development team’s velocity

The term “velocity” as it applies to software development is simple to explain and to illustrate. Here’s my definition:

Velocity: A team’s velocity is the number of features it can get completely DONE and Ready-to-Ship during a short, fixed time period (2 to 4 weeks)

Velocity is extremely important for business owners and other project stakeholders. Without knowing the velocity of their team, they have no way to reliably plan release dates and coordinate marketing and sales teams. (2) It’s no exaggeration to say that the most important thing a professional software team can do to increase its value to an organization is to become skilled in the arts of estimation and planning. This post introduces the concepts behind  velocity measurement, and provides links for more detailed reading.


Are we there yet? Speed racing down the software delivery highway

Building successful software and delivering it on time is an art of numbers. It all boils down to math, like physics or accounting.

Who can forget the familiar high-school formula D = V * T? (also written as D = R * T)

This, of course, is the famous equation for calculating how far you can travel from a given starting point to another when you already know your velocity (or rate) and how long you will be travelling.


Distance = Velocity multiplied by Time

For example: if we have know we are traveling at 50 miles per hour and plan to travel for 3 hours, then we know we will travel 150 miles.


What happens though if we do not know our velocity, but instead know how far we have traveled and how much time it took to get there? Can we derive our velocity from these other two measurements? Of course we can with simple math. In this case, we have D and T, and can derive V by modifying the formula to be V = D / T.


Velocity = Distance divided by Time

For example: if we have traveled 120 miles after 3 hours from point A to point B, then we know our velocity per 1 single hour is 40, or 40 mph. 



Figure 1: Two US high school math students calculating how far they can travel before returning to math class  in 30 minutes or before being caught by authorities for driving on the wrong side of the road.


Measuring velocity in software development to decrease time-to-market and realize faster ROI

I hear what you’re screaming: enough with the PSAT math prep already, how does this apply to releasing software on time? It’s so simple you’ll kick yourself, or your team, for not doing this already.

Agile teams use a formula that works the same way. It’s calculated differently, because most software teams aren’t very mobile while coding, though it would be relaxing to code on a boat.

Because a team cannot reliably know in advance how quickly it can complete a set of features, it must use the second form of the equation to derive its velocity based upon actual observation of progress first.

Thus, the formula for calculating an agile team’s initial velocity still reads as V = D / T, except the D stands for “DONE Features” instead of distance. T, or time, usually stands for 2 to 4 weeks instead of 1 hour. For this article, we’ll assume it means 3 weeks.


Initial Velocity = DONE Features divided by Time

For example: if we get 6 features DONE in 3 weeks, then we know our velocity is 6 features per 3 weeks. Simplified, we’ll say 2 features per week.

Here is a simple chart depicting this velocity:


Figure 2: Velocity measurement illustration of six features becoming done during a three-week period

It’s tempting to look at this chart and say the velocity is 2 features per week, and that we can now start using the formula DONE Features = Velocity multiplied by Time to plan ahead. We will use this simplification for the purposes of this article, but keep in mind that this may or may not be true, so be careful! Here are two reasons why:

  1. New Requirements Discovered: During the course of any three-week period, teams will discover new requirements frequently. The new requirements could be bugs, change requests from the business team, or important changes required to match the competition. This is a subject for an entire volume on change management!
  2. Definition of DONE: It’s extremely important that a team agrees upon what qualifies as a DONE feature. Each team must define what it means by the word DONE. I leave that as an exercise for a future article, but you can find some recommended reading and listening below for reference. (3, 4)

For the rest of this post, we’ll pretend that no new requirements are discovered and we’ll define a feature as DONE if it has successfully passed through each of the following development phases:

  1. Requirements Definition
  2. Analysis, Design, and sufficient Documentation
  3. Coding
  4. Unit Testing
  5. Code Review (for development standards adherence and security design assessment)
  6. Refactoring (to conform to standards and address security deficiencies)
  7. Functional Testing
  8. User Acceptance Testing (preferably automated)
  9. Performance Testing
  10. Pilot (beta testing with real or proxy users)
  11. Ready-to-Ship

This may sound like a lot of work! And, it certainly is a lot of work. All mission-critical projects consist of a number of features that must go through these steps before they can be considered DONE and Ready-to-Ship.


Pitfalls of using early travel velocity to forecast total road trip duration

Returning to our travel example, suppose we are traveling from our city to the mountains for a conference about software estimation and planning. We know the destination is 500 miles away. We also know that the interstate through our city and into the next state has a speed limit of 70 mph. A simple calculation tells us that it would take 7.14 hours to travel 500 miles at 70 mph.

What if you absolutely had to be at the meeting on time? Would you think it’s wise to turn that back-of-the-napkin estimate into a target to which you could commit?

Most people would say it’s insane to expect that would travel into the mountains at 70 mph, the same velocity as the interstate. What’s more, you’d have to take bathroom breaks and food breaks too. You agree with most people.

You decide to email the mailing list for the conference and ask if anyone has ever traveled from your city to the mountain location, and get a response complete with a chart! Your colleague says she kept track of how many miles she traveled during each hour and came up with chart in figure 3 showing that it took just over 9 hours to complete the 500 miles.



Figure 3: Chart showing total number of miles driven after each hour in red and number of miles driven during each hour in blue

If we round up the number of hours traveled to an even 10, we’ll just call this 50 mph. The reasons we cannot travel at 70 mph during the entire trip is simple: mountains are more curvy and dangerous, and we have to break for food and the bathroom. Only after completing the trip one time can we look back and use the experience as a way to gauge future trips through the same or similar terrain.

Let’s take a beginner's look now at how agile teams can use historical data, combined with estimation, to produce better delivery date forecasts. This will be covered in more depth in my next post.


Producing better software delivery date forecasts using simple, empirical estimation techniques

Similarly, if we know our total number of features is 50, and that our velocity is 2 features-per-week, then it’s tempting to calculate that it should take 25 weeks to complete our project.

Alas, software development is rarely as simple as driving down a straight interstate. Just like the journey into the mountains takes us through a variety of terrain and we must take breaks, all software development takes us through all kinds of unexpected requirements. Stakeholders request new features, markets changes, people get hired, people get fired!

And, most importantly, not all features are the same size or complexity. Because of this, agile teams need to take additional steps to bring predictability to delivery schedules. This is usually done with estimation techniques like Wideband Delphi or Planning Poker. These two techniques have been written about by Steve McConnell and Mike Cohn, respectively. (5, 6, 7)

I will cover Planning Poker in more detail in a future post, but the main idea behind it is that the entire team takes a few hours every three weeks to look ahead at the work to be done and places a relative estimate of size or complexity estimate on each item. They then measure how quickly they can complete each item. So instead our simple count of “50 features”, the team might actually have a number such as 150 “points”, which means that, on average, each feature is roughly 3 points of estimated size or complexity. For now, however, let’s continue to focus on tracking how fast the team moves through 50 features.

Agile teams typically use a chart that is drawn from the top down towards zero, which indicates zero more features outstanding! This is called a burndown chart, and a realistic chart might look as follows in figure 4:



Figure 4: Hypothetical burndown chart illustrating how the amount of actual work, in blue, fluctuates up and down as the total number of UNDONE features approaches zero. The initial estimate of 50 features and the target velocity of burning down 2 features per week is shown in red

This chart shows that the team had 50 features remaining to implement at the start of week 0. The initial target velocity of 2 features per week holds up for a few weeks, shown in red, but then it falls off a bit before speeding up to be faster than 2 per week. Not to be outdone, perhaps the business team feels the team can do more work, and new features get added. This causes the time between week 11 and 24 to remain relatively flat before the velocity picks up again.

By the time the initial 50 features are completed, we can calculate that they burned down at a rate of about 1.5 per week. Now, this simple chart does not actually show how many features were added during the duration of the project, though it’s obvious when you see the blue spikes. There are more sophisticated charts that can help illustrate this, but I’ll leave that for next time.

In the meantime, please visit the suggested resources, starting with Mike Cohn’s excellent presentation about Agile Estimation and Planning, to learn more. (1)

Until next time, stay agile not fragile.


References and Resources


  1. “Introduction to Agile Estimation and Planning” – by Mike Cohn, PDF presentation about release planning with agile estimation and planning techniques:
  2. “Nokia Test : Where did it come from?” – by Jeff Sutherland, about how Nokia uses velocity tracking to assess their teams’ productivity and likelihood to generate future ROI:
  3. “How Do We Know When We Are Done?” – by Mitch Lacey, about how his team defined DONE with the whole team’s participation:
  4. “Scrum, et al” – by Ken Schwaber about the history of Scrum, presented at Google :
  5. Software Estimation: Demystifying the Black Art – by Steve McConnell, book about lessons learned and best practices for software estimation:
  6. Agile Estimation and Planning – by Mike Cohn, book about how to perform agile estimation and planning using simple estimation techniques and short, fixed time-boxed development iterations:
  7. ATL ALT.NET Meetup recorded conversation about Agile Estimation and Planning: (direct MP3 link: