Agile From The Ground Up: Strangulation: The Pattern of Choice for Risk Mitigating, ROI-Maximizing Agilists When Rewriting Legacy Systems

"The most important reason to consider a strangler application over a cut-over rewrite is reduced risk. A strangler can give value steadily and the frequent releases allow you to monitor its progress more carefully. Many people still don't consider a strangler since they think it will cost more - I'm not convinced about that. Since you can use shorter release cycles with a strangler you can avoid a lot of the unnecessary features that cut over rewrites often generate." -- Martin Fowler, Chief Scientist, ThoughtWorks, on The Strangler Pattern

When rewriting a system in a new technology, it's tempting to think that the task will be easier and quicker than the first time it was written. Because of this, sometimes business sponsors believe that a "waterfall" or "big bang all at once" approach will work out, but this is rarely the case for any project large enough and important enough to warrant rewriting. It's always important to practice iterative and incremental development to provide for feedback loops. But, it's even more important to do this in the case of a large application rewrite. This article will explain why this is true. There are a few bedrock development principles that project sponsors and team members should put into practice to ensure the success of large scale migrations. Having learned these lessons from experience, these are:

Involve business sponsors and end-users directly (or a user-experience specialist) and the entire support and operations teams during the entire rewrite
Involve permanent quality-assurance professionals from the beginning and during the entire rewrite
Design, code, and test one complete feature rewritten from the existing system as quickly as possible
Thereafter, design, code, test, and pilot user-valued, return-on-investment-generating (ROI) features in small increments
Most importantly, continuously build team member skills, knowledge, and leadership abilities

Lessons Learned in Rewriting Large Legacy Systems

In February of 2006 I joined a small .NET consulting company. Shortly thereafter I was assigned to a brand new project for one of their clients to analyze, design, and develop a new version of an existing electronic commerce platform. The system was a highly successful, niche-market leading auction site with nearly 700,000 registered users at the time. In operation for more than seven years by then, the system was built on classic ASP, C++/COM, and SQL Server 2000. It consisted of about 330 ASP pages. Our client wanted to do two primary things. First, he wanted to add new, value-added features to the system to provide a much better user experience, one that would be similar to eBay. These features would be called "My Auctions". This new set of features would take the place of roughly 30 pages from the existing web site. Second, he wanted to migrate the other 300 pages, without introducing any functionality or usability improvements, to ASP.NET WebForms. Having already personally designed and developed the entire business object back-end COM objects, he wanted all of the new web site to reuse this investment by utilizing COM Interop.

My Recommendation: Perform a Phased, Vertical Migration One Piece at a Time

My first assignment was to analyze the existing ASP and C++ code to and produce a migration strategy recommendation. This strategy document would lay out our company's professional opinion for migrating the system to the .NET platform and the C# language. My recommendation was for our client to perform a vertical migration, which is a migration that incorporates an entire functional slice of a subset of the system (My Auctions) which cuts across all architectural layers (top-to-bottom). In their book The Pragmatic Programmer, Dave Thomas and Andrew Hunt call this a "tracer bullet". This was, in fact, what Microsoft recommended in their best practices guidance documents that I researched about performing large scale system architecture migrations. I recommended that our client hire us to build a new core platform on ASP.NET with C# and get the new, value-added features to market as soon as possible on top of that core platform. Only after these value-added features were in production would we then move on to replacing the rest of the 300 pages with ASP.NET replacements.

My Reasoning: Place Customer Satisfaction, ROI, and Risk Mitigation First

My reasoning was that by creating a new core platform and building the brand new, usability-focused, value-added My Auctions features on top of that, our client would generate a return-on-investment (ROI) much sooner by generating more sales volume with the user-friendly features and would simultaneously mitigate significant risk by testing the viability of the COM Interop strategy. By virtue of the features being value-added, there would be no risk whatsoever for him to deploy them to a parallel web server and get his users to begin pilot testing the system and providing valuable feedback early on in the game when he could still make significant changes prior to committing to replacing the entire system with the new technology.

I've since learned from Dan North at QCon 2010 in S.F. that is called The Strangler Pattern according to Martin Fowler, hence the title of this post!

Client Decision: Let's Do It All At Once

Our client considered my recommendation very carefully, but wanted to take a different approach. Rather than deploy the new My Auctions features independently, side-by-side with the existing system, he wanted to have his in-house staff work on the other 300 pages while our company worked on the value-added features. With more than 330 pages to complete, I estimated that the project would not take less than a year, but would more likely take two years or more to complete. Our client and my manager thought things could be done much faster if we had three or four people working on the system. This was certainly the case early on when I worked side-by-side with another developer in our company. Within four months, he and I had completed the new C# application foundation and the value-added features to the point that they were ready for beta-testing.

And that's when all the fun began!

Planning is Essential; Plans are Useless

As anyone who has worked in the software industry for a number of years knows, the best laid plans never go as you planned. Our client's lead C# developer left his company. Soon after that, my manager at my company was let go, but several months later he was hired by our client to take over the development management of the project. This made sense since he already had strong background on the project since its inception. Shortly after this, our client's HTML, graphics, and CSS developer quit when asked to change his focus to become an ASP.NET developer. They hired a lead C# developer and he got to working on a large slice of the application while I continued to work on another large slice. Five months later, they hired a second C# developer and he began working on several other slices of the application.

Wanting to see the project through to success, I joined the client as a direct employee to continue being the lead architect for the project.

Naturally, There's a Big Trade Show In The Story

What would any development story be without a "Big Trade Show" lurking around the corner? As luck and fate would have it, in early 2008 there was a huge industry trade show, and it was critical that we would be able to demonstrate the new version of the system to the roughly 25,000 customers that would be passing through our booth. And, it would be very important that these customers be able to see their own real items, either ones they were selling or ones they were buying. The problem was, of course, that the system was not ready to replace the production system! Due to security requirements brought about by a changing legal environment, we had to repartition the back-end database for the new system from 2 SQL databases into 6 separate databases just before the trade show. It was deemed too risky to perform this radical "surgery" on the live, production system just two months before the trade show. The new system's schema was about 95% the same as the old system, but there were corrections to long-standing column name problems or foreign-key reference inconsistencies. This was a complicating factor, however, for migration.

We tossed around various ideas, such as:

Perform the "surgery" on the production database to upgrade it to the new system schema, then use views and synonyms to create a "pass-through" database that looked like the old schema, but mapped across to the new DBs and structures.
Do the reverse: create several "pass-through" new databases with views and synonyms that actually resolved to the single existing production database's objects.

We felt that we could mitigate risk entirely by following the second option. What this also allowed us to do was to "override" some of the production system's tables with configuration data specific to the new system. The approach of using synonyms and views ensured that all writes and reads against the pass-through objects would actually resolve into the production database, thus enabling the beta version of the new system to live side-by-side with the legacy system.

The War Room

After some proof-of-concept prototyping, we realized this would be a winning strategy. Over the next couple of weeks, the four of us on the development team gathered daily in our "war room", and worked together to create all the necessary SQL scripts and shell databases, synonyms, views, etc that would be the magic glue. We ensured that we could re-run the scripts at will and automated our quality-control checks and sanity checks to be certain that all mappings would have proper permissions and configurations. After enough practice runs, we felt confident that it was ready to go. We created a single zip file which contained 5 BAK files, and a T-SQL script. We handed them off to our lead database administrator and he ran the scripts. Everything worked just as planned!

Cha-Ching!

At the trade show, everything went off flawlessly! Customers attended our booth and we, the development team, aided them directly in logging into the system and showcasing the new features we had worked so hard to develop. It was a very gratifying feeling to see how our improvised plan came together so well. Most importantly, we had succeeded in mitigating all risks to the money-generating production system, while also achieving the benefit of showcasing the new system to customers with real data. This was very exciting to them because they felt that the new features would greatly help them run their own businesses atop our platform.

Phased Transition From Legacy to New

We had now successfully demonstrated and validated the new, value-added features directly with customers in person. This was a great success. Yet, there was still much to do after the trade show. Features of lesser prominence, those in the 300 other pages set still needed to be developed and tested. This ended up taking a very long time, but we ultimately cycled back to my original recommendation by adopting an incremental replacement strategy.

It worked like this:

We deployed the new system to a new web server, named v2.
The existing, v1 site, remained at www.
We provided a link from v1 to v2 in the header of the v1 site, including advertising the benefits of the new system, but also including disclaimers and calls for assistance in testing and validating the usability of the new system.
This garnered a lot of early-adopters who helped find bugs and inconsistencies, all for free to us!
We monitored the usage patterns of v2 versus v1, to help estimate the load capacity under real-world conditions.

Michael Nygard's book "Release It!" proved prophetic here. In his book he says that "feature complete" is not the same as "production ready."
We learned this because the COM code had to be completely replaced with pure C# code since it could not stand up under load using COM Interop.

This result bore out my original advice to get the new features into production as soon as possible to monitor under real world conditions.

We formally adopted Scrum and Agile practices by identifying business-driven priorities and working through them in sprints.

We did this by closely monitoring the real-world usage of both the existing v1 system and the v2 system and focusing our effort first on the highest traffic pages, such as Viewing, Browsing, and Searching. Of course, Bidding and Payment, while producing less volume, were also mission-critical.
This focus allowed us to prioritize properly. We did not place inordinate emphasis on automating the testing of all areas of the system.

For example: we did not write Selenium test suites for things like Help Pages or Support Pages. Why? They are seldom used! And, they generate no revenue.

Instead, we built comprehensive Selenium test suites for the Big Four: Viewing, Browsing & Searching, Bidding, and Payment.

A Pleasant Surprise!

With the site now operating both in legacy, classic mode at www, and in "beta" mode at v2, the team began to actively monitor the new system's health health and encourage more and more users to jump into using v2. And, because we had focused on developing the value-added My Auctions features in the very beginning of the project, those features sat ready and willing to get into production! Our newest member of the team, who joined about two years after those features were ready and "shelved", took it on his own initiative, to our delight, to start building a mobile version of the core My Auctions features using ASP.NET MVC and the business objects that supported the My Auctions features. He was reluctant to show this prototype to the "higher ups", but the rest of our team encouraged him to do so. Within a few months, his mobile application was released into production before the global "switchover", described below, to the new system. A job well done!

Switching Over Right on Time for a Cool Billion Dollars

Over the course of more than a year, the team monitored the usage of v1 and v2, and began to more aggressively push the late adopters and stragglers into the new system. Eventually a "switchover" was made, and the v2 system took over the place of www. At that point, there was now a link back to v1, which ran from a virtual machine. Several months after this, the VM was retired, and the v1 system, and all of its legacy COM, was no more.

Just after the legacy system was retired for good, the company celebrated its 10th anniversary and 1 billion dollars in sales volume!

Retrospective

In retrospect, I spent nearly three years working on this project and learned a great deal! While I wish that the original plan of seeing the entire migration take place "all at once" could have been successful, I also am pleased that my original recommendation to take a phased, incremental, risk-mitigating, ROI-maximizing approach was very sound. Ultimately, that very approach became necessary due to the "expected" unexpected bumps along the road!

Application to Domains Seeking Non-Financial Returns

I understand that not all projects involve financial reward goals. Before I began working on the project just described, I worked for four years at the US Centers for Disease Control and Prevention. While working there, we were not seeking to generate financial return-on-investment. However, we did seek returns in the form of utility and value to the users and stakeholders of our systems. To assess this properly, it was critical to either observe the real users working with the system or to sit down with them and experience their pain, frustration, and sometimes: delight! Our team did this regularly by conducting evaluations, performing proficiency testing, and through coordinated multi-agency and stakeholder exercises under simulated public health emergency "war games".

Tying This All Back to Agile

While I've written more extensively on Agile in other posts on this blog, this post has not been about the "mechanics" of agile so much as it has been about the why. But, I want to look at just the first principle of the Agile Manifesto and make a brief comment:

Our highest priority is to satisfy the customer
through early and continuous delivery
of valuable software.

One might look at this principle and ask how can this be done when facing the situation I originally faced that featured my own client asking not for a continuous delivery model, but a "big bang" model? That's a very good question, and it's not one that has any quick-fix answer. My best advice here is that you need to learn the language and goals of both your client and your client's ultimate customers. If your client values financial returns, then ask him or her exactly what it is that generates financial returns.

In my client's case, returns come in when more people purchase items through his system. The next question should be: what is the shortest path we can take to increase that rate? If the client answers that the shortest path is to rewrite the entire system and deploy a big-bang upgrade, then you're going to have to keep breaking that down into smaller and smaller value-added chunks. You might have to suggest straw-men in terms of business-value if your client will not prioritize by business value naturally. Ultimately, like in this story, reality may bear down on the situation and if you have done your best to incrementally develop the system in terms of business value, then you can deliver value, upholding your end of the deal to the utmost of your ability within your realm of control. Sometimes that's the best you can do, until you run your own show!

Monday, March 28, 2011

Strangulation: The Pattern of Choice for Risk Mitigating, ROI-Maximizing Agilists When Rewriting Legacy Systems