"The most important reason to consider a strangler application over a cut-over rewrite is reduced risk. A strangler can give value steadily and the frequent releases allow you to monitor its progress more carefully. Many people still don't consider a strangler since they think it will cost more - I'm not convinced about that. Since you can use shorter release cycles with a strangler you can avoid a lot of the unnecessary features that cut over rewrites often generate." -- Martin Fowler, Chief Scientist, ThoughtWorks, on The Strangler Pattern
When   rewriting a system in a new technology, it's tempting to think that the   task will be easier and quicker than the first time it was written.   Because of this, sometimes business sponsors believe that a "waterfall"   or "big bang all at once" approach will work out, but this is rarely  the  case for any project large enough and important enough to warrant   rewriting. It's always important to practice iterative and incremental   development to provide for feedback loops. But, it's even more important   to do this in the case of a large application rewrite. This article   will explain why this is true. There are a few bedrock development   principles that project sponsors and team members should put into   practice to ensure the success of large scale migrations. Having learned   these lessons from experience, these are:
- Involve   business sponsors and end-users directly (or a user-experience   specialist) and the entire support and operations teams during the   entire rewrite
 
- Involve permanent quality-assurance professionals from the beginning and during the entire rewrite
 
- Design, code, and test one complete feature rewritten from the existing system as quickly as possible
 
- Thereafter, design, code, test, and pilot user-valued, return-on-investment-generating (ROI) features in small increments
- Most importantly, continuously build team member skills, knowledge, and leadership abilities
 
Lessons Learned in Rewriting Large Legacy Systems 
In   February of 2006 I joined a small .NET consulting company. Shortly   thereafter I was assigned to a brand new project for one of their   clients to analyze, design, and develop a new version of an existing   electronic commerce platform. The system was a highly successful,   niche-market leading auction site with nearly 700,000 registered users   at the time. In operation for more than seven years by then, the system   was built on classic ASP, C++/COM, and SQL Server 2000. It consisted of   about 330 ASP pages. Our client wanted to do two primary things.  First,  he wanted to add new, value-added features to the system to  provide a  much better user experience, one that would be similar to  eBay. These  features would be called "My Auctions". This new set of  features would  take the place of roughly 30 pages from the existing web  site. Second,  he wanted to migrate the other 300 pages, without  introducing any  functionality or usability improvements, to ASP.NET  WebForms. Having  already personally designed and developed the entire  business object  back-end COM objects, he wanted all of the new web site  to reuse this investment by utilizing COM Interop.
My Recommendation: Perform a Phased, Vertical Migration One Piece at a Time
My   first assignment was to analyze the existing ASP and C++ code to and   produce a migration strategy recommendation. This strategy document   would lay out our company's professional opinion for migrating the   system to the .NET platform and the C# language. My recommendation was   for our client to perform a vertical migration, which is a migration   that incorporates an entire functional slice of a subset of the system   (My Auctions) which cuts across all architectural layers   (top-to-bottom). In their book The Pragmatic Programmer, Dave   Thomas and Andrew Hunt call this a "tracer bullet". This was, in fact,   what Microsoft recommended in their best practices guidance documents   that I researched about performing large scale system architecture   migrations. I recommended that our client hire us to build a new core   platform on ASP.NET with C# and get the new, value-added features to  market as soon as possible  on top of that core platform.  Only after these value-added features  were in production would we then  move on to replacing the rest of the  300 pages with ASP.NET  replacements.
My Reasoning: Place Customer Satisfaction, ROI, and Risk Mitigation First
My   reasoning was that by creating a new core platform and building the   brand new, usability-focused, value-added My Auctions features on top of   that, our client would generate a return-on-investment (ROI) much   sooner by generating more sales volume with the user-friendly features   and would simultaneously mitigate significant risk by testing the   viability of the COM Interop strategy. By virtue of the features being value-added,   there would be no risk whatsoever for him to deploy them to a parallel   web server and get his users to begin pilot testing the system and providing   valuable feedback early on in the game when he could still make   significant changes prior to committing to replacing the entire system with the new technology.
I've since learned from Dan North at QCon 2010 in S.F. that is called The Strangler Pattern according to Martin Fowler, hence the title of this post!   Client Decision: Let's Do It All At Once
Our   client considered my recommendation very carefully, but wanted to take  a  different approach. Rather than deploy the new My Auctions features   independently, side-by-side with the existing system, he wanted to have   his in-house staff work on the other 300 pages while our company  worked  on the value-added features. With more than 330 pages to  complete, I  estimated that the project would not take less than a year,  but would  more likely take two years or more to complete. Our client  and my  manager thought things could be done much faster if we had three  or four  people working on the system. This was certainly the case  early on when  I worked side-by-side with another developer in our  company. Within  four months, he and I had completed the new C#  application foundation  and the value-added features to the point that  they were ready for  beta-testing.
And that's when all the fun began!
Planning is Essential; Plans are Useless
As   anyone who has worked in the software industry for a number of years   knows, the best laid plans never go as you planned. Our client's lead C#   developer left his company. Soon after that, my manager at my company   was let go, but several months later he was hired by our client to take   over the development management of the project. This made sense since  he  already had strong background on the project since its inception.   Shortly after this, our client's HTML, graphics, and CSS developer quit   when asked to change his focus to become an ASP.NET  developer. They  hired a lead C# developer and he got to working on a  large slice of the  application while I continued to work on another  large slice. Five  months later, they hired a second C# developer and he  began working on  several other slices of the application. 
Wanting to see the  project through to success, I joined the client as a  direct employee to  continue being the lead architect for the project.
Naturally, There's a Big Trade Show In The Story
What   would any development story be without a "Big Trade Show" lurking   around the corner? As luck and fate would have it, in early 2008 there   was a huge industry trade show, and it was critical that we would be   able to demonstrate the new version of the system to the roughly 25,000   customers that would be passing through our booth. And, it would be  very  important that these customers be able to see their own real items,   either ones they were selling or ones they were buying. The problem   was, of course, that the system was not ready to replace the production   system! Due to security requirements brought about by a changing legal   environment, we had to repartition the back-end database for the new   system from 2 SQL databases into 6 separate databases just before the   trade show. It was deemed too risky to perform this radical "surgery" on   the live, production system just two months before the trade show. The   new system's schema was about 95% the same as the old system, but  there  were corrections to long-standing column name problems or  foreign-key  reference inconsistencies. This was a complicating factor,  however, for  migration.
We tossed around various ideas, such as:
- Perform  the "surgery" on the  production database to upgrade it to the new  system schema, then use  views and synonyms to create a "pass-through"  database that looked like  the old schema, but mapped across to the new  DBs and structures.
- Do the reverse:  create several "pass-through"  new databases with views and synonyms that  actually resolved to the  single existing production database's objects.
We  felt that we could mitigate risk entirely  by following the second  option. What this also allowed us to do was to  "override" some of the  production system's tables with configuration  data specific to the new  system. The approach of using synonyms and  views ensured that all writes  and reads against the pass-through  objects would actually resolve into  the production database, thus  enabling the beta version of the new  system to live side-by-side with  the legacy system.
The War Room
 After some  proof-of-concept prototyping, we realized this would be a winning  strategy. Over the next couple of weeks, the four of us on the  development team gathered daily in our "war room", and worked together  to create all the necessary SQL scripts and shell databases, synonyms,  views, etc that would be the magic glue. We ensured that we could re-run  the scripts at will and automated our quality-control checks and sanity  checks to be certain that all mappings would have proper permissions  and configurations. After enough practice runs, we felt confident that  it was ready to go. We created a single zip file which contained 5 BAK  files, and a T-SQL script. We handed them off to our lead database  administrator and he ran the scripts. Everything worked just as planned! 
Cha-Ching!
At   the trade show, everything went off flawlessly! Customers attended our   booth and we, the development team, aided them directly in logging  into  the system and showcasing the new features we had worked so hard  to  develop. It was a very gratifying feeling to see how our improvised  plan  came together so well. Most importantly, we had succeeded in  mitigating  all risks to the money-generating production system, while  also  achieving the benefit of showcasing the new system to customers  with  real data. This was very exciting to them because they felt that  the new  features would greatly help them run their own businesses atop  our  platform.
Phased Transition From Legacy to New
We   had now successfully demonstrated and validated the new, value-added   features directly with customers in person. This was a great success.   Yet, there was still much to do after the trade show. Features of lesser   prominence, those in the 300 other pages set still needed to be   developed and tested. This ended up taking a very long time, but we   ultimately cycled back to my original recommendation by adopting an   incremental replacement strategy.
It worked like this:
- We deployed the new system to a new web server, named v2. 
 
- The existing, v1 site, remained at www. 
 
- We  provided a link from v1 to v2 in the header  of the v1 site, including  advertising the benefits of the new system,  but also including  disclaimers and calls for assistance in testing and  validating the  usability of the new system.
- This garnered a lot of early-adopters who helped find bugs and inconsistencies, all for free to us!
- We monitored the usage patterns of v2 versus v1, to help estimate the load capacity under real-world conditions.
- Michael  Nygard's book "Release It!" proved  prophetic here. In his book he says  that "feature complete" is not the  same as "production ready."
- We  learned this because the COM code had to be  completely replaced with  pure C# code since it could not stand up under  load using COM Interop.
- This result bore out my original advice to get the new features into production as soon as possible to monitor under real world conditions.
- We formally adopted Scrum and Agile practices by identifying business-driven priorities and working through them in sprints.
- We  did this by closely monitoring the  real-world usage of both the  existing v1 system and the v2 system and  focusing our effort first on  the highest traffic pages, such as  Viewing, Browsing, and Searching. Of  course, Bidding and Payment, while  producing less volume, were also  mission-critical.
- This focus allowed  us to prioritize properly. We  did not place inordinate emphasis on  automating the testing of all  areas of the system.
- For  example: we did not write Selenium test  suites for things like Help  Pages or Support Pages. Why? They are  seldom used! And, they generate no  revenue.
- Instead, we built comprehensive Selenium test suites for the Big Four: Viewing, Browsing & Searching, Bidding, and Payment.
A Pleasant Surprise!
With   the site now operating both in legacy, classic mode at www, and in   "beta" mode at v2, the team began to actively monitor the new system's   health health and encourage more and more users to jump into using v2.   And, because we had focused on developing the   value-added My Auctions features in the very beginning of the project,   those features sat ready and willing to get into production! Our newest   member of the team, who joined about two years after those features  were  ready and "shelved", took it on his own initiative, to our delight, to start  building a  mobile version of the core My Auctions features using 
ASP.NET  MVC and  the business objects that supported the My Auctions features.  He was  reluctant to show this prototype to the "higher ups", but the  rest of  our team encouraged him to do so. Within a few months, his  mobile  application was released into production before the global  "switchover",  described below, to the new system. A job well done!
Switching Over Right on Time for a Cool Billion Dollars
Over   the course of more than a year, the team monitored the usage of v1 and   v2, and began to more aggressively push the late adopters and  stragglers into the new system. Eventually a   "switchover" was made, and the v2 system took over the place of www. At   that point, there was now a link back to v1, which ran from a virtual   machine. Several months after this, the VM was retired, and the v1 system, and all of its legacy COM, was no more.
Just  after the legacy  system was retired for good, the company celebrated  its 10th  anniversary and 1 billion dollars in sales volume!
Retrospective
In   retrospect, I spent nearly three years working on this project and   learned a great deal! While I wish that the original plan of seeing the   entire migration take place "all at once" could have been successful, I   also am pleased that my original recommendation to take a  phased,  incremental, risk-mitigating, ROI-maximizing approach was very  sound.  Ultimately, that very approach became necessary due to the  "expected"  unexpected bumps along the road!
Application to Domains Seeking Non-Financial Returns
I   understand that not all projects involve financial reward goals.  Before  I began working on the project just described, I worked for four  years  at the US Centers for Disease Control and Prevention. While  working  there, we were not seeking to generate financial  return-on-investment.  However, we did seek returns in the form of  utility and value to the  users and stakeholders of our systems. To  assess this properly, it was  critical to either observe the real users  working with the system or to  sit down with them and experience their  pain, frustration, and  sometimes: delight! Our team did  this regularly by  conducting evaluations, performing proficiency  testing, and through  coordinated multi-agency and stakeholder exercises  under simulated  public health emergency "war games". 
Tying This All Back to Agile
While   I've written more extensively on Agile in other posts on this blog,   this post has not been about the "mechanics" of agile so much as it has   been about the why. But, I want to look at just the first principle of   the Agile Manifesto and make a brief comment:
Our highest priority is to satisfy the customer
through early and continuous delivery
of valuable software.
One   might look at this principle and ask how can this be done when facing   the situation I originally faced that featured my own client asking not   for a continuous delivery model, but a "big bang" model? That's a very   good question, and it's not one that has any quick-fix answer. My best   advice here is that you need to learn the language and goals of both   your client and your client's ultimate customers. If your client values   financial returns, then ask him or her exactly what it is that  generates  financial returns. 
In my client's case, returns come in when more  people purchase items through his system. The next question should be: what is the shortest path we can take to increase that rate?   If the client answers that the shortest path is to rewrite the entire   system and deploy a big-bang upgrade, then you're going to have to keep   breaking that down into smaller and smaller value-added chunks. You   might have to suggest straw-men in terms of business-value if your   client will not prioritize by business value naturally. Ultimately, like   in this story, reality may bear down on the situation and if you have   done your best to incrementally develop the system in terms of business   value, then you can deliver value, upholding your end of the deal to  the  utmost of your ability within your realm of control. Sometimes  that's  the best you can do, until you run your own show!