As the holiday rush is winding down, I sit here reflecting on all the companies that lost business/revenue over the past few days. Loss of business not because of technology failure, although this is always a manifestation of a problem, but because of process failure in order to remedy the failures of technology. I’ve offered some tips on preparing for the holiday traffic from the system architecture perspective, but perhaps I should’ve concentrated on preparing for the rush from the organizational perspective.
Behind the extensive downtimes I witness every holiday, I too see a corporate failure to change the archaic processes to match the change in business models. Often, the companies who are the most prone to this are the companies transitioning from either a brick and mortar model or from enterprise software to web-based offering. The latter are actually transitioning from B2B to B2C model, often without realizing it. But they are not the only offenders. Even web-only companies suffer from the same symptoms. Whatever the company type is, the change must come from the top. Often, the corporate inflexibility and complacency is the main driver drives behind the legacy processes not reflecting the state of modern operations.
This year, as example, I witnessed a large e-commerce site, originating from a traditional catalog company, suffer a big revenue blow during Black Friday specifically because of devaluing the principals of collaboration and shared responsibility while running a complex, business critical web application. The owners made a conscious decision to separate the operations and development groups and maintain the traditional software development lifecycle (SDLC), limiting each group’s responsibilities specific to the respective domain. And those choices were the reason why they were unable to accept orders for over 8 hours on Black Friday. 8 hours of no revenue. On e-commerce site. During the busiest time of the year. Boom.
Lack of shared accountability
Management of the application was a function of operations group, however, system administrators had no domain knowledge of application or, perhaps even worse, deployment history or rollback procedures. On the flip side, developers also viewed the operations solely as a responsibility of system administrators, so when they were done deploying the code, they assumed their work was done. This meant that no developer was available immediately (being a holiday and all) to troubleshoot the problem,.
Lack of instrumentation
Monitoring was also defined by the business units as a function of operations team only and therefore adding application level monitors was not part of the development life cycle. All the system level monitors that you would expect from the traditional operations team responsible for systems only were showing no anomalies in behavior. No metrics showing application health or business rules were in place, making it difficult to pinpoint the problem in application layer and, consequently, extending troubleshooting (and outage) time.
Lack of flexibility
Development group had a defined process for modifying and deploying the code that they had to follow, preventing them from deploying quick patches as needed. They were forced to follow standard SDLC process for a critical bug fix instead of adjusting the process to shorten time-to-market for an issue affecting millions of users and as much in dollar figures.
It is also worth mentioning a lack of automation, since packaging, testing and deployment of the patch itself took significant time because of the required coordination and hand-offs between the groups. And a lack of rollback plan, allowing to quickly back out last set of changes allowing users to continue to shop, while developers were working on the fix. But one can argue that those oversights fall on IT groups rather than business groups, although still falls within domain of process failure.
Technology space has evolved. So have a lot of technologists. Businesses, however, especially larger ones, have a natural aversion to change, that is often justified by risk and cost factors. However, processes are put in place for exactly that reason — to save time and money. If they don’t accomplish those two goals or worse, contributing to the opposite – they need to be changed. My hope is that in light of visible, high profile failures businesses will begin to realize that ROI of change in the right direction is worth it.