Mistakes happen. People who make a claim that they can produce bug free product are lying either to you or to themselves. And it is debatable what’s worse. Anyone who worked in the tech industry for a few years has a couple of horror stories up their sleeves about THE mistake. Some of those stories are amusing, in retrospect of course, some are pretty disturbing, but all of them clearly demonstrate one point – there is no perfection. As the systems today become more and more complex it is virtually impossible to avoid all the mistakes and implement a bug-free solution. So once you accept it as an axiom, the accent shifts from the question “How to avoid all mistakes?” to “How to minimize the impact of a mistake?”
There are a couple of common misconceptions in a development world, that could potentially be costly to the companies. One of them is a developer’s belief that the work is “done” when it is deployed to production. While it is true from the sign-off and acceptance perspective, it is of very little comfort to company when they realize that the brand new holiday campaign that launched last week (tested and approved), is not actually collecting order information on “Black Friday”. Or, taking a well known example, Y2K bug breaks applications developed years before the year 2000. Bottom line is – the solution needs to be working not only today or tomorrow, but months and years from now. And by marrying this goal with a previously derived axiom that no solution is 100% bug-free we have a pretty interesting challenge on our hands. Which brings back the question of how to timely identify and mitigate the impact of a mistake.
If you expected revelations of all the world secrets and conspiracies after the long preamble – you’re in the wrong place. The answer is simple. Monitoring. Anyone who is running a website in this day and age employs some monitoring strategy to make sure the site is up and running. Most of these people firmly believe that the monitors in place is sufficient to run their business successfully. And most of them are wrong.
Flaws in “traditional” monitoring
Complexity of the web applications today has gone far and beyond the capabilities of the “traditional” monitoring. Keeping tabs on uptime and responses of your site will not paint a full picture of the application performance and, consequentially, will let the the problems slip through the cracks. Twitter serves over 20 million unique visitors a day … and is legendary for it’s downtime. Now, traditional HTTP checks would notify operations team if the site is to become unavailable immediately. But what happens is the site is, seemingly, up and running. HTTP checks return 200 code from target page checks and the browsing trends are above threshold. Everything is up and running, users are happy, the operations team can go out for a few beers. Right? Wrong! One of the more recent problems with Twitter was lost tweet data. The site was up and available for browsing, the API accepted post requests and returned success codes, but the post never registered with Twitter. From the standpoint of basic checks the site was operational, but not from the standpoint of frustrated users. Now, in all fairness, because of viral popularity and non-sensitive data, Twitter’s constant issues do not significantly effect the bottom line for the company, unlike most companies, who would be paying for the business downtime either in hard dollars, opportunity cost, or both. In order to minimize these costs, the companies need to identify and implement specific business rules that would provide a sound base for measuring the availability and success of the service offered.
Business rules for developers
Which brings us back to the developer’s perspective. Establishing business rules does not only establish the base for business success, but can also establish the success criteria for the job being “done” after the launch. If the project involves site registration (beta sign-up pages, membership sites, etc), a viable business check would be to make sure the hourly number of registrations does not drop below the set threshold. If you’re working on an e-commerce solution, credit card transaction success v failure ratio would be a good measurement to be certain that the process works as expected. Note, that business checks do not conflict with system checks. They can (and should) be used in conjunction with each other.
Who to blame?
There is a clear line of responsibility between system administrators and developers, which is often blurred, leading to either an effective teamwork or cross-department finger pointing. Often times operations have no knowledge about the application-specific functionality or how the systems changes would effect the application. System administrators are responsible for system health. Developers are responsible for application health. Following the above example, if your web server is offline – the application will not be accepting any registrations. But bringing the server back online does not guarantee successful application resumption. If there are no monitors validating application behavior – there are no problems. From the stand point of operations team – everything is up and running. From the stand point of the business owner – not so much.
To wrap up …
Development of business and functionality monitors should be a part of any project scope. Period. The application may be elegant, may be extensible, may be near-perfect, but if that “near” rears it’s ugly head without anyone noticing and acting in a timely fashion – no redeeming quality of the application can negate that. There is a variety of tools to help the developers to get the job done and be compliant with company’s monitoring guidelines. Get them, learn them, use them.