“OMG, Facebook is DOWN!!!” was the cry of the millions when Facebook was unavailable for about 3 hours because of the network issues. Given the nature of Facebook service, the downtime did not have any long lasting effects on it’s user base. In fact, some say that the productivity significantly increased during the 3 hour window without access to Facebook. Bottom line is – the unavailability of the social networking service doesn’t negatively impact the users (ego and reputation of the service aside). Question is: does it also hold true for the companies leveraging Facebook, or other social networks like Twitter, Flickr, FourSquare, etc., in their daily operations?
In this day and age, more and more companies operating online businesses try to break into the social media realm by leveraging existing services to increase visibility and loyalty of the brand, to bring more people to the sites, and consequently, increase the conversions, whether it’s visits, purchases, or participation. I’ve seen many incarnations of social networking implementations, from the basic simplified authentication with Facebook Connect augmenting regular process (for ease of registration/login) to full blown applications relying heavily on multiple features available from the services’ APIs. Now, personally, I am all for having the services available, and used strategically throughout the applications. It provides a tremendous benefit not only in brand familiarity and content, but also in cost saving, considering you’re leveraging years of someone else’s work for your gain. Consider Flickr. The storage, CDN, and REST APIs to present the assets are all developed and tested for you; all you need to do is to integrate the functionality within the content of your site. The same services are available to everyone, and it’s a business decision which features would be beneficial to the company’s strategy. The implementation of the features, however, varies significantly.
One of the major risks when implementing a third party service is the reliance on the availability of that third party service. The service that you have no control over. And no matter how large or successful the service that you’re using is – it will go down at one point or another. Just look at the intro image with a log from my Twitter client. Now, imagine what happens during the downtime if your site heavily relies on Twitter API.
On a side note, I talked at length before about the need for application monitors to assist both development and operations teams with monitoring the health of the application, because a lot of times just monitoring system health is not enough. To offer a piece of advice, a good metric to track against your system and application performance is a commit log. It allows you to correlate you version control commits and deployments to irregularities in trending patterns. It also lets you identify the problematic piece of code when something goes wrong. But I digress.
So let’s examine a situation. A large online media company decided to switch to Facebook Connect as an exclusive authentication method for their site. (To prevent the discussion about the viability of this choice, let me just note that there was a business reason for choosing this route.) This is where the fun starts. The graph below represents HTTP load time for the pages on the site at every stage of the process. Even without the captions on the graph (given the tone of this post), everyone should be able to pin-point the exact time when the changeset went live, and the load time of the pages tripled. The project owners were notified, but since the load times were extremely low to begin with (thank you caching) the load speed was deemed acceptable, and the changes remained in production. Time passed. And then some more. And then the dark day came – the day when Facebook went down. And the page load times on the media site tripled again, for a very brief period of time (while Facebook servers were just lagging), and then dropped to 0, i.e. “users are unable to see the site”. Just like that – Facebook’s problem became the company’s problem.
Upon closer code investigation the problem was identified and solved quickly, also reducing the page load time to it’s original threshold as a byproduct of the change, but it shows how dependent your site can become on a third party service availability if the features are not implemented correctly.
So how can these issues be avoided? There are a few common sense rules, that for some reason often get ignored during the development, which should help with the integration of external services without effecting your site’s performance.
1. Only connect to a third-party service where needed. Don’t try to connect to Facebook on every page load to validate that user is still the user you displayed the previous page to. Cache the results locally.
2. Don’t make connections to a third-party service in the critical path of the page load. Don’t load Google Analytics as a first thing on your page, you will delay the display of the content that actually matters. Make the connections after your content is loaded, or better yet – connect asynchronously.
3. Trap time-outs and errors. You do it with your database connections, why would you treat external connections differently?
4. Create a fallback plan. You have no control over external services, but you do have the control over the content presented to your users. If Flickr feed is the essential feature of your site – store the displayed history locally, so you can fall back to the latest available content in case Flickr is unavailable. Remember, sometimes stale content is better then no content at all.
To make a blanket statement — don’t jump to using social media features without identifying a need for them, use them to support your primary business model. At the end of the day, when integrating any third party service, you are trying to leverage the benefits of the available functionality to enhance the experience for your own users, not to inherit the services’ availability problems. Integrate smartly, not blindly.