Facilitating High Availability Via Systematic Capacity Planning
High availability is one of the key features of web-scale distributed systems. It has become even more paramount with mobile computing becoming increasingly ubiquitous and ever decreasing latency tolerance of the end-users. One of the primary aspects of delivering high availability is systematic and rigorous capacity planning. The latter is non-trivial as underestimation of capacity requirements would adversely impact end-user customer experience, thereby impacting business; in contrast, overestimation of capacity requirements would result in ballooning high operational costs, thereby impacting business. Further, in case of services such as Twitter, the event-driven nature (where event occurrence is not know known a priori) of the service makes capacity planning very challenging. To this end, at Twitter, we developed a systematic and statistically rigorous approach for capacity planning. In particular, we derived insights from historical time series to estimate, say, for example, traffic for upcoming events. We shall walk through a concrete example in the talk about how we went about capacity planning for Superbowl 2013. Inspite of the blackout at Superbowl 2013, the capacity deployed, based on the approach we developed, seamlessly handled the 'additional' traffic. We validated our capacity projections post-Superbowl 2013.