Presentation: Improving Resilience by Creating Storms in the Cloud



1:40pm - 2:30pm

Day of week:



Key Takeaways

  • Understand that fault injection is critical in order to build and maintain highly-available cloud services. It’s also a great tool to train on-call engineers
  • Learn the basic principles on designing, developing and using a fault injection system
  • Discover the problems in a timely manner and in a controlled environment without affecting the customers


For any company to run on the cloud they need assurances that their workloads, services, and data will be always available and secure. To be able to provide such guarantees, application developers and cloud providers need to perform extensive verification across a number of distributed services. Traditional testing tools were not designed to verify the resiliency of such systems.

At Microsoft, we actively develop and use fault injection to test and break our services. By doing this we identify failure points, design better detection, and build mitigations which allow us to auto-heal when real issues arise. Fault injection can span the whole stack: from applications to hardware and the network, from VMs to datacenters.

Developing a fault injection system is tricky, but utilizing it effectively is a magnitude of order harder. It’s important for engineers to embrace the fault injection culture and be trained to leverage it in all phases of development and maintenance. Only then can we significantly reduce the TTD/TTM (time-to-detect/mitigate). We will present our learnings for designing and using fault injection systems to maintain highly-available cloud-scale applications and the cultural change necessary to enable them.


What is your role today?
I work as a software engineer in the Service Resilience Team at Microsoft. Our main charter is to provide tools and efficient usage patterns to ensure the resilience of our cloud services. Over the last years, we have developed a modern, easy-to-use, fault injection system for distributed services that helps service owners to simulate various failure conditions and verify the health and behavior of their products. Currently, the majority of my time is spent on designing and developing integral parts of this system, such as new faults, automation, security and usability.
I’ve been with Microsoft since 2013 and also spent 3 months here as an intern in 2012 while I was pursuing my Master’s from EPFL university in Switzerland. I worked on various different projects, mainly as freelancer since 2004. The last couple of years prior to joining Microsoft. I was mostly occupied with Computer Vision research which I consider one of the most fascinating and promising fields in computer science.
Can you explain your talk title to me?
Nowadays people and companies rely on cloud services for all kinds of workloads. Cloud providers and web-application developers need to guarantee uninterrupted access to services and data. Based on our experience and experiments, we believe it’s not possible to meet customer expectations in terms of availability without automated verification of service resilience under various failure conditions. Web services, in contrast with boxed software, change very fast - new features are introduced daily or weekly and dependencies change regularly. The only way to keep up with this pace is to have automated solutions for resilience verification. So we try to continuously disturb both our cloud infrastructure and our services running on top of it by introducing faults in order to identify problems as early as possible.
What types of things does your Fault Injection System do?
Our system provides more than 25 faults that span across the whole execution stack. For example, there are: process related faults, operating system and machine level faults, resource pressure faults, network level faults and others. Each of those faults has various configuration options, so we can simulate a large number of incidents that can happen (or have happened in the past). Apart of the faults themselves, the system provides automated and scheduled injections, a subsystem to verify the behavior and health of the targeted service during and after injection and exposes different interfaces for user and programmatic interaction.
How would you rate this talk?
Intermediate – We will discuss our learnings on both developing and using a fault injection system and other tools for improving resilience.
How you you describe the persona of the target audience of this talk?
The talk will be targeted mostly to tech leads, developers, DevOps and architects as we will present the high-level design of a fault injection system. We will also talk about the importance of investing on fault injection and resilience tools and how they can make a difference for companies developing cloud services. So, I hope it will also appeal to senior management people since, at the end of the day, they are the ones that will take the decision of allocating resources for such projects.
What’s the motivation for your talk?
Every few weeks you will hear in the news about an outage on some web service or cloud provider. We believe that we (both at Microsoft and the industry) are still in the early stages of “solving” the service resilience problem. It’s in the best interest of IT companies and their customers for engineers and managers to embrace the failure testing culture and understand the importance of it. With that in mind, we want to share our learnings on the subject and try to promote fault injection as a crucial tool for achieving high-availability. At the same time, we are attending QCon to discuss with all interested parties about their views and ideas on the subject and learn from their experiences as well. We would like to exchange information on the most efficient technical solutions and discuss about standardizing patterns across all platforms.
QCon targets advanced architects and sr development leads, what do you feel will be the actionable that type of persona will walk away from your talk with?
1. Invest in fault injection tools as soon as possible and use it during the design and development process – It will minimize surprises (both for developers and customers) later; 2. Use the system to train on-call engineers and keep them prepared for real-life incidents; 3. Consider Azure for their workloads knowing that we are serious about providing resilient services.

Speaker: Michalis Zervos

Service Resilience Software Engineer @Microsoft

Michalis Zervos is a software engineer at Microsoft. He is a member of the Service Resilience Team that designs and builds state-of-the-art fault injection solutions and tools for improving the reliability and resilience of Microsoft’s cloud services. Over the years he has worked on a variety of projects such as micro-processor programming, computer vision algorithms and distributed applications. Michalis holds a Master’s degree in Computer Science from EPFL university in Switzerland.

Find Michalis Zervos at


Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June