Track:

Location:

Salon D

Duration

Duration:

1:40pm - 2:30pm

Day of week:

Tuesday

Level:

Intermediate

Persona:

Architect

Key Takeaways

Understand that fault injection is critical in order to build and maintain highly-available cloud services. It’s also a great tool to train on-call engineers
Learn the basic principles on designing, developing and using a fault injection system
Discover the problems in a timely manner and in a controlled environment without affecting the customers

Abstract

For any company to run on the cloud they need assurances that their workloads, services, and data will be always available and secure. To be able to provide such guarantees, application developers and cloud providers need to perform extensive verification across a number of distributed services. Traditional testing tools were not designed to verify the resiliency of such systems.

At Microsoft, we actively develop and use fault injection to test and break our services. By doing this we identify failure points, design better detection, and build mitigations which allow us to auto-heal when real issues arise. Fault injection can span the whole stack: from applications to hardware and the network, from VMs to datacenters.

Developing a fault injection system is tricky, but utilizing it effectively is a magnitude of order harder. It’s important for engineers to embrace the fault injection culture and be trained to leverage it in all phases of development and maintenance. Only then can we significantly reduce the TTD/TTM (time-to-detect/mitigate). We will present our learnings for designing and using fault injection systems to maintain highly-available cloud-scale applications and the cultural change necessary to enable them.

Interview

Question:

What is your role today?

Answer:

I work as a software engineer in the Service Resilience Team at Microsoft. Our main charter is to provide tools and efficient usage patterns to ensure the resilience of our cloud services. Over the last years, we have developed a modern, easy-to-use, fault injection system for distributed services that helps service owners to simulate various failure conditions and verify the health and behavior of their products. Currently, the majority of my time is spent on designing and developing integral parts of this system, such as new faults, automation, security and usability.

I’ve been with Microsoft since 2013 and also spent 3 months here as an intern in 2012 while I was pursuing my Master’s from EPFL university in Switzerland. I worked on various different projects, mainly as freelancer since 2004. The last couple of years prior to joining Microsoft. I was mostly occupied with Computer Vision research which I consider one of the most fascinating and promising fields in computer science.

Question:

Can you explain your talk title to me?

Answer:

Nowadays people and companies rely on cloud services for all kinds of workloads. Cloud providers and web-application developers need to guarantee uninterrupted access to services and data. Based on our experience and experiments, we believe it’s not possible to meet customer expectations in terms of availability without automated verification of service resilience under various failure conditions. Web services, in contrast with boxed software, change very fast - new features are introduced daily or weekly and dependencies change regularly. The only way to keep up with this pace is to have automated solutions for resilience verification. So we try to continuously disturb both our cloud infrastructure and our services running on top of it by introducing faults in order to identify problems as early as possible.

Question:

What types of things does your Fault Injection System do?

Answer:

Our system provides more than 25 faults that span across the whole execution stack. For example, there are: process related faults, operating system and machine level faults, resource pressure faults, network level faults and others. Each of those faults has various configuration options, so we can simulate a large number of incidents that can happen (or have happened in the past). Apart of the faults themselves, the system provides automated and scheduled injections, a subsystem to verify the behavior and health of the targeted service during and after injection and exposes different interfaces for user and programmatic interaction.

Question:

How would you rate this talk?

Answer:

Intermediate – We will discuss our learnings on both developing and using a fault injection system and other tools for improving resilience.

Question:

How you you describe the persona of the target audience of this talk?

Answer:

The talk will be targeted mostly to tech leads, developers, DevOps and architects as we will present the high-level design of a fault injection system. We will also talk about the importance of investing on fault injection and resilience tools and how they can make a difference for companies developing cloud services. So, I hope it will also appeal to senior management people since, at the end of the day, they are the ones that will take the decision of allocating resources for such projects.

Question:

What’s the motivation for your talk?

Answer:

Every few weeks you will hear in the news about an outage on some web service or cloud provider. We believe that we (both at Microsoft and the industry) are still in the early stages of “solving” the service resilience problem. It’s in the best interest of IT companies and their customers for engineers and managers to embrace the failure testing culture and understand the importance of it. With that in mind, we want to share our learnings on the subject and try to promote fault injection as a crucial tool for achieving high-availability. At the same time, we are attending QCon to discuss with all interested parties about their views and ideas on the subject and learn from their experiences as well. We would like to exchange information on the most efficient technical solutions and discuss about standardizing patterns across all platforms.

Question:

QCon targets advanced architects and sr development leads, what do you feel will be the actionable that type of persona will walk away from your talk with?

Answer:

1. Invest in fault injection tools as soon as possible and use it during the design and development process – It will minimize surprises (both for developers and customers) later; 2. Use the system to train on-call engineers and keep them prepared for real-life incidents; 3. Consider Azure for their workloads knowing that we are serious about providing resilient services.

Speaker: Michalis Zervos

Service Resilience Software Engineer @Microsoft

Michalis Zervos is a software engineer at Microsoft. He is a member of the Service Resilience Team that designs and builds state-of-the-art fault injection solutions and tools for improving the reliability and resilience of Microsoft’s cloud services. Over the years he has worked on a variety of projects such as micro-processor programming, computer vision algorithms and distributed applications. Michalis holds a Master’s degree in Computer Science from EPFL university in Switzerland.

Find Michalis Zervos at

Speaker page

@mzervos

Software Engineer

Similar Talks

Learnings from a Culture First Startup

CTO @Buffer

Sunil Sadasivan

Day in the Life with Speech Recognition, Machine Learning, and IOT

Distinguished Engineer, Emerging Technology @IBM

Mark Vanderwiele

Architecting Your Application for Federation

Senior Development Evangelist @Okta

Joël Franusic

Becoming an Outlier

Software Architect @VinSolutions, Author @pluralsight

Cory House

ESPN Next Generation APIs Powering Web, Mobile, TV

Senior Director of Distribution Platforms @ESPN

Manny Pelarinos

The Human Side of Microservices

Tech Lead @Yelp

John Billings

Lessons Learned on Uber's Journey into Microservices

Software Engineer @Uber

Emily Reinhold

What They Don’t Tell You About Microservices…

CTO @Yodle

Daniel Rolnick

Algorithms for Animation

Partner & Tech Lead @CarbonFive

Courtney Hemphill

Tracks

Monday, 13 June

Architectures You've Always Wondered About

Case studies from: Google, Linkedin, Alibaba, Twitter, and more...
Stream Processing @ Scale

Technologies and techniques to handle ever increasing data streams
Culture As Differentiator

Stories of companies and team for whom engineering culture is a differentiator - in delivering faster, in attracting better talent, and in making their businesses more successful.
Practical DevOps for Cloud Architectures

Real-world lessons and practices that enable the devops nirvana of operating what you build
Incredible Power of an Open-Sourced .NET

.NET is more than you may think. From Rx to C# 7 designed in the open, learn more about the power of open source .NET
Sponsored Solutions Track 1

Tuesday, 14 June

Better than Resilient: Antifragile

Failure is a constant in production systems, learn how to wield it to your advantage to build more robust systems.
Innovations in Java and the Java Ecosystem

Cutting Edge Java Innovations for the Real World
Modern CS in the Real World

Real-world Industry adoption of modern CS ideas
Containers: From Dev to Prod

Beyond the buzz and into the how and why of running containers in production
Security War Stories

Expert-level security track led by well known and respected leaders in the field
Sponsored Solutions Track 2

Wednesday, 15 June

Microservices and Monoliths

Practical lessons on services. Asks the question when and when to NOT go with Microservices?
Modern API Architecture - Tools, Methods, Tactics

API-based application development, and the tooling and techniques to support effectively working with APIs in the small or at scale. Using internal and external APIs
Commoditized Machine Learning

Barriers to entry for applied ML are lower than ever before, jumpstart your journey
Full Stack JavaScript

Browser, server, devices - JavaScript is everywhere
Optimizing Yourself

Keeping life in balance is always a challenge. Learning lifehacks
Sponsored Solutions Track 3

See the Full Schedule

Location:

Duration

Day of week:

Level:

Persona:

Key Takeaways

Abstract

Interview

Find Michalis Zervos at

Similar Talks

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Improving Resilience by Creating Storms in the Cloud

Location:

Duration

Day of week:

Level:

Persona:

More talks on:

Key Takeaways

Abstract

Interview

Find Michalis Zervos at

Similar Talks

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World