Observability

“Observability”... is a superset of “monitoring”, providing certain benefits and insights that “monitoring” tools come a cropper at. Before examining what these gains might be and when they are even needed, let’s first understand what “monitoring” really is, what its shortcomings are and why “monitoring” alone isn’t sufficient for certain use cases.

These are the four pillars of the Observability Engineering team’s charter:

- Monitoring

- Alerting/visualization

- Distributed systems tracing infrastructure

- Log aggregation/analytics

Source: https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c

Position on the Adoption Curve

Presentations about Observability

Site Reliability Engineer Liz Fong-Jones

Debugging Microservices: How Google SREs Resolve Outages

Site Reliability Engineer @Google Adam Mckaig

Debugging Microservices: How Google SREs Resolve Outages

Production Engineer @Shopify Niko Kurtti

Forced Evolution: Shopify's Journey to Kubernetes

Software Engineer @Facebook Haozhe Gao

Canopy: Scalable Distributed Tracing & Analysis @Facebook

Software Engineer @Facebook Joe O’Neill

Canopy: Scalable Distributed Tracing & Analysis @Facebook

CTO @IOpipes, former Maintainer Docker & OpenStack Erica Windisch

Observability to Better Serverless Apps

Interviews

Site Reliability Engineer

Liz Fong-Jones

Featured Interview

Debugging Microservices: How Google SREs Resolve Outages

What is the work that you do today as a Google SRE?

Adam: I work for a Google DevOps team that takes care of Monarch. Monarch is a very large time series database used for querying and metrics collection. Monarch is roughly the internal equivalent of combining Prometheus, Grafana, and Graphite from the open source world. Monarch also adds to that stack all of Stackdriver and provides the backend for a lot of our cloud signals product. My role is an SRE-SWE which means I'm involved in the software engineering side as well. So a lot of my time is spent taking apart Monarch and putting it back together more durably and more reliably. Durability is especially important because Monarch is a globally distributed system (it runs in every single availability zone).

Can you give me an idea of the scope and size we’re talking about with Monarch?

Adam: I can’t be specific, but it’s very large in terms of both QPS and resources. The quantity of data per stream is extremely variable in size, from periodically receiving one byte, to receiving a constant stream of high-cardinality data. The same applies to the query side, where some queries need only fetch a single stream, and some need to fetch and aggregate a lot of them. Some consumers are doing ad hoc queries, and other teams are doing a tremendous number of queries per second to inform their actual customer facing products. Without Monarch, we have no monitoring or alerting, so it’s a critical system.

Read full interview

Site Reliability Engineer @Google Adam Mckaig