Observability
“Observability”... is a superset of “monitoring”, providing certain benefits and insights that “monitoring” tools come a cropper at. Before examining what these gains might be and when they are even needed, let’s first understand what “monitoring” really is, what its shortcomings are and why “monitoring” alone isn’t sufficient for certain use cases.
These are the four pillars of the Observability Engineering team’s charter:
- Monitoring
- Alerting/visualization
- Distributed systems tracing infrastructure
- Log aggregation/analytics
Source: https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c
Position on the Adoption Curve
Presentations about Observability
Debugging Microservices: How Google SREs Resolve Outages
Debugging Microservices: How Google SREs Resolve Outages
Forced Evolution: Shopify's Journey to Kubernetes
Canopy: Scalable Distributed Tracing & Analysis @Facebook
Canopy: Scalable Distributed Tracing & Analysis @Facebook
Observability to Better Serverless Apps
Interviews
Debugging Microservices: How Google SREs Resolve Outages
What is the work that you do today as a Google SRE?
Adam: I work for a Google DevOps team that takes care of Monarch. Monarch is a very large time series database used for querying and metrics collection. Monarch is roughly the internal equivalent of combining Prometheus, Grafana, and Graphite from the open source world. Monarch also adds to that stack all of Stackdriver and provides the backend for a lot of our cloud signals product. My role is an SRE-SWE which means I'm involved in the software engineering side as well. So a lot of my time is spent taking apart Monarch and putting it back together more durably and more reliably. Durability is especially important because Monarch is a globally distributed system (it runs in every single availability zone).
Can you give me an idea of the scope and size we’re talking about with Monarch?
Adam: I can’t be specific, but it’s very large in terms of both QPS and resources. The quantity of data per stream is extremely variable in size, from periodically receiving one byte, to receiving a constant stream of high-cardinality data. The same applies to the query side, where some queries need only fetch a single stream, and some need to fetch and aggregate a lot of them. Some consumers are doing ad hoc queries, and other teams are doing a tremendous number of queries per second to inform their actual customer facing products. Without Monarch, we have no monitoring or alerting, so it’s a critical system.