GCP
Presentations about GCP
Debugging Microservices: How Google SREs Resolve Outages
Debugging Microservices: How Google SREs Resolve Outages
Serverless Patterns and Anti-Patterns
Interviews
Debugging Microservices: How Google SREs Resolve Outages
What is the work that you do today as a Google SRE?
Adam: I work for a Google DevOps team that takes care of Monarch. Monarch is a very large time series database used for querying and metrics collection. Monarch is roughly the internal equivalent of combining Prometheus, Grafana, and Graphite from the open source world. Monarch also adds to that stack all of Stackdriver and provides the backend for a lot of our cloud signals product. My role is an SRE-SWE which means I'm involved in the software engineering side as well. So a lot of my time is spent taking apart Monarch and putting it back together more durably and more reliably. Durability is especially important because Monarch is a globally distributed system (it runs in every single availability zone).
Can you give me an idea of the scope and size we’re talking about with Monarch?
Adam: I can’t be specific, but it’s very large in terms of both QPS and resources. The quantity of data per stream is extremely variable in size, from periodically receiving one byte, to receiving a constant stream of high-cardinality data. The same applies to the query side, where some queries need only fetch a single stream, and some need to fetch and aggregate a lot of them. Some consumers are doing ad hoc queries, and other teams are doing a tremendous number of queries per second to inform their actual customer facing products. Without Monarch, we have no monitoring or alerting, so it’s a critical system.