The views, thoughts, and opinions expressed in this blog belong solely to the author, and not necessarily to the author’s employer, organization, committee, or any other group or individual.
Observability means different things to different people- few might say it is all about metrics, logs, and traces. I personally consider it as the old wine of monitoring in a new bottle. However, having practiced Observability in the global KL SRE CoE team over the past couple of months, I believe it is all about bringing better and impactful visibility into systems. Click here for Part 1 of my blog series on The Basics of Observability.
The practise of testing pre-production models is slowly phasing out and coding for failure is emerging. With regard to coding for failure, we acknowledge that systems will fail; being able to debug such failures is vital, enabling debuggability into the system from the ground up.
The three pillars and aspects of coding for failures are:
- Code and test with the operational semantics of the application,
- Get the details into operational characteristics of the system and dependency, and
- Write the code which is debuggable.
Testing for failure is the best effort verification for correctness and does not predict every possible way in which the service might fail. This doesn’t mean that testing is useless, it is still a very crucial aspect. But testing for failures involves acknowledging that certain types of failures can only be surfaced in the production environment. Testing in production doesn’t mean the replacement of pre-production testing.
And testing in production involves three steps:
- Integration testing
- Tap compare
- Load tests
- Config tests
- Soak tests
- Traffic shaping
- Feature flagging
- Exception tracking
- Logs & events
- Chaos testing
- A/B tests
- Dynamic exploration
- Real user monitoring
- On-call experience
The goal of testing in production is not to eliminate all manners of failure in the system, but to minimize failures and bring in reliability to complex distributed systems.
The three pillars of observability are:
Accessing these three pillars of observability doesn’t mean that the systems are more observable, these are powerful tools. Once we have a good grip on logs, metrics, and traces, we can build better systems.
Log: An event log is an immutable, time-stamped record. It comes in three formats such as plain text, structured and binary.
Traces and Metrics are an abstraction built on top of the logs that encode information along two axes- one being benign request-centric which is a trace, and the other being benign system-centric (metric).
Logs, metrics, and traces serve their own unique purpose and all three are complementary, with all three unions we get maximum visibility into the behavior of distributed systems.
As Brian Knox put it, the goal of an Observability team is not to collect logs, metrics, or traces. It is to build a culture of engineering based on facts and feedback, and then spread that culture within the broader organization.
And, the same can be observability itself, which is about being data-driven during debugging and using the feedback to iterate on and improve the product and services. Observability is not one thing, we have to pick our observability based on the requirements of our products and services by leveraging the three pillars of Logs, Metrics, and Traces.
Happy Observability, Happy SRE !
Thank you, and all credits go to the KL Leadership Team, KL-SRE-CoE Team, free & paid resources, and Cindy Sridharan’s book “Observability for Distributed Systems”.