“When it’s done well, it does look easy. People have no idea how complicated and difficult it really is. When you think of a movie, most people imagine a two-hour finished, polished product. But to get to that two-hour product, it can take hundreds or thousands of people many months of full-time work.”

-George Kennedy

Disclaimer: The views, thoughts, and opinions expressed in this blog belong solely to the author, and not necessarily to the author’s employer, organization, committee, or other group or individual.

Platforms such as Containers, Orchestrators, Microservices architectures, Service Meshes, immutable infrastructure, Serverless, Docker, Kubernetes are visionary concepts that have fundamentally shifted the way software is built and operated today. 

Observability means different things to different people- few might say it is all about metrics, logs, and traces. I personally think of it as the old wine of monitoring in a new bottle. But now, having practiced Observability in the global KL SRE CoE team, I believe it is all about bringing better and impactful visibility into systems.

I would also like to quote  Cindy Sridharan’s definition here:

“Observability is a property of a system that has been designed, built, tested, deployed, operated, monitored, maintained, and evolved in acknowledgment of all the below facts:

  • The ease of debugging is a cornerstone for the maintenance and evolution of robust systems.
  • No complex system is ever fully healthy and distributed systems are unpredictable.
  • Failure needs to be embraced at every phase, from system design to implementation, testing, deployment, and the most important leg of the operation.”

So, the observable system isn’t something that is achieved by just having monitoring or an SRE team in place. Observability is a feature that requires to be inserted into a system at the time of system design, such that a system can be built for allowing a certain degree of testing in production, deployed incrementally, and finally produce a post-release report with insightful, intelligent data points. 

Observability and Monitoring are complementary, one is not a substitute for the other. 

So, now let’s stick to Observability as a superset of both monitoring and testing; it provides information about unpredictable failures in the cloud-native distributed system which could not be monitored or tested.

Monitoring is best suited to report the overall health of systems and to derive alerts.

Alerting applies to both failure and human-centric alerts, an example of Kubernetes provides us with out-of-the-box operators that need to be alerted on such failures.

Monitoring data should at all times provide a bird’s-eye view of the overall health of a distributed system by recording and exposing high-level metrics over time across all components of the systems, including load balancers, caches, queues, databases and stateless services. A very important aspect is that the on-call experience must be humane and sustainable, all alerts need to be actionable, and not simply emailing of all aspects of log alerts. 

Let us continue the conversation in Part 2/2 of this blog series, where we will discuss USE & RED Metrics methodologies for monitoring and alerting of the system.

Thank you, and all credits go to the KL Leadership team, the KL SRE CoE team, free & paid resources, and Cindy Sridharan’s book “Observability for Distributed Systems”.


Leave a Reply

Your email address will not be published. Required fields are marked *