Why does observability seem out of reach for so many teams? Part of the problem, I believe, is because we’ve wishcast an ontologically pure engineering culture that, unintentionally, excludes a lot of people.
I’m going to turn my lens inward—”deploy on fridays”, “nines don’t matter”, “observability is a state and not a tool”—these are all sayings I’ve readily adopted and repeated throughout my career. I don’t think they’re wrong, either! The problem, though, is that they start from a position of you understanding what I actually mean by them, rather than what I’m saying directly.
If you boil down a lot of ‘observability-isms’, they turn into a couple of key points:
– It’s really hard to run an effective engineering organization (and by extension, a business) if you don’t understand what’s happening in your system.
– The things that actually matter to the business, like revenue or customer satisfaction, can’t be measured well with traditional ‘engineering metrics’.
– Differentiation between engineering teams is mostly a factor of agility.
When we talk about “you should be able to deploy on Fridays”, what we mean is that you should have enough agility and adaptive capacity in your engineering organization to deploy changes of pretty much any size or scale whenever you need to, without unnecessary fear or drama. Should you actually deploy on a Friday if you don’t need to? Of course not, it’s Friday! You don’t become an ontologically Good engineering lead by doing unnecessary things just because.
When we hear that observability or OpenTelemetry is ‘too confusing’, or ‘too hard’, or that people ‘don’t see the value’ I think it’s useful to step back and ask if it’s because those things are true, or if the perceptions of those things are true. It’s an important distinction—and the perception is often more real than the reality. Heck, take generative AI—the real revolution there isn’t in creating images from text prompts or text analysis, it’s the ability to use natural language to communicate with computers for complex task control and definition. People are way out over their skis with it, though, because the perception is that we’re on the verge of ‘autonomous AIs’ that will go rampant.
While Skynet is a long way off, observability isn’t. To achieve it, though, we need more than native telemetry everywhere—we need to make these concepts accessible, grounded, and based in real-world value. There’s already plenty of work here to build on. Things like Service Level Objectives (SLOs), but also high cardinality data processing and correlation between performance telemetry and user analytics. The pieces & parts all exist, our job is to stitch them together and show people that a better world is possible.
We just have to make sure that we’re not leaving folks behind on this journey.






