Kubernetes gets a bad rap for being mysterious, but I think that’s unfair.
When we talk about ‘mystery’ in debugging, profiling, or during degraded performance it conjures up the mental image of a hard-boiled detective, staring down a web of clues arrayed on a cherry desk. I feel like this is a habit we’ve kept from the bad old days of trying to figure out what the difference between two servers named “gandalf” and “gimli” is, or why everyone speaks in hushed tones about “the backupocalypse” — it’s an artifact of the past.
Kubernetes is perhaps the least mysterious thing you can find. It’s exhaustively documented, and it’s incredibly standardized across most deployment methods; It emits reams of standardized telemetry data about events that occur.
The problem isn’t that the answers aren’t there, it’s two separate things:
– It’s hard to connect what’s going on in our application with what’s happening in Kubernetes.
– There’s too much data, and it’s not really clear what’s important unless you’re an expert on Kubernetes and your application.
Now, the former is a bit easier to fix than the latter. OpenTelemetry is able to annotate application telemetry with Kubernetes metadata, making association between anomalous application events and pod, node, service, or deployment events and metrics.
What it doesn’t fix, though, is the second part — and I’d argue that it’s the harder one to solve. Often, Kubernetes administration isn’t something that application engineers have to deal with, and there’s organizational silos built up that prevent great communication between these teams. I’ve seen logs get ferried back and forth on Teams chats during incident calls in CSV format, because of controls on who can actually see what! All the data in the world won’t help you if you can’t get at it, and all too often what ends up happening is that operators and developers are overwhelmed with too much data, and not enough context.
This is why we talk about observability not just as a technical change, but a cultural practice. You can’t take the old way of doing things and apply it to a new technology stack, you need to think about your system holistically. This includes who’s responsible for what, who can access telemetry data, who can make changes (and who those changes are visible to), and more. It takes some tools, yes, it takes some technology as well, but it also takes some thought about how you’re solving problems in the first place.
(comment on this on LinkedIn)