I have observability I have my metrics going into Prometheus and we have a Grafana dashboard.
– several people at recent conferences
(they didn’t exactly ask, but here, @martinthwaites answers anyway.)
I don’t completely disagree. However the statement is loaded with context.
First, metrics aren’t Observability on their own. I’m not talking about “3 pillars.” (There aren’t 3 pillars. There are actually 3 or 4 signals). Observability is about asking questions of production and getting answers about its current state.
Therefore, if you can get all the answers about the current state from metrics alone, I’d agree: you’ve achieved 100% Observability, gold star for you. The reality, though, is that it’s impossible to do that. Infrastructure metrics give you a slice of the picture. They tell you about servers, pods, CPU, Memory–but customers don’t care about those.
Remember: “Pods only matter when users aren’t happy.”
The sentiment here is that no one cares how many pods are running if users are getting good requests. Those metrics ARE useful for autoscaling or sizing of hardware and costs. Until you have an actual User/Customer who’s having a bad time though, we don’t care. At that point you might need to investigate pods and server metrics.
Application metrics (requests, duration, etc.) can be replaced with traces if you have the right backend to generate graphs from them for analysis, and you’re willing to do some intelligent sampling to keep the useful data like errors and slowness.
Coupling infrastructure metrics with sampled traces, you can likely get to “full observability.” The ability to ask those questions you didn’t know you needed to ask and get detailed answer. Both are useful in different contexts, but together they’re amazing.