Ask Miss o11y – Sampling and Application engineers

|

, ,

“Should my engineers care about how I set up sampling of our telemetry data?”

This question comes from a platform engineer who owns the observability pipeline, working with any number of application engineers who change code that sends useful data to this pipeline. 

This is an exceptionally good question, and the short answer is: “yes, but not all of it” as they should understand the high-level decisions, but not the underlying implementations. Let’s dive into a few things that should inform what you tell them and you don’t.

What is Sampling?

At its heart, sampling is a cost-saving exercise for telemetry data. Storing everything is expensive, and most traces are similar to each other.Sampling sends a smaller amount of data that is a good representation of what the full data looks like. For example, say your API for getting product data `/product/{productId}` received 10,000 requests in a particular second., Send the spans from only 1,000 of those requests and then say “each one of these represents 10 requests.” You’ve saved yourself 90% of your telemetry costs for storage and network.

The downside here is that you don’t see each and every request when you’re looking at the raw data in your telemetry backend. It’s all about trade-offs, and what we’ve found is that you can, very effectively, see and debug production issues with a decent sample of production telemetry. 

The trick is choosing which data to keep (sample). Tracing gives us the power to keep the whole story of a request, and drop many similar stories. But what makes a story similar? That is the hard stuff, where application engineers need to chime in.

Sampling traces randomly rarely gives good data, the reason is that you’re unlikely to get a view of all the different types of traces. Instead, you should sample based on what you know about your system. This is where I think it’s important to engage with the application engineers, as they have all the knowledge about the system. They know which endpoints are important–a POST to /checkout is way more important and varied than a GET from /status for a healthcheck. Requests from a particular tenant are likely to be similar to each other, or each GET for a particular product. These rules let us keep a useful sample of all the different stories happening in our system.

Why should engineers know?

Recently an application engineer, someone who used Honeycomb in their role, suggested to me that Honeycomb “sampled too little” (meaning that there was a smaller amount of data than they expected) and therefore they didn’t get the interesting data they wanted.

This surprised me, as Honeycomb doesn’t sample anything. We provide customers with Refinery, which platform engineers run and configure with sampling rules based on their unique system characteristics. Honeycomb accepts everything that’s sent to us, there’s no magic from our side by design.

This led me to the conclusion that understanding (and ultimately influencing) how the sampling in your organization is configured is actually quite important. Further, the application engineers should be able to influence how those rules evolve over time. The tools though, the infrastructure etc. are not important to them.

What should engineers know?

This is a tough one to answer definitively as the real answer is “it depends” (can you tell I was a consultant for a long time!). So let me try and detail some of the things that engineers might want to know and why.

  1. High level sampling rules
    This means that if your sampling ensures that all traces with errors come through, they should know. If you’re discarding the majority of traces for a particular endpoint, they should know.
  2. Individual span sampling rules
    If a particular trace is part of a sampled set, which rules told it to sample? Was it part of an error set? Did someone think this was an unimportant endpoint? 
  3. How and where to ask questions about sampling
    The engineers should know who to ask about sampling rules, how effective they are, and also make suggestions about why they’re excluding interesting things.

Conclusion

Sampling is really powerful, but can seem like a black box of magic. When you don’t even know that the black box of magic exists, it can feel like the telemetry system isn’t working for you. Involve your application engineers, don’t hide sampling from them, educate them and produce resources so they know what’s going on, because ultimately they’ll save you money!

Latest Articles

Subscribe to
our newsletter

OpenTelemetry in Practice
%d bloggers like this: