Tracing is too expensive. You must have metrics instead to save money– a popular argument for not using tracing
This argument is only partially true. Trace data can be relatively cheap, definitely cheaper than logs–but not if you keep them all.
With tracing, we have a methodology called “sampling.” It’s a method of retaining statistical accuracy of the traces, and the full user context, but without storing all the data.
This allows you to query for incredibly in-depth information about your user journeys, incredibly fast and cheap, and still see what the original data looked like.
Doing this intelligently is the hard part. You will see the ability to configure sampling in most SDKs. In fact, #OpenTelemetry has just such a function in all our SDKs. We call this “head sampling” (because it decides whether to sample when the trace is first created).
The issue is that doing sampling inside an SDK means the only data you have to make a decision is the URL, and maybe the user-agent. You don’t know if a request is going to error, or whether it could end up being slow, and those are the interesting ones you want to know about. Those customers are having a real issue with your product.
The most effective sampling happens after the trace has finished, in something like the OpenTelemetry collector, where you can make judgement calls like that. We call this “tail sampling” (because it happens after the trace is over).
You can keep the slow traces, the errors, the important customers–as you have all the telemetry. Doing this means you have all the context to investigate issues, and not use guesswork from metrics that have had their interesting context stripped at the source.
So instead of ruling out tracing as being expensive, don’t treat them like logs! Look into the right ways to handle them in your pipelines, and talk to the people running your backend about how sample rates are handled.