How to Validate & Optimize Telemetry with the OpenTelemetry Remote Tap Processor Read More

How to Validate & Optimize Telemetry with the OpenTelemetry Remote Tap Processor

Jon Reeve December 16, 2024

As more OpenTelemetry (OTel) instrumentation is being deployed across our applications and infrastructure, we need ways of understanding the actual payloads of the telemetry being sent and received, so that we can validate what’s being sent and received, to drive accurate analysis and optimization. The OpenTelemetry collector is a powerful piece of software with many capabilities, but sometimes, there’s no substitute to “cracking open” some of the telemetry data to see what’s actually being received, or sent. Common questions that often arise once you start with the OTel collector include:

  • Is data even being received by the collector? How do I know?
  • How are the OTel “attributes” or “resource attributes” associated with the telemetry (logs, metrics, traces) actually getting populated (if at all)? How can I see them?
  • If I have enrichment, or other transformation/optimization “processors” in place, how do I know if they’re working and that I’ve configured them correctly?

The Remote Tap Processor

The Remote Tap Processor is currently part of the OpenTelemetry Contrib repo.  Recall that the OTel collector simply takes a set of telemetry sources (”receivers”) and wires those to a set of destinations (”exporters”) via one or more pipelines. Those pipelines can be configured with “processors” to enrich, transform and optimize the telemetry as it flows through the pipeline. The remote tap processor is a type of processor that samples a subset of traffic flowing through the pipeline and makes that sample available to a WebSocket client. You can position the processor anywhere in the pipeline (order is important as we’ll see) and you can also add more than one of them – for example to sample the telemetry coming in before is it transformed, and after it goes out of the pipeline post transformation.

Tapping the start and end of a logs pipeline in a collector

Setup

We will use the configuration in a previous blog here as a starting point, where we optimized all three telemetry types. In the readme for the remotetap processor, we see it runs on port 12001 and listens on localhost by default, and that it rate limits the sampled telemetry to one message per second by default (this is configurable). We plan to leverage two remote tap processors here – one for the start of the pipeline and one for the end. The “start” tap processor will run on the default port and the second “end” tap processor will be configured to run on port 12002. To run the websocket client, we will use a utility called websocat available here. To run this utility from the command line to connect to the start tap processor for example, we can execute the following in our terminal:

websocat ws://127.0.0.1:12001

Logs Pipeline Validation

We’ll start with validating our logs pipeline. Recall that we had previously optimized our logs by filtering out logs with low level severities (< WARN), and by deduplicating them in two processors chained together. You can find the collector configuration in the following GitHub Gist

There are two Remote Tap processors defined:

processors:
    remotetap/start:
        endpoint: 0.0.0.0:12001
    remotetap/end:
        endpoint: 0.0.0.0:12002

and then these two processors are referenced at the start and end of the logs pipeline respectively:

service:
    pipelines:
        logs:
            exporters:
                - debug
            processors:
                - remotetap/start
                - filter/severity
                - logdedup
                - batch
                - remotetap/end
            receivers:
                - otlp

Assuming the configuration is saved to a file called configremotetap.yaml, run the OTel collector as a Docker container as follows:

docker run -v $(pwd)/configremotetap.yaml:/etc/otelcol-contrib/config.yaml -p 4317:4317 -p 4318:4318 -p 12001:12001 -p 12002:12002 otel/opentelemetry-collector-contrib:latest

Note that in addition to the two OTLP receiver ports (that we’ll send telemetry traffic to), we have also exposed the ports for the two remote tap processors, so that we can connect to them with our WebSocket client.

Generate Telemetry

We’ll be using the handy otelgen tool again to generate OTel telemetry to send to the OTLP receiver defined in the collector configuration. You can install this locally or run it as a container. For example, to generate logs using Docker and direct them to the OTLP receiver running in our other container (on our mac):

docker run --rm ghcr.io/krzko/otelgen:latest  --otel-exporter-otlp-endpoint host.docker.internal:4317 --insecure logs m

Let the logs (samples) flow!

Now we can use the websocat client to see the sampled logs from the start remote tap processor (i.e. the incoming logs before they are transformed in our pipeline) in our terminal – we’ll also use jq to make these more readable, like:

websocat ws://127.0.0.1:12001 | jq '.'

You should see something like the following – we can already see useful information including the aforementioned resource attributes (in this case from K8s) and attributes. We can also see the log severity in here – which one of our subsequent processors will filter on.

Introspecting the Logs JSON

Start and End

Let’s now validate the 1st of the optimization processors in our logs pipeline – recall that this is a filter to remove logs with severity < WARN – from our config:

    filter/severity:
        logs:
            log_record:
                - severity_number < SEVERITY_NUMBER_WARN

Let’s validate if this is happening – in one terminal for the start tap processor, run:

websocat ws://127.0.0.1:12001 | jq '.' | grep severityText

and in the other terminal – for the end tap processor, run:

websocat ws://127.0.0.1:12002 | jq '.' | grep severityText

You should see something like:

Logs In (left) vs Logs Out (right)

i.e. you should only see “TRACE”, “DEBUG”, or “INFO” logs at the start of the pipeline and not at the end, if they have been successfully filtered out.

Log DeDuplication Verification

Let’s also revisit the log deduplication processor in our config:

    logdedup:
        interval: 5s
        log_count_attribute: dedup_count
        timezone: America/Chicago

According to the readme for this processor, the attribute defined as dedup_count above, should get injected into our log messages by the deduplication processor, so we should see this attribute only at the end of our logs pipeline, and not at the beginning. This time let’s run the following in our start tap processor terminal:

websocat ws://127.0.0.1:12001 | jq '.resourceLogs[].scopeLogs[].logRecords[].attributes[] | select(.key=="dedup_count")'

and this in our end tap processor terminal:

websocat ws://127.0.0.1:12002 | jq '.resourceLogs[].scopeLogs[].logRecords[].attributes[] | select(.key=="dedup_count")'

We should not get any results from the 1st command since that attribute should not be there yet, and we should see the attribute show up only in our second (end) terminal like:

Validating the Deduplication Processor

This confirms that the processor is working (injecting the attribute), and also, that we don’t appear to be getting any deduplication of logs for this particular otelgen scenario (since the “dedup_count” is always 1).

Traces Pipeline Validation

One more example – recall that we had a “tail based sampling” processor that would only forward traces that met a latency threshold OR had errors in them. From the config above:

    tail_sampling:
        decision_wait: 10s
        num_traces: 100
        expected_new_traces_per_sec: 10
        decision_cache:
          sampled_cache_size: 100000
        policies:
          - name: test-policy-1
            type: latency
            latency:
              threshold_ms: 200
          - name: test-policy-2
            type: status_code
            status_code:
              status_codes:
                - ERROR

Looking at the configuration of the collector we applied from the Gist, we can see that we’ve configured the remote tap processors for the traces pipeline, the same way we did for the logs one – with the “start” tap processor as the 1st in the pipeline and the “end” tap processor at the end:

        traces:
            exporters:
                - debug
            processors:
                - remotetap/start
                - tail_sampling
                - batch
                - remotetap/end
            receivers:
                - otlp

If you haven’t already – you can cancel your logs generation from otelgen – we’re going to generate traces now into the traces pipeline. Leveraging the help -h option for otelgen shows that traces generation supports various scenarios:

Trace Generation

We will use the web_request scenario to generate (e.g. 1000) traces like so:

docker run --rm ghcr.io/krzko/otelgen:latest --otel-exporter-otlp-endpoint host.docker.internal:4317 --insecure traces m -t 1000 -s web_request

We can start by taking a look at some sample traces coming in to the traces pipeline with the start remote tap processor:

websocat ws://127.0.0.1:12001 | jq '.'

You should see something like the following – resourceSpans objects, each with an array of spans corresponding to a given trace (id). Recall that the trace processor we set up was looking for traces with a latency greater than 200ms (above).

Sample Span/Trace Data

We can compute the latency on the individual spans by subtracting the start time from the end time using jq like follows:

websocat ws://127.0.0.1:12001 | jq '. |= (.duration = ((.resourceSpans[].scopeSpans[].spans[].endTimeUnixNano|tonumber) - (.resourceSpans[].scopeSpans[].spans[].startTimeUnixNano|tonumber))/1000000)' | grep duration

This computes a duration (in ms) and adds that field to the JSON, and we grep it out. Now let’s run this same command on the start and end processors as we did above, to look at the latency numbers before and after the tail sampling processor – you should see something like:

Span Latency before (left) and after (right) the tail sampling processor

We can see that the latency being reported for spans after the tail sampling processor are now all > 200ms as we expect 🙌  Astute readers familiar with the tail sampling processor will note that we are really computing span latency here, not trace latency (a collection of spans relating to the same trace id) – but if a span is more than 200ms, the trace itself must at least have that much latency, so this is a pretty good indicator that things are working as we would expect.

Summary

In this blog, we reviewed the need, and the power of getting at the underlying data that is flowing through the OTel collector. Whether it’s to understand the specifics of the data coming in, ensuring transformation or optimization is happening as we expect, or to simply understand if the collector is functioning! We talked about the need to be able to program our collectors “on the fly” in a previous post – the remote tap processor is another great example of that. Furthermore, getting at the underlying data that is flowing through opens up the potential of driving intelligent feedback loops that optimize the telemetry for us – think self driving for your observability 🙂  If you’re interested in learning more about this and what we’re working on, don’t hesitate to reach out!

For more information, please contact us at info@controltheory.com

Contact

Stay in touch.

Be the first to gain control