Time To Deploy an OpenTelemetry Collector

In this article, we’ll take an introductory look at deploying OpenTelemetry Collectors using various strategies to accomplish different objectives. We’ll cover the basics including No-Collector and Agent-based deployments and progress to advanced Gateway deployments highlighting the important considerations that should be made when deciding what should be handled by the Agent, and what can/should be deferred to the Gateway.

In future blog posts, we’ll take a sample application, determine what processing needs we require from our OpenTelemetry Collector pipelines, determine whether they should be deployed on the agent or the gateway, explore tail-based sampling in depth, and determine ways to monitor the health of our collector deployments.

Introduction

OpenTelemetry has developed OTLP as its standard for sending and receiving telemetry signals. The OpenTelemetry SDK utilizes OTLP to produce metrics, traces and logs inside applications and send them to any backend that supports OTLP.

When it comes to deploying OpenTelemetry Collectors, perhaps the most basic OTel deployment strategy is one that doesn’t use a Collector at all.

The “No Collector” Deployment

This pattern sends telemetry signals directly from the application producing it to the backend consuming it:

OpenTelemetry - No Collector Deployment

Because the backend observability platform consumes OTLP natively, there is no need for an OpenTelemetry Collector making it quick and simple to get started. Simply run your code from your desktop and send directly to the backend. This is also very useful in cloud environments where serverless functions need a place to offload telemetry to. An available, OTLP-enabled observability backend makes it very convenient to send telemetry without a lot of configuration and management overhead.

The approach is not without tradeoffs of course. With this pattern, the application is tightly coupled to the backend meaning that if the backend becomes unavailable, sending telemetry blocks the application causing service interruption. Directly sending telemetry to a remote backend also has higher round trip latency than offloading to a locally running Collector as we’ll see in some of the other deployment models.

Additionally, one of the great strengths of using OpenTelemetry is the availability of processing pipelines– that is, the ability to process telemetry signals before egressing it from where it was produced. With the No Collector strategy, there is no Collector in which to configure processing pipelines for metrics, traces and logs. In other words, there is no centralized point to consistently enrich, redact, transform and route telemetry. Some pipelining can be accomplished using the OTel SDK, but then making changes becomes a completely different workstream and is often less convenient than changing a Collector configuration.

The “Agent” Deployment

The Agent Deployment pattern is the approach that most newcomers to OpenTelemetry get started with. It functions both as a standalone means of collecting and processing telemetry signals but also plays a role in a more advanced deployment strategy as we’ll see in the Gateway pattern. The approach is fairly straightforward– an OpenTelemetry Collector is deployed as close to the application or service as possible, either on the same host, or within the same pod, etc. The idea is to offload telemetry quickly and efficiently from the application to minimize interruption. Offloading to a local Collector has the added benefit of providing a local, temporary cache for storing telemetry should the backend consumers become unavailable.

This pattern consists of a single instance of an OpenTelemetry Collector (ie., the Agent) that has its own configuration to determine where and how it can receive metrics, traces, and logs. It has any number of configured processors that can be used to enrich the telemetry. Finally, it has a set of configured exporters than enable it to egress to any number of backend consumers.

A typical Agent Collector deployment architecture consists of an Application, sending to a Collector which sends to any number of backends:

OpenTelemetry - Agent Deployment

Because of its flexibility, this pattern becomes the workhorse for processing telemetry signals. Unlike the No Collector pattern, configuration changes can be made to the Collector configuration without the need to modify code and redeploy an application.

The drawback comes as organizations attempt to scale up their deployment of Collectors. Maintaining a fleet of configurations that consistently enrich, redact, transform and route telemetry across separate installations of OTel installations becomes unmanageable. Additionally, other situations arise that can’t be handled by a standalone Agent Collector– for instance, tail-based sampling requires a complete trace with all spans collected to make a sampling decision. If spans belonging to the same trace can be collected by multiple Agent Collectors, then a single view of the entire trace is never assembled and tail-sampling is not possible. This is where the Gateway Deployment pattern comes in.

The “Gateway” Deployment

The Gateway Deployment pattern extends the Agent Deployment pattern by introducing a secondary layer of OpenTelemetry Collectors to the telemetry stream. Think of this as an aggregation layer where any number of Agent Collectors can forward their telemetry to a Gateway Collector:

OpenTelemetry - Gateway Deployment

For simplicity, the diagram illustrates multiple Agent Collectors feeding into a Gateway Collector instance. In reality, multiple Gateway Collector instances are clustered together:

OpenTelemetry - Gateway Deployment

This is done to enable scaling up or down Gateway instances as needed to accommodate demand, but also for high availability such that no single point of failure exists.

Additionally, some organizations have network and security requirements that prevent access to the Internet from internally. Gateway deployments are a great option as they consolidate the number of egress points to manageable size keeping security teams happy.

Use Cases

Earlier we touched on maintaining configuration consistency across a large deployment of Agent Collectors and mentioned how Gateway Deployment can help. Here we’ll go through some common use cases on what are (and what are not) good candidates for a Gateway instance. As you go through your planning exercise, think through your telemetry processing needs and determine if it’s something universally applicable to all deployments, or if specific to a particular application. If it’s the former, it’s likely a candidate for a Gateway deployment. If it’s the latter, it’s probably best handled by an Agent deployment.

Gateway Deployment Examples

Good Examples

We want to consistently enrich the telemetry signals such that all applications and services deployed in a particular environment are tagged consistently with the same environment identifier (e.g., dev, qa, stage, prod, etc).

For this we can use an attributes processor deployed to the Gateway to always upsert the deployment.environment.name attribute:
```
processors:
  attributes:
    actions:
      - key: deployment.environment.name
        value: stage
        action: upsert
```

We want to implement tail-based sampling to control our span egress. As discussed in the Gateway Deployment, a fully assembled trace containing all spans is necessary to apply sampling policies. As such, we need to apply the sampling policies on the Gateway instance in order to work as expected.

processors:
  tail_sampling:
    decision_wait: 1s
    expected_new_traces_per_sec: 100
    policies:
    # Policy #1
    - name: env-based-sampling-policy
      type: and
      and:
        and_sub_policy:
        - name: env-prefix-policy
          type: string_attribute
          string_attribute:
            key: env
            values:
            - dev
            - qa
        - name: env_sample-policy
          type: always_sample

We need to compute Request, Error, and Duration metrics from span data. This is a perfect candidate for a Gateway deployment for a couple of reasons:
1. It consolidates the configuration of the spanmetrics connector to just the Gateway deployments,
2. It is also necessary to compute span metrics before applying tail sampling policies and because tail sampling policies must be deployed to a Gateway instance, it makes sense to collocate this processing to the Gateway instances.
```
connectors:
  spanmetrics:
    namespace: span.metrics
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [spanmetrics, datadog]
    metrics:
      receivers: [spanmetrics]
      exporters: [datadog]
```

Another popular use case for OpenTelemetry Collectors includes using processing pipelines to look for things in the telemetry stream before egressing the data external. Examples include looking for PHI in health care data, PCI in financial data, or even Bearer tokens in HTTP headers. Because we want these types for rules applied universally across all telemetry, a Gateway with a consistent set of rules covers the requirements.

processors:
  transform/replace:
    log_statements:
      - context: log
        statements:
          - set(attributes["pci_present"], "true") where IsMatch(body, "\"creditCardNumber\":\\s*\"(\\d{4}-\\d{4}-\\d{4}-\\d{4})\"")
          - replace_pattern(body, "\"creditCardNumber\":\\s*\"(\\d{4}-\\d{4}-\\d{4}-\\d{4})\"", "\"creditCardNumber\":\"***REDACTED***\"")
redaction:
  allow_all_keys: false
  allowed_keys:
    - description
    - group
    - id
    - name
  ignored_keys:
    - safe_attribute
  blocked_values: # Regular expressions for blocking values of allowed span attributes
    - '4[0-9]{12}(?:[0-9]{3})?' # Visa credit card number
    - '(5[1-5][0-9]{14})' # MasterCard number
  summary: debug

Not Good Examples

There are a number of things that are not intended to be done on the Gateway instance. Typically this involves collecting attributes about the host where the Collector is running. Examples include:

hostmetrics receiver: we want to collect host metrics from where the Agent is running, not the Gateway host so this needs to be deployed on the Agent side. NOTE: this doesn’t preclude the Gateway from collecting its own metrics, but we can use a separate metrics pipeline to collect metrics for the Gateway that are separate from the metrics we’re processing from upstream collectors. We’ll cover this more in the next blog post.
k8sattributes and resourcedetection processors: again, similar to the hostmetrics receiver, we want to enrich telemetry attributes with the host or pod the application is running on so we keep these on the Agent collector.

Recap

To recap, we’ve covered the three primary deployment architectures for the OpenTelemetry Collector:

No-Collector Deployment
Agent Deployment
Gateway Deployment

We discussed how each deployment operates and covered some of the various use cases for using one versus another.

In the next article, we’ll go into details on how to monitor the health of the collector operating as both an Agent and a Gateway. We’ll cover the topics that are often overlooked until users are farther along in their OTel journey but should be considered sooner rather than later to ensure that they are consuming telemetry smoothly and without interruption.