In this article, we’ll take an introductory look at deploying OpenTelemetry Collectors using various strategies to accomplish different objectives. We’ll cover the basics including No-Collector and Agent-based deployments and progress to advanced Gateway deployments highlighting the important considerations that should be made when deciding what should be handled by the Agent, and what can/should be deferred to the Gateway.
In future blog posts, we’ll take a sample application, determine what processing needs we require from our OpenTelemetry Collector pipelines, determine whether they should be deployed on the agent or the gateway, explore tail-based sampling in depth, and determine ways to monitor the health of our collector deployments.
Introduction
OpenTelemetry has developed OTLP as its standard for sending and receiving telemetry signals. The OpenTelemetry SDK utilizes OTLP to produce metrics, traces and logs inside applications and send them to any backend that supports OTLP.
When it comes to deploying OpenTelemetry Collectors, perhaps the most basic OTel deployment strategy is one that doesn’t use a Collector at all.
The “No Collector” Deployment
This pattern sends telemetry signals directly from the application producing it to the backend consuming it:
Because the backend observability platform consumes OTLP natively, there is no need for an OpenTelemetry Collector making it quick and simple to get started. Simply run your code from your desktop and send directly to the backend. This is also very useful in cloud environments where serverless functions need a place to offload telemetry to. An available, OTLP-enabled observability backend makes it very convenient to send telemetry without a lot of configuration and management overhead.
The approach is not without tradeoffs of course. With this pattern, the application is tightly coupled to the backend meaning that if the backend becomes unavailable, sending telemetry blocks the application causing service interruption. Directly sending telemetry to a remote backend also has higher round trip latency than offloading to a locally running Collector as we’ll see in some of the other deployment models.
Additionally, one of the great strengths of using OpenTelemetry is the availability of processing pipelines– that is, the ability to process telemetry signals before egressing it from where it was produced. With the No Collector strategy, there is no Collector in which to configure processing pipelines for metrics, traces and logs. In other words, there is no centralized point to consistently enrich, redact, transform and route telemetry. Some pipelining can be accomplished using the OTel SDK, but then making changes becomes a completely different workstream and is often less convenient than changing a Collector configuration.
The “Agent” Deployment
The Agent Deployment pattern is the approach that most newcomers to OpenTelemetry get started with. It functions both as a standalone means of collecting and processing telemetry signals but also plays a role in a more advanced deployment strategy as we’ll see in the Gateway pattern. The approach is fairly straightforward– an OpenTelemetry Collector is deployed as close to the application or service as possible, either on the same host, or within the same pod, etc. The idea is to offload telemetry quickly and efficiently from the application to minimize interruption. Offloading to a local Collector has the added benefit of providing a local, temporary cache for storing telemetry should the backend consumers become unavailable.
This pattern consists of a single instance of an OpenTelemetry Collector (ie., the Agent) that has its own configuration to determine where and how it can receive metrics, traces, and logs. It has any number of configured processors that can be used to enrich the telemetry. Finally, it has a set of configured exporters than enable it to egress to any number of backend consumers.
A typical Agent Collector deployment architecture consists of an Application, sending to a Collector which sends to any number of backends:
Because of its flexibility, this pattern becomes the workhorse for processing telemetry signals. Unlike the No Collector pattern, configuration changes can be made to the Collector configuration without the need to modify code and redeploy an application.
The drawback comes as organizations attempt to scale up their deployment of Collectors. Maintaining a fleet of configurations that consistently enrich, redact, transform and route telemetry across separate installations of OTel installations becomes unmanageable. Additionally, other situations arise that can’t be handled by a standalone Agent Collector– for instance, tail-based sampling requires a complete trace with all spans collected to make a sampling decision. If spans belonging to the same trace can be collected by multiple Agent Collectors, then a single view of the entire trace is never assembled and tail-sampling is not possible. This is where the Gateway Deployment pattern comes in.
The “Gateway” Deployment
The Gateway Deployment pattern extends the Agent Deployment pattern by introducing a secondary layer of OpenTelemetry Collectors to the telemetry stream. Think of this as an aggregation layer where any number of Agent Collectors can forward their telemetry to a Gateway Collector:
For simplicity, the diagram illustrates multiple Agent Collectors feeding into a Gateway Collector instance. In reality, multiple Gateway Collector instances are clustered together:
This is done to enable scaling up or down Gateway instances as needed to accommodate demand, but also for high availability such that no single point of failure exists.
Additionally, some organizations have network and security requirements that prevent access to the Internet from internally. Gateway deployments are a great option as they consolidate the number of egress points to manageable size keeping security teams happy.
Use Cases
Earlier we touched on maintaining configuration consistency across a large deployment of Agent Collectors and mentioned how Gateway Deployment can help. Here we’ll go through some common use cases on what are (and what are not) good candidates for a Gateway instance. As you go through your planning exercise, think through your telemetry processing needs and determine if it’s something universally applicable to all deployments, or if specific to a particular application. If it’s the former, it’s likely a candidate for a Gateway deployment. If it’s the latter, it’s probably best handled by an Agent deployment.
Gateway Deployment Examples
Good Examples
We want to consistently enrich the telemetry signals such that all applications and services deployed in a particular environment are tagged consistently with the same environment identifier (e.g.,
dev
,qa
,stage
,prod
, etc).For this we can use an attributes processor deployed to the Gateway to always upsert the
deployment.environment.name
attribute:processors: attributes: actions: - key: deployment.environment.name value: stage action: upsert
We want to implement tail-based sampling to control our span egress. As discussed in the Gateway Deployment, a fully assembled trace containing all spans is necessary to apply sampling policies. As such, we need to apply the sampling policies on the Gateway instance in order to work as expected.
processors: tail_sampling: decision_wait: 1s expected_new_traces_per_sec: 100 policies: # Policy #1 - name: env-based-sampling-policy type: and and: and_sub_policy: - name: env-prefix-policy type: string_attribute string_attribute: key: env values: - dev - qa - name: env_sample-policy type: always_sample
We need to compute Request, Error, and Duration metrics from span data. This is a perfect candidate for a Gateway deployment for a couple of reasons:
- It consolidates the configuration of the
spanmetrics
connector to just the Gateway deployments, - It is also necessary to compute span metrics before applying tail sampling policies and because tail sampling policies must be deployed to a Gateway instance, it makes sense to collocate this processing to the Gateway instances.
connectors: spanmetrics: namespace: span.metrics service: pipelines: traces: receivers: [otlp] exporters: [spanmetrics, datadog] metrics: receivers: [spanmetrics] exporters: [datadog]
- It consolidates the configuration of the
Another popular use case for OpenTelemetry Collectors includes using processing pipelines to look for things in the telemetry stream before egressing the data external. Examples include looking for PHI in health care data, PCI in financial data, or even Bearer tokens in HTTP headers. Because we want these types for rules applied universally across all telemetry, a Gateway with a consistent set of rules covers the requirements.
processors: transform/replace: log_statements: - context: log statements: - set(attributes["pci_present"], "true") where IsMatch(body, "\"creditCardNumber\":\\s*\"(\\d{4}-\\d{4}-\\d{4}-\\d{4})\"") - replace_pattern(body, "\"creditCardNumber\":\\s*\"(\\d{4}-\\d{4}-\\d{4}-\\d{4})\"", "\"creditCardNumber\":\"***REDACTED***\"") redaction: allow_all_keys: false allowed_keys: - description - group - id - name ignored_keys: - safe_attribute blocked_values: # Regular expressions for blocking values of allowed span attributes - '4[0-9]{12}(?:[0-9]{3})?' # Visa credit card number - '(5[1-5][0-9]{14})' # MasterCard number summary: debug
Not Good Examples
There are a number of things that are not intended to be done on the Gateway instance. Typically this involves collecting attributes about the host where the Collector is running. Examples include:
hostmetrics
receiver: we want to collect host metrics from where the Agent is running, not the Gateway host so this needs to be deployed on the Agent side. NOTE: this doesn’t preclude the Gateway from collecting its own metrics, but we can use a separate metrics pipeline to collect metrics for the Gateway that are separate from the metrics we’re processing from upstream collectors. We’ll cover this more in the next blog post.k8sattributes
andresourcedetection
processors: again, similar to the hostmetrics receiver, we want to enrich telemetry attributes with the host or pod the application is running on so we keep these on the Agent collector.
Recap
To recap, we’ve covered the three primary deployment architectures for the OpenTelemetry Collector:
- No-Collector Deployment
- Agent Deployment
- Gateway Deployment
We discussed how each deployment operates and covered some of the various use cases for using one versus another.
In the next article, we’ll go into details on how to monitor the health of the collector operating as both an Agent and a Gateway. We’ll cover the topics that are often overlooked until users are farther along in their OTel journey but should be considered sooner rather than later to ensure that they are consuming telemetry smoothly and without interruption.