Telemetry collector pod crash-loops after upgrading to Spaces 1.14 when using the Datadog exporter

Last updated: June 12, 2026

Summary

After upgrading to Upbound Spaces 1.14.x (or upgrading the OpenTelemetry Operator to 0.137.0 or later), the telemetry collector pod may fail to start and enter CrashLoopBackOff if your SharedTelemetryConfig uses the Datadog exporter.

This happens because the new operator version adds a startup probe that kills the collector 30 seconds after launch, while the Datadog exporter can spend ~38 seconds trying to auto-detect its hostname before it reports healthy. The collector is killed before it ever finishes starting.

The fix is a one-line change to your SharedTelemetryConfig: set hostname_detection_timeout: 5s on the Datadog exporter. No downgrade or operator changes are needed.

Am I affected?

You are affected if all of the following apply:

Condition	Affected values
Upbound Spaces version	1.14.0 – 1.14.6 (or any setup with OTEL Operator ≥ 0.137.0)
OTEL Operator version	0.137.0 – 0.139.0 (Helm chart 0.98.0 – 0.100.0)
Telemetry exporter	Datadog
Environment	EKS, or any cluster where the cloud metadata service (`169.254.169.254`) is unreachable from pods

You are not affected on Spaces 1.13.x or earlier (OTEL Operator ≤ 0.136.0), or if you use a different exporter.

What you'll see

After the upgrade, the telemetry-control-plane-collector pod restarts repeatedly and ends up in CrashLoopBackOff:

$ kubectl get pods -n <spaces-namespace>
NAME                                      READY   STATUS             RESTARTS
telemetry-control-plane-collector-...     0/1     CrashLoopBackOff   6

The pod terminates with exit code 137 (killed), and its logs are empty — the collector never got far enough to log anything. The pod events show the real clue:

Container otc-container failed startup probe, will be restarted

Why it happens

Two independent behaviors interact badly:

The operator now enforces a startup deadline. Starting with version 0.137.0, the OpenTelemetry Operator automatically adds a startup probe to the collector container. The probe checks the collector's health endpoint (port 13133) every 10 seconds and gives up after 3 failures — so the collector must be healthy within 30 seconds of starting. Older operator versions only used a liveness probe with a 60-second grace period, which is why this never surfaced before the upgrade.
The Datadog exporter is slow to start when it can't reach cloud metadata. Before the collector opens its health endpoint, the Datadog exporter tries to auto-detect the host's name by querying the cloud provider metadata service (169.254.169.254). On EKS (and many hardened clusters), pods cannot reach that address, so the requests hang until a built-in 25-second timeout expires, after which the exporter falls back to the operating system hostname. End to end, the collector takes about 38 seconds to become healthy.

38 seconds of startup against a 30-second deadline means the pod is killed every time, indefinitely.

This is a known and documented behavior of the Datadog exporter, not a bug in Spaces or in your configuration.

How to fix it

Add hostname_detection_timeout: 5s to the Datadog exporter section of every affected SharedTelemetryConfig:

apiVersion: observability.spaces.upbound.io/v1alpha1kind: SharedTelemetryConfigmetadata:name: <your-stc-name>namespace: defaultspec:exporters:
    datadog:
      api:
        key: "<redacted>"
        site: datadoghq.com
      hostname_detection_timeout: 5s# ... rest of spec unchanged

Apply the change with kubectl apply. The Spaces controller picks it up and rolls out a new collector pod within about a minute, and the pod starts cleanly.

Why this works: on clusters where the metadata service is unreachable, hostname detection always ends at the same fallback (the OS hostname) — the default 25-second timeout just delays getting there. Shortening it to 5 seconds means the collector becomes healthy in well under 10 seconds, comfortably inside the startup probe's deadline. You lose nothing: the detected hostname is identical either way.

Important: make this change in the SharedTelemetryConfig only. Do not patch the generated OpenTelemetryCollector resource or its Deployment directly — the Spaces controller continuously reconciles those objects and will revert any direct edits within about a minute.

Alternative: set a static hostname

If you don't need hostname auto-detection at all, you can skip the detection logic entirely by setting an explicit hostname:

spec:exporters:
    datadog:
      host_metadata:
        hostname: my-collector-host

Upgrading to Spaces 1.14? Do this first

If you use the Datadog exporter and haven't upgraded yet, apply the fix before upgrading so the collector never hits the new probe unprepared:

Add hostname_detection_timeout: 5s to all Datadog-backed SharedTelemetryConfig objects.
Upgrade the OTEL Operator to 0.139.0.
Upgrade Spaces to 1.14.x.

References

Datadog exporter README — collector pod rebooted on startup
OpenTelemetry Operator PR #4347 — startup probe introduced
Spaces 1.14.6 release notes
Internal tracking: PYLON-1767