Why Is Crossplane Not Reconciling? Pod Lifecycle and Reconcile Loop Reference

Last updated: April 22, 2026

Relevant to: Diagnosing situations where Crossplane or a provider pod appears healthy but is not responding to state visible in the Kubernetes API.

Applies to: Crossplane v1.14 through v2.x (equivalent UXP versions apply). This documents behaviour across these versions — for supported versions, refer to the Product Lifecycle page. Version-specific behaviour is noted inline where it differs.

Overview

Both the Crossplane core pod and Crossplane providers are built on controller-runtime, following the standard Kubernetes operator pattern. Each layer adds customisations on top. Understanding the full lifecycle helps narrow down why reconciliation has stalled without obvious pod-level errors.

Phase 1 — Leader Election

Leader election is disabled by default for both the Crossplane core pod and official providers. It must be explicitly enabled via the --leader-election flag. Most single-replica deployments run without it.

If leader election is enabled, controller-runtime uses a Kubernetes Lease object to coordinate which pod is active. Only the lease holder runs the reconcile loop — other replicas stand by.

If leader election is enabled, verify the correct pod holds the lease:

kubectl get lease -n crossplane-system

If the lease holder is stale (e.g. after a crash or eviction), the new pod may be waiting to acquire it. This appears as pod Ready: True with no reconciliation activity.

If leader election is disabled (the default), skip this phase — no lease is involved and the single pod reconciles immediately after startup.

Phase 2 — Cache Warm-up (Shared Informer Cache)

After leader election, the pod populates its in-memory caches via Kubernetes watch streams before it starts reconciling. During this window the pod is Running but processing no work.

What can go wrong:

After an OOM kill or API server blip, the informer cache can become stale despite the pod appearing healthy.
Misconfiguration of RBAC permissions for API resources can prevent the informer from listing the resources it needs.
Missing or unreachable API server endpoints will also block cache population.
Look for reflector.go: failed to list errors in pod logs — this indicates the cache is not syncing. Note that these errors won't appear immediately; they surface only after the reflector's list/watch timeout elapses, so if you check logs shortly after startup you may not see them yet. The exact timing depends on your logging configuration.
Crossplane issue #3565 is a documented real-world example of this.

kubectl logs -n crossplane-system deploy/crossplane | grep -i "failed to list\|reflector"

Phase 3 — Watch and Poll Scheduling

Once the cache is warm, the controller schedules reconciles via two mechanisms:

Mechanism

Trigger

Crossplane default

Provider default

Kubernetes watch

Object create/update/delete event

Immediate enqueue

Poll interval

Periodic re-queue

1 minute;

--poll-interval

10 minute;

--poll

Note that Crossplane core and providers have separate, independent poll flags with different defaults. The core --poll-interval (1m) controls how often XRs and Claims are re-queued. The provider --poll flag (10m for Upjet-based providers like provider-upjet-aws) controls how often individual MRs are checked for drift against the external API. Providers also have a --sync flag (default 1 hour) that triggers a full sweep to double-check all MRs for drift — useful for catching changes that occurred outside Crossplane.

If reconciliation appears slow for MRs but XRs are updating promptly, the provider's --poll interval is likely the bottleneck rather than the core poll interval — but before adjusting intervals, rule out the following:

CPU throttling — check CPU utilization of the Crossplane or provider pod. The Go runtime exposes a CPU throttling metric (process_cpu_seconds_total or similar) that can confirm whether the operator is being starved of CPU rather than simply waiting on a timer.
Memory pressure — high memory utilization or OOM proximity can slow reconciliation loops independently of poll configuration.
--max-reconcile-rate saturation — if the number of MRs significantly exceeds the configured --max-reconcile-rate on the provider, the work queue will back up and reconciliation will appear slow even with aggressive poll intervals. Check the provider deployment args for this value alongside --poll.
External API throttling — if you're specifically investigating drift detection against the external resource state, the cloud provider's API rate limits are a common culprit. Look for throttling or 429 errors in provider logs alongside slow drift resolution.

Check the provider deployment's args to confirm the configured value:

kubectl get deployment <provider-pod> -n crossplane-system -o jsonpath='{.spec.template.spec.containers[0].args}'

Real-time Compositions

When real-time compositions is enabled, watch events from composed resources trigger XR reconciliation directly, bypassing the poll interval. This significantly reduces latency between an MR becoming ready and the XR being updated.

Version	Status
v1.14	Introduced as alpha (`--enable-realtime-compositions` flag, opt-in)
v1.16	Promoted to beta
v2.0+	On by default — no flag needed; opt-out with `--disable-realtime-compositions`

Key behaviour change from v1.20 to v2: PR #6619 moved the TTL-based requeue so it only fires when real-time compositions is enabled; when disabled, the system falls back to --poll-interval. This affects reconcile frequency but not readiness logic.

Check whether real-time compositions is enabled (v1.x opt-in era):

kubectl get deployment crossplane -n crossplane-system -o jsonpath='{.spec.template.spec.containers[0].args}'

Look for --enable-realtime-compositions (v1.14–v1.x) or --disable-realtime-compositions (v2.x, to confirm it hasn't been turned off).

Circuit Breaker

The circuit breaker was introduced alongside real-time compositions in v1.14 to prevent reconciliation thrashing — when real-time watch events trigger tight reconcile loops that saturate the controller. It applies a per-XR token bucket: if an XR receives too many watch events in a short period, reconciliation is throttled to once every 30 seconds for a 5-minute cooldown. The pod and XR both appear healthy; reconciliation is simply rate-limited.

The circuit breaker is most relevant in v2.0+ where real-time compositions is on by default and all users are subject to it. On v1.x with real-time compositions disabled, the circuit breaker does not apply.

v2.1 note: PR #6911 fixed a bug where the circuit breaker consumed double the expected tokens per event, causing it to trip more aggressively than intended. If you are on v2.0 and seeing frequent circuit breaker opens without obvious event storms, upgrading to v2.1+ resolves this. The same PR also introduced three CLI flags to make the circuit breaker tunable without recompiling — these can be set as args in the Crossplane deployment or via Helm values:

Flag	Description
`--circuit-breaker-burst`	Maximum number of watch events the bucket can absorb before opening
`--circuit-breaker-refill-rate`	Rate at which tokens are refilled (events per second)
`--circuit-breaker-cooldown`	How long the breaker stays open before allowing reconciliation again

If you are seeing the circuit breaker open on a legitimate high-churn XR (rather than a bug), tuning these flags is preferable to disabling real-time compositions entirely.

Diagnose via Prometheus metrics — see Circuit Breaker Metrics for the full reference:

circuit_breaker_opens_total                     # transitions from closed → open
circuit_breaker_closes_total                    # transitions from open → closed
circuit_breaker_events_total{result="Dropped"}  # events suppressed while open

All three metrics include a controller label (composite/<plural>.<group>) so you can scope to a specific XR type.

Check on the XR directly — when the circuit breaker opens, the XR gains a Responsive: False condition with reason WatchCircuitOpen:

kubectl get composite <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Responsive")'

Phase 4 — Managed Resource Reconcile Loop

The core reconcile loop for MRs is implemented in crossplane-runtime/managed/reconciler.go. Each iteration runs these steps in order:

Connect → Observe → Create / Update / Delete (or no-op) → LateInitialize → PublishConnectionDetails → UpdateStatus

Step-by-step

1. Connect Initialises the provider client (e.g. AWS SDK, GCP client). Uses credentials from the ProviderConfig referenced by the MR. Failure here produces Synced: False with a connection error. Check ProviderConfig status and secret references.

2. Observe Calls the external API to check whether the external resource exists and retrieve its current state. For Upjet providers this runs terraform refresh internally. The result determines the next step: Create, Update, Delete, or no-op.

3. Create / Update / Delete Only executed if Observe determines a diff. Gated by managementPolicies — if a policy does not include the action (e.g. Observe-only), this step is skipped entirely.

4. LateInitialize Writes provider-defaulted fields back into the MR spec. This triggers a spec update, which in turn triggers another reconcile iteration.

5. PublishConnectionDetails Writes connection details (endpoints, credentials) to the referenced Secret or StoreConfig.

6. UpdateStatus Sets Synced and Ready conditions. Ready: True is only set when the provider's readiness check passes (configurable via readiness.policy on the MR).

Things that halt or modify the loop

Mechanism	Effect
`managementPolicies`	Restricts which actions (Observe, Create, Update, Delete, LateInitialize) the reconciler will take
Pause annotation (`crossplane.io/paused: "true"`)	Halts reconciliation entirely; paused resources cannot be deleted
`external-create-pending/succeeded/failed` annotations	Guard against duplicate resource creation when a previous Create may have leaked. `external-create-failed` is the most operationally significant — it halts the reconciler and requires manual intervention (either delete the annotation after confirming no resource was created, or import the leaked resource). `external-create-pending` is transient and self-resolves; `external-create-succeeded` is informational.

Phase 5 — Upjet-specific: Provider Backends and Async Operations

Upjet providers implement the crossplane-runtime ExternalClient interface against Terraform provider code. There are two distinct backends depending on how the provider was built:

Backend	Implementation	How it calls Terraform
CLI-based (legacy)	`external.go` / `Connector`	Spawns `terraform` binary processes via a workspace; Observe runs `terraform refresh`, Create/Update/Delete run `terraform apply`/`destroy`
SDK-based (modern)	`external_tfpluginsdk.go` / `TerraformPluginSDKConnector`	Calls Terraform Plugin SDK v2 directly in-process — no binary spawned
Framework-based (modern)	`external_tfpluginfw.go` / `TerraformPluginFrameworkConnector`	Calls Terraform Plugin Framework directly in-process — no binary spawned

Most current providers (including provider-upjet-aws) use the SDK-based or Framework-based backend. The CLI-based backend is legacy and less common.

Async behaviour

For all backends, long-running operations (Create/Update/Delete) run asynchronously — either via async workspace calls (CLI) or goroutines (SDK/Framework, default timeout 1 hour). Observe is synchronous in all cases. The reconciler requeues after kicking off an async operation and polls for completion on the next iteration.

The async state is surfaced via two status conditions:

AsyncOperation — whether an async operation is currently running
LastAsyncOperation — the result of the most recent completed operation

Monitor health:

For CLI-based providers, monitor Terraform process health via Prometheus metrics:

# Via Prometheus metrics
upjet_terraform_active_cli_invocations
upjet_terraform_running_processes

If the process pool is exhausted or a process exceeds its TTL (--provider-ttl), reconciles are silently requeued without an error on the MR. High upjet_resource_running_processes relative to --max-reconcile-rate is the signal.

For SDK/Framework-based providers there are no separate TF processes to monitor — look at provider pod CPU/memory and reconcile queue depth instead.

Diagnosing: Healthy Pod, No Reconciliation Activity

Work through these in order:

1. Confirm the correct pod holds the leader lease (only if `--leader-election` is enabled)

Leader election is disabled by default. First confirm it is active before investigating lease state:

kubectl get deployment crossplane -n crossplane-system -o jsonpath='{.spec.template.spec.containers[0].args}'
# look for --leader-election in the output

If enabled, verify the correct pod holds the lease:

kubectl get lease -n crossplane-system

2. Check for stale informer cache

kubectl logs -n crossplane-system deploy/crossplane | grep -i "failed to list\|reflector\|cache"
kubectl logs -n crossplane-system deploy/<provider-pod> | grep -i "failed to list\|reflector"

If found, restart the affected pod to force cache re-sync.

3. Check the circuit breaker (v1.14+ with real-time compositions; always applicable on v2.0+)

The circuit breaker is implemented in Crossplane core only. Official providers do not have a circuit breaker — if you're investigating slow MR reconciliation, skip this step and focus on poll intervals, --max-reconcile-rate, and resource utilization instead.

Check for the Responsive: False condition on the XR first — this is the fastest signal:

kubectl get composite <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Responsive")'

Then confirm via Prometheus:

circuit_breaker_opens_total
circuit_breaker_events_total{result="Dropped"}

If opens are recent, wait for the 5-minute cooldown or investigate what is generating excessive watch events on the affected XRs. If on v2.0, consider upgrading to v2.1+ — PR #6911 fixed a double token consumption bug that caused the breaker to trip more aggressively than intended.

4. Check for paused resources

The crossplane.io/paused: "true" annotation halts reconciliation on any Crossplane resource — not just MRs. This includes XRs, Claims, and package resources (Providers, Functions, Configurations, and their Revisions). A paused Provider or Function will stop reconciling all MRs or compositions that depend on it, which can appear as a broader stall rather than a single resource being stuck.

When a resource is paused, it should also report a Synced: False condition with reason Paused. Check for this condition alongside the annotation — if a resource carries the pause annotation but has not set the Synced: False / Paused condition, that itself indicates a deeper problem (e.g. the controller is not running at all due to leader election failure, OOM, etc.) and you should investigate those areas instead.

# Managed resources — check annotation
kubectl get managed -A -o json | jq '.items[] | select(.metadata.annotations["crossplane.io/paused"] == "true") | .metadata.name'

# Managed resources — check Synced: False / Paused condition
kubectl get managed -A -o json | jq '.items[] | select(.status.conditions[]? | .type=="Synced" and .status=="False" and .reason=="Paused") | .metadata.name'

# XRs and Claims
kubectl get composite -A -o json | jq '.items[] | select(.metadata.annotations["crossplane.io/paused"] == "true") | .metadata.name'

# Package resources (Providers, Functions, Configurations)
kubectl get providers,functions,configurations -A -o json | jq '.items[] | select(.metadata.annotations["crossplane.io/paused"] == "true") | .metadata.name'

If the annotation is present but Synced: False / Paused is absent, don't assume the resource is simply paused — investigate controller health first.

5. Check for stuck external-create annotations

kubectl get managed -A -o json | jq '.items[] | select(.metadata.annotations | keys[] | test("external-create")) | {name: .metadata.name, annotations: .metadata.annotations}'

These require manual resolution — see Crossplane docs on creation annotations.

6. Check managementPolicies (Observe-only resources)

An Observe-only MR will never Create or Update. Verify that the managementPolicies on the MR matches the intended behaviour:

kubectl get managed <name> -o jsonpath='{.spec.managementPolicies}'

7. For CLI-based Upjet providers: check process exhaustion

# If Prometheus is available
upjet_terraform_running_processes
upjet_terraform_active_cli_invocations

High sustained values indicate the provider is at capacity. Tuning --max-reconcile-rate or reducing the number of MRs per provider revision may help.