Why Is Crossplane Not Reconciling? Pod Lifecycle and Reconcile Loop Reference
Last updated: April 22, 2026
Relevant to: Diagnosing situations where Crossplane or a provider pod appears healthy but is not responding to state visible in the Kubernetes API.
Applies to: Crossplane v1.14 through v2.x (equivalent UXP versions apply). This documents behaviour across these versions — for supported versions, refer to the Product Lifecycle page. Version-specific behaviour is noted inline where it differs.
Overview
Both the Crossplane core pod and Crossplane providers are built on controller-runtime, following the standard Kubernetes operator pattern. Each layer adds customisations on top. Understanding the full lifecycle helps narrow down why reconciliation has stalled without obvious pod-level errors.
Phase 1 — Leader Election
Leader election is disabled by default for both the Crossplane core pod and official providers. It must be explicitly enabled via the --leader-election flag. Most single-replica deployments run without it.
If leader election is enabled, controller-runtime uses a Kubernetes Lease object to coordinate which pod is active. Only the lease holder runs the reconcile loop — other replicas stand by.
If leader election is enabled, verify the correct pod holds the lease:
kubectl get lease -n crossplane-system
If the lease holder is stale (e.g. after a crash or eviction), the new pod may be waiting to acquire it. This appears as pod Ready: True with no reconciliation activity.
If leader election is disabled (the default), skip this phase — no lease is involved and the single pod reconciles immediately after startup.
Phase 2 — Cache Warm-up (Shared Informer Cache)
After leader election, the pod populates its in-memory caches via Kubernetes watch streams before it starts reconciling. During this window the pod is Running but processing no work.
What can go wrong:
After an OOM kill or API server blip, the informer cache can become stale despite the pod appearing healthy.
Misconfiguration of RBAC permissions for API resources can prevent the informer from listing the resources it needs.
Missing or unreachable API server endpoints will also block cache population.
Look for
reflector.go: failed to listerrors in pod logs — this indicates the cache is not syncing. Note that these errors won't appear immediately; they surface only after the reflector's list/watch timeout elapses, so if you check logs shortly after startup you may not see them yet. The exact timing depends on your logging configuration.Crossplane issue #3565 is a documented real-world example of this.
kubectl logs -n crossplane-system deploy/crossplane | grep -i "failed to list\|reflector"
Phase 3 — Watch and Poll Scheduling
Once the cache is warm, the controller schedules reconciles via two mechanisms:
Mechanism | Trigger | Crossplane default | Provider default |
Kubernetes watch | Object create/update/delete event | Immediate enqueue | Immediate enqueue |
Poll interval | Periodic re-queue | 1 minute;
| 10 minute; |
Note that Crossplane core and providers have separate, independent poll flags with different defaults. The core --poll-interval (1m) controls how often XRs and Claims are re-queued. The provider --poll flag (10m for Upjet-based providers like provider-upjet-aws) controls how often individual MRs are checked for drift against the external API. Providers also have a --sync flag (default 1 hour) that triggers a full sweep to double-check all MRs for drift — useful for catching changes that occurred outside Crossplane.
If reconciliation appears slow for MRs but XRs are updating promptly, the provider's --poll interval is likely the bottleneck rather than the core poll interval — but before adjusting intervals, rule out the following:
CPU throttling — check CPU utilization of the Crossplane or provider pod. The Go runtime exposes a CPU throttling metric (
process_cpu_seconds_totalor similar) that can confirm whether the operator is being starved of CPU rather than simply waiting on a timer.Memory pressure — high memory utilization or OOM proximity can slow reconciliation loops independently of poll configuration.
--max-reconcile-ratesaturation — if the number of MRs significantly exceeds the configured--max-reconcile-rateon the provider, the work queue will back up and reconciliation will appear slow even with aggressive poll intervals. Check the provider deployment args for this value alongside--poll.External API throttling — if you're specifically investigating drift detection against the external resource state, the cloud provider's API rate limits are a common culprit. Look for throttling or 429 errors in provider logs alongside slow drift resolution.
Check the provider deployment's args to confirm the configured value:
kubectl get deployment <provider-pod> -n crossplane-system -o jsonpath='{.spec.template.spec.containers[0].args}'
Real-time Compositions
When real-time compositions is enabled, watch events from composed resources trigger XR reconciliation directly, bypassing the poll interval. This significantly reduces latency between an MR becoming ready and the XR being updated.
Version | Status |
v1.14 | Introduced as alpha ( |
v1.16 | Promoted to beta |
v2.0+ | On by default — no flag needed; opt-out with |
Key behaviour change from v1.20 to v2: PR #6619 moved the TTL-based requeue so it only fires when real-time compositions is enabled; when disabled, the system falls back to --poll-interval. This affects reconcile frequency but not readiness logic.
Check whether real-time compositions is enabled (v1.x opt-in era):
kubectl get deployment crossplane -n crossplane-system -o jsonpath='{.spec.template.spec.containers[0].args}'
Look for --enable-realtime-compositions (v1.14–v1.x) or --disable-realtime-compositions (v2.x, to confirm it hasn't been turned off).
Circuit Breaker
The circuit breaker was introduced alongside real-time compositions in v1.14 to prevent reconciliation thrashing — when real-time watch events trigger tight reconcile loops that saturate the controller. It applies a per-XR token bucket: if an XR receives too many watch events in a short period, reconciliation is throttled to once every 30 seconds for a 5-minute cooldown. The pod and XR both appear healthy; reconciliation is simply rate-limited.
The circuit breaker is most relevant in v2.0+ where real-time compositions is on by default and all users are subject to it. On v1.x with real-time compositions disabled, the circuit breaker does not apply.
v2.1 note: PR #6911 fixed a bug where the circuit breaker consumed double the expected tokens per event, causing it to trip more aggressively than intended. If you are on v2.0 and seeing frequent circuit breaker opens without obvious event storms, upgrading to v2.1+ resolves this. The same PR also introduced three CLI flags to make the circuit breaker tunable without recompiling — these can be set as args in the Crossplane deployment or via Helm values:
Flag | Description |
| Maximum number of watch events the bucket can absorb before opening |
| Rate at which tokens are refilled (events per second) |
| How long the breaker stays open before allowing reconciliation again |
If you are seeing the circuit breaker open on a legitimate high-churn XR (rather than a bug), tuning these flags is preferable to disabling real-time compositions entirely.
Diagnose via Prometheus metrics — see Circuit Breaker Metrics for the full reference:
circuit_breaker_opens_total # transitions from closed → open
circuit_breaker_closes_total # transitions from open → closed
circuit_breaker_events_total{result="Dropped"} # events suppressed while open
All three metrics include a controller label (composite/<plural>.<group>) so you can scope to a specific XR type.
Check on the XR directly — when the circuit breaker opens, the XR gains a Responsive: False condition with reason WatchCircuitOpen:
kubectl get composite <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Responsive")'
Phase 4 — Managed Resource Reconcile Loop
The core reconcile loop for MRs is implemented in crossplane-runtime/managed/reconciler.go. Each iteration runs these steps in order:
Connect → Observe → Create / Update / Delete (or no-op) → LateInitialize → PublishConnectionDetails → UpdateStatus
Step-by-step
1. Connect Initialises the provider client (e.g. AWS SDK, GCP client). Uses credentials from the ProviderConfig referenced by the MR. Failure here produces Synced: False with a connection error. Check ProviderConfig status and secret references.
2. Observe Calls the external API to check whether the external resource exists and retrieve its current state. For Upjet providers this runs terraform refresh internally. The result determines the next step: Create, Update, Delete, or no-op.
3. Create / Update / Delete Only executed if Observe determines a diff. Gated by managementPolicies — if a policy does not include the action (e.g. Observe-only), this step is skipped entirely.
4. LateInitialize Writes provider-defaulted fields back into the MR spec. This triggers a spec update, which in turn triggers another reconcile iteration.
5. PublishConnectionDetails Writes connection details (endpoints, credentials) to the referenced Secret or StoreConfig.
6. UpdateStatus Sets Synced and Ready conditions. Ready: True is only set when the provider's readiness check passes (configurable via readiness.policy on the MR).
Things that halt or modify the loop
Mechanism | Effect |
| Restricts which actions (Observe, Create, Update, Delete, LateInitialize) the reconciler will take |
Pause annotation ( | Halts reconciliation entirely; paused resources cannot be deleted |
| Guard against duplicate resource creation when a previous Create may have leaked. |
Phase 5 — Upjet-specific: Provider Backends and Async Operations
Upjet providers implement the crossplane-runtime ExternalClient interface against Terraform provider code. There are two distinct backends depending on how the provider was built:
Backend | Implementation | How it calls Terraform |
CLI-based (legacy) |
| Spawns |
SDK-based (modern) |
| Calls Terraform Plugin SDK v2 directly in-process — no binary spawned |
Framework-based (modern) |
| Calls Terraform Plugin Framework directly in-process — no binary spawned |
Most current providers (including provider-upjet-aws) use the SDK-based or Framework-based backend. The CLI-based backend is legacy and less common.
Async behaviour
For all backends, long-running operations (Create/Update/Delete) run asynchronously — either via async workspace calls (CLI) or goroutines (SDK/Framework, default timeout 1 hour). Observe is synchronous in all cases. The reconciler requeues after kicking off an async operation and polls for completion on the next iteration.
The async state is surfaced via two status conditions:
AsyncOperation— whether an async operation is currently runningLastAsyncOperation— the result of the most recent completed operation
Monitor health:
For CLI-based providers, monitor Terraform process health via Prometheus metrics:
# Via Prometheus metrics
upjet_terraform_active_cli_invocations
upjet_terraform_running_processes
If the process pool is exhausted or a process exceeds its TTL (--provider-ttl), reconciles are silently requeued without an error on the MR. High upjet_resource_running_processes relative to --max-reconcile-rate is the signal.
For SDK/Framework-based providers there are no separate TF processes to monitor — look at provider pod CPU/memory and reconcile queue depth instead.
Diagnosing: Healthy Pod, No Reconciliation Activity
Work through these in order:
1. Confirm the correct pod holds the leader lease (only if --leader-election is enabled)
Leader election is disabled by default. First confirm it is active before investigating lease state:
kubectl get deployment crossplane -n crossplane-system -o jsonpath='{.spec.template.spec.containers[0].args}'
# look for --leader-election in the outputIf enabled, verify the correct pod holds the lease:
kubectl get lease -n crossplane-system
2. Check for stale informer cache
kubectl logs -n crossplane-system deploy/crossplane | grep -i "failed to list\|reflector\|cache"
kubectl logs -n crossplane-system deploy/<provider-pod> | grep -i "failed to list\|reflector"
If found, restart the affected pod to force cache re-sync.
3. Check the circuit breaker (v1.14+ with real-time compositions; always applicable on v2.0+)
The circuit breaker is implemented in Crossplane core only. Official providers do not have a circuit breaker — if you're investigating slow MR reconciliation, skip this step and focus on poll intervals, --max-reconcile-rate, and resource utilization instead.
Check for the Responsive: False condition on the XR first — this is the fastest signal:
kubectl get composite <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Responsive")'
Then confirm via Prometheus:
circuit_breaker_opens_total
circuit_breaker_events_total{result="Dropped"}
If opens are recent, wait for the 5-minute cooldown or investigate what is generating excessive watch events on the affected XRs. If on v2.0, consider upgrading to v2.1+ — PR #6911 fixed a double token consumption bug that caused the breaker to trip more aggressively than intended.
4. Check for paused resources
The crossplane.io/paused: "true" annotation halts reconciliation on any Crossplane resource — not just MRs. This includes XRs, Claims, and package resources (Providers, Functions, Configurations, and their Revisions). A paused Provider or Function will stop reconciling all MRs or compositions that depend on it, which can appear as a broader stall rather than a single resource being stuck.
When a resource is paused, it should also report a Synced: False condition with reason Paused. Check for this condition alongside the annotation — if a resource carries the pause annotation but has not set the Synced: False / Paused condition, that itself indicates a deeper problem (e.g. the controller is not running at all due to leader election failure, OOM, etc.) and you should investigate those areas instead.
# Managed resources — check annotation
kubectl get managed -A -o json | jq '.items[] | select(.metadata.annotations["crossplane.io/paused"] == "true") | .metadata.name'
# Managed resources — check Synced: False / Paused condition
kubectl get managed -A -o json | jq '.items[] | select(.status.conditions[]? | .type=="Synced" and .status=="False" and .reason=="Paused") | .metadata.name'
# XRs and Claims
kubectl get composite -A -o json | jq '.items[] | select(.metadata.annotations["crossplane.io/paused"] == "true") | .metadata.name'
# Package resources (Providers, Functions, Configurations)
kubectl get providers,functions,configurations -A -o json | jq '.items[] | select(.metadata.annotations["crossplane.io/paused"] == "true") | .metadata.name'
If the annotation is present but Synced: False / Paused is absent, don't assume the resource is simply paused — investigate controller health first.
5. Check for stuck external-create annotations
kubectl get managed -A -o json | jq '.items[] | select(.metadata.annotations | keys[] | test("external-create")) | {name: .metadata.name, annotations: .metadata.annotations}'
These require manual resolution — see Crossplane docs on creation annotations.
6. Check managementPolicies (Observe-only resources)
An Observe-only MR will never Create or Update. Verify that the managementPolicies on the MR matches the intended behaviour:
kubectl get managed <name> -o jsonpath='{.spec.managementPolicies}'
7. For CLI-based Upjet providers: check process exhaustion
# If Prometheus is available
upjet_terraform_running_processes
upjet_terraform_active_cli_invocations
High sustained values indicate the provider is at capacity. Tuning --max-reconcile-rate or reducing the number of MRs per provider revision may help.