Learn how Middleware OpsAI detects pod crashes, OOMKilled errors, and HPA misconfigurations in real time and fixes them automatically without paging your on-call team.

Kubernetes pod crash auto-remediation is the ability to automatically detect why a pod crashed and apply a permanent fix without human intervention. Middleware OpsAI does this by monitoring Kubernetes events, pod metrics, and container logs in real time, diagnosing the root cause of each failure, and patching the cluster directly, for example, raising a memory limit after an OOMKill or correcting a broken ConfigMap mount that causes CrashLoopBackOff. The result is fewer repeat incidents, shorter recovery times, and no 3 AM pages for problems that the system can resolve itself.

TL;DR

  • Kubernetes restarts crashed pods automatically but never fixes why they crashed
  • OOMKilled pods will keep dying on every restart until the memory limit is manually raised; OpsAI does this automatically using real P99 usage data
  • CrashLoopBackOff has many causes (broken ConfigMap, bad volume mount, startup error); OpsAI reads exit codes and logs to find the exact one and patches it
  • HPA misconfiguration never triggers a crash alert; OpsAI detects it by watching the gap between autoscaler behavior and actual service latency
  • You control how much OpsAI acts on its own: Auto Fix for staging, human-reviewed Auto RCA for production
  • No custom scripts, no 3 AM pages for problems the system can resolve itself

Quick Summary

ProblemWhat Kubernetes DoesWhat OpsAI Does
Pod OOMKilled repeatedlyRestarts pod with same memory limitPatches memory limit using P99 usage + 30% buffer
CrashLoopBackOff from broken ConfigMapKeeps restarting the crashing containerDetects the broken volume mount reference and corrects it
HPA stuck at maxReplicasNo action (no crash event to react to)Detects the scaling ceiling, reviews load history, updates maxReplicas
Ambiguous app errorRestarts pod indefinitelyRuns full RCA, opens a PR with proposed fix for human review
Single OOMKill, first timeRestarts podLogs, watches for recurrence, no immediate action
Table of Contents

What is Kubernetes self-healing and what are its limits?

Kubernetes self-healing refers to the platform’s built-in ability to detect and recover from container and node failures, automatically and without operator intervention. It is one of Kubernetes’ most cited advantages, and it genuinely works up to a point. For a full view of what Kubernetes exposes for monitoring and alerting, see the Kubernetes monitoring guide.

What Kubernetes self-healing actually covers

FeatureHow It Works
Container restartsIf a container exits, Kubernetes restarts it based on the pod’s restartPolicy
Liveness probesIf a container stops responding, Kubernetes kills and replaces it
ReplicaSetsIf a pod is deleted, Kubernetes creates a replacement to maintain the desired count
Node failure recoveryIf a node goes offline, Kubernetes reschedules its pods onto healthy nodes

These features handle availability. They keep pods running. What they do not do is understand why a pod failed or prevent the same failure from happening again.

The hard limit of Kubernetes self-healing

This is the core gap. Every restart happens with the same configuration that caused the crash. There is no mechanism in native Kubernetes to:

  • Raise a memory limit after repeated OOMKills
  • Detect a broken ConfigMap reference causing CrashLoopBackOff
  • Identify that an HPA’s maxReplicas is too low for actual traffic demand
  • Distinguish between a transient failure and a systemic misconfiguration

Teams relying solely on Kubernetes self-healing still face repeat incidents, on-call burnout, and slow MTTR because every restart is a bandage over the same wound. For a deeper look at how to reason from Kubernetes symptoms to root causes, see diagnosing abnormal Kubernetes workload behavior.

Middleware OpsAI is built to close exactly this gap, adding a root-cause reasoning layer on top of the same Kubernetes monitoring data your cluster already produces.

Stop getting paged for the same pod crashes every week. OpsAI detects and fixes OOMKills, CrashLoopBackOff, and HPA issues automatically before they wake up your on-call team. Start free · Book a demo · See pricing

Why does my Kubernetes pod keep crashing with OOMKilled, and how do I stop it from happening again?

An OOMKilled pod means the Linux kernel’s Out-of-Memory manager terminated the container because its memory usage reached or exceeded the limit defined in the pod spec. Kubernetes flags this with exit code 137, which corresponds to SIGKILL (signal 9) sent by the kernel. Middleware log monitoring captures these exit codes and log tails in real time so OpsAI has everything it needs to act immediately.

Common causes of repeated OOMKills

  1. Memory limit set too low at deploy time. The original limit was an estimate and never revisited as traffic grew.
  2. New code introduces memory growth. A feature release adds buffering, caching, or large payload processing that was not present when the limit was set.
  3. Memory leak. The application accumulates memory over time; the pod eventually hits its ceiling on every restart cycle.
  4. Traffic spike. A sudden increase in concurrent requests pushes per-request memory allocation above the steady-state limit.

Why Kubernetes will not stop it on its own

Kubernetes sees exit code 137 records an OOMKill event and restarts the container with the same memory limit. If the same workload conditions persist, the pod will be killed again. This cycle repeats indefinitely until a human manually edits the deployment spec. For the manual side of pod restarts and when to use them, see the kubectl restart pod guide. That is precisely what Middleware OpsAI automates.

How to auto-fix Kubernetes OOMKill errors with OpsAI

The scenario

A Node.js API service in the production namespace has a memory limit of 512Mi. Normal usage sits around 320Mi. A new feature released on Tuesday starts buffering large response payloads in memory. By Saturday morning, the pod has been OOMKilled four times in six hours.

What kubectl shows before OpsAI acts

$ kubectl get pods -n production

NAME                           READY   STATUS      RESTARTS   AGE
api-server-7d9f6b8c4-x2pqr     0/1     OOMKilled   4          2h
api-server-7d9f6b8c4-nklt9     1/1     Running     0          45m

$ kubectl describe pod api-server-7d9f6b8c4-x2pqr -n production

Last State:  Terminated
  Reason:    OOMKilled
  Exit Code: 137
  Finished:  Sat, 25 May 2026 04:18:44 +0000

Limits:    memory: 512Mi
Requests:  memory: 256Mi

OpsAI five-step auto-fix flow: no human needed

Step 1: Detect and classify the failure. OpsAI receives the OOMKilled event within seconds of the kernel kill. Exit code 137 confirms a memory limit breach. Four kills in six hours crosses the anomaly threshold, triggering a full root-cause investigation rather than a simple log entry.

Step 2: Pull memory usage history. OpsAI retrieves seven days of memory working-set data for this container. It identifies the upward trend starting Tuesday, correlates it with that deployment’s rollout, and confirms the 512Mi limit now has less than 5% headroom before kill.

Step 3: Calculate the correct new limit. OpsAI checks for an existing VPA recommendation. If none exists, it computes one from first principles: P99 memory usage over seven days plus a 30% safety buffer. Result: 768Mi limit, 384Mi request.

Step 4: Apply the patch. In Auto Fix mode, OpsAI generates a strategic merge patch and applies it through the Kubernetes API. The deployment performs a rolling update with zero downtime. Full patch details are in the OpsAI documentation.

Step 5: Confirm and notify. Once the new pods are stable and memory usage is confirmed to be below the new limit, OpsAI sends a Slack summary. No page is generated. The on-call engineer sees a digest in the morning.

What kubectl shows after OpsAI fixes the pod

$ kubectl get pods -n production

NAME                           READY   STATUS    RESTARTS   AGE
api-server-84c7f9b5d-p3rkx     1/1     Running   0          8m
api-server-84c7f9b5d-qhwlm     1/1     Running   0          7m
api-server-84c7f9b5d-tnvb2     1/1     Running   0          6m

# OpsAI audit log — visible in Middleware dashboard

Action:   memory_limit_patch
Target:   deployment/api-server  (namespace: production)
Before:   limits.memory=512Mi   requests.memory=256Mi
After:    limits.memory=768Mi   requests.memory=384Mi
Trigger:  OOMKill x4 in 6 hours (exit code 137)
Applied:  2026-05-25T04:23:11Z
Mode:     Auto Fix (namespace policy: production)

Tired of OOMKill alerts waking you up? OpsAI fixes memory limit issues automatically using real usage data not guesswork. Connect your cluster in under 5 minutes. Start free — no credit card · Book a demo

What causes a pod to keep restarting in Kubernetes and how can AI fix CrashLoopBackOff automatically?

CrashLoopBackOff is not a specific error. It is a Kubernetes status meaning “this container has restarted repeatedly and keeps failing.” For a complete reference of all common Kubernetes errors and fixes, see Kubernetes common errors and how to fix them. The actual root cause can be any of the following:

  • A missing or incorrectly named environment variable
  • A broken ConfigMap or Secret reference (bad volume mount)
  • A failed database connection at startup
  • An application-level exception before the process is ready
  • A liveness probe that fires before the app has fully initialized

Native Kubernetes cannot distinguish between these causes. It restarts the container regardless. OpsAI identifies the real cause by reading the exit code and the log tail from each crash cycle, then cross-referencing with live cluster state. This is powered by Middleware log monitoring, which ingests and pattern-matches container logs in real time.

Imagepullbackofferror k8
Middleware OpsAI surfacing an ImagePullBackOff K8s warning

CrashLoopBackOff: How OpsAI Finds the Real Cause

CrashLoopBackOff means Kubernetes has tried restarting a container multiple times and keeps failing. It is not a specific error, it is a symptom. The actual cause could be a missing environment variable, a broken ConfigMap reference, a failed database connection, a startup exception in the app code, or a liveness probe that fires before the app is ready.

OpsAI identifies the real cause by reading the exit code and the log tail from each crash cycle, then checking that against the cluster state. Here is how each exit code maps to a root cause:

Exit codeMeaningOpsAI action
1 with stack traceApplication error or missing configLog analysis + config patch
137OOMKill (kernel SIGKILL, signal 9)Memory limit patch (see above)
143Process terminated via SIGTERM (signal 15)Liveness probe timeout investigation

Note on exit code 143: This equals 128 + 15 (SIGTERM). It typically means a liveness or readiness probe forcibly terminated the process, often because the application takes longer to start than the probe’s initialDelaySeconds allows. This is a probe misconfiguration, not an application crash.

Example: broken ConfigMap reference

$ kubectl get pod worker-6d8b9f4c7-zmpvt -n staging

NAME                         READY   STATUS              RESTARTS   AGE
worker-6d8b9f4c7-zmpvt       0/1     CrashLoopBackOff    9          31m

$ kubectl logs worker-6d8b9f4c7-zmpvt -n staging --previous

Error: Cannot find module '/app/config/settings.json'
    at Function.Module._resolveFilename (node:internal/modules:1039:15)
npm ERR! code 1

OpsAI reads exit code 1 and the log message, then audits the cluster state. It finds that a recent deployment renamed the ConfigMap from worker-settings to worker-config but did not update the volume mount reference in the deployment spec. OpsAI identifies this mismatch as the root cause, patches the volume reference to point to the correct ConfigMap name, and triggers a rolling restart.

For cases where log analysis points to an ambiguous application error (something that may require a code change rather than a config patch), OpsAI switches to Auto RCA mode and presents the full diagnosis with a proposed fix for human review, even in namespaces where Auto Fix is enabled.

Example: ImagePullBackOff auto-fix with OpsAI

OpsAI also handles ImagePullBackOff errors another common cause of CrashLoopBackOff that native Kubernetes cannot self-resolve. When a pod fails because the container image reference is broken or points to a nonexistent tag, Kubernetes keeps restarting it with the same bad image spec indefinitely. OpsAI detects the failure, searches the connected repository for the broken image reference, identifies the exact line in the pod spec causing the issue, patches k8s/pod.yaml, and opens a pull request automatically with no manual investigation required.

ops ai auto remediation k8 error

OpsAI Auto Fix in action on an ImagePullBackOff error it searched the repo for the broken image tag, identified the exact file and line, made the change in k8s/pod.yaml, and opened a pull request. The full fix trail is visible in the dashboard.

Can an AI agent automatically adjust HPA settings when it detects misconfiguration?

Yes, and this is one of the most valuable things OpsAI does because HPA misconfiguration rarely generates an alert. Pods are running. The cluster looks healthy. But traffic is being throttled, latency is climbing, and the autoscaler is silently hitting its ceiling. Standard alerting misses this entirely because there is no crash event to fire on, just slow and quiet degradation.

OpsAI detects HPA misconfiguration by watching the gap between what the HPA is configured to do and what is actually happening in the cluster. For a broader comparison of how OpsAI fits against other AI-driven SRE tools, see the 10 best AI SRE tools for 2026. It monitors three patterns:

Pattern 1: Latency rising but no scale-out triggered

The HPA’s scaling metric (usually CPU) stays below its threshold while user-facing latency climbs. This happens when the real bottleneck is memory, I/O, or connection pool exhaustion (metrics the HPA is not configured to watch). Understanding which Kubernetes metrics to track is key to setting up HPA scaling rules that reflect actual service health. OpsAI spots the disconnect between the scaling signal and actual service health.

Pattern 2: Rapid scale-up followed immediately by scale-down (thrashing)

The HPA oscillates between replica counts within a short window. This wastes resources and causes instability. Common causes: stabilizationWindowSeconds is too short, metric variance is too high, or targetAverageUtilization is set too aggressively.

Pattern 3: HPA pinned at maxReplicas for an extended period

$ kubectl get hpa checkout-service -n production

NAME               REFERENCE                TARGETS        MIN  MAX  REPLICAS  AGE
checkout-service   Deployment/checkout      cpu: 92%/70%   3    8    8         2d

$ kubectl describe hpa checkout-service -n production

Conditions:
  ScalingLimited  True  TooManyReplicas
    'the desired replica count is more than the maximum replica count'

Events:
  15m   SuccessfulRescale   New size: 8; reason: cpu utilization above target
  38m   SuccessfulRescale   New size: 8; reason: cpu utilization above target
  61m   SuccessfulRescale   New size: 8; reason: cpu utilization above target

OpsAI detects the HPA has been capped at maxReplicas for over two hours while service latency is 40% above its SLO baseline. It reviews historical peak load data, calculates that maxReplicas should be 14 instead of 8, and either patches the HPA directly (Auto Fix mode) or presents the change for one-click approval in the Middleware dashboard (Auto RCA mode). See the OpsAI documentation for full configuration options.

Kubernetes liveness probes vs AI-based auto-remediation

This distinction is frequently misunderstood. Liveness probes and OpsAI operate at completely different layers of the stack and solve different problems. Both are needed in production.

Kubernetes liveness probesOpsAI auto-remediation
LayerAvailabilityRoot cause
Question answered“Is this container currently healthy?”“Why did this container fail, and how do I prevent recurrence?”
Action takenKills and replaces an unhealthy containerChanges the configuration, resource limits, or autoscaler settings causing the failure
Memory of past failuresNone (each probe check is stateless)Full historical context: usage trends, deployment history, prior incidents
Fixes the underlying problemNoYes
ScopeSingle container health checkFull-stack: Kubernetes events, metrics, logs, APM, infrastructure

Liveness probes are about availability: they keep bad pods out of the load balancer and bring replacements in. OpsAI is about prevention: it changes the underlying configuration so the same failure does not recur. Running OpsAI does not replace liveness probes. It builds on top of them. You can see how the full observability stack connects in Middleware’s Kubernetes monitoring guide.

Auto RCA mode vs Auto Fix mode

OpsAI has two operating modes, configurable at the namespace level. The core difference is whether OpsAI applies a fix autonomously or waits for a human to approve it. You can read how this compares to other approaches in the AI SRE tools roundup.

Auto RCA mode: diagnose and propose, human approves

OpsAI runs a complete root-cause investigation in seconds (reading events, metrics, logs, and cluster state), then presents a ready-to-apply fix in the Middleware dashboard. Nothing changes in your cluster until you click Approve.

  • Full investigation completes in under 30 seconds
  • Fix is written out clearly with before/after diff
  • One-click apply from the dashboard

Best for: Production namespaces with change approval policies

OpsAI Auto RCA panel showing a complete root cause

OpsAI Auto RCA panel showing a complete root cause investigation findings, evidence (code analysis, trace correlation), severity rating, and two ready-to-apply fix options, all prepared and waiting for one-click human approval.

Auto Fix mode: detect, diagnose, and fix automatically

OpsAI runs the same investigation and then applies the fix directly to the cluster, or opens a pull request for code-level issues. No human step is needed.

  • Detect-to-fix latency measured in seconds, not minutes
  • Memory limit patches applied immediately after OOMKill pattern is confirmed
  • HPA and resource limits updated automatically
  • Confidence threshold is configurable per namespace

Best for: Staging and development namespaces where speed matters and rollback is trivial

Recommended configuration for most teams

NamespaceRecommended modeReason
productionAuto RCAEvery cluster change reviewed before apply; review takes around 30 seconds with OpsAI’s prepared analysis
stagingAuto FixSpeed matters; wrong fixes are easy to revert
devAuto FixFrictionless remediation for developer environments

How to automate Kubernetes pod crash recovery without writing custom scripts

Custom remediation scripts are a common first approach: write a script that watches for OOMKilled events and bumps memory limits. They work within limits. For the manual troubleshooting techniques teams use before adopting automation, see top 10 Kubernetes troubleshooting techniques. Here is why Middleware OpsAI is a better long-term approach:

Custom scriptsMiddleware OpsAI
Failure patterns coveredOnly patterns you have already seen and coded forDynamic analysis across any failure pattern, including novel ones
Context usedUsually a single signal (the event type)Cross-references events, metrics, logs, APM, and deployment history simultaneously
Audit trailManual logging if you built itImmutable audit log built in with every action, signal, confidence score, and before/after state
Change approval integrationManualNative integration with standard change management workflows
Maintenance overheadHigh (scripts need updates as your stack evolves)Zero (OpsAI learns workload baselines continuously)
HPA and VPA awarenessTypically noneFirst-class (monitors autoscaler state as a primary signal)

What OpsAI fixes automatically vs what needs human approval

The table below shows default behavior. Everything is configurable through namespace-level policies in OpsAI settings. Check Middleware pricing to see which plan includes auto-remediation.

ScenarioWhat OpsAI doesApproval required?
Pod OOMKilled 3+ times in 6 hoursPatches memory limits using P99 usage + 30% buffer. Rolls out via rolling update.Auto Fix
CrashLoopBackOff from broken ConfigMap or volume mountFixes the volume mount reference. Triggers rolling restart.Auto Fix
HPA stuck at maxReplicas while latency is risingReviews historical load data and updates maxReplicas. Patches HPA spec.Human Review (prod)
CrashLoopBackOff with ambiguous application errorFull RCA. Opens GitHub PR with proposed code fix.Human Review
Node pressure causing pod evictionsReschedules affected pods. Alerts on node capacity shortfall.Auto Fix / Human Review
Memory limit increase would double the current valueAlways escalates regardless of namespace policy.Human Review
Single OOMKill, first occurrenceLogs event, runs lightweight check, monitors for recurrence.Monitor Only

Every action OpsAI takes is recorded in an immutable audit log in the Middleware dashboard, including the exact patch applied, signals that triggered it, confidence score, and before/after pod state. The log is exportable and integrates with standard change management workflows.

See Auto Fix and Auto RCA in action. Middleware OpsAI works across your entire stack Kubernetes, APM, logs, and infrastructure in one platform. Start free · View pricing

How OpsAI watches your Kubernetes cluster

OpsAI connects through the Middleware agent and streams the following signals simultaneously, cross-referencing all of them within 30 seconds of a failure event. For a broader look at how Middleware surfaces workload failures visually, see troubleshooting Kubernetes workloads with Middleware:

  • Kubernetes Events API: OOMKilled, CrashLoopBackOff, BackOff, FailedMount, Evicted, and all Warning-level events
  • Pod phase transitions: Every Pending to Running to Failed transition with timestamps
  • Resource metrics: CPU throttling, memory working set vs limit, restart counts, node pressure
  • Container logs: Ingested and matched against failure patterns in real time
  • HPA and VPA state: Autoscaler conditions, current vs desired replica counts, metric source health
  • Deployment history: Recent rollouts, ConfigMap changes, and spec diffs correlated with failure timing

How to connect OpsAI to your Kubernetes cluster

Install the Middleware agent once as a DaemonSet. OpsAI starts streaming data immediately. Full steps are in the Kubernetes agent installation docs. For a walkthrough of the full monitoring setup, see Kubernetes application monitoring with Middleware.

# Step 1: Add the Middleware Helm repo
helm repo add middleware https://helm.middleware.io
helm repo update

# Step 2: Install the agent with OpsAI enabled
helm install mw-agent middleware/mw-agent \
  --namespace middleware \
  --create-namespace \
  --set mw.apiKey=YOUR_API_KEY \
  --set opsai.enabled=true \
  --set opsai.autoFix.namespaces="staging,dev"

# Step 3: Verify the agent is running
kubectl get pods -n middleware

NAME                     READY   STATUS    RESTARTS   AGE
mw-agent-ds-node-01      1/1     Running   0          2m
mw-agent-ds-node-02      1/1     Running   0          2m

What happens after installation:

  • With real-time, OpsAI can detect anomalous patterns against early baselines
  • Ongoing: OpsAI continuously refines baselines as workloads evolve

See Middleware pricing for plan details, including OpsAI auto-remediation.

Connect your cluster and let OpsAI fix its first incident usually within minutes of installation. No credit card required. Works with any Kubernetes distribution. Get started free · Book a live demo

FAQs

Does Kubernetes fix OOMKilled pods automatically?

Kubernetes restarts an OOMKilled pod automatically but does not fix the root cause. The pod restarts with the same memory limit and will be OOMKilled again under the same workload conditions. Middleware OpsAI patches the memory limit based on actual usage history so the crash does not repeat.

What causes a pod to keep restarting in Kubernetes?

The most common causes are: memory limit too low (OOMKill, exit code 137), broken ConfigMap or Secret reference (exit code 1 with a missing-file error in logs), liveness probe firing too early (exit code 143), application startup exception, or a failed dependency like a database connection. Each has a distinct signature in exit codes and logs, which is how OpsAI identifies and fixes them.

How does Kubernetes self-healing work and what are its limits?

Kubernetes self-healing keeps pods running through container restarts, liveness probes, ReplicaSet enforcement, and node failure recovery. The limit is that it only reacts to failures. It never diagnoses them or prevents recurrence. It restarts a pod with the same broken configuration every time. See the Kubernetes monitoring guide for a full overview of what Kubernetes exposes for observability.

Can an AI agent automatically adjust HPA settings when it detects misconfiguration?

Yes. Middleware OpsAI monitors HPA behavior (not just pod health) and detects patterns like thrashing, scale-up lag, and maxReplicas being hit repeatedly. In Auto Fix mode it patches the HPA spec directly. In Auto RCA mode it prepares the change for one-click human approval. See the OpsAI documentation for configuration options.

How can I automate Kubernetes pod crash recovery without writing custom scripts?

Middleware OpsAI installs via a single Helm command and immediately begins watching for failure patterns across OOMKills, CrashLoopBackOff, and HPA issues. Unlike custom scripts, it handles novel failure patterns dynamically, maintains a full audit trail, and integrates with change approval workflows with no script maintenance required.

Does Kubernetes self-healing fix OOMKilled pods automatically?

No. Kubernetes restarts an OOMKilled pod automatically, but it restarts it with the same memory limit that caused the kill in the first place. The pod will be OOMKilled again under the same workload conditions. Kubernetes keeps your pods running. It does not fix the underlying configuration that is causing them to fail.