📋 Quick Steps
The 5-minute triage commands to understand any Kubernetes cluster when you have zero context.
Welcome to the Jungle
You didn't deploy to Kubernetes today. Kubernetes deployed to you. One moment you're sipping coffee, the next you're staring at a terminal connected to a production cluster with zero documentation, three different deployment tools, and alerts screaming about something called "pod-disruption-budget-violation." The last person who understood this system left six months ago, taking the tribal knowledge with them.
This isn't your beautiful greenfield project. This is the Kubernetes Cluster-from-Hell, and your job isn't to architect it—it's to survive it. The good news? Every haunted cluster follows patterns. Once you know what to look for, you can map the chaos, stop the bleeding, and maybe even make it better for the next poor soul.
TL;DR: Your Survival Kit
- First 5 minutes: Run the triage commands above. Don't touch anything until you know what's running and what's on fire.
- First hour: Map namespaces, ingress, storage, and configs. Find the previous admin's hidden notes (check
kubectl describe configmap). - First day: Build your escape plan—document everything so you're not the next ghost in the machine.
Phase 1: The 5-Minute Triage (Don't Panic, Yet)
Your first connection is the most dangerous. You have no context, and every command could be a landmine. Start with observation, not action.
Step 1: Assess the Battlefield
Run kubectl config get-contexts. Are you in the right cluster? I've seen developers debug "production issues" on a staging cluster for two hours. Check your current namespace with kubectl config view --minify | grep namespace. Now run the Quick-Value commands. Look for:
- CrashLoopBackOff: The classic. Something keeps dying on startup.
- Pending pods: Usually means no resources or persistent volume issues.
- Old pods: Pods running for 300+ days are either incredibly stable or zombies no one dares touch.
Step 2: Find the Alerts (Follow the Screaming)
Check events sorted by time: kubectl get events --all-namespaces --sort-by='{.lastTimestamp}' | grep -i "error\|fail\|backoff\|pending". Don't get distracted by warnings from 30 days ago. Focus on what's happening now. Pro tip: If you see "ImagePullBackOff," someone deleted a Docker image or changed permissions. Classic Friday afternoon move.
Phase 2: Mapping the Chaos
Now that you know what's bleeding, figure out how the body is supposed to work. Reverse-engineer the architecture before you try to fix anything.
Step 3: Discover the Ingress Points
Where does traffic enter? Run kubectl get ingress --all-namespaces. No ingresses? Check for LoadBalancer services: kubectl get svc --all-namespaces | grep LoadBalancer. Still nothing? Maybe it's NodePort hell. Look for patterns in service names—you'll often find -prod, -api, -web clues.
Step 4: Uncover the Secrets (Literally)
Secrets and config maps are the cluster's diary. Run kubectl get secrets,configmaps --all-namespaces | head -20. Look for config maps with names like "app-config," "environment," or "kubectl describe configmap . You might find database URLs, feature flags, or—if you're lucky—actual documentation.
Step 5: Follow the Storage Trail
Persistent volumes are where data goes to die or cause problems. Check kubectl get pvc --all-namespaces. See any stuck in "Pending" status? That's your disk space issue. Notice PVCs with 1Gi claims on 100Gi volumes? Someone didn't understand storage classes.
Phase 3: Emergency Procedures
Now you know what's running and how it's connected. Time to decide: coffee or panic?
When to Make Coffee (It's Fine, Probably)
- Single pod restarting occasionally (check logs first:
kubectl logs -f)--previous - Old warnings in events (more than 1 hour old)
- CPU/Memory usage under 80% on nodes
When to Panic (Controlled, Professional Panic)
- Multiple nodes showing "NotReady" status
- Critical service (database, message queue) pods down
- All replicas of a deployment unavailable
- Persistent volume claims failing across namespaces
For actual fires: kubectl delete pod can kill a stuck pod. kubectl scale deployment then back to original count can restart a service. Have the rollback command ready before you touch anything.
The "Oh God Why" Checklist
Your cluster is haunted if you find three or more of these:
- Everything in default namespace: The telltale sign of a cluster built by someone who stopped at the first tutorial.
- Latest tags everywhere:
kubectl get pods -o jsonpath='{..image}' | grep latestFinding "latest" tags in production is like finding raw chicken at a sushi restaurant. - No resource limits: Run
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}' | grep -v limitsIf nothing has limits, your cluster is a resource-free buffet. - Secrets in environment variables: Check with
kubectl get pods -o jsonpath='{..env}' | grep -i secretThey should be mounted as volumes, not exposed in env. - More than 5 Helm releases with "test" in the name:
helm list --all-namespaces | grep -i testTest releases that were never cleaned up.
Building Your Escape Plan: Document or Die
You've survived. Now make sure the next person doesn't have to. Create a "cluster-context" document with:
- Critical Services Map: Which deployments talk to what, and which namespaces they're in.
- Ingress Inventory: URLs and which services they point to.
- Storage Dependencies: Which apps need persistent storage and where it lives.
- Secret Locations: Where credentials are stored (without the actual credentials!).
- Common Firefighting Commands: The exact kubectl commands you used to fix things.
Better yet, create a simple script that outputs this: kubectl get all,ingress,pvc,secrets,configmaps --all-namespaces -o wide > cluster_snapshot_$(date +%Y%m%d).txt Run it weekly. Future you will send thank you notes.
Pro Tips from the Trenches
- Alias everything: Add
alias k='kubectl'andalias kg='kubectl get'to your shell. Seconds matter when things are burning. - Use
--contextflag religiously: Never run a command without explicitly specifying context. Your production cluster shouldn't be an accident. - Master
kubectl describeandkubectl logs --previous: 80% of debugging is in these two commands. - Check for Helm before you touch: Run
helm list --all-namespaces. If it's managed by Helm, use Helm commands to modify it, not kubectl. - Set up
kubectl get events --watchin a separate terminal: Real-time monitoring while you work. - Beware of custom resource definitions (CRDs): Run
kubectl get crd. If you see strange resources, you might be in an operator-managed cluster. Tread carefully.
Conclusion: You're Not Alone in the Dark
Every Kubernetes cluster eventually becomes legacy. The clean YAML manifests from day one become a palimpsest of quick fixes, abandoned experiments, and "temporary" solutions that outlasted their creators. Your goal isn't to fix everything—it's to understand enough to keep the lights on and make incremental improvements.
Remember: The person who created this mess probably thought they were doing the right thing with the information they had at the time. (Or they were maliciously incompetent—but let's assume good intentions.) Your survival, and eventual triumph, comes from systematic exploration, cautious intervention, and leaving better breadcrumbs than you found.
Now go check what's in the default namespace. I'll wait.
Quick Summary
- What: Developers get thrown into existing Kubernetes clusters with zero context, outdated documentation, and production on fire, leading to hours of confusion and dangerous trial-and-error
💬 Discussion
Add a Comment