Debugging Production Issues: A Systematic Approach

August 19, 2025

DebuggingDevOps

Production issues are stressful. Here’s a systematic approach to debug them quickly without making things worse.

The Framework

Gather Information
Form Hypothesis
Test Hypothesis
Fix and Verify
Document

Before you start: stabilize the system. Rate limit, scale up, or roll back if needed. Debugging while users are down is rarely worth it.

Step 1: Gather Information

Logs

# Tail logs
tail -f /var/log/app.log

# Search for errors
grep -i "error" /var/log/app.log | tail -100

# Filter by time
awk '$0 >= "2024-01-15 14:00" && $0 <= "2024-01-15 15:00"' app.log

Metrics

CPU usage
Memory usage
Request rate
Error rate
Response time
Queue depth / backlog
Saturation (threads, connections, file descriptors)

Recent Changes

# Git commits in last 24 hours
git log --since="24 hours ago" --oneline

# Deployments
kubectl rollout history deployment/myapp

Step 2: Form Hypothesis

Bad: “Something is broken” Good: “Database connection pool exhausted due to slow queries”

Step 3: Test Hypothesis

# Check database connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

# Check slow queries
SELECT query, query_start, state 
FROM pg_stat_activity 
WHERE state != 'idle' 
AND query_start < now() - interval '5 seconds';

Step 4: Fix and Verify

# Increase pool size
kubectl set env deployment/myapp HIKARI_MAX_POOL_SIZE=20

# Monitor
watch -n 1 'curl -s http://localhost:8080/actuator/metrics/hikari.connections.active'

Step 5: Document

## Incident: High Error Rate (2024-01-15)

**Symptoms:** 50% error rate, slow responses

**Root Cause:** Database connection pool exhausted

**Fix:** Increased pool size from 10 to 20

**Prevention:** Add alerting for pool utilization > 80%

Common Patterns

Slow Dependency

If downstream latency spikes, everything upstream can collapse:

Check timeouts and retry behavior.
Ensure you are not retrying without backoff.
Fail fast or use circuit breakers.

Memory Leak

# Heap dump
jmap -dump:format=b,file=heap.bin <pid>

# Analyze with Eclipse MAT

CPU Spike

# Thread dump
jstack <pid> > threads.txt

# Profile with perf
perf record -p <pid> -g -- sleep 30
perf report

Network Issues

# Check connectivity
telnet db.example.com 5432

# DNS resolution
nslookup db.example.com

# Packet capture
tcpdump -i any -w capture.pcap port 5432

Tools

Logs: ELK Stack, Splunk, CloudWatch
Metrics: Prometheus, Grafana, Datadog
Tracing: Jaeger, Zipkin
APM: New Relic, AppDynamics

Communication Matters

During incidents, keep a short status update loop:

Current impact
What you believe the cause is
What you are doing next
ETA for next update

What’s your worst production incident? How did you solve it?