Debugging: The 9 indispensable rules for finding even the most elusive software and hardware problems. By: David J. Agans

Debugging: The 9 indispensable rules for finding even the most elusive software and hardware problems. By: David J. Agans

  • Hassan Salem
  • 5 min read

Book review

David Agans keeps things lean and tactical. The 2002 classic still reads like a series of battle reports, only now the stories sound even more relevant with cloud-native stacks, microservices, and hardware-software hybrids everywhere. I like how short it is—you can finish it on a weekend flight and walk away with a repeatable playbook you can run on Monday.

Summary

The book teaches you how to get from “something is off” to root cause quickly. Agans insists on ruthless observation, disciplined experiments, and teamwork. The nine rules are timeless, but pairing them with modern tooling—observability dashboards, packet captures, feature flags—makes them lethal in 2024.

The rules (with 2024 twists)

  1. Understand the system — Sketch the architecture, read the docs, study runbooks, know the SLAs.
  2. Make it fail — Reproduce in a controlled way; capture logs, traces, and metrics.
  3. Quit thinking and look — Inspect the evidence before you theorize.
  4. Divide and conquer — Bisect code, configs, or network paths until the failure window is small.
  5. Change one thing at a time — Isolate variables; use feature flags or canaries when touching production.
  6. Keep an audit trail — Journal your experiments; commit temporary scripts; snapshot dashboards.
  7. Check the plug — Validate assumptions about dependencies, credentials, cables, DNS, and quotas.
  8. Get a fresh view — Pair debug, rubber duck, or call in someone from another team.
  9. If you didn’t fix it, it ain’t fixed — Monitor after the patch; keep watch for regression signals.

The rules look simple, but you get the most value by stacking them. I rarely follow them in strict order because every outage and bug has its own shape.

Rules in practice

Understand the system

Read the architecture doc, the onboarding wiki, and the code comments. Map how requests flow across services and hardware. In distributed systems you also need to understand contracts: API schemas, message queues, retry policies, and backoff timers. Learn your tools—kubectl, journalctl, adb, oscilloscopes—so you know what data they can surface. Do not rely on memory; pull the dashboards and specs every time.

Make it fail

“If it doesn’t fail, I can’t fix it.” — David J. Agans

Reproduce the issue step by step. Automate the failure with a failing unit test, a curl script, or a load scenario in k6. Capture the exact build ID, configuration, and environment variables. For intermittent bugs, log everything, compare good runs versus bad ones, and look for invariants. When I chase flaky tests, I run them in a loop under pytest -k flaky_test --maxfail=1, record seed values, and save the artifacts. Never fake the failure mode—simulating the bug masks the real trigger.

Quit thinking and look

Richard Feynman called it “leaning over to look.” Open the logs, dump the database record, check the oscilloscope trace. Facts first, theories later. Screenshot metrics before you restart anything. When everything looks fine, dig deeper: turn on debug logging, capture packets, or attach a profiler. Evidence keeps you from twisting facts to match your favorite hypothesis.

Divide and conquer

Cut the problem in half repeatedly until the culprit has nowhere to hide. Use git bisect when a regression appeared after a merge. In production, flip feature flags or route traffic gradually with canary deployments to isolate the suspect component. For hardware issues, remove subsystems or swap cables to see if the defect travels. Every slice should either include or exclude the bug; keep slicing until the answer is obvious.

Change one thing at a time

Whether you are editing code, swapping drivers, or toggling infrastructure knobs, vary one variable per experiment. Multi-change fixes hide regression sources. Lean on CI, feature flags, and progressive delivery to stage changes safely. If you must apply multiple patches, verify each one separately in a branch or staging environment.

Keep an audit trail

Write everything down. I keep a debugging diary in Notion or a plain Markdown file with timestamps, commands, and observations. Commit temporary scripts under a throwaway branch so the team can rerun them. Saving Grafana snapshots or Datadog notebooks preserves context if someone else needs to pick up the incident later. Detailed trails also feed post-incident reviews.

Check the plug

Never skip the obvious. Verify power, cables, certificates, secrets, DNS TTLs, IAM permissions, and quota limits. In cloud environments, expired tokens and forgotten sidecar restarts cause more incidents than exotic race conditions. Build automated health checks that fail loudly when a dependency is unavailable.

Get a fresh view

When you are out of ideas, invite another engineer, a support specialist, or even a product manager to walk through the symptoms. Rubber-ducking forces you to explain the problem clearly; half the time you catch the mistake mid-sentence. Rotating on-call engineers, forming SWAT teams, or posting in the incident channel exposes blind spots and prevents tunnel vision.

If you didn’t fix it, it ain’t fixed

After the patch lands, keep watching. Ship the fix behind a feature flag, enable verbose logging temporarily, and add metrics or synthetic checks that scream if the bug reappears. Run regression tests, chaos experiments, or load tests to ensure the system holds up. Only declare victory when you see stable signals over time.

Final thoughts

Agans’ nine rules are still the backbone of modern debugging. Pair them with today’s tooling—observability platforms, continuous integration, remote debuggers—and you get a rigorous feedback loop that works across firmware, backend services, and mobile apps. Debugging is a craft; the more disciplined you stay, the faster you earn the reputation for fixing the “impossible” bugs.

Hassan Salem

Written by : Hassan Salem

A developer sharing my journey.