Debugging: The 9 indispensable rules for finding even the most elusive software and hardware problems. By: David J. Agans

Book review

This book is so practical and short, and even though the narrative is so straightforward and exciting as if you are reading a war story, the author added lots of actual war stories that you can relate to in your real-life problems.

Summary

The book tells you how to find out what’s wrong with stuff quick.

The rules:

  1. Understand the system
  2. Make it fail
  3. Quit thinking and look
  4. Divide and conquer
  5. Change one thing at a time
  6. Keep an audit trail
  7. Check the plug
  8. Get a fresh view
  9. If you didn’t fix it, it ain’t fixed

The rules are easy to understand and apply. For me putting them in points made it easier to use them in my daily work. I don’t always follow the same order because the complexity of every bug I work on is different.

rules summary

Understand the system

That means knowing how your system works and how it is integrated. That can be done by reading the documentation for every component involved with the issue. Sometimes people read by skimming the book. But sometimes, you need to read everything cover to cover because, most likely, the devil is in detail. With this, you will get an idea of how the system is supposed to work normally. And you have to learn a bit of the fundamental of your technical field. Also, to know the system roadmap, understand the top level, and what all blocks and interfaces do.

You should know what goes across all the APIs and communication interfaces in the system. And you have to try picking the right debugging tools, and to learn them, you have to know what your tools are capable of. Finally, try not to trust your memory but instead look up the details.

Make it fail

There are three reasons to make it fail:

  1. So you can look at it and understand it
  2. So you can get a clue about the cause
  3. So you can tell when it get fixed

Do it all again and make it fail, try to repeat everything, even use the same mug and coffee every time. And keep a detailed easy to follow test procedure document, and always start from the beginning, so if the bug report says I logged in then went to page x, then page y, and at page z, I got a problem. You have to start from the login, page x, page y, then page z. That is because some bugs can depend on the complex state of the machine.

Stimulate the failure, try to arrange all the conditions that make the loss happen. And remember always to automate the failure sequence. In software debugging, that means creating a test that fails. Never simulate the failure. Some people might simulate the failure mechanism because the bug is intermittent. They guess that a particular part of the system is failing, so they simulate it to reproduce the bug. They are more likely to solve either nothing or just another bug. In some cases, simulating the conditions that stimulate the failure is ok, like simulating network load. But never simulate the failure mechanism itself.

What if the bug is intermittent? How to make it fail? A bug is intermittent when you try too many times, but it only fails few times. The key here is that you repeat the steps, but you don’t know all the conditions affecting that failing part. Like initial conditions, input data, timing, outside processes, network traffic, and so on. So look for all conditions affecting the system. You can record logs for everything and compare them for good calls and bad ones. Do not fully trust statistics, especially with intermittent bugs, so sometimes coincidence will make you think that one condition makes it fail. Gather more information to identify things that are always associated with the bug. With intermittent bugs, it is tough to know when they get fixed because if it fails one time out of ten today, tomorrow might fail one out of 1000. So the more sample tests we do, the better.

And the best is to find a sequence of events that always goes with the failure, even if the sequence itself is intermittent, so when it happens, you are 100% sure that the bug occurs as well. Then when you fix the bug and that sequence occurred, if the bug didn’t happen, you are 100% sure that you have fixed the bug. So when you think you solved the problem, run the tests until the sequence happens.

Don’t say that cannot happen. Try accepting the data and look further into the situation. We need to create our debugging tools for some cases, so don’t be afraid to make your tools.

Quit thinking and look

It is a capital mistake to theorize before one has data Insensibly one begins to twist facts to suite theories, instead of theories to suite facts

“Sherlok holems, A scandal in Bohemia”

“Quit thinking and look” means not to assume and start looking into the problem and see it fails. Seeing the low-level failure is crucial. If you guess at how something is failing, you often fix something that is not the bug. Looking can have different forms like breakpoints, adding debug statements, monitoring program values, and examining memory.

See the failure, not its result. Usually, we see the result of a failure; we need to know the root cause. So what we see when we note the bug is the result of a failure. ✨ Look into the details, so you keep looking until the failure you can see has a limited number of possible causes to examine.

The measure of a good debugger is not how soon you came up with a guess or how good your guesses are! But how few wrong guesses you act on.

✨ Design instrumentation by adding the debugging tools at design time. Or design instrumentation later when you discover new issues. Look for things that will confirm what you expect or show you the unexpected behavior that is causing the bug.

The Heisenberg Uncertainty Principle Realized that you could either measure where a particle is or where it is going but the more precisely you measure one of these aspects, the more you disturb the other. That is because your probes are part of the system. That tells us that the test instrumentation can affect the system under test.

It is ok to guess when the problem is more likely to happen.

Divide and conquer

Divide the system into a good part and a bad part and do the same to the bad part. Narrow the search, home in the problem, or find the range of the target. This way is called successive approximation, which is when you need to find something in a range of possibilities. You start at one end of the range, then go halfway to the other end and see if you are past it or not. If yes, you go one-fourth and try again. If not, you go three fourth and try again.

Successive approximation depends on two essential details:

Start with the bad end because there are many good ends, and it is a waste of time to verify every end. And try fixing the bugs you already know about because sometimes bugs hide each other.

Change one thing at a time

While debugging, if a change didn’t fix the problem, then revert it (Use a refile, not a shotgun). Look at all dials and indicators, monitor the system before doing any action. And again change one test pattern at a time. Compare with a good one, using two cases, one has failed, and another has not. Compare scope traces, code traces, debug output, status windows, or whatever else you can instrument.

Also, a great question to answer that would help you is: What did you last change since the last time it worked? Isolate the key factors, like if don’t change the watering schedule for a plant if you are looking for the effect of sunlight on it.

Keep an audit trail

Write down what you did, in what order, and what happened. Write a detailed log like what system was running the sequence of events leading up to the failure. Correlate: It did ‘x’ after I did ‘y’. The audit trail can be in the form of VCS commit messages. Add time. Time is critical to see the relationship between multiple systems. And remember to write it down

Check the plug

Don’t take things for granted, and don’t trust assumptions. Start from the first block, even if you ask, “Is it on or off?”. Always test the testing tools. They might be wrong, don’t trust them 100%. One way is to do a manipulation test to artificially create a failing scenario and check if they can detect it.

Get a fresh view

Ask people for help because being busy with the details can build up some biases, so a fresh view will not have the biases you have. And that will make you describe the bug, which will indicate if you understand it or not. Also, ask an expert when you are not sure, or the part you are looking at is unknown to you, so it is better to ask an expert rather than learning everything about it. Try to find or write a troubleshooting guide, research for people who has the same problem. Don’t be proud. Just bring the right expertise to resolve the issue. And remember, Don’t report theories, only symptoms while asking for help, so you don’t influence their thinking. And you don’t have to be sure about the details related to the problem. Just communicate every detail.

If you didn’t fix it, it ain’t fixed

Always check that the problem is really fixed, and check that your fix fixed it. After fixing it, try removing your fix, is it still fixed? If something failed after a long time, it was working, that means a recent change made it fail, or a recent condition like a high load. Always find the underlying conditions that created the failure. And fix the process, not the symptom. Like if you have to clean data every time, try to improve your design.

Final words

The book is rich with many war stories and examples of how the author or his colleagues fixed it. The issues sometimes are software issues and sometimes hardware. The book even has problems that are not related to the tech world, like plumbing issues. I recommend reading this book. It is easy and straightforward. To buy the book from Amazon, please use this link (Amazon’s affiliate marketing program link ♥️): https://amzn.to/3sToH2q