The complexity of the solutions you can build is limited by the complexity you can debug. When code gets complex, debugging gets harder. More data is involved, setup is more difficult, problems harder to reproduce. Code paths are more complex. Data structures become unwieldy nested graphs. Stack traces many hundreds of frames deep. Multi-threaded timing problems, deadlocks, and intermittment errors. Do you yield at this point or do you dig in and find and fix the problems? If you walk away in fear, your system faces failure or at least may live as a buggy, hated thing people want to replace, instead of a solid, stable system that runs for years.
One of the reasons I’ve been successful in my career is that I’m good at debugging the hard problems. I’m not afraid to tackle a more complex design as I am confident I can solve the more complex problems that will arise. When working with teams, I can save a lot of time by helping others find those problems that can suck up days and weeks often in much less time.
I’ve wanted to write a post about debugging for a while but it is a dauntingly complex problem, worthy of a book. I learned most of my tricks by sitting over the shoulders of great programmers while they debug problems. I’ve learned over the years that there’s no substitute for taking an intuitive approach in the time savings involved. Even with a list of the many things you to find a problem, the secret is applying them in the right way at the right time.
Despite the difficulty, I took some time to write down some approaches I’ve found useful over the years:
- Have the right attitude. You will find the bug, it’s only a matter of time. The more frequent a bug occurs, the easier it is to find because of all the data gathering opportunities. The longer between occurrences, the more time you have to prepare for the next occurrence so you can catch it.
- Familiarize yourself with all of the processes, threads, data structures involved.
- Even if you can’t easily reproduce the bug in the lab, use the debugger to understand the affected code. Judiciously stepping into or over functions based on your level of “need to know” about that code path. Examine live data and stack traces to augment your knowledge of the code paths.
- When you can’t reproduce the bug, you may need to instrument the code with additional logging. Make sure the skeleton of major operations has adequate logging to understand what’s happening in the system. Investing some effort in improving the targeted quality, readability of these logs will go far in the overall lifecycle of a complex system. Too much logging swamps performance and hurts readability. But with time-stamps, user-ids, user-agent strings, session-id, basic operations, you learn a lot about the running system and why it might have failed for one particular user. Logging is crucial for multi-threaded interactions.
- Generate theories as to what might be causing the problem and test those theories. Keep an open mind. Generate as many theories as possible before you start the longer process of testing those theories. You may decide to test more than one at the same time.
- If you have no theories, you need to learn more about the system, particularly information relevant to the code paths causing the bug. Adding additional logging is a good way to do that when sporadic errors cannot be reproduced.
- Familiarize yourself with all of the layers of the system, at least at an intuitive level, from the hardware on up. This will help you visualize what’s going on in in your mind’s eye so your intuition can help steer you towards the most likely source of the problem.
- For certain types of complex code, I will write debugging code, which I put in temporarily just to isolate a specific code path where a simple breakpoint won’t do. I’ve found using the debugger’s conditional breakpoints is usually too slow when the code path you are testing is complicated. You may hit a specific method 1000’s of times before the one that causes the failure. The only way to stop in the right iteration is to add specific code to test for values of input parameters. Aways do a System.out.println or log some visible, unique consistent token. This makes it easy to find and remove these code snippets when debugging is complete. Once you stop at the interesting point, you can examine all of the relevant state and use that to understand more about the program.
- Some people start out by drawing pictures, flow charts, entity-relationship diagrams of their data structures, and detailed state tables. I will do this only as a last resort or for documenting the project as it is time consuming. Examining a real instance of that data structures in the debugger is much faster, more accurate and more informative than any diagram. The JavaDoc or structured code browser are enough for me to understand the entities and relationships. I try to visualize data structures in my head and only resort to a drawings, or state tables when necessary. I think that over time, this has made me faster at visualizing and building systems.
- If you get stuck, take a break. Sleep on it, approach things fresh the next day. You may not have enough information and may need more to get the next piece of the puzzle. Too much frustration impedes your motivation and ability to focus. For the best debugging approach, you need to research all relevant aspects of a system, simulate that all in your head, use your intuition to flush out ways things could be going wrong. Instead of trying to find the problem, perhaps you need to learn more about the failure. Under what conditions does it happen? What’s unique about those cases? How can you learn more about those unique code paths?
- Do not spend too much time on minor bugs but do keep in mind the value of true reliability in a system. Your pride is not relevant. Your customers’ experience in using the software is all that matters.
If you become good at debugging complex problems, your confidence as a programmer will grow, letting you tackle bigger, more relevant problems. When things go wrong, you’ll be able to step up and make things right again.
Did I miss any of your favorite debugging tips? Continue the discussion in the comments!
3 thoughts on “Debugging Hard Problems”
Jeff: great post on a topic that deserves much more coverage. Especially salient for me right now as I;m working on a tough Python memory leak problem that is necessitating custom tooling in C++. On point 5, about theories, I find my own theories or preconceptions can sometimes blind me to the real nature of the underlying problem. I agree it’s important to keep an open mind. One little “meditational mantra” I chant to myself when debugging is “listen to the system”, Read all those stack traces, logs and err msgs again, with an open mind, and try not to force them into whatever hypothesis you favour at any given time.
Great point John O’. Good luck with that bug. Maybe someone needs to build some nice tools to help track down Python/C++ memory problems. In the old days, Java memory leaks were a real challenge. You’d think they’d be easy cause the leak is actually a big pussball of objects in some reference chain. Just find the chain. Once the dump analyzers were written, they became easier as long as the leak was a really big fast leak. One hack I used before the heap analyzers was to patch in a modified version of the major container classes: Vector, HashMap, etc. to dump a stack trace when the size reached a threshold like (size % 1,000) == 9999. This would often find the allocation at fault without routing around in the dump even. You’d at least get an idea of the code paths involved in building the large data structures, assuming any one data structure grows large. Finding the missing “release reference” is harder though in mixed language environments. Code inspection seems like the place I’d start with that. Make an exhaustive list of all code paths that touch the leaked data object. Then follow each one methodically.