Search This Blog


Tuesday, December 16, 2008

A Tale of Hunting Down an Elusive Bug

I am a developer, not a tester. So finding and reproducing bugs is not my main forte. But I was also in charge of software builds. This incident occurred when I was upgrading Automated Build Studio(ABS) from version 4 to version 5. It turned out that this upgrade gave me a taste of how it was like to hunt down an elusive bug.

Upgrading ABS from version 4 to version 5 was especially problematic; a lot of macros stopped working out of sudden. Going through the production script was no joke because a lot of the macros involve lengthy and time-consuming operations such as reading file from network or pulling data from external database or lengthy builds and so on. In the end I spent half a day to debug my scripts and discovered 7 or 8 bugs in total. Out of these bugs, all of them can be reproduced easily, so I wrote a bug description for each case and forwarded them to AutomatedQA , the ABS producer. Being a good software vendor , AutomatedQA fixed each and every one of them and emailed me the patch 2 weeks after I reported the problems.

There was only one elusive bug, the kind of bug that won't surrender itself to concentrated attacks. What happened was something like this. Running form operation in ABS would crash Google Chrome . The hard thing was that this wasn't happen all the time. Sometimes I could run the whole build script without a single glitch, sometimes  but sometimes the build script would simply crash Google Chrome. All the efforts to trap down the bug and reproduce it consistently proved to be futile. It's a non-deterministic bug. Given that it wasn't so critical, I left the bug alone, and hoped that future versions of ABS would fix this problem automagically . Also, why on earth should ABS crash Google Chrome was quite a mystery. They are different applications and sure they are confined in their domain, no?

The version of ABS increased from 5.0.1 to 5.1.0, but the fix didn't come. I could still observe the crash from time to time. The hard part was it didn't occur always. Sometimes it happened, sometimes it didn't. As any software developer would tell you, a not-reproducible bug is a non-existent bug. If you couldn't reproduce it, how could you fix it? Even if you could, how to tell whether you are really fixing the problem instead of just wasting your time?

But as crashes happened, I gained some insights. One thing I noticed was the crash happened more often when I was inside Gmail . Although if Gmail wasn't opened the crash would still occur randomly, but the chances of getting the problem was higher if Gmail was opened.

Finally, the day came when I felt that I had sufficient faith in reproducing the problem. So I
  1. Logged into Gmail using Google Chrome
  2. replicated another sets of script minus the later part that didn't contribute to the problem, since the problem occurred quite early into the script.
  3. Ran the test script again and again
The bug didn't occur in the first few runs, then suddenly a "Whoa! Google Chrome has Crashed, Restart now?" message box appeared.

I have seen this message box a few times, and that was the only time when I was elated.

I had reproduced the problem!

Not feeling satisfied, I ran a few more times the scripts to confirm that this problem was indeed occurring. In the process I also trimmed down the scripts to the minimum. Now I have something to report to the technical support!

The ABS support gave me a reply the next day. Yes, they had, too, reproduced the problem. I felt a sense of pride and relief. Nailing down an elusive bug wasn't just helping the software quality; it was also intellectually satisfying. Aha, I said to the naughty bug, I finally got you!

How did the problem occur? I asked the support.

And here's their reply:

Here are the details: Automated Build Studio installs or removes global mouse and keyboard hooks when an interactive form is opened or closed. If this happens too frequently, Google Chrome crashes. The crash occurs in Automated Build Studio's code that is injected to the chrome.exe process. Therefore, the problem is observed only when both applications are working simultaneously. 
To avoid the problem, we will not use global hooks in future updates of the tool.

 Now I can understand why good testers are hard to find, and why they are absolutely a must .  

No comments: