Testing and Telephony

• Tuesday, May 13, 2008 - Adventures with one of those elusive intermittent bugs

We took to saying the process had vaporized, because it exited unexpectedly leaving no useful information behind.  It was repeatable, but not on demand - I would fire the load scenario and wait.  There was nothing useful in the log files, even if I turned the log level up.  There was no Dr Watson file (this is Windows 2003), and there should have been.  And, I couldn't make it die if I ran the process with a debugger attached.

 

We had an awful time with this one - my developers were pulling their hair out, and my developers are *good*.  They finally had to go to Microsoft support for advice, which eventually did the trick.

 

My secret identity is "Mostly-clueless-with-a-PC Girl", so I'd better make notes about this for future use.  (So what am I doing working in a Microsoft shop?  Well, the product is a conference bridge, and I do the telephony testing.)

 

(1) Make sure your entire process is subject to an exception handler, make sure there's an exception handler of last resort.

(2) The normal exception handlers are stack-based, therefore if your bug is trashing the stack, you may not get a Dr Watson file.

(3) Check your string handling and similar functions - might you be overwriting a buffer somewhere?  That's a prime candidate for stack overwrite.

(4) Might you be throwing an exception within an exception handler?  Repeatedly?

 

The advice from Microsoft support was to add a "vectored exception handler", which wrote the exception code and address to a file, then returned to commence the normal, stack-based exception handler. 

 

Quote from explanation that got passed around: "The idea is that the vectored exception handler gives us a chance to log some data before the complex SEH unwind begins and possibly goes haywire.  This API was added in XP/2K3, likely to provide a means to troubleshoot the scenario we are encountering.  Since SEH is stack based it is susceptible to severe malfunctions."

 

This exposed the information that trouble starts with an access violation from a bad read (from still-under-development proprietary hardware), then additional exceptions get thrown during the attempt to handle the first one.

 

Whew!  Useful information at last!  Developers added defensive code for the bad read(s), and are investigating why it happens in the first place.

:: Send to a Friend!

About Me

I have a Hammer,and I know how to use it! (Actually, I have 3 Hammers ...)

«  January 2009  »
MonTueWedThuFriSatSun
 1234
567891011
12131415161718
19202122232425
262728293031 

Links

Home
View my profile
Archives
Friends
Email Me
My Blog's RSS

Friends

Entry 3 of 6
Last Page | Next Page