Dr Watson will be there untill the end of this world


A win32 user mode Application is crashing and support is not able to get a crash dump either using adplus.vbs script or using CDB or NTSD debugger. It failed to generate crash dump even after setting WinDbg as the default post mortem debugger.

Steps to resolution

1) Crate a adplus configuration file to generate a minidump on the following breakpoints

  • kernel32!UnhandledExceptionFilter
  • kernel32!SetUnhandledExceptionFilter

Analyze the stack trace

1)  Open the first crash dump with breakpoint on kernel32!UnhandledExceptionFilter, run the following command

0:000> kb
ChildEBP RetAddr  Args to Child
0012ebf4 6da6c8ac 0012ec54 002ef7f0 c0000417 kernel32!UnhandledExceptionFilter
0012ef2c 6da302fd 00000000 00000000 00000000 msvcr90!_invoke_watson+0xf9
0012ef74 6da30366 047d0030 ffffffff 00000001 msvcr90!fread_s+0×4a

Something interesting, why is it calling _invoke_watson?

2) Open the 2nd crash dump with breakpoint on kernel32!SetUnhandledExceptionFilter and run the following command

0:000> kb
ChildEBP RetAddr  Args to Child
0012ebf4 6da6c89f 00000000 002ef7f0 c0000417 kernel32!SetUnhandledExceptionFilter
0012ef2c 6da302fd 00000000 00000000 00000000 msvcr90!_invoke_watson+0xec
0012ef74 6da30366 047d0030 ffffffff 00000001 msvcr90!fread_s+0×4a

Interesting, SetHandledExceptionFilter is passed NULL . MSDN documentation says, A value of NULL for this parameter specifies default handling within UnhandledExceptionFilter so that means a pointer to a top-level exception filter function is nulled out.

stack trace

002eebb4 6dcbc8ac 002eec14 0040f7f0 c0000417 kernel32!UnhandledExceptionFilter

002eeeec 6dc802fd 00000000 00000000 00000000 msvcr90!_invoke_watson+0xf9

002eef34 6dc80366 045b0030 ffffffff 00000001 msvcr90!fread_s+0×4a

002eef50 6ef3105e 045b0030 00000001 000001f4 msvcr90!fread+0×18

Root Cause Analysis

1) Basically, CRT is hardcoded to invoke drwatson, in our case msvcr90!fread under a few scenario

2) Crash dump was not getting created because it was using fread with invalid runtime parameter resulting in calling hardcoded dr watson

3) Event log also shows the Exception Code = 0xC0000417, defined as #define STATUS_INVALID_CRUNTIME_PARAMETER ((NTSTATUS)0xC0000417L) in “NTSTATUS.h”

4) This was confirmed in the following article on microsoft web site A proposal to make Dr.Watson invocation configurable

When any of CRT invariants are violated (invalid arguments passed, buffer overrun detected, etc.) CRT needs to report somehow about that violation. VC++ 2005 uses the following strategy:
1. Display (optionally) a message box with error description.
2. Invoke Dr.Watson to create crash dump.

(All this code is located in abort.c, gs_report.c, and invarg.c files).

prior to windows vista, DrWatson is installed as the default post mortem debugger. But you always have an option to set WinDbg or NTSD or any other debugger to change post mortem debugger. However, you don’t have Dr Watson in Windows Vista ( now you have windows error reporting service). So, We ran into an interesting issue while trying to get a crash dump for a user mode win32 process. You have to hook your own unhandled exception filter to create a dump

Analyze DrWatson minidump using WinDbg

This is the analysis of a recent DrWatson(postmortem debugger) mini crash dump we received

Debugging Steps using WinDBG

  1. Open the .dmp file using WinDBG
  2. Make sure your symbol server path is set correctly(File->Symbol File Path), below is what I have set it to srv*C:\Temp\Symbols*http://msdl.microsoft.com/download/symbols;
  3. You should save the workspace to save your symbol path information unless you are using _NT_SYMBOL_PATH otherwise you will loose the information next time you start WinDBG. Remember to save it without opening a dump file to use it in next debugging session
  4. Run !analyze -v
  5. 0:182> !analyze -v

77e4bee7 5e pop esi
EXCEPTION_RECORD: ffffffff — (.exr 0xffffffffffffffff)
ExceptionAddress: 77e4bee7 (kernel32!RaiseException+0×00000053)
ExceptionCode: e0434f4d (CLR exception)
ExceptionFlags: 00000001
NumberParameters: 1
Parameter[0]: 8007000e
ERROR_CODE: (NTSTATUS) 0xe0434f4d – <Unable to get error code text>
MANAGED_STACK: !dumpstack -EE
OS Thread Id: 0xdd0 (182)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame:
ChildEBP RetAddr Caller,Callee
ADDITIONAL_DEBUG_TEXT: Followup set via attribute from Frame ffffffff on thread ffffffff
LAST_CONTROL_TRANSFER: from 79f071ac to 77e4bee7
STACK_TEXT: 15d9f418 79f071ac e0434f4d 00000001 00000001 kernel32!RaiseException+0×53
15d9f478 79f9293a 75c9cb08 00000000 00000000 mscorwks!RaiseTheExceptionInternalOnly+0x2a8
15d9f4b0 7a00e74a 15d9f508 7a3c441c 9ae216b2 mscorwks!UnwindAndContinueRethrowHelperAfterCatch+0×70
15d9f558 793a858c 0874cd34 00000000 793af628 mscorwks!JIT_NewFast+0×105
15d9f578 79e7c74b 00000000 15d9f5e0 15d9f608 mscorlib_ni+0x2e858c
15d9f588 79e7c6cc 00000000 00000000 15d9f714 mscorwks!CallDescrWorker+0×33
15d9f608 79f00eca 00000000 00000000 15d9f714 mscorwks!CallDescrWorkerWithHandler+0xa3
15d9f628 79f00e75 00000000 00000000 15d9f714 mscorwks!DispatchCallBody+0x1e
15d9f68c 79f00f03 00000000 00000000 15d9f714 mscorwks!DispatchCallDebuggerWrapper+0x3d
15d9f6c0 79f02978 00000000 00000000 15d9f714 mscorwks!DispatchCallNoEH+0×51
15d9f720 79ef3207 15d9f870 7a3b57d8 00000000 mscorwks!AddTimerCallback_Worker+0×79
15d9f734 79ef31a3 15d9f810 15d9f7bc 79f91478 mscorwks!Thread::DoADCallBack+0x32a
15d9f7c8 79ef30c3 15d9f810 9ae21bee 7a3b57d8 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3
15d9f804 79ef4826 15d9f810 00000001 00000000 mscorwks!Thread::ShouldChangeAbortToUnload+0x30a
15d9f82c 79ef48cf 00000001 79f02928 15d9f870 mscorwks!Thread::ShouldChangeAbortToUnload+0x33e
15d9f844 79f02901 00000001 79f02928 15d9f870 mscorwks!ManagedThreadBase::ThreadPool+0×13
15d9f88c 79f0288c 106bcc38 00000001 00000001 mscorwks!AddTimerCallbackEx+0×83
15d9f8a0 79f02850 106bcc38 00000001 9ae21b36 mscorwks!AddTimerCallback+0×10
15d9f8dc 79f024cf 2a023e70 9ae21ac2 79f02030 mscorwks!ThreadpoolMgr::AsyncTimerCallbackCompletion+0×64
15d9f928 79f0202a 00000000 00000000 77e619d1 mscorwks!UnManagedPerAppDomainTPCount::DispatchWorkItem+0xb2
15d9f93c 79f021a0 9ae21a7e 00000000 79f0207d mscorwks!ThreadpoolMgr::ExecuteWorkRequest+0xaf
15d9f994 79f95a2e 00000000 7c81c5f0 15d9fa48 mscorwks!ThreadpoolMgr::WorkerThreadStart+0×223
15d9ffb8 77e64829 106f61d8 00000000 00000000 mscorwks!Thread::intermediateThreadProc+0×49
15d9ffec 00000000 79f959e8 106f61d8 00000000 kernel32!BaseThreadStart+0×34

6) RaiseTheExceptionInternalOnly method signature is RaiseTheExceptionInternalOnly(OBJECTREF, BOOL, BOOL)
7) 2nd parameter evaluates to whether the exception is re-thrown and the 3rd parameter tells you if this is a stack overflow exception but in this case both of them are false
8 ) ExceptionCode: e0434f4d (CLR exception) which applies to any unhandled CLR exception e.g DivideByZero
9) mini dump reports 1204 threads, as shown below

0:182> ~
0 Id: 968.1128 Suspend: 1 Teb: 7ffdf000 Unfrozen
1 Id: 968.c34 Suspend: 1 Teb: 7ffde000 Unfrozen
2 Id: 968.8e0 Suspend: 1 Teb: 7ffdd000 Unfrozen
3 Id: 968.a90 Suspend: 1 Teb: 7ffdc000 Unfrozen
1200 Id: 968.24b0 Suspend: 1 Teb: 7fbd6000 Unfrozen
1201 Id: 968.1bc4 Suspend: 1 Teb: 7fae4000 Unfrozen
1202 Id: 968.25ac Suspend: 1 Teb: 7faba000 Unfrozen
1203 Id: 968.1b7c Suspend: 1 Teb: 7fab8000 Unfrozen
10) More than 200 threads have the following stack trace
0:182> ~*k
1201 Id: 968.1bc4 Suspend: 1 Teb: 7fae4000 Unfrozen
ChildEBP RetAddr
7b22ebec 7c827d0b ntdll!KiFastSystemCallRet
7b22ebf0 77e61d1e ntdll!NtWaitForSingleObject+0xc
7b22ec60 79e789c6 kernel32!WaitForSingleObjectEx+0xac
7b22eca4 79e7898f mscorwks!PEImage::LoadImage+0x1af
7b22ecf4 79e78944 mscorwks!CLREvent::WaitEx+0×117
7b22ed08 79fbc82a mscorwks!CLREvent::Wait+0×17
7b22ed1c 79fbc6da mscorwks!SVR::GCHeap::WaitUntilGCComplete+0×34
7b22ed58 79fc51bf mscorwks!Thread::RareDisablePreemptiveGC+0x1a0
7b22ed8c 79fc529b mscorwks!Thread::UserSleep+0xb1
7b22ee2c 0deb1401 mscorwks!ThreadNative::Sleep+0xce
WARNING: Frame IP not in any known module. Following frames may be wrong.
7b22ef18 793b0d1f 0xdeb1401
7b22ef20 793740ab mscorlib_ni+0x2f0d1f
7b22ef30 79e7c74b mscorlib_ni+0x2b40ab
7b22ef40 79e7c6cc mscorwks!CallDescrWorker+0×33
7b22efc0 79e7c8e1 mscorwks!CallDescrWorkerWithHandler+0xa3
7b22f0fc 79e7c783 mscorwks!MethodDesc::CallDescr+0x19c
7b22f118 79e7c90d mscorwks!MethodDesc::CallTargetWorker+0x1f
7b22f12c 79eb300f mscorwks!MethodDescCallSite::Call_RetArgSlot+0×18
7b22f2f8 79eb2f31 mscorwks!ExecuteCodeWithGuaranteedCleanupHelper+0x9b
7b22f3a8 79373ff7 mscorwks!ReflectionInvocation::ExecuteCodeWithGuaranteedCleanup+0xf9

11) There are 2 GC threads (2 and 3) since this process is running on Windows Server 2003 with 2 CPUs so one GC thread per CPU.
0:182> ~2 kb
ChildEBP RetAddr Args to Child
00b7fd88 7c827d0b 77e61d1e 000000e8 00000000 ntdll!KiFastSystemCallRet
00b7fd8c 77e61d1e 000000e8 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
00b7fdfc 79e789c6 000000e8 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
00b7fe40 79e7898f 000000e8 ffffffff 00000000 mscorwks!PEImage::LoadImage+0x1af
00b7fe90 79e78944 ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0×117
00b7fea4 79efa6c3 ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0×17
00b7fee8 79efb5b7 001568f8 00000006 00000000 mscorwks!SVR::t_join::join+0xae
00b7ff28 79efa9ba 00000002 00000000 001568f8 mscorwks!SVR::gc_heap::mark_phase+0x16f
00b7ff54 79efaf60 ffffffff 00000000 001568f8 mscorwks!SVR::gc_heap::gc1+0×46
00b7ff74 79efa72f 00000002 00000000 001568f8 mscorwks!SVR::gc_heap::garbage_collect+0×246
00b7ff98 79fc8583 79fc8559 77e64590 79fc857c mscorwks!SVR::gc_heap::gc_thread_function+0x6a
00b7ffb8 77e64829 001568f8 00000000 00000000 mscorwks!SVR::gc_heap::gc_thread_stub+0×92
00b7ffec 00000000 79fc8538 001568f8 00000000 kernel32!BaseThreadStart+0×34
0:182> ~3 kb
ChildEBP RetAddr Args to Child
00c7fc9c 79efd1d0 55b4de68 00000001 001568f8 mscorwks!SVR::gc_heap::mark_object_simple1+0xdf
00c7fce4 79efba46 0108ffcc 00000001 09063c94 mscorwks!SVR::gc_heap::mark_object_simple+0×254
00c7fd10 79fbb49d 09063c94 0108ffcc 00000000 mscorwks!SVR::GCHeap::Promote+0×115
00c7fd28 79fc3a38 00c7fdd0 09063c94 00000000 mscorwks!GcEnumObject+0x2d
00c7fd48 79fc39e2 00153050 79fbb475 00c7fdd0 mscorwks!Module::EnumStaticGCRefs+0×59
00c7fd74 79fc3ada 79fbb475 00c7fdd0 00174ea8 mscorwks!DomainAssembly::EnumStaticGCRefs+0×38
00c7fd98 79fc3a9b 79fbb475 00c7fdd0 2a03d8b0 mscorwks!AppDomain::EnumStaticGCRefs+0×29
00c7fdb4 79fc399b 79fbb475 00c7fdd0 79efba5a mscorwks!SystemDomain::EnumAllStaticGCRefs+0x4e
00c7fddc 79efb559 00000000 00000002 00000002 mscorwks!CNameSpace::GcScanRoots+0x1a2
00c7fe28 79efa9ba 00000002 00000000 00157d30 mscorwks!SVR::gc_heap::mark_phase+0×111
00c7fe54 79efaf60 ffffffff 00000000 00157d30 mscorwks!SVR::gc_heap::gc1+0×46
00c7fe74 79efa72f 00000002 00000000 00157d30 mscorwks!SVR::gc_heap::garbage_collect+0×246
00c7fe98 79fc8583 00000000 7c8276bf 00c7fef8 mscorwks!SVR::gc_heap::gc_thread_function+0x6a
00c7ffb8 77e64829 00157d30 00000000 00000000 mscorwks!SVR::gc_heap::gc_thread_stub+0×92
00c7ffec 00000000 79fc8538 00157d30 00000000 kernel32!BaseThreadStart+0×34
12) Mark_Phase signature is mark_phase (int gen_number, BOOL mark_only_p
parameters passed are 2, 0, that means GC was triggered for generation 2

13) Thread # 345 has triggered garbage collection for generation 2 with a request to allocate memory from GC Heap(see highleted in red)
0:182> ~345 kb 100
ChildEBP RetAddr Args to Child
2484e0bc 7c827d0b 77e61d1e 00000100 00000000 ntdll!KiFastSystemCallRet
2484e0c0 77e61d1e 00000100 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
2484e130 79e789c6 00000100 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
2484e174 79e7898f 00000100 ffffffff 00000000 mscorwks!PEImage::LoadImage+0x1af
2484e1c4 79e78944 ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0×117
2484e1d8 79efafc5 ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0×17
2484e1fc 79efad3d ffffffff 71ca0000 00157dd4 mscorwks!SVR::gc_heap::wait_for_gc_done+0×62
2484e224 79fd1cf3 00000002 00000000 00157d30 mscorwks!SVR::GCHeap::GarbageCollectGeneration+0x1b5
2484e2b4 79ef98cf 14fa94e0 0000000c 00000000 mscorwks!SVR::gc_heap::try_allocate_more_space+0×464
2484e2d0 79ef9b9c 14fa94e0 0000000c 00000000 mscorwks!SVR::gc_heap::allocate_more_space+0x2e
2484e2f0 79e7510e 14fa94e0 0000000c 00000000 mscorwks!SVR::GCHeap::Alloc+0×54
2484e30c 79e85806 0000000c 00000000 00000000 mscorwks!Alloc+0×60
2484e34c 79e858cb 79102290 abbf001a 000007d8 mscorwks!FastAllocateObject+0×38
2484e3f0 793ab271 01061034 2484e484 00000000 mscorwks!JIT_NewFast+0x9e
2484e444 793aaef0 f1dcc000 ffffffbc 00000000 mscorlib_ni+0x2eb271
2484e618 793b0d1f 3a5a2074 793740ab 00000000 mscorlib_ni+0x2eaef0
2484e620 793740ab 00000000 14fa94a0 2484e640 mscorlib_ni+0x2f0d1f
2484e630 79e7c74b 00000000 00000000 2484e6c0 mscorlib_ni+0x2b40ab
2484e640 79e7c6cc 2484e710 00000000 2484e6e0 mscorwks!CallDescrWorker+0×33
2484e6c0 79e7c8e1 2484e710 00000000 2484e6e0 mscorwks!CallDescrWorkerWithHandler+0xa3
2484e7fc 79e7c783 796c01c0 2484e92c 2484e84c mscorwks!MethodDesc::CallDescr+0x19c
2484e818 79e7c90d 796c01c0 2484e92c 2484e84c mscorwks!MethodDesc::CallTargetWorker+0x1f
2484e82c 79eb300f 2484e84c abbf0a12 14fa94a0 mscorwks!MethodDescCallSite::Call_RetArgSlot+0×18
2484e9f8 79eb2f31 2484ea8c abbf0942 09060054 mscorwks!ExecuteCodeWithGuaranteedCleanupHelper+0x9b
2484eaa8 79373ff7 2484ea4c 408301c8 0109dd28 mscorwks!ReflectionInvocation::ExecuteCodeWithGuaranteedCleanup+0xf9
2484eac0 79373ede 4a34402c 2484eaf0 2484eb1c mscorlib_ni+0x2b3ff7
00000000 00000000 00000000 00000000 00000000 mscorlib_ni+0x2b3ede


  1. Application has thrown exception at the top of stack
  2. In CLR 2.0, unhandled exceptions at the top of the stack on any thread will terminate the application
  3. Exception has occurred after GC was triggered
  4. Mark Phase was triggered for Generation 2 that means it was under severe memory pressure
  5. since there are more than 1200 threads that itself amounts to 1.2 GB of virtual memory for just the user mode stack and each thread has user mode plus kernel mode stack although not much but kernel stack will also add up to another 3 virtual pages(12KB) 1200*12KB of physical memory, kernel stack is physical memory resident. There will be around 100 MB or so in MEM_IMAGE itself
  6. Since this exception has occurred while starting a worker thread from managed thread pool and at the same GC was also triggered. So that means there was not enough memory
  7. There is no thread leak, these many threads are by design(one thread per request, I don’t know why)
  8. Some of the interesting points

There are more than 200 threads with preemptive GC disabled what that means is these threads can’t be suspended and GC Threads will have to wait for these threads to return to preemptive GC mode in order to claim the memory. Managed threads should not be in a state with preemptive GC disabled because this is very rare as interpreted from the call stack Thread::RareDisablePreemptiveGC

Below is the description of Preemptive GC from http://blogs.msdn.com/yunjin/archive/2005/08/30/457756.aspx

Preemptive GC: also very important. In Rotor, this is m_fPreemptiveGCDisabled field of C++ Thread class. It indicates what GC mode the thread is in: “enabled” in the table means the thread is in preemptive mode where GC could preempt this thread at any time; “disabled” means the thread is in cooperative mode where GC has to wait the thread to give up its current work (the work is related to GC objects so it can’t allow GC to move the objects around). When the thread is executing managed code (the current IP is in managed code), it is always in cooperative mode; when the thread is in Execution Engine (unmanaged code), EE code could choose to stay in either mode and could switch mode at any time; when a thread are outside of CLR (e.g, calling into native code using interop), it is always in preemptive mode.

CLR rotor source code has the following comments before calling Thread::RareDisablePreemptiveGC

We must do the following in this order, because otherwise we would be constructing the exception for the abort without synchronizing with the GC. Also, we have no CLR SEH set up, despite the fact that we may throw a ThreadAbortException.

Most likely it crashed because of not having enough memory to start a worker thread since it crashed with the call stack pointing to WokerThreadStart

The real challenge is how will you find out the exception detail for sure?How will you make sure that this is indeed a out of memory exception?

Since this is a mini dump with no information on CLR Data structures. Effective managed debugging requires a mini dump with full virtual memory because managed heap is created with virtualalloc but Dr. Watson as a postmortem debugger with default options creates a minidump which includes only all thread contexts and 10 instructions.