DPC_WATCHDOG_VIOLATION blue screen problem analysis (1)

Article directory

  • DPC_WATCHDOG_VIOLATION blue screen problem analysis
    • 1. Principle
      • 1.1 DPC WATCHDOG
      • 1.2 DPC timeout acquisition
    • 2. Example
      • 2.1 Analysis
    • 3. Summary

DPC_WATCHDOG_VIOLATION blue screen problem analysis

Under Windows 10, the DPC_WATCHDOG_VIOLATION blue screen error was introduced. The reason for the introduction of this error is to prevent the kernel code from being executed in an error state for too long at DPC and above levels, affecting the stability of the system.

For the blue screen error code DPC_WATCHDOG_VIOLATION, there are two types (use the first parameter to distinguish):

  1. The first parameter is 0 indicating that the execution of a single DPC routine has timed out (A single DPC or ISR exceeded its time allotment).
  2. The first parameter is 1, indicating that the system has timed out on DPC (The system cumulatively spent an extended period of time at IRQL DISPATCH_LEVEL or above).

The two blue screen codes are as follows:

DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000000, A single DPC or ISR exceeded its time allotment. The offending
component can usually be identified with a stack trace.
Arg2: 0000000000000501, The DPC time count (in ticks).
Arg3: 0000000000000500, The DPC time allotment (in ticks).
Arg4: fffff8077ff73358, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which contains
additional information regarding this single DPC timeout


DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000001, The system cumulatively spent an extended period of time at
DISPATCH_LEVEL or above. The offending component can usually be
identified with a stack trace.
Arg2: 0000000000001e00, The watchdog period.
Arg3: fffff80080705330, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which contains
additional information regarding the cumulative timeout
Arg4: 0000000000000000

Below we will discuss the basic principles and analysis process of DPC_WATCHDOG_VIOLATION by giving examples.

1. Principle

1.1 DPC WATCHDOG

Windows has relevant restrictions on the time when the system runs in DPC and interrupts. Generally speaking, DPC cannot run more than 100MS, and IRS cannot run more than 25MS (but the actual value is higher than this).

In Windows, use KeAccumulateTicks to limit the running time, as follows:

__int64 __fastcall KeAccumulateTicks(...)
{<!-- -->
//...
if (pcrb->NestingLevel == 2 & amp; & amp; pcrb->DpcRoutineActive)
{<!-- -->
DpcTimeCount = pcrb->DpcTimeCount;
pcrb->DpcTime + = KernelTime;
NewDpcTimeCount = DpcTimeCount + 1;
DpcTimeLimit = pcrb->DpcTimeLimit;
pcrb->DpcTimeCount = NewDpcTimeCount;
if(DpcTimeLimit)
{<!-- -->
//...
if (NewDpcTimeCount > DpcTimeLimit)
{<!-- -->
if (KeEnableWatchdogTimeout)
{<!-- -->
HvlInvokeHypervisorDebugger(3i64, 0i64);
if (!(_BYTE)KdDebuggerEnabled || (_BYTE)KdDebuggerNotPresent)
KeBugCheckEx(
0x133u,
0i64,
pcrb->DpcTimeCount,
pcrb->DpcTimeLimit,
(ULONG_PTR) &KeDpcWatchdogProfileGlobalTriageBlock);
}
pcrb->DpcTimeCount = 0;
KiResetGlobalDpcWatchdogProfiler(pcrb);
}
}
}

//...
if (Irql < 2u)
{<!-- -->
DpcWatchdogProfile = pcrb->DpcWatchdogProfile;
pcrb->DpcWatchdogCount = 0;
//...
goto Over;
}
DpcWatchdogPeriod = pcrb->DpcWatchdogPeriod;
NewDpcWatchdogCount = pcrb->DpcWatchdogCount + 1;
pcrb->DpcWatchdogCount = NewDpcWatchdogCount;
if (DpcWatchdogPeriod)
{<!-- -->
//...
if (DpcWatchdogCount >= DpcWatchdogPeriod)
{<!-- -->
if (KeEnableWatchdogTimeout)
{<!-- -->
HvlInvokeHypervisorDebugger(2i64, 0i64);
if (!(_BYTE)KdDebuggerEnabled || (_BYTE)KdDebuggerNotPresent)
KeBugCheckEx(0x133u,
1ui64,
pcrb->DpcWatchdogPeriod,
(ULONG_PTR) &KeDpcWatchdogProfileGlobalTriageBlock,
0i64);
}
}
}
//...
}

From here, if the time is exceeded and KeEnableWatchdogTimeout is true, then a 133 blue screen will be triggered. So when is KeEnableWatchdogTimeout initialized? We can see that the initialization code is as follows:

char __fastcall KeInitSystem(int a1)
{<!-- -->
    //...
    if (!HviIsAnyHypervisorPresent())
      KeEnableWatchdogTimeout = 1;
    //...
}

HviIsAnyHypervisorPresentThis function returns whether the cpuid command is implemented by the hypervisor. If not, then enable WatchDog.

In the above KeBugCheckEx there are two situations:

  1. If pcrb->DpcRoutineActive is in the DPC routine, then determine whether the DPC routine times out. The DPC routine determines whether it uses DpcTimeCount and DpcTimeLimitTicks in PRCB. Compare.
  2. The next step is to determine whether the whole thing is running under DPC basics, using DpcWatchdogCount and DpcWatchdogPeriodTicks in PRCB.

Since DpcTimeCount represents the execution time of a single DPC routine, the value is set to 0 when each DPC is executed.

__int64 __fastcall KiExecuteAllDpcs(...)
{<!-- -->
    //...
    a1->DpcTimeCount = 0;
    //...
}

1.2 DPC timeout acquisition

The DPC timeout information of the current processor can be obtained through KeQueryDpcWatchdogInformation, as follows:

NTSTATUS KeQueryDpcWatchdogInformation(
  PKDPC_WATCHDOG_INFORMATION WatchdogInformation
);

typedef struct _KDPC_WATCHDOG_INFORMATION {<!-- -->
  ULONGDpcTimeLimit;
  ULONGDpcTimeCount;
  ULONGDpcWatchdogLimit;
  ULONG DpcWatchdogCount;
  ULONG Reserved;
} KDPC_WATCHDOG_INFORMATION, *PKDPC_WATCHDOG_INFORMATION;

in:

  1. DpcTimeLimit: The time limit for a single delayed procedure call on the current processor. If DPC timeout is disabled, this value is set to 0.
  2. DpcTimeCount: If DPC timeout is enabled, the usage time of this delay procedure call.
  3. DpcWatchdogLimit : The total time limit allowed for a series of deferred procedure calls. If the DPC watchdog is disabled, this value is set to zero.
  4. DpcWatchdogCount : If the DPC watchdog is enabled, the remaining time value of the current sequence of consecutive delayed procedure calls.

For example we can see the following:

0: kd> dt nt!_KPRCB fffff8007bfe9180 DPC*
    + 0x3310 DpcRuntimeHistoryHashTable : 0xffffaf0c`1f0d5690 _RTL_HASH_TABLE
    + 0x3318 DpcRuntimeHistoryHashTableCleanupDpc : 0xffffaf0c`1f122c40 _KDPC
    + 0x3338 DpcDelegateThread : 0xffffaf0c`1f150140 _KTHREAD
    + 0x3340 DpcData: [2] _KDPC_DATA
    + 0x33a0 DpcStack : 0xfffff800`7c395fb0 Void
    +0x33acDpcRequestRate: 0
    + 0x33b4 DpcLastCount : 0x67518e
    + 0x33ba DpcRoutineActive : 0 ''
    + 0x33bc DpcRequestSummary : 0n131174
    + 0x33bc DpcRequestSlot: [2] 0n102
    + 0x33bc DpcNormalProcessingActive : 0y0
    + 0x33bc DpcNormalProcessingRequested: 0y1
    + 0x33bc DpcNormalThreadSignal: 0y1
    + 0x33bc DpcNormalTimerExpiration : 0y0
    + 0x33bc DpcNormalDpcPresent : 0y0
    + 0x33bc DpcNormalLocalInterrupt : 0y1
    + 0x33bc DpcNormalPriorityAntiStarvation : 0y1
    + 0x33bc DpcNormalSwapToDpcDelegate : 0y0
    + 0x33bc DpcNormalSpare : 0y00000000 (0)
    + 0x33bcDpcThreadActive: 0y0
    + 0x33bcDpcThreadRequested: 0y1
    + 0x33bc DpcThreadSpare : 0y00000000000000 (0)
    + 0x7e40 DpcGate: _KGATE
    + 0x7ea8 DpcWatchdogPeriodTicks : 0n7680
    + 0x7eac DpcWatchdogCount : 0n7680
    + 0x7eb4 DpcWatchdogProfileCumulativeDpcThresholdTicks : 0x1b80
    + 0x7eec DpcWatchdogSequenceNumber : 0x18794c
    + 0x814c DpcTime : 0x4c56
    + 0x815c DpcTimeCount : 0
    + 0x8160 DpcTimeLimitTicks : 0x500
    + 0x8240 DpcWatchdogProfileSingleDpcThresholdTicks : 0x496
    + 0x85d8 DpcWatchdogDpc: _KDPC
    + 0x8618 DpcWatchdogTimer: _KTIMER
    + 0x88a8 DpcWatchdogProfile : 0xffffaf0c`1f16a000 -> 0x00000000`00000011 Void
    + 0x88b0 DpcWatchdogProfileCurrentEmptyCapture : 0xffffaf0c`1f181cf8 -> (null)

In summary, there are two types of monitoring for DPC:

  1. Use DpcTimeCount and DpcTimeLimitTicks to monitor the execution time of a single DPC routine to prevent the DPC routine from executing too long.
  2. Use DpcWatchdogCount and DpcWatchdogPeriodTicks to monitor the total DPC time executed by the CPU to prevent the CPU from executing a series of DPC functions.

2. Example

Below we conduct an example analysis through a DPC_WATCHDOG_VIOLATION blue screen dump file.

2.1 Analysis

Let’s first take a look at the blue screen code DPC_WATCHDOG_VIOLATION, as follows:

DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000001, The system cumulatively spent an extended period of time at
DISPATCH_LEVEL or above. The offending component can usually be
identified with a stack trace.
Arg2: 0000000000001e00, The watchdog period.
Arg3: fffff80080705330, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which contains
additional information regarding the cumulative timeout
Arg4: 0000000000000000

The call stack at this time is as follows:

0: kd> k
 # Child-SP RetAddr Call Site
00 fffff800`7c39cdd8 fffff800`7fc31c8f nt!KeBugCheckEx
01 fffff800`7c39cde0 fffff800`7fc318a1 nt!KeAccumulateTicks + 0x20f
02 fffff800`7c39ce50 fffff800`7fc2fa8a nt!KiUpdateRunTime + 0x61
03 fffff800`7c39ceb0 fffff800`7fc2f876 nt!KeClockInterruptNotify + 0x11a
04 fffff800`7c39cf40 fffff800`7fc14650 nt!HalpTimerClockIpiRoutine + 0x16
05 fffff800`7c39cf70 fffff800`7fe1e05a nt!KiCallInterruptServiceRoutine + 0xa0
06 fffff800`7c39cfb0 fffff800`7fe1e8c7 nt!KiInterruptSubDispatchNoLockNoEtw + 0xfa
07 fffff882`a7a57430 fffff800`7fd67972 nt!KiInterruptDispatchNoLockNoEtw + 0x37
08 fffff882`a7a575c0 fffff800`7fc6cb64 nt!KiAcquireKobjectLockSafe + 0x32
09 fffff882`a7a575f0 fffff800`7fc44136 nt!KeSetEvent + 0x64
0a fffff882`a7a57680 fffff800`7fc18496 nt!IopCompleteRequest + 0x396
0b fffff882`a7a57760 fffff800`7fc70846 nt!KiDeliverApc + 0x1b6
0c fffff882`a7a57820 fffff800`7fc72169 nt!KiSwapThread + 0x936
0d fffff882`a7a57900 fffff800`7fc75e26 nt!KiCommitThreadWait + 0x159
0e fffff882`a7a579a0 fffff800`8cce1cc1 nt!KeDelayExecutionThread + 0x416
0f fffff882`a7a57a30 fffff800`8cce3b7a xxxDrv_x64 + 0x1cc1
10 fffff882`a7a57a60 fffff800`7fd54295 xxxDrv_x64 + 0x3b7a
11 fffff882`a7a57af0 fffff800`7fe20e34 nt!PspSystemThreadStartup + 0x55
12 fffff882`a7a57b40 00000000`00000000 nt!KiStartSystemThread + 0x34

What we can conclude here is that the blue screen is not caused by executing a certain DPC routine for a long enough time, but that the CPU executes in DPC mode for too long.

At this point we see the DCP times as follows:

0: kd> dt nt!_KPRCB fffff8007bfe9180 DPC*
    + 0x7ea8 DpcWatchdogPeriodTicks : 0n7680
    + 0x7eac DpcWatchdogCount : 0n7680
    + 0x815c DpcTimeCount : 0
    + 0x8160 DpcTimeLimitTicks : 0x500

From here we can also find:

  1. DpcTimeCount is 0, indicating that we have not executed the DPC routine.
  2. DpcWatchdogCount and DpcWatchdogPeriodTicks are equal, indicating that the CPU execution time under DPC is too long.

We can take a look at the current DPC information, as follows:

0: kd> !dpcs 0
CPU Type KDPC Function
 0: Normal: 0xffffaf0c32787278 0xfffff8007fe56420 nt!IopPassiveInterruptDpc
 0: Normal : 0xffffaf0c21e7b520 0xfffff80082092550 Wdf01000!FxInterrupt::_InterruptDpcThunk
 0: Normal: 0xffffaf0c23ce9808 0xfffff80082c314d0 ndis!ndisInterruptDpc
 0: Normal: 0xffffaf0c21eb26e8 0xfffff800858666c0 dxgkrnl!DpiFdoDpcForIsr
 0: Normal: 0xfffff882acf91f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback
 0: Normal: 0xfffff8007bff15d8 0xfffff8007fc26b60 nt!PpmPerfAction
 0: Normal: 0xffffaf0c2173a0c8 0xfffff80082ec51d0 tcpip!TcpPeriodicTimeoutHandler
 0: Normal: 0xffffaf0c211f71e8 0xfffff80082732000 stornvme!NVMeCompletionDpcRoutine
 0: Normal: 0xfffff8007bff1c68 0xfffff8007fd4c800 nt!KiEntropyDpcRoutine
 0: Normal : 0xffffff882a8ab6f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback
 0: Normal : 0xffffff882ad376f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback
 0: Normal: 0xffffaf0c1f122c40 0xfffff8007fd3fd30 nt!KiDpcRuntimeHistoryHashTableCleanupDpcRoutine
 0: Normal : 0xffffaf0c231d0de0 0xfffff80082092550 Wdf01000!FxInterrupt::_InterruptDpcThunk
 0: Normal: 0xffffaf0c21d93db0 0xfffff800b332b470 i8042prt!I8042KeyboardIsrDpc
 0: Normal : 0xffffaf0c23ce9a48 0xfffff80082c314d0 ndis!ndisInterruptDpc
 0: Normal : 0xffffff882a8476f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback
 0: Normal : 0xffffaf0c21233000 0xfffff8008278c0f0 storport!RaidpAdapterRedirectDpcRoutine
 0: Normal : 0xffffaf0c23081270 0xfffff800ae096d90 Ndu!NduQueryStatisticsDpc
 0: Normal : 0xffffff882a7f26f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback
 0: Normal: 0xfffff882ac1a6f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback
 0: Threaded: 0xfffff8007bff1758 0xfffff8007fd59dc0 nt!KiDpcWatchdog

A large number of DPCs were found, why? One reason why DPC cannot be executed is that the CPU is always under a high IRQL, such as an interrupt. Is the CPU always under an interrupt?

We can use !dpcwatchdog to check the status of dpcwatchdog, as follows:

0: kd> !dpcwatchdog

All durations are in seconds (1 System tick = 15.625000 milliseconds)

Circular Kernel Context Logger history: !logdump 0x2
DPC and ISR stats: !intstats /d

--------------------------------------------------
CPU#0
--------------------------------------------------
Current DPC: No Active DPC

Pending DPCs:
----------------------------------------
CPU Type KDPC Function
 0: Normal: 0xffffaf0c32787278 0xfffff8007fe56420 nt!IopPassiveInterruptDpc
 0: Normal : 0xffffaf0c21e7b520 0xfffff80082092550 Wdf01000!FxInterrupt::_InterruptDpcThunk
 0: Normal: 0xffffaf0c23ce9808 0xfffff80082c314d0 ndis!ndisInterruptDpc
 0: Normal: 0xffffaf0c21eb26e8 0xfffff800858666c0 dxgkrnl!DpiFdoDpcForIsr
 0: Normal: 0xfffff882acf91f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback
 0: Normal: 0xfffff8007bff15d8 0xfffff8007fc26b60 nt!PpmPerfAction
 0: Normal: 0xffffaf0c2173a0c8 0xfffff80082ec51d0 tcpip!TcpPeriodicTimeoutHandler
 0: Normal: 0xffffaf0c211f71e8 0xfffff80082732000 stornvme!NVMeCompletionDpcRoutine
 0: Normal: 0xfffff8007bff1c68 0xfffff8007fd4c800 nt!KiEntropyDpcRoutine
 0: Normal : 0xffffff882a8ab6f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback
 0: Normal : 0xffffff882ad376f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback
 0: Normal: 0xffffaf0c1f122c40 0xfffff8007fd3fd30 nt!KiDpcRuntimeHistoryHashTableCleanupDpcRoutine
 0: Normal : 0xffffaf0c231d0de0 0xfffff80082092550 Wdf01000!FxInterrupt::_InterruptDpcThunk
 0: Normal: 0xffffaf0c21d93db0 0xfffff800b332b470 i8042prt!I8042KeyboardIsrDpc
 0: Normal : 0xffffaf0c23ce9a48 0xfffff80082c314d0 ndis!ndisInterruptDpc
 0: Normal : 0xffffff882a8476f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback
 0: Normal : 0xffffaf0c21233000 0xfffff8008278c0f0 storport!RaidpAdapterRedirectDpcRoutine
 0: Normal : 0xffffaf0c23081270 0xfffff800ae096d90 Ndu!NduQueryStatisticsDpc
 0: Normal : 0xffffff882a7f26f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback
 0: Normal: 0xfffff882ac1a6f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback
 0: Threaded: 0xfffff8007bff1758 0xfffff8007fd59dc0 nt!KiDpcWatchdog


c0000094 Exception in kdexts.dpcwatchdog debugger extension.
      PC: 00007ffa`507625c0 VA: 00000000`00000000 R/W: 0 Parameter: 00000000`00000000

You can also see a large number of Pending DPCs, Current DPC: No Active DPC.

3. Summary

DPC_WATCHDOG_VIOLATION has two blue screen situations:

  1. Parameter 1 = 0 indicates that the DPC routine is currently being executed and has timed out. You can directly see the reason for the DPC timeout (or the DPC has been unable to exit due to other threads).
  2. Parameter 1 = 1 indicates that the CPU has been executing in DPC state at this time. This reason requires detailed analysis.

There are two WinDBG commands that can assist in analyzing this blue screen:

  1. !dpcwatchdogView the dpcwatchdog information of all CPUs.
  2. !dpcsView the dpc routine information of the specified CPU.