Watchdog Timer and System Health Monitor Design is the dedicated hardware subsystem that continuously monitors processor operation and system health indicators, automatically triggering corrective actions (reset, interrupt, or safe-state transition) when software execution hangs, thermal limits are exceeded, or supply voltages drift outside specification — providing the autonomous safety net that enables reliable operation in unattended and safety-critical systems.
Watchdog Timer Architecture:
- Basic Watchdog: a free-running down-counter clocked by an independent oscillator (not derived from the main CPU clock); software must periodically write a specific value to the watchdog register (kick/pet/feed) before the counter reaches zero; if the software fails to respond (hung, crashed, stuck in infinite loop), the counter expires and asserts a system reset
- Windowed Watchdog: extends the basic watchdog by defining both a minimum and maximum time window for the kick; the software must respond neither too early nor too late; early kicks indicate runaway execution (software looping too fast); this catches a broader class of software malfunctions than a simple timeout
- Independent Watchdog: uses a completely separate clock source (dedicated RC oscillator or crystal) and power domain from the main CPU; continues operating even if the CPU clock fails; essential for automotive ASIL-D and aerospace applications where the watchdog itself must be immune to the failure modes it monitors
- Multi-Stage Watchdog: provides multiple escalating timeout levels; first timeout generates a non-maskable interrupt (NMI) giving software a chance to recover; second timeout asserts a warm reset; third timeout (if warm reset fails) triggers a cold power-cycle reset
System Health Monitoring:
- Temperature Monitoring: on-die thermal sensors (BJT-based or ring oscillator-based) measure junction temperature at multiple locations; hardware comparators trigger interrupts when temperature approaches the thermal throttle threshold (typically 100°C) and force shutdown above the critical threshold (typically 125°C)
- Voltage Monitoring: on-chip ADC or comparator circuits monitor VDD core, VDD I/O, and other supply rails; under-voltage detection prevents operation below the minimum voltage for reliable logic switching; over-voltage detection prevents gate oxide stress and reliability degradation
- Clock Monitoring: a clock supervisor circuit checks that the main clock is running within the expected frequency range; loss-of-clock detection triggers failsafe mode using the backup oscillator; frequency out-of-range indicates PLL malfunction
- Memory Health: periodic ECC scrubbing of SRAM and flash checks for accumulated bit errors; crossing a correctable error threshold indicates aging or radiation damage that may require preventive maintenance or safe shutdown
Design Considerations:
- Kick Sequence: simple single-write kicks are vulnerable to accidental writes from runaway software; robust watchdog designs require a specific multi-step unlock sequence before the kick is accepted, ensuring that only intentional software action can reset the timer
- Reset Behavior: the watchdog reset output must be clean (glitch-free) and held for sufficient duration (typically >100 μs) to ensure all chip blocks properly initialize; the reset cause is recorded in a persistent status register so that software can identify watchdog-triggered resets at boot
- Testability: the watchdog must be testable during manufacturing without waiting for the actual timeout period; test modes provide accelerated timeouts and direct access to the counter and status registers
- Power Consumption: the independent watchdog and its oscillator operate continuously, even in low-power sleep modes; power consumption must be minimized (typically <1 μA total) to avoid significantly impacting battery-powered device standby time
Watchdog timer and system health monitor design is the essential autonomous safety infrastructure in every microcontroller and SoC — providing the hardware-level failure detection and recovery mechanism that keeps systems running reliably when software encounters unexpected conditions, from consumer electronics to life-critical automotive and medical devices.