Functional Safety (ISO 26262) in Chip Design is a comprehensive safety assurance standard for automotive semiconductor products, requiring hardware/software co-design for ASIL (Automotive Safety Integrity Level) compliance, diagnostic coverage, and failure mode analysis to ensure vehicles operate safely despite hardware faults.
ASIL Levels and Automotive Requirements
- ASIL Classification: A (least critical) to D (most critical). ASIL determined by severity (injury/death), exposure (driving conditions), controllability (driver ability to mitigate).
- Severity/Exposure/Controllability Matrix: Example: brake failure = High severity, high exposure, low controllability → ASIL D (highest). ASIL D requires dual-channel architectures, extensive diagnostics.
- Hardware Safety Requirements: ASIL D mandates redundancy (2-channel), fault isolation, diagnostic coverage >90%. ASIL B less stringent but still demands single-channel with monitoring.
- Hardware vs Software Split: Both hardware and software contribute to safety. Hardware ISO 26262 Part 5-10; software Part 6-8. Integrated assessment across both domains required.
Safety Island Architecture
- Redundant Processing: ASIL D designs incorporate dual independent processors (separate cores, separate memory, separate I/O). Outputs compared; mismatch indicates failure, triggers safe state.
- Lockstep Execution: Twin cores execute identical instructions on identical inputs, synchronously check results. Transient faults (single-event upsets) detected via mismatch, triggering safe action.
- Voter Logic: Compares outputs; disagreement triggers safe state (halt, safe default output). Voter itself must be ASIL-compliant (simple, auditable logic).
- Isolated I/O Paths: Separate A/D converters, sensor inputs, actuator outputs per channel. Single failure (sensor malfunction) doesn't propagate to multiple channels.
Hardware Diagnostic Coverage
- Diagnostic Coverage (DC): Percentage of failure modes detectable by built-in self-test (BIST) and runtime monitoring. ASIL D requires >90% DC.
- Common Failures Covered: Single-bit memory errors (ECC detects), stuck-at faults (BIST exercises logic), clock distribution failures (clock monitor), supply voltage excursions (brown-out detection).
- Latent Faults: Failures undetectable until dual redundancy comparison fails or periodic test occurs. Periodic self-test (every 10-100ms) limits latency.
- Safe Failure: Detected failures trigger safe actions (limp-home mode for engine, brake fail-safe for steering). ISO 26262 requires safe shutdown vs random failure.
Safe State Machine Design
- Finite State Machine (FSM): Control logic models system states (Idle, Running, Fault, Safe_Shutdown). Transitions guarded by fault detection logic.
- Watchdog Timer: Independent timer circuit monitors software execution progress. Software must "kick" watchdog periodically. Timeout indicates hang, triggers reset/safe state.
- Timeout Logic: Detects abnormal software execution duration (software loop stuck). Timeout accuracy requires temperature-stable oscillator and careful timeout value selection.
- Safe State Transition: Upon fault, FSM transitions to safe state (output safe defaults, disable dangerous actuators). Transition logic itself subjected to extensive verification.
FMEDA Analysis
- Failure Modes Effects and Diagnostic Analysis: Systematic identification of all component failures (transistors, capacitors, resistors), effects (circuit malfunction), and detectability (diagnostic coverage).
- Hardware Components: FMEDA analyzes each transistor, wire, via. Failures: stuck-at 0/1, open, short, out-of-spec leakage.
- Software Failures: Code coverage analysis, control-flow analysis ensures no hidden execution paths. Compiler-generated code audited for safety properties.
- Failure Rate Calculation: Each component assigned failure rate (FIT = failures per 10^9 hours). Summed across redundant channels for dual-channel diagnostic coverage calculation.
ECC and Memory Safety
- Single-Error Correction (SECDED): Hamming-code ECC detects/corrects single-bit errors. Typical overhead: ~7-8 parity bits per 64-bit word.
- Parity Checking: Simple parity (even/odd) detects odd number of bit errors. SECDED detects/corrects 1 bit, detects (but not corrects) 2+ bits.
- Memory Initialization: All memory cleared on boot. Uninitialized memory treated as potential safety hazard.
- Scrubbing: Background process periodically reads/writes memory, correcting single-bit errors before they accumulate. Typical scrub interval: 100-1000ms.
Lockstep CPU Cores and Comparison
- Dual-Core Lockstep: Identical cores execute same instruction stream, compared every cycle (OR'd outputs for any mismatch). Core count impact: minimal (~10-15% area overhead).
- Transient Fault Detection: Single-event upsets (SEU) from cosmic rays/alpha particles introduce bit flips. Comparison detects bit flips, triggers safe shutdown.
- Permanent vs Transient: Lockstep only detects; doesn't distinguish temporary vs permanent faults. Secondary diagnostics (factory tests, power-on tests) assess permanent damage.
Automotive Certification Flow
- Design Assurance: ISO 26262 Part 5-10 prescribes development process (requirements, design, verification, validation). Auditable design history required.
- Qualification Support: Foundry provides fault modeling, process variation characterization, failure rate data. OEM and Tier-1 supplier co-verify designs.
- Sign-Off Artifacts: Safety manual documents architecture, failure modes, FMEDA tables, test procedures. Regulatory bodies (SAE, TÜV) audit artifacts pre-production.
- Field Monitoring: Post-production vehicles monitored for safety-relevant failures. Recalls issued if undiagnosed failures discovered or ASIL requirements not met.