This article needs additional citations for verification. (September 2014) |
Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel.[1] The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared to determine if there has been a fault if there are at least two systems (dual modular redundancy DMR), and the error can be automatically corrected if there are at least three systems (triple modular redundancy TMR), via majority vote. The term "lockstep" originates from army usage, where it refers to synchronized walking, in which marchers walk as closely together as physically practical.
To run in lockstep, each system is set up to progress from one well-defined state to the next well-defined state. When a new set of inputs reaches the system, it processes them, generates new outputs and updates its state. This set of changes (new inputs, new outputs, new state) is considered to define that step, and must be treated as an atomic transaction; in other words, either all of it happens, or none of it happens, but not something in between. Sometimes a timeshift (delay) is set between systems, which increases the detection probability of errors induced by external influences (e.g. voltage spikes, ionizing radiation, or in situ reverse engineering).