This article needs additional citations for verification. (February 2011) |
Software fault tolerance is the ability of computer software to continue its normal operation despite the presence of system or hardware faults. Fault-tolerant software has the ability to satisfy requirements despite failures.[1][2]
Following design patterns should be combined together to make the system more fault tolerant: retry, fallback, timeout, circuit breaker, and bulkhead pattern. [3][4]
To make your system more fault tolerant, you should measure 99th percentile latency and keep the remaining 1% (aka tail latencies) in check through self healing mechanisms.[5]