Loop unrolling

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler. On modern processors, loop unrolling is often counterproductive, as the increased code size can cause more cache misses; cf. Duff's device.^[1]

The goal of loop unwinding is to increase a program's speed by reducing or eliminating instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration;^[2] reducing branch penalties; as well as hiding latencies, including the delay in reading data from memory.^[3] To eliminate this computational overhead, loops can be re-written as a repeated sequence of similar independent statements.^[4]

Loop unrolling is also part of certain formal verification techniques, in particular bounded model checking.^[5]

^ Tso, Ted (August 22, 2000). "Re: [PATCH] Re: Move of input drivers, some word needed from you". lkml.indiana.edu. Linux kernel mailing list. Retrieved August 22, 2014. Jim Gettys has a wonderful explanation of this effect in the X server. It turns out that with branch predictions and the relative speed of CPU vs. memory changing over the past decade, loop unrolling is pretty much pointless. In fact, by eliminating all instances of Duff's Device from the XFree86 4.0 server, the server shrunk in size by _half_ _a_ _megabyte_ (!!!), and was faster to boot, because the elimination of all that excess code meant that the X server wasn't thrashing the cache lines as much.
^ Ullman, Jeffrey D.; Aho, Alfred V. (1977). Principles of compiler design. Reading, Mass: Addison-Wesley Pub. Co. pp. 471–2. ISBN 0-201-10073-8.
^ Petersen, W.P., Arbenz, P. (2004). Introduction to Parallel Computing. Oxford University Press. p. 10.{{cite book}}: CS1 maint: multiple names: authors list (link)
^ Nicolau, Alexandru (1985). "Loop Quantization: Unwinding for Fine-Grain Parallelism Exploitation". Dept. of Computer Science Technical Report. Ithaca, NY: Cornell University. OCLC 14638257. {{cite journal}}: Cite journal requires |journal= (help)
^ Model Checking Using SMT and Theory of Lists

[lkml-0008.2/0171-1] Tso, Ted (August 22, 2000). "Re: [PATCH] Re: Move of input drivers, some word needed from you". lkml.indiana.edu. Linux kernel mailing list. Retrieved August 22, 2014. Jim Gettys has a wonderful explanation of this effect in the X server. It turns out that with branch predictions and the relative speed of CPU vs. memory changing over the past decade, loop unrolling is pretty much pointless. In fact, by eliminating all instances of Duff's Device from the XFree86 4.0 server, the server shrunk in size by _half_ _a_ _megabyte_ (!!!), and was faster to boot, because the elimination of all that excess code meant that the X server wasn't thrashing the cache lines as much.

[2] Ullman, Jeffrey D.; Aho, Alfred V. (1977). Principles of compiler design. Reading, Mass: Addison-Wesley Pub. Co. pp. 471–2. ISBN 0-201-10073-8.

[3] Petersen, W.P., Arbenz, P. (2004). Introduction to Parallel Computing. Oxford University Press. p. 10.{{cite book}}: CS1 maint: multiple names: authors list (link)

[4] Nicolau, Alexandru (1985). "Loop Quantization: Unwinding for Fine-Grain Parallelism Exploitation". Dept. of Computer Science Technical Report. Ithaca, NY: Cornell University. OCLC 14638257. {{cite journal}}: Cite journal requires |journal= (help)

[5] Model Checking Using SMT and Theory of Lists

[1]

[2]

[3]

[4]

[5]