Loop tiling
Encyclopedia
Loop tiling, also known as loop blocking, or strip mine and interchange, is a loop optimization
used by compiler
s to make the execution of certain types of loops more efficient.
until it is reused. The partitioning of loop iteration space leads to partitioning of large array into smaller blocks, thus fitting accessed array elements into cache size, enhancing cache reuse and eliminating cache size requirements.
An ordinary loop
can be blocked with a block size B by replacing it with
where min is a function returning the minimum of its arguments.
After we apply loop tiling using 2 * 2 blocks, our code looks like:
The original loop iteration space is n by n. The accessed chunk of array a[i, j] is also n by n. When n is too large and the cache size of the machine is too small, the accessed array elements in one loop iteration (for example,
Another example using an algorithm for matrix multiplication:
Original matrix multiplication:
DO I = 1, M
DO K = 1, M
DO J = 1, M
Z(J, I) = Z(J, I) + X(K, I) * Y(J, K)
After loop tiling B*B:
DO K2 = 1, M, B
DO J2 = 1, M, B
DO I = 1, M
DO K1 = K2, MIN(K2 + B - 1, M)
DO J1 = J2, MIN(J2 + B - 1, M)
Z(J1, I) = Z(J1, I) + X(K1, I) * Y(J1, K1)
It is not always easy to decide what value of tiling size is optimal for one loop because it demands an accurate estimate of accessed array regions in the loop and the cache size of the target machine. The order of loop nests (loop interchange
) also plays an important role in achieving better cache performance.
Loop optimization
In compiler theory, loop optimization plays an important role in improving cache performance, making effective use of parallel processing capabilities, and reducing overheads associated with executing loops. Most execution time of a scientific program is spent on loops...
used by compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...
s to make the execution of certain types of loops more efficient.
Overview
Loop tiling partitions a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the cacheCPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...
until it is reused. The partitioning of loop iteration space leads to partitioning of large array into smaller blocks, thus fitting accessed array elements into cache size, enhancing cache reuse and eliminating cache size requirements.
An ordinary loop
can be blocked with a block size B by replacing it with
where min is a function returning the minimum of its arguments.
Example
The following is an example of matrix vector multiplication. There are three arrays, each with 100 elements. The code does not partition the arrays into smaller sizes.After we apply loop tiling using 2 * 2 blocks, our code looks like:
The original loop iteration space is n by n. The accessed chunk of array a[i, j] is also n by n. When n is too large and the cache size of the machine is too small, the accessed array elements in one loop iteration (for example,
i = 1
, j = 1 to n
) may cross cache lines, causing cache misses.Another example using an algorithm for matrix multiplication:
Original matrix multiplication:
DO I = 1, M
DO K = 1, M
DO J = 1, M
Z(J, I) = Z(J, I) + X(K, I) * Y(J, K)
After loop tiling B*B:
DO K2 = 1, M, B
DO J2 = 1, M, B
DO I = 1, M
DO K1 = K2, MIN(K2 + B - 1, M)
DO J1 = J2, MIN(J2 + B - 1, M)
Z(J1, I) = Z(J1, I) + X(K1, I) * Y(J1, K1)
It is not always easy to decide what value of tiling size is optimal for one loop because it demands an accurate estimate of accessed array regions in the loop and the cache size of the target machine. The order of loop nests (loop interchange
Loop interchange
In compiler theory, loop interchange is the process of exchanging the order of two iteration variables.For example, in the code fragment: for i from 0 to 10 for j from 0 to 20 a[i,j] = i + jloop interchange would result in: for j from 0 to 20...
) also plays an important role in achieving better cache performance.
Further reading
- Wolfe, M. More Iteration Space Tiling. Supercomputing'89, pages 655—664, 1989.
- Wolf, M. E. and Lam, M. A Data Locality Optimizing Algorithm. PLDIPLDIProgramming Language Design and Implementation is one of the ACM SIGPLAN's most important conferences. The precursor of PLDI was the Symposium on Compiler Optimization, held July 27–28, 1970 at the University of Illinois at Urbana-Champaign and chaired by Robert S. Northcote. That conference...
'91, pages 30—44, 1991. - Irigoin, F. and Triolet, R. Supernode Partitioning. POPLPOPLThe annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages is an academic conference in the field of computer science, with focus on fundamental principles in the design, definition, analysis, and implementation of programming languages, programming systems, and programming...
'88, pages 319—329, 1988. - Xue, J. Loop Tiling for Parallelism. Kluwer Academic Publishers. 2000.
- M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63–74, April 1991.