General Matrix Multiply
Encyclopedia
The General Matrix Multiply (GEMM) is a subroutine
in the Basic Linear Algebra Subprograms
(BLAS) which performs matrix multiplication
, that is the multiplication of two matrices. This includes:
GEMM is often tuned by High Performance Computing vendors to run as fast as possible, because it is the building block for so many other routines. It is also the most important routine in the LINPACK
benchmark. For this reason, implementations of fast BLAS library typically focus first on GEMM performance.
where and values are scalar coefficients.
The Fortran
interface for these procedures are:
where TRANSA and TRANSB determines if the matrices A and B are to be transposed. M is the number of rows in matrix C and, depending on TRANSA, the number of rows in the original matrix A or its transpose. N is the number of columns in matrix C and, depending on TRANSB, the number of columns in the matrix B or its transpose. K is the number of columns in matrix A (or its transpose) and rows in matrix B (or its transpose). LDA, LDB and LDC specifies the size of the first dimension of the matrices, as laid out in memory; meaning the memory distance between the start of each row/column, depending on the memory structure .
The C
interface for these procedures are similar:
where 'x' in name is s,d,c, or z meaning single precision real, double precision real, single precision complex, or double precision complex, and other types have similar meanings as in Fortran
interface.
For the C
interface, a higher API also exists. For example, in the GNU Scientific Library
, a corresponding interface is (take the double precision real case for instance):
where TransA is CblasNoTrans or CblasTrans for A or transposed A respectively, and TransB has similar choices .
By decomposing one or both of the input matrices into block matrices
, GEMM can be used repeatedly on the smaller blocks to build up
a result for the full matrix. This is one of the motivations for including the parameter, so the results of previous blocks can be accumulated.
Note that this decomposition requires the special case which many implementations optimize for, thereby eliminating one multiplication for each value of C.
This decomposition allows for better locality of reference
both in space and time of the data used in the product.
This, in turn, takes advantage of the cache
on the system . For systems with more than one level of cache,
the blocking can be applied a second time to the order in which the blocks are used in the computation.
Both of these levels of optimization are used in implementations such as ATLAS
.
More recently, implementations by Kazushige Goto
have shown that blocking only for the L2 cache, combined with careful amortizing of copying to contiguous memory to reduce TLB misses, is superior to ATLAS
. A highly tuned implementation based on these ideas are part of the GotoBLAS.
Subroutine
In computer science, a subroutine is a portion of code within a larger program that performs a specific task and is relatively independent of the remaining code....
in the Basic Linear Algebra Subprograms
Basic Linear Algebra Subprograms
Basic Linear Algebra Subprograms is a de facto application programming interface standard for publishing libraries to perform basic linear algebra operations such as vector and matrix multiplication. They were first published in 1979, and are used to build larger packages such as LAPACK...
(BLAS) which performs matrix multiplication
Matrix multiplication
In mathematics, matrix multiplication is a binary operation that takes a pair of matrices, and produces another matrix. If A is an n-by-m matrix and B is an m-by-p matrix, the result AB of their multiplication is an n-by-p matrix defined only if the number of columns m of the left matrix A is the...
, that is the multiplication of two matrices. This includes:
- SGEMM for single precision,
- DGEMM for double-precision,
- CGEMM for complexComplex numberA complex number is a number consisting of a real part and an imaginary part. Complex numbers extend the idea of the one-dimensional number line to the two-dimensional complex plane by using the number line for the real part and adding a vertical axis to plot the imaginary part...
single precision, and - ZGEMM for complex double precision.
GEMM is often tuned by High Performance Computing vendors to run as fast as possible, because it is the building block for so many other routines. It is also the most important routine in the LINPACK
LINPACK
LINPACK is a software library for performing numerical linear algebra on digital computers. It was written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Gilbert Stewart, and was intended for use on supercomputers in the 1970s and early 1980s...
benchmark. For this reason, implementations of fast BLAS library typically focus first on GEMM performance.
Operation
The xGEMM routine calculates the new value of matrix C based on the matrix-product of matrices A and B, and the old value of matrix Cwhere and values are scalar coefficients.
The Fortran
Fortran
Fortran is a general-purpose, procedural, imperative programming language that is especially suited to numeric computation and scientific computing...
interface for these procedures are:
where TRANSA and TRANSB determines if the matrices A and B are to be transposed. M is the number of rows in matrix C and, depending on TRANSA, the number of rows in the original matrix A or its transpose. N is the number of columns in matrix C and, depending on TRANSB, the number of columns in the matrix B or its transpose. K is the number of columns in matrix A (or its transpose) and rows in matrix B (or its transpose). LDA, LDB and LDC specifies the size of the first dimension of the matrices, as laid out in memory; meaning the memory distance between the start of each row/column, depending on the memory structure .
The C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
interface for these procedures are similar:
where 'x' in name is s,d,c, or z meaning single precision real, double precision real, single precision complex, or double precision complex, and other types have similar meanings as in Fortran
Fortran
Fortran is a general-purpose, procedural, imperative programming language that is especially suited to numeric computation and scientific computing...
interface.
For the C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
interface, a higher API also exists. For example, in the GNU Scientific Library
GNU Scientific Library
In computing, the GNU Scientific Library is a software library written in the C programming language for numerical calculations in applied mathematics and science...
, a corresponding interface is (take the double precision real case for instance):
where TransA is CblasNoTrans or CblasTrans for A or transposed A respectively, and TransB has similar choices .
Optimization
Not only is GEMM an important building block of other numeric software, it is often an important building block for calls to GEMM for larger matrices.By decomposing one or both of the input matrices into block matrices
Block matrix
In the mathematical discipline of matrix theory, a block matrix or a partitioned matrix is a matrix broken into sections called blocks. Looking at it another way, the matrix is written in terms of smaller matrices. We group the rows and columns into adjacent 'bunches'. A partition is the rectangle...
, GEMM can be used repeatedly on the smaller blocks to build up
a result for the full matrix. This is one of the motivations for including the parameter, so the results of previous blocks can be accumulated.
Note that this decomposition requires the special case which many implementations optimize for, thereby eliminating one multiplication for each value of C.
This decomposition allows for better locality of reference
Locality of reference
In computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related storage locations being frequently accessed. There are two basic types of reference locality. Temporal locality refers to the reuse of specific data and/or resources...
both in space and time of the data used in the product.
This, in turn, takes advantage of the cache
CPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...
on the system . For systems with more than one level of cache,
the blocking can be applied a second time to the order in which the blocks are used in the computation.
Both of these levels of optimization are used in implementations such as ATLAS
Automatically Tuned Linear Algebra Software
Automatically Tuned Linear Algebra Software is a software library for linear algebra. It provides a mature open source implementation of BLAS APIs for C and Fortran77....
.
More recently, implementations by Kazushige Goto
Kazushige Goto
was a research associate at the Texas Advanced Computing Center at the University of Texas at Austin when he famously hand-optimized assembly routines for supercomputing and PC platforms that outperform the best compiler generated code. Several of the fastest supercomputers in the world still use...
have shown that blocking only for the L2 cache, combined with careful amortizing of copying to contiguous memory to reduce TLB misses, is superior to ATLAS
Automatically Tuned Linear Algebra Software
Automatically Tuned Linear Algebra Software is a software library for linear algebra. It provides a mature open source implementation of BLAS APIs for C and Fortran77....
. A highly tuned implementation based on these ideas are part of the GotoBLAS.