Memory Hierarchy Behavior Study during the Execution of Recursive Linear Algebra Library

1.1 The cache model The cache model we consider corresponds to the structure of L1 and L2 (and optionally L3) caches in the Intel x86 architecture. An s-way set-associative cache consists of h sets and one set consists of s independent blocks (called lines in the Intel terminology). Let CS denote the size of the data part of a cache in bytes and BS denote the cache block size in bytes. Then C s B h S S . Let SD denote the size of type double and SI the size of type integer. We consider only two types of cache misses: Compulsory misses (sometimes called intrinsic or cold) that occur when new data is loaded into empty cache blocks.


Introduction
In the following text, we assume that A, B and C are real matrices of order n.Every modern CPU architecture use a complex memory hierarchy scheme (for details see [6]), we assume the scheme depicted in Fig. 1.In contrast to other authors, we also assume TLB performance impact.

The cache model
The cache model we consider corresponds to the structure of L1 and L2 (and optionally L3) caches in the Intel x86 architecture.An s-way set-associative cache consists of h sets and one set consists of s independent blocks (called lines in the Intel terminology).Let C S denote the size of the data part of a cache in bytes and B S denote the cache block size in bytes.Then C s B h S S = × × .Let S D denote the size of type double and S I the size of type integer.
We consider only two types of cache misses: • Compulsory misses (sometimes called intrinsic or cold) that occur when new data is loaded into empty cache blocks.• Thrashing misses (also called cross-interference, conflict, or capacity misses) that occur if useful data has been replaced prematurely from a cache block for capacity reason.

The code restructuring
There are many compiler techniques for improving the performance (for details see [8,9]).These compiler techniques are used in programs to maximize the cache hit ratio and improve ALU and FPU utilization.

Existing LA libraries and their possible drawbacks
There exist many libraries for the linear algebra (LA) such as BLAS [5], LAPACK [2], Intel MKL [1] etc.They are tuned for the given architecture, but they also have some drawbacks: • We must include the whole library in the final code, it can significantly increase the code size.• Licence problems can occur.
• Some format for storing sparse matrices are not supported.
• Operations are "black boxes", so the user cannot manage the inner solution process.• A combination of two or more operations with the same operands is solved separately (i.e.inefficiently).
The standard code of these routines has good performance due to the high cache hit ratio only for small sizes of order of matrix.When good performance is required for larger values, modifications must be made.In numerical algebra packages, this is achieved by explicit loop restructuring.This includes loop unrolling-and-jam which increase the FPU pipeline utilization in the innermost loop, loop blocking (which is why we refer to blocked codes) and loop interchange to maximize the cache hit ratio.After applying of these transformations, these codes are divided into two parts.The outer loops are "out-cache", and the inner loops are "in-cache".The codes have almost the same performance irrespective on the amount of data.
But good cache utilization can also be achieved by the "divide-and-conquer" approach.These codes, which use recursive-style formulations, referred to as recursive-based.

Blocked code programming
Effective implementation in blocked code programming style requires the following steps: 1. Straightforward implementation.2. Design a coarse cache model of this code.3. Apply source code transformations (according to the architecture parameters).4. Refine the cache model.5. Derive optimal values of the program thresholds from the cache parameters.

Recursive code programming
The motivation idea of recursive codes, is to divide the matrices into disjoint submatrices of the same size (see [3,4]).The resulting code has much better spatial and temporal locality.Effective implementation in recursive code programming style requires the following steps: 1. Straightforward implementation.2. Apply source code transformations (according to architecture parameters).3. Express routine(s) in the recursive style.4. Measure threshold value(s).

Discussion
The main differences between blocked and recursive--based codes are • Cache analysis of some LA routines is very difficult, much more difficult than recursive formulation of these routines.
• For optimal performance of blocked codes, the program may have different in-L1-cache loops and in-L2-cache loops (and in-L3-cache respectively).In recursive-based codes, this is automatically done by the partitioning of the data.• Blocked codes lead to even better cache utilization because the data is optimally divided to fit into the cache.In recursive codes some part of cache memory remains unused.• Recursive style of expression is very easy to understand and results in error-free codes.• Recursive-based codes have additional stack management overhead in comparison to blocked codes.
3 Detailed overview of matrix-matrix multiplication procedure

Matrix-matrix multiplication
We consider input real matrices ´.A standard sequential pseudocode for matrix-matrix multiplication C A B = × (abbreviated as GEMM, assuming the ordering of loops: i, j, k) is as follows:

Recursion in matrix-matrix multiplication
Let us denote C (i) if the final result (matrix C) is computed from the recursion by i variable.Consider the partitioning of matrices A, B and C as follows.For better illustration, the partitioning of matrices is depicted in the Fig. 3.

Compulsory misses
During the execution of GEMM(A, B, C, i, j, k) routine all elements from arrays A, B and C must be loaded or written to the cache.So, the number of compulsory misses is A similar idea is also valid for TLB misses.

HW configuration
All results were measured on Pentium Celeron M420 at 1.6 GHz, 2GB@ 333 MHz, with the following cache parameters: L1 cache is a data cache with B S = 64, C S = 32 kB, s = 8, h = 64, and LRU replacement strategy.L2 cache is unified cache with B S = 64, C S =1 MB, s = 8, h = 2048, and LRU strategy.

Test data
We have measured instances using square matrices with the order of n in range from 10 to 1000 (for cache analysis to 800) with the step 10.

Methodology of the measuring
• All cache events were monitored by the Cache Analyzer (CA) [7].• For TLB miss ratio measuring, we also used the CA with the following parameters: h =1, s =128, B S = 4096.This means, that we assume TLB as a fully-associative cache with its block size equal to the system page size.• For performance measurements, the average of five measured values was taken as the result.• The caches were flushed out before each measurement.In real measurement, they were flushed by a reading of a given amount of useless data.In measurement using the CA, they were flushed by a command of the CA.• The codesize of the innermost loops is negligible in comparison to the size of the L2 cache, so we assume that the unified L2 cache is used only for data.

L1 cache miss ratio
A comparison of the L1 cache miss rate for three different variants of the GEMM procedure is illustrated in Fig. 4. The peaks in this graph are caused by a resonance between the matrix element mapping function and the cache memory mapping function.Both unrolled variants achieve better L1 utilization.

L2 cache miss ratio
A comparison of the L2 cache miss rate for three different variants of the GEMM procedure is illustrated in Fig 5 .We can conclude that if the size of operands exceeds the data cache size (for n > 350), the L2 cache utilization drops significantly.Both unrolled variants achieve better L2 utilization.

Choosing the optimal value of the threshold
A comparison of cache utilization when three different threshold values were chosen, is depicted in Figs. 6 and 7.If the threshold is too large, the L1 cache miss ratio increases rapidly.The L2 cache miss ratio remains the same across the whole measured set.
Comparisons of cache miss rates for three different variants of GEMM procedure are illustrated in Figs. 8 and 9.The recursive variant of the GEMM procedure achieves much better cache utilization across the whole measured set.

TLB cache miss ratio
A comparison of TLB utilization when three different threshold values were chosen, is depicted in Fig. 10.With the exception of one case (caused by resonance of the matrix element mapping function for n = 510 with the TLB memory mapping function), the number of TLB misses is very low.
A comparison of the TLB miss rates for three different variants of the GEMM procedure is illustrated in Fig. 11.The recursive variant of the GEMM procedure achieve much better TLB utilization for larger matrices (n > 250).• the recursive variant achieves very good performance irrespective of matrix size.It was just slightly slower than the vendor implementation.

Conclusions
In this paper, we have described the implementation of some important routines from LA using a recursive approach, concentrating on the matrix-matrix multiplication routine.All tested routines achieve performance about 80-90 % in comparison to the vendor MKL library.We can conclude that