View profile

Exploiting cache locality - Software Shots - Issue #4

Exploiting cache locality - Software Shots - Issue #4
By Karn • Issue #4 • View online
Hello there friend, it’s been a week already. Just watched the finale of WandaVision. What! You have not! Go watch, read this later.

So, I did a small experiment to exploit cache locality in order to improve the performance of my matrix multiplication code. Let’s see what did I do and why did it happen.
Loop order
Let’s set the exposition first. This is a typical way to multiply two matrices, O(n^3) algorithm. Notice the fact that we can change the order of the loops in this program without affecting its correctness.
typical matrix multiplication
typical matrix multiplication
What if we change the order of loops?
Does it affect the performance in any way? As a matter of fact it does.
For a value of n = 4096, I got the following runtimes for different orders of nested loops. (BTW ran it on my mac, while writing this up. Should have fired up an EC2 for more accurate results.)
4096 x 4096
  • i, j, k -> 1255.99 sec
  • i, k, j -> 200.38 sec
  • j, i, k -> 1115.81 sec
  • j, k, i -> 3103.15 sec
  • k, i, j -> 210.53 sec
  • k, j, i -> 3042.23 sec
Apparently, the slowest we have is 3103.15 seconds and the fastest is 200.38 seconds. Here, the loop order affects running time by a factor of 15.4.
What if we increase n, does the factor increase linearly or stays constant or increases exponentially?
May be a weekend exercise for you. Give it a try, I’ll quickly go into why this happened.
Hardware Caches
Each processor reads and writes main memory in contiguous blocks, called cache lines. Previously accessed cache lines are stored in a smaller memory called a cache, that sits near the processor.
Entries stored in the caches are not single words but, instead, “lines” of several contiguous words. In early caches these lines were 32 bytes long; nowadays the norm is 64 bytes. If the memory bus is 64 bits wide this means 8 transfers per cache line. DDR supports this transport mode efficiently.
In this matrix multiplication code, matrices are laid out in memory in row-major order:
row major in memory scene
row major in memory scene
There are programming languages that lay out the memory in column-major order as well, e.g. FORTRAN.
Access Pattern for i, j, k
So by now if you haven’t figured it out, let’s make it easier for you and look at the access pattern for Order i, j, k.
Running time: 1255.99 sec
i, j, k access pattern (explanation down below)
i, j, k access pattern (explanation down below)
i, j, k
i, j, k
Have a look at the code first and imagine when we are executing the last nested loop, for particular i and j.
  • For matrix C, we keep on updating the C[i][j] and since i and j are fixed for this iteration of length n, we get excellent spatial locality. (high number of cache hits)
  • For matrix A, we are accessing A[i][k], we access the whole row A[i] and now since the matrix is stored in row-major order, we get good spatial locality.
  • For matrix B, it’s B[k][j] we access a column, which is distributed far away in memory. Now if the memory bus is 64 bits wide this means 8 transfers per cache line. So the processor is going to be ignoring 7 of the 8 words it brings along with it on that cache line. Hence, poor spatial locality.
Access Pattern for i, k, j (fastest)
  • Order is i, k, j. For that nested iteration i, k is fixed and j moves from 0 to n - 1.
  • Let’s recall the final expression, C[i][j] += A[i][k] * B[k][j]
  • We access the entire ith row in C, fixed cell A[i][k], and the kth row in B.
Obviously there’ a lot of scope of performance improvement by parallelizing the multiplication and it depends a lot on the dimensions of the matrix.
By the way: BEWARE of over optimizations.
Hope you liked reading it, let me know by giving a quick thumbs up. Good night :). I’ll probably rewatch WandaVision.
Did you enjoy this issue?
By Karn

I build and create things. As of now, I work as a Software Engineer at SendX/SendPost.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
Gyan Karn, Delhi, India - 110039