## Compile-time optimisation of matrix multiplication chains

[28th October 2007]

In my previous post I outlined my plans for a new vector and matrix library. I also mentioned a multiplication optimization that I haven't seen anywhere else^{[1]}, which I'll be outlining here. To get the most out of this post, you'll really need to read the last one else some of it might seem a little mysterious. It's going to be very heavy on templates and meta-programming. If you're not a fan of that kind of stuff you might want to look away! Familiarity with the boost meta-programming library will certainly help when reading this post.

### Definitions

There are a number of different notations in common use for matrices, so before I say anything else let me just mention what I'll be using, here.

If A is a matrix, of dimensions m×n, then it has m rows and n columns. I'll use the notation A(i,j) to denote the element of A in the i^{th} row and j^{th} column. As it happens, this matches the syntax used by the library. I use the notation M^{m×n} to denote the set of matrices whose dimensions are m×n.

I define multiplication of matrices as follows:

### Default C++ associativity is against us

Consider the multiplication of 3 matrices, P, Q and R:

Since matrix multiplication is associative, there are two ways we can perform this multiplication to obtain the result. We can either multiply P and Q first, and then multiply the result of that with R i.e. (P×Q)×R. Or we can multiply Q and R before multiplying P with the result i.e. P×(Q×R). If I was multiplying this out in my head (or on paper, more like!), I know which route I would take. While both of course yield the same answer, the second way involves doing far fewer multiplications.

However, unless we explicitly bracket sub-expressions in our C++ code, when the compiler comes across `P * Q * R`

, it doesn't know that this second grouping is less costly. The language rules say that the calls to `operator*`

must be made in left-to-right order.

It would be great if the compiler could somehow deduce the cheapest order in which to perform the multiplications. As it happens, we can actually give the compiler a helping hand without explicitly bracketing the sub-expressions, but we need to gather some key ingredients, first.

### Regrouping the product

We need to find an algorithm that can deduce the best grouping

for performing the multiplications in a matrix product chain. We also need to have this algorithm run at compile time, ideally, which would mean zero run-time overhead.

Then, we need to somehow stop the compiler from doing its naive left-to-right multiplication and instead use the better way that we've computed using our algorithm.

The good news is that such an algorithm exists and it's nice and simple. The bad news is that it's O(n^{3}) complexity (where n is the number of matrices). But if it's running at compile time, it's no big deal as we won't see the difference at run-time.

- Take the sequence of matrices and separate it into two subsequences.
- Find the minimum cost of multiplying out each subsequence.
- Add these costs together, and add in the cost of multiplying the two result matrices.
- Do this for each possible position at which the sequence of matrices can be split, and take the minimum over all of them.

Here, the cost of multiplying an M^{m×n} by an M^{n×p} is defined as m×n×p i.e. the number of scalar multiplications involved in computing the result.

### Delaying matrix multiplication

So how do we stop the compiler multiplying our matrices left-to-right? We can have `operator*`

create an object whose type is a model of the FixedMatrix concept, but the generated object delays the actual multiplication until its elements are accessed:

```
template<typename LeftMatrix, typename RightMatrix>
class matrix_product : public operands_list<LeftMatrix, RightMatrix>
{
public:
BOOST_STATIC_ASSERT(matrix_width<LeftMatrix>::value == matrix_height<RightMatrix>::value);
// Components needed to model the FixedMatrix concept
typedef typename wider_element_type<LeftMatrix, RightMatrix>::type element_type;
typedef typename wider_size_type<LeftMatrix, RightMatrix>::type size_type;
static const size_type height = matrix_height<LeftMatrix>::value;
static const size_type width = matrix_width<RightMatrix>::value;
typedef matrix_tag operand_category;
typedef LeftMatrix left_matrix_type;
typedef RightMatrix right_matrix_type;
matrix_product(const left_matrix_type &lhs, const right_matrix_type &rhs) : lhs_(lhs), rhs_(rhs) { }
matrix_product(const matrix_product &other) : lhs_(other.lhs_), rhs_(other.rhs_) { }
element_type operator() (size_type i, size_type j) const
{
element_type ret(0);
for (size_type k = 0; k != left_matrix_type::width; ++k)
ret += at(lhs_, i, k) * at(rhs_, k, j);
return ret;
}
const left_matrix_type &lhs() const { return lhs_; }
const right_matrix_type &rhs() const { return rhs_; }
private:
const left_matrix_type &lhs_;
const right_matrix_type &rhs_;
};
```

You'll see that I've defined `operator()`

to calculate each element as it's requested. So even if someone was to do something a little weird like accessing a particular element of a temporary, the code would still work as expected:

```
// Assuming P and Q are each of a type that is a model of FixedMatrix
double x = (P * Q)(0, 1); // fine, matrix_product<>::operator() called
```

Also, note what happens when we multiply more than two matrices in a chain. We effectively get a tree of `matrix_product<>`

s. For example, if `P`

, `Q`

and `R`

are each of type `mat10x10`

, then `P * Q * R`

generates an object of type:

```
matrix_product< matrix_product<mat10x10, mat10x10>, matrix10x10 >
```

While this stops the multiplication happening at the usual time, we have actually created a rather large *inefficiency*, here. To see why, you need to realise that when multiplying two matrices together, you typically need to access each element of either operand more than once. But if accessing those elements involves a calculation, such as the one performed in `matrix_product<>::operator()`

, we're needlessly doing the same work over and over again.

But fear not! Later on we'll see a way to dig our way out of this particular hole.

Lastly, you'll see that `matrix_product<>`

derives from `operands_list<>`

. We'll see what this gives us, shortly.

### Planning the product re-grouping code

So, let's have a look at the meat and potatoes of the problem; assembling a compile-time data structure that indicates the preferred grouping of a matrix multiplication chain. There are a number of ways to go about this and I am obliged to point you in the direction of an alternative that was kindly presented to me in a conversation in the comp.lang.c++.moderated newsgroup. My method is somewhat different, but they boil down to the same thing in the end.

So, the way I approached this was to devise the following structure, which would act as a binary tree inside the compiler.

```
template<typename Lhs, typename Rhs>
struct assoc_tree
{
typedef Lhs lhs;
typedef Rhs rhs;
};
```

The tree structure becomes obvious when you realise that either of the types `Lhs`

and `Rhs`

may themselves be `assoc_tree<>`

s. Alternatively, either or both could be some other type, representing a leaf. In my implementation, I used `boost::mpl::int_<>`

s to represent the leaves^{[2]}.

For example, the following type represents a grouping that indicates the 2nd and 3rd matrices should be multiplied first, before the 1st matrix is multiplied with the result.

```
namespace mpl = boost::mpl;
typedef assoc_tree<mpl::int_<0>, assoc_tree<mpl::int_<1>, mpl::int_<2> > > grouping_tree;
```

If we can build up a structure like this at compile-time we'll have obtained an entity that we can use to control the re-grouping of the product.

### Putting together an `assoc_tree`

So we've got a data structure of sorts, to aim at now. We just need to create it somehow. To do this, we need to implement the algorithm I quoted from wikipedia, earlier and we need to run it at compile time.

The most intuitive way to do this, I think, is to assemble a sequence of the types involved in the product, or rather a `boost::mpl::vector<>`

of types. A `boost::mpl::vector<>`

is the compile-time counterpart of `std::vector<>`

in the C++ standard library. Instead of containing values, it contains types.

Once we've created our vector of types, we can then run the grouping algorithm over it in order to create an `assoc_tree<>`

. As it turns out, this vector was actually pretty easy to put together. The idea is to keep a vector inside each `matrix_product<>`

. Whenever a `matrix_product<>`

is instantiated with a pair of template parameters, it takes the existing vectors from its `LeftMatrix`

and `RightMatrix`

and joins them end-to-end to create its own vector. We can use template specialisation to achieve this:

```
template<typename LeftMatrix, typename RightMatrix>
struct operands_list
{
typedef mpl::vector<LeftMatrix, RightMatrix> operands;
};
template<typename Lhs, typename Rhs, typename RightMatrix>
struct operands_list<matrix_product<Lhs, Rhs>, RightMatrix>
{
// append an element to a vector
typedef typename mpl::push_back<typename matrix_product<Lhs, Rhs>::operands, RightMatrix>::type operands;
};
template<typename LeftMatrix, typename Lhs, typename Rhs>
struct operands_list<LeftMatrix, matrix_product<Lhs, Rhs> >
{
// prepend an element to a vector
typedef typename mpl::push_front<typename matrix_product<Lhs, Rhs>::operands, LeftMatrix>::type operands;
};
template<typename Lhs1, typename Rhs1, typename Lhs2, typename Rhs2>
struct operands_list<matrix_product<Lhs1, Rhs1>, matrix_product<Lhs2, Rhs2> >
{
typedef typename matrix_product<Lhs1, Rhs1>::operands lops;
typedef typename matrix_product<Lhs2, Rhs2>::operands rops;
// join two existing vectors
typedef typename mpl::insert_range<lops, typename mpl::end<lops>::type, rops>::type operands;
};
```

Now the root `matrix_product<>`

in the hierarchy is of a type whose inner operands `typedef`

is the sequence of types we need (you'll recall that `matrix_product<>`

publicly inherits `operands_list<>`

). In the above, we used a number of `boost::mpl`

meta-functions to prepend/append (`push_front`

/`push_back`

) types to a vector and also `mpl::insert_range`

to effectively join two existing vectors together.

So we have a list of matrix types and an outline for a tree structure that will contain indices in to this list. Now let's see how to construct the actual `assoc_tree<>`

that represents the optimal grouping for the chain.

The way I arrived at my implementation was to sketch out the basic algorithm in pseudo-code and then adapt the iterative parts so that they were implemented using recursion instead. This needed to be done because you can't do iteration particularly well at compile-time. Recursion on the other hand can be achieved with templates, with a particular specialisation representing the termination condition.

Here's that process spelled out a little more explicitly:

```
// Pseudo code using iteration
// Returns a 2-tuple containing the optimal cost and the corresponding assoc_tree.
// list is a sequence of TYPES, which is typically a sub-list of the original list
// of fixed matrices. offset represents how far in to the original list you have to
// look in order to find the list passed to this function
function find_best_assoc(list, offset)
{
len = length(list)
if (len == 1)
{
// cost of multiplying a chain of 1 matrix
return (0, offset) // an integer is a leaf node of an assoc_tree
}
if (len == 2)
{
// cost of multiplication for the types at positions 0 and 1 in the list
cost = list[0].width * list[0].height * list[1].width
return (cost, assoc_tree<offset, offset + 1>)
}
// We have more than 2 elements, so a number of different groupings must be considered
best_cost = INFINITY
best_assoc = NULL
for (i = 1; i != len - 1; ++i)
{
// split the list about position i and find the details for either side of the split
(left_cost, left_assoc) = find_best_assoc(sub_range(list, 0, i), offset)
(right_cost, right_assoc) = find_best_assoc(sub_range(list, i, len), offset)
total_cost = left_cost +
right_cost +
list[0].height * list[i-1].width * list[i].height
if (total_cost < best_cost)
{
best_cost = total_cost
best_assoc = assoc_tree<left_assoc, right_assoc>
}
}
return (best_cost, best_assoc)
}
```

In the initial pseudo-code version above, there is an element of recursion, in that we're calling `best_assoc()`

inside itself but there is still a loop that must be eliminated. The last term in the `total_cost`

calculation is the cost of multiplying the sub-products on either side of the chain when it is split about position `i`

.

So let's get rid of this loop now, in favour of recursion.

```
// Pseudo code using recursion only, no iteration
function find_best_assoc(list, offset)
{
if (len == 1)
{
return (0, 0) // an integer is a leaf node of an assoc_tree
}
if (len == 2)
{
cost = list[0].width * list[0].height * list[1].width
return (cost, assoc_tree<offset, offset + 1>)
}
// We have more than 2 elements, so a number of different groupings must be considered
return split_details(list, 1, INFINITY, NULL, offset)
}
// Calculates the best cost by splitting the list at positions split and beyond
// best_cost and best_assoc are the best cost and assoc_tree configurtion found so far
function split_details(list, split, best_cost, best_assoc, offset)
{
len = length(list)
if split == len:
return (best_cost, best_assoc)
(left_cost, left_assoc) = find_best_assoc(sub_range(list, 0, split), offset)
(right_cost, right_assoc) = find_best_assoc(sub_range(list, split, len), offset)
total_cost = left_cost +
right_cost +
list[0].height * list[i-1].width * list[i].height
if (total_cost < best_cost)
{
best_cost = total_cost
best_assoc = assoc_tree<left_assoc, right_assoc>
}
return split_details(list, split + 1, best_cost, best_assoc, offset)
}
```

Now that we have an entirely recursive algorithm, we can turn this in to compile-time C++

. The recursion-termination conditions in each function must be translated in to template specialisations, but otherwise things are much the same. The returned

cost and `assoc_tree`

are exposed through the `value`

constants and `assoc_node typedefs`

in each of the template classes.

```
template<typename FixedMatrices, std::size_t Length, std::size_t Offset>
struct find_best_assoc;
template
<
typename FixedMatrices,
std::size_t Split,
std::size_t Length,
std::size_t BestCost,
typename BestAssoc,
std::size_t Offset
>
struct split_details
{
typedef typename sub_range<FixedMatrices, 0, Split>::type left_list;
typedef typename sub_range<FixedMatrices, Split, Length>::type right_list;
typedef find_best_assoc<left_list, Split, Offset> left_best;
typedef find_best_assoc<right_list, Length - Split, Offset + Split> right_best;
static const std::size_t total_cost =
left_best::value + right_best::value +
matrix_height<typename mpl::at_c<FixedMatrices, 0>::type>::value *
matrix_width<typename mpl::at_c<FixedMatrices, Length - 1>::type>::value *
matrix_height<typename mpl::at_c<FixedMatrices, Split>::type>::value;
static const bool smaller = total_cost < BestCost;
static const std::size_t best_cost = smaller ? total_cost : BestCost;
typedef assoc_tree<typename left_best::assoc_node, typename right_best::assoc_node> proposed_assoc;
typedef typename mpl::if_<mpl::bool_<smaller>, proposed_assoc, BestAssoc>::type best_assoc;
typedef split_details<FixedMatrices, Split + 1, Length, best_cost, best_assoc, Offset> cas;
static const std::size_t value = cas::value;
typedef typename cas::assoc_node assoc_node;
};
template
<
typename FixedMatrices,
std::size_t Length,
std::size_t BestCost,
typename BestAssoc,
std::size_t Offset
>
struct split_details<FixedMatrices, Length, Length, BestCost, BestAssoc, Offset>
{
static const std::size_t value = BestCost;
typedef BestAssoc assoc_node;
};
template<typename FixedMatrices, std::size_t Length, std::size_t Offset>
struct find_best_assoc
{
static const std::size_t best_cost = ~std::size_t(0);
typedef split_details<FixedMatrices, 1, Length, best_cost, void, Offset> cas;
static const std::size_t value = cas::value;
typedef typename cas::assoc_node assoc_node;
};
template<typename FixedMatrices, std::size_t Offset>
struct find_best_assoc<FixedMatrices, 1, Offset>
{
static const std::size_t value = 0;
typedef mpl::int_<Offset> assoc_node;
};
template<typename FixedMatrices, std::size_t Offset>
struct find_best_assoc<FixedMatrices, 2, Offset>
{
typedef typename mpl::at_c<FixedMatrices, 0>::type at0;
typedef typename mpl::at_c<FixedMatrices, 1>::type at1;
static const std::size_t value = cost<at0, at1>::value;
typedef assoc_tree<mpl::int_<Offset>, mpl::int_<Offset + 1> > assoc_node;
};
```

Phew! I've omitted the definition of the `sub_range`

meta-function, but you can find it in the zip of source code available at the end. So now if we have a `matrix_product<Lhs, Rhs>`

, we can find the optimal grouping like so:

```
typedef typename matrix_product<Lhs, Rhs>::operands operands;
const std::size_t num_operands = detail::mpl::size<operands>::value;
typedef typename detail::find_best_assoc<operands, num_operands, 0>::assoc_node best_assoc;
```

`best_assoc`

is now the `assoc_tree<>`

structure we have been seeking. We aren't really interested in the final optimal cost, but we had to do the calculations along the way in order to do comparisons between the different groupings.

There are still two pieces missing at this point. The first is that we still don't have any code to actually perform the regrouping, even though we have a structure describing the preferred multiplication order. The second is that we need to trigger this regrouping somehow. For example consider this code:

`c = a * b;`

Here, we will end up assigning a `matrix_product<>`

to the matrix `c`

. Without special logic inside the assignment operator involved here, the elements of `c`

will be filled using the `operator()`

of said `matrix_product<>`

, which doesn't perform any regrouping.

Let's tackle this problem of triggering the re-group first, before moving on to look at the actual mechanics of the re-grouping code itself.

### Making the re-group happen

I currently can't see any way around the fact that we'll need some kind of check inside functions such as `operator=`

to see if we're dealing with a `matrix_product<>`

. However, we can use this to our advantage by allowing clients of our library to use this hook for their own purposes. Let's see what I'm babbling on about.

The library comes with a concrete template class called matrix, that is of course a model of the FixedMatrix concept. One of its constructors looks like this:

```
template<typename FixedMatrix>
explicit matrix(const FixedMatrix &m)
{
BOOST_STATIC_ASSERT(matrix_height<FixedMatrix>::value == height);
BOOST_STATIC_ASSERT(matrix_width<FixedMatrix>::value == width);
typedef optimize_matrix_access<FixedMatrix, width * height> optaccess;
typename optaccess::type m2(optaccess::optimize(m));
for (size_type i = 0; i != height; ++i)
for (size_type j = 0; j != width; ++j)
(*this)(i, j) = matrix_element(m2, i, j);
}
```

The first two lines check that the FixedMatrix we're given has dimensions matching those of the matrix we're constructing. The next two lines is where we have a little bit of magic happening.

`optimise_matrix_access<>`

is a template class that clients can specialise. They can change the behaviour of the `static optimize()`

function to return an object whose type is still a model of FixedMatrix, but whose elements can be accessed with optimal efficiency.

The default implementation just looks like this:

```
template<typename FixedMatrix, unsigned AccessCount>
struct optimize_matrix_access
{
typedef const FixedMatrix &type;
static type optimize(const FixedMatrix &m) { return m; }
};
```

So by default, `optimize_matrix_access::optimize()`

doesn't do anything. It simply passes through its argument. However, we can specialise this template class for `matrix_product<>`

in order to kick-off the regrouping of the matrix multiplication chain. And it's not too hard to imagine cases where you might have other models of FixedMatrix that could take advantage of this hook.

So here's what the specialisation for `matrix_product<>`

looks like:

```
template<typename Lhs, typename Rhs, unsigned AccessCount>
struct optimize_matrix_access<detail::matrix_product<Lhs, Rhs>, AccessCount>
{
typedef detail::matrix_product<Lhs, Rhs> product;
typedef matrix<typename product::element_type, product::height, product::width> type;
typedef typename product::operands operands;
static type optimize(const product &m)
{
const std::size_t num_operands = detail::mpl::size<operands>::value;
typedef typename detail::find_best_assoc<operands, num_operands, 0>::assoc_node best_assoc;
return detail::optimize_matrix_product<best_assoc, product>::multiply_out(m);
}
};
```

In this specialisation we define the `type typedef`

to be a bonafied `matrix<>`

, which is the type returned by the `optimize()`

function. You'll see in the last line of the `optimize()`

member function, we call a function to multiply out

the matrix chain using our `assoc_tree<>`

.

Before we move on to looking at that, I'll just point out one last thing. The current implementation of `optimize_matrix_access<>`

has an `AccessCount`

integral template parameter. It is expected that the value of this will be the number of element accesses made by the calling code. I'm not entirely sure if this will be useful yet, but you can imagine that if you have a case where only a single element needs to be read, there would be no real point in multiplying out the chain some of the time. So by allowing specialisations based on the (expected) number of element accesses, we can account for this kind of thing.

### Changing the multiplication order

Ok, so we've done all the prep. Pencils are sharpened, shoelaces are tied, eyebrows are waxed. We now need the final ingredient — the elusive `optimize_matrix_product<>`

class that we just saw and its `multiply_out()`

function, in particular.

```
template<typename AssocTree, typename FixedMatrixProduct>
struct optimize_matrix_product
{
typedef multiplication_traits<typename FixedMatrixProduct::operands, AssocTree> traits;
typedef matrix<typename traits::element_type, traits::height, traits::width> result_type;
static result_type multiply_out(const FixedMatrixProduct &original)
{
typedef typename optimize_matrix_product<typename AssocTree::lhs, FixedMatrixProduct>::result_type lhs;
typedef typename optimize_matrix_product<typename AssocTree::rhs, FixedMatrixProduct>::result_type rhs;
return generic_matrix_product<lhs, rhs>::multiply
(
optimize_matrix_product<typename AssocTree::lhs, FixedMatrixProduct>::multiply_out(original),
optimize_matrix_product<typename AssocTree::rhs, FixedMatrixProduct>::multiply_out(original)
);
}
};
template<int Index, typename FixedMatrixProduct>
struct optimize_matrix_product<mpl::int_<Index>, FixedMatrixProduct>
{
typedef multiplication_traits<typename FixedMatrixProduct::operands, mpl::int_<Index> > traits;
typedef matrix<typename traits::element_type, traits::height, traits::width> result_type;
static result_type multiply_out(const FixedMatrixProduct &original)
{
return result_type(matrix_at<Index>(original));
}
};
```

If you ignore all the `typedefs`

and so on, the code is actually pretty simple. We're calling `generic_matrix_product<>::multiply()`

(which is by default your standard triple-nested loop to perform matrix multiplication) with arguments whose values are generated by recursive calls to `optimize_matrix_product<>::multiply_out()`

. These recursive calls descend down the branches of our `assoc_tree<>`

and `matrix_product<>`

structures, fetching the relevant matrices. Note the specialisation of `optimize_matrix_product<>`

to handle a leaf node of an `assoc_tree`

.

There are still some details omitted, such as the implementation of `multiplication_traits<>`

, which appears in the `multiply_out`

functions. This is simply a class that contains a bunch of `typedefs`

to describe the attributes of the multiplication about to be performed.

The other thing that I haven't described in detail is the `matrix_at<>()`

function, as seen in the specialisation of `optimize_matrix_product<>`

for leaf nodes of `assoc_tree<>`

s. This is actually significantly more complicated, but essentially uses similar tricks to those I've already outlined. I won't go in to great detail here, but the basic idea is as follows.

`matrix_product<>`

contains a second `mpl::vector<>`

, called `directions`

. The i^{th} element of directions is itself an `mpl::vector<>`

describing the sequence of left/right branches that need to be taken to find the i^{th} matrix of the chain within the `matrix_product<>`

hierarchy.

So when `matrix_at<Index>(product)`

gets called, the `directions typedef`

is accessed from `product`

's type. The `mpl::vector<>`

located at position `Index`

in `directions`

contains the sequence of left/right turns

that need to be taken to descend down the `matrix_product<>`

to access the `Index`

^{th} matrix in the chain.

For example, if the directions to get to matrix at index `idx`

were left, left, right, then `matrix_at<idx>`

would effectively end up calling `product.lhs().lhs().rhs()`

.

The details get a bit hairy, so I'll leave you to rummage through the code if you want to know more. I hope I'll be able to reduce its complexity a little over time.

So there you have it. You'll note that all the runtime code we've defined are short one liners, easily optimized by most compilers (assuming they can handle all the meta-programming to begin with!).

But has all this shuffling actually improved the speed of matrix multiplication?

### Benchmarks

The answer to that question isn't as straight forward as I had hoped. In some cases we do see considerable improvement in execution speed. However, the construction of the `matrix_product<>`

and the calls to its `lhs()`

and `rhs()`

member functions cannot be optimised-out completely^{[3]}. This means that multiplying three 4×4 matrices in a chain, for example, will be slower with the proposed method because each grouping has the same cost and so we're doing a little bit of extra work for no benefit.

So even though we managed to deduce the optimal grouping entirely at compile-time, actually shuffling all the references around incurs a small runtime penalty that cannot be avoided (unless there are some additional tricks that elude me).

So let's have a look at some pretty graphs. It is not the intention to compare the results between different compilers, but rather the differences in speed between the two multiplication techniques on each compiler in turn. Indeed, the results for the Apple GCC compiler were obtained on an entirely different machine, so comparisons of those results against any other compiler would be meaningless.

First, we have a the graph that illustrates the kind of speed up we can get. The second graph illustrates the worst case scenario for this code i.e. where it actually adds a little bit of overhead from naive chain multiplication.

Here are the compiler flags I used.

- GCC (MinGW 3.4.5, MinGW 4.2.1 sjlj, Apple GCC 4.0.1):
`-Wall -Wextra -pedantic -O3 -ansi -Wswitch -D NDEBUG`

- MSVC 8:
`/nologo /D _SECURE_SCL=0 /D _HAS_ITERATOR_DEBUGGING=0 /Oi /Zc:forScope,wchar_t /EHsc /wd4996 /D _CRT_SECURE_NO_DEPRECATE /MT /GR /Ox /W3 /D NDEBUG`

I'm sure there are some additional flags that I could have added to improve things a little (such as better automatic vectorisation on the GCC 4-series compilers), but if the `matrix_product<>`

overhead cannot be optimised out, we'd still see the same kind of results. But you're welcome to prove me wrong, in fact I'd love it if you did! The code I used to generate these figures is available in the zip file download at the end of this post.

### Conclusion

We have a partial success, here. But unfortunately probably not enough of a success to use this code in the wild.

I say this because the most common case, for me at least, will be multiplying a number of 4×4 matrices together, or a 4×4 by a 4×1 which is where this alternative method doesn't help. Occasionally, I'll have a 4×4 by 4×4 by 4×1 chain, but by adding parentheses explicitly we can still beat this method. So for now, I'm just going to go with the usual matrix multiplication algorithm. It's a shame in a way, but I enjoyed the experiment nonetheless!

I'll will keep the code around and perhaps work on it some more in future. It might be the case that compilers will one day be able make up the difference, or I might come across a way that allows current compilers to optimize this stuff better.

### Downloads

##### Footnotes

### Comments

All original content copyright© Edd Dawson.

Any opinions expressed by Edd are his own and are not necessarily shared by his employer. Or by anyone else, in fact.

All source code appearing on this website that was written by Edd Dawson is made available under the terms of the Boost software license version 1.0 unless otherwise stated or implied by the license associated with the work from which the code is derived.