mr-edd.co.uk :: horsing around with the C++ programming language

Compile-time optimisation of matrix multiplication chains

[28th October 2007]

In my previous post I outlined my plans for a new vector and matrix library. I also mentioned a multiplication optimization that I haven't seen anywhere else[1], which I'll be outlining here. To get the most out of this post, you'll really need to read the last one else some of it might seem a little mysterious. It's going to be very heavy on templates and meta-programming. If you're not a fan of that kind of stuff you might want to look away! Familiarity with the boost meta-programming library will certainly help when reading this post.

Definitions

There are a number of different notations in common use for matrices, so before I say anything else let me just mention what I'll be using, here.

If A is a matrix, of dimensions m×n, then it has m rows and n columns. I'll use the notation A(i,j) to denote the element of A in the ith row and jth column. As it happens, this matches the syntax used by the library. I use the notation Mm×n to denote the set of matrices whose dimensions are m×n.

I define multiplication of matrices as follows:

multiplication_formula.png

Default C++ associativity is against us

Consider the multiplication of 3 matrices, P, Q and R:

multiplication chain diagram

Since matrix multiplication is associative, there are two ways we can perform this multiplication to obtain the result. We can either multiply P and Q first, and then multiply the result of that with R i.e. (P×Q)×R. Or we can multiply Q and R before multiplying P with the result i.e. P×(Q×R). If I was multiplying this out in my head (or on paper, more like!), I know which route I would take. While both of course yield the same answer, the second way involves doing far fewer multiplications.

However, unless we explicitly bracket sub-expressions in our C++ code, when the compiler comes across P * Q * R, it doesn't know that this second grouping is less costly. The language rules say that the calls to operator* must be made in left-to-right order.

It would be great if the compiler could somehow deduce the cheapest order in which to perform the multiplications. As it happens, we can actually give the compiler a helping hand without explicitly bracketing the sub-expressions, but we need to gather some key ingredients, first.

Regrouping the product

We need to find an algorithm that can deduce the best grouping for performing the multiplications in a matrix product chain. We also need to have this algorithm run at compile time, ideally, which would mean zero run-time overhead.

Then, we need to somehow stop the compiler from doing its naive left-to-right multiplication and instead use the better way that we've computed using our algorithm.

The good news is that such an algorithm exists and it's nice and simple. The bad news is that it's O(n3) complexity (where n is the number of matrices). But if it's running at compile time, it's no big deal as we won't see the difference at run-time.

http://en.wikipedia.org/wiki/Chain_matrix_multiplication

Here, the cost of multiplying an Mm×n by an Mn×p is defined as m×n×p i.e. the number of scalar multiplications involved in computing the result.

Delaying matrix multiplication

So how do we stop the compiler multiplying our matrices left-to-right? We can have operator* create an object whose type is a model of the FixedMatrix concept, but the generated object delays the actual multiplication until its elements are accessed:

template<typename LeftMatrix, typename RightMatrix>
class matrix_product : public operands_list<LeftMatrix, RightMatrix>
{
    public:
        BOOST_STATIC_ASSERT(matrix_width<LeftMatrix>::value == matrix_height<RightMatrix>::value);

        // Components needed to model the FixedMatrix concept
        typedef typename wider_element_type<LeftMatrix, RightMatrix>::type element_type;
        typedef typename wider_size_type<LeftMatrix, RightMatrix>::type size_type;
        static const size_type height = matrix_height<LeftMatrix>::value;
        static const size_type width = matrix_width<RightMatrix>::value;
        typedef matrix_tag operand_category;

        typedef LeftMatrix left_matrix_type;
        typedef RightMatrix right_matrix_type;

        matrix_product(const left_matrix_type &lhs, const right_matrix_type &rhs) : lhs_(lhs), rhs_(rhs) { }

        matrix_product(const matrix_product &other) : lhs_(other.lhs_), rhs_(other.rhs_) { }

        element_type operator() (size_type i, size_type j) const
        {
            element_type ret(0);
            for (size_type k = 0; k != left_matrix_type::width; ++k)
                ret += at(lhs_, i, k) * at(rhs_, k, j);
            return ret;
        }

        const left_matrix_type &lhs() const { return lhs_; }
        const right_matrix_type &rhs() const { return rhs_; }

    private:
        const left_matrix_type &lhs_;
        const right_matrix_type &rhs_;
};

You'll see that I've defined operator() to calculate each element as it's requested. So even if someone was to do something a little weird like accessing a particular element of a temporary, the code would still work as expected:

// Assuming P and Q are each of a type that is a model of FixedMatrix
double x = (P * Q)(0, 1); // fine, matrix_product<>::operator() called

Also, note what happens when we multiply more than two matrices in a chain. We effectively get a tree of matrix_product<>s. For example, if P, Q and R are each of type mat10x10, then P * Q * R generates an object of type:

matrix_product< matrix_product<mat10x10, mat10x10>, matrix10x10 >

While this stops the multiplication happening at the usual time, we have actually created a rather large inefficiency, here. To see why, you need to realise that when multiplying two matrices together, you typically need to access each element of either operand more than once. But if accessing those elements involves a calculation, such as the one performed in matrix_product<>::operator(), we're needlessly doing the same work over and over again.

But fear not! Later on we'll see a way to dig our way out of this particular hole.

Lastly, you'll see that matrix_product<> derives from operands_list<>. We'll see what this gives us, shortly.

Planning the product re-grouping code

So, let's have a look at the meat and potatoes of the problem; assembling a compile-time data structure that indicates the preferred grouping of a matrix multiplication chain. There are a number of ways to go about this and I am obliged to point you in the direction of an alternative that was kindly presented to me in a conversation in the comp.lang.c++.moderated newsgroup. My method is somewhat different, but they boil down to the same thing in the end.

So, the way I approached this was to devise the following structure, which would act as a binary tree inside the compiler.

template<typename Lhs, typename Rhs>
struct assoc_tree
{
    typedef Lhs lhs;
    typedef Rhs rhs;
};

The tree structure becomes obvious when you realise that either of the types Lhs and Rhs may themselves be assoc_tree<>s. Alternatively, either or both could be some other type, representing a leaf. In my implementation, I used boost::mpl::int_<>s to represent the leaves[2].

For example, the following type represents a grouping that indicates the 2nd and 3rd matrices should be multiplied first, before the 1st matrix is multiplied with the result.

namespace mpl = boost::mpl;

typedef assoc_tree<mpl::int_<0>, assoc_tree<mpl::int_<1>, mpl::int_<2> > > grouping_tree;

If we can build up a structure like this at compile-time we'll have obtained an entity that we can use to control the re-grouping of the product.

Putting together an assoc_tree

So we've got a data structure of sorts, to aim at now. We just need to create it somehow. To do this, we need to implement the algorithm I quoted from wikipedia, earlier and we need to run it at compile time.

The most intuitive way to do this, I think, is to assemble a sequence of the types involved in the product, or rather a boost::mpl::vector<> of types. A boost::mpl::vector<> is the compile-time counterpart of std::vector<> in the C++ standard library. Instead of containing values, it contains types.

Once we've created our vector of types, we can then run the grouping algorithm over it in order to create an assoc_tree<>. As it turns out, this vector was actually pretty easy to put together. The idea is to keep a vector inside each matrix_product<>. Whenever a matrix_product<> is instantiated with a pair of template parameters, it takes the existing vectors from its LeftMatrix and RightMatrix and joins them end-to-end to create its own vector. We can use template specialisation to achieve this:

template<typename LeftMatrix, typename RightMatrix>
struct operands_list
{
    typedef mpl::vector<LeftMatrix, RightMatrix> operands;
};

template<typename Lhs, typename Rhs, typename RightMatrix>
struct operands_list<matrix_product<Lhs, Rhs>, RightMatrix>
{
    // append an element to a vector
    typedef typename mpl::push_back<typename matrix_product<Lhs, Rhs>::operands, RightMatrix>::type operands;
};

template<typename LeftMatrix, typename Lhs, typename Rhs>
struct operands_list<LeftMatrix, matrix_product<Lhs, Rhs> >
{
    // prepend an element to a vector
    typedef typename mpl::push_front<typename matrix_product<Lhs, Rhs>::operands, LeftMatrix>::type operands;
};

template<typename Lhs1, typename Rhs1, typename Lhs2, typename Rhs2>
struct operands_list<matrix_product<Lhs1, Rhs1>, matrix_product<Lhs2, Rhs2> >
{
    typedef typename matrix_product<Lhs1, Rhs1>::operands lops;
    typedef typename matrix_product<Lhs2, Rhs2>::operands rops;

    // join two existing vectors
    typedef typename mpl::insert_range<lops, typename mpl::end<lops>::type, rops>::type operands;
};

Now the root matrix_product<> in the hierarchy is of a type whose inner operands typedef is the sequence of types we need (you'll recall that matrix_product<> publicly inherits operands_list<>). In the above, we used a number of boost::mpl meta-functions to prepend/append (push_front/push_back) types to a vector and also mpl::insert_range to effectively join two existing vectors together.

So we have a list of matrix types and an outline for a tree structure that will contain indices in to this list. Now let's see how to construct the actual assoc_tree<> that represents the optimal grouping for the chain.

The way I arrived at my implementation was to sketch out the basic algorithm in pseudo-code and then adapt the iterative parts so that they were implemented using recursion instead. This needed to be done because you can't do iteration particularly well at compile-time. Recursion on the other hand can be achieved with templates, with a particular specialisation representing the termination condition.

Here's that process spelled out a little more explicitly:

// Pseudo code using iteration

// Returns a 2-tuple containing the optimal cost and the corresponding assoc_tree.
// list is a sequence of TYPES, which is typically a sub-list of the original list
// of fixed matrices. offset represents how far in to the original list you have to
// look in order to find the list passed to this function
function find_best_assoc(list, offset)
{
    len = length(list)

    if (len == 1)
    {
        // cost of multiplying a chain of 1 matrix
        return (0, offset) // an integer is a leaf node of an assoc_tree
    }

    if (len == 2)
    {
        // cost of multiplication for the types at positions 0 and 1 in the list
        cost = list[0].width * list[0].height * list[1].width
        return (cost, assoc_tree<offset, offset + 1>)
    }

    // We have more than 2 elements, so a number of different groupings must be considered
    best_cost = INFINITY
    best_assoc = NULL

    for (i = 1; i != len - 1; ++i)
    {
        // split the list about position i and find the details for either side of the split
        (left_cost, left_assoc)     = find_best_assoc(sub_range(list, 0, i), offset)
        (right_cost, right_assoc)   = find_best_assoc(sub_range(list, i, len), offset)

        total_cost = left_cost +
                     right_cost +
                     list[0].height * list[i-1].width * list[i].height

        if (total_cost < best_cost)
        {
            best_cost = total_cost
            best_assoc = assoc_tree<left_assoc, right_assoc>
        }
    }    

    return (best_cost, best_assoc)
}

In the initial pseudo-code version above, there is an element of recursion, in that we're calling best_assoc() inside itself but there is still a loop that must be eliminated. The last term in the total_cost calculation is the cost of multiplying the sub-products on either side of the chain when it is split about position i.

So let's get rid of this loop now, in favour of recursion.

// Pseudo code using recursion only, no iteration

function find_best_assoc(list, offset)
{
    if (len == 1)
    {
        return (0, 0) // an integer is a leaf node of an assoc_tree
    }

    if (len == 2)
    {
        cost = list[0].width * list[0].height * list[1].width
        return (cost, assoc_tree<offset, offset + 1>)
    }

    // We have more than 2 elements, so a number of different groupings must be considered
    return split_details(list, 1, INFINITY, NULL, offset)
}

// Calculates the best cost by splitting the list at positions split and beyond
// best_cost and best_assoc are the best cost and assoc_tree configurtion found so far
function split_details(list, split, best_cost, best_assoc, offset)
{
    len = length(list)

    if split == len:
        return (best_cost, best_assoc)

    (left_cost, left_assoc)     = find_best_assoc(sub_range(list, 0, split), offset)
    (right_cost, right_assoc)   = find_best_assoc(sub_range(list, split, len), offset)

    total_cost = left_cost +
                 right_cost +
                 list[0].height * list[i-1].width * list[i].height

    if (total_cost < best_cost)
    {
        best_cost = total_cost
        best_assoc = assoc_tree<left_assoc, right_assoc>
    }

    return split_details(list, split + 1, best_cost, best_assoc, offset)
} 

Now that we have an entirely recursive algorithm, we can turn this in to compile-time C++. The recursion-termination conditions in each function must be translated in to template specialisations, but otherwise things are much the same. The returned cost and assoc_tree are exposed through the value constants and assoc_node typedefs in each of the template classes.

template<typename FixedMatrices, std::size_t Length, std::size_t Offset>
struct find_best_assoc;

template
<
    typename FixedMatrices,
    std::size_t Split,
    std::size_t Length,
    std::size_t BestCost,
    typename BestAssoc,
    std::size_t Offset
>
struct split_details
{
    typedef typename sub_range<FixedMatrices, 0, Split>::type left_list;
    typedef typename sub_range<FixedMatrices, Split, Length>::type right_list;

    typedef find_best_assoc<left_list, Split, Offset> left_best;
    typedef find_best_assoc<right_list, Length - Split, Offset + Split> right_best;

    static const std::size_t total_cost =
        left_best::value + right_best::value +
        matrix_height<typename mpl::at_c<FixedMatrices, 0>::type>::value *
        matrix_width<typename mpl::at_c<FixedMatrices, Length - 1>::type>::value *
        matrix_height<typename mpl::at_c<FixedMatrices, Split>::type>::value;

    static const bool smaller = total_cost < BestCost;
    static const std::size_t best_cost = smaller ? total_cost : BestCost;
    typedef assoc_tree<typename left_best::assoc_node, typename right_best::assoc_node> proposed_assoc;
    typedef typename mpl::if_<mpl::bool_<smaller>, proposed_assoc, BestAssoc>::type best_assoc;

    typedef split_details<FixedMatrices, Split + 1, Length, best_cost, best_assoc, Offset> cas;
    static const std::size_t value = cas::value;
    typedef typename cas::assoc_node assoc_node;
};

template
<
    typename FixedMatrices,
    std::size_t Length,
    std::size_t BestCost,
    typename BestAssoc,
    std::size_t Offset
>
struct split_details<FixedMatrices, Length, Length, BestCost, BestAssoc, Offset>
{
    static const std::size_t value = BestCost;
    typedef BestAssoc assoc_node;
};

template<typename FixedMatrices, std::size_t Length, std::size_t Offset>
struct find_best_assoc
{
    static const std::size_t best_cost = ~std::size_t(0);
    typedef split_details<FixedMatrices, 1, Length, best_cost, void, Offset> cas;

    static const std::size_t value = cas::value;
    typedef typename cas::assoc_node assoc_node;
};

template<typename FixedMatrices, std::size_t Offset>
struct find_best_assoc<FixedMatrices, 1, Offset>
{
    static const std::size_t value = 0;
    typedef mpl::int_<Offset> assoc_node;
};

template<typename FixedMatrices, std::size_t Offset>
struct find_best_assoc<FixedMatrices, 2, Offset>
{
    typedef typename mpl::at_c<FixedMatrices, 0>::type at0;
    typedef typename mpl::at_c<FixedMatrices, 1>::type at1;
    static const std::size_t value = cost<at0, at1>::value;

    typedef assoc_tree<mpl::int_<Offset>, mpl::int_<Offset + 1> > assoc_node;
};

Phew! I've omitted the definition of the sub_range meta-function, but you can find it in the zip of source code available at the end. So now if we have a matrix_product<Lhs, Rhs>, we can find the optimal grouping like so:

typedef typename matrix_product<Lhs, Rhs>::operands operands;
const std::size_t num_operands = detail::mpl::size<operands>::value;
typedef typename detail::find_best_assoc<operands, num_operands, 0>::assoc_node best_assoc;

best_assoc is now the assoc_tree<> structure we have been seeking. We aren't really interested in the final optimal cost, but we had to do the calculations along the way in order to do comparisons between the different groupings.

There are still two pieces missing at this point. The first is that we still don't have any code to actually perform the regrouping, even though we have a structure describing the preferred multiplication order. The second is that we need to trigger this regrouping somehow. For example consider this code:

c = a * b;

Here, we will end up assigning a matrix_product<> to the matrix c. Without special logic inside the assignment operator involved here, the elements of c will be filled using the operator() of said matrix_product<>, which doesn't perform any regrouping.

Let's tackle this problem of triggering the re-group first, before moving on to look at the actual mechanics of the re-grouping code itself.

Making the re-group happen

I currently can't see any way around the fact that we'll need some kind of check inside functions such as operator= to see if we're dealing with a matrix_product<>. However, we can use this to our advantage by allowing clients of our library to use this hook for their own purposes. Let's see what I'm babbling on about.

The library comes with a concrete template class called matrix, that is of course a model of the FixedMatrix concept. One of its constructors looks like this:

template<typename FixedMatrix>
explicit matrix(const FixedMatrix &m)
{
    BOOST_STATIC_ASSERT(matrix_height<FixedMatrix>::value == height);
    BOOST_STATIC_ASSERT(matrix_width<FixedMatrix>::value == width);

    typedef optimize_matrix_access<FixedMatrix, width * height> optaccess;
    typename optaccess::type m2(optaccess::optimize(m));

    for (size_type i = 0; i != height; ++i)
        for (size_type j = 0; j != width; ++j)
            (*this)(i, j) = matrix_element(m2, i, j);
}

The first two lines check that the FixedMatrix we're given has dimensions matching those of the matrix we're constructing. The next two lines is where we have a little bit of magic happening.

optimise_matrix_access<> is a template class that clients can specialise. They can change the behaviour of the static optimize() function to return an object whose type is still a model of FixedMatrix, but whose elements can be accessed with optimal efficiency.

The default implementation just looks like this:

template<typename FixedMatrix, unsigned AccessCount>
struct optimize_matrix_access
{
    typedef const FixedMatrix &type;
    static type optimize(const FixedMatrix &m) { return m; }
};

So by default, optimize_matrix_access::optimize() doesn't do anything. It simply passes through its argument. However, we can specialise this template class for matrix_product<> in order to kick-off the regrouping of the matrix multiplication chain. And it's not too hard to imagine cases where you might have other models of FixedMatrix that could take advantage of this hook.

So here's what the specialisation for matrix_product<> looks like:

template<typename Lhs, typename Rhs, unsigned AccessCount>
struct optimize_matrix_access<detail::matrix_product<Lhs, Rhs>, AccessCount>
{
    typedef detail::matrix_product<Lhs, Rhs> product;
    typedef matrix<typename product::element_type, product::height, product::width> type;
    typedef typename product::operands operands;

    static type optimize(const product &m)
    {
        const std::size_t num_operands = detail::mpl::size<operands>::value;
        typedef typename detail::find_best_assoc<operands, num_operands, 0>::assoc_node best_assoc;
        return detail::optimize_matrix_product<best_assoc, product>::multiply_out(m);
    }
};

In this specialisation we define the type typedef to be a bonafied matrix<>, which is the type returned by the optimize() function. You'll see in the last line of the optimize() member function, we call a function to multiply out the matrix chain using our assoc_tree<>.

Before we move on to looking at that, I'll just point out one last thing. The current implementation of optimize_matrix_access<> has an AccessCount integral template parameter. It is expected that the value of this will be the number of element accesses made by the calling code. I'm not entirely sure if this will be useful yet, but you can imagine that if you have a case where only a single element needs to be read, there would be no real point in multiplying out the chain some of the time. So by allowing specialisations based on the (expected) number of element accesses, we can account for this kind of thing.

Changing the multiplication order

Ok, so we've done all the prep. Pencils are sharpened, shoelaces are tied, eyebrows are waxed. We now need the final ingredient — the elusive optimize_matrix_product<> class that we just saw and its multiply_out() function, in particular.

template<typename AssocTree, typename FixedMatrixProduct>
struct optimize_matrix_product
{
    typedef multiplication_traits<typename FixedMatrixProduct::operands, AssocTree> traits;
    typedef matrix<typename traits::element_type, traits::height, traits::width> result_type;

    static result_type multiply_out(const FixedMatrixProduct &original)
    {
        typedef typename optimize_matrix_product<typename AssocTree::lhs, FixedMatrixProduct>::result_type lhs;
        typedef typename optimize_matrix_product<typename AssocTree::rhs, FixedMatrixProduct>::result_type rhs;
        return generic_matrix_product<lhs, rhs>::multiply
        (
            optimize_matrix_product<typename AssocTree::lhs, FixedMatrixProduct>::multiply_out(original),
            optimize_matrix_product<typename AssocTree::rhs, FixedMatrixProduct>::multiply_out(original)
        );
    }
};

template<int Index, typename FixedMatrixProduct>
struct optimize_matrix_product<mpl::int_<Index>, FixedMatrixProduct>
{
    typedef multiplication_traits<typename FixedMatrixProduct::operands, mpl::int_<Index> > traits;
    typedef matrix<typename traits::element_type, traits::height, traits::width> result_type;

    static result_type multiply_out(const FixedMatrixProduct &original)
    {
        return result_type(matrix_at<Index>(original));
    }
};

If you ignore all the typedefs and so on, the code is actually pretty simple. We're calling generic_matrix_product<>::multiply() (which is by default your standard triple-nested loop to perform matrix multiplication) with arguments whose values are generated by recursive calls to optimize_matrix_product<>::multiply_out(). These recursive calls descend down the branches of our assoc_tree<> and matrix_product<> structures, fetching the relevant matrices. Note the specialisation of optimize_matrix_product<> to handle a leaf node of an assoc_tree.

There are still some details omitted, such as the implementation of multiplication_traits<>, which appears in the multiply_out functions. This is simply a class that contains a bunch of typedefs to describe the attributes of the multiplication about to be performed.

The other thing that I haven't described in detail is the matrix_at<>() function, as seen in the specialisation of optimize_matrix_product<> for leaf nodes of assoc_tree<>s. This is actually significantly more complicated, but essentially uses similar tricks to those I've already outlined. I won't go in to great detail here, but the basic idea is as follows.

matrix_product<> contains a second mpl::vector<>, called directions. The ith element of directions is itself an mpl::vector<> describing the sequence of left/right branches that need to be taken to find the ith matrix of the chain within the matrix_product<> hierarchy.

So when matrix_at<Index>(product) gets called, the directions typedef is accessed from product's type. The mpl::vector<> located at position Index in directions contains the sequence of left/right turns that need to be taken to descend down the matrix_product<> to access the Indexth matrix in the chain.

For example, if the directions to get to matrix at index idx were left, left, right, then matrix_at<idx> would effectively end up calling product.lhs().lhs().rhs().

The details get a bit hairy, so I'll leave you to rummage through the code if you want to know more. I hope I'll be able to reduce its complexity a little over time.

So there you have it. You'll note that all the runtime code we've defined are short one liners, easily optimized by most compilers (assuming they can handle all the meta-programming to begin with!).

But has all this shuffling actually improved the speed of matrix multiplication?

Benchmarks

The answer to that question isn't as straight forward as I had hoped. In some cases we do see considerable improvement in execution speed. However, the construction of the matrix_product<> and the calls to its lhs() and rhs() member functions cannot be optimised-out completely[3]. This means that multiplying three 4×4 matrices in a chain, for example, will be slower with the proposed method because each grouping has the same cost and so we're doing a little bit of extra work for no benefit.

So even though we managed to deduce the optimal grouping entirely at compile-time, actually shuffling all the references around incurs a small runtime penalty that cannot be avoided (unless there are some additional tricks that elude me).

So let's have a look at some pretty graphs. It is not the intention to compare the results between different compilers, but rather the differences in speed between the two multiplication techniques on each compiler in turn. Indeed, the results for the Apple GCC compiler were obtained on an entirely different machine, so comparisons of those results against any other compiler would be meaningless.

benchmarks for matrix chain multiplication (length=4)Click to enlarge

benchmarks for matrix chain multiplication (length=2)Click to enlarge

First, we have a the graph that illustrates the kind of speed up we can get. The second graph illustrates the worst case scenario for this code i.e. where it actually adds a little bit of overhead from naive chain multiplication.

Here are the compiler flags I used.

I'm sure there are some additional flags that I could have added to improve things a little (such as better automatic vectorisation on the GCC 4-series compilers), but if the matrix_product<> overhead cannot be optimised out, we'd still see the same kind of results. But you're welcome to prove me wrong, in fact I'd love it if you did! The code I used to generate these figures is available in the zip file download at the end of this post.

Conclusion

We have a partial success, here. But unfortunately probably not enough of a success to use this code in the wild.

I say this because the most common case, for me at least, will be multiplying a number of 4×4 matrices together, or a 4×4 by a 4×1 which is where this alternative method doesn't help. Occasionally, I'll have a 4×4 by 4×4 by 4×1 chain, but by adding parentheses explicitly we can still beat this method. So for now, I'm just going to go with the usual matrix multiplication algorithm. It's a shame in a way, but I enjoyed the experiment nonetheless!

I'll will keep the code around and perhaps work on it some more in future. It might be the case that compilers will one day be able make up the difference, or I might come across a way that allows current compilers to optimize this stuff better.

Downloads

Footnotes
  1. if you know better, please let me know []
  2. boost::mpl::int_<> is similar to Loki's Int2Type<>, for those more familiar with that library []
  3. due to aliasing, I suspect []

Comments

(optional)
(optional)
(required, hint)

Links can be added like [this one -> http://www.mr-edd.co.uk], to my homepage.
Phrases and blocks of code can be enclosed in {{{triple braces}}}.
Any HTML markup will be escaped.