:: horsing around with the C++ programming language

Variations on a theme (the STL)

[7th April 2008]

The STL portion of the C++ standard library provides a set of tremendously useful tools. In most cases it will have a generic algorithm that work's as fast as your hand-rolled code. But there are a number of STL implementations and variations on the existing algorithms that can dish out something beyond what's on offer in the stuff shipped with your compiler.

I'll be taking a look at these implementations in this post. You might be surprised at the number of ways in which the STL has been adapted for different needs.


STXXL is a library that provides drop-in replacements for a number of STL containers. What makes it different from your common or garden STL is that it's designed to work on huge amounts of data — amounts that won't fit in RAM and must instead be kept on disk.

It also provides some basic parallel algorithms and a number of pipelining and prefetching optimizations for working with huge data sets.

A fair number of platforms are supported including Linux g++, Mac g++, Sun OS g++ and recent versions of MSVC.


The macstl project provides a replacement for std::valarray<> that has been optimised to make use of SIMD vector operations such as those found on chips supporting Altivec, SSE, MMX and so on.

Though the project started life on the Mac, it reportedly now compiles on a variety of other systems including Windows and Linux.

How much faster is it? According to the website, in the region of 10x-20x faster! That's not to be sniffed at if you're doing a lot of heavy numerical work.


The Multi-Core Standard Template Library is a parallel implementation of the standard C++ library. It makes use of multiple processors and/or multiple cores of a processor with shared memory. The functions provided are drop in replacements for the STL equivalents.

GCC is adding many of the MCSTL algorithms to its libstdc++ under a special parallel mode.

So far it looks like it's only ready for use on Linux and Solaris. Mac and Windows users will have to wait a little longer…

Calum Grant's Atomic Library

This one's a doozy! I haven't seen any other attempt at doing something like this with the STL.

C++ provides the facilities to implement RAII for managing invariants in the face of exceptions. Even though this is a powerful tool, there are times when it's still not enough.

What if I'm making changes to several containers, changes which should be seen as a single logical update by client code, only to have an exception thrown towards the end? I have to go back and undo all my hard work. Sometimes I might not even be able to roll-back my changes if deletions are involved and if I can, chances are the extra code will be tedious to write.

Enter The Atomic Library, which brings transactions to STL containers. In an article at Dr Dobbs' Journal, the author describes the implementation and how it can help to write exception safe code, even when sweeping changes are involved to multiple containers.

The basic idea is that you queue up your insertions, deletions and so on and then commit them all in one go, forming a transaction. If an exception is thrown, the transaction fails and the containers are left in the state they were before the transaction began. Otherwise, you can be sure that all the updates were successful.

The transactions are simple to set up. You just have to use the replacement containers and sprinkle in two lines of code for the start and end of the transaction:

atomic::vector<std::string> list1, list2;

void f(std::string const & x)
    atomic::transaction tr;

While there will predictably be some overhead when compared with the original STL containers, it's smaller than you might think, coming in at a mere 10% speed penalty.

Move semantics

In the next revision of the C++ standard, rvalue references will be added. This is needed to implement move semantics, which have the potential to offer massive efficiency gains in returning strings, containers and other large or expensive-to-copy objects.

To a degree, it's already possible to imbue containers and other objects with movable behavior. One example of this is Andrei Alexandrescu's yasli::vector<>:

Why all the fuss about moving data versus just copying it, you may ask? It can't be too bad, could it? Well, how about an improvement of about 23. And that's not percentage, it's times; that is, insertion using memmove, as implemented and measured by Howard Hinnant in CodeWarrior Pro 9 against a vector<string> (100 elements, 20 characters each) is 23 times faster than the politically correct version that copies strings around.

Andrei, in his DDJ article

Also, some of the more recent versions of the STL provided with Microsoft's Visual C++ compilers include similar abilities. According to Stefan T. Lavavej[1], copying containers of containers and TR1 types will avoid making additional copies where possible (see point #10).

With the additions to the C++ standard, we won't need tricks like those seen here, which is kind of a shame because they're just so damn ingenious! But compilers probably won't implement the new standard for a while after its publication. In the mean time we still have some “movable” containers to play with.

Alternative algorithm implementations

As I mentioned earlier, most of the time you won't be able to do better than the STL shipped with your compiler in terms of efficiency. Sometimes however, there's a tiny bit of domain knowledge that can give you an edge.

Another idea that comes from Anderi Alexandrescu is one that can potentially halve the time it takes to run std::find(), which in my code at least, is one of the most commonly employed algorithms.

A typical std::find() implementation might look something like this:

template<typename InputIterator, typename T>
InputIterator find(InputIterator begin, InputIterator end, const T &what)
    while (b != e && *b != what) ++b;
    return b;

How can we possibly make this faster?! First of all we can create specialisations that map on to more efficient functions where possible, such as memchr(). Many implementations will do just that, but I'm interested in speeding up the general case shown above. It can be done, but we need to work with containers rather than iterators.

The key observation is that the termination condition for the loop contains two checks, one to see if the input range is exhausted and another to see if the current element is equal to the one we're seeking. However, by inserting an element at the end of the range that is equal to what we can omit the range check as we're guaranteed to find the element eventually! Here's a version which takes a container:

template<typename T>
typename vector<T>::iterator find(vector<T> &ctr, const T &what)
    typedef typename vector<T>::iterator iter_t;
    typedef iterator_traits<iter_t>::difference_type diff_t;

    iter_t b = ctr.begin(), e = ctr.end(), p = e;
    while (*p != what) ++p;

    diff_t offset = p - b;

    return b + offset;

The code is more complicated and might now throw an exception when push_back() is called, but now we're only doing a single check inside the loop.

Even though the code above operates on a vector, the same thing would work on a deque and a string. You could also construct a version that works on a list, even though the details would be slightly different due to not having random access iterators.

EDIT: Andrei's version actually swap()s the last element of the range with what, meaning that a push_back() isn't required. His site was down at the time of writing so I couldn't find the article to check the code. Since swap() is almost always a no-throw operation, this is a much better approach in the general case.

Now, I've said this version of find() will potentially double the speed of your code. This is obviously only true if your comparison operator is as cheap as an iterator pre-increment. A lot of the time this won't be the case, but again, if you're writing CPU intensive numeric code using primitive types, this particular version of find() might be worth a go.

Keep your eyes peeled

Finally, I should mention that it's not impossible that your standard library implementation has missed a trick or two. Here's a version of the search_n() algorithm that is optimized for random-access iterators. The details are fiddly, but the speed up is significant.

Any more?

If you've ever come across an interesting take on the STL that I haven't mentioned here, please let me know. I'm sure you'll agree that some of the variations presented here are interesting and you may have already spotted a use for them in your own code!

  1. whose initials are very appropriate for his job and this post []


kaido kert

[07/04/2008 at 21:29:00]

There are couple embedded STL implemenations with focus on code footprint reduction ( stock #include is often enormous on ARM7 thumb code ) and more tailored memory allocation.
Some are specifically feature-limited that just contain most often used parts like trimmed-down vector.

(required, hint)

Links can be added like [this one ->], to my homepage.
Phrases and blocks of code can be enclosed in {{{triple braces}}}.
Any HTML markup will be escaped.