mr-edd.co.uk :: horsing around with the C++ programming language

Multicore scalability testing with Intel Parallel Universe

[30th January 2010]

I'm subscribed to Intel's Visual Adrenaline magazine by email. Some of the content is so-so and has a strong marketing tint, but occasionally something really interesting or useful appears.

One such thing that I noticed the other day is a relatively new free service called Intel Parallel Universe. It allows you to upload applications to be run on systems with 1, 2, 4 and 8 cores. It also claims to run the code on a box with 16 cores, but that's really a hyper-threaded 8 core machine. Not to be sniffed at, though!

There are a couple of restrictions. Most notable is that your application has to be compiled for Win32 and must run to completion, unaided, in under a minute. But otherwise, pretty much anything is fair game.

Once your application has been run, you are presented with some basic statistics relating to speed and scalability. I don't know about you, but I don't have any 8 and 16 core machines lying around. So I was quite keen to put the service through its paces.

Some test code

To try out Parallel Universe, I needed to create a suitable program. I decided to go with a Mandelbrot set renderer.

The Mandelbrot set, for those of you that don't know, is essentially a bunch of complex numbers (2D positions, essentially) that obey a certain property. Here's a picture created by my program:

The Mandelbrot setClick to enlarge

The key thing to note about the algorithm used to generate the Mandelbrot set is that it is embarrassingly parallel; the colour of each pixel can be computed in isolation for the colour of every other pixel. This property makes it a great test case.

So all the program does is iterate over the pixels in an image, find the corresponding complex numbers (2D positions) and calculate the appropriate colours:

const complex_t domain[2] = { complex_t(-2, -1.2), complex_t(0.75, 1.2) };
image img(1024, 768);

for (size_t y = 0; y != img.height; ++y)
{
    const double pos_y = lerp(double(y) / img.height, domain[0].imag(), domain[1].imag());
    for (size_t x = 0; x != img.width; ++x)
    {
        const double pos_x = lerp(double(x) / img.width, domain[0].real(), domain[1].real());
        img(x, y) = mandelbrot(complex_t(pos_x, pos_y)); // colour the pixel
    }
}

Using OpenMP, we can easily parallelize the outer loop:

#ifdef _OPENMP
#include <omp.h>
#endif

// ...

#ifdef _OPENMP
omp_set_dynamic(1); // let OpenMP choose an appropriate number of threads
#endif

const int height = img.height; // OpenMP needs signed integrals for loop counters

#pragma omp parallel for shared(img, domain) 
for (int y = 0; y < height; ++y)
{
    const double pos_y = lerp(double(y) / img.height, domain[0].imag(), domain[1].imag());
    for (size_t x = 0; x != img.width; ++x)
    {
        const double pos_x = lerp(double(x) / img.width, domain[0].real(), domain[1].real());
        img(x, y) = mandelbrot(complex_t(pos_x, pos_y)); // colour the pixel
    }
}

Not all compilers support OpenMP, but the most recent versions of Visual C++ and g++ do, so you should be able to give it a whirl. The nice thing about it is that a compiler without OpenMP support will still be able to compile the code, though the binary won't benefit from OpenMP's semi-automatic parallelization.

Anyway, that #pragma omp parallel for line is the important part. It tells the compiler that the following loop can parallelized by dishing out the indices within range to a number of threads to work on.

The complete code is available in the downloads section further down the page.

Once I finished it, I uploaded it to Parallel Universe and had it produce a nice, big 12800 by 9600 pixel image. After 30 minutes or so, I got the following report:

report for first attemptClick to enlarge

The most interesting graph here, I think, is the one in the lower left corner. It shows the actual performance of the application vs the ideal performance, across the different systems upon which it was executed.

We can see that running the code on a dual core system did indeed result in a 2x performance improvement. But after that, things start to go wrong. The kink at 4 cores is especially weird. I ran the code again to make sure it wasn't a glitch, but the repeats came back the same.

After a bit of pondering, I realized what was happening. There's a clue in the Mandelbrot set itself: it is vertically symmetric. It looked as if OpenMP was splitting the index range of the y-loop in to the minimal number of chunks appropriate for the hardware concurrency of the machine. So when the code was running on a dual core machine, two threads were created, each working on a half of the image. Due to the symmetry of the fractal, the amount of work associated with each half is the same.

But when the image is split in to 4 strips, the work distribution changes. Here's an image showing the relative amounts of work assigned to each thread[1]:

work break-down

As we can see, the 2nd and 3rd threads are given almost 10 times as much work as the 1st and 4th. So threads 1 and 4 finish their work quite promptly but threads 2 and 3 will still be churning away for some time, yet.

There's a total of 7841495397 units of work in generating this image. If that work was spread evenly, we would expect linear scaling. Instead, because cores 2 and 3 are a bottleneck, we can predict that the code should only run 2.2 times faster: 7841495397 / 3525043721 = 2.2245 (4dp).

This matches the value for 2 cores in the lower-left hand graph in the Parallel Universe report, so the analysis seems to hold.

Let's fix it

So Amdahl's Law gave me a damn good pasting. In order to mount a come-back, I'll try to conceptually cut the image in to lots of horizontal strips. Within those strips I'll have OpenMP parallelize a loop over the rows of pixels. This reduces the granularity of the tasks created by the Visual C++ OpenMP implementation, which should help to spread the load more evenly.

The pixel-filling code now becomes:

const size_t grains = 96;

for (size_t k = 0; k != grains; ++k)
{
    const int y0 = k * img.height / grains;
    const int y1 = (k+1) * img.height / grains;

#pragma omp parallel for shared(img, domain) 
    for (int y = y0; y < y1; ++y)
    {
        const double pos_y = lerp(double(y) / img.height, domain[0].imag(), domain[1].imag());
        for (size_t x = 0; x != img.width; ++x)
        {
            const double pos_x = lerp(double(x) / img.width, domain[0].real(), domain[1].real());
            img(x, y) = mandelbrot(complex_t(pos_x, pos_y));
        }
    }
}

BEHOLD:

report for second attemptClick to enlarge

The graph in the lower left corner shows that we're now scaling pretty nicely up to 8 cores. Things start to go a little sour at 16 cores, but that might be due to hyper-threading. Then again, we might be able to tease out some more performance by continuing to tweak the task granularity.

But at this point Parallel Universe had proved its worth in my mind, so I didn't take this particular test any further.

Conclusions

If I had run this code on my dual core desktop and extrapolated the apparent scalability to infer performance on machines with higher core-counts, I would have been off by a factor of 2! Or maybe more if I had experimented further with task granularity…

Once again, we see that there's a lot to be gained by profiling your code. That's even true in simple cases such as this, in which intuition might suggest that linear scaling is a given. Again, for those of us without 8 or 16 core machines, I can see Parallel Universe being really useful for simple multi-threaded profiling.

One thing that disappointed me a little is that OpenMP wasn't able to do a better job out the box. I guess there's a balance to be found between spreading work evenly and producing so many tasks that task-creation or cache-contention becomes a serial bottleneck.

I should mention that the data Parallel Universe collects can be downloaded and imported in to Intel Parallel Amplifier, should you want to do more in-depth analysis. I don't have a copy, but I might try to persuade my boss to get one for our team at some point.

The code

The zip file contains the two versions of the code described above. At the top of each file is a command line by which you can compile an OpenMP-enabled executable with Visual C++. I got MinGW g++ 4.4 working too, but I couldn't find the compiler flags to match the performance of Visual C++ in this case.

Anyway, once compiled, you can run the code like so:

> mandelbrot.exe 1024 768 fractal.tga

That will save a 1024 by 768 pixel image to fractal.tga.

If you omit the file name, it won't save the image at all. When running the code through Parallel Universe, this is useful as it cuts out the serial I/O code at the end, making it easier to gauge the scalability of the core algorithm.

Downloads

Footnotes
  1. which can be calculated rather conveniently by simply summing pixel intensities []

Comments

(optional)
(optional)
(required, hint)

Links can be added like [this one -> http://www.mr-edd.co.uk], to my homepage.
Phrases and blocks of code can be enclosed in {{{triple braces}}}.
Any HTML markup will be escaped.