# Performance in Mandelbrot Set Computation¶

Date: | 2017-04-09 |
---|---|

Abstract: | Programs that compute the Mandelbrot set written in Python and C/C++ are
compared for their speed in terms of time to solution. The C/C++ version
is much faster than the Python version. This article is meant as a
motivation for learning a lower-level language in addition to a high-level
language. |

## General Description¶

I want to have a nice poster of a Mandelbrot set, say 75×50 cm². At 300 DPI printing resolution, I need to compute a lot of pixels:

The computation shall not take forever. When it takes 1000 nano seconds to compute one pixel, then it will take almost one minute for the whole image.

Computing a pixel in a Mandelbrot set is not hard. The set is defined on the complex plane \(\mathbb C\) with an iterative procedure. Take the point \(z_0 \in \mathbb C\). Then iterate:

Sometimes one also calls \(z_0 = c\) and write that equation as

The point \(c\) belongs to the Mandelbrot set if that does not diverge for \(n \to \infty\). The number \(c = 1\) does not belong to the set, it diverges. However, \(c = -1\) is contained in the set. The iteration will be \(-1\), \(0\), \(-1\), and repeat itself. Once a number has \(|z_n| > 2\), we know that it will diverge.

In order to create a pixel image, one maps the pixels to complex numbers. Then one iterates until \(|z_n| > R\) for some \(R \geq 2\). I have chosen \(R = 5\) for my examples. Then I note the iteration number where \(|z_n|\) exceeds the radius. That is the value that I note for the pixel. After 100 iterations, I stop. If it has not diverged, I put in the number 100.

An image is created by taking the logarithm of the iteration number and spreading the values from black to white:

## First Python Implementation¶

Implemententing this in Python is not that hard. I need two loops over the
height and width of the image. The numbers `lx`

and `ly`

give the number of
pixels in the directions. Then I have a minimum real and imaginary part and a
step size `d`

.

```
for i in range(lx):
for j in range(ly):
x = re_min + i * d
y = im_min + j * d
c = x + 1.0j * y
z = 0.0 + 0.0j
for it in range(100):
z = z**2 + c
a = np.abs(z)
if (a > radius):
iters[i, j] = it
break
```

There is some helper code around, but that is the main code. I stop iterating when it has diverged.

This simple Python implementation takes 59000 nano seconds per pixel on my laptop. This means that the whole poster will take around an hour to compute! That is doable, but can it be done faster?

## NumPy Implementation¶

Python itself is not that fast. It is not really the language, more the implementation of the language. Either way, it just takes too long to compute the data for the poster.

The NumPy array implementation for Python is written in C and performs the internal loops in compiled C functions. So although the programmer does not have to deal with C, it still uses C below.

Writing the algorithm in Python with NumPy could look like this:

```
for i in range(100):
z = z**2 + c
a = np.abs(z)
selection = (a >= radius) & (iters == 0)
sel = np.where(selection)
iters[sel] = i
if selection.all():
break
```

The first two lines in the `for`

loop are to be read as arrays. We take all
the values for \(z\) and square them. Then we add the whole array of \(c\) onto
that. The loops are now compiled C code. The following lines then look at the
elements where the absolute value exceeds the radius and where there are no
iterations written down. Then it writes down the current iteration number into
those fields.

The problem with this approch is that a lot of useless work is done. Even when some pixel already has its value, the complex multiplications are still done. This is a waste.

Still, this implementation only needs 2800 nano seconds per pixel. That is a speedup of 21, which is great! I just do not see any way to make it faster within Python, though.

## First C/C++ Implementation¶

Using a native compiled language like C or C++ can make this much faster. For
this we take an array `escape_iter`

which is linearized. Then we loop over
height and width and perform the same computation as before. The code looks as
follows:

```
for (size_t y = 0; y < ly; ++y) {
for (size_t x = 0; x < lx; ++x) {
double z_re = 0.0;
double z_im = 0.0;
for (size_t iter = 0; iter < max_iter; ++iter) {
auto const new_z_re = z_re * z_re - z_im * z_im + c_re(x);
auto const new_z_im = 2 * z_re * z_im + c_im(y);
z_re = new_z_re;
z_im = new_z_im;
auto const abs = z_re * z_re + z_im * z_im;
if (unlikely(abs >= radius)) {
escape_iter[idx(x, y)] = iter;
break;
}
}
}
}
```

Compiled with GCC using `-O3`

, this will get us to 92 nano seconds per pixel!
In comparison with first Python version, that is now a factor 641 faster.

## Adding Threads¶

From here it is relatively easy to distribute the pixels onto the various cores
of my CPU. Just adding a single OpenMP directive does the trick. The following
line gets added before the `for`

loops:

```
#pragma omp parallel for collapse(2)
```

On my laptop, it now takes only 45 nano seconds per pixel. The speedup is already at 1311. This is a day-and-night difference to the first Python implementation!

## SIMD Optimization¶

One can do slightly better than that. The current data layout is not really suited for using SIMD, the vector units in the CPU. My laptop CPU (Intel Sandy Bridge) supports the AVX instruction set and can do operations with four floating point numbers (double precision) at the same time. Since I want to do the same computation for each pixel, one should be able to make use of that. It could become up to four times as fast as before.

One caveat is that now all the code can be put into vector instructions. When doing so, one is doing more work than needed. A group of four pixels has to be iterated until the last one has diverged. In the beautiful parts of the sets, that could happen often. However, the other computations are virtually for free, so there is no harm in it.

Programming this is more cumbersome, though. One needs to introducs arrays of size vector length that are aligned properly such that they can be loaded in one go. Then I need to do extra work and keep track of the finished pixels. All this adds some overhead to the code. The implementation is here:

```
#pragma omp parallel for collapse(2)
for (size_t y = 0; y < ly; ++y) {
for (size_t bx = 0; bx < blocks_x; ++bx) {
alignas(veclen * sizeof(bool)) bool finished[veclen] = {false};
alignas(veclen * sizeof(bool)) bool written[veclen] = {false};
alignas(veclen * sizeof(double)) double z_re[veclen] = {0.0};
alignas(veclen * sizeof(double)) double z_im[veclen] = {0.0};
alignas(veclen * sizeof(double)) double new_z_re[veclen];
alignas(veclen * sizeof(double)) double new_z_im[veclen];
alignas(veclen * sizeof(double)) double c_re[veclen] = {0.0};
for (size_t v = 0; v < veclen; ++v) {
c_re[v] = get_c_re(bx, v);
}
double c_im = get_c_im(y);
for (size_t iter = 0; iter < max_iter; ++iter) {
#pragma omp simd aligned(z_re, z_im, new_z_re, new_z_im, finished)
for (size_t v = 0; v < veclen; ++v) {
new_z_re[v] =
z_re[v] * z_re[v] - z_im[v] * z_im[v] + c_re[v];
new_z_im[v] = 2 * z_re[v] * z_im[v] + c_im;
z_re[v] = new_z_re[v];
z_im[v] = new_z_im[v];
auto const abs = z_re[v] * z_re[v] + z_im[v] * z_im[v];
if (!finished[v] && unlikely(abs > radius)) {
finished[v] = true;
escape_iter[idx(bx, y) * veclen + v] = iter;
}
}
bool all_finished = true;
for (size_t v = 0; v < veclen; ++v) {
all_finished &= finished[v];
}
if (all_finished) {
break;
}
}
}
}
```

All in all it seems to be worth it. One pixel is down to 33 nano seconds. The overal speedup is 1787, which means that my post only takes in the order of a minute to compute.

## Conclusion¶

The Python version was the easiest to implement. It did not run very fast, though. Using a language that allows to get closer to the machine (like C or C++) exposes more opportunities to make it faster. If performance is top priority, then C or C++ give the tools to make it fast.

One should not neglect algorithmic efficiency, though. Choosing a more efficient algorithm usually gives greater improvements than tuning the performance of an implementation.

A summary of the various implementations:

Implementation | Seconds | Nano seconds | Speedup |
---|---|---|---|

Python simple | 5.9e-05 | 59000 | 1 |

Python NumPy | 2.8e-06 | 2800 | 21 |

C++ | 9.2e-08 | 92 | 641 |

OpenMP Threads | 4.5e-08 | 45 | 1311 |

SIMD Optimized | 3.3e-08 | 33 | 1787 |

All the programs can be found in the repository:

git clone git://github.com/martin-ueding/mandelbrot-performance.git