Efficiency of the `pow` function

Someone said that using pow(x, 2) is always more inefficient than using x * x. Well, there are two things to remember:

  1. Do not "optimize" without measurement.
  2. Measure with full compiler optimization.

So this is exactly what I did then. This is a simple C++11 program that uses pow(x, 2) and x * x. I highlighted the lines in questions. The calculations with test are in place so that the compiler does not optimize away the code, which it would do otherwise.

#include <chrono>
#include <cmath>
#include <iostream>

int main() {
    double test {0};
    unsigned iter_max {100000000};

    auto start_time = std::chrono::steady_clock::now();
    for (unsigned iter {0}; iter < iter_max; iter++) {
        test += std::pow(iter, 2);
    }
    double time_in_seconds = std::chrono::duration_cast<std::chrono::milliseconds>
        (std::chrono::steady_clock::now() - start_time).count() / 1000.0;
    std::cout << "Pow: " << time_in_seconds << std::endl;

    start_time = std::chrono::steady_clock::now();
    for (unsigned iter {0}; iter < iter_max; iter++) {
        test += iter * iter;
    }
    time_in_seconds = std::chrono::duration_cast<std::chrono::milliseconds>
        (std::chrono::steady_clock::now() - start_time).count() / 1000.0;
    std::cout << "Multiplication: " << time_in_seconds << std::endl;

    std::cout << test << std::endl;

    return 0;
}

Results

If you compile that without optimization, pow is clearly slower:

$ clang++ -std=c++11 pow.cpp -o pow; and ./pow
Pow: 0.272
Multiplication: 0.042

Now I compiled this with clang++ using its -O3 optimization. When I did this on 2014-05-21, pow was significantly faster. I have revisited this on 2014-07-10, where pow just a tiny bit slower than the multiplication. Interesting.

I also tested with g++ and found that pow is significantly slower than the multiplication. To be fair, I ran each one a couple times since the overall time varies. With g++, the multiplication is actually faster. To get meaningful results, I ran each one 10 times and too mean and standard deviation with this Python script:

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import subprocess

import numpy
import unitprint

def compile_cpp(compiler):
    subprocess.check_call([compiler, '-std=c++11', '-O3', 'pow.cpp', '-o', 'pow'])

def get_results():
    words = subprocess.check_output(['./pow']).decode().strip().split()
    return float(words[1]), float(words[3])

def bootstrap(compiler, runs=10):
    compile_cpp(compiler)
    times_pow = []
    times_mul = []
    for i in range(runs):
        time_pow, time_mul = get_results()
        times_pow.append(time_pow)
        times_mul.append(time_mul)

    mean_pow = numpy.mean(times_pow)
    mean_mul = numpy.mean(times_mul)

    std_pow = numpy.std(times_pow)
    std_mul = numpy.std(times_mul)

    return unitprint.siunitx(mean_pow, std_pow), \
            unitprint.siunitx(mean_mul, std_mul)

def main():
    for compiler in ['g++', 'clang++']:
        print(compiler, *bootstrap(compiler))

if __name__ == "__main__":
    main()

These are the results:

Compiler pow / s x * x / s
g++ 2.90 ± 0.03 1.28 ± 0.01
clang++ 1.272 ± 0.008 1.268 ± 0.008

With clang++, there is a tiny difference between pow and multiplication, it is not really significant though, since it is just half a standard deviation. g++ takes more than twice as long for using pow. I can absolutely understand that people using g++ will try to avoid the pow function.

However, and that is my point, the statement that pow is always slower than multiplication does not really hold. I consider the results from clang++ to be on par. Please test your application on your compiler and see what is faster in reality, not in theory.

Source of pow

And if you look at the source of pow(), you will see that they have thought about it:

112         /* First see whether `y' is a natural number.  In this case we
113            can use a more precise algorithm.  */

Then it jumps to 9 where it says:

136 9:      /* OK, we have an integer value for y.  Unless very small
137            (we use < 8), use the algorithm for real exponent to avoid
138            accumulation of errors.  */