There are a lot of novice questions on optimizing NumPy code on StackOverflow, that make a lot of the same mistakes. I'll try to cover them all here.

What does NumPy speed up?

Let's look at some Python code that does some computation element-wise on two lists of lists. I'll use operator.add because it's just about the fastest function you can call.
c = []
    for ra, rb in zip(a, b):
        c.append([])
        for ca, cb in zip(ra, rb):
            c[-1].append(operator.add(ca, cb))
This is slow. If I run it on two 100x100 lists, it takes 2.59ms. Where is all that time going?
  • 206µs: the fastest possible 100x100 loop
  • 185µs: extra time for the zip generators
  • 770µs: function calls on operator.add
  • 713µs: the actual additions
  • 716µs: appending
So, we're spending 3x as long on overhead as on the actual work.

You can optimize that overhead. Using a list comprehension reduces the time for both the loops and the appends, which gets us down to 2.18ms—a 19% speedup. Manually inlining the add calls into ca+cb reduces the function call time, and with a couple more tricks, ultimately, the fastest I can get this in pure Python is 1.74ms. That's a 49% speedup, which isn't bad, but it's still orders of magnitude slower than what you'd get from the dumbest possible C implementation.

So, let's turn to NumPy. What can NumPy speed up?

Well, storing both matrices in arrays and filling in a pre-allocated target zeros array gets us a little better cache locality and eliminates the costs of reallocating and copying lists that comes from appending. That gets us to 2.39ms—only an 8% speedup.

If we use a vectorized operation, we can turn the loops into C loops, which should eliminate most of the looping time. It also makes the code simpler:
c = np.vectorize(operator.add)(a, b)
But that still only gets us down to 2.08ms—a 25% speedup. It's a little better than the list comprehension, but not much. And now we can't replace the function call with an expression, so this is the end of the line, right?

No! NumPy comes with a bunch of functions that are already vectorized and implemented in C (or, sometimes, Fortran or C++). One of them, np.add, does the same thing as np.vectorize(operator.add), but a whole lot faster. You can also write this as a normal operator.
c = np.add(a, b)
    c = a + b
What's the difference between np.vectorize(operator.add) and np.add?

They both spend about 5µs on preallocating the target array and looping.

But for np.add, that's almost all it has to do—other than that, it's just looping over 0-99 100 times in C, looking up doubles in a couple of C arrays, adding them, and storing the result in another C array. All told, that takes about another 2µs, so our total time is 7.07µs. Yes, microseconds, not milliseconds. That's a 36634% speedup—or, turned around, we've shaved off 99.7% of the time.

By contrast, for np.vectorize(operator.add), inside the loop, it has to make a Python function call to operator.add, and that function has to unbox two Python floats, add them together, and box up a new Python float. We already saw how long those parts take—770µs and 718µs, respectively.

So, the key to using NumPy is simple: it's not about moving the inner loop into NumPy, it's about getting rid of all Python functions in that inner loop. Only call natively-vectorized functions like np.add, or use operators like +, on the arrays.

And notice that in simple cases, this is dead easy: c = a + b is the code from the inner loop, written without the loops. (And yes, you can also just call operator.add(a, b).) If the function you want to vectorize instead returns (a**2 + b**2) ** .5, that's fine, just call that directly on the arrays instead of inside a loop.

Sometimes it's not as easy. Consider this loop:
for i, ra in enumerate(a):
        c.append([])
        for j, ca in enumerate(ra):
            c[-1].append(ca + b[j][i] * i)
Here, we're iterating b in a different order from a, and we're using the indices themselves in the computation. How can you do that? This is where NumPy programming becomes an art rather than just literal translation. You have to learn all the functions NumPy includes, and get to know which ones to reach for when facing each problem, and the only way to get there is through experience. But let's work this one through.

First, notice that the different order we're iterating b would be the same order if b were transposed. So, just transpose it. (If you're worried that this wastes time making a copy, don't worry, it doesn't; it just creates a new view over the same array, with different striding rules attached. But, even if it _did_ make a copy, wasting a few microseconds to save a few milliseconds would still be a huge win.)

Next, how do we get the i values? Well, enumerate basically zips a range with its argument. And we can use a third array for that range.
c = a + b.T * np.arange(a.shape[0])
This takes 42.1µs, compared to 3.16ms for the naive pure Python implementation and 2.16ms for the fastest pure Python implementation.
2

View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.