I love programming, not only for science, but because I enjoy the process. I programmed for security and automation in industry, for database systems and lately, for videogames. In all of these fields, optimizing your code to get an improvement in speed can save resources, time and money.

If you run short simulations, you may tell yourself that you don’t need faster code because it only takes a few of seconds -or up to a couple of minutes- and you don’t want to “waste” your time learning non interesting coding tricks. However, my experience tells me than good programming habits are easier to learn than bad ones, they decrease the probability of having bugs in your code, and you'll have a clearer and better organized result.

One of the main problems in modern languages, such as Python, is that a simple line can hide a complex algorithm. Well, this is not a problem, it’s a good solution, but it could become a problem if you don’t take the time to understand what is happening inside.

In this post I will focus on Python, particularly its numerical library Numpy, and I will try to explain how to improve a few simple pieces of code that could be found in your scripts, which will easily improve the quality of your programming.

Let's start with a simple example using matrices. In Numpy, matrices are stored as an array of values plus a variable that indicates the shape of each dimension. For example, a 5x10 matrix is stored as an array of 50 items and an array indicating the length of each dimension: [5,10]. The structure of the “items array” are stored first for each row and then for each columns:

a = np.array(range(50))
a.shape = (5, 10)
a.values = [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9,
            10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
            20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
            30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
            40, 41, 42, 43, 44, 45, 46, 47, 48, 49]

There are some details you should understand related to this type of storage. For instance, what is the best algorithm to access the data.

Another important thing is how the CPU moves items from/to the main memory to operate on them. In this case, we can improve the code by using DMA (Direct Memory Access), which is a mechanism for moving large amounts of data from one memory address to another.

Let's look at how it works with an example. In the first case we copy item by item from one array to another:

for j in range(WIDTH):
        for i in range(WIDTH):
                b[i,j] = a[i,j]

In my computer, this takes over seven seconds (width=3000), because it needs to access the memory for each item. For each value, the python code mixes with the C wrapper and can waste a bit of time.

Instead, if we copy entire columns:

for i in range(WIDTH):
        d[:,i] = a[:,i]

The copying only takes over 0.35 seconds. However, in this case the code is still not properly using DMA yet because the data is not stored sequentially. For each column, Python calls the C wrapper and it copies the entire column, which only demonstrates that C code is faster than Python code (obviously).

Let's go with a third example that copies rows:

for j in range(WIDTH):
        c[j,:] = a[j,:]

This one takes only 0.068 seconds to finish, which is 100 times faster than the first example! In this case the code is clearly using the DMA.

Finally, we could copy all the data in one sentence:

e[:,:] = a[:,:]

Taking 0.043 seconds. This code copies all the data in one C call and it uses DMA to speed up the data transfer.

For a more practical example, in the paper “A Fast Parallel Algorithm for Thinning Digital Patterns” (T. Y. ZHANG and C. Y. SUEN, ACM March, 1984), the authors presented a method to thin an image. The program needs to add a blank border to the image, so, the code was like the first example:

b = np.zeros((WIDTH + 2, HEIGHT + 2))
for j in range(HEIGHT):
        for i in range(WIDTH):
                b[i+1,j+1] = a[i,j]

This code can be improved to the following one:

b = np.zeros((WIDTH + 2, HEIGHT + 2))
b[1:WIDTH+1,1:HEIGHT+1] = a[:,:]

One image of 1024 x 640 pixels took 0.556 seconds with the original code and  0.0033 seconds with the new one. Consider an experiment that takes 50 images every 10 seconds and then uses this code to process them. The code will need to process over 43,200 images per day: the first code will take more than six hours, but with the second all the images can be processed in less than three minutes.

I want to finish this post with a small trick: If your code needs to operate iterating over columns you can transpose the matrix, run your code (iterating over rows!) and transpose it at the end. It might improve your code, try it!