Martin McBride, 2022-04-06
Tags optimisation numpy scipy
Categories python language intermediate python
We looked at low level otimisation previously. In this article we will look at structural optimisation.
Structural optimisation takes a big picture approach. It attempts to optimise code by solving the problem in a different way, for example:
- Using a faster algorithm or more efficient data structures.
- Using high-performance Python libraries.
- Applying functional programming techniques.
- Caching key data.
- Using multithreading.
These techniques are not a replacement for low-level optimisation, they represent an additional way of optimising code. It isn't always possible to apply this type of optimisation. For example, if you are already using the most efficient known algorithm then there may be no way to improve it. But where structural changes do apply they can sometimes lead to dramatic improvements. And, of course, having found the most efficient structural solution, it might still be possible to make it even faster (if necessary) using low-level optimisations.
This is a big topic, this article is only really an overview of techniques rather than a comprehensive guide.
Algorithm and data structure optimisation
There are many standard algorithms in computer science that cover tasks such as:
- Searching a list to check if it contains a particular value.
- Sorting the elements in a list.
- Finding the shortest path between two elements in a network.
There are many other algorithms for other tasks, of course. There are often several algorithms that achieve the same result but work in different ways. And sometimes, one algorithm can be much faster than another, depending on the specific requirements of your application.
There are also many data structures. Again there are often several different data structures that could be used for any particular purpose, but some might be more efficient than others for a particular purpose.
Example choosing a search algorithm
Let's take the search example. This is a simple example that you might already be familiar with. In the linear search algorithm you start with the first element and try every element until you find a match. If you find a match then the element is present, if you reach the end of the list without finding a match then the element is not present. The time taken to search depends on the length of the list. If the list is a thousand elements long, it will take at most 1000 iterations to find the element. If the list is a million elements long it will take up to one million iterations to find the element.
Another algorithm is the binary search. This requires the list to be sorted. The process is to find the middle element in the list, and check if it is larger or smaller than the target. You can then eliminate half of the elements in the list. Repeat the procedure, eliminating half the remaining elements each time until you either find the element or confirm it isn't there. For a thousand elements, after the first try, you will have reduced the range to 500 elements, then 250, 125, 63, 32, 16, 8, 4, 2, 1. It will take just 10 iterations to complete the search because one thousand is almost 2^10. This compares to a thousand iterations with the linear search.
If you had a million elements it would take a mere 20 iterations (because a million is almost 2^20). If you were using a linear search that takes a million operations, there is no possible way that you will ever be able to use low-level optimisation to compete with the algorithmic efficiency of a binary search.
But here is the problem. A binary search requires the data to be sorted. If the data happens to be sorted already, then of course it is better to do a binary search. If you are likely to need to search the data many times, it could be worth sorting the data (a slow process, but you only need to do it once) to gain the advantage of faster searches every time. But if you only need to search the data once, you are better just doing a linear search.
The algorithm you choose depends on the specific details of your software.
Example choosing a data structure
In the previous example, we assumed that the data should be stored as a list. A list is a good general-purpose data structure that can be used for searching (and for many other purposes). But for searching in particular it is not the absolute best choice.
A set is a better choice. It uses a hash table for searching, which means that searching for an element takes the same time, regardless of the number of elements being searched. In addition, a set only stores unique elements. If a list contains multiple identical elements, the corresponding set would only contain that element once, which can save memory for large repetitive lists.
Your data needs to be stored as a set to take advantage of this efficiency. If you don't need the full features of a list (storing multiple copies of the same value, inserting and removing items at specific indexes, and so on) then you can simply use a set instead of a list. But if you need the data to be held as a list, you will need to copy the data into a set structure before performing any searches. This may or may not be worthwhile, depending on how often the data changes compared to how many times your code will search the data. Converting a large list to a set is quite time-consuming, so it is not worth doing that just to speed up a single search.
Using high-performance Python libraries
One of the reasons that Python is sometimes slow is that it is an interpreted language. Strictly speaking, in the default Python implementation, it is just-in-time compiled to bytecode which is then interpreted. By contrast, a language like C is fully compiled to machine code and saved as an executable. The compilation stage is carried out when the executable is built, so the compiler can apply more complex optimisation strategies to improve the performance.
In many cases, the fact that Python is slightly slower makes absolutely no difference to the end-user. For example, in a GUI based application, the user is often the slowest part of the system. If you move the mouse, it doesn't matter if the program takes 0.1ms to respond or 1ms to respond, it will be fast enough that you will not notice any delay.
Differences will become more apparent if the code is processing very large amounts of data - for example, video data, 3D data, image data, sound data, very large spreadsheets, etc. If, for example, the code is looping over every pixel in a 24 megapixel image, the difference between a C program (that might optimise the inner loop down to a few clock cycles) and a Python program (the might take tens or hundreds of cycles per loop) can be very significant.
So if you want to process large chunks of data efficiently, write the code in C ... but we are Python programmers, how does that help us? Well fortunately there are plenty of Python libraries that can be imported and called in the normal way, but under the hood use highly optimised C code to do the hard work. Some useful libraries are:
- NumPy, a library that handles multidimensional arrays of raw data, and can perform many common array operations very efficiently.
- SciPy, which adds lots of scientific and statistical algorithms to NumPy.
- Pillow image processing.
- MoviePy video processing.
- Pandas for processing spreadsheet data.
There are many libraries for processing all sorts of large data. These are often implemented as wrappers around well-established C libraries, so you will access the speed and reliability of a mature code base but with a Python interface.
If you need to interface to a C/C++ library that doesn't already have a Python interface, you can write your own. You need some knowledge of C programming to do this, but it is not particularly difficult. The built-in ctypes module allows for this.
An alternative is to use a Python compiler. One option is Numba, which can selectively compile individual functions in your Python code. It can also allow code to run on multiple CPU cores, or even on a GPU. As with all techniques, there are pros and cons (it adds additional dependencies, the initial compilation takes time, and some types of code might be slower) but used sparingly it can achieve good results.
Using functional programming techniques
Python provides many functional programming (FP) features. The main benefit of the functional programming paradigm is that it allows us to create more readable and reliable code. But it can also help with performance in several ways:
- Lazy iteration is preferred in FP. In typical procedural programming, data sequences are stored as lists, and we usually process the entire list to create another list in each stage of the process. When we process a sequence of data in FP, we use iterators that provide the data one element at a time and process each element from start to finish before moving on to the next. This doesn't directly make things quicker, but it reduces latency (the first results are calculated sooner) and also uses less memory.
- FP uses higher-order functions - that is, functions that act on other functions - taking function objects as arguments or even return new functions. Some of these functions perform implicit looping operations. For example, the
mapfunction applies a supplied function to every item in a list (or other iterator). This is the equivalent of looping over the iterator, but because it is done internally it usually runs faster.
zipare other examples.
- Reducing functions are another example. The
sumfunction will add all the elements in an iterator, and again the loop is implicit so is often faster.
allare other reducing functions.
itertoolsis a library that allows us to process iterators in lots of useful ways and usually avoids creating intermediate lists that use memory.
- Another feature of FP is that it tries to use pure functions. A pure function is a function that has no side effects, and whose result depends only on its inputs. The result of a pure function can be cached - that is, the first time it is called with a particular set of arguments, the result can be stored. If the function is called again with the same arguments, the cached result can be returned immediately, avoiding a repeat calculation. The
lru_cachemodule provides this functionality.
It is worth noting that the NumPy module also takes a functional approach, and often you only need to make a single function call to process an entire array.
Often your program might need to access data stored in a file or on the network. For example, your program might have various options stored in a configuration file.
A simplistic approach would be to read the configuration file every time you need to check a setting. The problem with this, of course, is that might read the same file several times which is inefficient. Worst case you might read the same file many times creating serious performance issues.
The best solution is usually to cache the data, and there are several approaches:
- The simplest is to read the file right at the start and store the data in memory.
- Sometimes you might prefer to read the file when it is first used, then store the data in memory for subsequent uses. This should be considered if the program only needs the data in particular circumstances (for example, print settings that would only be required if the user decided to do some printing).
- If the configuration file can be modified outside of your program, and if you need your program to take account of the changes, then you would need a more sophisticated system capable of monitoring the file for changes.
Although the basic concept of caching is quite simple, the implementation can sometimes be more complex.
Most modern computers have multiple cores - effectively separate CPUs that can process data simultaneously.
If a task can be broken down into separate parts that can be run independently and at the same time, this can speed processing up significantly.
Unfortunately, the standard Python implementation is not particularly efficient at multithreading, due to its built-in mechanism (the Global Interpreter Lock) for avoiding conflicts. Multithreading can be used for certain purposes, such as ensuring that a GUI program is still responsive while Python is waiting for data to load from disk. But it won't always give you quite the speedup you would expect.
There are several workarounds:
- You can switch to a different Python implementation, for example, Cython (completely different from the standard implementation CPython, despite the similar names). This can be more efficient, but it is a different implementation so there is always the possibility of slightly different behaviour, so some libraries not quite working correctly.
- If you need to speed up specific operations, you could check if there is a library (possibly written in C/C++) to perform the processing using multiple threads.
- Try Numba, as mentioned earlier.
This has been a high-level review of some of the big picture code changes you should consider to speed up your code if low-level improvements aren't helping. Sometimes a new approach is the only way to improve performance.