Python informer

Improve your Python coding skills

Map/reduce example

As an example of functional programming, in this article we will use some Python built in functions to analyse a text file. In the first example we will count the number of words in the file. In the second example we will extract the unique words.

The basic method is map/reduce. In this model the map function is used to pre-process the input data. The reduce function is then used to combine the data to give the final result.

Built in operator functions

As an aside, Python has function equivalents of every operator. For example, the following two lines of code do the same thing:

c = operator.add(a, b)
c = a + b

This is very useful because you can use add wherever you need a function object that perform the equivalent of +. There are functions for every operator, including non-symbolic operators like in.

The map function

The simplest form of map takes a function object and an iterable. It applies the function to each item in the iterable, and returns an iterator containing the results. For example:

from operator import neg

it = map(neg, [1, -3, 5, -7])
print(list(it))

Here, we apply the neg function to the list [1, -3, 5, -7]. neg is a built-in function that does the same job as the unary minus operator. This results in a iterator it.

If we try to print the interator we would just see the iterator object (a map object), we must convert it to a list to see its elements:

[-1, 3, -1, 7]

map can take more than one iterator, provided the function object takes the same number of parameters. For example:

from operator import mul

it = map(mul, [1, 2, 3, 4], [10, 100, 1000])
print(list(it))

This takes a number from each list, and multiplies them together: 1*10, 2*100, 3*1000. If the iterables have different lengths (as in this example) it stops when the shortest sequence is exhausted. So it prints:

[10, 200, 3000]

The reduce function

The reduce function takes a sequence of values and reduces it to a single value by applying a function repeatedly and accumulating the result. Here is a simple example to illustrate how it works:

from functools import reduce
from operator import add

x = reduce(add, [1, 2, 3, 4])
print(x)

What reduce does is take the first two elements and combine them using the function (add in this case). It then takes the result and combines it with the next element, and so on. It calculates

(((1+2)+3)+4)

In other words, it sums all the values. In fact, there is a built in function sum that does the same job - you should use that normally as it is simpler and more efficient. This is just an illustration.

Counting words

As a practical example, here is a list of strings (an excerpt from the Zen of Python):

zen = ['beautiful is better than ugly',
       'explicit is better than implicit',
       'simple is better than complex',
       'complex is better than complicated',
       'flat is better than nested',
       'sparse is better than dense',
       'readability counts']

Now imagine we want to count the total number of words in all these strings. We can use map/reduce:

  • Use map to convert the list of strings into a sequence of number-of-words values for each string
  • Use reduce to convert the list of lengths to a single total number-of-words

Here is how this looks in code

from functools import reduce
from operator import add

def number_of_words(s):
    words = s.split(' ')
    return len(words)

counts = map(number_of_words, zen)
total = reduce(add, counts)

We could get rid of the number_of_words function by using a lambda:

lambda x: len(x.split(' '))

You can also combine the map and reduce into a single line. This is our code for finding the total number of words in a list of sentences:

total = reduce(add, map(lambda x: len(x.split(' ')), zen))

If you think this is a bit too terse, you could leave it as separate lines. But if you are used to functional Python this code is fairly readable.

Counting unique words

Now suppose we want to count the number of unique words in the list of sentences. We could do it like this:

  • Use map to convert each sentence into a set of words (a set will only contain one copy of each word, even if it appears more than once in the sentence).
  • Use reduce to combine the list of sets into a single set.
  • Find the length of the set.

Here is the map code:

def set_of_words(s):
    words = s.split(' ')
    return set(words)

sets = map(set_of_words, zen)

Once again, we use split to convert a string to a list of words. But this time we use set to convert the list into a set. This creates a list of sets (one set for each sentence).

Next we want to find the union of all these sets:

(((s1 | s2) | s3) | s4) etc...

Just as we can use the add function instead of the + operator, we can use the or_ function instead of the | operator (don’t forget to import or_ from operator):

all_words = reduce(or_, sets)
total = len(all_words)

Once again we can use a lambda and code the whole thing as one line:

total = len(reduce(or_, map(lambda x: set(x.split(' ')), zen))

Why bother with map/reduce

In this case, of course we could have simply concatenated all our input strings together, and removed the need to bother with map/reduce. However, imagine that instead of processing a few setences, we had a massive database with millions of sentences.

The first advantage of map is that it uses lazy iteration, so we don’t have to load all that data into memory. We only ever need to store one sentence in memory at a time.

The second advantage is that every map operation is independent. That is the beauty of functional programming, we can map the data in any order because there are no side effects. If we wanted, we could split our database into parts, and run map on several different computers, each processing its own section of the database. A final computer could run the reduce task, speeding the whole process up by many times.

Map/reduce is often used in big data applications because it allows for large scale parallelisation of data processing.