Map/reduce example

By Martin McBride, 2019-09-08
Tags: built-in function operator functools map reduce lambda function
Categories: functional programming


As an example of functional programming, in this article we will use some Python built-in functions to analyse a text file. In the first example we will count the number of words in the file. In the second example we will extract the unique words.

The basic pattern we will be using is map/reduce. In this pattern the map function is used to pre-process the input data. The reduce function is then used to combine the data to give the final result.

Built-in operator functions

As an aside, Python has function equivalents of every operator. These functions live in the operator module. For example, the operator.add function is equivalent to the + operator. Therefore the following two lines of code do the same thing:

import operator

c = operator.add(a, b)
c = a + b

This is very useful because you can use add wherever you need a function object that perform the equivalent of +. There are functions for every operator, including non-symbolic operators like in.

The map function

The simplest form of map takes a function object and an iterable. It applies the function to each item in the iterable, and returns an iterator that accesses the results. For example:

from operator import neg

it = map(neg, [1, -3, 5, -7])
print(list(it))

Here, we apply the neg function (again from the operator module) to the list [1, -3, 5, -7]. neg is a function that does the same job as the unary minus operator. This results in a iterator it.

Remember that iterators are lazy - they don't perform any calculations until they need to. If we try to print the iterator we would just see the iterator object (a map object), because the actual work of the map operation hasn't been done yet. When we convert the iterator to to a list, the iterable is processed to obtain its elements:

[-1, 3, -1, 7]

map can take more than one iterable, provided the function object takes the same number of parameters. For example, mul (the function equivalent of *) takes two parameters, so map accepts two iterables:

from operator import mul

it = map(mul, [1, 2, 3, 4], [10, 100, 1000])
print(list(it))

This takes a number from each list, and multiplies them together: 1*10, 2*100, 3*1000. If the iterables have different lengths (as in this example) it stops when the shortest sequence is exhausted. So it prints:

[10, 200, 3000]

The reduce function

The reduce function takes a sequence of values and reduces it to a single value by applying a function repeatedly, and accumulating the result. Here is a simple example to illustrate how it works:

from functools import reduce
from operator import add

x = reduce(add, [1, 2, 3, 4])
print(x)

What reduce does is take the first two elements and combine them using the function (add in this case). It then takes the result and combines it with the next element, and so on. It calculates

(((1+2)+3)+4)

In other words, it sums all the values. In fact, there is a built-in function sum that does the same job - you should use that normally as it is simpler and more efficient. This is just an illustration.

Counting words

As a practical example, here is a list of strings (an excerpt from the Zen of Python):

zen = ['beautiful is better than ugly',
       'explicit is better than implicit',
       'simple is better than complex',
       'complex is better than complicated',
       'flat is better than nested',
       'sparse is better than dense',
       'readability counts']

Now imagine we want to count the total number of words in all these strings. We can use map/reduce:

  • Use map to convert the list of strings into a sequence of number-of-words values for each string
  • Use reduce to convert the list of lengths to a single total number-of-words

Here is how this looks in code

from functools import reduce
from operator import add

def number_of_words(s):
    words = s.split(' ')
    return len(words)

counts = map(number_of_words, zen)
total = reduce(add, counts)

We could get rid of the number_of_words function by using a lambda:

lambda x: len(x.split(' '))

You can also combine the map and reduce into a single line. This is our code for finding the total number of words in a list of sentences:

total = reduce(add, map(lambda x: len(x.split(' ')), zen))

If you think this is a bit too terse, you could leave it as separate lines - that is a matter of style. But if you are used to functional Python this code is quite readable.

Counting unique words

Now suppose we want to count the number of unique words in the list of sentences. We could do it like this:

  • Use map to convert each sentence into a set of words (a set will only contain one copy of each word, even if it appears more than once in the sentence).
  • Use reduce to combine the list of sets into a single set.
  • Find the length of the set.

This example is a little different because, rather than obtaining a single integer, we are obtaining a set of values. We are using reduce to take a groups of sets of words, and reduce it down to a single set of unique words. But that is a perfectly valid use of reduce.

Here is the map code:

def set_of_words(s):
    words = s.split(' ')
    return set(words)

sets = map(set_of_words, zen)

Once again, we use split to convert a string to a list of words. But this time we use set to convert the list into a set. This creates a list of sets (one set for each sentence).

Next we want to find the union of all these sets:

(((s1 | s2) | s3) | s4) etc...

Just as we can use the add function instead of the + operator, we can use the or_ function instead of the | operator (don't forget to import or_ from operator):

all_words = reduce(or_, sets)
total = len(all_words)

Once again we can use a lambda and code the whole thing as one line:

total = len(reduce(or_, map(lambda x: set(x.split(' ')), zen))

Why bother with map/reduce

In this case, of course we could have simply concatenated all our input strings together, and removed the need to bother with map/reduce. However, imagine that instead of processing a few sentences, we had a massive database with millions of sentences.

The first advantage of map is that it uses lazy iteration, so we don't have to load all that data into memory. We only ever need to store one sentence in memory at a time.

The second advantage is that every map operation is independent. That is the beauty of functional programming, we can map the data in any order because there are no side effects. If we wanted, we could split our database into parts, and run map on several different computers, each processing its own section of the database. A final computer could run the reduce task, speeding the whole process up by many times.

The map/reduce pattern is often used in big data applications because it allows for large scale parallelisation of data processing. But even on smaller data sets, it is an interesting way of breaking problems down into functional units.

See also

See also

If you found this article useful, you might be interested in the book NumPy Recipes or other books by the same author.

Join the PythonInformer Newsletter

Sign up using this form to receive an email when new content is added:

Popular tags

2d arrays abstract data type alignment and angle animation arc array arrays bar chart bar style behavioural pattern bezier curve built-in function callable object chain circle classes clipping close closure cmyk colour combinations comparison operator comprehension context context manager conversion count creational pattern data science data types decorator design pattern device space dictionary drawing duck typing efficiency ellipse else encryption enumerate fill filter font font style for loop formula function function composition function plot functools game development generativepy tutorial generator geometry gif global variable gradient greyscale higher order function hsl html image image processing imagesurface immutable object in operator index inner function input installing iter iterable iterator itertools join l system lambda function latex len lerp line line plot line style linear gradient linspace list list comprehension logical operator lru_cache magic method mandelbrot mandelbrot set map marker style matplotlib monad mutability named parameter numeric python numpy object open operator optimisation optional parameter or pandas partial application path pattern permutations pie chart pil pillow polygon pong positional parameter print product programming paradigms programming techniques pure function python standard library radial gradient range recipes rectangle recursion reduce regular polygon repeat rgb rotation roundrect scaling scatter plot scipy sector segment sequence setup shape singleton slice slicing sound spirograph sprite square str stream string stroke structural pattern subpath symmetric encryption template tex text text metrics tinkerbell fractal transform translation transparency triangle truthy value tuple turtle unpacking user space vectorisation webserver website while loop zip zip_longest