itertools module - general iterators

By Martin McBride, 2021-12-06

Tags: starmap filterfalse dropwhile takewhile compress islice chain zip_longest tee groupby accumulate itertools python standard library
Categories: python standard library

The itertools module provides several general iterators. In many cases, these iterators are variants of built-in functions such as map and filter. There are also some other generally useful iterators.

The iterators in this section are:

Map-like iterators - starmap.
Filter-like iterators - filterfalse, dropwhile, takewhile, compress, and islice.
Zip-like iterators - chain, chain.from_iterable, and zip_longest.
Splitting iterators - tee and groupby.
Accumulating iterators - accumulate.

Since each of these functions returns an iterator, if you want to print the resulting data you will need to convert it to a list. For example:

i = starmap(...)
print(i)         # <map object ...>
print(list(i))   # prints the data

Map-like iterators - starmap

The starmap function is related to map, but it always accepts its arguments in a different format.

Here is a simple use of map:

from operator import add

a = [2, 4, 6]
b = [10, 20, 30]

i = map(add, a, b)

This creates a result:

[12, 24, 36]

But suppose we had the initial parameters in a "pre-zipped" format:

[(2, 10), (4, 20), (6, 30)]

We can use starmap to process this data:

from itertools import starmap
from operator import add

z = [(2, 10), (4, 20), (6, 30)]

i = starmap(add, z)

starmap is roughly equivalent to unzipping the valued and using map:

i = map(add, *zip(*z))

Filter-like iterators

These operators provide useful variants of the built-in filter function.

To recap, filter takes a predicate (a function that returns true or false) and applies it to every element in the iterable. The resulting iterator only includes the values for which the iterator is true. For example:

def is_negative(x):
    return x < 0

a = [-3, -2, 0, 1, -5, 6]

i = filter(is_negative, a)

The predicate function returns true if the value is negative. So in this case, the output iterator would produce the values:

-3, -2, -5

filterfalse

filterfalse is similar to filter, except it only allows the values for which the predicate is false. Based on the code above:

from itertools import filterfalse

i = filterfalse(is_negative, a)

Produces only the values that are greater than or equal to zero:

0, 1, 6

takewhile

takewhile takes values from the input iterable while ever the predicate is true, then it stops.

from itertools import takewhile

i = takewhile(is_negative, a)

In our case, it will take the first two values, because they are negative, then it will stop because the third value is 0 (ie not negative). It will ignore any items after the first non-negative even though some of them are also negative.

-3, -2

dropwhile

dropwhile is the opposite of takewhile. It ignores values from the input iterable while ever the predicate is true, and then everything after that.

from itertools import dropwhile

i = dropwhile(is_negative, a)

In our case, it will drop the first two values, because they are negative. it will return everything after that:

0, 1, -5, 6

compress

compress filters an iterable based on a sequence of selectors. It accepts two iterators. The first, data, contains the input data. The second, selectors, contains a set of values that filter the data.

For each element in data, that element will be included if the corresponding selectors value is true, and excluded otherwise. For example:

from itertools import compress

data = [10, 20, 30, 40, 50 , 60]
selectors = [1, 0, 0, 1, 1, 0]

i = compress(data, selectors)

Since selectors is only true at positions 0, 3 and 4, only those elements of data will be included in the output iterator:

10, 40, 50

If data and selectors are different lengths, the length of the output sequence by whichever is shorter.

islice

islice provides slicing for iterators. It works in a similar way to slicing a list, but of course, since it works on iterables the operation is lazy, that is it isn't applied until you read the iterable values.

There are several ways to call islice. The two-argument form takes an iterable and a stop value:

from itertools import islice

data = [10, 20, 30, 40, 50, 60]
i = islice(data, 4)

This creates an iterator that stops at 4, ie returns every item up to but not including index 4. This is equivalent to a slice [:4] applied to a list. The result is:

10, 20, 30, 40

The three-argument form takes an iterable, a start and a stop value:

i = islice(data, 2, 5)

This is equivalent to a slice [2:5] applied to a list. The result is:

30, 40, 50

Finally, the four argument form adds a step value: The three-argument form takes an iterable, a start and a stop value:

i = islice(data, 1, 5, 2)

This is equivalent to a slice [1:5:2] applied to a list. It takes values from position 1, up to but not including 5, in steps of 2. The result is:

20, 40

In either of the previous two examples, if the stop value is set to None the iteration continues to the end of the sequence:

i = islice(data, 1, None, 2)  # equivalent to [1::2]

Zip-like iterators

These functions join two or more iterables, in various ways.

chain

chain joins two or more iterables to act like a single iterable with all the values joined end to end. For example:

from itertools import chain

a = [1, 2, 3, 4]
b = [10, 20]
c = [100, 200, 300]
i = chain(a, b, c)

This gives an iterable with the following sequence of values:

1, 2, 3, 4, 10, 20, 100, 200, 300

chain.from_iterable

This is similar to chain, except that it takes a single iterable:

from itertools import chain

m = [[1, 2, 3, 4],
     [10, 20],
     [100, 200, 300]]
i = chain.from_iterable(m)
i = chain(a, b, c)

In this case, we have used a list of lists, but it can accept any iterable of iterables. The main iterable m is evaluated lazily, that is to say, chain will not attempt to access the next iterable until the previous one is exhausted.

zip_longest

In a normal zip operation, if you attempt to zip several iterables of different lengths, the sequence will stop when the shortest iterable is exhausted:

a = [1, 2, 3, 4]
b = [10, 20]
c = [10, 200, 300]
i = zip(a, b, c)

Since the shortest iterable, b, has length 2, the output sequence also has length 2:

(1, 10, 10), (2, 20, 200)

With zip_longest the sequence will stop when the longest iterable is exhausted:

from itertools import zip_longest

a = [1, 2, 3, 4]
b = [10, 20]
c = [10, 200, 300]
i = zip_longest(a, b, c, fillvalue=-1)

Since the longest iterable, a, has length 4, the outout sequence also has length 4:

(1, 10, 10), (2, 20, 200), (3, -1, 300), (4, -1, -1)

fillvalue is used to fill in any blanks. If a value is not supplied it defaults to None.

Splitting iterators

These iterators split a single input into several outputs.

tee

tee effectively provides two or more iterators that can iterate over the input iterable independently. Here is an example:

from itertools import tee

r = range(4)
a, b = tee(r, 2)
for i in a:
    print(i)
for i in b:
    print(i)

Here, r is a range object. You can only iterate over a range object once, and then it is spent.

The tee function creates two new iterators, that can each iterate over the values in r, independently. We illustrate this by looping over a then looping over b, The result is:

There are a couple of caveats to this function:

After creating the tee, you should not access r from anywhere else. The tee function effectively takes ownership of the iterable, and things could get out if step if something else consumes items from the iterable at the same time.
The tee iterators are not threadsafe.

Of course, a similar effect can be obtained by creating a list from r, then you can iterate over the list multiple times. The main advantage of tee is lazy evaluation.

groupby

groupby will split an iterable into several iterators, grouping the original elements according to some chosen characteristic.

Here is an example:

from itertools import groupby

items = ['apple', 'apricot', 'cherry', 'carrot',
         'cranberry', 'banana', 'blueberry',
         'avocado', 'almond']

grouped = groupby(items, lambda x: x[0])

for key, values in grouped:
    print('{}: {}'.format(key, ', '.join(values)))

groupby accepts two arguments:

The iterable to be grouped.
A key function that will be applied to each item in the input iterable, to calculate that key that will be used for grouping.

In the example, the input iterable contains strings. The key function is a lambda that calculates x[0], which is the first character of the string. This means that the elements in the original list will be grouped by their first letter.

groupby returns an iterator. The elements of the iterator are all tuples of the form (key, values), where:

key is the key (the first letter in the case of our example).
values is another iterable, that returns every name in the group.

There is a separate (key, values) pair for each group.

We print each pair like this:

print('{}: {}'.format(key, ', '.join(values)))

values is an iterator, but the join function iterates over it and converts it to a set of real values that are printed. Here is what the code displays:

a: apple, apricot
c: cherry, carrot, cranberry
b: banana, blueberry
a: avocado, almond

The first key is the letter 'a', and the first group contains the two elements that begin with an 'a'. The next key is 'c', and the group contains the three elements that begin with 'c', and so on.

Notice that there are two groups with the letter 'a' as a key. That is because the function will only group similar elements that are adjacent to each other. That is by design, it is a grouping function, not a sorting function. There are two distinct groups of words that begun with 'a', and each gets its own group.

If you specifically want to group all the names that begin with the same letter into a single group, then you should first sort the group using the same key function as the one used in groupby:

items.sort(key=lambda x: x[0])

The result is then:

a: apple, apricot, avocado, almond
b: banana, blueberry
c: cherry, carrot, cranberry

accumulate

At its simplest, accumulate is a bit like sum, except that it provides a running total. However, accumulate can be used in other ways, as we will see.

Here is the simplest case:

from itertools import accumulate

items = [5, 2, 6, 1, 9]

totals = accumulate(items)

This creates an iterator, totals, the gives a running total of the sum of the items in the original iterable:

5, 7, 13, 14, 23

This is formed from 5, (5 + 2), (5 + 2 + 6), etc.

By default, accumulate will add the values. However, you can supply a different function. For example, if you use the built-in max function, the result will be a running maximum (ie the maximum value so far):

maxima = accumulate(items, func=max)

Giving:

5, 5, 6, 6, 9

It is also possible to create a series based on a recurrence relationship. To do this we can define a function of a and b that only uses the value of a, for example:

items = [2]*8

def fn(a, b):
    return a*2

totals = accumulate(items, fn)

In this case, each new value is equal to the previous value multiplied by 2, giving a sequence that is the powers of 2:

2, 4, 8, 16, 32, 64, 128, 256

Notice that the only value of items that we actually use is the first element, which is the starting point of the sequence. After that, the values are ignored (but of course the length of items determines the length of the output sequence). Since we only care about the first element of items, the easiest way to create it is just to fill it with the initial value, 2.