How grouper works
A
very common question on StackOverflow is: "How do I split a sequence into evenly-sized chunks?"
If it's actually a sequence, rather than an arbitrary iterable, you can do this with slicing. But the
itertools documentation has a nifty way to do it completely generally, with a recipe called grouper.
As soon as someone sees the recipe, they know how to use it, and it solves their problem. But, even though it's very short and simple, most people don't understand how it works. And it's really worth taking a look at it, because if you can muddle your way through grouper, you can understand a lot of more complicated iterator-based programming.
Pairs
Let's start with a simpler function to group an even-length iterator over objects into an iterator over pairs of objects:
def pairs(iterable):
it = iter(iterable)
return zip(it, it)
How does this work?
First, we make an iterator over the iterable. An iterator is an iterable that keeps track of its current position. The most familiar iterators are the things returned by generator functions, generator expressions, map, filter, zip, the functions in itertools, etc. But you can create an iterator for any iterable with the
iter function. For example:
>>> a = range(5) # not an iterator
>>> list(a)
[0, 1, 2, 3, 4]
>>> list(a)
[0, 1, 2, 3, 4]
>>> i = iter(a) # is an iterator
>>> list(i)
[0, 1, 2, 3, 4]
>>> list(i)
[]
Since we've already consumed i in the first list call, there's nothing left in it for the second call. This might be a little easier to see with a function like islice or takewhile that only consumes part of the iterator:
>>> i = iter(a)
>>> list(islice(i, 3))
[0, 1, 2]
>>> list(islice(i, 3))
[3, 4]
You may wonder what happens if a was already an iterator. That's perfectly fine: in that case, iter just returns a itself.
Anyway, if we have two references to the same iterator, and we advance one reference, the other one has of course also been advanced (since it's the same iterator). Having two separate iterators over the same iterable doesn't do that. For example:
>>> i1, i2 = iter(a), iter(a) # two separate iterators
>>> list(i1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(i2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> i1 = i2 = iter(a) # two references to the same iterator
>>> list(i1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(i2)
[]
(Of course in this case, if a had already been an iterator, calling iter(a) twice would have given us back the same iterator (a itself) twice, so the first example would be the same as the second.)
So, what happens if you zip the two references to the same iterator together? Each one gets every other value:
>>> i1 = i2 = iter(a)
>>> list(zip(i1, i2))
[(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)]
If it isn't obvious why, work through what zip does—or this simplified pure-Python zip:
def fake_zip(i1, i2):
while True:
v1 = next(i1)
v2 = next(i2)
yield v1, v2
If i1 and i2 are the same iterator, after v1 = next(i1), i1 and i2 will be pointing to the next value after v1, so v2 = next(i2) will get that value.
And that's all there is to the pairs function.
Arbitrary-sized chunks
So, how do we make n references to the same iterator? There are a few ways to do it, but the simplest is:
args = [iter(iterable)] * n
And now, how do we zip them together? Since zip takes any number of arguments, not just two, you just use argument list unpacking:
zip(*args)
And now we can almost write grouper:
def grouper(iterable, n):
args = [iter(iterable)] * n
return zip(*args)
Uneven chunks
Finally, what if the number of items isn't divisible into evenly-sized chunks? For example, what if you want to group range(10) into groups of 3? There are a few possible answers:
- [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, None, None)]
- [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 0, 0)]
- [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9,)]
- [(0, 1, 2), (3, 4, 5), (6, 7, 8)]
- ValueError
By using zip, we get the fourth one: an incomplete group just doesn't appear at all. Sometimes that's exactly what you want. But most of the time, it's probably one of the first two that you actually want.
For that purpose, itertools has a function called zip_longest. It fills in the missing values with None, or with the fillvalue argument if you pass one in. So:
>>> list(zip_longest(*iters, fillvalue=0))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 0, 0)]
And now we've got everything we need to write, and understand, grouper:
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
And if you want to, e.g., use zip instead of zip_longest, you know how to do it.
View comments