1. Stack Overflow is full of questions where the answer is to create a "multidict", a dict mapping each key to a list of values.

    There are two ways to do this, using defaultdict, or using a regular dict with setdefault. And as soon as someone posts an answer using one or the other, someone else suggests they should have used the other in a comment, and sometimes it even devolves into an argument about which is better.

    Compare these two functions:

        def f1(pairs):
            d = {}
            for key, value in pairs:
                d.setdefault(key, []).append(value)
            return d
    
        def f2(pairs):
            d = collections.defaultdict(list)
            for key, value in csv.reader(f):
                d[key].append(value)
            return d
    

    It's hard to argue that either one is unclear, overly verbose, hard to understand, etc.

    And, while one or the other is probably faster, it's probably not enough to make a difference in real-life programs.*

    So, how do you decide between them?

    The answer is simple: This isn't the relevant code for making the decision. You have to look at how the returned value is going to be used in the code. When you later look up a missing key, do you want an empty list, or a KeyError?


    * From a quick test, setdefault is about 60% slower, so in the rare cases where it matters, if you want a plain dict, it might be worth using defaultdict anyway, then converting at the end.
    1

    View comments

  2. The problem

    Often, using an iterator lazily is better than generating a sequence (like the one you get from a list comprehension). For example, compare these two scripts:

    The first one has to read the entire file into memory, which can be a problem for huge files, but usually you don't have files that big.

    A more serious problem is that it has to read the entire file before it can do any work. Even with moderately-sized files, the corresponding delay at startup can make debugging more painful (each run takes seconds instead of milliseconds before you can see whether the first results are correct). And it can even have a significant performance cost (by preventing the file reads from interleaving with the network reads, which takes away caching opportunities from the OS/filesystem/drive).

    The problem with the second one is that you can only iterate a file once. If you try to iterate it a second time, it'll be empty (because you've already iterated the whole thing). So, the following code gives you 2 instead of 4:

    Of course little of this is specific to files. If you have an generator that requires a lot of CPU work to run, running it to completion before you get started causes the same startup delay, and can prevent pipelining of work, which can have a huge cost in CPU cache misses. But just leaving it as a generator means you can't iterate through it repeatedly, or you'll get nothing each time but the first.

    The solution

    So, is there a way to get an iterable that's lazy the first time you run through it like an iterator, but restartable like a sequence?

    Well, you could build a complete "lazy sequence" class, but that's a lot of work, with some fiddly edge cases to deal with if in order to handle the full interface properly (including things like indexing and slicing with negative values).

    Fortunately, you don't need the full interface. You need __next__ to store the values as you create them, and __iter__ to give you a new iterator that shares the same storage.

    The easy way

    As it turns out, that's exactly what itertools.tee does:
    The problem is that it's tedious and error-prone to have to call tee explicitly each time you want to iterate. But you can easily wrap this up:
    Now Reiterable is as easy to use as list, and gives you the benefit you care about (being able to iterate the values repeatedly) without the cost (iterating the entire thing up front).

    Performance

    The documentation shows you how tee is implemented. And if you don't understand it, it's probably worth copying the pure-Python implementation from the docs and stepping through what it does.

    But the basic idea is this: Each time you call tee, it creates two new deques and two new generators, both tied to the original iterator. Whenever either generator needs a value that wasn't yet produced, it's taken from the original iterator and added to both deques, and then of course immediately popped from one.

    So, the first time through Reiterable, it iterates the values on demand, and copies each value to the spare generator's deque. Each subsequent time, it's doing the same, but from an iterator over the spare deque instead of from the original iterator. So the values get moved from one deque to the next, with no wasted space and very little wasted time, right?

    Well, not quite. This is hard to see with the C implementation of tee, or even the generator-based implementation given in the docs, but it you build a class-based implementation, you can see what's going on. Unfortunately, the class implementation seems to break the online interactive visualizer, so you'll need to copy the code below and run it locally:

        def tee(iterable, n=2):
            class gen(object):
                it = iter(iterable)
                deques = [collections.deque() for i in range(n)]
                def __init__(self, d):
                    self.d = d
                def __iter__(self):
                    return self
                def __next__(self):
                    if not self.d:              # when the local deque is empty
                        newval = next(gen.it)      # fetch a new value and
                        for d in gen.deques:        # load it to all the deques
                            d.append(newval)
                    return self.d.popleft()
            return tuple(gen(d) for d in gen.deques)
    
        class Reiterable(object):
            def __init__(self, iterable):
                self.iterable = iterable
            def __iter__(self):
                self.iterable, t = itertools.tee(self.iterable)
                return t
    
        f = io.StringIO('abc\ndef\n')
        f = Reiterable(f)
        for i in range(3):
            list(f)
        print(f.iterable.it.it.it)
    

    Algorithmic analysis

    Reiterable is building up a chain of tee objects. It is moving the values from one deque to the next, so all but the highest are empty, but there is still a deque and a tee wrapper object. Each value iterated is just moved from the highest deque on the chain to the new deque, so the wasted time per iteration step is minimal, but when you run out of values, it has to run through the whole chain to discover that they're all empty before the new iterator can be declared empty.

    So, to iterate N items M times, instead of wasting N space to hold a copy of the iterable, you're wasting N+M space to hold a copy of the iterable and a chain of M empty tees and deques. And instead of NM time for the iteration, it's NM+M/2 time for the iteration plus the extra empty checks (which is still O(NM), of course)>

    So, there's no algorithmic cost, except in edge cases when M >> N, which is a very strange use case. (If N is tiny, you really should just use a list; if M is gigantic, that almost always means you're doing a nested iteration that you can just flip over.)

    Real-life performance

    The real cost is the added overhead of having to go through the tee's generator for each value instead of just going through a list iterator. Which you can time pretty easily, so let's try it:

        In [66]: def func():
            ...      f = (i for i in range(1000000))
            ...      sum(f)
        In [67]: %timeit func()]
        10 loops, best of 3: 73.2 ms per loop
    
        In [68]: def func():
            ...      f = (i for i in range(1000000))
            ...      sum(list(f))
        In [68]: %timeit func()]
        10 loops, best of 3: 101 ms per loop
    
        In [69]: def func():
            ...      f = (i for i in range(1000000))
            ...      sum(Reiterable(f))
        In [70]: %timeit func()]
        10 loops, best of 3: 108 ms per loop
    

    So, there is an additional performance cost to building a tee out of an iterator vs. building a list… but it's only about 25% higher.
    2

    View comments

  3. A lot of people—not just novices—mix up parameters and arguments, especially when it comes to things like how default-valued parameters and keyword arguments, or argument unpacking and variable parameters.

    The FAQ has a section called What is the difference between arguments and parameters, which explains the basics:
    Parameters are defined by the names that appear in a function definition, whereas arguments are the values actually passed to a function when calling it. Parameters define what types of arguments a function can accept. 
    PEP 3102 explains the current design. The glossary entries for argument and parameter are also helpful. And the tutorial also has nice sections on Defining Functions, which explain things in loose terms. However, to actually understand how it works, you need to read the reference documentation on Calls and Function definitions, and some of the terminology is only defined in the middle of the inspect module documentation.

    However, putting it all together is a little difficult.

    The explanations below are all for Python 3.x, but there is a brief overview of the differences between 3.x and 2.x at the end.

    PEP 3102 uses different terminology from the reference documentation and the inspect module; I've chosen to go with the latter.

    Parameters

    Parameters are part of a function definition (def or lambda). There are five different kinds of parameters:

    • positional-or-keyword: Normal parameters in a function definition, with or without default values.
      • Each has a name and an index, and can accept a positional argument with the same index, or a keyword argument with the same name, or (if it has a default value) nothing.
      • Technically, every parameter before the first bare *, var-positional, or var-keyword is a positional-or-keyword parameter.
    • positional-only: Only found in builtin/extension functions.
      • Each has a name and an index, but only accepts positional arguments with the same index.
    • var-positional: This is the *args.
      • This accepts a sequence consisting of all positional arguments whose index is larger than any positional-or-keyword or positional-only parameter. (Note that you can also specify a bare * here. In that case, you don't take variable positional arguments. You do this to set off keyword-only from positional-or-keyword parameters.)
    • keyword-only: These are parameters that come after a * or *args, with or without default values.
      • Each has a name only, and accepts only keyword arguments with the same name.
      • Technically, every parameter after the first bare * or var-positional, but before the var-keyword (if any), is a keyword-only parameter.
    • var-keyword: This is the **args.
      • This accepts a mapping consisting of all keyword arguments whose name does not match any positional-or-keyword or keyword-only parameter.

    Arguments

    Arguments are part of a function call. There are four different kinds of arguments:
    • positional: Arguments without a name.
      • Each is matched to the positional-or-keyword or positional-only parameter with the same index, or to the var-positional parameter if there is no matching index (or, if there is no var-positional parameter, it's an error if there is no match).
    • keyword: Arguments with a name.
      • Each is matched to the postional-or-keyword or keyword-only parameter with the same name, or to the var-keyword parameter if there is no matching name (or, if there is no var-keyword parameter, it's an error if there is no match).
    • packed positional: An iterator preceded by *.
      • The iterator is unpacked, and the values treated as separate positional arguments.
    • packed keyword: A mapping preceded by **.
      • The mapping is iterated, and the key-value pairs treated as separate keyword arguments.
    There is no direct connection between parameters with a default value and keyword arguments. You can pass keyword arguments to parameters without default values, or positional arguments to parameters with them.

    Similarly, there is no direct connection between a *args in a function call and a *args in a function definition. A packed positional argument with three values might be unpacked into the last three positional-or-keyword parameters, or all three may be extended to the last two normal positional arguments and passed together to the var-positional, or its first two values may be unpacked into positional-or-keyword parameters and the last into the var-positional.

    Examples

    In the following example:
        >>> def func(spam, eggs=1, *args, foo, bar=2, **kwargs):
        ...     print(spam, eggs, args, foo, bar, kwargs)
        >>> func(1, 2, 3, 4, foo=5, bar=6, baz=7)
        1 2 (3, 4) 5 6 {'baz': 7}
        >>> func(baz=1, bar=2, foo=3, eggs=4, spam=5)
        5 4 () 3 2 {'baz': 1}
        >>> func(foo=1, spam=2)
        2 1 () 1 2 {}
    
    spam is a positional-or-keyword parameter with no default value. This means any call must pass either at least one positional argument (as in the first call), or a keyword argument with name "spam" (as in the other two).

    eggs is a positional-or-keyword parameter with a default value. This means a call may pass either at least two positional arguments (as in the first call), or a keyword argument with name "eggs" (as in the second), or neither (as in the third).

    args is a var-positional parameter. Because there is such a parameter, it's legal to pass more than two positional arguments, in which case it will get all of the excess (as in the first call), but of course it's not required (as in the other two).

    foo is a keyword-only parameter with no default value. This means any call must pass a keyword argument with the name "foo" (as in all three calls).

    bar is a keyword-only argument with a default value. This means a call may pass a keyword argument with the name "bar" (as in the first two calls), or not (as in the third).

    kwargs is a var-keyword parameter. Because there is such a parameter, it's legal to pass keyword arguments whose names don't match any parameter (as in the first two calls), but of course it's not required (as in the third).

    With the same function from above:
        >>> a = [2,3,4]
        >>> d = {'foo': 5, 'baz': 6}
        >>> func(1, *a, **d)
        1 2 (3, 4) 5 2 {'baz': 6}
        >>> func(1, 2, 3, 4, foo=5, baz=6)
        1 2 (3, 4) 5 2 {'baz': 6}
    
    In the first call, the a iterable is unpacked into three positional arguments, and the **d is unpacked into keyword arguments named "foo" and "baz". So, this has the exact same effect as the second call.

    Python 2.x

    Most of the above is true for Python 2.x as well, except:
    • There are no keyword-only arguments between *args and **kwargs.
    • You cannot use a bare * to mean no variable-positional parameter.
    However, the documentation is not as thorough, and some edge cases may be slightly different.
    3

    View comments

  4. How grouper works

    A very common question on StackOverflow is: "How do I split a sequence into evenly-sized chunks?"

    If it's actually a sequence, rather than an arbitrary iterable, you can do this with slicing. But the itertools documentation has a nifty way to do it completely generally, with a recipe called grouper.

    As soon as someone sees the recipe, they know how to use it, and it solves their problem. But, even though it's very short and simple, most people don't understand how it works. And it's really worth taking a look at it, because if you can muddle your way through grouper, you can understand a lot of more complicated iterator-based programming.

    Pairs

    Let's start with a simpler function to group an even-length iterator over objects into an iterator over pairs of objects:
        def pairs(iterable):
            it = iter(iterable)
            return zip(it, it)
    How does this work?

    First, we make an iterator over the iterable. An iterator is an iterable that keeps track of its current position. The most familiar iterators are the things returned by generator functions, generator expressions, map, filter, zip, the functions in itertools, etc. But you can create an iterator for any iterable with the iter function. For example:
        >>> a = range(5) # not an iterator
        >>> list(a)
        [0, 1, 2, 3, 4]
        >>> list(a)
        [0, 1, 2, 3, 4]
        >>> i = iter(a) # is an iterator
        >>> list(i)
        [0, 1, 2, 3, 4]
        >>> list(i)
        []
    Since we've already consumed i in the first list call, there's nothing left in it for the second call. This might be a little easier to see with a function like islice or takewhile that only consumes part of the iterator:
        >>> i = iter(a)
        >>> list(islice(i, 3))
        [0, 1, 2]
        >>> list(islice(i, 3))
        [3, 4]
    You may wonder what happens if a was already an iterator. That's perfectly fine: in that case, iter just returns a itself.

    Anyway, if we have two references to the same iterator, and we advance one reference, the other one has of course also been advanced (since it's the same iterator). Having two separate iterators over the same iterable doesn't do that. For example:
        >>> i1, i2 = iter(a), iter(a) # two separate iterators
        >>> list(i1)
        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
        >>> list(i2)
        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
        >>> i1 = i2 = iter(a) # two references to the same iterator
        >>> list(i1)
        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
        >>> list(i2)
        []
    (Of course in this case, if a had already been an iterator, calling iter(a) twice would have given us back the same iterator (a itself) twice, so the first example would be the same as the second.)

    So, what happens if you zip the two references to the same iterator together? Each one gets every other value:
        >>> i1 = i2 = iter(a)
        >>> list(zip(i1, i2))
        [(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)]
    If it isn't obvious why, work through what zip does—or this simplified pure-Python zip:
        def fake_zip(i1, i2):
            while True:
                v1 = next(i1)
                v2 = next(i2)
                yield v1, v2
    If i1 and i2 are the same iterator, after v1 = next(i1), i1 and i2 will be pointing to the next value after v1, so v2 = next(i2) will get that value.

    And that's all there is to the pairs function.

    Arbitrary-sized chunks

    So, how do we make n references to the same iterator? There are a few ways to do it, but the simplest is:
        args = [iter(iterable)] * n
    And now, how do we zip them together? Since zip takes any number of arguments, not just two, you just use argument list unpacking:
        zip(*args)
    And now we can almost write grouper:
        def grouper(iterable, n):
            args = [iter(iterable)] * n
            return zip(*args)

    Uneven chunks

    Finally, what if the number of items isn't divisible into evenly-sized chunks? For example, what if you want to group range(10) into groups of 3? There are a few possible answers:

    1. [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, None, None)]
    2. [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 0, 0)]
    3. [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9,)]
    4. [(0, 1, 2), (3, 4, 5), (6, 7, 8)]
    5. ValueError

    By using zip, we get the fourth one: an incomplete group just doesn't appear at all. Sometimes that's exactly what you want. But most of the time, it's probably one of the first two that you actually want.

    For that purpose, itertools has a function called zip_longest. It fills in the missing values with None, or with the fillvalue argument if you pass one in. So:
        >>> list(zip_longest(*iters, fillvalue=0))
        [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 0, 0)]
    And now we've got everything we need to write, and understand, grouper:
        def grouper(iterable, n, fillvalue=None):
            "Collect data into fixed-length chunks or blocks"
            # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
            args = [iter(iterable)] * n
            return zip_longest(*args, fillvalue=fillvalue)
    And if you want to, e.g., use zip instead of zip_longest, you know how to do it.

    1

    View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.