This confuses every Python developer the first time they see it—even if they're pretty experienced by the time they see it:
    >>> t = ([], [])
    >>> t[0] += [1]
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <stdin> in <module>()
    ----> 1 t[0] += [1]
    
    TypeError: 'tuple' object does not support item assignment
So far, so good. But now:
    >>> t
    ([1], [])
So the addition succeeded, but also failed!

This is discussed in the official FAQ, but most people don't read that FAQ—and those who do seem to immediately start trying to think of "solutions" without fully understanding the underlying issue. So let's dig into it at a lower level.

How assignment works

Let's take a simpler case first:
    >>> t = [0, 1]
    >>> t[0] = 2
    >>> t
    [2, 1]
In many languages, like C++, what happens here is that t[0] doesn't return 0, but some kind of reference object, which acts like the original 0 in most contexts, but if you modify it—including by assignment—the first element of t sees the change. Like this:
    >>> _x = getitem(t, 0)
    >>> _x.__assign__(2)
But Python doesn't do this. There is no object that's like 0 except when you modify it; there's just 0, which you can't modify. Assignment doesn't modify an object, it just makes the left side into a new name for the value on the right side. And in fact, t[0] isn't even an expression in this context; it's a special thing called an assignment target.

So what actually happens is:
    >>> x = setitem(t, 0, 2)
This avoids some pretty hairy stuff that comes up in languages like C++. If you want to design, say, a tree-based sorted list contained in C++, the __getitem__ has to return a really smart reference-like object (and you have to design and implement those smarts) that knows how to handle __assign__ by changing the value and then moving it to the right new location in the tree to maintain the sorting. And then you have to think about what happens with references to references. (Imagine a sorted list of references to the values in a dictionary.) Also, because you may not want to create this heavy-weight reference object every time someone just wants to see a value, you have to deal with constness—you overload __getitem__ to return a plain value in const contexts, and a fancy reference value otherwise. And so on (and let's not even get into rvalue vs. lvalue references…).

None of that comes up in Python: __getitem__ just returns a plain-old value (an int, in this case), and all the tree-rebalancing just happens in __setitem__.

Of course this means that Python has a whole parallel grammar for assignment targets that's similar to, but not identical to (a subset of) the grammar for expressions. But it's not all that complicated. A target can be:
  • a simple identifier, in which case you're creating or replacing a local, closure, or global variable of that name.
  • an attribution (something ending with ".foo"), in which case you're calling setattr on the thing to the left of the last "." (which _is_ effectively an expression, although with slightly restricted syntax—it has to be a "primary expression").
  • a subscription or slicing (something ending with [1] or [:3]), in which case you're calling setitem on the thing to the left of the last "[", with the value or slice inside the brackets as an argument.
  • a target list (one or more targets separated by commas, like "x, y[0]", optionally inside parens or square brackets—which looks a lot like a tuple or list display expresson), in which case you're doing iterable unpacking, verifying that there are exactly 2 elements, and assigning the first value to the first target and the second to the second.
This is still a bit oversimplified (it doesn't handle extended iterable unpacking with "*spam", chained assignments like "x = y, z = 2, 3", etc.); see the language reference for full details. But it's all pretty intuitive once you get the idea. (By the way, anyone who wants to propose further extensions to assignment targets really needs to understand how the existing system works. The fact that Joshua Landau does is a big part of the reason PEP 448 is under serious consideration, while most such proposals die out after a half-dozen posts on python-ideas…)

How augmented assignment works

Now let's look at this case:
    >>> t = [0, 1]
    >>> t[0] += 2
    >>> t
    [2, 1]
In a C++ style language, this just means reference objects have to have an __iadd__ method that affects the original value, just like their __assign__ method. But Python doesn't have __assign__, so how does it handle this?

Well, for this case, Python could just treat it the same as t[0] = t[0] + 2, like this:
    >>> setitem(t, 0, getitem(t, 0) + 2)
That works fine for immutable objects, but it doesn't work right for mutable objects like lists. For example:
    >>> x = [0]
    >>> t = [x, [1]]
    >>> t[0] += [2]
    >>> t
    [[0, 2], [1]]
    >>> x
    [0, 2]
Here, t[0] refers to the same object as x. You don't just replace it with a different object, you modify the object in-place. To handle this, Python has an __iadd__ method, similar to __add__, but making the change in-place. (In fact, for a list, __iadd__ is the same thing as extend.) So for this case, Python could just treat it the same as t[0].__iadd__([2]). But that wouldn't work for immutable objects like integers.

So, what Python does is both. Since we don't have except expressions (at least not yet), I'll have to fake it with a series of separate statements:
    
    >>> _tmp = getitem(t, 0)
    >>> try:
    >>>     _tmp2 = _tmp.__iadd__([2])
    >>> except AttributeError:
    >>>     _tmp2 = _tmp.__add__([2])
    >>> setitem(t, 0, _tmp2)

Back to the original example

    >>> t = ([], [])
    >>> t[0] += [1]
So, that += first calls getitem(t, 0). So far, so good. Then it calls __iadd__ on the resulting list. Since that's a list, that's fine; the list is extended in-place, and returns self. Then it calls setitem(t, 0, <the modified list>). Since t is a tuple, that raises an exception. But t[0] has already been modified in-place. Hence the confusing result.

Could we fix this somehow?

It's not clear how. Python can't know that setitem is going to raise an exception until it actually calls it. And it can't call it until it has the value to pass. And it doesn't have that value until it calls __iadd__. At which point the list has been modified.

You might argue that Python could know that setitem will raise, by testing hasattr(t, '__setitem__') and skipping the rest of the statement, synthesizing an AttributeError instead. But let's look at a similar case where that can't possibly work:
    >>> f = open('foo')
    >>> f.encoding += '16'
    AttributeError: readonly attribute
The exact same thing is going on here, but with setattr rather than setitem failing. (Fortunately, because a file's encoding is a string, which is immutable, we're not modifying it and then failing, but it's easy to construct an example with a read-only mutable attribute.)

And here, there's no way to predict in advance that it's going to fail. The problem isn't that __setattr__ doesn't exist, but that it raises an exception when called specifically with 'encoding'. If you call it with some different attribute, it'll work fine. What's happened is that io.TextIOWrapper is a C extension type that's declared 'encoding' as a readonly attribute—or, if you're using the pure-Python implementation, _pyio.TextIOWrapper is a Python class that's declared 'encoding' as a @property with no setter. There are plenty of other ways to make __setattr__ fail, and Python can't check them all in advance.

One option that _might_ work would be to add some new methods to the data model, __cansetitem__(i) and __cansetattr__(name), which does nothing if __setitem__(i, value) would work for some value, but raises if __setitem__(i) would raise. Then, Python could do this:
    
    >>> _tmp = getitem(t, 0)
    >>> cansetitem(t, 0)
    >>> try:
    >>>     _tmp2 = _tmp.__iadd__([2])
    >>> except AttributeError:
    >>>     _tmp2 = _tmp.__add__([2])
    >>> setitem(t, 0, _tmp2)
The __cansetitem__ method would be pretty simple, and could be added to existing collection types; you could even make MutableSequence.__cansetitem__ just call __getitem__ and ignore the result and MutableMapping.__cansetitem__ always return successfully and that'll make most third-party ABC-mixing-derived types just work.

The __cansetattr__ method would be a bit more complicated, however. A base implementation in object.__cansetattr__ can fail for readonly attributes of C extension types, for read-only descriptors (to handle @property and custom descriptors), and for non-declared attributes of C extension types and __slots__ types. But every class with a custom __setattr__ would need a custom __cansetattr__. And there are a lot of those out there, so this would be a backward-compatibility nightmare.

Maybe someone can think of a way around this, or come up with a different solution. If you think you've got one, post it to the python-ideas list. But remember that, as Guido says, language design is not just solving puzzles. The fact that __setitem__ and __setattr__ are resolved dynamically is a feature, and a lot of things (including the "no need for reference types" raised above) flow naturally from that feature. And the surprising behavior of "t[0] += [2]" also flows naturally from that feature. So, there's a good chance any "solution" to the problem will actually be unnatural and more surprising under the covers. And solving a problem with deep magic that's harder to understand than the original problem is usually not a good idea.
11

View comments

  1. It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

    Much of what he said still applies, but other things have changed. Languages in the Abject-Oriented space have been borrowing ideas from another paradigm entirely—and then everyone realized that languages like Python, Ruby, and JavaScript had been doing it for years and just hadn't noticed (because these languages do not require you to declare what you're doing, or even to know what you're doing). Meanwhile, new hybrid languages borrow freely from both paradigms.

    This other paradigm—which is actually older, but was largely constrained to university basements until recent years—is called Functional Addiction.

    A Functional Addict is someone who regularly gets higher-order—sometimes they may even exhibit dependent types—but still manages to retain a job.

    Retaining a job is of course the goal of all programming. This is why some of these new hybrid languages, like Rust, check all borrowing, from both paradigms, so extensively that you can make regular progress for months without ever successfully compiling your code, and your managers will appreciate that progress. After all, once it does compile, it will definitely work.

    Closures

    It's long been known that Closures are dual to Encapsulation.

    As Abject-Oriented Programming explained, Encapsulation involves making all of your variables public, and ideally global, to let the rest of the code decide what should and shouldn't be private.

    Closures, by contrast, are a way of referring to variables from outer scopes. And there is no scope more outer than global.

    Immutability

    One of the reasons Functional Addiction has become popular in recent years is that to truly take advantage of multi-core systems, you need immutable data, sometimes also called persistent data.

    Instead of mutating a function to fix a bug, you should always make a new copy of that function. For example:

    function getCustName(custID)
    {
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    }

    When you discover that you actually wanted fields 2 and 3 rather than 1 and 2, it might be tempting to mutate the state of this function. But doing so is dangerous. The right answer is to make a copy, and then try to remember to use the copy instead of the original:

    function getCustName(custID)
    {
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    }
    
    function getCustName2(custID)
    {
        custRec = readFromDB("customer", custID);
        fullname = custRec[2] + ' ' + custRec[3];
        return fullname;
    }

    This means anyone still using the original function can continue to reference the old code, but as soon as it's no longer needed, it will be automatically garbage collected. (Automatic garbage collection isn't free, but it can be outsourced cheaply.)

    Higher-Order Functions

    In traditional Abject-Oriented Programming, you are required to give each function a name. But over time, the name of the function may drift away from what it actually does, making it as misleading as comments. Experience has shown that people will only keep once copy of their information up to date, and the CHANGES.TXT file is the right place for that.

    Higher-Order Functions can solve this problem:

    function []Functions = [
        lambda(custID) {
            custRec = readFromDB("customer", custID);
            fullname = custRec[1] + ' ' + custRec[2];
            return fullname;
        },
        lambda(custID) {
            custRec = readFromDB("customer", custID);
            fullname = custRec[2] + ' ' + custRec[3];
            return fullname;
        },
    ]

    Now you can refer to this functions by order, so there's no need for names.

    Parametric Polymorphism

    Traditional languages offer Abject-Oriented Polymorphism and Ad-Hoc Polymorphism (also known as Overloading), but better languages also offer Parametric Polymorphism.

    The key to Parametric Polymorphism is that the type of the output can be determined from the type of the inputs via Algebra. For example:

    function getCustData(custId, x)
    {
        if (x == int(x)) {
            custRec = readFromDB("customer", custId);
            fullname = custRec[1] + ' ' + custRec[2];
            return int(fullname);
        } else if (x.real == 0) {
            custRec = readFromDB("customer", custId);
            fullname = custRec[1] + ' ' + custRec[2];
            return double(fullname);
        } else {
            custRec = readFromDB("customer", custId);
            fullname = custRec[1] + ' ' + custRec[2];
            return complex(fullname);
        }
    }
    

    Notice that we've called the variable x. This is how you know you're using Algebraic Data Types. The names y, z, and sometimes w are also Algebraic.

    Type Inference

    Languages that enable Functional Addiction often feature Type Inference. This means that the compiler can infer your typing without you having to be explicit:


    function getCustName(custID)
    {
        // WARNING: Make sure the DB is locked here or
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    }

    We didn't specify what will happen if the DB is not locked. And that's fine, because the compiler will figure it out and insert code that corrupts the data, without us needing to tell it to!

    By contrast, most Abject-Oriented languages are either nominally typed—meaning that you give names to all of your types instead of meanings—or dynamically typed—meaning that your variables are all unique individuals that can accomplish anything if they try.

    Memoization

    Memoization means caching the results of a function call:

    function getCustName(custID)
    {
        if (custID == 3) { return "John Smith"; }
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    }

    Non-Strictness

    Non-Strictness is often confused with Laziness, but in fact Laziness is just one kind of Non-Strictness. Here's an example that compares two different forms of Non-Strictness:

    /****************************************
    *
    * TO DO:
    *
    * get tax rate for the customer state
    * eventually from some table
    *
    ****************************************/
    // function lazyTaxRate(custId) {}
    
    function callByNameTextRate(custId)
    {
        /****************************************
        *
        * TO DO:
        *
        * get tax rate for the customer state
        * eventually from some table
        *
        ****************************************/
    }

    Both are Non-Strict, but the second one forces the compiler to actually compile the function just so we can Call it By Name. This causes code bloat. The Lazy version will be smaller and faster. Plus, Lazy programming allows us to create infinite recursion without making the program hang:

    /****************************************
    *
    * TO DO:
    *
    * get tax rate for the customer state
    * eventually from some table
    *
    ****************************************/
    // function lazyTaxRateRecursive(custId) { lazyTaxRateRecursive(custId); }

    Laziness is often combined with Memoization:

    function getCustName(custID)
    {
        // if (custID == 3) { return "John Smith"; }
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    }

    Outside the world of Functional Addicts, this same technique is often called Test-Driven Development. If enough tests can be embedded in the code to achieve 100% coverage, or at least a decent amount, your code is guaranteed to be safe. But because the tests are not compiled and executed in the normal run, or indeed ever, they don't affect performance or correctness.

    Conclusion

    Many people claim that the days of Abject-Oriented Programming are over. But this is pure hype. Functional Addiction and Abject Orientation are not actually at odds with each other, but instead complement each other.
    5

    View comments

  2. I haven't posted anything new in a couple years (partly because I attempted to move to a different blogging platform where I could write everything in markdown instead of HTML but got frustrated—which I may attempt again), but I've had a few private comments and emails on some of the old posts, so I decided to do some followups.

    A couple years ago, I wrote a blog post on greenlets, threads, and processes. At that time, it was already possible to write things in terms of explicit coroutines (Greg Ewing's original yield from proposal already had a coroutine scheduler as an example, Twisted already had @inlineCallbacks, and asyncio had even been added to the stdlib), but it wasn't in heavy use yet. Things have changed since then, especially with the addition of the async and await keywords to the language (and the popularity of similar constructs in a wide variety of other languages). So, it's time to take a look back (and ahead).

    Differences

    Automatic waiting

    Greenlets are the same thing as coroutines, but greenlet libraries like gevent are not just like coroutine libraries like asyncio. The key difference is that greenlet libraries do the switching magically, while coroutine libraries make you ask for it explicitly.

    For example, with gevent, if you want to yield until a socket is ready to read from and then read from the socket when waking up, you write this:
        buf = sock.recv(4096)
    To do the same thing with asyncio, you write this:
        buf = await loop.sock_recv(sock, 4096)
    Forget (for now) the difference in whether recv is a socket method or a function that takes a socket; the key difference is that await. In asyncio, any time you're going to wait for a value, yielding the processor to other coroutines until you're ready to run, you always do this explicitly, with await. In gevent, you just call one of the functions that automatically does the waiting for you.

    In practice, while marking waits explicitly is a little harder to write (especially during quick and dirty prototyping), it seems to be harder to get wrong, and a whole lot easier to debug. And the more complicated things get, the more important this is.

    If you miss an await, or try to do it in a non-async function, your code will usually fail hard with a obvious error message, rather than silently doing something undesirable.

    Meanwhile, let's say you're using some shared container, and you've got a race on it, or a lock that's being held too long. It's dead simple to tell at a glance whether you have an await between a read and a write to that container, while with automatic waiting, you have to read every line carefully. Being able to follow control flow at a glance is really one of the main reasons people use Python in the first place, and await extends that ability to concurrent code.

    Serial-style APIs

    Now it's time to come back to the difference between sock.recv and sock_recv(sock). The asyncio library doesn't expose a socket API, it exposes an API that looks sort of similar to the socket API. And, if you look around other languages and frameworks, from JavaScript to C#, you'll see the same thing.

    It's hard to argue that the traditional socket API is in any objective sense better, but if you've been doing socket programming for a decade or four, it's certainly more familiar. And there's a lot more language-agnostic documentation on how it works, both tutorial and reference (e.g., if you need to look up the different quirks of a function on Linux vs. *BSD, the closer you are to the core syscall, the easier it will be to find and understand the docs).

    In practice, however, the vast majority of code in a nontrivial server is going to work at a higher level of abstraction. Most often, that abstraction will be Streams or Protocols or something similar, and you'll never even see the sockets. If not, you'll probably be building your own abstraction, and only the code on the inside—a tiny fraction of your overall code—will ever see the sockets.

    One case where using the serial-style APIs really does help, however, is when you've got a mess of already-written code that's either non-concurrent or using threads or processes, and you want to convert it to use coroutines. Rewriting all that code around asyncio (no matter which level you choose) is probably a non-trivial project; rewriting it around gevent, you just import all the monkeypatches and you're 90% done. (You still need to scan your code, and test the hell out of it, to make sure you're not doing anything that will break or become badly non-optimal, of course, but you don't need to rewrite everything.)

    Conclusion

    If I were writing the same blog post today, I wouldn't recommend magic greenlets for most massively-concurrent systems; I'd recommend explicit coroutines instead.

    There is still a place for gevent. But that place is largely in migrating existing threading-based (or on-concurrent) codebases. If you (and your intended collaborators) are familiar enough with threading and traditional APIs, it may still be worth considering for simpler systems. But otherwise, I'd strongly consider asyncio (or some other explicit coroutine framework) instead.
    6

    View comments

  3. Looking before you leap

    Python is a duck-typed language, and one where you usually trust EAFP ("Easier to Ask Forgiveness than Permission") over LBYL ("Look Before You Leap"). In Java or C#, you need "interfaces" all over the place; you can't pass something to a function unless it's an instance of a type that implements that interface; in Python, as long as your object has the methods and other attributes that the function needs, no matter what type it is, everything is good.

    Of course there are some occasions where you do need to LBYL. For example, say you need to do a whole set of operations or none at all. If the second one raises a TypeError after the first one succeeded, now you're in an inconsistent state. If you can make the first operation reversible—or, better, do the whole transaction on a copy of your state, off to the side, and then commit it atomically at the end—that may be acceptable, but if not, what can you do? You have to LBYL all of the operations before you do the first one. (Of course if you have to worry about concurrency or reentrancy, even that isn't sufficient—but it's still necessary.)

    hasattr

    Traditionally, in Python, you handled LBYL by still using duck-typing: just use hasattr to check for the methods you want. But this has a few problems.

    First, not all distinctions are actually testable with hasattr. For example, often, you want a list-like object, and a dict-like object won't do. But you can't distinguish with __getitem__, because they both support that. You can try to pile on tests--has __getitem__, and __len__, but doesn't have keys... But that's still going to have false positives once you go beyond the builtin types, and it's hard to document for the user of your function what you're testing for, and you have to copy and paste those checks all over the place (and, if you discover that you need to add another one, make sure to update all the copies). And remember to test according to special-method lookup (do the hasattr on the type, not the object itself) when appropriate, if you can figure out when it is appropriate.

    And, on top of the false positives, you may also get false negatives--a type that doesn't have __iter__ may still work in a for loop because of the old-style sequence protocol. So, you have to test for having __iter__ or having __getitem__ (but then you get back to the false positives). And, again, you have to do this all over the place.

    Finally, it isn't always clear from the test what you're trying to do. Any reader should understand that a test for __iter__ is checking whether an object is iterable, but what is a test for read or draw checking?

    The obvious way to solve these problems is to factor out an issequence function, an isiterable function, and so on.

    ABCs

    ABCs ("Abstract Base Classes") are a way to make those functions extensible. You write isinstance(seq, Sequence) instead of issequence(seq), but now a single builtin function works for an unbounded number of interfaces, rather than a handful of hardcoded ones. And we have that Sequence type to add extra things to. And there are three ways to hook the test.

    First, similar to Java interfaces, you can always just inherit from the ABC. If you write a class that uses Sequence as a base class, then of course isinstance(myseq, Sequence) will pass. And Sequence can use abstractmethod to require subclasses to implement some required methods or the inheritance will fail.

    Alternatively, similar to Go interfaces, an ABC can include structural tests (in a __subclasshook__ method) that will automatically treat any matching type as a subclass. For example, if your class defines an __iter__ method, it automatically counts as a subclass of Iterable.

    Finally, you can explicitly register any type as a virtual subclass of an ABC by calling the register method. For example, tuple doesn't inherit from Sequence, and Sequence doesn't do structural checking, but because the module calls Sequence.register(tuple) at import time, tuples count as sequences.

    Most OO languages provide either the first option or the second one, which can be a pain. In languages with only nominal (subclass-based) interfaces, there's no way to define a new interface like Reversible that stdlib types, third-party types, or even builtins like arrays can match (except n a language like like Ruby, where you can always reopen the classes and monkeypatch the inheritance chain at runtime, and it's even idiomatic to do so.) In languages with only structural interfaces, there's no way to do things like specify sequences and mappings as distinct interfaces that use the same method names with different meanings. (That's maybe not s much of a problem for named methods, but think about operator overloading and other magic methods—would you really want to spell dict access d{2} or similar instead of d[2] just so sequences and mappings can use different operators?)

    Sometimes, ABCs can clarify an interface even if you don't actually use them. For example, Python 2 made heavy use of the idea of "file-like objects", without ever really defining what that means. A file is an iterable of lines; an object with read() to read the whole thing and read(4096) to read one buffer; an object with fileno() to get the underlying file descriptor to pass to some wrapped-up C library; and various other things (especially once you consider that files are also writable). When a function needs a file-like object, which of these does it need? When a function gives you a file-like object, which one is it providing? Worse, even actual file objects come in two flavors, bytes and Unicode; an iterable of Unicode strings is a file-like object, but it's still useless to a function that needs bytes.

    So, in Python 3, we have RawIOBase and TextIOBase, which specify exactly which methods they provide, and whether those methods deal in bytes or text. You almost never actually test for them with isinstance, but they still serve as great documentation for both users and implementers of file-like objects

    But keep in mind that you don't want to create an ABC in Python whenever you'd create an interface in Java or Go. The vast majority of the time, you want to use EAFP duck typing. There's no benefit in using them when not required (Python doesn't do type-driven optimizations, etc.), and there can be a big cost (because so much Python code is written around duck-typing idioms, so you don't want to fight them). The only time to use ABCs is when you really do need to check that an object meets some interface (as in the example at the top).

    ABCs as mixins

    The obvious problem with the more complex ABCs is that interfaces like Sequence or RawIOBase have tons of methods. In the Python 2 days, you'd just implement the one or two methods that were actually needed (once you guessing which ones those were) and you were done.

    That's where mixins come in. Since ABCs are just classes, and mixins are just classes, there's nothing stopping an ABC from providing default implementations for all kinds of methods based on the subclass's implementations of a few required methods.

    If you inherit from Sequence, you only have to implement __getitem__ and __len__, and the ABC fills in all the rest of the methods for you. You can override them if you want (e.g., if you can provide a more efficient implementation for __contains__ than iterating the whole sequence), but you usually don't need to. A complete Sequence is under 10 lines of code for the author, but provides the entire interface of tuple for the user.

    Of course there are other ways Python could provide similar convenience besides mixins (e.g., see functools.total_ordering, a decorator that similarly lets you define just two methods and get a complete suite implemented for you), but mixins work great for this case.

    Isn't it conceptually wrong for an interface (which you inherit for subtyping) and a mixin (which you inherit for implementation) to be the same class? Well, it's not exactly "pure OO", but then practicality beats purity. Sure, Python could have a SequenceABC and a SequenceMixin and make you inherit both separately when you want both, but when are you going to want SequenceMixin without also wanting SequenceABC? Merging them is very often helpful, and almost never a problem.
    1

    View comments

  4. Background

    Currently, CPython’s internal bytecode format stores instructions with no args as 1 byte, instructions with small args as 3 bytes, and instructions with large args as 6 bytes (actually, a 3-byte EXTENDED_ARG followed by a 3-byte real instruction). While bytecode is implementation-specific, many other implementations (PyPy, MicroPython, …) use CPython’s bytecode format, or variations on it.

    Python exposes as much of this as possible to user code. For example, you can write a decorator that takes a function’s __code__ member, access its co_code bytecode string and other attributes, build a different code object, and replace the function’s __code__. Or you can build an import hook that takes the top-level module code returned by compile (and all the code objects stored in co_consts for the top-level object or recursively in any of the nested objects), builds up a different one, and returns that as the module’s actual code.

    And this isn’t just a nifty feature for playing around with Python’s internals without having to hack up the interpreter; there’s real-life production code that depends on doing this kind of stuff to code objects.

    Making CPython bytecode format less brittle

    There have been a few proposals to change the internal format: e.g., “wordcode” stores most instructions as 2 bytes (using EXTENDED_ARGS for 4 or 6 bytes when needed), while “packed ops” steals bits from the opcode to store very small args for certain ops (e.g., LOAD_CONST_2 can be stored as 1 byte instead of 3). Such proposals would obviously break all code that depends on bytecode. Guido suggested on python-dev that it would need two Python versions (so 3.7, if we finished wordcode today) before we could release such a breaking change; I’m not sure even that would be enough time.

    Similarly, while some other implementations use essentially the same format as CPython, they’re not all actually compatible. For example, for compactness, MicroPython changes most non-jump instructions are 2 bytes instead of 3, which breaks almost any code that even inspects bytecode, much less tries to modify it.

    One of the replies to Guido suggested, “Maybe we should have a standard portable assembly format for bytecode”. If we had that, it would remain consistent across most internal changes, even as radical as bytecode to wordcode, and would be portable between CPython and MicroPython. If we could get such a format into 3.6, I think it would definitely be reasonable for a radical change like wordcode to go into 3.7 as Guido suggests—and maybe even for any future changes to take place over a single version instead of two.

    Making bytecode processors easier to write

    Separately, as part of the discussion on PEP 511, which adds a new API for directly processing code objects, I suggested that it would be a lot simpler to write such processors, and also simpler for the compiler, if it used some “unpacked bytecode” format that’s easier to work with.

    The main objection I heard to that idea was that any such format would make Python harder to change in the future—but once you consider that the obvious format to use is effectively a bytecode assembly language, it becomes clear that the opposite is true: we’re making Python easier to change in the future, along with making it simpler to work with.

    The other objection is that there are just too many possible designs to choose from, so it’s way too early to put anything in the stdlib. But there really aren’t many possible designs to choose from.

    dis

    Python already has a standard bytecode assembly language format, as exposed by the dis module: a dis.Bytecode object represents the assembly-language version of a code object, as an iterable of dis.Instruction objects instead of just raw bytes (plus a few extra attributes and methods). It abstracts away things like variable instruction size (including EXTENDED_ARG), mapping indices into co_const into actual constant values, etc.

    This is almost exactly what we want, for both purposes—but there are two problems with it.

    First, there’s no assembler to get back from Bytecode back to code. Even if there were, Bytecode is immutable. And, even if it were mutable, Instruction objects have to be constructed out of a bunch of stuff that is either meaningless or needlessly difficult to deal with when you’re modifying or creating code. Most of that stuff is also useless to the assembler and could just be left at None, but the big one is the argval for each jump-like Instruction: it’s an offset into the raw byte string, not something meaningful at all in the Bytecode object. This means that if you add or remove any bytecodes, all the jumps have to be fixed up.

    Second, the dis module is written in Python. To make it usable in the compiler, the peephole optimizer, and any C-API bytecode processors, we need something accessible from C. We could, of course, rewrite most of dis in C, move it to builtins, and give it a C API, but that’s a lot of work—and it would probably discourage further improvements to dis, and it might even discourage other Python implementations from including it.

    Proposal

    As mentioned above, a Bytecode is basically an iterable of tuples. And an iterable of tuples is something that you can already handle from the C API.

    Meanwhile, what should jump arguments be? The simplest change is to make them references to other instructions. There are a couple of minor questions on that, which I’ll get to in the appropriate disassembly section.

    And that’s really all we need to make the dis format the standard, portable assembly format, and remove the need for most code (including bytecode-based optimizers, code-coverage tools, etc.) to need to depend on fragile and difficult-to-process raw bytecode.

    PyCode_Assemble

    PyCode_Assemble(instructions, name, filename, firstlineno)
    

    This function (in Include/code.h and compile.c) takes any iterable of tuples of 2 or more elements each, where the first three elements are opcode, argval, and line, and returns a code object. While assembling, it removes any existing EXTENDED_ARG and NOP instructions (although it will of course insert new EXTENDED_ARG as needed).

    Note that a Bytecode is an iterable of Instructions, which are namedtuples, so (assuming we reorganize the attributes of Instruction, and change jump targets to be references to Instructions instead of offsets) you can call this with a Bytecode object. But it’s usually simpler to just call it with a generator.

    The implementation for this new function would be somewhat complex if we were writing it from scratch—but the assemble function in compile.c already does almost exactly what we want, and the small changes to it that we need are not at all complex.

    Do we need a corresponding PyCode_Disassemble? It would be pretty simple, but I can’t think of any code in C, either today or in the future, that will ever need to disassemble a code object. (Well, you might want to write a third-party optimizer in C, but in that case, you can import the dis module from your code and use it.) And we already have a nice disassembler in pure Python. So… no C API for disassembly.

    PyCode_Optimize

    This function currently takes a bytecode bytestring, and in-out lists for consts, names, and lnotab, and returns a new bytestring. The implementation is the peephole optimizer.

    PEP 511 proposes to change it to take a code object and a context object (a collection of flags, etc.), and return a new code object. The implementation calls the code_transformer() method of every registered code transformer, which have the same interface, and then calls the peephole optimizer.

    I propose to instead change it to take an iterable of tuples and return the same, and to have PEP 511 transformers objects also use the same interface, and to rewrite the peephole optimizer to do so as well.

    compile.c

    Currently: The compiler converts each AST node into a block of instruction objects, where the instructions are pretty similar to the format used by the dis module (opcode and argument value), but with one major difference: each jump target is a pointer to a block, rather than an offset into the (not-yet-existing) bytecode. The compiler flattens this tree of blocks into a linked list, then calls assemble. The assemble function does a few passes over each instruction of each block and ends up with most of the attributes of a code object. It then calls PyCode_Optimize with four of those attributes, and then builds a code object from the result.

    Instead, the compiler will turn its linked list of arrays of instruction objects into a flat list of instruction tuples, converting each jump target from a block pointer to an instruction pointer along the way, and introducing the NOP label and setlineno pseudo-ops described below (which is very easy—each block starts with a label, and each instruction that starts a block or has a different line from the last instruction is a setlineno). It then call PyCode_Optimize with that list, and then call PyCode_Assemble (which replaces the existing assemble) with the result.

    dis.Opcode

    Currently, the dis module has 101 integer constants, opname and opmap to map back and forth between integer opcodes and their names, and each Instruction has both opcode and opname attributes.

    While we’re improving dis, we might as well make an IntEnum type called Opcode to wrap this up. We can even throw has_jrel and so forth on as methods. But for backward compatibility, we’ll keep the two mappings, the extra attribute, and the various lists and sets of raw opcodes too.

    dis.assemble

    The new dis.assemble function just exports PyCode_Assemble to Python in the obvious way.

    dis._get_instructions_bytes

    The core of the dis module is a function called _get_instructions_bytes. This takes a bytecode string, plus a bunch of optional parameters for things like the consts and varnames lists, and generates Instruction objects. (If the optional parameters are missing, it creates fake values—e.g., a LOAD_FAST 1 with no varnames gets an argval="1". Which is not at all useful for processing, but can occasionally be useful for dumping to the REPR, e.g., when you’re trying to debug a bytecode processor.)

    We need to make a few changes to this function.

    First, we add a new parameter, raw, both to this function and to all the functions that call it.

    If raw is false, all EXTENDED_ARG and NOP opcodes are skipped during disassembly. (If you’re planning to modify and reassemble, they’re at best useless, and can be confusing.)

    As mentioned earlier, jump arguments need the target instruction in the argval, not an offset. If raw, this is the actual instruction; otherwise, a new label pseudo-instruction (NOP, labelcount++) is generated for the jump target, and later inserted before the actual instruction.

    Also, since each instruction now has an optional line, and starts a line iff line is present and different from the previous instruction, rather than having an optional start_line, we’ll add line to every instruction. Also, if raw is false, we’ll emit pseudo-instructions (NOP, None, line) for each new line. (This allows people to modify instructions in-place without having to worry about whether they’re replacing a line start.)

    While we’re at it, it’s almost always simpler to not worry about which instructions start a new line while dealing with bytecode. So, _get_instructions_bytes can generate an (NOP, None, line) for each instruction in the lntoab, and then users can leave the line numbers off any real instructions they generate without having to worry about breaking the lnotab. But that too can get in the way for really short functions, so maybe it should also be optional. Or maybe just controlled by the same labels flag.

    dis.Instruction

    The Instruction class just needs to add a line attribute (which can be None), rearrange its attributes so (opcode, argval, line) come first, and make all attributes after the second optional (with default values of None).

    In fact, most of the other existing attributes aren’t useful when assembling, and aren’t useful when processing bytecode for other purposes, unless you’re trying to replace the high-level stringifying functions. But there’s no good reason to get rid of them; as long as people don’t have to specify them when creating or modifying instructions, they don’t get in the way. We could mark them deprecated; I’m not sure whether that’s worth it.

    dis.Bytecode

    Bytecode.assemble(self) just calls assemble, passing itself as the instructions, and the name, filename, and line number (if any) that it stored (directly or indirectly) at construction.

    The second public change is to the constructor. The constructor currently looks like this:

    Bytecode(x, *, first_line=None, current_offset=None)
    

    That x can be a code, function, method, generator, or str (the last is handled by calling compile(x)). The constructor stores x for later, and digs out the first line number and current offset for later unless overridden by the optional arguments, and also digs out the code object both to store and to disassemble.

    We’ll add a couple more optional parameters to override the values pulled from the first:

    Bytecode(x, *, name=None, filename=None, first_line=None, current_offset=None)
    

    (This means we also need to store the name and filename, as we do with the other two, instead of pulling them out of the appropriate object on demand.)

    And we’ll add two more “overloads” for the x parameter: It can be a Bytecode, in which case it’s just a shallow copy, or any (non-str, non-Bytecode) iterable, in which case it just stores [Instruction(Opcode(op), *rest) for (op, *rest) in it].

    The third public change is that Bytecode becomes a MutableSequence. I haven’t found any examples where I wanted to mutate a Bytecode; writing a non-mutating generator seems to be always easier, even when I need two passes. However, everyone else who’s written an example of a code generator does it by iterating indexes and mutating something in-place, and I can’t see any good reason to forbid that, so let’s allow it. This means that internally, Bytecode has to store the instructions in a list at construction time, instead of generating them lazily in __iter__.

    Examples

    Constify

    The following function takes an iterable of instruction tuples, and gives you back an iterable of instruction tuples where every LOAD_GLOBAL is replaced by a LOAD_CONST with the current value of that global:

    def constify(instructions):
        for opcode, argval, *rest in instructions:
            if opcode == dis.LOAD_GLOBAL:
                yield (dis.LOAD_CONST, eval(argval, globals())
            else:
                yield (opcode, argval)
    

    Or, if you prefer doing it mutably:

    def constify(instructions):
        bc = dis.Bytecode(instructions)
        for i, instr in enumerate(bc):
            if instr.opcode == dis.LOAD_GLOBAL:
                bc[i] = (dis.LOAD_CONST, eval(instr.argval, globals()))
        return bc
    

    Either way, you can use it in a decorator:

    def constify_func(func):
        code = func.__code__
        instructions = constify(dis.raw_disassemble(code))
        func.__code__ = dis.assemble(instructions,
                                     code.co_name, code.co_filename, code.co_firstlineno)
        return func
    

    … or a PEP 511 processor:

    def code_transformer(self, instructions, context):
        if context.some_flag:
            return constify(instructions)
        else:
            return instructions
    

    Dejumpify

    For something a little more complicated, the first example everyone gives me of why labels are insufficient: let’s flatten out double-jumps.

    This is one of the rare examples that’s easier without labels:

    def _dejumpify(code):
        bc = dis.Bytecode(code, raw=True)
        for op, arg, *rest in bc:
            if op.hasjump:
                if arg.opcode in (dis.JUMP_FORWARD, dis.JUMP_ABSOLUTE):
                    yield (op, arg.arg, *rest)
                    continue
            yield (op, arg, *rest)
    
    def dejumpify(func):
        code = func.__code__
        func.__code__ = dis.assemble(
            _dejumpify(code),
            code.co_name, code.co_filename, code.co_firstlineno)
        return func
    

    … but it’s still not at all hard (or inefficient) with labels:

    def _dejumpify(code):
        bc = dis.Bytecode(code)
        indices = {instr: i for i, instr in enumerate(bc)}
        for op, arg, *rest in bc:
            if op.hasjump:
                target = bc[indices[arg]+1]
                if target.opcode in (dis.JUMP_FORWARD, dis.JUMP_ABSOLUTE):
                    yield (op, target.arg, *rest)
                    continue
            yield (op, arg, *rest)
    

    Reorder blocks

    For something a little more complicated, what if we wanted to break the code into blocks, reorder those blocks in some way that minimizes jumps, then relinearlize. Surely we’d need blocks then, right? Sure. And here’s how we get them:

    def blockify(instructions):
        block = []
        for instruction in instructions:
            if instruction.is_jump_target:
                yield block
                block = []
            block.append(instruction)
        yield block
    

    (That could be shorter with itertools, but some people seem to find such code confusing, so I wrote it out.)

    So:

    def minimize_jumps(instructions):
        blocks = list(blockify(instructions))
        # put them in a graph and optimize it or whatever...
        # put them back in a flat iterable of blocks
        return itertools.chain.from_iterable(blocks)
    

    Backward compatibility and third-party libs

    First, the fact that Instruction.argval is now an instruction rather than an integer offset for jump instructions is a breaking change. I don’t think any code actually processes argval; if you need offsets, you pretty much need the raw arg. But I could be wrong. If I am, maybe it’s better to add a new argument or argument_value attribute, and leave argval with its old meaning (adding it to the deprecated list, if we’re deprecating the old attributes).

    If this is added to Python 3.6, we almost certainly want a backport to 3.4-5. I don’t know if it’s too important for that backport to export the C API for PyCode_Assemble, but that’s not hard to add, so why not. Backporting to 2.6+/3.2+ would be great, but… is there a backport of the 3.4 version of dis? If so, adding to that shouldn’t be too hard; if not, someone has to write that part first, and I’m not volunteering. :)

    Code that only inspects bytecode, without modifying it, like coverage.py, shouldn’t need any change. If that code is already using dis in 3.4+, it can continue to do so—and most such code would then work even if Python 3.7 changed to wordcode. If that code isn’t using dis, well, nothing changes.

    Code that significantly modifies bytecode often relies on the third-party module byteplay. We definitely want to allow such code to continue working. Ideally, byteplay will change to just use dis for assembly in 3.6+ (the way it changed to use dis for various other bits in 3.4+), meaning it’ll no longer be a stumbling block to upgrading to new versions of Python. This would be a very easy change to byteplay.

    The only other utility module I’ve found for dealing with bytecode that isn’t stuck in Python 2.4 or something is codetransformer. I haven’t looked at it in as much detail as byteplay, and I don’t know whether it could/should be changed to use the dis assembler in 3.6+.

    Victor Stinner is in the middle of writing a new module called bytecode, as an attempt at working out what would be the best API for rewriting the peephole assembler line-by-line in Python. I’m willing to bet that whatever API he comes up with could be trivially built as a wrapper on top of dis, if dis isn’t sufficient in itself. (I’m not sure why you’d want to rewrite the peephole assembler line-by-line instead of doing the work at a higher level, but presumably the hope is that whatever he comes up with would also be useful for other code.)

    6

    View comments

  5. If you want to skip all the tl;dr and cut to the chase, jump to Concrete Proposal.


    Why can’t we write list.len()?

    Python is an OO language. To reverse a list, you call lst.reverse(); to search a list for an element, you call lst.index(). Even operator expressions are turned into OO method calls: elem in lst is just lst.__contains__(elem).

    But a few things that you might expect to be done with methods are instead done with free functions. For example, to get the length of a list (or array or vector or whatever) in different OO languages:1

    • lst.length (Ruby)
    • lst.size() (C++)
    • lst size (Smalltalk)
    • [lst count] (Objective C)
    • lst.count (Swift)
    • lst.length (JavaScript)
    • lst.size() (Java)
    • len(lst) (Python)

    One of these things is not like the others.2 In Python, len is a free function that you call with a collection instance, not a method of collection types that you call on a collection instance. (Of course under the covers, that distinction isn’t as big as it appears, which I’ll get back to later.)

    And the list of things that are methods like lst.index(elem) vs. free functions like len(lst) seems a bit haphazard. This often confuses novices, while people who move back and forth between Python and another language they’re more familiar with tend to list it as an example of “why language XXX is better than Python”.

    As you might expect, this is for historical reasons. This is covered in the official design FAQ:

    The major reason is history. Functions were used for those operations that were generic for a group of types and which were intended to work even for objects that didn’t have methods at all (e.g. tuples).

    In modern Python, this explanation seems a bit weird—tuple has methods like tuple.index(), just like list, so why couldn’t it have a len method too? Well, Python wasn’t originally as consistently OO as it is today. Classes were a completely separate thing from types (the type of an instance of class C wasn’t C, but instance),3 and only a handful of types that needed mutating methods, like list.reverse(), had any methods, and they were implemented very differently from methods of classes.

    So, could this be fixed? Sure. Should it be? I’m a lot less sure. But I think it’s worth working through the details to understand what questions arise, and what it would take to make this work, before deciding whether it’s worth considering seriously.

    Dunder methods

    First, an aside.

    In Python, most generic free functions, like operators, actually do their dispatch via a dunder-method protocol—that is, by calling a method named with double-underscore prefix and suffix on the first argument. For example, len(x) calls x.__len__(). So, couldn’t we just rename __len__ to len, leaving the old name as a deprecated legacy alternative (much like, in the other direction, we renamed next to __next__)? Or, if that doesn’t work, make lst.len() automatically fall back to lst.__len__()?

    Unfortunately, that would only work for a small handful of functions; most are more complicated than len. For example, next(i) just calls i.__next__(), but next(i, None) tries i.__next__() and returns None on StopIteration. So, collapsing next and __next__ would mean that every iterator type now has to add code to handle the default-value optional parameter, which would be a huge waste of code and effort, or it would mean that users could no longer rely on the default-value argument working with all iterators. Meanwhile, bytes(x) tries the buffer protocol (not implementable in Python, but you can inherit an implementation from bytes, bytearray, etc.), then the __bytes__ method, then the __iter__ protocol, then the old-style sequence __getitem__ protocol. Collapsing bytes and __bytes__ would mean none of those fallbacks happen. And int() has both fallback protocols, and an optional parameter that changes which protocols are tried. And so on. So, this idea is a non-starter.

    C++

    C++ isn’t usually the best language to turn to for inspiration; any C++ ideas that make sense for Python probably also made sense for C#, Swift, Scala, or some other language, and the version in those other languages is usually closer to what Python wants.

    But in this case, C++ is one of the few languages that, for its own historical reasons,4 haphazardly mixes free functions and methods. For example, to get the length of a container, you call cont.size() method, but to get the index of an element, you call find(cont, elem) (the exact opposite of Python, in these two cases). There was originally a principled distinction, but the first two decades of the language and what would become its standard library changing back and forth to support each other led to a mishmash.

    It’s become part of the “modern C++ dogma” that free functions are part of a type’s interface, just like public members. But some things that are part of a type’s interface, you call function-style; others, you call dot-style—and, because there’s no rhyme or reason behind the distinction, you just have to memorize which are which. And, when designing types, and protocols for those types to interact with, you have to pick one arbitrarily, and be careful to be consistent everywhere.

    Over the years, people have tried to come up with solutions to fix this inconsistency for C++. Most of the early attempts involved Smalltalk-style “extension methods”: reopening a type (including “native types” like int and C arrays) to add new methods. Then, you could just reopen the C array type template and add a begin method, so lst.begin() now works no matter what type lst is. Unfortunately, fitting that into a static language like C++ turned out to be difficult.5

    In 2014, there were two new proposals to solve this problem by unifying call syntax. Herb Sutter’s N4165 and Bjarne Stroustrup’s N4174 both have interesting motivation sections that are at least partly relevant to Python and other languages, including some of the areas where they differ. Starting with N4474, they switched to a combined proposal, which gets more into C++-specific issues. At any rate, while I’m not sure which one is on the right side for C++, I think Sutter’s definitely is for Python,6 so that’s what I’m going to summarize here.

    The basic idea is simple: the call syntax x.f(y) tries to look up the method x.f and call it on (y), but falls back to looking up the function f and calling it on (x, y). So, for example, you could write lst.len(); that would fail to find a list.len method, and then try the len function, which would work.

    Python

    So, the basic idea in Python is the same: x.f(y) tries to look up the method x.f, and falls back to the function f.

    Of course Python has to do this resolution at runtime, and because it has only generic functions rather than separate automatically-type-switched overloads (unless you use something like singledispatch to implement overloads on top of generic functions)—but those don’t turn out to be problem.

    There are also a number of details that are problems, but I think they’re all solvable.

    Locals

    When lst.len() fails to find a len method on the lst object, we want it to fall back to calling len(lst), so we need to do a lookup on len.

    In Python, the lookup scope is resolved partly at compile time, partly at runtime. See How lookup works for details, but I can summarize the relevant parts inline here: If we’re inside a function body, if that function or an outer nested function has a local variable named len, compile a local or dereferenced lookup (don’t worry about the details for how it decides which—they’re intuitive, but a bit long to explain here, and not relevant); otherwise, compile a global-or-builtin lookup.

    This sounds complicated, but the compiler is already doing this every time you type len(lst) in a function, so making it do the same thing for fallback is pretty simple. Essentially, we compile result = lst.len() as if it were:

    try:
        _meth = lst.len
    except AttributeError:
        result = len(lst)
    else:
        result = _meth()
        del _meth
    

    What raises on failure?

    With the implementation above, if neither lst.len nor len exists, we’re going to end up raising a NameError. We probably wanted an AttributeError. But that’s easy to handle:

    try:
        _meth = lst.len
    except AttributeError:
        try:
            result = len(lst)
        except NameError:
            raise AttributeError(...)
    else:
        result = _meth()
        del _meth
    

    (We don’t have to worry about the lst raising a NameError, because if it doesn’t exist, the firt lst.len would have raised that instead of AttributeError.)

    I’ll ignore this detail in the following sections, but it should be obvious how to add it back in to each version.

    Method objects

    In Python, functions and methods are both first-class objects. You might not actually call lst.len(), but instead store lst.len to call later, or pass it to some other function. So, how does that work?

    We definitely don’t want to put off resolution until call time. That would mean lst.len has to become some complicated object that knows how to look up "len" both on lst and in the creating scope (which would also mean if len were a local it would have to become a cellvar, and so on).

    So, what we need to do is put the fallback before the call, and make it give us a method object. Like this:

    try:
        meth = lst.len
    except AttributeError:
        meth = make_method_out_of(len, lst)
    # and then the call happens later:
    meth()
    

    As it turns out, this is pretty much how bound methods already works. That made-up make_method_out_of(len, lst) exists, and is spelled len.__get__(lst, type(lst)). The standard function type has a __get__ method that does the right thing. See the Descriptor HowTo Guide for more details, but you don’t really need any more details here. So:

    try:
        meth = lst.len
    except AttributeError:
        meth = len.__get__(lst, type(lst))
    

    The only problem here is that most builtin functions, like len, aren’t of function type, and don’t have a __get__ method.7 One obvious solution is to give normal builtin functions the same descriptor method as builtin unbound methods.8 But this still has a problem: you can construct custom function-like objects in Python just by defining the __call__ method. But those functions won’t work as methods unless you also define __get__. Which isn’t much of a problem today—but might be more of a problem if people expect them to work for fallback.

    Alternatively, we could make the fallback code build explicit method objects (which is what the normal function.__get__ does already):

    try:
        meth = lst.len
    except AttributeError:
        try:
            meth = len.__get__(lst, type(lst))
        except AttributeError:
            meth = types.MethodType(len, lst)
    

    I’ll assume in later sections that decide this isn’t necessary and just give builtins the descriptor methods, but it should be obvious how to convert any of the later steps.

    Things are slightly more complicated by the fact that you can call class methods on a class object, and look up instance methods on a class object to call them later as unbound methods, and store functions directly in an object dict and call them with the same dot syntax, and so on. For example, we presumably want list.len(lst) to do the same thing as lst.len(). The HowTo covers all the details of how things work today, or How methods work for a slightly higher-level version; basically, I think it all works out, with either variation, but I’d have to give it more thought to be sure.

    There have been frequent proposals to extend the CPython implementation with a way to do “fast method calls”—basically, x.f() would use a new instruction to skip building a bound method object just to be called and discarded, and instead directly do what calling the bound method would have done. Such proposals wouldn’t cause any real problem for this proposal. Explaining why requires diving into the details of bytecode and the eval loop, so I won’t bother here.

    What about set and delete?

    If x.f() falls back to f.__get__(x, type(x))(), then shouldn’t x.f = 3 fall back to f.__set__(x, 3)?

    No. For one thing, that would mean that, instead of hooking lookup, we’d be hooking monkeypatching, which is a very silly thing to do. For another, if x has no member named f, x.f = 3 doesn’t fail, it just creates a member, so fallback wouldn’t normally happen anyway. And of course the same goes for deletion.

    Data members

    We want to be able to write x.f() and have it call f(x). But when we’re dealing with data rather than a method, we probably don’t want to be able to write x.sys and have it return a useless object that looks like a method but, when called, tries to do sys(x) and raises a TypeError: 'module' object is not callable.

    Fortunately, the way methods work pretty much takes care of this for us. If you try to construct a types.MethodType with a non-callable first argument, it raises a TypeError (which we can handle and turn into an AttributeError), right there at construction time, not later at call time.

    Or, alternatively, we could just make fallback only consider non-data descriptors; if it finds a non-descriptor or a data descriptor, it just raises the AttributeError immediately.

    Namespaces

    In C++, when you write f(x), the compiler looks for functions named f in the current namespace, and also in the namespace where the type of x is defined. This is the basic idea behind argument-dependent lookup. For example:

    // things.h
    namespace things {
        class MyThing { ... };
        void myfunc(MyThing) { ... }
    }
    
    // stuff.cpp
    #include <things.h>
    namespace stuff {
        void stuffy() {
            things::MyThing thing;
            things::myfunc(thing); // works
            myfunc(thing); // works, means the same
            thing.myfunc(); // works under N4165
        }
    }
    

    In Python, the equivalent doesn’t work:

    # things.py
    class MyThing: ...
    def myfunc(thing: MyThing): ...
    
    # stuff.py
    import things
    def stuffy():
        thing = things.MyThing()
        things.myfunc(thing) # works
        myfunc(thing) # NameError
        thing.myfunc() # AttributeError even with this proposal
    

    You can’t just write myfunc(thing), and with the proposed change, you still can’t write thing.myfunc().

    So, does that make the idea useless? No. Some of the most important functions are builtins. And some important third-party libraries are meant to be used with from spam import *.

    But still, it does mean the idea is less useful.

    Could Python implement a simple form of argument-dependent lookup, only for dot-syntax fallback? Sure; it’s actually pretty simple. One option is:

    try:
        meth = lst.len
    except AttributeError:
        try:
            meth = len.__get__(lst, type(lst))
        except NameError:
            mod = sys.modules[type(lst).__module__]
            mod.len.__get__(lst, type(lst))
    

    This extends the LEGB (local, enclosing, global, builtin) rule to LEGBA (…, argument-dependent) for dot-syntax fallback. If the compiler emitted a FAST or DEREF lookup, we don’t need to worry about the A case; if it emitted a GLOBAL lookup, we only need to worry about it if the global and builtins lookups both failed, in which case we get a NameError, and we can then do an explicit global lookup on the module that type(lst) was defined in (and if that too fails, then we get a NameError).

    It isn’t quite this simple, because the module might have been deleted from sys.modules, etc. There are already rules for dealing with this in cases like pickle and copy, and I think applying the same rules here would be perfectly acceptable—or, even simpler, just implement it as above (except obviously turning the KeyError into an AttributeError), and you just can’t use argument-dependent lookup to fall back to a function from a module that isn’t available.

    Again, I’ll assume we aren’t doing argument-dependent lookup below, but it should be obvious how to change any of the later code to deal with it.

    Bytecode details

    We could actually compile the lookup and fallback as a sequence of separate bytecode instructions. So, instead of a simple LOAD_ATTR, we’d do a SETUP_EXCEPT, and handle AttributeError by doing a LOAD_FAST, LOAD_DEREF, or LOAD_GLOBAL. This gets a bit verbose to read, and could be a performance problem, but it’s the smallest change, and one that could be done as a bytecode hack in existing Python.

    Alternatively, we could add new bytecodes for LOAD_ATTR_OR_FAST, LOAD_ATTR_OR_DEREF, and LOAD_ATTR_OR_GLOBAL, and do all the work inside those new opcode handlers. This gets a bit tricky, because we now have to store the name in two arrays (co_names for the ATTR lookup and co_varnames for the FAST lookup) and somehow pass both indices as arguments to the opcode. But we could do that with EXTENDED_ARGS. Or, alternatively, we could just make LOAD_ATTR_OR_FAST look in co_varnames, and change the definition of co_names (which isn’t really publicly documented or guaranteed) so that instead of having globals, attributes, etc., it has globals, attributes that don’t fall back to local/free/cell lookups, etc.

    It might seem like the best place to do this would be inside the LOAD_ATTR opcode handler, but then we’d need some flag to tell it whether to fall back to fast, deref, or global, and we’d also need some way for code that’s inspecting bytecode (like dis or inspect) to know that it’s going to do so, at which point we’ve really done the same thing as splitting it into the three opcodes from the previous paragraph.

    A simpler version of any of these variants is to always fall back to LOAD_NAME, which does the LEGB lookup dynamically, by looking in the locals() dict and then falling back to a global lookup. Since most fallbacks are probably going to be to globals or builtins, and LOAD_NAME isn’t that much slower than LOAD_GLOBAL, this doesn’t seem too bad at first—but it is slower. Plus, this would mean that every function that has any attribute lookup and any local or free variables has to be flagged as needing a locals() dict (normally, one doesn’t get built unless you ask for it explicitly). Still, it might be acceptable for a proof of concept implementation—especially as a variation on the first version.

    Lookup overrides

    In Python, almost everything is dynamic and hookable. Even the normal attribute lookup rules. When you write x.f, that’s handled by calling x.__getattribute__(f),9 and any type can override __getattribute__ to do something different.

    So, if we want types to be able to control or even prevent fallback to free functions, the default fallback has to be handled inside object.__getattribute__. But, as it turns out (see the next section), that’s very complicated.

    Fortunately, it’s not clear it would ever be useful to change the fallback. Let’s say a type wanted to, say, proxy all methods to a remote object, except for methods starting with local_, where local_spam would be handled by calling spam locally. At first glance, that sounds reasonable—but if that spam falls back to something local, you have to ask, local to what?10 How can the type can decide at runtime, based on local_spam, that the compiler should have done LEGB resolution on spam instead of local_spam at compile time, and then get that retroactively done? There doesn’t seem to be any reasonable and implementable answer.

    So, instead, we’ll just do the fallback in the interpreter: if normal lookup (as possibly overridden by __getattribute__) for x.f fails, do the f lookup.

    Introspection

    In Python, you can write getattr(x, 'f'), and expect to get the same thing as x.f.

    One option is to make getattr(x, 'f') not do the fallback and just make it raise. But that seems to defeat the purpose of getattr. It’s not just about using getattr (or hasattr) for LBYL-style coding, in which case you could argue that LBYL is always approximate in Python and if people don’t like it they can just suck it. The problem is that sometimes you actually have an attribute name as a string, and need to look it up dynamically. For example, many RPC servers do something like this:

    cmd, args = parse(message)
    try:
        handler.getattr('do_' + cmd)(*args)
    except AttributeError:
        handler.do_unknown_command(cmd, *args)
    

    If I wanted to add a do_help that works on all three of my handlers by writing it as a free function, I’d need getattr to be able to find it, right?

    I’m not sure how important this is. But if getattr needs fallback code, the obvious is something like this:

    try:
        return _old_getattr(x, 'f')
    except AttributeError as ex:
        frame = inspect.currentframe().f_back
        try:
            return frame.f_locals()('f').__get__(x, type(x))
        except KeyError:
            pass
        try:
            return frame.f_globals()('f').__get__(x, type(x))
        except KeyError:
            raise ex
    

    But doesn’t this have all the same problems I brought up earlier, e.g., with why we can’t expect people to usefully do this in __getattribute__, or why we can’t do it as the interpreter’s own implementation? Yes, but…

    The first issue is performance. That’s generally not important when you’re doing getattr. People write x.f() in the middle of inner loops, and expect it to be reasonably fast. (Although when they need it to be really fast, they write xf = x.f() outside the loop, you shouldn’t have to normally do that.) People don’t write getattr(x, 'f') in the middle of inner loops. It already takes roughly twice as long for successful lookups, and even longer on failed lookups. So, I don’t think anyone would scream if it also took longer on successful fallback lookups.

    The second issue is portability. We can’t require people to use non-portable frame hacks to implement the portable __getattribute__ method properly. But we can use it internally, in the CPython implementation of a builtin. (And other implementations that do things differently can do whatever they need to do differently.)

    The final issue is correctness. If f is a local in the calling function, but hasn’t been assigned to yet, a LOAD_FAST on f is going to raise an UnboundLocalError, but locals()['f'] is going to just raise a KeyError and our code will fall back to a global lookup, which is bad. But fortunately, getattr is a builtin function. In CPython, it’s implemented in C, and uses the C API rather than the Python one. And in C, you can actually get to the parent frame’s varnames and locals array directly. Also, I’m not sure this would be a problem. It’s already true for dynamic lookup of locals()['x'] (and, for that matter, eval('x')), so why shouldn’t it also be true for dynamic lookup of getattr(f, 'x')?

    So, I think this is acceptable here.

    At any rate, changing getattr automatically appropriately takes care of hasattr, the inspect module, and anything else that might be affected.

    C API

    As just mentioned above, the C API, used by builtins and extension modules, and by programs that embed the Python interpreter, is different from the Python API. The way you get the attribute x.f is by calling PyObject_GetAttrString(x, "f") (or, if you have 'f' as a Python string object f, PyObject_GetAttr(x, f)). So, does this need to fall back the same way as x.f?

    I don’t think it does. C code generally isn’t running inside a Python frame, and doesn’t even have Python locals, or an enclosing scope, and it only has globals if it manually adds them (and access them) as attributes of the module object it exports to Python. So, fallback to f(x) doesn’t even really make sense. Plus, C API functions don’t always need to be generic in the same way Python functions do, and they definitely don’t need to be as convenient.

    The one problem is that the documentation for PyObject_GetAttr explicitly says “This is the equivalent of the Python expression o.attr_name.” (And similarly for the related functions.) But that’s easily fixed: just add “(except that it doesn’t fall back to a method on o built from looking up attr_name as a local, global, etc.)” to that sentence in the docs. Or, alternatively, change it to “This is the equivalent of the Python expression o.__getattribute__('attr_name')“.

    This could be a minor headache for something like Cython, that generates C API code out of Pythonesque source. Without looking at how Cython handles local, free, and global variables under the covers, I’m not sure how much of a headache. But it certainly ought to be something that Cython can cope with.

    Concrete proposal

    The expression obj.attr will fall back to a bound method on obj built from function attr.

    Abstractly, this is done by handling AttributeError from __getattribute__ to look for a binding named attr and, if it’s bound to a non-data descriptor, calling attr.__get__(obj, type(obj)) and using that as the value. This means that implementations’ builtin and extension functions should now be non-data descriptors, just like normal functions. The getattr builtin does the same fallback, but it may handle unbound locals the same way that other dynamic (by-name) lookup functions and/or eval do on a given implementation.

    CPython

    The compiler will determine whether attr is a local, freevar, cellvar, or global, just as if the expression had been attr instead of obj.attr, and then emit one of the three new opcodes LOAD_ATTR_FAST, LOAD_ATTR_DEREF, or LOAD_ATTR_GLOBAL in place of LOAD_ATTR. These function like LOAD_ATTR, except that the name is looked up in the appropriate array (e.g., co_varnames instead of co_names for LOAD_ATTR_FAST), and for the fallback behavior described above. (The exact implementation of the ceval.c patch will probably be harder to describe in English than to just write.)

    Builtin functions gain a __get__ method that binds the function to an instance. (The details are again harder to describe than to write.)

    The getattr function uses the calling frame’s locals and globals to simulate fallback lookup, as described in a previous section (but implemented in C instead of Python, of course).

    No changes are made to PyObject_GetAttr and friends except for the note in the documentation about being the same as o.attr_name. No changes at all are made to any other public C APIs.

    Analysis

    I suspect that this change would, overall, actually make Python more confusing rather than less, and might also lead to code written by novices (especially those coming from other languages) being even more inconsistent than today.

    But it really is hard to guess in advance. So, is it worth building an implementation to see?

    Well, nobody’s going to run a patched interpreter that’s this different from standard Python on enough code to really get a feel for whether this works.

    On the other hand, a bytecode hack that can be applied to functions as a decorator, or to modules as an import hook (especially if it can hook the REPL, a la MacroPy) might get some people to play with it. So, even though that sounds like not much less work, and a lot less fun, it’s probably worth doing it that way first, if it’s worth doing anything.


    1. In some languages, arrays are special, not like other classes, and sometimes this is signaled by a special API. For example, in Java, to get the length of a “native array” you write lst.length, but to get the length of an instance of an array class, you write lst.size(). Most newer languages either do away with the “native types” distinction, like Ruby, or turn it into a user-extensible “value types” distinction, like Swift.
    2. OK, none of them are like the others—nobody can agree on the name, whether it’s a method or a property (except in those languages where nullary methods and properties are the same thing), etc. But you know what I mean: everyone else has len as a method or property; Python has it as a free function.
    3. In Python 2.1 and earlier, this was true for all classes. Python 2.2 added the option to make a class inherit from a type, like class C(object): pass, in which case it was a new-style class, and new-style classes are types. Python 3.0 made all classes new-style (by making them inherit from object if they don’t have any other bases). This adds a number of other benefits besides just unifying the notions of class and type. But there are a number of historical artifacts left behind–e.g., spam.__class__ vs. type(spam).
    4. … which you don’t want to know about, unless you understand or want to learn about fun things like argument-dependent lookup and the complex overload resolution rules.
    5. N1742 in 2004 seems to be the last attempt to make it work for C++. Building it into a new static language, on the other hand, is a lot easier; C#, Swift, etc. benefited from all of the failed attempts and just had to avoid making the same choices that had painted C++ into a corner where it couldn’t fit extensions in.
    6. In Python, f(x, y) is not considered part of the public interface of x’s type in the same way as in C++. Meanwhile, dynamic lookup makes f(x, y) auto-completion even worse. And TOOWTDI has a much stronger pull.
    7. This occasionally comes up even today. For example, you can write a global function def spam(self, arg): , and then write spam = spam inside a class definition, or monkeypatch in C.spam = spam later, and now c.spam(2) works on all C instances. But if you try that with a builtin function, you get a TypeError: spam() takes exactly 2 arguments (1 given). This rarely comes up in practice, and the workarounds aren’t too onerous when they’re rare (e.g., spam = lambda arg: spam(self, arg)). But if this fallback became automatic, it would come up all the time, and with no easy workaround, making the fallback feature nearly useless.
    8. At the Python level, functions and unbound methods are the same type, which has a __get__ which creates a bound method when called on an object; bound methods have a __get__ that does nothing. But for C builtins, functions and bound methods are the same type (with functions being “bound” to their module), builtin_function_or_method, which has no __get__, while unbound methods have a different type, method_descriptor, which has a __get__ that works like the Python version. The smallest change would be to give builtin_function_or_method a __get__ that rebinds the method. It is occasionally useful that re-binding a Python bound method is a no-op (e.g., so you can do spam = delegate.spam in a class definition, and that delegate’s self isn’t overwritten), but I don’t think it would be a problem that re-binding a builtin bound method is different (since today, it raises an AttributeError, so no code could be relying on it).
    9. It’s actually a little more complicated than that. See How lookup works for the gory details, or just ignore the details; I’ll cover the bits that are relevant later.
    10. Of course it could use inspect.currentframe()—or, since it’s actually implemented in C, the C API equivalent—to access the locals of the caller. But that doesn’t help with the next part.
    8

    View comments

  6. Many people, when they first discover the heapq module, have two questions:

    • Why does it define a bunch of functions instead of a container type?
    • Why don't those functions take a key or reverse parameter, like all the other sorting-related stuff in Python?

    Why not a type?

    At the abstract level, it's often easier to think of heaps as an algorithm rather than a data structure.

    For example, if you want to build something like nlargest, it's usually easier to understand that larger algorithm in terms of calling heappush and heappop on a list, than in terms of using a heap object. (And certainly, things like nlargest and merge make no sense as methods of a heap object—the fact that they use one internally is irrelevant to the caller.)

    And the same goes for building data types that use heaps: you might want a timer queue as a class, but that class's implementation is going to be more readable using heappush and heappop than going through the extra abstraction of a heap class.

    Also, even when you think of a heap as a data type, it doesn't really fit in with Python's notion of collection types, or most other notions. Sure, you can treat it as a sequence—but if you do, its values are in arbitrary order, which defeats the purpose of using a heap. You can only access it in sorted order by doing so destructively. Which makes it, in practice, a one-shot sorted iterable—which is great for building iterators on top of (like merge), but kind of useless for storing as a collection in some other object. Meanwhile, it's mutable, but doesn't provide any of the mutation methods you'd expect from a mutable sequence. It's a little closer to a mutable set, because at least it has an equivalent to add--but that doesn't fit either, because you can't conceptually (or efficiently, even if you do want to break the abstraction) remove arbitrary values from a heap.

    But maybe the best reason not to use a Heap type is the answer to the next question.

    Why no keys?

    Almost everything sorting-related in Python follows list.sort in taking two optional parameters: a key that can be used to transform each value before comparing them, and a reverse flag that reverses the sort order. But heapq doesn't.

    (Well, actually, the higher-level functions in heapq do--you can use a key with merge or nlargest. You just can't use them with heappush and friends.)

    So, why not?

    Well, consider writing nlargest yourself, or a heapsort, or a TimerQueue class. Part of the point of a key function is that it only gets called once on each value. But you're going to call heappush and heappop N times, and each time it's going to have to look at about log N values, so if you were applying the key in heappush and heappop, you'd be applying it about log N times to each value, instead of just once.

    So, the right place to put the key function is in whatever code wraps up the heap in some larger algorithm or data structure, so it can decorate the values as they go into the heap, and undecorate them as they come back out. Which means the heap itself doesn't have to understand anything about decoration.

    Examples

    The heapq docs link to the source code for the module, which has great comments explaining how everything works. But, because the code is also meant to be as optimized and as general as possible, it's not as simple as possible. So, let's look at some simplified algorithms using heaps.

    Sort

    You can easily sort objects using a heap, just by either heapifying the list and popping, or pushing the elements one by one and popping. Both have the same log-linear algorithmic complexity as most other decent sorts (quicksort, timsort, plain mergesort, etc.), but generally with a larger constant (and obviously the one-by-one has a larger constant than heapifying).

    def heapsort(iterable):
        heap = list(iterable)
        heapq.heapify(heap)
        while heap:
            yield heapq.heappop(heap)
    
    Now, adding a key is simple:
    def heapsort(iterable, key):
        heap = [(key(x), x) for x in iterable]
        heapq.heapify(heap)
        while heap:
            yield heapq.heappop(heap)[1]
    
    Try calling list(heapsort(range(100), str)) and you'll see the familiar [0, 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, ...] that you usually only get when you don't want it.

    If the values aren't comparable, or if you need to guarantee a stable sort, you can use [(key(x), i, x) for i, x in enumerate(iterable)]. That way, two values that have the same key will be compared based on their original index, rather than based on their value. (Alternatively, you could build a namedtuple around (key(x), x) then override its comparison to ignore the x, which saves the space for storing those indices, but takes more code, probably runs slower, and doesn't provide a stable sort.) The same is true for the examples below, but I generally won't bother doing it, because the point here is to keep things simple.

    nlargest

    To get the largest N values from any iterable, all you need to do is keep track of the largest N so far, and whenever you find a bigger one, drop the smallest of those N.
    def nlargest(iterable, n):
        heap = []
        for value in iterable:
            heapq.heappush(heap, value)
            if len(heap) > n:
                heapq.heappop()
        return heap
    
    To add a key:
    def nlargest(iterable, n, key):
        heap = []
        for value in iterable:
            heapq.heappush(heap, (key(value), value))
            if len(heap) > n:
                heapq.heappop()
        return [kv[1] for kv in heap]
    
    This isn't stable, and gives you the top N in arbitrary rather than sorted order, and there's lots of scope for optimization here (again, the heapq.py source code is very well commented, so go check it out), but this is the basic idea.

    One thing you might notice here is that, while collections.deque has a nice maxlen attribute that lets you just push things on the end without having to check the length and pop off the back, heapq doesn't. In this case, it's not because it's useless or complicated or potentially inefficient, but because it's so trivial to add yourself:
    def heappushmax(heap, value, maxlen):
        if len(heap) >= maxlen:
            heapq.heappushpop(heap, value)
        else:
            heapq.heappush(heap, value)
    
    And then:
    def nlargest(iterable, n):
        heap = []
        for value in iterable:
            heappushmax(heap, value, n)
        turn heap
    

    merge

    To merge (pre-sorted) iterables together, it's basically just a matter of sticking their iterators in a heap, with their next value as a key, and each time we pop one off, we put it back on keyed by the next value:
    def merge(*iterables):
        iterators = map(iter, iterables)
        heap = [(next(it), i, it) 
                for i, it in enumerate(iterators)]
        heapq.heapify(heap)
        while heap:
            nextval, i, it = heapq.heappop(heap)
            yield nextval
            try:
                nextval = next(it)
            except StopIteration:
                pass
            else:
                heapq.heappush(heap, (nextval, i, it))
    
    Here, I did include the index, because most iterables either aren't comparable or are expensive to compare, so it's a bit more serious of an issue if two of them have the same key (next element).

    (Note that this implementation won't work if some of the iterables can be empty, but if you want that, it should be obvious how to do the same thing we do inside the loop.)

    What if we want to attach a key to the values as well? The only tricky bit is that we want to transform each value of each iterable, which is one of the few good cases for a nested comprehension: use the trivial comprehension (key(v), v) for v in iterable in place of the iter function.
    def merge(*iterables, key):
        iterators = ((key(v), v) for v in iterable) for iterable in iterables)
        heap = [(next(it), i, it) for i, it in enumerate(iterators)]
        heapq.heapify(heap)
        while heap:
            nextval, i, it = heapq.heappop(heap)
            yield nextval[-1]
            try:
                nextval = next(it)
            except StopIteration:
                pass
            else:
                heapq.heappush(heap, (nextval, i, it))
    
    Again, there are edge cases to handle and optimizations to be had, which can be found in the module's source, but this it the basic idea.

    Summary

    Hopefully all of these examples show why the right place to insert a key function into a heap-based algorithm is not at the level of heappush, but at the level of the higher-level algorithm.
    1

    View comments

  7. Currently, in CPython, if you want to process bytecode, either in C or in Python, it’s pretty complicated.

    The built-in peephole optimizer has to do extra work fixing up jump targets and the line-number table, and just punts on many cases because they’re too hard to deal with. PEP 511 proposes a mechanism for registering third-party (or possibly stdlib) optimizers, and they’ll all have to do the same kind of work.

    Code that processes bytecode in function decorators or import hooks, whether for optimization or for other purposes like extending the language, is usually written in Python rather than C, but it’s no simpler there, so such code usually has to turn to a third-party library like byteplay.

    Compile process

    All of the details are very well documented, but I’ll give a brief summary of the important parts, to make it easy to see where things hook in and what information is available there.

    When the Python interpreter starts, it reads interactive input, stdin, or a script file, and sets up a tokenizer. Then, it parses iteratively, generating an AST (abstract syntax tree) for each top-level statement. For each one:

    • Assuming PEP 511 is accepted, it will first call pass the AST node through the chain of all registered AST processors.

    • Next, it calls compile on the node,1 which compiles it into an internal structure, assembles that into bytecode and associated values, passes that through the optimizer, and builds a code object.

      • As of Python 3.5, “the optimizer” means the built-in peephole optimizer; assuming PEP 511, it will mean the peephole optimizer plus the chain of all registered bytecode processors.
    • The code object is then passed to exec.

    Compiling a statement may recursively call the compiler. For example, a function definition first compiles the function body, then compiles the def statement into a statement code object that makes a function out of that function body’s code object. And functions and classes can of course be nested.

    The compiler can also be triggered at runtime, e.g., by a call to compile or exec with source or an AST–or, probably more commonly, by the import system. The default import loader for .py files calls compile on the entire file, which builds an AST for the entire file, then performs the same steps as done for each input statement, to build a module code object. However, an import hook can change this–most simply, by parsing the AST explicitly, processing the AST, then compiling the result, or by processing the bytecode resulting from compile.

    It’s also possible to post-process code at runtime. A function decorator is just a normal function called with a function object–but that function object contains a code object, and a decorator can modify the function to use a different code object, or just return a new function.

    Bytecode formats

    Bytecode and code objects

    Bytecode is, as the name implies, just a string of bytes.

    The peephole optimizer (and, in the current version of PEP 511, any plugin optimizers) receives this bytecode string, plus a few associated values as in-out parameters. (See PyCode_Optimize in the source.)

    Decorators and import hooks receive a code object, which includes the bytecode string and a much larger set of associated values, whose members are described in the inspect docs.

    Bytecode is nicely explained in the dis documentation, but I’ll briefly cover it here, and explain where the headaches come in.

    Each operation takes either 1 byte (opcodes below HAVE_ARGUMENT) or 3 (opcodes >= HAVE_ARGUMENT, where the extra 2 bytes form a short integer argument). If an operation needs an argument too large to fit, it’s prefixed by a special EXTENDED_ARG opcode, which provides 2 more bytes.2

    Argument values

    Most arguments are stored as offsets into arrays that get stored with the bytecode. There are separate arrays for constant values, local names, free variable names, cell variable names, and other names (globals, attributes, etc.).

    For example, a LOAD_CONST None may be stored as a LOAD_CONST with argument 1, meaning that it loads the code object’s co_consts[1] at runtime.

    Jumps

    Jump instructions (which include things like SETUP_FINALLY or FOR_ITER, which indirectly include jumps) come in two forms: relative, and absolute. Either way, jump offsets or targets are in terms of bytes. So, jumping from the 3rd instruction to the 9th might be jumping over 6 bytes, or 19.

    Obviously, this means that inserting or removing an instruction, or changing its argument to now need or no longer need EXTENDED_ARG, requires changing the offsets of any relative jumps over that instruction, and the targets of any absolute jumps beyond that instruction. Worse, in some cases, fixing up a jump will push its argument over or under the EXTENDED_ARG limit, cascading further changes.

    Special arguments

    A few other opcodes assign special meanings to their arguments. For example, CALL_FUNCTION treats its argument as two bytes rather than one short, with the high byte representing the count of positional arguments pushed on the stack, and the low byte representing the count of keyword argument pairs pushed on the stack. MAKE_FUNCTION and MAKE_CLOSURE are the most complicated such opcodes.

    lnotab

    One of the associated values is a compressed line-number table that maps between bytecode offsets and source line numbers. This is complicated enough that it needs a special file, lnotab_notes.txt, to explain it. This of course also needs fixups whenever the first opcode of any source line moves to a new offset.

    Assembler format

    The compiler has a pretty simple and flexible structure that it uses internally, which can be seen at the top of compile.c.

    The compiler first build a tree of blocks, each containing an array of struct instr instruction objects. Because Python doesn’t have arbitrary goto or similar instructions, all jumps are always to the start of a block, so the compiler just represents jump targets as pointers to blocks. For non-jump arguments, the instruction objects also hold their actual argument (a 32-bit integer for cases like MAKE_FUNCTION, but a const value, variable name, etc. rather than an array index for typical opcodes). Instructions also hold their source line number.

    The compiler linearizes the tree of blocks into a linked list. Then it starts the “assembler” stage, which walks that list, emitting the bytecode and lnotab as it goes, and keeping track of the starting offset of each block. It then runs a fixup step, where it fills in the jump targets and adjusts the lnotab. If any of jump targets get pushed into EXTENDED_ARG territory, then the block offsets have to be updated and the fixup rerun. It repeats this until there are no changes.

    The assembler’s list-of-blocks representation would be very easy to work with. Iterating over instructions in linear order is slightly complicated by the fact that you have to iterate a linked list, then an array within each node, but it’s still easier than walking 1, 3, or 6 bytes at a time.3

    Meanwhile, you could add and delete instructions just by changing the array of instructions, without invalidating any jump targets or line-number mappings. And you can change arguments without worrying about whether they push you over or under the EXTENDED_ARG limit.

    However, the compiler than calls the optimizer after this fixup step. This means the optimizer has to walk the more complicated packed bytecode, and has to repeat some of the same fixup work (or, as currently implemented, does only the simplest possible fixup work, and punts and doesn’t optimize if anything else might be necessary). The current peephole optimizer is kept pretty limited because of these restrictions and complexities. But if PEP 511 adds plugin optimizers, they’re all going to have to deal with these headaches. (And, besides the complexity, there’s obviously a performance cost to repeating the fixup work once for every optimizer.)

    dis format

    The dis module disassembles bytecode into a format that’s easier to understand: a Bytecode object can be constructed from a code object, which wraps up all of the associated values, and acts as an iterable of Instruction objects. These objects are somewhat similar to the struct instr used by the assembler, in that they hold the argument value and line number. However, jumps are still in terms of bytes rather than (blocks of) instructions, and EXTENDED_ARG is still treated as a separate instruction.

    So, this format would be a little easier to process than the raw bytecode, but it would still require jump fixups, including cascading jump fixups over EXTENDED_ARG changes. Plus, there is no assembler that can be used to go back from the Bytecode object to a code object. So, you need to write code that iterates the instructions and emits bytecode and the lnotab and then does the same fixup work as the compiler. This code is usually in Python, rather than in C, which makes it a little simpler–but it’s still pretty painful.

    byteplay format

    The third-party byteplay module has a byteplay.Code type that expands on the dis.Bytecode model by removing the EXTENDED_ARG instructions and synthesizing fake instructions to make things simpler: a SetLineNo instruction to represent each lnotab entry, and a Label instruction to represent each jump target. (It’s also easy to create new SetLineNo and Label instructions as needed.)

    One way to look at this is that byteplay lets you edit code for a simple “macro assembler”, instead of for a raw assembler of the kind that used to be built into the ROM of old home computers.

    Unlike dis, byteplay also works recursively: if a function has a nested function or class definition,4 its code object is already disassembled to a byteplay.Code object.

    The byteplay disassembly also acts as a mutable sequence of instructions, rather than a immutable, non-indexable iterable, and allows replacing instructions with simple opcode-argument 2-tuples instead of having to build instruction objects.

    Finally, byteplay provides a method to assemble the instructions back into a code object–which automatically takes care of building the value and name arrays, calculating the jump targets, building the lnotab, etc.

    Together, all of these changes allow processing instructions without worrying about any of the complicated issues. Instead of being a sequence of variable-length instructions that contain references to byte offsets within that sequence and external arrays along with an associated compacted table of line numbers, bytecode is just a sequence of (opcode, argument) pairs.

    On top of that, byteplay also includes code to check that the stack effects of all of the instructions add up, which helps quite a bit for catching common errors that would otherwise be hard to debug.

    The biggest problem with byteplay is that it’s somewhat complicated code that has to change with each new version of Python, but generally hasn’t tracked those new versions very well. It took years to get it compatible with Python 3.x, and for 2.7 and each new 3.x version, and it had (and may still have) longstanding bugs with those versions. As of the 3.4 improvements to dis, it now builds some of its work on top of that module when available (and dis, being in the stdlib, always tracks changes), but so far, each new version has still required patches that were months behind the release. And of course the need to be backward compatible with a wide range of supported Python versions also complicates things.

    A smaller problem with byteplay is that, while it does a lot of the same things as dis (especially now that it’s partly built on dis), it does them in different terms. For example, the equivalent of dis.Bytecode is called byteplay.Code, not Bytecode; it has a @classmethod alternate constructor from code objects instead of a normal constructor; that constructor can only take actual code objects rather than extracting them from things like functions automatically; it isn’t directly iterable but instead has a code attribute that is; etc.

    Improving the process

    AST processing

    Import hooks already provide a simple way to hook the process between AST parsing and compilation, and the ast module makes processing the ASTs pretty easy. (Some changes can be done even earlier, at the source code or token level.)

    Many of the optimizations and other processing tasks people envision (and much of what the current peephole optimizer does) could also be done on the AST, so PEP 511 allows for AST optimizers as well as bytecode optimizers.

    However, not everything can be done this way–e.g., eliminating redundant store-load pairs when the stored value is never used again. Plus, there’s no way to write a post-processing function decorator in AST terms–at that point, you’ve got a live function object.

    So, this is (already) part of the solution, but not a complete solution on its own.

    Assembler structures

    As explained earlier, the assembler structures are much easier to work on than the compiled bytecode.

    And there’s really no good reason the peephole optimizer couldn’t be called with these structures. Rather than emitting bytecode, fixing it up, and passing it to some code that tries to partially undo that work to make changes and then redo the work it undid seems somewhat silly, and it’s pretty complicated.

    Unfortunately, exposing the assembler structures to third-party code, like the bytecode processors envisioned by PEP 511, is a different story. The peephole optimizer is essentially part of the compiler; PEP 511 optimizers aren’t, and shouldn’t be given access to structs that contain information that’s both (it’s very easy to segfault the compiler by messing with these structs) and brittle (the structs could change between versions, and definitely should be allowed to). Plus, some of the optimizers will be written in Python.

    So, we’d have to provide some way to wrap these up and expose a C API and Python API for dealing with blocks of instructions. But we probably don’t want the compiler and assembler themselves to work in terms of this C API instead of working directly on their internal structs.

    Meanwhile, for the peephole optimizer and PEP 511 processors, it makes sense to process the code before it’s been fixed up and turned into bytecode–but that obviously doesn’t work for import hooks, or for decorators. So, we’d need some function to convert back from bytecode to a more convenient format, and then to convert from that format back to bytecode and fix it up.

    So, at this point, we’re really not talking about exposing the assembler structures, but designing a similar public structure, and functions to convert that to and from bytecode. At which point there’s no reason it necessarily has to be anything like the assembler structs.

    PyBytecode

    Once you look at it in those terms, what we’re looking for is exactly what byteplay is. If byteplay were incorporated into the stdlib, it would be a lot easier to keep it tracking the dis module and bytecode changes from version to version: any patch that would break byteplay has to instead include the changes to byteplay.

    But byteplay is pure Python. We need something that can be called from C code.

    So, what I’m imagining is a PyBytecode object, which is similar to a byteplay.Code object (including having pseudo-ops like SetLineNum and Label), but with a C API, and hewing closer to the dis model (and to PEP 8). The assembler stage of the compiler, instead of emitting a string of bytecode and related objects, builds up a PyBytecode object. That’s the object that the peephole optimizer and PEP 511 processors work on. Then, the compiler calls PyBytecode_ToCode, which generates executable bytecode and associated objects, does all the fixups, and builds the code object.

    The PyBytecode type would be exposed to Python as, say, dis.Bytecode (possibly renaming the existing thing to dis.RawBytecode). An import hook or decorator can call PyBytecode_FromCode or the dis.Bytecode constructor, process it, then call PyBytecode_ToCode or the to_code method to get back to executable bytecode. A PEP 511 processor written in Python would get one of these objects, but wouldn’t have to import dis to use it (although it would probably be convenient to do so in order to get access to the opcodes, etc.).

    We could, like byteplay, allow instructions to be just 2-sequences of opcode and argument instead of Instruction objects–and make those objects namedtuples, too, a la the tokenize module. We could even allow PEP 511 processors to return any iterable of such instructions. There’s some precedent for this in the way tokenizer works (although it has its own kinds of clunkiness). And this would mean that simple one-pass processors could actually be written as generators:

    def constify(bytecode: dis.Bytecode):
        for op, arg in bytecode:
            if op == dis.LOAD_GLOBAL:
                yield (dis.LOAD_CONST, eval(arg, globals()))
            else:
                yield (op, arg)
    

    If you wanted to write a decorator that just constifies a specified list of globals, it would look something like this:

    def constify(*names, globals=None):
        mapping = {name: eval(name, globals) for name in names}
        def process(bytecode: dis.Bytecode):
            for op, arg in bytecode:
                if op == dis.LOAD_GLOBAL and arg in mapping:
                    yield (dis.LOAD_CONST, mapping[arg])
                else:
                    yield (op, arg)
        def deco(func):
            func.__code__ = dis.Bytecode(process(bytecode)).to_code()
            return func
        return deco
    

    For a more complicated example, here’s a PEP 511 processor to eliminate double jumps:

    def dejumpify(bytecode: dis.Bytecode):
        for instr in b:
            if isinstance(instr.argval, dis.Label):
                target = bytecode[instr.argval.offset + 1]
                if target.opcode in {dis.JUMP_FORWARD, dis.JUMP_ABSOLUTE}:
                    yield (instr.opcode, target.argval)
                    continue
            yield instr
    

    Of course not everything makes sense as a generator. So, Bytecode objects should still be mutable as sequences.

    And, to make things even simpler for in-place processors, the to_code method can skip NOP instructions, so you can delete an instruction by writing bytecode[i] = (dis.NOP, None) instead of bytecode[i:i+1] = [] (which means you can do this in the middle of iterating over bytecode without losing your place).

    Unpacked opcodes

    My initial thought (actually, Serhiy Storchaka came up with it first) was that “unpacking” bytecode into a fixed-length form would be enough of a simplification.

    From Python, you’d call c.unpacked() on a code object, and get back a new code object where there are no EXTENDED_ARGs, and every opcode is 4 bytes5 instead of 1, 3, or 6. Jump targets and the lnotab are adjusted accordingly. The lnotab is also unpacked into 8-byte entries each holding a line number and and offset, instead of 2-byte entries holding deltas and special handling for large deltas.

    So, you do your processing on that, and call uc.packed() to pack it, remove NOPs, and redo the fixups, giving you back a normal code object.6

    This means you still do have to do some fixup if you rearrange any arguments, but at least it’s much easier fixup.

    This also means that dis, as it is today, is really all the help you need. For example, here’s the constify decorator from above:

    def constify(*names, globals=None):
        mapping = {name: eval(name, globals) for name in names}
        def process(code: types.CodeType):
            bytecode = bytearray(code.co_code)
            consts = list(code.co_consts)
            for i in range(0, len(bytecode), 4):
                if bytecode[i] == dis.LOAD_GLOBAL:
                    arg = struct.unpack('<I', bytecode[i+1:i+4] + b'\0')
                    name = code.co_names[arg]
                    if name in mapping:
                        try:
                            const = consts.index(mapping[name])
                        except ValueError:
                            const = len(consts)
                            consts.append(mapping[name])
                        bytecode[i] = dis.LOAD_CONST
                        bytecode[i+1:i+4] = struct.pack('<I', const)[:3]
        def deco(func):
            func.__code__ = dis.Bytecode(process(bytecode)).to_code()
            return func
        return deco
    

    Not quite as nice as the byteplay-ish example, but still not terrible.

    However, after implementing a proof of concept for this, I no longer think this is the best idea. That easier fixup is still too much of a pain for simple processors. For example, to add an instruction at offset i, you still have to scan every instruction from 0 to i for any jrel > i, and every instruction for any jabs > i, and all 4 to all of them, and also go through the lnotab and add 4 to the offset for every entry > i.

    Plus, this is obviously horribly inefficient if you’re going to make lots of insertions–which nobody ever is, but people are still going to do overcomplicated things like build a balanced sort tree of jumps and of lnotab entries so they can make those fixups logarithmic instead of linear.

    Of course if you really want to get efficient, there’s a solution that’s constant time, and also makes things a lot simpler: just use relocatable labels instead of offsets, and do all the fixups in a single pass at the end. And now we’re encouraging everyone to build half of byteplay themselves for every code processor.

    And it’s not like we save that much in compiler performance or simplicity–the fixup needed in pack isn’t much simpler than what’s needed by the full byteplay-style design (and the pack fixup plus the manual fixup done in the processor itself will probably be slower than the byteplay fixup, not faster). Also, while having the same code object to represent both packed and unpacked formats seems like a savings, in practice it’ll probably lead to confusion more often than less to learn about.

    Why can’t we just let people use byteplay if they want?

    Well, for import hooks and decorators, they already can, and do. For PEP 511 optimizers, as long as they’re written in Python, they can.

    The peephole optimizer, of course, can’t use byteplay, if for no other reason than it’s built into Python and byteplay isn’t. But maybe we could just leave that alone. People who want more will install PEP 511 optimizers.

    There is, of course, a cost to building a code object and then converting back and forth between code and byteplay.Code once for each optimizer, instead of building a byteplay.Code (or PyBytecode) object, passing it through all the optimizers, and then converting to code once. Remember that converting to code involves that repeated fixup loop and other stuff that might be kind of slow to do multiple times. Then again, it might still be so simple that the cost is negligible. The only way we’ll really know is to build PEP 511 as currently designed, build a bunch of optimizers that manually use byteplay, then build PEP 511 with PyBytecode and rewrite the optimizers to use that, and profile times for imports, trivial scripts, compileall, etc. both ways.

    Besides the probably-irrelevant performance issues, there are other advantages to PyBytecode. They’ve mostly been mentioned earlier, but to put them all in one place:

    • PyBytecode is automatically always up to date, instead of lagging behind Python releases like byteplay.
    • The compiler gets a bit simpler.
    • A big chunk of what byteplay and the compiler do today is the same work; better to have one implementation than two.
    • The peephole optimizer gets a lot simpler (although, as a change from today, “a lot simpler” is still more work than “unchanged”).
    • The peephole optimizer gets a little better (doesn’t punt as often).
    • If we later decide to build more optimizations into CPython itself as bytecode processors, it will be a lot simpler (which probably means we’ll be a lot more likely to build such optimizations).
    • PyBytecode comes with Python, instead of requiring a third-party install. (Probably not a huge benefit–anyone installing a third-party PEP 511 optimizer can install its dependencies, and the number of people who need to write an optimizer for local use but for some reason can’t use any third-party modules is probably vanishingly small.)
    • It seems like a fun project.

    Overall, none of them seem so hugely compelling that I’d say someone ought to do this–except maybe the last one. :) If I get the time, I’ll try to build it, and rebuild the peephole optimizer, and maybe modify the latest PEP 511 patch as well. Then, if it looks promising, it might be worth suggesting as a patch to CPython to go along with PEP 511.



    1. Well, not really. It just does the same stuff that’s wrapped up by the Python compile function (when called on an AST).
    2. The bytes within an opcode’s argument are in little-endian format, but EXTENDED_ARG holds higher-order bytes, so the overall format for a 4-byte arg is mixed-endian, a la VAX.
    3. Most code actually just walks 1 or 3 bytes at a time, and treats EXTENDED_ARG like a separate instruction, keeping track of the current extended-arg value in an extra register.
    4. Comprehensions are compiled into a function definition and call, so you often have nested functions without realizing it.
    5. Yes, this would mean that you can no longer have absolute or relative jumps over 2**24, const/name/etc. arrays longer than 2**24, or annotations for more than 255 parameters (currently, you can have up to 32767 of these, but you can only have 512 total arguments). I don’t think anyone would complain about any of these limits. But if it really is a problem, we could use 5 or 8 bytes per instruction instead of 4.
    6. We could even allow code in unpacked form to be executed–I played with the ceval loop a bit, and it’s not at all hard to make this work, and doesn’t seem likely to would have a significant performance impact when you’re executing packed code. But I don’t see a good reason for that, and it’s a lot simpler to just raise a SystemError('cannot execute unpacked code').
    3

    View comments

  8. One common "advanced question" on places like StackOverflow and python-list is "how do I dynamically create a function/method/class/whatever"? The standard answer is: first, some caveats about why you probably don't want to do that, and then an explanation of the various ways to do it when you really do need to.

    But really, creating functions, methods, classes, etc. in Python is always already dynamic.
    Some cases of "I need a dynamic function" are just "Yeah? And you've already got one". More often, you do need something a little more complicated, but still something Python already gives you. Occasionally, even that isn't enough. But, once you understand how functions, methods, classes, etc. work in Python, it's usually pretty easy to understand how to do what you want. And when you really need to go over the edge, almost anything you can think of, even if it's almost always a bad idea, is probably doable (either because "almost always" isn't "always", or just because almost nobody would ever think to try to do it, so it wasn't worth preventing).

    Functions

    A normal def statement compiles to code that creates a new function object at runtime.

    For example, consider this code:
    def spam(x):
        return x+1
    Let's say you type that into the REPL (the interactive interpreter). The REPL reads lines until it gets a complete statement, parses and compiles that statement, and then interprets the resulting bytecode. And what does that definition get compiled to? Basically the same thing as this:
    spam = types.FunctionType(
        compile('return x+1\n', '__main__', mode='function'),
        globals(),
        'spam',
        (),
        ())
    spam.__qualname__ = 'spam'
    You can't quite write this, because the public interface for the compile function doesn't expose all of the necessary features--but outside of that compile, the rest is all real Python. (For simple lambda functions, you can use, e.g., compile('1 + 2', '__main__', mode='eval') and the whole thing is real Python, but that doesn't work for def functions. When you really need to create code objects, there are ways to do it, but you very rarely need to, so let's not worry about that.)

    If you put the same thing in a module instead of typing it at the REPL, the only difference is that the body is compiled ahead of time and stored in a marshaled code object inside the .pyc file so it never needs to be compiled again. The def statement is still compiled and then interpreted as top-level module code that constructs a function on the fly out of that code constant, every time you import the module..

    For a slightly more complicated example, consider this:
    def add_one(x: int=0) -> int:
        """Add 1"""
        return x+1
    This is equivalent to:
    spam = types.FunctionType(
        compile('return x+1\n', '__main__', mode='function'),
        globals(),
        'add_one',
        (0,), # This is where default values go
        ())
    spam.__qualname__ = 'add_one'
    spam.__doc__ = """Add 1"""
    adder.__annotations__ = {'return': 'int', 'x': 'int'}
    Notice that the default values are passed into that FunctionType constructor. That's why defaults are bound in at the time the def statement is executed, which is how you can do tricks like using a dict crammed into a default value as a persistent cache.

    Closures

    The real point of functions always being created on the fly is that this means any function can be a closure--it can capture values from the environment that the function was defined in. The standard Lisp-style example looks like this:
    def make_adder(n):
        def adder(x):
     return x+n
        return adder
    That's equivalent to:
    adder = types.FunctionType(
        compile('return x+n', '__main__', mode='exec'),
        globals(),
        'adder',
        (),
        (CellType(locals(), 'n'),)) # tuple of closure cells
    adder.__qualname__ = 'make_adder.<locals>.adder'
    So every time you call make_adder, you get back a new adder function, created on the fly, referencing the particular n local variable from that particular call to make_adder.

    (Unfortunately, I cheated a bit. Unlike function objects, and even code objects, you can't actually manually create closure cells like this. But you rarely want to. And if you ever do need it, you can just do a trivial lambda that captures n and then do a minor frame hack to get at the cell.)

    Even if you never want to do anything this Lispy, closures get used all over the place. For example, if you've ever written a Tk GUI, you may have done something like this in your Frame subclass:
    def __init__(self):
        Frame.__init__(self)
        self.hello_button = tkinter.Button(
            self, text='Hello',
            command=lambda: self.on_button_click(self.hello_button))
    That lambda is creating a new function that captures the local self variable so it can access self.hello_button whenever the button is clicked.

    (A lambda compiles in almost the same way as a def, except that it's a value in the middle of an expression rather than a statement, and it doesn't have a name, docstring, etc.).

    Another common way to write the same button is with functools.partial:
        self.hello_button = tkinter.Button(
            self, text='Hello',
            command=partial(self.on_button_click, self.hello_button)
    If Python didn't come with partial, we could easily write it ourselves:
    def partial(func, *args, **kw):
        def wrapped(*more_args, **more_kw):
            return func(*args, *more_args, **kw, **more_kw)
        return wrapped
    This is also how decorators work:
    def simple_memo(func):
        cache = {}
        def wrapped(*args):
            args = tuple(args)
            if args not in cache:
         cache[args] = func(*args)
     return cache[args]
        return wrapped
    
    @simple_memo
    def fib(n):
        if n < 2: return n
        return fib(n-1) + fib(n-2)
    I've written a dumb exponenially-recursive Fibonacci function, and that @simple_memo magically turns it into a linear-time function that takes a fraction of a second instead of hours. How does this work? Simple: after the usual fib = types.FunctionType blah blah stuff, it does fib = simple_memo(fib). That's it. Because function are already always created on the fly, decorators don't need anything complicated.

    By the way, if you can follow everything above, you pretty much know all there is to know about dynamic higher-order programming, except for how the theory behind it maps to advanced math. (And that part is simple if you already know the math, but meaningless if you don't.) That's one of those things that sounds scary when functional programmers talk about it, but if you go from using higher-order functions to building them to understanding how the work before the theory, instead of going from theory to implementation to building to using, it's not actually hard.

    Fake functions

    Sometimes, you can describe what code should run when you call spam, but it's not obvious how to construct a function object that actually runs that code. Or it's easy to write the closure, but hard to think about it when you later come back and read it.

    In those cases, you can create a class with a __call__ method, and it acts like a function. For example:
    class Adder:
        def __init__(self, n):
            self._n = n
        def __call__(self, x):
            return x+n
    An Adder(5) object behaves almost identical to a make_adder(5) closure. It's just a matter of which one you find more readable. Even for experienced Python programmers, the answer is different in different cases, which is why you'll find both techniques all over the stdlib and popular third-party modules.

    In fact, functools.partial isn't actually a closure, but a class, like this:
    class partial:
        def __init__(self, func, *args, **kw):
            self.func, self.args, self.kw = func, args, kw
        def __call__(self, *more_args, **more_kw):
            return self.func(*self.args, *more_args, **self.kw, **more_kw)
    (Actually, the real partial has a lot of bells and whistles. But it's still not that complicated. The docs link to the source code, if you want to see it for yourself.)

    Methods

    OK, so you can create functions on the fly; what about methods?

    Again, they're already always created on the fly, and once you understand how, you can probably do whatever it was you needed.

    Let's look at an example:
    class Spam:
        def __init__(self, x, y):
            self.x, self.y = x, y
        def eggs(self):
            return self.x + self.y
    spam = Spam()
    print(spam.eggs())
    The definition of eggs is compiled and interpreted exactly the same as the definition of any other function. And the result is just stored as a member of the Spam class (see the section on classes to see how that works), so when you write Spam.eggs you just get that function.

    This means that if you want to add a new method to a class, there's no special trick, you just do it:
    def cheese(self):
        return self.x * self.y
    Spam.cheese = cheese
    print(spam.cheese())
    That's all it takes to add a method to a class dynamically.

    But meanwhile, on the instance, spam.eggs is not just a function, it's a bound method. Try print(spam.eggs) from the interactive REPL. A bound method knows which instance it belongs to, so when you call it, that instance can get passed as the self argument.

    The details of how Python turns the function Spam.eggs into the bound method spam.eggs are a bit complicated (and I've already written a whole post about them), but we don't need to know that here.

    Obviously, bound methods get created dynamically. Every time you do spam.eggs or Spam().cheese or string.ascii_letters.find, you're getting a new bound method.

    And if you want to create one manually, you can just call types.MethodType(func, obj).

    So, if you want to add a new method beans to just the spam instance, without adding it to the Spam class? Just construct the same bound method that Python would have constructed for you whenever you looked up spam.beans, and store it there:
    def beans(self):
        return self.x / self.y
    spam.beans = types.MethodType(beans, spam)
    And now you know enough to implement Javascript-style object literals, or even prototype inheritance. Not that you should do either, but if you ever run into something that you really do want to do, that requires creating methods on the fly, either on classes or on instances, you can do it. Because creating methods on the fly is what Python always does.

    Classes

    What if we want to create a class dynamically?

    Well, I shouldn't have to tell you at this point. You're always creating classes dynamically.

    Class definitions work a bit differently from function definitions. Let's start with a simple example again:
    class Spam:
        z = 0
        def __init__(self, x, y):
            self.x, self.y = x, y
        def eggs(self):
            return self.x + self.y + self.z
    First, Python interprets the class body the same as any other code, but it runs inside an empty environment. So, those def calls create new functions named __init__ and eggs in that empty environment, instead of at the global level. Then, it dynamically creates a class object out of that environment, where every function or other value that got created becomes a method or class attribute on the class. The code goes something like this:
    _Spam_locals = {}
    exec('def __init__(self, x, y):\n    self.x, ... blah blah ...\n',
         globals=globals(), locals=_Spam_locals)
    Spam = type('Spam', (object,), _Spam_locals)
    This is why you can't access the Spam class object inside the class definition--because there is nothing named Spam until after Python calls type and stores the result in Spam. (But of course you can access Spam inside the methods; by the time those methods get called, it'll exist.)

    So, what if you want to create some methods dynamically inside the class? No sweat. By the time it gets to calling type, nobody can tell whether eggs gets into the locals dict by you calling def eggs(...): or eggs = fancy_higher_order_function(...), so they both do the same thing.

    In fact, one idiom you'll see quite often in the stdlib is this:
        def __add__(self, other):
            blah blah
        __radd__ = __add__
    This just makes __radd__ another name for the same method as __add__.

    And yes, you can call type manually if you need to, passing it any dict you want as an environment:
    def __init__(self, x, y):
        self.x, self.y = x, y
    Spam = type('Spam', (object,),
        {'z': 0, '__init__': __init__,
         'eggs': lambda self: self.x + self.y + self.z}
    There are a few more details to classes. A slightly more complicated example covers most of them:
    @decorate_my_class
    class Spam(metaclass=MetaSpam, Base1, Base2):
        """Look, I've got a doc string"""
        def __init__(self, x, y):
            self.x, self.y = x, y
    This is equivalent to:
    _Spam_locals = {}
    exec('def __init__(self, x, y):\n    self.x, ... blah blah ...\n',
         globals=globals(), locals=_Spam_locals)
    Spam = MetaSpam('Spam', (Base1, Base2), _Spam_locals)
    Spam.__doc__ = """Look, I've got a doc string"""
    Spam = decorate_my_class(Spam)
    As you can see, if there's a metaclass, it gets called in place of type, and if there are base classes, they get passed in place of object, and docstrings and decorators work the same way as in functions.

    There are a few more complexities with qualnames, __slots__, closures (if you define a class inside a function), and the magic to make super() work, but this is almost everything.

    Remember from the last section how easy it is to add methods to a class? Often that's simpler than trying to programmatically generate methods from inside the class definition, or customize the class creation. (See functools.total_ordering for a nice example.) But when you really do need a dynamically-created class for some reason, it's easy.

    Generating code

    Occasionally, no matter how hard you try, you just can't come up with any way to define or modify a function or class dynamically with your details crammed into the right place the right way, at least not readably. In that case, you can always fall back to generating, compiling, and executing source code.

    The simplest way to do this is to just build a string and call exec on it. You can find a few examples of this in the stdlib, like collections.namedtuple. (Notice the trick it uses of calling exec in a custom empty namespace, then copying the value out of it. This is a bit cleaner that just executing in your own locals and/or globals.)

    You've probably hard "exec is dangerous". And of course it is. But "dangerous" really just means two things: "powerful" and "hard to control or reason about". When you need the first one badly enough that you can accept the second, danger is justified. If you don't understand things like how to make sure you're not letting data some user sent to your web service end up inside your exec don't use it. But if you're still reading at this point, I think you can learn how to reason through the issues.

    Sometimes you don't want to exec right now, you want to compile something that you can pass around and exec later. That works fine too.

    Sometimes, you even want to generate a whole module and save it as a .py file. Of course people do that kind of thing in C all the time, as part of the build process--but in Python, it doesn't matter whether you're generating a .py file during a build or install to be used later, or generating one at normal runtime to be used right now; they're the same case as far as Python is concerned.

    Sometimes you want to build an AST (abstract source tree) or a stream of tokens instead of source code, and then compile or execute that. (And this is a great time to yet again plug macropy, one of the coolest projects ever.)

    Sometimes you even want to do the equivalent of inline assembly (but assembling Python bytecode, not native machine code, of course) with a library like byteplay or cpyasm (or, if you're a real masochist, just assembling it in your head and using struct.pack on the resulting array of 16-bit ints...). Again, unlike C, you can do this at runtime, then wrap that code object up in a function object, and call it right now.

    You can even do stuff like marshaling code objects to and from a database to build functions out of later.

    Conclusion

    Because almost all of this is accessible from within Python itself, and all of it is inherently designed to be executed on the fly, almost anything you can think of is probably doable.

    So, if you're thinking "I need to dynamically create X", you need to think through exactly what you need, but whether that turns out to be "just a normal function" or something deeply magical, you'll be able to do it, or at least explain the magic you're looking for in specific enough terms that someone can actually show you how to do it.
    1

    View comments

  9. A few years ago, Cesare di Mauro created a project called WPython, a fork of CPython 2.6.4 that “brings many optimizations and refactorings”. The starting point of the project was replacing the bytecode with “wordcode”. However, there were a number of other changes on top of it.

    I believe it’s possible that replacing the bytecode with wordcode would be useful on its own. Of course it could also lead to opportunities for more optimizations and refactorings, but it might be worth keeping a wordcode fork alive (or even proposing it as a patch to core CPython) that doesn’t have additional radical changes.

    The code for this project is in the wpy branch of my github fork of the CPython source. As of post time, it's basically just a proof of concept: the compiler generates wordcode, the interpreter interprets wordcode, but things like the pdb debugger don't work, the peephole optimizer has been disabled, etc., so it won't even pass the full test suite. No attempt at further simplification has been made, or will be made initially: The goal is to get a complete working prototype that passes the test suite and can be benchmarked against stock CPython (and against Serhiy's alternative project to pack a bits into the opcode and skip arguments) for size and performance. If that looks promising, then I'll look into simplifying the eval loop to see if there are further gains.

    As I make further changes to the project, an updated version of this document will be available in Python/wordcode.md inside the repo.

    Bytecode

    The core of CPython is an eval loop that implements a simple stack-based virtual machine. At each step, it has a frame object, with an attached code object and an instruction pointer. The code object contains a bytes string full of bytecode. So, it just fetches the next bytecode operation and interprets it. The compiler is responsible for turning source code into bytecode, and is called as needed from within the interpreter (by the importer, the exec function, etc.).

    Each operation consists of a single 8-bit opcode, plus possibly a 16-bit argument. For example, the RETURN_VALUE opcode (which returns the value on top of the stack to the caller) doesn’t need an argument, so a RETURN_VALUE instruction is just a single byte (83). But the LOAD_CONST opcode, which loads a constant stored in the code object by index, does need an opcode, to tell the VM which index to use, so LOAD_CONST 0 takes 3 bytes (100, 0, 0).

    Argument values that don’t fit into 16 bits are handled by prefixing an operation with a special EXTENDED_ARG opcode. This obviously rarely comes up with opcodes like LOAD_CONST, but may occasionally turn up with, say, JUMP_ABSOLUTE. For example, to jump to offset 0x123456 would require EXTENDED_ARG 18 (the most-significant 0x12) followed by JUMP_ABSOLUTE 13398 (the least-significant 0x3456), which takes 6 bytes (144, 18, 0, 113, 86, 52).

    The dis module documentation explains what each opcode does, and which ones do and don’t take arguments.

    The biggest potential problem here is that fetching a typical LOAD_CONST takes three single-byte fetches: first, because the interpreter doesn’t know whether it needs an argument until it sees the opcode, it has to fetch just one byte. Then, when it needs an argument, it has two fetch two more bytes (it could just fetch a word, but half the time that word will be misaligned, which is illegal on some platforms, and no cheaper than–or even more expensive than–two single-byte fetches on others). This also means that the argument fetching has to be either conditional (which can break branch prediction), or duplicated to each opcode’s interpreter chunk (which increases cache pressure). Even the more complicated pointer arithmetic can slow things down by preventing the CPU from figuring out what to prefetch.

    On top of that, the vast majority of operations with arguments have tiny values (for example, more than half of all LOAD_CONST operations in the stdlib have indices 0-3), but they still require two bytes to store those arguments. This makes bytecode longer, meaning more cache spills, disk reads, VM page faults, etc.

    Variable-width bytecode is also more complicated to scan, interpret, and modify, which complicates code like the peephole optimizer, and third-party libraries like byteplay), as well as the interpreter itself. (Notice the custom macros to fetch the next opcode and argument, peek at the next argument, etc.) Making the code simpler would make it easier to read and maintain, and might open the door for adding further changes. (It might also allow the C compiler or CPU to optimize things better without any work on our part–for example, in some cases, a PREDICT macro doesn’t always help as much as it could, because the prediction ends up reordered right after a conditional fetch.)

    Using two-byte arguments means every argument depends on word-order (CPython stores the arguments in little-endian order, even on big-endian machines).

    Finally, variable-width opcodes mean you can’t synchronize at an arbitrary position. For example, if I want to disassemble some bytecode around offsets 100-150, there’s no way to tell whether offset 100 is an opcode, or part of an argument for an opcode starting at 98 or 99. Usually, starting from the middle of an argument will give you a bunch of invalid operations, so you can try 100, then 99, and then 98 until one of them makes sense, but that’s not guaranteed to work (and it’s not the kind of thing you’d want to automate).

    Packed Arguments

    One solution is to find the most commonly-used opcode/argument combinations and pack them into a single byte. So, we’d still have LOAD_CONST that takes an argument, but we’d also have LOAD_CONST_0 and LOAD_CONST_1 that don’t.

    We could extend this to have single-byte-arg variants, so for 2-255 you’d use LOAD_CONST_B, and only use LOAD_CONST_W for 256 and up.

    This would solve many of the problems with existing bytecode–although at the cost of making things more complicated, instead of simpler. It also means duplicating code, or adding jumps, or doing extra mask-and-shift work on the opcodes, any of which could slow things down. (It also involves using up more opcodes, but since we’re still only up to 101 out of 255 as of CPython 3.5, this probably isn’t too worrying.)

    Wordcode

    A different solution is to simplify things to use 16 bits for every operation: one byte for the opcode, one byte for the argument. So, LOAD_CONST 1 goes from 100, 1, 0 to just 100, 1. The interpreter can just fetch (aligned) 16-bit words, with no need for conditionals or duplicated code. All else being equal, it should make things faster on most platforms. It also makes things simpler, both in the interpreter and in bytecode processors.

    There are two obvious problems here.

    First, every opcode that used to take 1 byte now takes 2.

    Second, while most arguments are tiny, not all of them are. How can you use an argument >255 with this solution? The obvious answer here is to expand the use of EXTENDED_ARG. Keeping things simple, we can allow up to three EXTENDED_ARG opcodes, each carrying an additional byte for the following operation (which are then assembled in big-endian order). So, for example, LOAD_CONST 321 goes from 100, 65, 1 to 144, 1, 100, 65.

    Of course this means that many jumps (where values >255 aren’t that rare) now take 4 bytes instead of 3, although that’s balanced by many other jumps taking 2 bytes instead of 3. (Also note that if we treat the bytecode as an array of 16-bit words instead of an array of 8-bit bytes, each offset now takes 1 less bit.)

    So, putting all of that together, will code get longer or shorter overall? (Remember, shorter code also means a faster interpreter.) It’s hard to predict, but I did a quick experiment with the importlib frozen code and the .pyc files for the stdlib, and it looks like wordcode even with the peephole optimizer disabled is about 7% smaller than bytecode with the optimizer enabled. So, I think we’d get a net savings. But obviously, this is something to test with a more complete implementation, not just guess at. Plus, it’s possible that, even if we’re saving space overall, we’re hitting the worst costs in exactly the worst places (e.g., mid-sized functions may often end with with loop bodies that expand to another cache line), which we’d need to test for.

    More extensive use of EXTENDED_ARG might also increase register pressure (because the compiler and/or CPU decides to dedicate a register to holding the current extended value rather than use that register for something else).

    So, there’s no guarantee that this will make things faster, and a possibility that it will make things slower. We won’t know until we build something to test.

    Besides performance, more extensive use of EXTENDED_ARG now means if you write a quick&dirty bytecode processor that just pretends it never exists, it will fail reasonably often instead of very rarely.

    On the other hand, Python doesn’t have a type for “immutable array of pairs of bytes” akin to bytes. Continuing to use the bytes type for bytecode could be confusing (especially if jumps use word-based offsets–that means code would need *2 and //2 all over the place). But using something like array('H') has its own problems (endianness, mutability, not being a builtin). And creating a new builtin words type for something that most users are never going to touch seems like overkill.

    Hybrid

    Obviously, you can’t do both of the above at the same time. The first change is about creating more opcodes that don’t need an argument at all, while the second change gives an argument to opcodes even if they don’t need it, which would cancel out the benefit.

    However, there is a hybrid that might make sense: For opcodes that frequently need values a little bigger than 255, we could steal a few bits from the opcode and use it as part of the argument. For example, we’d have JUMP_ABSOLUTE_0 through JUMP_ABSOLUTE_3, so to jump to offset 564 (0x234), you’d do JUMP_ABSOLUTE_2 52 instead of EXTENDED_ARG 2 JUMP_ABSOLUTE 52.

    It’s worth noting that all of the problems we’re looking at occur in real machine code. For example, the x86 uses variable-width operations, where some opcodes take no arguments, some take one, some take two, and those arguments can even be different widths. And RISC is essentially this hybrid solution–for example, on PowerPC, every operation is 32 bits, with the arguments encoded in anywhere from 0 to 26 of those bits.

    Implementing wordcode

    The argument-packing idea is worth exploring on its own. The hybrid idea might be worth exploring as an extension to either wordcode or argument-packing, if they both pan out. But experimenting with just the wordcode idea seems to be worth pursuing.

    It’s probably worth starting with the smallest possible change. This means bytecode is still treated as a string of bytes, but every operation is now always 2 bytes instead of 1 or 3, and jumps are still byte-offset-based, and no additional simplifications or refactorings are attempted.

    So, what needs to be changed?

    Interpreter

    The core interpreter loop ([PyEval_FrameEx][framex] in ceval.c) already has a set of macros to wrap up the complicated fetching of opcodes. We just need to change these macros to always work on 2 bytes instead of conditionally on 1 or 3. Again, this doesn’t give us the simplest code; it gives us the code with the fewest changes. This means changing the FAST_DISPATCH macro (both versions), NEXTOP, NEXTARG, PEEKARG, PREDICTED, and PREDICTED_WITH_ARG. The only other place the instruction pointer is manipulated is the next_instr = first_instr + f->f_lasti + 1; line under the long comment.

    As that comment implies, we also have to patch the frame setup (PyFrame_New in frameobject.c) to start f_lasti at -2 instead of -1, and change the YIELD_FROM code inside the eval loop to -= 2 instead of --.

    We also need to change EXTENDED_ARG to only shift 8 bits instead of 16. (You might expect that we need more changes, to allow up to three EXTENDED_ARGs instead of just one, but the code as written already allows multiple instances–although if you use more than one with bytecode, or more than three with wordcode, the extra bits just get shifted out.)

    Surprisingly, that’s all that needs to be done to interpret wordcode instead of bytecode.

    Compiler

    Obviously the compiler (in compile.c) needs to be changed to emit wordcode instead of bytecode. But again, it’s designed pretty flexibly, so it’s less work than you’d expect. In particular, it works on a list of instruction objects as long as possible, and only emits the actual bytecode at the very end. This intermediate representation treats instructions with preceding EXTENDED_ARG as single instructions. So, it should work unchanged for our purposes.

    We need to change the instrsize function to return 2, 4, 6, or 8 depending on whether 0, 1, 2, or 3 EXTENDED_ARG opcodes will be needed, instead of 1, 3, or 6 for no args, args, or args plus EXTENDED_ARG. And we need to modify assemble_emit to emit up 0 to 3 EXTENDED_ARG opcodes instead of 0 or 1. Finally, the jump target fixup code in assemble_jump_offsets has to be similarly modified to count 0 to 3 EXTENDED_ARG opcodes instead of 0 or 1.

    And that’s it for the compiler.

    Peephole optimizer

    Unfortunately, the peephole optimizer (in peephole.c) doesn’t work on that nice intermediate list-of-instructions representation, but on the emitted bytecode. Which means it has to reproduce all the jump-target fixup code, and do it in a more complicated way. It also doesn’t process EXTENDED_ARG as nicely as the eval loop (it essentially assumes that only MAKE_FUNCTION and jump arguments will ever use it, which is no longer almost-always true with wordcode).

    For my first proof of concept, I just disabled the peephole optimizer. But obviously, we can’t do useful benchmark tests against the stock interpreter this way, so we’ll have to tackle it before too long.

    As a possible alternative: Victor Stinner’s PEP 511 proposes an API for registering AST- and bytecode-based code transformers, and in his earlier work with FAT Python he’s reproduced everything the peephole optimizer does (and more) as separate transformers. Most of these should be simpler to port to wordcode (especially since most of them are AST-based, before we even get to the bytecode step). So, it may be simpler to use the PEP 511 patch, disable the peephole optimizer, and use separate optimizers, both for bytecode and for wordcode. We could then test that the 511-ized bytecode interpreter is pretty close to stock CPython, and then fairly compare the 511-ized wordcode intepreter to the 511-ized bytecode interpreter.

    Debugger

    The pdb debugger has an intimate understanding of Python bytecode, and how it maps back to the compiled source code. Obviously it will need some changes to support wordcode. (This may be another place where we get some simplification opportunities.)

    I haven’t yet looked at how much work pdb needs, but I’m guessing it shouldn’t be too hard.

    Introspection tools

    The dis module disassembles bytecode. It obviously needs to be patched. But this turns out to be very easy. There are two functions, _get_instructions_bytes and findlabels, that each need two changes: to skip or fetch 1 byte instead of fetching 0 or 2 for arguments, and to handle multiple EXTENDED_ARGs.

    Marshaling, bootstrapping, etc.

    I expected that we might need some changes in marshaling, importer bootstrapping, and other parts of the interpreter. But as it turns out, all of the rest of the code just treats bytecode as an uninterpreted bytes object, so all of it just works.

    Third-party code

    Of course any third-party disassemblers, decompilers, debuggers, and optimizers that deal with bytecode would have to change.

    I believe the most complicated library to fix will be byteplay, so my plan is to tackle that one. (It may also be useful for replacing the peephole optimizer, and possibly for debugging any problems that come up along the way.)

    1

    View comments

  10. Many languages have a for-each loop. In some, like Python, it’s the only kind of for loop:

    for i in range(10):
        print(i)
    

    In most languages, the loop variable is only in scope within the code controlled by the for loop,[1] except in languages that don’t have granular scopes at all, like Python.[2]

    So, is that i a variable that gets updated each time through the loop or is it a new constant that gets defined each time through the loop?

    Almost every language treats it as a reused variable. Swift, and C# since version 5.0, treat it as separate constants. I think Swift is right here (and C# was right to change, despite the backward-compatibility pain), and everyone else is wrong.

    However, if there is a major language that should be using the traditional behavior, it’s Python.

    Loops and closures

    In any language that has lexical closures that capture variables, you’ll run into the foreach-lambda problem. This has been known in Python (and JavaScript, and Ruby, etc.) for decades, and is even covered in Python’s official Programming FAQ, but every new programmer discovers it on his own, and every new language seems to do so as well, and it always takes people by surprise.

    Consider this:

    powers = [lambda x: x**i for i in range(10)]
    

    This creates 10 functions that return x**0, x**1, … x**9, right?

    If for-each works by re-using a variable, as it does in Python (and JavaScript, …) then no, this creates 10 functions that all return x**9 (or, in some languages, x**10). Each function returns x**i, where i is the variable from the scope the function was defined in. At the end of that scope, i has the value 9.

    The problem has nothing to do with lambdas or comprehensions.[3] You’ll get the same behavior this way:

    powers = []
    for i in range(10):
        def func(x):
            return x**i
        powers.append(func)
    

    Whenever people run into this problem, in any language, they insist that the language “does closures wrong”–but in fact, the language is doing closures perfectly.

    So, does that mean all those users are wrong?

    Traditional solution

    In most languages, you can solve the problem by wrapping the function definition inside another function that you define and call:

    for i in range(10):
        def make_func(j):
            def func(x):
                return x**j
            return func
        powers.append(make_func(i))
    

    This works because the j parameter gets a reference to the value of the i argument, not a reference to the i variable itself. Then, func gets a reference to that j variable, which is different each time.

    The problem with this solution is that it’s verbose, and somewhat opaque to the reader. It’s a bit less verbose when you can use lambdas, but arguably it’s even less readable that way:

    for i in range(10):
        powers.append((lambda j: (lambda x: x**j))(i))
    

    In some languages, including Python, there’s a simpler way to do this:

    for i in range(10):
        def func(x, j=i):
            return x**j
        powers.append(func)
    

    That’s because the default parameter value j gets a reference to the value of i in the defining scope, not a closure cell capturing the variable i.

    Of course this is a bit “magical”, but people quickly learn to recognize it in Python. In fact, most experienced Python developers will spell it i=i instead of j=i, because (once you recognize the idiom) that makes it clear why we’re using a parameter with a default variable here: to get the value of i at the time of definition (or, in some cases, as an optimization–e.g., you can use len=len to turn len into a local variable instead of a builtin within the function, which makes it a bit faster to look up each time you call it).

    The bigger problem is that it makes j a parameter, which means you can override the default value with an argument. For example, powers[3](2, 10) gives you 1024, which is almost certainly not something anyone wanted. You can do tricks like making it keyword-only, prefixing it with an underscore, etc. to try to discourage people from accidentally passing an argument, but it’s still a flaw.

    Terminology sidebar

    In Python, this issue is often described in terms of late vs. early binding. There are actually three places things could get bound: compile time (like constants), function definition time (like default parameter values), or function call time (like free variables captured by a closure). In many discussions of late vs. early binding, the distinction is between run time and compile time, but in this case, it’s between the later and earlier run time cases: free variables are bound late, at call time, while default parameter values are bound early, at definition time. So, capturing a nonlocal is late binding; the “default-value trick” of passing the value as the default value of an extra parameter means using early binding instead.

    A different way of solving the same problem is to consider capture by variable vs. capture by value. In C++, variables are lvalues, but lvalues are also things you can pass around and take references to. You can write a function that takes an int, or one that takes a reference to an int, spelled int& (and another that takes a reference to a const int, and one that takes an rvalue-reference to an int, with complicated rules about how you collapse an lvalue reference to an rvalue reference and how overload distinguishes between an int and a reference to const int when the lvalue is constant, and…). And closures are built on top of this: there’s no special “closure cell” or “closure environment”; closures just have (lvalue) references to variables from the outer scope. Of course taking a reference to something doesn’t keep it alive, because that would be too easy, so returning a function that closes over a local variable means you’ve created a dangling reference, just like returning a pointer to a local variable. So, C++ also added capture by value: instead of copying an lvalue reference (which gives you a new reference to the same lvalue), you can copy an rvalue reference (which steals the reference if it’s a temporary that would have otherwise gone away, or copies the value into a new lvalue if not). As it turns out, this gives you another way to solve the loop problem: if you capture the loop variable by value, instead of by variable, then of course each closure is capturing a different value. Confused? You won’t be, after this week’s episode of C++.

    Anyway, you can now forget about C++. In simpler languages like Java, C#, JavaScript, and Python, people have proposed adding a way to declare capture by value for closures, to solve the loop problem. For example, borrowing C++ syntax (sort of):

    for i in range(10):
        def func[i](x):
            return x**i
        powers.append(func)
    

    Or, alternatively, one of these proposals:

    for i in range(10):
        def func(x; i):
            return x**i
        powers.append(func)
    
    for i in range(10):
        def func(x) sharedlocal(i):
            return x**i
        powers.append(func)
    
    for i in range(10):
        def func(x):
            sharedlocal i
            return x**i
        powers.append(func)
    

    No matter how you spell it, the idea is that you’re telling func to capture the current value of the i local from the enclosing scope, rather than capturing the actual variable i.

    If you think about it, that means that capture by value in a language like Python is exactly equivalent to early binding (in the “binding at definition time” sense). And all of these solutions do the exact same thing as the parameter default-value trick i=i, except that i is no longer visible (and overridable) as a parameter, so it’s no longer really a “trick”. (In fact, we could even hijack the existing machinery and store the value in the function object’s __defaults__, if we just add a new member to the code object to keep a count of captured local values that come after all the parameters.)

    While we’re talking terminology, when discussing how parameters get passed in the previous section, is that “pass by value” or “pass by reference”? If you think that’s a good question, or you think the problem could be solved by just coming up with a new name, see this post.

    Solutions

    Since the traditional solution works, and people obviously can learn it and get used to it, one solution is to just do nothing.

    But if we are going to do something, there are two obvious solutions:

    1. Provide a way for closures to use early binding or capture by value. (Again, these options are equivalent in languages like Python.)
    2. Provide a way for for-each loops to define a new variable each time through the loop, instead of reusing the same variable.

    Either one would solve the problem of closures capturing loop variables. Either one might also offer other benefits, but it’s hard to see what they might be. Consider that, in decades of people using the default-value trick in Python, it’s rarely used (with locals[4]) for anything but loop variables. And similarly, you can find FAQ sections and blog posts and StackOverflow questions in every language discussing the loop problem, and none of them mention any other cases.

    The first one obviously has to be optional: you still want closures to capture some (in fact, most) variables, or they really aren’t closures. And there’s no obvious way to make an automatic distinction, so it has to be something the user spells explicitly. (As shown in the previous section, there are plenty of alternative ways to spell it, each with room for more bikeshedding.)

    The second one could be optional–but does it have to be? The C# team carefully considered adding a new form like this (in Python terms):

    for new i in range(10):
        def func(x):
            sharedlocal i
            return x**i
        powers.append(func)
    

    But they decided that any existing code that actually depends on the difference between for i and for new i is more likely to be buggy than to be intentionally relying on the old semantics, so, even from a backward-compatibility point of view, it’s still better to change things. Of course that had to be balanced with the fact that some people write code that’s used in both C# 4 and C# 5, and having it do the right thing in one version and the wrong in the other is pretty ugly… but even so, they decided to make the breaking change.

    I’m going to examine the arguments for each of the two major alternatives, without considering any backward compatibility issues.

    Consistency with C-style for

    This one isn’t relevant to languages like Python, which don’t have C-style for loops, but many languages do, and they almost always use the same keyword, or closely related ones, to introduce C-style for loops and for-each loops. So, superficially, it seems like they should be as similar as possible, all else being equal.

    In a C-style for loop, the loop variable is clearly a variable whose lifetime lasts for the entire loop, not just a single iteration. That’s obvious from the way you define the loop:

    for (int i=0; i != 10; i++)
    

    That third part isn’t generating a value for a new constant (that also happens to be named i), it’s mutating or rebinding a variable named i. That’s what the ++ operator means. If you change to i + 1, you get an infinite loop, where i stays 0 forever.

    And this is why it’s usually legal to modify the loop variable inside the loop. It allows you to do things like this:

    for (int i=0; i != spam.len(); ++i) {
        if (spam[i].has_arg) {
            process_with_arg(spam[i], spam[i+1]);
            ++i; // skips the next spam; we already used it
        } else {
            process_no_arg(spam[i]);
        }
    }
    

    So, shouldn’t for-each loops be consistent with that?

    I don’t think so. Most languages already consider it legal and reasonable and only slightly unusual to modify the loop variable of a C-style for, but not to modify the loop variable of a for-each. Why is that? I think it’s because the two loops are semantically different. They’re not the same thing, just because they share a keyword.

    So, at first glance this seems like an argument for #1, but in fact, it’s not an argument either way. (And for a new language, or an existing language that doesn’t already have C-style for loops, really, you don’t want C-style for loops…)

    Consistency with functions

    Anyone who’s done any functional programming has noticed that statement suites in imperative languages are a lot like function bodies. In fact, most control statements can be converted to a function definition or two, and a call to a higher-order function:

    if spam:
        do_stuff()
        do_more_stuff()
    else:
        do_different_stuff()
        
    def if_body():
        do_stuff()
        do_more_stuff()
    def else_body():
        do_different_stuff()
    if_func(spam, if_body, else_body)
    
    while eggs:
        eat(cheese)
        
    def while_cond():
        return eggs
    def while_body():
        eat(cheese)
    while_func(while_cond, while_body)
    

    But that doesn’t work with for-each:

    powers = []
    for i in range(10):
        def func(x):
            return x**i
        powers.append(func)
    
    powers = []
    def for_body(i):
        def func(x):
            return x**i
        powers.append(func)
    for_each(range(10), for_body)
    

    And this makes the problem obvious: that i is an argument to the for_body. Calling a function 10 times doesn’t reuse a single variable for its parameter; each time, the parameter is separate thing.

    This is even more obvious with comprehensions, where we already have the equivalent function:[5]

    powers = [lambda x: x**i for i in range(10)]
    
    powers = map(lambda i: (lambda x: x**i), range(10))
    

    Also consider Ruby, where passing blocks to higher-order-functions[6] is the idiomatic way to loop, and for-each loops are usually described in terms of equivalence to that idiom, and yet, the following Ruby 1.8[7] code produces 10 lambdas that all raise to the same power:

    powers = []
    10.times do |i|
        powers.push(lambda |x| { x**i })
    end
    

    So, this is definitely an argument for #2. Except…

    Python

    This is where Python differs from the rest of the pack. In most other languages, each suite is a scope.[8] This is especially obvious in languages like C++, D, and Swift, which use RAII, scope-guard statements, etc. where Python would use a with statement (or a try/finally). In such languages, the duality between scopes and functions is much clearer. In particular, if i is already going to go out of scope at the end of the for loop, it makes sense for each individual i to go out of scope at the end of an iteratiion.

    But in Python, not only does i not go out of scope at the end of the for loop, there are comfortable idioms involving using the loop variable after its scope. (All such cases could be written by putting the use of the loop variable into the else clause of a for/else statement, but many novices find for/else confusing, and it’s not idiomatic to use it when not necessary.)

    In addition, one of the strengths of Python is that it has simple rules, with few exceptions, wherever possible, which makes it easier to work through the semantics of any code you read. Currently, the entire semantics of variable scope and lifetime can be expressed in a few simple rules, which are easy to hold in your head:

    • Only functions and classes have scopes. ** Comprehensions are a hidden function definition and call.
    • Lookup goes local, enclosing, global, builtin.
    • Assignment creates a local variable. ** … unless there’s an explicit global or nonlocal statement.
    • Exceptions are unbound after an except clause.

    Adding another rule to this list would mean one more thing to learn. Of course it would also mean not having to learn about the closure-over-loop-variable problem. Of course that problem is a natural consequence of the simple rules of Python, but, nevertheless, everyone runs into it at some point, has to think it through, and then has to remember it. Even after knowing about it, it’s still very easy to screw up and accidentally capture the same variable in a list of functions. (And it’s especially easy to do so when coming back to Python from Swift or C#, which don’t have that problem…)

    Mutable values

    Let’s forget about closures now, and consider what happens when we iterate over mutable values:

    x = [[], [], []]
    for i, sublist in enumerate(x):
        sublist.append(i)
    

    If you’re not familiar with Python, try it this way:

    x = [[], [], []]
    i = 0
    for sublist in x:
        sublist.append(i)
        i += 1
    

    What would we expect here? Clearly, if this is legal, the result should be:

    x = [[0], [1], [2]]
    

    And there’s no obvious reason it shouldn’t be legal.

    So, each sublist isn’t really a constant, but a variable, right? After all, it’s mutable.

    Well, yes, each sublist is mutable. But that’s irrelevant to the question of whether the name sublist is rebindable. In languages where variables hold references to values, or are just names for values, like Java (except for native types) or Python, there’s a very obvious distinction here:

    a = [1, 2, 3]
    b = [4, 5, 6]
    b = a
    b.append(7)
    

    It’s perfectly coherent that sublist is a non-rebindable name for a possibly-mutable value. In other words, a new variable each time through the loop.

    In a language like C++, things are a little trickier, but it’s just as coherent that sublist is a non-l-modifiable-but-r-mutable l-value (unless sublist is of non-const reference type in which case it’s a modifiable l-value). Anyway, C++ has bigger problems: capturing a variable doesn’t do anything to keep that variable alive past its original scope, and there’s no way to capture by value without copying, so C++ has no choice but to force the developer to specify exactly what he’s trying to capture by reference and what he’s trying to copy.

    Anyway, the one thing you definitely don’t expect is for sublist to be mutating the same list in-place each time through (what you get if you write sublist[:] = ... instead of sublist = ...). But you wouldn’t expect that whether sublist is being reused, or is a new variable.

    So, ultimately, mutability isn’t an argument either way.

    Performance

    Obviously, creating and destroying a new variable each time through the loop isn’t free.

    Most of the time, you’re not keeping a reference to the loop variable beyond the lifetime of the loop iteration. So, why should you pay the cost of creating and destroying that variable?

    Well, if you think about it, in most languages, if you elide the creation and destruction, the result is invisible to the user unless the variable is captured. Almost all languages allow optimizers to elide work that has no visible effects. There would be nothing inconsistent about Python, or Java, deciding to reuse the variable where the effect is invisible, and only create a new variable when it’s captured.

    So, what about the case where the variable is captured? If we want each closure to see a different value, our only choices are to capture a new variable each time, to copy the variable on capture, or to copy the value instead of capturing the variable. The first is no more costly than the second, and cheaper than the third. It’s the simplest and most obvious way to get the desired behavior.

    So, performance isn’t an argument either way, either.

    Simplicity

    When a new user is learning the language, and writes this:

    powers = []
    for i in range(10):
        def func(x):
            return x**i
        powers.append(func)
    

    … clearly, they’re expecting 10 separate functions, each raising x to a different power.

    And, as mentioned earlier, even experienced developers in Python, Ruby, JavaScript, C#, etc. who have run into this problem before still write such code, and expect it to work as intended; the only difference is that they know how to spot and fix this code when debugging.

    So, what’s the intuition here?

    If the intuition is that lexical closures are early-bound, or by-value, then we’re in big trouble. They obviously aren’t, and, if they were, that would make closures useless. People use closures all the time in these languages, without struggling over whether they make sense.

    If the intuition is that they’re defining a new function each time through the loop because they have a new i, that doesn’t point to any other problems or inconsistencies anywhere else.

    And the only other alternative is that nobody actually understands what they’re doing with for-each loops, and we’re all only (usually) writing code that works because we treat them like magic. I don’t think that’s true at all; the logic of these loops is not that complicated (especially in Python).

    So, I think this is an argument for #2.

    Consistency with other languages

    As I mentioned at the start, most languages that have both for-each loops and closures have this problem.

    The only language I know of that’s solved it by adding capture by value is C++, and they already needed capture by value for a far more important reason (the dangling reference problem). Not to mention that, in C++, capture by value means something different than it would in an lvalue-less language like Python or JavaScript.

    By contrast, C# 5.0 and Ruby 1.9 both changed from "reused-i" to "new-i" semantics, and Lua, Swift, and Scala have used "new-i" semantics from the start.[9]. C# and Ruby are particularly interesting, because that was a breaking backward-compatibility change, and they could very easily have instead offered new syntax (like for new i) instead of changing the semantics of the existing syntax. Eric Lippert’s blog covers the rationale for the decision in C#.

    As mentioned earlier, Python’s simpler scoping roles (only functions are scopes, not every suite) do weaken the consistency argument. But I think it still falls on #2. (And, for a language with suite scopes, and/or a language where loop bodies are some weird not-quite-function things like Ruby, it’s definitely #2.)

    Exact semantics

    In languages where each suite is a scope, or where loop suites are already function-esque objects like Ruby’s blocks, the semantics are pretty simple. But what about in Python?

    The first implementation suggestion for Python came from Greg Ewing. His idea was that whenever a loop binds a cellvar, the interpreter creates a new cell each time through the loop, replacing the local i binding each time. This obviously solves the loop-capture problem, with no performance effect on the more usual case where you aren’t capturing a loop variable.

    This works, but, as Guido pointed out, it’s pretty confusing. Normally, a binding just means a name in a dict. The fact that locals and closures are implemented with indexes into arrays instead of dict lookups is an implementation detail of CPython, but Greg’s solution requires that implementation detail. How would you translate the design to a different implementation of Python that handled closures by just capturing the parent dict and using it for dict lookups?[10]

    Nick Coghlan suggested that the simplest way to define the semantics is to spell out the translation to a while loop. So the current semantics for our familiar loop are:

    _it = iter(range(10))
    try:
        while True:
            i = next(it)
            powers.append(lambda x: x**i)
    except StopIteration:
        pass
    

    … and the new semantics are:

    _it = iter(range(10))
    try:
        while True:
            i = next(it)
            def _suite(i=i):
                powers.append(lambda x: x**i)
            _suite()
    except StopIteration:
        pass
    

    But I think it’s misleading that way. In a comprehension, the i is an argument to the _suite function, not a default parameter value, and the function is built outside the loop. If we reuse the same logic here, we get something a bit simpler to think through:

    _it = iter(range(10))
    def _suite(i):
        powers.append(lambda x: x**i)
    try:
        while True:
            i = next(it)
            _suite(i)
    except StopIteration:
        pass
    

    And now the “no-capture” optimization isn’t automatic, but it’s a pretty simple optimization that could be easily implemented by a compiler (or a plugin optimizer like FAT Python). In CPython terms, if the i in _suite ends up as a cellvar (because it’s captured by something in the suite), you can’t simplify it; otherwise, you can just inline the _suite, and that gives you exactly the same code as in Python 3.5.

    I still think there might be a better answer than the comprehension-like answer, but, if so, I haven’t thought of it. My suspicion is that it’s going to be like the comprehension problem: once you see a way to unify the two cases, everything becomes simpler.[11]

    Conclusion

    So, if you’re designing a new language whose variable semantics are like C#/Java/etc., or like Python/JavaScript/etc., I think you definitely want a for-each statement that declares a new variable each time through the loop–just like C#, Swift, and Ruby.

    For an existing language, I think it’s worth looking at existing code to try to find anything that would be broken by making the change backward-incompatibly. If you find any, then consider some new for new syntax; otherwise, just do the same thing C# and Ruby did and change the semantics of for.

    For the specific case of Python, I’m not sure. I don’t know if the no-backward-compatibility decision that made sense for C# and Ruby make sense for Python. I also think the new semantics need more thought–and, after that’s worked out, it will depend on how easily the semantics fit into the simple scoping rules, in a way which can be taught to novices and transplants from other languages and then held in their heads. (That really is an important feature of Python, worth preserving.) Also, unlike many other languages, the status quo in Python really isn’t that bad–the idiomatic default-value trick works, and doesn’t have the same verbosity, potential for errors, etc. as, say, JavaScript’s idiomatic anonymous wrapper function definition and call.


    1. In a typical for statement, there’s a statement or suite of statements run for each iteration. In a comprehension or other expression involving for, there may be subsequent for, if, and other clauses, and an output expression. Either way, the loop variable is in scope for all that stuff, but nowhere else.

    2. In Python, only functions and classes define new scopes. This means the loop variable of a for statement is available for the rest of the current function. Comprehensions compile to a function definition, followed by a call of that function on the outermost iterable, so technically the loop variable is available more widely than you’d expect, but that rarely comes up–you’d have to do something like write a nested comprehension where the second for clause uses the variable from the third for clause, but doesn’t use it the first time through.

    3. There are plenty of languages where normal function definitions and lambda definitions actually define different kinds of objects, and normal functions can’t make closures (C++, C#) or even aren’t first-class values (Ruby). In those languages, of course, you can’t reproduce the problem with lambda.

    4. The same trick is frequently used with globals and builtins, for two different purposes. First, len=len allows Python to lookup len as a local name, which is fast, instead of as a builtin, which is a little slower, which can make a difference if you’re using it a zillion times in a loop. Second, len=len allows you to “hook” the normal behavior, extending it to handle a type that forgot a __len__ method, or to add an optional parameter, or whatever, but to still access the original behavior inside the implementation of your hook. Some possible solutions to the “local capture” problem might also work for these uses, some might not, but I don’t think that’s actually relevant to whether they’re good solutions to the intended problem.

    5. In modern Python (3.0+), map is lazy–equivalent to a generator expression rather than to a list comprehension. But let’s ignore that detail for now. Just read it as list(map(...)) if you want to think through the behavior in Python 3.

    6. Well, not higher-order functions, because they can’t take functions, only blocks, in part because functions aren’t first-class values. But the Ruby approximation of HOFs. Ruby is just two-ordered instead of arbitrary-ordered like Lisp or Python.

    7. Ruby 1.9 made a breaking change that’s effectively my #2: the i block parameter is now a new variable each time, which shadows any local i in the block’s caller’s scope, instead of being a single variable in the caller’s scope that gets rebound repeatedly. There were some further changes for 2.0, but they aren’t relevant here.

    8. Ruby is an interesting exception here. The scope rules are pretty much like Python’s–but those rules don’t matter, because loop bodies are almost always written as blocks passed to looping methods rather than as suites within a looping statement or expression. You could argue that this makes the choice obvious for Ruby, in a way that makes it irrelevant to other languages–but it’s actually not obvious how block parameters should be scoped, as evidenced by the fact that things changed between 1.8 and 1.9, primarily to fix exactly this problem.

    9. Most of these also make i a constant. This avoids some potential confusion, at the cost of a restriction that really isn’t necessary. Swift’s design is full of such decisions: when the cost of the restriction is minimal (as it is here), they go with avoiding potential confusion.

    10. If you think it through, there is an answer here, but the point is that it’s far from trivial. And defining semantics in terms of CPython’s specific optimizations, and then requiring people to work back to the more general design, is not exactly a clean way to do things…

    11. Look at the Python 2.7 reference for list displays (including comprehensions), set/dict displays (including comprehensions), and generator expressions. They’re a mess. List comprehensions are given their full semantics. Set and dict comprehensions repeat most of the same text (with different formatting, and with a typo), and still fail to actually define how the key: value pairs in a dict comprehension get handled. Then generator expressions only hint at the semantics. The Python 3.3 reference tries to refactor things to make it simpler. The intention was actually to make it even simpler: [i**2 for i in range(10)] ought to be just an optimization for list(i**2 for i in range(10)), with identical semantics. But it wasn’t until someone tried to write it that way that everyone realized that, in fact, they’re not identical. (Raise a StopIteration from the result expression or a top-level if clause and see what you get.) I think there’s some kind of similar simplification possible here, and I’d like to actually work it through ahead of time, rather than 3 versions after the semantics are implemented and it’s too late to change anything. (Not that I think my suggestion in this post will, or even necessarily should, get implemented in Python anyway, but you know what I mean.)

    4

    View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.