Many languages have a for-each loop. In some, like Python, it’s the only kind of for loop:

for i in range(10):
    print(i)

In most languages, the loop variable is only in scope within the code controlled by the for loop,[1] except in languages that don’t have granular scopes at all, like Python.[2]

So, is that i a variable that gets updated each time through the loop or is it a new constant that gets defined each time through the loop?

Almost every language treats it as a reused variable. Swift, and C# since version 5.0, treat it as separate constants. I think Swift is right here (and C# was right to change, despite the backward-compatibility pain), and everyone else is wrong.

However, if there is a major language that should be using the traditional behavior, it’s Python.

Loops and closures

In any language that has lexical closures that capture variables, you’ll run into the foreach-lambda problem. This has been known in Python (and JavaScript, and Ruby, etc.) for decades, and is even covered in Python’s official Programming FAQ, but every new programmer discovers it on his own, and every new language seems to do so as well, and it always takes people by surprise.

Consider this:

powers = [lambda x: x**i for i in range(10)]

This creates 10 functions that return x**0, x**1, … x**9, right?

If for-each works by re-using a variable, as it does in Python (and JavaScript, …) then no, this creates 10 functions that all return x**9 (or, in some languages, x**10). Each function returns x**i, where i is the variable from the scope the function was defined in. At the end of that scope, i has the value 9.

The problem has nothing to do with lambdas or comprehensions.[3] You’ll get the same behavior this way:

powers = []
for i in range(10):
    def func(x):
        return x**i
    powers.append(func)

Whenever people run into this problem, in any language, they insist that the language “does closures wrong”–but in fact, the language is doing closures perfectly.

So, does that mean all those users are wrong?

Traditional solution

In most languages, you can solve the problem by wrapping the function definition inside another function that you define and call:

for i in range(10):
    def make_func(j):
        def func(x):
            return x**j
        return func
    powers.append(make_func(i))

This works because the j parameter gets a reference to the value of the i argument, not a reference to the i variable itself. Then, func gets a reference to that j variable, which is different each time.

The problem with this solution is that it’s verbose, and somewhat opaque to the reader. It’s a bit less verbose when you can use lambdas, but arguably it’s even less readable that way:

for i in range(10):
    powers.append((lambda j: (lambda x: x**j))(i))

In some languages, including Python, there’s a simpler way to do this:

for i in range(10):
    def func(x, j=i):
        return x**j
    powers.append(func)

That’s because the default parameter value j gets a reference to the value of i in the defining scope, not a closure cell capturing the variable i.

Of course this is a bit “magical”, but people quickly learn to recognize it in Python. In fact, most experienced Python developers will spell it i=i instead of j=i, because (once you recognize the idiom) that makes it clear why we’re using a parameter with a default variable here: to get the value of i at the time of definition (or, in some cases, as an optimization–e.g., you can use len=len to turn len into a local variable instead of a builtin within the function, which makes it a bit faster to look up each time you call it).

The bigger problem is that it makes j a parameter, which means you can override the default value with an argument. For example, powers[3](2, 10) gives you 1024, which is almost certainly not something anyone wanted. You can do tricks like making it keyword-only, prefixing it with an underscore, etc. to try to discourage people from accidentally passing an argument, but it’s still a flaw.

Terminology sidebar

In Python, this issue is often described in terms of late vs. early binding. There are actually three places things could get bound: compile time (like constants), function definition time (like default parameter values), or function call time (like free variables captured by a closure). In many discussions of late vs. early binding, the distinction is between run time and compile time, but in this case, it’s between the later and earlier run time cases: free variables are bound late, at call time, while default parameter values are bound early, at definition time. So, capturing a nonlocal is late binding; the “default-value trick” of passing the value as the default value of an extra parameter means using early binding instead.

A different way of solving the same problem is to consider capture by variable vs. capture by value. In C++, variables are lvalues, but lvalues are also things you can pass around and take references to. You can write a function that takes an int, or one that takes a reference to an int, spelled int& (and another that takes a reference to a const int, and one that takes an rvalue-reference to an int, with complicated rules about how you collapse an lvalue reference to an rvalue reference and how overload distinguishes between an int and a reference to const int when the lvalue is constant, and…). And closures are built on top of this: there’s no special “closure cell” or “closure environment”; closures just have (lvalue) references to variables from the outer scope. Of course taking a reference to something doesn’t keep it alive, because that would be too easy, so returning a function that closes over a local variable means you’ve created a dangling reference, just like returning a pointer to a local variable. So, C++ also added capture by value: instead of copying an lvalue reference (which gives you a new reference to the same lvalue), you can copy an rvalue reference (which steals the reference if it’s a temporary that would have otherwise gone away, or copies the value into a new lvalue if not). As it turns out, this gives you another way to solve the loop problem: if you capture the loop variable by value, instead of by variable, then of course each closure is capturing a different value. Confused? You won’t be, after this week’s episode of C++.

Anyway, you can now forget about C++. In simpler languages like Java, C#, JavaScript, and Python, people have proposed adding a way to declare capture by value for closures, to solve the loop problem. For example, borrowing C++ syntax (sort of):

for i in range(10):
    def func[i](x):
        return x**i
    powers.append(func)

Or, alternatively, one of these proposals:

for i in range(10):
    def func(x; i):
        return x**i
    powers.append(func)

for i in range(10):
    def func(x) sharedlocal(i):
        return x**i
    powers.append(func)

for i in range(10):
    def func(x):
        sharedlocal i
        return x**i
    powers.append(func)

No matter how you spell it, the idea is that you’re telling func to capture the current value of the i local from the enclosing scope, rather than capturing the actual variable i.

If you think about it, that means that capture by value in a language like Python is exactly equivalent to early binding (in the “binding at definition time” sense). And all of these solutions do the exact same thing as the parameter default-value trick i=i, except that i is no longer visible (and overridable) as a parameter, so it’s no longer really a “trick”. (In fact, we could even hijack the existing machinery and store the value in the function object’s __defaults__, if we just add a new member to the code object to keep a count of captured local values that come after all the parameters.)

While we’re talking terminology, when discussing how parameters get passed in the previous section, is that “pass by value” or “pass by reference”? If you think that’s a good question, or you think the problem could be solved by just coming up with a new name, see this post.

Solutions

Since the traditional solution works, and people obviously can learn it and get used to it, one solution is to just do nothing.

But if we are going to do something, there are two obvious solutions:

  1. Provide a way for closures to use early binding or capture by value. (Again, these options are equivalent in languages like Python.)
  2. Provide a way for for-each loops to define a new variable each time through the loop, instead of reusing the same variable.

Either one would solve the problem of closures capturing loop variables. Either one might also offer other benefits, but it’s hard to see what they might be. Consider that, in decades of people using the default-value trick in Python, it’s rarely used (with locals[4]) for anything but loop variables. And similarly, you can find FAQ sections and blog posts and StackOverflow questions in every language discussing the loop problem, and none of them mention any other cases.

The first one obviously has to be optional: you still want closures to capture some (in fact, most) variables, or they really aren’t closures. And there’s no obvious way to make an automatic distinction, so it has to be something the user spells explicitly. (As shown in the previous section, there are plenty of alternative ways to spell it, each with room for more bikeshedding.)

The second one could be optional–but does it have to be? The C# team carefully considered adding a new form like this (in Python terms):

for new i in range(10):
    def func(x):
        sharedlocal i
        return x**i
    powers.append(func)

But they decided that any existing code that actually depends on the difference between for i and for new i is more likely to be buggy than to be intentionally relying on the old semantics, so, even from a backward-compatibility point of view, it’s still better to change things. Of course that had to be balanced with the fact that some people write code that’s used in both C# 4 and C# 5, and having it do the right thing in one version and the wrong in the other is pretty ugly… but even so, they decided to make the breaking change.

I’m going to examine the arguments for each of the two major alternatives, without considering any backward compatibility issues.

Consistency with C-style for

This one isn’t relevant to languages like Python, which don’t have C-style for loops, but many languages do, and they almost always use the same keyword, or closely related ones, to introduce C-style for loops and for-each loops. So, superficially, it seems like they should be as similar as possible, all else being equal.

In a C-style for loop, the loop variable is clearly a variable whose lifetime lasts for the entire loop, not just a single iteration. That’s obvious from the way you define the loop:

for (int i=0; i != 10; i++)

That third part isn’t generating a value for a new constant (that also happens to be named i), it’s mutating or rebinding a variable named i. That’s what the ++ operator means. If you change to i + 1, you get an infinite loop, where i stays 0 forever.

And this is why it’s usually legal to modify the loop variable inside the loop. It allows you to do things like this:

for (int i=0; i != spam.len(); ++i) {
    if (spam[i].has_arg) {
        process_with_arg(spam[i], spam[i+1]);
        ++i; // skips the next spam; we already used it
    } else {
        process_no_arg(spam[i]);
    }
}

So, shouldn’t for-each loops be consistent with that?

I don’t think so. Most languages already consider it legal and reasonable and only slightly unusual to modify the loop variable of a C-style for, but not to modify the loop variable of a for-each. Why is that? I think it’s because the two loops are semantically different. They’re not the same thing, just because they share a keyword.

So, at first glance this seems like an argument for #1, but in fact, it’s not an argument either way. (And for a new language, or an existing language that doesn’t already have C-style for loops, really, you don’t want C-style for loops…)

Consistency with functions

Anyone who’s done any functional programming has noticed that statement suites in imperative languages are a lot like function bodies. In fact, most control statements can be converted to a function definition or two, and a call to a higher-order function:

if spam:
    do_stuff()
    do_more_stuff()
else:
    do_different_stuff()
    
def if_body():
    do_stuff()
    do_more_stuff()
def else_body():
    do_different_stuff()
if_func(spam, if_body, else_body)

while eggs:
    eat(cheese)
    
def while_cond():
    return eggs
def while_body():
    eat(cheese)
while_func(while_cond, while_body)

But that doesn’t work with for-each:

powers = []
for i in range(10):
    def func(x):
        return x**i
    powers.append(func)

powers = []
def for_body(i):
    def func(x):
        return x**i
    powers.append(func)
for_each(range(10), for_body)

And this makes the problem obvious: that i is an argument to the for_body. Calling a function 10 times doesn’t reuse a single variable for its parameter; each time, the parameter is separate thing.

This is even more obvious with comprehensions, where we already have the equivalent function:[5]

powers = [lambda x: x**i for i in range(10)]

powers = map(lambda i: (lambda x: x**i), range(10))

Also consider Ruby, where passing blocks to higher-order-functions[6] is the idiomatic way to loop, and for-each loops are usually described in terms of equivalence to that idiom, and yet, the following Ruby 1.8[7] code produces 10 lambdas that all raise to the same power:

powers = []
10.times do |i|
    powers.push(lambda |x| { x**i })
end

So, this is definitely an argument for #2. Except…

Python

This is where Python differs from the rest of the pack. In most other languages, each suite is a scope.[8] This is especially obvious in languages like C++, D, and Swift, which use RAII, scope-guard statements, etc. where Python would use a with statement (or a try/finally). In such languages, the duality between scopes and functions is much clearer. In particular, if i is already going to go out of scope at the end of the for loop, it makes sense for each individual i to go out of scope at the end of an iteratiion.

But in Python, not only does i not go out of scope at the end of the for loop, there are comfortable idioms involving using the loop variable after its scope. (All such cases could be written by putting the use of the loop variable into the else clause of a for/else statement, but many novices find for/else confusing, and it’s not idiomatic to use it when not necessary.)

In addition, one of the strengths of Python is that it has simple rules, with few exceptions, wherever possible, which makes it easier to work through the semantics of any code you read. Currently, the entire semantics of variable scope and lifetime can be expressed in a few simple rules, which are easy to hold in your head:

  • Only functions and classes have scopes. ** Comprehensions are a hidden function definition and call.
  • Lookup goes local, enclosing, global, builtin.
  • Assignment creates a local variable. ** … unless there’s an explicit global or nonlocal statement.
  • Exceptions are unbound after an except clause.

Adding another rule to this list would mean one more thing to learn. Of course it would also mean not having to learn about the closure-over-loop-variable problem. Of course that problem is a natural consequence of the simple rules of Python, but, nevertheless, everyone runs into it at some point, has to think it through, and then has to remember it. Even after knowing about it, it’s still very easy to screw up and accidentally capture the same variable in a list of functions. (And it’s especially easy to do so when coming back to Python from Swift or C#, which don’t have that problem…)

Mutable values

Let’s forget about closures now, and consider what happens when we iterate over mutable values:

x = [[], [], []]
for i, sublist in enumerate(x):
    sublist.append(i)

If you’re not familiar with Python, try it this way:

x = [[], [], []]
i = 0
for sublist in x:
    sublist.append(i)
    i += 1

What would we expect here? Clearly, if this is legal, the result should be:

x = [[0], [1], [2]]

And there’s no obvious reason it shouldn’t be legal.

So, each sublist isn’t really a constant, but a variable, right? After all, it’s mutable.

Well, yes, each sublist is mutable. But that’s irrelevant to the question of whether the name sublist is rebindable. In languages where variables hold references to values, or are just names for values, like Java (except for native types) or Python, there’s a very obvious distinction here:

a = [1, 2, 3]
b = [4, 5, 6]
b = a
b.append(7)

It’s perfectly coherent that sublist is a non-rebindable name for a possibly-mutable value. In other words, a new variable each time through the loop.

In a language like C++, things are a little trickier, but it’s just as coherent that sublist is a non-l-modifiable-but-r-mutable l-value (unless sublist is of non-const reference type in which case it’s a modifiable l-value). Anyway, C++ has bigger problems: capturing a variable doesn’t do anything to keep that variable alive past its original scope, and there’s no way to capture by value without copying, so C++ has no choice but to force the developer to specify exactly what he’s trying to capture by reference and what he’s trying to copy.

Anyway, the one thing you definitely don’t expect is for sublist to be mutating the same list in-place each time through (what you get if you write sublist[:] = ... instead of sublist = ...). But you wouldn’t expect that whether sublist is being reused, or is a new variable.

So, ultimately, mutability isn’t an argument either way.

Performance

Obviously, creating and destroying a new variable each time through the loop isn’t free.

Most of the time, you’re not keeping a reference to the loop variable beyond the lifetime of the loop iteration. So, why should you pay the cost of creating and destroying that variable?

Well, if you think about it, in most languages, if you elide the creation and destruction, the result is invisible to the user unless the variable is captured. Almost all languages allow optimizers to elide work that has no visible effects. There would be nothing inconsistent about Python, or Java, deciding to reuse the variable where the effect is invisible, and only create a new variable when it’s captured.

So, what about the case where the variable is captured? If we want each closure to see a different value, our only choices are to capture a new variable each time, to copy the variable on capture, or to copy the value instead of capturing the variable. The first is no more costly than the second, and cheaper than the third. It’s the simplest and most obvious way to get the desired behavior.

So, performance isn’t an argument either way, either.

Simplicity

When a new user is learning the language, and writes this:

powers = []
for i in range(10):
    def func(x):
        return x**i
    powers.append(func)

… clearly, they’re expecting 10 separate functions, each raising x to a different power.

And, as mentioned earlier, even experienced developers in Python, Ruby, JavaScript, C#, etc. who have run into this problem before still write such code, and expect it to work as intended; the only difference is that they know how to spot and fix this code when debugging.

So, what’s the intuition here?

If the intuition is that lexical closures are early-bound, or by-value, then we’re in big trouble. They obviously aren’t, and, if they were, that would make closures useless. People use closures all the time in these languages, without struggling over whether they make sense.

If the intuition is that they’re defining a new function each time through the loop because they have a new i, that doesn’t point to any other problems or inconsistencies anywhere else.

And the only other alternative is that nobody actually understands what they’re doing with for-each loops, and we’re all only (usually) writing code that works because we treat them like magic. I don’t think that’s true at all; the logic of these loops is not that complicated (especially in Python).

So, I think this is an argument for #2.

Consistency with other languages

As I mentioned at the start, most languages that have both for-each loops and closures have this problem.

The only language I know of that’s solved it by adding capture by value is C++, and they already needed capture by value for a far more important reason (the dangling reference problem). Not to mention that, in C++, capture by value means something different than it would in an lvalue-less language like Python or JavaScript.

By contrast, C# 5.0 and Ruby 1.9 both changed from "reused-i" to "new-i" semantics, and Lua, Swift, and Scala have used "new-i" semantics from the start.[9]. C# and Ruby are particularly interesting, because that was a breaking backward-compatibility change, and they could very easily have instead offered new syntax (like for new i) instead of changing the semantics of the existing syntax. Eric Lippert’s blog covers the rationale for the decision in C#.

As mentioned earlier, Python’s simpler scoping roles (only functions are scopes, not every suite) do weaken the consistency argument. But I think it still falls on #2. (And, for a language with suite scopes, and/or a language where loop bodies are some weird not-quite-function things like Ruby, it’s definitely #2.)

Exact semantics

In languages where each suite is a scope, or where loop suites are already function-esque objects like Ruby’s blocks, the semantics are pretty simple. But what about in Python?

The first implementation suggestion for Python came from Greg Ewing. His idea was that whenever a loop binds a cellvar, the interpreter creates a new cell each time through the loop, replacing the local i binding each time. This obviously solves the loop-capture problem, with no performance effect on the more usual case where you aren’t capturing a loop variable.

This works, but, as Guido pointed out, it’s pretty confusing. Normally, a binding just means a name in a dict. The fact that locals and closures are implemented with indexes into arrays instead of dict lookups is an implementation detail of CPython, but Greg’s solution requires that implementation detail. How would you translate the design to a different implementation of Python that handled closures by just capturing the parent dict and using it for dict lookups?[10]

Nick Coghlan suggested that the simplest way to define the semantics is to spell out the translation to a while loop. So the current semantics for our familiar loop are:

_it = iter(range(10))
try:
    while True:
        i = next(it)
        powers.append(lambda x: x**i)
except StopIteration:
    pass

… and the new semantics are:

_it = iter(range(10))
try:
    while True:
        i = next(it)
        def _suite(i=i):
            powers.append(lambda x: x**i)
        _suite()
except StopIteration:
    pass

But I think it’s misleading that way. In a comprehension, the i is an argument to the _suite function, not a default parameter value, and the function is built outside the loop. If we reuse the same logic here, we get something a bit simpler to think through:

_it = iter(range(10))
def _suite(i):
    powers.append(lambda x: x**i)
try:
    while True:
        i = next(it)
        _suite(i)
except StopIteration:
    pass

And now the “no-capture” optimization isn’t automatic, but it’s a pretty simple optimization that could be easily implemented by a compiler (or a plugin optimizer like FAT Python). In CPython terms, if the i in _suite ends up as a cellvar (because it’s captured by something in the suite), you can’t simplify it; otherwise, you can just inline the _suite, and that gives you exactly the same code as in Python 3.5.

I still think there might be a better answer than the comprehension-like answer, but, if so, I haven’t thought of it. My suspicion is that it’s going to be like the comprehension problem: once you see a way to unify the two cases, everything becomes simpler.[11]

Conclusion

So, if you’re designing a new language whose variable semantics are like C#/Java/etc., or like Python/JavaScript/etc., I think you definitely want a for-each statement that declares a new variable each time through the loop–just like C#, Swift, and Ruby.

For an existing language, I think it’s worth looking at existing code to try to find anything that would be broken by making the change backward-incompatibly. If you find any, then consider some new for new syntax; otherwise, just do the same thing C# and Ruby did and change the semantics of for.

For the specific case of Python, I’m not sure. I don’t know if the no-backward-compatibility decision that made sense for C# and Ruby make sense for Python. I also think the new semantics need more thought–and, after that’s worked out, it will depend on how easily the semantics fit into the simple scoping rules, in a way which can be taught to novices and transplants from other languages and then held in their heads. (That really is an important feature of Python, worth preserving.) Also, unlike many other languages, the status quo in Python really isn’t that bad–the idiomatic default-value trick works, and doesn’t have the same verbosity, potential for errors, etc. as, say, JavaScript’s idiomatic anonymous wrapper function definition and call.


  1. In a typical for statement, there’s a statement or suite of statements run for each iteration. In a comprehension or other expression involving for, there may be subsequent for, if, and other clauses, and an output expression. Either way, the loop variable is in scope for all that stuff, but nowhere else.

  2. In Python, only functions and classes define new scopes. This means the loop variable of a for statement is available for the rest of the current function. Comprehensions compile to a function definition, followed by a call of that function on the outermost iterable, so technically the loop variable is available more widely than you’d expect, but that rarely comes up–you’d have to do something like write a nested comprehension where the second for clause uses the variable from the third for clause, but doesn’t use it the first time through.

  3. There are plenty of languages where normal function definitions and lambda definitions actually define different kinds of objects, and normal functions can’t make closures (C++, C#) or even aren’t first-class values (Ruby). In those languages, of course, you can’t reproduce the problem with lambda.

  4. The same trick is frequently used with globals and builtins, for two different purposes. First, len=len allows Python to lookup len as a local name, which is fast, instead of as a builtin, which is a little slower, which can make a difference if you’re using it a zillion times in a loop. Second, len=len allows you to “hook” the normal behavior, extending it to handle a type that forgot a __len__ method, or to add an optional parameter, or whatever, but to still access the original behavior inside the implementation of your hook. Some possible solutions to the “local capture” problem might also work for these uses, some might not, but I don’t think that’s actually relevant to whether they’re good solutions to the intended problem.

  5. In modern Python (3.0+), map is lazy–equivalent to a generator expression rather than to a list comprehension. But let’s ignore that detail for now. Just read it as list(map(...)) if you want to think through the behavior in Python 3.

  6. Well, not higher-order functions, because they can’t take functions, only blocks, in part because functions aren’t first-class values. But the Ruby approximation of HOFs. Ruby is just two-ordered instead of arbitrary-ordered like Lisp or Python.

  7. Ruby 1.9 made a breaking change that’s effectively my #2: the i block parameter is now a new variable each time, which shadows any local i in the block’s caller’s scope, instead of being a single variable in the caller’s scope that gets rebound repeatedly. There were some further changes for 2.0, but they aren’t relevant here.

  8. Ruby is an interesting exception here. The scope rules are pretty much like Python’s–but those rules don’t matter, because loop bodies are almost always written as blocks passed to looping methods rather than as suites within a looping statement or expression. You could argue that this makes the choice obvious for Ruby, in a way that makes it irrelevant to other languages–but it’s actually not obvious how block parameters should be scoped, as evidenced by the fact that things changed between 1.8 and 1.9, primarily to fix exactly this problem.

  9. Most of these also make i a constant. This avoids some potential confusion, at the cost of a restriction that really isn’t necessary. Swift’s design is full of such decisions: when the cost of the restriction is minimal (as it is here), they go with avoiding potential confusion.

  10. If you think it through, there is an answer here, but the point is that it’s far from trivial. And defining semantics in terms of CPython’s specific optimizations, and then requiring people to work back to the more general design, is not exactly a clean way to do things…

  11. Look at the Python 2.7 reference for list displays (including comprehensions), set/dict displays (including comprehensions), and generator expressions. They’re a mess. List comprehensions are given their full semantics. Set and dict comprehensions repeat most of the same text (with different formatting, and with a typo), and still fail to actually define how the key: value pairs in a dict comprehension get handled. Then generator expressions only hint at the semantics. The Python 3.3 reference tries to refactor things to make it simpler. The intention was actually to make it even simpler: [i**2 for i in range(10)] ought to be just an optimization for list(i**2 for i in range(10)), with identical semantics. But it wasn’t until someone tried to write it that way that everyone realized that, in fact, they’re not identical. (Raise a StopIteration from the result expression or a top-level if clause and see what you get.) I think there’s some kind of similar simplification possible here, and I’d like to actually work it through ahead of time, rather than 3 versions after the semantics are implemented and it’s too late to change anything. (Not that I think my suggestion in this post will, or even necessarily should, get implemented in Python anyway, but you know what I mean.)

4

View comments

  1. It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

    Much of what he said still applies, but other things have changed. Languages in the Abject-Oriented space have been borrowing ideas from another paradigm entirely—and then everyone realized that languages like Python, Ruby, and JavaScript had been doing it for years and just hadn't noticed (because these languages do not require you to declare what you're doing, or even to know what you're doing). Meanwhile, new hybrid languages borrow freely from both paradigms.

    This other paradigm—which is actually older, but was largely constrained to university basements until recent years—is called Functional Addiction.

    A Functional Addict is someone who regularly gets higher-order—sometimes they may even exhibit dependent types—but still manages to retain a job.

    Retaining a job is of course the goal of all programming. This is why some of these new hybrid languages, like Rust, check all borrowing, from both paradigms, so extensively that you can make regular progress for months without ever successfully compiling your code, and your managers will appreciate that progress. After all, once it does compile, it will definitely work.

    Closures

    It's long been known that Closures are dual to Encapsulation.

    As Abject-Oriented Programming explained, Encapsulation involves making all of your variables public, and ideally global, to let the rest of the code decide what should and shouldn't be private.

    Closures, by contrast, are a way of referring to variables from outer scopes. And there is no scope more outer than global.

    Immutability

    One of the reasons Functional Addiction has become popular in recent years is that to truly take advantage of multi-core systems, you need immutable data, sometimes also called persistent data.

    Instead of mutating a function to fix a bug, you should always make a new copy of that function. For example:

    function getCustName(custID)
    {
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    }

    When you discover that you actually wanted fields 2 and 3 rather than 1 and 2, it might be tempting to mutate the state of this function. But doing so is dangerous. The right answer is to make a copy, and then try to remember to use the copy instead of the original:

    function getCustName(custID)
    {
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    }
    
    function getCustName2(custID)
    {
        custRec = readFromDB("customer", custID);
        fullname = custRec[2] + ' ' + custRec[3];
        return fullname;
    }

    This means anyone still using the original function can continue to reference the old code, but as soon as it's no longer needed, it will be automatically garbage collected. (Automatic garbage collection isn't free, but it can be outsourced cheaply.)

    Higher-Order Functions

    In traditional Abject-Oriented Programming, you are required to give each function a name. But over time, the name of the function may drift away from what it actually does, making it as misleading as comments. Experience has shown that people will only keep once copy of their information up to date, and the CHANGES.TXT file is the right place for that.

    Higher-Order Functions can solve this problem:

    function []Functions = [
        lambda(custID) {
            custRec = readFromDB("customer", custID);
            fullname = custRec[1] + ' ' + custRec[2];
            return fullname;
        },
        lambda(custID) {
            custRec = readFromDB("customer", custID);
            fullname = custRec[2] + ' ' + custRec[3];
            return fullname;
        },
    ]

    Now you can refer to this functions by order, so there's no need for names.

    Parametric Polymorphism

    Traditional languages offer Abject-Oriented Polymorphism and Ad-Hoc Polymorphism (also known as Overloading), but better languages also offer Parametric Polymorphism.

    The key to Parametric Polymorphism is that the type of the output can be determined from the type of the inputs via Algebra. For example:

    function getCustData(custId, x)
    {
        if (x == int(x)) {
            custRec = readFromDB("customer", custId);
            fullname = custRec[1] + ' ' + custRec[2];
            return int(fullname);
        } else if (x.real == 0) {
            custRec = readFromDB("customer", custId);
            fullname = custRec[1] + ' ' + custRec[2];
            return double(fullname);
        } else {
            custRec = readFromDB("customer", custId);
            fullname = custRec[1] + ' ' + custRec[2];
            return complex(fullname);
        }
    }
    

    Notice that we've called the variable x. This is how you know you're using Algebraic Data Types. The names y, z, and sometimes w are also Algebraic.

    Type Inference

    Languages that enable Functional Addiction often feature Type Inference. This means that the compiler can infer your typing without you having to be explicit:


    function getCustName(custID)
    {
        // WARNING: Make sure the DB is locked here or
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    }

    We didn't specify what will happen if the DB is not locked. And that's fine, because the compiler will figure it out and insert code that corrupts the data, without us needing to tell it to!

    By contrast, most Abject-Oriented languages are either nominally typed—meaning that you give names to all of your types instead of meanings—or dynamically typed—meaning that your variables are all unique individuals that can accomplish anything if they try.

    Memoization

    Memoization means caching the results of a function call:

    function getCustName(custID)
    {
        if (custID == 3) { return "John Smith"; }
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    }

    Non-Strictness

    Non-Strictness is often confused with Laziness, but in fact Laziness is just one kind of Non-Strictness. Here's an example that compares two different forms of Non-Strictness:

    /****************************************
    *
    * TO DO:
    *
    * get tax rate for the customer state
    * eventually from some table
    *
    ****************************************/
    // function lazyTaxRate(custId) {}
    
    function callByNameTextRate(custId)
    {
        /****************************************
        *
        * TO DO:
        *
        * get tax rate for the customer state
        * eventually from some table
        *
        ****************************************/
    }

    Both are Non-Strict, but the second one forces the compiler to actually compile the function just so we can Call it By Name. This causes code bloat. The Lazy version will be smaller and faster. Plus, Lazy programming allows us to create infinite recursion without making the program hang:

    /****************************************
    *
    * TO DO:
    *
    * get tax rate for the customer state
    * eventually from some table
    *
    ****************************************/
    // function lazyTaxRateRecursive(custId) { lazyTaxRateRecursive(custId); }

    Laziness is often combined with Memoization:

    function getCustName(custID)
    {
        // if (custID == 3) { return "John Smith"; }
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    }

    Outside the world of Functional Addicts, this same technique is often called Test-Driven Development. If enough tests can be embedded in the code to achieve 100% coverage, or at least a decent amount, your code is guaranteed to be safe. But because the tests are not compiled and executed in the normal run, or indeed ever, they don't affect performance or correctness.

    Conclusion

    Many people claim that the days of Abject-Oriented Programming are over. But this is pure hype. Functional Addiction and Abject Orientation are not actually at odds with each other, but instead complement each other.
    5

    View comments

  2. I haven't posted anything new in a couple years (partly because I attempted to move to a different blogging platform where I could write everything in markdown instead of HTML but got frustrated—which I may attempt again), but I've had a few private comments and emails on some of the old posts, so I decided to do some followups.

    A couple years ago, I wrote a blog post on greenlets, threads, and processes. At that time, it was already possible to write things in terms of explicit coroutines (Greg Ewing's original yield from proposal already had a coroutine scheduler as an example, Twisted already had @inlineCallbacks, and asyncio had even been added to the stdlib), but it wasn't in heavy use yet. Things have changed since then, especially with the addition of the async and await keywords to the language (and the popularity of similar constructs in a wide variety of other languages). So, it's time to take a look back (and ahead).

    Differences

    Automatic waiting

    Greenlets are the same thing as coroutines, but greenlet libraries like gevent are not just like coroutine libraries like asyncio. The key difference is that greenlet libraries do the switching magically, while coroutine libraries make you ask for it explicitly.

    For example, with gevent, if you want to yield until a socket is ready to read from and then read from the socket when waking up, you write this:
        buf = sock.recv(4096)
    To do the same thing with asyncio, you write this:
        buf = await loop.sock_recv(sock, 4096)
    Forget (for now) the difference in whether recv is a socket method or a function that takes a socket; the key difference is that await. In asyncio, any time you're going to wait for a value, yielding the processor to other coroutines until you're ready to run, you always do this explicitly, with await. In gevent, you just call one of the functions that automatically does the waiting for you.

    In practice, while marking waits explicitly is a little harder to write (especially during quick and dirty prototyping), it seems to be harder to get wrong, and a whole lot easier to debug. And the more complicated things get, the more important this is.

    If you miss an await, or try to do it in a non-async function, your code will usually fail hard with a obvious error message, rather than silently doing something undesirable.

    Meanwhile, let's say you're using some shared container, and you've got a race on it, or a lock that's being held too long. It's dead simple to tell at a glance whether you have an await between a read and a write to that container, while with automatic waiting, you have to read every line carefully. Being able to follow control flow at a glance is really one of the main reasons people use Python in the first place, and await extends that ability to concurrent code.

    Serial-style APIs

    Now it's time to come back to the difference between sock.recv and sock_recv(sock). The asyncio library doesn't expose a socket API, it exposes an API that looks sort of similar to the socket API. And, if you look around other languages and frameworks, from JavaScript to C#, you'll see the same thing.

    It's hard to argue that the traditional socket API is in any objective sense better, but if you've been doing socket programming for a decade or four, it's certainly more familiar. And there's a lot more language-agnostic documentation on how it works, both tutorial and reference (e.g., if you need to look up the different quirks of a function on Linux vs. *BSD, the closer you are to the core syscall, the easier it will be to find and understand the docs).

    In practice, however, the vast majority of code in a nontrivial server is going to work at a higher level of abstraction. Most often, that abstraction will be Streams or Protocols or something similar, and you'll never even see the sockets. If not, you'll probably be building your own abstraction, and only the code on the inside—a tiny fraction of your overall code—will ever see the sockets.

    One case where using the serial-style APIs really does help, however, is when you've got a mess of already-written code that's either non-concurrent or using threads or processes, and you want to convert it to use coroutines. Rewriting all that code around asyncio (no matter which level you choose) is probably a non-trivial project; rewriting it around gevent, you just import all the monkeypatches and you're 90% done. (You still need to scan your code, and test the hell out of it, to make sure you're not doing anything that will break or become badly non-optimal, of course, but you don't need to rewrite everything.)

    Conclusion

    If I were writing the same blog post today, I wouldn't recommend magic greenlets for most massively-concurrent systems; I'd recommend explicit coroutines instead.

    There is still a place for gevent. But that place is largely in migrating existing threading-based (or on-concurrent) codebases. If you (and your intended collaborators) are familiar enough with threading and traditional APIs, it may still be worth considering for simpler systems. But otherwise, I'd strongly consider asyncio (or some other explicit coroutine framework) instead.
    6

    View comments

  3. Looking before you leap

    Python is a duck-typed language, and one where you usually trust EAFP ("Easier to Ask Forgiveness than Permission") over LBYL ("Look Before You Leap"). In Java or C#, you need "interfaces" all over the place; you can't pass something to a function unless it's an instance of a type that implements that interface; in Python, as long as your object has the methods and other attributes that the function needs, no matter what type it is, everything is good.

    Of course there are some occasions where you do need to LBYL. For example, say you need to do a whole set of operations or none at all. If the second one raises a TypeError after the first one succeeded, now you're in an inconsistent state. If you can make the first operation reversible—or, better, do the whole transaction on a copy of your state, off to the side, and then commit it atomically at the end—that may be acceptable, but if not, what can you do? You have to LBYL all of the operations before you do the first one. (Of course if you have to worry about concurrency or reentrancy, even that isn't sufficient—but it's still necessary.)

    hasattr

    Traditionally, in Python, you handled LBYL by still using duck-typing: just use hasattr to check for the methods you want. But this has a few problems.

    First, not all distinctions are actually testable with hasattr. For example, often, you want a list-like object, and a dict-like object won't do. But you can't distinguish with __getitem__, because they both support that. You can try to pile on tests--has __getitem__, and __len__, but doesn't have keys... But that's still going to have false positives once you go beyond the builtin types, and it's hard to document for the user of your function what you're testing for, and you have to copy and paste those checks all over the place (and, if you discover that you need to add another one, make sure to update all the copies). And remember to test according to special-method lookup (do the hasattr on the type, not the object itself) when appropriate, if you can figure out when it is appropriate.

    And, on top of the false positives, you may also get false negatives--a type that doesn't have __iter__ may still work in a for loop because of the old-style sequence protocol. So, you have to test for having __iter__ or having __getitem__ (but then you get back to the false positives). And, again, you have to do this all over the place.

    Finally, it isn't always clear from the test what you're trying to do. Any reader should understand that a test for __iter__ is checking whether an object is iterable, but what is a test for read or draw checking?

    The obvious way to solve these problems is to factor out an issequence function, an isiterable function, and so on.

    ABCs

    ABCs ("Abstract Base Classes") are a way to make those functions extensible. You write isinstance(seq, Sequence) instead of issequence(seq), but now a single builtin function works for an unbounded number of interfaces, rather than a handful of hardcoded ones. And we have that Sequence type to add extra things to. And there are three ways to hook the test.

    First, similar to Java interfaces, you can always just inherit from the ABC. If you write a class that uses Sequence as a base class, then of course isinstance(myseq, Sequence) will pass. And Sequence can use abstractmethod to require subclasses to implement some required methods or the inheritance will fail.

    Alternatively, similar to Go interfaces, an ABC can include structural tests (in a __subclasshook__ method) that will automatically treat any matching type as a subclass. For example, if your class defines an __iter__ method, it automatically counts as a subclass of Iterable.

    Finally, you can explicitly register any type as a virtual subclass of an ABC by calling the register method. For example, tuple doesn't inherit from Sequence, and Sequence doesn't do structural checking, but because the module calls Sequence.register(tuple) at import time, tuples count as sequences.

    Most OO languages provide either the first option or the second one, which can be a pain. In languages with only nominal (subclass-based) interfaces, there's no way to define a new interface like Reversible that stdlib types, third-party types, or even builtins like arrays can match (except n a language like like Ruby, where you can always reopen the classes and monkeypatch the inheritance chain at runtime, and it's even idiomatic to do so.) In languages with only structural interfaces, there's no way to do things like specify sequences and mappings as distinct interfaces that use the same method names with different meanings. (That's maybe not s much of a problem for named methods, but think about operator overloading and other magic methods—would you really want to spell dict access d{2} or similar instead of d[2] just so sequences and mappings can use different operators?)

    Sometimes, ABCs can clarify an interface even if you don't actually use them. For example, Python 2 made heavy use of the idea of "file-like objects", without ever really defining what that means. A file is an iterable of lines; an object with read() to read the whole thing and read(4096) to read one buffer; an object with fileno() to get the underlying file descriptor to pass to some wrapped-up C library; and various other things (especially once you consider that files are also writable). When a function needs a file-like object, which of these does it need? When a function gives you a file-like object, which one is it providing? Worse, even actual file objects come in two flavors, bytes and Unicode; an iterable of Unicode strings is a file-like object, but it's still useless to a function that needs bytes.

    So, in Python 3, we have RawIOBase and TextIOBase, which specify exactly which methods they provide, and whether those methods deal in bytes or text. You almost never actually test for them with isinstance, but they still serve as great documentation for both users and implementers of file-like objects

    But keep in mind that you don't want to create an ABC in Python whenever you'd create an interface in Java or Go. The vast majority of the time, you want to use EAFP duck typing. There's no benefit in using them when not required (Python doesn't do type-driven optimizations, etc.), and there can be a big cost (because so much Python code is written around duck-typing idioms, so you don't want to fight them). The only time to use ABCs is when you really do need to check that an object meets some interface (as in the example at the top).

    ABCs as mixins

    The obvious problem with the more complex ABCs is that interfaces like Sequence or RawIOBase have tons of methods. In the Python 2 days, you'd just implement the one or two methods that were actually needed (once you guessing which ones those were) and you were done.

    That's where mixins come in. Since ABCs are just classes, and mixins are just classes, there's nothing stopping an ABC from providing default implementations for all kinds of methods based on the subclass's implementations of a few required methods.

    If you inherit from Sequence, you only have to implement __getitem__ and __len__, and the ABC fills in all the rest of the methods for you. You can override them if you want (e.g., if you can provide a more efficient implementation for __contains__ than iterating the whole sequence), but you usually don't need to. A complete Sequence is under 10 lines of code for the author, but provides the entire interface of tuple for the user.

    Of course there are other ways Python could provide similar convenience besides mixins (e.g., see functools.total_ordering, a decorator that similarly lets you define just two methods and get a complete suite implemented for you), but mixins work great for this case.

    Isn't it conceptually wrong for an interface (which you inherit for subtyping) and a mixin (which you inherit for implementation) to be the same class? Well, it's not exactly "pure OO", but then practicality beats purity. Sure, Python could have a SequenceABC and a SequenceMixin and make you inherit both separately when you want both, but when are you going to want SequenceMixin without also wanting SequenceABC? Merging them is very often helpful, and almost never a problem.
    1

    View comments

  4. Background

    Currently, CPython’s internal bytecode format stores instructions with no args as 1 byte, instructions with small args as 3 bytes, and instructions with large args as 6 bytes (actually, a 3-byte EXTENDED_ARG followed by a 3-byte real instruction). While bytecode is implementation-specific, many other implementations (PyPy, MicroPython, …) use CPython’s bytecode format, or variations on it.

    Python exposes as much of this as possible to user code. For example, you can write a decorator that takes a function’s __code__ member, access its co_code bytecode string and other attributes, build a different code object, and replace the function’s __code__. Or you can build an import hook that takes the top-level module code returned by compile (and all the code objects stored in co_consts for the top-level object or recursively in any of the nested objects), builds up a different one, and returns that as the module’s actual code.

    And this isn’t just a nifty feature for playing around with Python’s internals without having to hack up the interpreter; there’s real-life production code that depends on doing this kind of stuff to code objects.

    Making CPython bytecode format less brittle

    There have been a few proposals to change the internal format: e.g., “wordcode” stores most instructions as 2 bytes (using EXTENDED_ARGS for 4 or 6 bytes when needed), while “packed ops” steals bits from the opcode to store very small args for certain ops (e.g., LOAD_CONST_2 can be stored as 1 byte instead of 3). Such proposals would obviously break all code that depends on bytecode. Guido suggested on python-dev that it would need two Python versions (so 3.7, if we finished wordcode today) before we could release such a breaking change; I’m not sure even that would be enough time.

    Similarly, while some other implementations use essentially the same format as CPython, they’re not all actually compatible. For example, for compactness, MicroPython changes most non-jump instructions are 2 bytes instead of 3, which breaks almost any code that even inspects bytecode, much less tries to modify it.

    One of the replies to Guido suggested, “Maybe we should have a standard portable assembly format for bytecode”. If we had that, it would remain consistent across most internal changes, even as radical as bytecode to wordcode, and would be portable between CPython and MicroPython. If we could get such a format into 3.6, I think it would definitely be reasonable for a radical change like wordcode to go into 3.7 as Guido suggests—and maybe even for any future changes to take place over a single version instead of two.

    Making bytecode processors easier to write

    Separately, as part of the discussion on PEP 511, which adds a new API for directly processing code objects, I suggested that it would be a lot simpler to write such processors, and also simpler for the compiler, if it used some “unpacked bytecode” format that’s easier to work with.

    The main objection I heard to that idea was that any such format would make Python harder to change in the future—but once you consider that the obvious format to use is effectively a bytecode assembly language, it becomes clear that the opposite is true: we’re making Python easier to change in the future, along with making it simpler to work with.

    The other objection is that there are just too many possible designs to choose from, so it’s way too early to put anything in the stdlib. But there really aren’t many possible designs to choose from.

    dis

    Python already has a standard bytecode assembly language format, as exposed by the dis module: a dis.Bytecode object represents the assembly-language version of a code object, as an iterable of dis.Instruction objects instead of just raw bytes (plus a few extra attributes and methods). It abstracts away things like variable instruction size (including EXTENDED_ARG), mapping indices into co_const into actual constant values, etc.

    This is almost exactly what we want, for both purposes—but there are two problems with it.

    First, there’s no assembler to get back from Bytecode back to code. Even if there were, Bytecode is immutable. And, even if it were mutable, Instruction objects have to be constructed out of a bunch of stuff that is either meaningless or needlessly difficult to deal with when you’re modifying or creating code. Most of that stuff is also useless to the assembler and could just be left at None, but the big one is the argval for each jump-like Instruction: it’s an offset into the raw byte string, not something meaningful at all in the Bytecode object. This means that if you add or remove any bytecodes, all the jumps have to be fixed up.

    Second, the dis module is written in Python. To make it usable in the compiler, the peephole optimizer, and any C-API bytecode processors, we need something accessible from C. We could, of course, rewrite most of dis in C, move it to builtins, and give it a C API, but that’s a lot of work—and it would probably discourage further improvements to dis, and it might even discourage other Python implementations from including it.

    Proposal

    As mentioned above, a Bytecode is basically an iterable of tuples. And an iterable of tuples is something that you can already handle from the C API.

    Meanwhile, what should jump arguments be? The simplest change is to make them references to other instructions. There are a couple of minor questions on that, which I’ll get to in the appropriate disassembly section.

    And that’s really all we need to make the dis format the standard, portable assembly format, and remove the need for most code (including bytecode-based optimizers, code-coverage tools, etc.) to need to depend on fragile and difficult-to-process raw bytecode.

    PyCode_Assemble

    PyCode_Assemble(instructions, name, filename, firstlineno)
    

    This function (in Include/code.h and compile.c) takes any iterable of tuples of 2 or more elements each, where the first three elements are opcode, argval, and line, and returns a code object. While assembling, it removes any existing EXTENDED_ARG and NOP instructions (although it will of course insert new EXTENDED_ARG as needed).

    Note that a Bytecode is an iterable of Instructions, which are namedtuples, so (assuming we reorganize the attributes of Instruction, and change jump targets to be references to Instructions instead of offsets) you can call this with a Bytecode object. But it’s usually simpler to just call it with a generator.

    The implementation for this new function would be somewhat complex if we were writing it from scratch—but the assemble function in compile.c already does almost exactly what we want, and the small changes to it that we need are not at all complex.

    Do we need a corresponding PyCode_Disassemble? It would be pretty simple, but I can’t think of any code in C, either today or in the future, that will ever need to disassemble a code object. (Well, you might want to write a third-party optimizer in C, but in that case, you can import the dis module from your code and use it.) And we already have a nice disassembler in pure Python. So… no C API for disassembly.

    PyCode_Optimize

    This function currently takes a bytecode bytestring, and in-out lists for consts, names, and lnotab, and returns a new bytestring. The implementation is the peephole optimizer.

    PEP 511 proposes to change it to take a code object and a context object (a collection of flags, etc.), and return a new code object. The implementation calls the code_transformer() method of every registered code transformer, which have the same interface, and then calls the peephole optimizer.

    I propose to instead change it to take an iterable of tuples and return the same, and to have PEP 511 transformers objects also use the same interface, and to rewrite the peephole optimizer to do so as well.

    compile.c

    Currently: The compiler converts each AST node into a block of instruction objects, where the instructions are pretty similar to the format used by the dis module (opcode and argument value), but with one major difference: each jump target is a pointer to a block, rather than an offset into the (not-yet-existing) bytecode. The compiler flattens this tree of blocks into a linked list, then calls assemble. The assemble function does a few passes over each instruction of each block and ends up with most of the attributes of a code object. It then calls PyCode_Optimize with four of those attributes, and then builds a code object from the result.

    Instead, the compiler will turn its linked list of arrays of instruction objects into a flat list of instruction tuples, converting each jump target from a block pointer to an instruction pointer along the way, and introducing the NOP label and setlineno pseudo-ops described below (which is very easy—each block starts with a label, and each instruction that starts a block or has a different line from the last instruction is a setlineno). It then call PyCode_Optimize with that list, and then call PyCode_Assemble (which replaces the existing assemble) with the result.

    dis.Opcode

    Currently, the dis module has 101 integer constants, opname and opmap to map back and forth between integer opcodes and their names, and each Instruction has both opcode and opname attributes.

    While we’re improving dis, we might as well make an IntEnum type called Opcode to wrap this up. We can even throw has_jrel and so forth on as methods. But for backward compatibility, we’ll keep the two mappings, the extra attribute, and the various lists and sets of raw opcodes too.

    dis.assemble

    The new dis.assemble function just exports PyCode_Assemble to Python in the obvious way.

    dis._get_instructions_bytes

    The core of the dis module is a function called _get_instructions_bytes. This takes a bytecode string, plus a bunch of optional parameters for things like the consts and varnames lists, and generates Instruction objects. (If the optional parameters are missing, it creates fake values—e.g., a LOAD_FAST 1 with no varnames gets an argval="1". Which is not at all useful for processing, but can occasionally be useful for dumping to the REPR, e.g., when you’re trying to debug a bytecode processor.)

    We need to make a few changes to this function.

    First, we add a new parameter, raw, both to this function and to all the functions that call it.

    If raw is false, all EXTENDED_ARG and NOP opcodes are skipped during disassembly. (If you’re planning to modify and reassemble, they’re at best useless, and can be confusing.)

    As mentioned earlier, jump arguments need the target instruction in the argval, not an offset. If raw, this is the actual instruction; otherwise, a new label pseudo-instruction (NOP, labelcount++) is generated for the jump target, and later inserted before the actual instruction.

    Also, since each instruction now has an optional line, and starts a line iff line is present and different from the previous instruction, rather than having an optional start_line, we’ll add line to every instruction. Also, if raw is false, we’ll emit pseudo-instructions (NOP, None, line) for each new line. (This allows people to modify instructions in-place without having to worry about whether they’re replacing a line start.)

    While we’re at it, it’s almost always simpler to not worry about which instructions start a new line while dealing with bytecode. So, _get_instructions_bytes can generate an (NOP, None, line) for each instruction in the lntoab, and then users can leave the line numbers off any real instructions they generate without having to worry about breaking the lnotab. But that too can get in the way for really short functions, so maybe it should also be optional. Or maybe just controlled by the same labels flag.

    dis.Instruction

    The Instruction class just needs to add a line attribute (which can be None), rearrange its attributes so (opcode, argval, line) come first, and make all attributes after the second optional (with default values of None).

    In fact, most of the other existing attributes aren’t useful when assembling, and aren’t useful when processing bytecode for other purposes, unless you’re trying to replace the high-level stringifying functions. But there’s no good reason to get rid of them; as long as people don’t have to specify them when creating or modifying instructions, they don’t get in the way. We could mark them deprecated; I’m not sure whether that’s worth it.

    dis.Bytecode

    Bytecode.assemble(self) just calls assemble, passing itself as the instructions, and the name, filename, and line number (if any) that it stored (directly or indirectly) at construction.

    The second public change is to the constructor. The constructor currently looks like this:

    Bytecode(x, *, first_line=None, current_offset=None)
    

    That x can be a code, function, method, generator, or str (the last is handled by calling compile(x)). The constructor stores x for later, and digs out the first line number and current offset for later unless overridden by the optional arguments, and also digs out the code object both to store and to disassemble.

    We’ll add a couple more optional parameters to override the values pulled from the first:

    Bytecode(x, *, name=None, filename=None, first_line=None, current_offset=None)
    

    (This means we also need to store the name and filename, as we do with the other two, instead of pulling them out of the appropriate object on demand.)

    And we’ll add two more “overloads” for the x parameter: It can be a Bytecode, in which case it’s just a shallow copy, or any (non-str, non-Bytecode) iterable, in which case it just stores [Instruction(Opcode(op), *rest) for (op, *rest) in it].

    The third public change is that Bytecode becomes a MutableSequence. I haven’t found any examples where I wanted to mutate a Bytecode; writing a non-mutating generator seems to be always easier, even when I need two passes. However, everyone else who’s written an example of a code generator does it by iterating indexes and mutating something in-place, and I can’t see any good reason to forbid that, so let’s allow it. This means that internally, Bytecode has to store the instructions in a list at construction time, instead of generating them lazily in __iter__.

    Examples

    Constify

    The following function takes an iterable of instruction tuples, and gives you back an iterable of instruction tuples where every LOAD_GLOBAL is replaced by a LOAD_CONST with the current value of that global:

    def constify(instructions):
        for opcode, argval, *rest in instructions:
            if opcode == dis.LOAD_GLOBAL:
                yield (dis.LOAD_CONST, eval(argval, globals())
            else:
                yield (opcode, argval)
    

    Or, if you prefer doing it mutably:

    def constify(instructions):
        bc = dis.Bytecode(instructions)
        for i, instr in enumerate(bc):
            if instr.opcode == dis.LOAD_GLOBAL:
                bc[i] = (dis.LOAD_CONST, eval(instr.argval, globals()))
        return bc
    

    Either way, you can use it in a decorator:

    def constify_func(func):
        code = func.__code__
        instructions = constify(dis.raw_disassemble(code))
        func.__code__ = dis.assemble(instructions,
                                     code.co_name, code.co_filename, code.co_firstlineno)
        return func
    

    … or a PEP 511 processor:

    def code_transformer(self, instructions, context):
        if context.some_flag:
            return constify(instructions)
        else:
            return instructions
    

    Dejumpify

    For something a little more complicated, the first example everyone gives me of why labels are insufficient: let’s flatten out double-jumps.

    This is one of the rare examples that’s easier without labels:

    def _dejumpify(code):
        bc = dis.Bytecode(code, raw=True)
        for op, arg, *rest in bc:
            if op.hasjump:
                if arg.opcode in (dis.JUMP_FORWARD, dis.JUMP_ABSOLUTE):
                    yield (op, arg.arg, *rest)
                    continue
            yield (op, arg, *rest)
    
    def dejumpify(func):
        code = func.__code__
        func.__code__ = dis.assemble(
            _dejumpify(code),
            code.co_name, code.co_filename, code.co_firstlineno)
        return func
    

    … but it’s still not at all hard (or inefficient) with labels:

    def _dejumpify(code):
        bc = dis.Bytecode(code)
        indices = {instr: i for i, instr in enumerate(bc)}
        for op, arg, *rest in bc:
            if op.hasjump:
                target = bc[indices[arg]+1]
                if target.opcode in (dis.JUMP_FORWARD, dis.JUMP_ABSOLUTE):
                    yield (op, target.arg, *rest)
                    continue
            yield (op, arg, *rest)
    

    Reorder blocks

    For something a little more complicated, what if we wanted to break the code into blocks, reorder those blocks in some way that minimizes jumps, then relinearlize. Surely we’d need blocks then, right? Sure. And here’s how we get them:

    def blockify(instructions):
        block = []
        for instruction in instructions:
            if instruction.is_jump_target:
                yield block
                block = []
            block.append(instruction)
        yield block
    

    (That could be shorter with itertools, but some people seem to find such code confusing, so I wrote it out.)

    So:

    def minimize_jumps(instructions):
        blocks = list(blockify(instructions))
        # put them in a graph and optimize it or whatever...
        # put them back in a flat iterable of blocks
        return itertools.chain.from_iterable(blocks)
    

    Backward compatibility and third-party libs

    First, the fact that Instruction.argval is now an instruction rather than an integer offset for jump instructions is a breaking change. I don’t think any code actually processes argval; if you need offsets, you pretty much need the raw arg. But I could be wrong. If I am, maybe it’s better to add a new argument or argument_value attribute, and leave argval with its old meaning (adding it to the deprecated list, if we’re deprecating the old attributes).

    If this is added to Python 3.6, we almost certainly want a backport to 3.4-5. I don’t know if it’s too important for that backport to export the C API for PyCode_Assemble, but that’s not hard to add, so why not. Backporting to 2.6+/3.2+ would be great, but… is there a backport of the 3.4 version of dis? If so, adding to that shouldn’t be too hard; if not, someone has to write that part first, and I’m not volunteering. :)

    Code that only inspects bytecode, without modifying it, like coverage.py, shouldn’t need any change. If that code is already using dis in 3.4+, it can continue to do so—and most such code would then work even if Python 3.7 changed to wordcode. If that code isn’t using dis, well, nothing changes.

    Code that significantly modifies bytecode often relies on the third-party module byteplay. We definitely want to allow such code to continue working. Ideally, byteplay will change to just use dis for assembly in 3.6+ (the way it changed to use dis for various other bits in 3.4+), meaning it’ll no longer be a stumbling block to upgrading to new versions of Python. This would be a very easy change to byteplay.

    The only other utility module I’ve found for dealing with bytecode that isn’t stuck in Python 2.4 or something is codetransformer. I haven’t looked at it in as much detail as byteplay, and I don’t know whether it could/should be changed to use the dis assembler in 3.6+.

    Victor Stinner is in the middle of writing a new module called bytecode, as an attempt at working out what would be the best API for rewriting the peephole assembler line-by-line in Python. I’m willing to bet that whatever API he comes up with could be trivially built as a wrapper on top of dis, if dis isn’t sufficient in itself. (I’m not sure why you’d want to rewrite the peephole assembler line-by-line instead of doing the work at a higher level, but presumably the hope is that whatever he comes up with would also be useful for other code.)

    6

    View comments

  5. If you want to skip all the tl;dr and cut to the chase, jump to Concrete Proposal.


    Why can’t we write list.len()?

    Python is an OO language. To reverse a list, you call lst.reverse(); to search a list for an element, you call lst.index(). Even operator expressions are turned into OO method calls: elem in lst is just lst.__contains__(elem).

    But a few things that you might expect to be done with methods are instead done with free functions. For example, to get the length of a list (or array or vector or whatever) in different OO languages:1

    • lst.length (Ruby)
    • lst.size() (C++)
    • lst size (Smalltalk)
    • [lst count] (Objective C)
    • lst.count (Swift)
    • lst.length (JavaScript)
    • lst.size() (Java)
    • len(lst) (Python)

    One of these things is not like the others.2 In Python, len is a free function that you call with a collection instance, not a method of collection types that you call on a collection instance. (Of course under the covers, that distinction isn’t as big as it appears, which I’ll get back to later.)

    And the list of things that are methods like lst.index(elem) vs. free functions like len(lst) seems a bit haphazard. This often confuses novices, while people who move back and forth between Python and another language they’re more familiar with tend to list it as an example of “why language XXX is better than Python”.

    As you might expect, this is for historical reasons. This is covered in the official design FAQ:

    The major reason is history. Functions were used for those operations that were generic for a group of types and which were intended to work even for objects that didn’t have methods at all (e.g. tuples).

    In modern Python, this explanation seems a bit weird—tuple has methods like tuple.index(), just like list, so why couldn’t it have a len method too? Well, Python wasn’t originally as consistently OO as it is today. Classes were a completely separate thing from types (the type of an instance of class C wasn’t C, but instance),3 and only a handful of types that needed mutating methods, like list.reverse(), had any methods, and they were implemented very differently from methods of classes.

    So, could this be fixed? Sure. Should it be? I’m a lot less sure. But I think it’s worth working through the details to understand what questions arise, and what it would take to make this work, before deciding whether it’s worth considering seriously.

    Dunder methods

    First, an aside.

    In Python, most generic free functions, like operators, actually do their dispatch via a dunder-method protocol—that is, by calling a method named with double-underscore prefix and suffix on the first argument. For example, len(x) calls x.__len__(). So, couldn’t we just rename __len__ to len, leaving the old name as a deprecated legacy alternative (much like, in the other direction, we renamed next to __next__)? Or, if that doesn’t work, make lst.len() automatically fall back to lst.__len__()?

    Unfortunately, that would only work for a small handful of functions; most are more complicated than len. For example, next(i) just calls i.__next__(), but next(i, None) tries i.__next__() and returns None on StopIteration. So, collapsing next and __next__ would mean that every iterator type now has to add code to handle the default-value optional parameter, which would be a huge waste of code and effort, or it would mean that users could no longer rely on the default-value argument working with all iterators. Meanwhile, bytes(x) tries the buffer protocol (not implementable in Python, but you can inherit an implementation from bytes, bytearray, etc.), then the __bytes__ method, then the __iter__ protocol, then the old-style sequence __getitem__ protocol. Collapsing bytes and __bytes__ would mean none of those fallbacks happen. And int() has both fallback protocols, and an optional parameter that changes which protocols are tried. And so on. So, this idea is a non-starter.

    C++

    C++ isn’t usually the best language to turn to for inspiration; any C++ ideas that make sense for Python probably also made sense for C#, Swift, Scala, or some other language, and the version in those other languages is usually closer to what Python wants.

    But in this case, C++ is one of the few languages that, for its own historical reasons,4 haphazardly mixes free functions and methods. For example, to get the length of a container, you call cont.size() method, but to get the index of an element, you call find(cont, elem) (the exact opposite of Python, in these two cases). There was originally a principled distinction, but the first two decades of the language and what would become its standard library changing back and forth to support each other led to a mishmash.

    It’s become part of the “modern C++ dogma” that free functions are part of a type’s interface, just like public members. But some things that are part of a type’s interface, you call function-style; others, you call dot-style—and, because there’s no rhyme or reason behind the distinction, you just have to memorize which are which. And, when designing types, and protocols for those types to interact with, you have to pick one arbitrarily, and be careful to be consistent everywhere.

    Over the years, people have tried to come up with solutions to fix this inconsistency for C++. Most of the early attempts involved Smalltalk-style “extension methods”: reopening a type (including “native types” like int and C arrays) to add new methods. Then, you could just reopen the C array type template and add a begin method, so lst.begin() now works no matter what type lst is. Unfortunately, fitting that into a static language like C++ turned out to be difficult.5

    In 2014, there were two new proposals to solve this problem by unifying call syntax. Herb Sutter’s N4165 and Bjarne Stroustrup’s N4174 both have interesting motivation sections that are at least partly relevant to Python and other languages, including some of the areas where they differ. Starting with N4474, they switched to a combined proposal, which gets more into C++-specific issues. At any rate, while I’m not sure which one is on the right side for C++, I think Sutter’s definitely is for Python,6 so that’s what I’m going to summarize here.

    The basic idea is simple: the call syntax x.f(y) tries to look up the method x.f and call it on (y), but falls back to looking up the function f and calling it on (x, y). So, for example, you could write lst.len(); that would fail to find a list.len method, and then try the len function, which would work.

    Python

    So, the basic idea in Python is the same: x.f(y) tries to look up the method x.f, and falls back to the function f.

    Of course Python has to do this resolution at runtime, and because it has only generic functions rather than separate automatically-type-switched overloads (unless you use something like singledispatch to implement overloads on top of generic functions)—but those don’t turn out to be problem.

    There are also a number of details that are problems, but I think they’re all solvable.

    Locals

    When lst.len() fails to find a len method on the lst object, we want it to fall back to calling len(lst), so we need to do a lookup on len.

    In Python, the lookup scope is resolved partly at compile time, partly at runtime. See How lookup works for details, but I can summarize the relevant parts inline here: If we’re inside a function body, if that function or an outer nested function has a local variable named len, compile a local or dereferenced lookup (don’t worry about the details for how it decides which—they’re intuitive, but a bit long to explain here, and not relevant); otherwise, compile a global-or-builtin lookup.

    This sounds complicated, but the compiler is already doing this every time you type len(lst) in a function, so making it do the same thing for fallback is pretty simple. Essentially, we compile result = lst.len() as if it were:

    try:
        _meth = lst.len
    except AttributeError:
        result = len(lst)
    else:
        result = _meth()
        del _meth
    

    What raises on failure?

    With the implementation above, if neither lst.len nor len exists, we’re going to end up raising a NameError. We probably wanted an AttributeError. But that’s easy to handle:

    try:
        _meth = lst.len
    except AttributeError:
        try:
            result = len(lst)
        except NameError:
            raise AttributeError(...)
    else:
        result = _meth()
        del _meth
    

    (We don’t have to worry about the lst raising a NameError, because if it doesn’t exist, the firt lst.len would have raised that instead of AttributeError.)

    I’ll ignore this detail in the following sections, but it should be obvious how to add it back in to each version.

    Method objects

    In Python, functions and methods are both first-class objects. You might not actually call lst.len(), but instead store lst.len to call later, or pass it to some other function. So, how does that work?

    We definitely don’t want to put off resolution until call time. That would mean lst.len has to become some complicated object that knows how to look up "len" both on lst and in the creating scope (which would also mean if len were a local it would have to become a cellvar, and so on).

    So, what we need to do is put the fallback before the call, and make it give us a method object. Like this:

    try:
        meth = lst.len
    except AttributeError:
        meth = make_method_out_of(len, lst)
    # and then the call happens later:
    meth()
    

    As it turns out, this is pretty much how bound methods already works. That made-up make_method_out_of(len, lst) exists, and is spelled len.__get__(lst, type(lst)). The standard function type has a __get__ method that does the right thing. See the Descriptor HowTo Guide for more details, but you don’t really need any more details here. So:

    try:
        meth = lst.len
    except AttributeError:
        meth = len.__get__(lst, type(lst))
    

    The only problem here is that most builtin functions, like len, aren’t of function type, and don’t have a __get__ method.7 One obvious solution is to give normal builtin functions the same descriptor method as builtin unbound methods.8 But this still has a problem: you can construct custom function-like objects in Python just by defining the __call__ method. But those functions won’t work as methods unless you also define __get__. Which isn’t much of a problem today—but might be more of a problem if people expect them to work for fallback.

    Alternatively, we could make the fallback code build explicit method objects (which is what the normal function.__get__ does already):

    try:
        meth = lst.len
    except AttributeError:
        try:
            meth = len.__get__(lst, type(lst))
        except AttributeError:
            meth = types.MethodType(len, lst)
    

    I’ll assume in later sections that decide this isn’t necessary and just give builtins the descriptor methods, but it should be obvious how to convert any of the later steps.

    Things are slightly more complicated by the fact that you can call class methods on a class object, and look up instance methods on a class object to call them later as unbound methods, and store functions directly in an object dict and call them with the same dot syntax, and so on. For example, we presumably want list.len(lst) to do the same thing as lst.len(). The HowTo covers all the details of how things work today, or How methods work for a slightly higher-level version; basically, I think it all works out, with either variation, but I’d have to give it more thought to be sure.

    There have been frequent proposals to extend the CPython implementation with a way to do “fast method calls”—basically, x.f() would use a new instruction to skip building a bound method object just to be called and discarded, and instead directly do what calling the bound method would have done. Such proposals wouldn’t cause any real problem for this proposal. Explaining why requires diving into the details of bytecode and the eval loop, so I won’t bother here.

    What about set and delete?

    If x.f() falls back to f.__get__(x, type(x))(), then shouldn’t x.f = 3 fall back to f.__set__(x, 3)?

    No. For one thing, that would mean that, instead of hooking lookup, we’d be hooking monkeypatching, which is a very silly thing to do. For another, if x has no member named f, x.f = 3 doesn’t fail, it just creates a member, so fallback wouldn’t normally happen anyway. And of course the same goes for deletion.

    Data members

    We want to be able to write x.f() and have it call f(x). But when we’re dealing with data rather than a method, we probably don’t want to be able to write x.sys and have it return a useless object that looks like a method but, when called, tries to do sys(x) and raises a TypeError: 'module' object is not callable.

    Fortunately, the way methods work pretty much takes care of this for us. If you try to construct a types.MethodType with a non-callable first argument, it raises a TypeError (which we can handle and turn into an AttributeError), right there at construction time, not later at call time.

    Or, alternatively, we could just make fallback only consider non-data descriptors; if it finds a non-descriptor or a data descriptor, it just raises the AttributeError immediately.

    Namespaces

    In C++, when you write f(x), the compiler looks for functions named f in the current namespace, and also in the namespace where the type of x is defined. This is the basic idea behind argument-dependent lookup. For example:

    // things.h
    namespace things {
        class MyThing { ... };
        void myfunc(MyThing) { ... }
    }
    
    // stuff.cpp
    #include <things.h>
    namespace stuff {
        void stuffy() {
            things::MyThing thing;
            things::myfunc(thing); // works
            myfunc(thing); // works, means the same
            thing.myfunc(); // works under N4165
        }
    }
    

    In Python, the equivalent doesn’t work:

    # things.py
    class MyThing: ...
    def myfunc(thing: MyThing): ...
    
    # stuff.py
    import things
    def stuffy():
        thing = things.MyThing()
        things.myfunc(thing) # works
        myfunc(thing) # NameError
        thing.myfunc() # AttributeError even with this proposal
    

    You can’t just write myfunc(thing), and with the proposed change, you still can’t write thing.myfunc().

    So, does that make the idea useless? No. Some of the most important functions are builtins. And some important third-party libraries are meant to be used with from spam import *.

    But still, it does mean the idea is less useful.

    Could Python implement a simple form of argument-dependent lookup, only for dot-syntax fallback? Sure; it’s actually pretty simple. One option is:

    try:
        meth = lst.len
    except AttributeError:
        try:
            meth = len.__get__(lst, type(lst))
        except NameError:
            mod = sys.modules[type(lst).__module__]
            mod.len.__get__(lst, type(lst))
    

    This extends the LEGB (local, enclosing, global, builtin) rule to LEGBA (…, argument-dependent) for dot-syntax fallback. If the compiler emitted a FAST or DEREF lookup, we don’t need to worry about the A case; if it emitted a GLOBAL lookup, we only need to worry about it if the global and builtins lookups both failed, in which case we get a NameError, and we can then do an explicit global lookup on the module that type(lst) was defined in (and if that too fails, then we get a NameError).

    It isn’t quite this simple, because the module might have been deleted from sys.modules, etc. There are already rules for dealing with this in cases like pickle and copy, and I think applying the same rules here would be perfectly acceptable—or, even simpler, just implement it as above (except obviously turning the KeyError into an AttributeError), and you just can’t use argument-dependent lookup to fall back to a function from a module that isn’t available.

    Again, I’ll assume we aren’t doing argument-dependent lookup below, but it should be obvious how to change any of the later code to deal with it.

    Bytecode details

    We could actually compile the lookup and fallback as a sequence of separate bytecode instructions. So, instead of a simple LOAD_ATTR, we’d do a SETUP_EXCEPT, and handle AttributeError by doing a LOAD_FAST, LOAD_DEREF, or LOAD_GLOBAL. This gets a bit verbose to read, and could be a performance problem, but it’s the smallest change, and one that could be done as a bytecode hack in existing Python.

    Alternatively, we could add new bytecodes for LOAD_ATTR_OR_FAST, LOAD_ATTR_OR_DEREF, and LOAD_ATTR_OR_GLOBAL, and do all the work inside those new opcode handlers. This gets a bit tricky, because we now have to store the name in two arrays (co_names for the ATTR lookup and co_varnames for the FAST lookup) and somehow pass both indices as arguments to the opcode. But we could do that with EXTENDED_ARGS. Or, alternatively, we could just make LOAD_ATTR_OR_FAST look in co_varnames, and change the definition of co_names (which isn’t really publicly documented or guaranteed) so that instead of having globals, attributes, etc., it has globals, attributes that don’t fall back to local/free/cell lookups, etc.

    It might seem like the best place to do this would be inside the LOAD_ATTR opcode handler, but then we’d need some flag to tell it whether to fall back to fast, deref, or global, and we’d also need some way for code that’s inspecting bytecode (like dis or inspect) to know that it’s going to do so, at which point we’ve really done the same thing as splitting it into the three opcodes from the previous paragraph.

    A simpler version of any of these variants is to always fall back to LOAD_NAME, which does the LEGB lookup dynamically, by looking in the locals() dict and then falling back to a global lookup. Since most fallbacks are probably going to be to globals or builtins, and LOAD_NAME isn’t that much slower than LOAD_GLOBAL, this doesn’t seem too bad at first—but it is slower. Plus, this would mean that every function that has any attribute lookup and any local or free variables has to be flagged as needing a locals() dict (normally, one doesn’t get built unless you ask for it explicitly). Still, it might be acceptable for a proof of concept implementation—especially as a variation on the first version.

    Lookup overrides

    In Python, almost everything is dynamic and hookable. Even the normal attribute lookup rules. When you write x.f, that’s handled by calling x.__getattribute__(f),9 and any type can override __getattribute__ to do something different.

    So, if we want types to be able to control or even prevent fallback to free functions, the default fallback has to be handled inside object.__getattribute__. But, as it turns out (see the next section), that’s very complicated.

    Fortunately, it’s not clear it would ever be useful to change the fallback. Let’s say a type wanted to, say, proxy all methods to a remote object, except for methods starting with local_, where local_spam would be handled by calling spam locally. At first glance, that sounds reasonable—but if that spam falls back to something local, you have to ask, local to what?10 How can the type can decide at runtime, based on local_spam, that the compiler should have done LEGB resolution on spam instead of local_spam at compile time, and then get that retroactively done? There doesn’t seem to be any reasonable and implementable answer.

    So, instead, we’ll just do the fallback in the interpreter: if normal lookup (as possibly overridden by __getattribute__) for x.f fails, do the f lookup.

    Introspection

    In Python, you can write getattr(x, 'f'), and expect to get the same thing as x.f.

    One option is to make getattr(x, 'f') not do the fallback and just make it raise. But that seems to defeat the purpose of getattr. It’s not just about using getattr (or hasattr) for LBYL-style coding, in which case you could argue that LBYL is always approximate in Python and if people don’t like it they can just suck it. The problem is that sometimes you actually have an attribute name as a string, and need to look it up dynamically. For example, many RPC servers do something like this:

    cmd, args = parse(message)
    try:
        handler.getattr('do_' + cmd)(*args)
    except AttributeError:
        handler.do_unknown_command(cmd, *args)
    

    If I wanted to add a do_help that works on all three of my handlers by writing it as a free function, I’d need getattr to be able to find it, right?

    I’m not sure how important this is. But if getattr needs fallback code, the obvious is something like this:

    try:
        return _old_getattr(x, 'f')
    except AttributeError as ex:
        frame = inspect.currentframe().f_back
        try:
            return frame.f_locals()('f').__get__(x, type(x))
        except KeyError:
            pass
        try:
            return frame.f_globals()('f').__get__(x, type(x))
        except KeyError:
            raise ex
    

    But doesn’t this have all the same problems I brought up earlier, e.g., with why we can’t expect people to usefully do this in __getattribute__, or why we can’t do it as the interpreter’s own implementation? Yes, but…

    The first issue is performance. That’s generally not important when you’re doing getattr. People write x.f() in the middle of inner loops, and expect it to be reasonably fast. (Although when they need it to be really fast, they write xf = x.f() outside the loop, you shouldn’t have to normally do that.) People don’t write getattr(x, 'f') in the middle of inner loops. It already takes roughly twice as long for successful lookups, and even longer on failed lookups. So, I don’t think anyone would scream if it also took longer on successful fallback lookups.

    The second issue is portability. We can’t require people to use non-portable frame hacks to implement the portable __getattribute__ method properly. But we can use it internally, in the CPython implementation of a builtin. (And other implementations that do things differently can do whatever they need to do differently.)

    The final issue is correctness. If f is a local in the calling function, but hasn’t been assigned to yet, a LOAD_FAST on f is going to raise an UnboundLocalError, but locals()['f'] is going to just raise a KeyError and our code will fall back to a global lookup, which is bad. But fortunately, getattr is a builtin function. In CPython, it’s implemented in C, and uses the C API rather than the Python one. And in C, you can actually get to the parent frame’s varnames and locals array directly. Also, I’m not sure this would be a problem. It’s already true for dynamic lookup of locals()['x'] (and, for that matter, eval('x')), so why shouldn’t it also be true for dynamic lookup of getattr(f, 'x')?

    So, I think this is acceptable here.

    At any rate, changing getattr automatically appropriately takes care of hasattr, the inspect module, and anything else that might be affected.

    C API

    As just mentioned above, the C API, used by builtins and extension modules, and by programs that embed the Python interpreter, is different from the Python API. The way you get the attribute x.f is by calling PyObject_GetAttrString(x, "f") (or, if you have 'f' as a Python string object f, PyObject_GetAttr(x, f)). So, does this need to fall back the same way as x.f?

    I don’t think it does. C code generally isn’t running inside a Python frame, and doesn’t even have Python locals, or an enclosing scope, and it only has globals if it manually adds them (and access them) as attributes of the module object it exports to Python. So, fallback to f(x) doesn’t even really make sense. Plus, C API functions don’t always need to be generic in the same way Python functions do, and they definitely don’t need to be as convenient.

    The one problem is that the documentation for PyObject_GetAttr explicitly says “This is the equivalent of the Python expression o.attr_name.” (And similarly for the related functions.) But that’s easily fixed: just add “(except that it doesn’t fall back to a method on o built from looking up attr_name as a local, global, etc.)” to that sentence in the docs. Or, alternatively, change it to “This is the equivalent of the Python expression o.__getattribute__('attr_name')“.

    This could be a minor headache for something like Cython, that generates C API code out of Pythonesque source. Without looking at how Cython handles local, free, and global variables under the covers, I’m not sure how much of a headache. But it certainly ought to be something that Cython can cope with.

    Concrete proposal

    The expression obj.attr will fall back to a bound method on obj built from function attr.

    Abstractly, this is done by handling AttributeError from __getattribute__ to look for a binding named attr and, if it’s bound to a non-data descriptor, calling attr.__get__(obj, type(obj)) and using that as the value. This means that implementations’ builtin and extension functions should now be non-data descriptors, just like normal functions. The getattr builtin does the same fallback, but it may handle unbound locals the same way that other dynamic (by-name) lookup functions and/or eval do on a given implementation.

    CPython

    The compiler will determine whether attr is a local, freevar, cellvar, or global, just as if the expression had been attr instead of obj.attr, and then emit one of the three new opcodes LOAD_ATTR_FAST, LOAD_ATTR_DEREF, or LOAD_ATTR_GLOBAL in place of LOAD_ATTR. These function like LOAD_ATTR, except that the name is looked up in the appropriate array (e.g., co_varnames instead of co_names for LOAD_ATTR_FAST), and for the fallback behavior described above. (The exact implementation of the ceval.c patch will probably be harder to describe in English than to just write.)

    Builtin functions gain a __get__ method that binds the function to an instance. (The details are again harder to describe than to write.)

    The getattr function uses the calling frame’s locals and globals to simulate fallback lookup, as described in a previous section (but implemented in C instead of Python, of course).

    No changes are made to PyObject_GetAttr and friends except for the note in the documentation about being the same as o.attr_name. No changes at all are made to any other public C APIs.

    Analysis

    I suspect that this change would, overall, actually make Python more confusing rather than less, and might also lead to code written by novices (especially those coming from other languages) being even more inconsistent than today.

    But it really is hard to guess in advance. So, is it worth building an implementation to see?

    Well, nobody’s going to run a patched interpreter that’s this different from standard Python on enough code to really get a feel for whether this works.

    On the other hand, a bytecode hack that can be applied to functions as a decorator, or to modules as an import hook (especially if it can hook the REPL, a la MacroPy) might get some people to play with it. So, even though that sounds like not much less work, and a lot less fun, it’s probably worth doing it that way first, if it’s worth doing anything.


    1. In some languages, arrays are special, not like other classes, and sometimes this is signaled by a special API. For example, in Java, to get the length of a “native array” you write lst.length, but to get the length of an instance of an array class, you write lst.size(). Most newer languages either do away with the “native types” distinction, like Ruby, or turn it into a user-extensible “value types” distinction, like Swift.
    2. OK, none of them are like the others—nobody can agree on the name, whether it’s a method or a property (except in those languages where nullary methods and properties are the same thing), etc. But you know what I mean: everyone else has len as a method or property; Python has it as a free function.
    3. In Python 2.1 and earlier, this was true for all classes. Python 2.2 added the option to make a class inherit from a type, like class C(object): pass, in which case it was a new-style class, and new-style classes are types. Python 3.0 made all classes new-style (by making them inherit from object if they don’t have any other bases). This adds a number of other benefits besides just unifying the notions of class and type. But there are a number of historical artifacts left behind–e.g., spam.__class__ vs. type(spam).
    4. … which you don’t want to know about, unless you understand or want to learn about fun things like argument-dependent lookup and the complex overload resolution rules.
    5. N1742 in 2004 seems to be the last attempt to make it work for C++. Building it into a new static language, on the other hand, is a lot easier; C#, Swift, etc. benefited from all of the failed attempts and just had to avoid making the same choices that had painted C++ into a corner where it couldn’t fit extensions in.
    6. In Python, f(x, y) is not considered part of the public interface of x’s type in the same way as in C++. Meanwhile, dynamic lookup makes f(x, y) auto-completion even worse. And TOOWTDI has a much stronger pull.
    7. This occasionally comes up even today. For example, you can write a global function def spam(self, arg): , and then write spam = spam inside a class definition, or monkeypatch in C.spam = spam later, and now c.spam(2) works on all C instances. But if you try that with a builtin function, you get a TypeError: spam() takes exactly 2 arguments (1 given). This rarely comes up in practice, and the workarounds aren’t too onerous when they’re rare (e.g., spam = lambda arg: spam(self, arg)). But if this fallback became automatic, it would come up all the time, and with no easy workaround, making the fallback feature nearly useless.
    8. At the Python level, functions and unbound methods are the same type, which has a __get__ which creates a bound method when called on an object; bound methods have a __get__ that does nothing. But for C builtins, functions and bound methods are the same type (with functions being “bound” to their module), builtin_function_or_method, which has no __get__, while unbound methods have a different type, method_descriptor, which has a __get__ that works like the Python version. The smallest change would be to give builtin_function_or_method a __get__ that rebinds the method. It is occasionally useful that re-binding a Python bound method is a no-op (e.g., so you can do spam = delegate.spam in a class definition, and that delegate’s self isn’t overwritten), but I don’t think it would be a problem that re-binding a builtin bound method is different (since today, it raises an AttributeError, so no code could be relying on it).
    9. It’s actually a little more complicated than that. See How lookup works for the gory details, or just ignore the details; I’ll cover the bits that are relevant later.
    10. Of course it could use inspect.currentframe()—or, since it’s actually implemented in C, the C API equivalent—to access the locals of the caller. But that doesn’t help with the next part.
    8

    View comments

  6. Many people, when they first discover the heapq module, have two questions:

    • Why does it define a bunch of functions instead of a container type?
    • Why don't those functions take a key or reverse parameter, like all the other sorting-related stuff in Python?

    Why not a type?

    At the abstract level, it's often easier to think of heaps as an algorithm rather than a data structure.

    For example, if you want to build something like nlargest, it's usually easier to understand that larger algorithm in terms of calling heappush and heappop on a list, than in terms of using a heap object. (And certainly, things like nlargest and merge make no sense as methods of a heap object—the fact that they use one internally is irrelevant to the caller.)

    And the same goes for building data types that use heaps: you might want a timer queue as a class, but that class's implementation is going to be more readable using heappush and heappop than going through the extra abstraction of a heap class.

    Also, even when you think of a heap as a data type, it doesn't really fit in with Python's notion of collection types, or most other notions. Sure, you can treat it as a sequence—but if you do, its values are in arbitrary order, which defeats the purpose of using a heap. You can only access it in sorted order by doing so destructively. Which makes it, in practice, a one-shot sorted iterable—which is great for building iterators on top of (like merge), but kind of useless for storing as a collection in some other object. Meanwhile, it's mutable, but doesn't provide any of the mutation methods you'd expect from a mutable sequence. It's a little closer to a mutable set, because at least it has an equivalent to add--but that doesn't fit either, because you can't conceptually (or efficiently, even if you do want to break the abstraction) remove arbitrary values from a heap.

    But maybe the best reason not to use a Heap type is the answer to the next question.

    Why no keys?

    Almost everything sorting-related in Python follows list.sort in taking two optional parameters: a key that can be used to transform each value before comparing them, and a reverse flag that reverses the sort order. But heapq doesn't.

    (Well, actually, the higher-level functions in heapq do--you can use a key with merge or nlargest. You just can't use them with heappush and friends.)

    So, why not?

    Well, consider writing nlargest yourself, or a heapsort, or a TimerQueue class. Part of the point of a key function is that it only gets called once on each value. But you're going to call heappush and heappop N times, and each time it's going to have to look at about log N values, so if you were applying the key in heappush and heappop, you'd be applying it about log N times to each value, instead of just once.

    So, the right place to put the key function is in whatever code wraps up the heap in some larger algorithm or data structure, so it can decorate the values as they go into the heap, and undecorate them as they come back out. Which means the heap itself doesn't have to understand anything about decoration.

    Examples

    The heapq docs link to the source code for the module, which has great comments explaining how everything works. But, because the code is also meant to be as optimized and as general as possible, it's not as simple as possible. So, let's look at some simplified algorithms using heaps.

    Sort

    You can easily sort objects using a heap, just by either heapifying the list and popping, or pushing the elements one by one and popping. Both have the same log-linear algorithmic complexity as most other decent sorts (quicksort, timsort, plain mergesort, etc.), but generally with a larger constant (and obviously the one-by-one has a larger constant than heapifying).

    def heapsort(iterable):
        heap = list(iterable)
        heapq.heapify(heap)
        while heap:
            yield heapq.heappop(heap)
    
    Now, adding a key is simple:
    def heapsort(iterable, key):
        heap = [(key(x), x) for x in iterable]
        heapq.heapify(heap)
        while heap:
            yield heapq.heappop(heap)[1]
    
    Try calling list(heapsort(range(100), str)) and you'll see the familiar [0, 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, ...] that you usually only get when you don't want it.

    If the values aren't comparable, or if you need to guarantee a stable sort, you can use [(key(x), i, x) for i, x in enumerate(iterable)]. That way, two values that have the same key will be compared based on their original index, rather than based on their value. (Alternatively, you could build a namedtuple around (key(x), x) then override its comparison to ignore the x, which saves the space for storing those indices, but takes more code, probably runs slower, and doesn't provide a stable sort.) The same is true for the examples below, but I generally won't bother doing it, because the point here is to keep things simple.

    nlargest

    To get the largest N values from any iterable, all you need to do is keep track of the largest N so far, and whenever you find a bigger one, drop the smallest of those N.
    def nlargest(iterable, n):
        heap = []
        for value in iterable:
            heapq.heappush(heap, value)
            if len(heap) > n:
                heapq.heappop()
        return heap
    
    To add a key:
    def nlargest(iterable, n, key):
        heap = []
        for value in iterable:
            heapq.heappush(heap, (key(value), value))
            if len(heap) > n:
                heapq.heappop()
        return [kv[1] for kv in heap]
    
    This isn't stable, and gives you the top N in arbitrary rather than sorted order, and there's lots of scope for optimization here (again, the heapq.py source code is very well commented, so go check it out), but this is the basic idea.

    One thing you might notice here is that, while collections.deque has a nice maxlen attribute that lets you just push things on the end without having to check the length and pop off the back, heapq doesn't. In this case, it's not because it's useless or complicated or potentially inefficient, but because it's so trivial to add yourself:
    def heappushmax(heap, value, maxlen):
        if len(heap) >= maxlen:
            heapq.heappushpop(heap, value)
        else:
            heapq.heappush(heap, value)
    
    And then:
    def nlargest(iterable, n):
        heap = []
        for value in iterable:
            heappushmax(heap, value, n)
        turn heap
    

    merge

    To merge (pre-sorted) iterables together, it's basically just a matter of sticking their iterators in a heap, with their next value as a key, and each time we pop one off, we put it back on keyed by the next value:
    def merge(*iterables):
        iterators = map(iter, iterables)
        heap = [(next(it), i, it) 
                for i, it in enumerate(iterators)]
        heapq.heapify(heap)
        while heap:
            nextval, i, it = heapq.heappop(heap)
            yield nextval
            try:
                nextval = next(it)
            except StopIteration:
                pass
            else:
                heapq.heappush(heap, (nextval, i, it))
    
    Here, I did include the index, because most iterables either aren't comparable or are expensive to compare, so it's a bit more serious of an issue if two of them have the same key (next element).

    (Note that this implementation won't work if some of the iterables can be empty, but if you want that, it should be obvious how to do the same thing we do inside the loop.)

    What if we want to attach a key to the values as well? The only tricky bit is that we want to transform each value of each iterable, which is one of the few good cases for a nested comprehension: use the trivial comprehension (key(v), v) for v in iterable in place of the iter function.
    def merge(*iterables, key):
        iterators = ((key(v), v) for v in iterable) for iterable in iterables)
        heap = [(next(it), i, it) for i, it in enumerate(iterators)]
        heapq.heapify(heap)
        while heap:
            nextval, i, it = heapq.heappop(heap)
            yield nextval[-1]
            try:
                nextval = next(it)
            except StopIteration:
                pass
            else:
                heapq.heappush(heap, (nextval, i, it))
    
    Again, there are edge cases to handle and optimizations to be had, which can be found in the module's source, but this it the basic idea.

    Summary

    Hopefully all of these examples show why the right place to insert a key function into a heap-based algorithm is not at the level of heappush, but at the level of the higher-level algorithm.
    1

    View comments

  7. Currently, in CPython, if you want to process bytecode, either in C or in Python, it’s pretty complicated.

    The built-in peephole optimizer has to do extra work fixing up jump targets and the line-number table, and just punts on many cases because they’re too hard to deal with. PEP 511 proposes a mechanism for registering third-party (or possibly stdlib) optimizers, and they’ll all have to do the same kind of work.

    Code that processes bytecode in function decorators or import hooks, whether for optimization or for other purposes like extending the language, is usually written in Python rather than C, but it’s no simpler there, so such code usually has to turn to a third-party library like byteplay.

    Compile process

    All of the details are very well documented, but I’ll give a brief summary of the important parts, to make it easy to see where things hook in and what information is available there.

    When the Python interpreter starts, it reads interactive input, stdin, or a script file, and sets up a tokenizer. Then, it parses iteratively, generating an AST (abstract syntax tree) for each top-level statement. For each one:

    • Assuming PEP 511 is accepted, it will first call pass the AST node through the chain of all registered AST processors.

    • Next, it calls compile on the node,1 which compiles it into an internal structure, assembles that into bytecode and associated values, passes that through the optimizer, and builds a code object.

      • As of Python 3.5, “the optimizer” means the built-in peephole optimizer; assuming PEP 511, it will mean the peephole optimizer plus the chain of all registered bytecode processors.
    • The code object is then passed to exec.

    Compiling a statement may recursively call the compiler. For example, a function definition first compiles the function body, then compiles the def statement into a statement code object that makes a function out of that function body’s code object. And functions and classes can of course be nested.

    The compiler can also be triggered at runtime, e.g., by a call to compile or exec with source or an AST–or, probably more commonly, by the import system. The default import loader for .py files calls compile on the entire file, which builds an AST for the entire file, then performs the same steps as done for each input statement, to build a module code object. However, an import hook can change this–most simply, by parsing the AST explicitly, processing the AST, then compiling the result, or by processing the bytecode resulting from compile.

    It’s also possible to post-process code at runtime. A function decorator is just a normal function called with a function object–but that function object contains a code object, and a decorator can modify the function to use a different code object, or just return a new function.

    Bytecode formats

    Bytecode and code objects

    Bytecode is, as the name implies, just a string of bytes.

    The peephole optimizer (and, in the current version of PEP 511, any plugin optimizers) receives this bytecode string, plus a few associated values as in-out parameters. (See PyCode_Optimize in the source.)

    Decorators and import hooks receive a code object, which includes the bytecode string and a much larger set of associated values, whose members are described in the inspect docs.

    Bytecode is nicely explained in the dis documentation, but I’ll briefly cover it here, and explain where the headaches come in.

    Each operation takes either 1 byte (opcodes below HAVE_ARGUMENT) or 3 (opcodes >= HAVE_ARGUMENT, where the extra 2 bytes form a short integer argument). If an operation needs an argument too large to fit, it’s prefixed by a special EXTENDED_ARG opcode, which provides 2 more bytes.2

    Argument values

    Most arguments are stored as offsets into arrays that get stored with the bytecode. There are separate arrays for constant values, local names, free variable names, cell variable names, and other names (globals, attributes, etc.).

    For example, a LOAD_CONST None may be stored as a LOAD_CONST with argument 1, meaning that it loads the code object’s co_consts[1] at runtime.

    Jumps

    Jump instructions (which include things like SETUP_FINALLY or FOR_ITER, which indirectly include jumps) come in two forms: relative, and absolute. Either way, jump offsets or targets are in terms of bytes. So, jumping from the 3rd instruction to the 9th might be jumping over 6 bytes, or 19.

    Obviously, this means that inserting or removing an instruction, or changing its argument to now need or no longer need EXTENDED_ARG, requires changing the offsets of any relative jumps over that instruction, and the targets of any absolute jumps beyond that instruction. Worse, in some cases, fixing up a jump will push its argument over or under the EXTENDED_ARG limit, cascading further changes.

    Special arguments

    A few other opcodes assign special meanings to their arguments. For example, CALL_FUNCTION treats its argument as two bytes rather than one short, with the high byte representing the count of positional arguments pushed on the stack, and the low byte representing the count of keyword argument pairs pushed on the stack. MAKE_FUNCTION and MAKE_CLOSURE are the most complicated such opcodes.

    lnotab

    One of the associated values is a compressed line-number table that maps between bytecode offsets and source line numbers. This is complicated enough that it needs a special file, lnotab_notes.txt, to explain it. This of course also needs fixups whenever the first opcode of any source line moves to a new offset.

    Assembler format

    The compiler has a pretty simple and flexible structure that it uses internally, which can be seen at the top of compile.c.

    The compiler first build a tree of blocks, each containing an array of struct instr instruction objects. Because Python doesn’t have arbitrary goto or similar instructions, all jumps are always to the start of a block, so the compiler just represents jump targets as pointers to blocks. For non-jump arguments, the instruction objects also hold their actual argument (a 32-bit integer for cases like MAKE_FUNCTION, but a const value, variable name, etc. rather than an array index for typical opcodes). Instructions also hold their source line number.

    The compiler linearizes the tree of blocks into a linked list. Then it starts the “assembler” stage, which walks that list, emitting the bytecode and lnotab as it goes, and keeping track of the starting offset of each block. It then runs a fixup step, where it fills in the jump targets and adjusts the lnotab. If any of jump targets get pushed into EXTENDED_ARG territory, then the block offsets have to be updated and the fixup rerun. It repeats this until there are no changes.

    The assembler’s list-of-blocks representation would be very easy to work with. Iterating over instructions in linear order is slightly complicated by the fact that you have to iterate a linked list, then an array within each node, but it’s still easier than walking 1, 3, or 6 bytes at a time.3

    Meanwhile, you could add and delete instructions just by changing the array of instructions, without invalidating any jump targets or line-number mappings. And you can change arguments without worrying about whether they push you over or under the EXTENDED_ARG limit.

    However, the compiler than calls the optimizer after this fixup step. This means the optimizer has to walk the more complicated packed bytecode, and has to repeat some of the same fixup work (or, as currently implemented, does only the simplest possible fixup work, and punts and doesn’t optimize if anything else might be necessary). The current peephole optimizer is kept pretty limited because of these restrictions and complexities. But if PEP 511 adds plugin optimizers, they’re all going to have to deal with these headaches. (And, besides the complexity, there’s obviously a performance cost to repeating the fixup work once for every optimizer.)

    dis format

    The dis module disassembles bytecode into a format that’s easier to understand: a Bytecode object can be constructed from a code object, which wraps up all of the associated values, and acts as an iterable of Instruction objects. These objects are somewhat similar to the struct instr used by the assembler, in that they hold the argument value and line number. However, jumps are still in terms of bytes rather than (blocks of) instructions, and EXTENDED_ARG is still treated as a separate instruction.

    So, this format would be a little easier to process than the raw bytecode, but it would still require jump fixups, including cascading jump fixups over EXTENDED_ARG changes. Plus, there is no assembler that can be used to go back from the Bytecode object to a code object. So, you need to write code that iterates the instructions and emits bytecode and the lnotab and then does the same fixup work as the compiler. This code is usually in Python, rather than in C, which makes it a little simpler–but it’s still pretty painful.

    byteplay format

    The third-party byteplay module has a byteplay.Code type that expands on the dis.Bytecode model by removing the EXTENDED_ARG instructions and synthesizing fake instructions to make things simpler: a SetLineNo instruction to represent each lnotab entry, and a Label instruction to represent each jump target. (It’s also easy to create new SetLineNo and Label instructions as needed.)

    One way to look at this is that byteplay lets you edit code for a simple “macro assembler”, instead of for a raw assembler of the kind that used to be built into the ROM of old home computers.

    Unlike dis, byteplay also works recursively: if a function has a nested function or class definition,4 its code object is already disassembled to a byteplay.Code object.

    The byteplay disassembly also acts as a mutable sequence of instructions, rather than a immutable, non-indexable iterable, and allows replacing instructions with simple opcode-argument 2-tuples instead of having to build instruction objects.

    Finally, byteplay provides a method to assemble the instructions back into a code object–which automatically takes care of building the value and name arrays, calculating the jump targets, building the lnotab, etc.

    Together, all of these changes allow processing instructions without worrying about any of the complicated issues. Instead of being a sequence of variable-length instructions that contain references to byte offsets within that sequence and external arrays along with an associated compacted table of line numbers, bytecode is just a sequence of (opcode, argument) pairs.

    On top of that, byteplay also includes code to check that the stack effects of all of the instructions add up, which helps quite a bit for catching common errors that would otherwise be hard to debug.

    The biggest problem with byteplay is that it’s somewhat complicated code that has to change with each new version of Python, but generally hasn’t tracked those new versions very well. It took years to get it compatible with Python 3.x, and for 2.7 and each new 3.x version, and it had (and may still have) longstanding bugs with those versions. As of the 3.4 improvements to dis, it now builds some of its work on top of that module when available (and dis, being in the stdlib, always tracks changes), but so far, each new version has still required patches that were months behind the release. And of course the need to be backward compatible with a wide range of supported Python versions also complicates things.

    A smaller problem with byteplay is that, while it does a lot of the same things as dis (especially now that it’s partly built on dis), it does them in different terms. For example, the equivalent of dis.Bytecode is called byteplay.Code, not Bytecode; it has a @classmethod alternate constructor from code objects instead of a normal constructor; that constructor can only take actual code objects rather than extracting them from things like functions automatically; it isn’t directly iterable but instead has a code attribute that is; etc.

    Improving the process

    AST processing

    Import hooks already provide a simple way to hook the process between AST parsing and compilation, and the ast module makes processing the ASTs pretty easy. (Some changes can be done even earlier, at the source code or token level.)

    Many of the optimizations and other processing tasks people envision (and much of what the current peephole optimizer does) could also be done on the AST, so PEP 511 allows for AST optimizers as well as bytecode optimizers.

    However, not everything can be done this way–e.g., eliminating redundant store-load pairs when the stored value is never used again. Plus, there’s no way to write a post-processing function decorator in AST terms–at that point, you’ve got a live function object.

    So, this is (already) part of the solution, but not a complete solution on its own.

    Assembler structures

    As explained earlier, the assembler structures are much easier to work on than the compiled bytecode.

    And there’s really no good reason the peephole optimizer couldn’t be called with these structures. Rather than emitting bytecode, fixing it up, and passing it to some code that tries to partially undo that work to make changes and then redo the work it undid seems somewhat silly, and it’s pretty complicated.

    Unfortunately, exposing the assembler structures to third-party code, like the bytecode processors envisioned by PEP 511, is a different story. The peephole optimizer is essentially part of the compiler; PEP 511 optimizers aren’t, and shouldn’t be given access to structs that contain information that’s both (it’s very easy to segfault the compiler by messing with these structs) and brittle (the structs could change between versions, and definitely should be allowed to). Plus, some of the optimizers will be written in Python.

    So, we’d have to provide some way to wrap these up and expose a C API and Python API for dealing with blocks of instructions. But we probably don’t want the compiler and assembler themselves to work in terms of this C API instead of working directly on their internal structs.

    Meanwhile, for the peephole optimizer and PEP 511 processors, it makes sense to process the code before it’s been fixed up and turned into bytecode–but that obviously doesn’t work for import hooks, or for decorators. So, we’d need some function to convert back from bytecode to a more convenient format, and then to convert from that format back to bytecode and fix it up.

    So, at this point, we’re really not talking about exposing the assembler structures, but designing a similar public structure, and functions to convert that to and from bytecode. At which point there’s no reason it necessarily has to be anything like the assembler structs.

    PyBytecode

    Once you look at it in those terms, what we’re looking for is exactly what byteplay is. If byteplay were incorporated into the stdlib, it would be a lot easier to keep it tracking the dis module and bytecode changes from version to version: any patch that would break byteplay has to instead include the changes to byteplay.

    But byteplay is pure Python. We need something that can be called from C code.

    So, what I’m imagining is a PyBytecode object, which is similar to a byteplay.Code object (including having pseudo-ops like SetLineNum and Label), but with a C API, and hewing closer to the dis model (and to PEP 8). The assembler stage of the compiler, instead of emitting a string of bytecode and related objects, builds up a PyBytecode object. That’s the object that the peephole optimizer and PEP 511 processors work on. Then, the compiler calls PyBytecode_ToCode, which generates executable bytecode and associated objects, does all the fixups, and builds the code object.

    The PyBytecode type would be exposed to Python as, say, dis.Bytecode (possibly renaming the existing thing to dis.RawBytecode). An import hook or decorator can call PyBytecode_FromCode or the dis.Bytecode constructor, process it, then call PyBytecode_ToCode or the to_code method to get back to executable bytecode. A PEP 511 processor written in Python would get one of these objects, but wouldn’t have to import dis to use it (although it would probably be convenient to do so in order to get access to the opcodes, etc.).

    We could, like byteplay, allow instructions to be just 2-sequences of opcode and argument instead of Instruction objects–and make those objects namedtuples, too, a la the tokenize module. We could even allow PEP 511 processors to return any iterable of such instructions. There’s some precedent for this in the way tokenizer works (although it has its own kinds of clunkiness). And this would mean that simple one-pass processors could actually be written as generators:

    def constify(bytecode: dis.Bytecode):
        for op, arg in bytecode:
            if op == dis.LOAD_GLOBAL:
                yield (dis.LOAD_CONST, eval(arg, globals()))
            else:
                yield (op, arg)
    

    If you wanted to write a decorator that just constifies a specified list of globals, it would look something like this:

    def constify(*names, globals=None):
        mapping = {name: eval(name, globals) for name in names}
        def process(bytecode: dis.Bytecode):
            for op, arg in bytecode:
                if op == dis.LOAD_GLOBAL and arg in mapping:
                    yield (dis.LOAD_CONST, mapping[arg])
                else:
                    yield (op, arg)
        def deco(func):
            func.__code__ = dis.Bytecode(process(bytecode)).to_code()
            return func
        return deco
    

    For a more complicated example, here’s a PEP 511 processor to eliminate double jumps:

    def dejumpify(bytecode: dis.Bytecode):
        for instr in b:
            if isinstance(instr.argval, dis.Label):
                target = bytecode[instr.argval.offset + 1]
                if target.opcode in {dis.JUMP_FORWARD, dis.JUMP_ABSOLUTE}:
                    yield (instr.opcode, target.argval)
                    continue
            yield instr
    

    Of course not everything makes sense as a generator. So, Bytecode objects should still be mutable as sequences.

    And, to make things even simpler for in-place processors, the to_code method can skip NOP instructions, so you can delete an instruction by writing bytecode[i] = (dis.NOP, None) instead of bytecode[i:i+1] = [] (which means you can do this in the middle of iterating over bytecode without losing your place).

    Unpacked opcodes

    My initial thought (actually, Serhiy Storchaka came up with it first) was that “unpacking” bytecode into a fixed-length form would be enough of a simplification.

    From Python, you’d call c.unpacked() on a code object, and get back a new code object where there are no EXTENDED_ARGs, and every opcode is 4 bytes5 instead of 1, 3, or 6. Jump targets and the lnotab are adjusted accordingly. The lnotab is also unpacked into 8-byte entries each holding a line number and and offset, instead of 2-byte entries holding deltas and special handling for large deltas.

    So, you do your processing on that, and call uc.packed() to pack it, remove NOPs, and redo the fixups, giving you back a normal code object.6

    This means you still do have to do some fixup if you rearrange any arguments, but at least it’s much easier fixup.

    This also means that dis, as it is today, is really all the help you need. For example, here’s the constify decorator from above:

    def constify(*names, globals=None):
        mapping = {name: eval(name, globals) for name in names}
        def process(code: types.CodeType):
            bytecode = bytearray(code.co_code)
            consts = list(code.co_consts)
            for i in range(0, len(bytecode), 4):
                if bytecode[i] == dis.LOAD_GLOBAL:
                    arg = struct.unpack('<I', bytecode[i+1:i+4] + b'\0')
                    name = code.co_names[arg]
                    if name in mapping:
                        try:
                            const = consts.index(mapping[name])
                        except ValueError:
                            const = len(consts)
                            consts.append(mapping[name])
                        bytecode[i] = dis.LOAD_CONST
                        bytecode[i+1:i+4] = struct.pack('<I', const)[:3]
        def deco(func):
            func.__code__ = dis.Bytecode(process(bytecode)).to_code()
            return func
        return deco
    

    Not quite as nice as the byteplay-ish example, but still not terrible.

    However, after implementing a proof of concept for this, I no longer think this is the best idea. That easier fixup is still too much of a pain for simple processors. For example, to add an instruction at offset i, you still have to scan every instruction from 0 to i for any jrel > i, and every instruction for any jabs > i, and all 4 to all of them, and also go through the lnotab and add 4 to the offset for every entry > i.

    Plus, this is obviously horribly inefficient if you’re going to make lots of insertions–which nobody ever is, but people are still going to do overcomplicated things like build a balanced sort tree of jumps and of lnotab entries so they can make those fixups logarithmic instead of linear.

    Of course if you really want to get efficient, there’s a solution that’s constant time, and also makes things a lot simpler: just use relocatable labels instead of offsets, and do all the fixups in a single pass at the end. And now we’re encouraging everyone to build half of byteplay themselves for every code processor.

    And it’s not like we save that much in compiler performance or simplicity–the fixup needed in pack isn’t much simpler than what’s needed by the full byteplay-style design (and the pack fixup plus the manual fixup done in the processor itself will probably be slower than the byteplay fixup, not faster). Also, while having the same code object to represent both packed and unpacked formats seems like a savings, in practice it’ll probably lead to confusion more often than less to learn about.

    Why can’t we just let people use byteplay if they want?

    Well, for import hooks and decorators, they already can, and do. For PEP 511 optimizers, as long as they’re written in Python, they can.

    The peephole optimizer, of course, can’t use byteplay, if for no other reason than it’s built into Python and byteplay isn’t. But maybe we could just leave that alone. People who want more will install PEP 511 optimizers.

    There is, of course, a cost to building a code object and then converting back and forth between code and byteplay.Code once for each optimizer, instead of building a byteplay.Code (or PyBytecode) object, passing it through all the optimizers, and then converting to code once. Remember that converting to code involves that repeated fixup loop and other stuff that might be kind of slow to do multiple times. Then again, it might still be so simple that the cost is negligible. The only way we’ll really know is to build PEP 511 as currently designed, build a bunch of optimizers that manually use byteplay, then build PEP 511 with PyBytecode and rewrite the optimizers to use that, and profile times for imports, trivial scripts, compileall, etc. both ways.

    Besides the probably-irrelevant performance issues, there are other advantages to PyBytecode. They’ve mostly been mentioned earlier, but to put them all in one place:

    • PyBytecode is automatically always up to date, instead of lagging behind Python releases like byteplay.
    • The compiler gets a bit simpler.
    • A big chunk of what byteplay and the compiler do today is the same work; better to have one implementation than two.
    • The peephole optimizer gets a lot simpler (although, as a change from today, “a lot simpler” is still more work than “unchanged”).
    • The peephole optimizer gets a little better (doesn’t punt as often).
    • If we later decide to build more optimizations into CPython itself as bytecode processors, it will be a lot simpler (which probably means we’ll be a lot more likely to build such optimizations).
    • PyBytecode comes with Python, instead of requiring a third-party install. (Probably not a huge benefit–anyone installing a third-party PEP 511 optimizer can install its dependencies, and the number of people who need to write an optimizer for local use but for some reason can’t use any third-party modules is probably vanishingly small.)
    • It seems like a fun project.

    Overall, none of them seem so hugely compelling that I’d say someone ought to do this–except maybe the last one. :) If I get the time, I’ll try to build it, and rebuild the peephole optimizer, and maybe modify the latest PEP 511 patch as well. Then, if it looks promising, it might be worth suggesting as a patch to CPython to go along with PEP 511.



    1. Well, not really. It just does the same stuff that’s wrapped up by the Python compile function (when called on an AST).
    2. The bytes within an opcode’s argument are in little-endian format, but EXTENDED_ARG holds higher-order bytes, so the overall format for a 4-byte arg is mixed-endian, a la VAX.
    3. Most code actually just walks 1 or 3 bytes at a time, and treats EXTENDED_ARG like a separate instruction, keeping track of the current extended-arg value in an extra register.
    4. Comprehensions are compiled into a function definition and call, so you often have nested functions without realizing it.
    5. Yes, this would mean that you can no longer have absolute or relative jumps over 2**24, const/name/etc. arrays longer than 2**24, or annotations for more than 255 parameters (currently, you can have up to 32767 of these, but you can only have 512 total arguments). I don’t think anyone would complain about any of these limits. But if it really is a problem, we could use 5 or 8 bytes per instruction instead of 4.
    6. We could even allow code in unpacked form to be executed–I played with the ceval loop a bit, and it’s not at all hard to make this work, and doesn’t seem likely to would have a significant performance impact when you’re executing packed code. But I don’t see a good reason for that, and it’s a lot simpler to just raise a SystemError('cannot execute unpacked code').
    3

    View comments

  8. One common "advanced question" on places like StackOverflow and python-list is "how do I dynamically create a function/method/class/whatever"? The standard answer is: first, some caveats about why you probably don't want to do that, and then an explanation of the various ways to do it when you really do need to.

    But really, creating functions, methods, classes, etc. in Python is always already dynamic.
    Some cases of "I need a dynamic function" are just "Yeah? And you've already got one". More often, you do need something a little more complicated, but still something Python already gives you. Occasionally, even that isn't enough. But, once you understand how functions, methods, classes, etc. work in Python, it's usually pretty easy to understand how to do what you want. And when you really need to go over the edge, almost anything you can think of, even if it's almost always a bad idea, is probably doable (either because "almost always" isn't "always", or just because almost nobody would ever think to try to do it, so it wasn't worth preventing).

    Functions

    A normal def statement compiles to code that creates a new function object at runtime.

    For example, consider this code:
    def spam(x):
        return x+1
    Let's say you type that into the REPL (the interactive interpreter). The REPL reads lines until it gets a complete statement, parses and compiles that statement, and then interprets the resulting bytecode. And what does that definition get compiled to? Basically the same thing as this:
    spam = types.FunctionType(
        compile('return x+1\n', '__main__', mode='function'),
        globals(),
        'spam',
        (),
        ())
    spam.__qualname__ = 'spam'
    You can't quite write this, because the public interface for the compile function doesn't expose all of the necessary features--but outside of that compile, the rest is all real Python. (For simple lambda functions, you can use, e.g., compile('1 + 2', '__main__', mode='eval') and the whole thing is real Python, but that doesn't work for def functions. When you really need to create code objects, there are ways to do it, but you very rarely need to, so let's not worry about that.)

    If you put the same thing in a module instead of typing it at the REPL, the only difference is that the body is compiled ahead of time and stored in a marshaled code object inside the .pyc file so it never needs to be compiled again. The def statement is still compiled and then interpreted as top-level module code that constructs a function on the fly out of that code constant, every time you import the module..

    For a slightly more complicated example, consider this:
    def add_one(x: int=0) -> int:
        """Add 1"""
        return x+1
    This is equivalent to:
    spam = types.FunctionType(
        compile('return x+1\n', '__main__', mode='function'),
        globals(),
        'add_one',
        (0,), # This is where default values go
        ())
    spam.__qualname__ = 'add_one'
    spam.__doc__ = """Add 1"""
    adder.__annotations__ = {'return': 'int', 'x': 'int'}
    Notice that the default values are passed into that FunctionType constructor. That's why defaults are bound in at the time the def statement is executed, which is how you can do tricks like using a dict crammed into a default value as a persistent cache.

    Closures

    The real point of functions always being created on the fly is that this means any function can be a closure--it can capture values from the environment that the function was defined in. The standard Lisp-style example looks like this:
    def make_adder(n):
        def adder(x):
     return x+n
        return adder
    That's equivalent to:
    adder = types.FunctionType(
        compile('return x+n', '__main__', mode='exec'),
        globals(),
        'adder',
        (),
        (CellType(locals(), 'n'),)) # tuple of closure cells
    adder.__qualname__ = 'make_adder.<locals>.adder'
    So every time you call make_adder, you get back a new adder function, created on the fly, referencing the particular n local variable from that particular call to make_adder.

    (Unfortunately, I cheated a bit. Unlike function objects, and even code objects, you can't actually manually create closure cells like this. But you rarely want to. And if you ever do need it, you can just do a trivial lambda that captures n and then do a minor frame hack to get at the cell.)

    Even if you never want to do anything this Lispy, closures get used all over the place. For example, if you've ever written a Tk GUI, you may have done something like this in your Frame subclass:
    def __init__(self):
        Frame.__init__(self)
        self.hello_button = tkinter.Button(
            self, text='Hello',
            command=lambda: self.on_button_click(self.hello_button))
    That lambda is creating a new function that captures the local self variable so it can access self.hello_button whenever the button is clicked.

    (A lambda compiles in almost the same way as a def, except that it's a value in the middle of an expression rather than a statement, and it doesn't have a name, docstring, etc.).

    Another common way to write the same button is with functools.partial:
        self.hello_button = tkinter.Button(
            self, text='Hello',
            command=partial(self.on_button_click, self.hello_button)
    If Python didn't come with partial, we could easily write it ourselves:
    def partial(func, *args, **kw):
        def wrapped(*more_args, **more_kw):
            return func(*args, *more_args, **kw, **more_kw)
        return wrapped
    This is also how decorators work:
    def simple_memo(func):
        cache = {}
        def wrapped(*args):
            args = tuple(args)
            if args not in cache:
         cache[args] = func(*args)
     return cache[args]
        return wrapped
    
    @simple_memo
    def fib(n):
        if n < 2: return n
        return fib(n-1) + fib(n-2)
    I've written a dumb exponenially-recursive Fibonacci function, and that @simple_memo magically turns it into a linear-time function that takes a fraction of a second instead of hours. How does this work? Simple: after the usual fib = types.FunctionType blah blah stuff, it does fib = simple_memo(fib). That's it. Because function are already always created on the fly, decorators don't need anything complicated.

    By the way, if you can follow everything above, you pretty much know all there is to know about dynamic higher-order programming, except for how the theory behind it maps to advanced math. (And that part is simple if you already know the math, but meaningless if you don't.) That's one of those things that sounds scary when functional programmers talk about it, but if you go from using higher-order functions to building them to understanding how the work before the theory, instead of going from theory to implementation to building to using, it's not actually hard.

    Fake functions

    Sometimes, you can describe what code should run when you call spam, but it's not obvious how to construct a function object that actually runs that code. Or it's easy to write the closure, but hard to think about it when you later come back and read it.

    In those cases, you can create a class with a __call__ method, and it acts like a function. For example:
    class Adder:
        def __init__(self, n):
            self._n = n
        def __call__(self, x):
            return x+n
    An Adder(5) object behaves almost identical to a make_adder(5) closure. It's just a matter of which one you find more readable. Even for experienced Python programmers, the answer is different in different cases, which is why you'll find both techniques all over the stdlib and popular third-party modules.

    In fact, functools.partial isn't actually a closure, but a class, like this:
    class partial:
        def __init__(self, func, *args, **kw):
            self.func, self.args, self.kw = func, args, kw
        def __call__(self, *more_args, **more_kw):
            return self.func(*self.args, *more_args, **self.kw, **more_kw)
    (Actually, the real partial has a lot of bells and whistles. But it's still not that complicated. The docs link to the source code, if you want to see it for yourself.)

    Methods

    OK, so you can create functions on the fly; what about methods?

    Again, they're already always created on the fly, and once you understand how, you can probably do whatever it was you needed.

    Let's look at an example:
    class Spam:
        def __init__(self, x, y):
            self.x, self.y = x, y
        def eggs(self):
            return self.x + self.y
    spam = Spam()
    print(spam.eggs())
    The definition of eggs is compiled and interpreted exactly the same as the definition of any other function. And the result is just stored as a member of the Spam class (see the section on classes to see how that works), so when you write Spam.eggs you just get that function.

    This means that if you want to add a new method to a class, there's no special trick, you just do it:
    def cheese(self):
        return self.x * self.y
    Spam.cheese = cheese
    print(spam.cheese())
    That's all it takes to add a method to a class dynamically.

    But meanwhile, on the instance, spam.eggs is not just a function, it's a bound method. Try print(spam.eggs) from the interactive REPL. A bound method knows which instance it belongs to, so when you call it, that instance can get passed as the self argument.

    The details of how Python turns the function Spam.eggs into the bound method spam.eggs are a bit complicated (and I've already written a whole post about them), but we don't need to know that here.

    Obviously, bound methods get created dynamically. Every time you do spam.eggs or Spam().cheese or string.ascii_letters.find, you're getting a new bound method.

    And if you want to create one manually, you can just call types.MethodType(func, obj).

    So, if you want to add a new method beans to just the spam instance, without adding it to the Spam class? Just construct the same bound method that Python would have constructed for you whenever you looked up spam.beans, and store it there:
    def beans(self):
        return self.x / self.y
    spam.beans = types.MethodType(beans, spam)
    And now you know enough to implement Javascript-style object literals, or even prototype inheritance. Not that you should do either, but if you ever run into something that you really do want to do, that requires creating methods on the fly, either on classes or on instances, you can do it. Because creating methods on the fly is what Python always does.

    Classes

    What if we want to create a class dynamically?

    Well, I shouldn't have to tell you at this point. You're always creating classes dynamically.

    Class definitions work a bit differently from function definitions. Let's start with a simple example again:
    class Spam:
        z = 0
        def __init__(self, x, y):
            self.x, self.y = x, y
        def eggs(self):
            return self.x + self.y + self.z
    First, Python interprets the class body the same as any other code, but it runs inside an empty environment. So, those def calls create new functions named __init__ and eggs in that empty environment, instead of at the global level. Then, it dynamically creates a class object out of that environment, where every function or other value that got created becomes a method or class attribute on the class. The code goes something like this:
    _Spam_locals = {}
    exec('def __init__(self, x, y):\n    self.x, ... blah blah ...\n',
         globals=globals(), locals=_Spam_locals)
    Spam = type('Spam', (object,), _Spam_locals)
    This is why you can't access the Spam class object inside the class definition--because there is nothing named Spam until after Python calls type and stores the result in Spam. (But of course you can access Spam inside the methods; by the time those methods get called, it'll exist.)

    So, what if you want to create some methods dynamically inside the class? No sweat. By the time it gets to calling type, nobody can tell whether eggs gets into the locals dict by you calling def eggs(...): or eggs = fancy_higher_order_function(...), so they both do the same thing.

    In fact, one idiom you'll see quite often in the stdlib is this:
        def __add__(self, other):
            blah blah
        __radd__ = __add__
    This just makes __radd__ another name for the same method as __add__.

    And yes, you can call type manually if you need to, passing it any dict you want as an environment:
    def __init__(self, x, y):
        self.x, self.y = x, y
    Spam = type('Spam', (object,),
        {'z': 0, '__init__': __init__,
         'eggs': lambda self: self.x + self.y + self.z}
    There are a few more details to classes. A slightly more complicated example covers most of them:
    @decorate_my_class
    class Spam(metaclass=MetaSpam, Base1, Base2):
        """Look, I've got a doc string"""
        def __init__(self, x, y):
            self.x, self.y = x, y
    This is equivalent to:
    _Spam_locals = {}
    exec('def __init__(self, x, y):\n    self.x, ... blah blah ...\n',
         globals=globals(), locals=_Spam_locals)
    Spam = MetaSpam('Spam', (Base1, Base2), _Spam_locals)
    Spam.__doc__ = """Look, I've got a doc string"""
    Spam = decorate_my_class(Spam)
    As you can see, if there's a metaclass, it gets called in place of type, and if there are base classes, they get passed in place of object, and docstrings and decorators work the same way as in functions.

    There are a few more complexities with qualnames, __slots__, closures (if you define a class inside a function), and the magic to make super() work, but this is almost everything.

    Remember from the last section how easy it is to add methods to a class? Often that's simpler than trying to programmatically generate methods from inside the class definition, or customize the class creation. (See functools.total_ordering for a nice example.) But when you really do need a dynamically-created class for some reason, it's easy.

    Generating code

    Occasionally, no matter how hard you try, you just can't come up with any way to define or modify a function or class dynamically with your details crammed into the right place the right way, at least not readably. In that case, you can always fall back to generating, compiling, and executing source code.

    The simplest way to do this is to just build a string and call exec on it. You can find a few examples of this in the stdlib, like collections.namedtuple. (Notice the trick it uses of calling exec in a custom empty namespace, then copying the value out of it. This is a bit cleaner that just executing in your own locals and/or globals.)

    You've probably hard "exec is dangerous". And of course it is. But "dangerous" really just means two things: "powerful" and "hard to control or reason about". When you need the first one badly enough that you can accept the second, danger is justified. If you don't understand things like how to make sure you're not letting data some user sent to your web service end up inside your exec don't use it. But if you're still reading at this point, I think you can learn how to reason through the issues.

    Sometimes you don't want to exec right now, you want to compile something that you can pass around and exec later. That works fine too.

    Sometimes, you even want to generate a whole module and save it as a .py file. Of course people do that kind of thing in C all the time, as part of the build process--but in Python, it doesn't matter whether you're generating a .py file during a build or install to be used later, or generating one at normal runtime to be used right now; they're the same case as far as Python is concerned.

    Sometimes you want to build an AST (abstract source tree) or a stream of tokens instead of source code, and then compile or execute that. (And this is a great time to yet again plug macropy, one of the coolest projects ever.)

    Sometimes you even want to do the equivalent of inline assembly (but assembling Python bytecode, not native machine code, of course) with a library like byteplay or cpyasm (or, if you're a real masochist, just assembling it in your head and using struct.pack on the resulting array of 16-bit ints...). Again, unlike C, you can do this at runtime, then wrap that code object up in a function object, and call it right now.

    You can even do stuff like marshaling code objects to and from a database to build functions out of later.

    Conclusion

    Because almost all of this is accessible from within Python itself, and all of it is inherently designed to be executed on the fly, almost anything you can think of is probably doable.

    So, if you're thinking "I need to dynamically create X", you need to think through exactly what you need, but whether that turns out to be "just a normal function" or something deeply magical, you'll be able to do it, or at least explain the magic you're looking for in specific enough terms that someone can actually show you how to do it.
    1

    View comments

  9. A few years ago, Cesare di Mauro created a project called WPython, a fork of CPython 2.6.4 that “brings many optimizations and refactorings”. The starting point of the project was replacing the bytecode with “wordcode”. However, there were a number of other changes on top of it.

    I believe it’s possible that replacing the bytecode with wordcode would be useful on its own. Of course it could also lead to opportunities for more optimizations and refactorings, but it might be worth keeping a wordcode fork alive (or even proposing it as a patch to core CPython) that doesn’t have additional radical changes.

    The code for this project is in the wpy branch of my github fork of the CPython source. As of post time, it's basically just a proof of concept: the compiler generates wordcode, the interpreter interprets wordcode, but things like the pdb debugger don't work, the peephole optimizer has been disabled, etc., so it won't even pass the full test suite. No attempt at further simplification has been made, or will be made initially: The goal is to get a complete working prototype that passes the test suite and can be benchmarked against stock CPython (and against Serhiy's alternative project to pack a bits into the opcode and skip arguments) for size and performance. If that looks promising, then I'll look into simplifying the eval loop to see if there are further gains.

    As I make further changes to the project, an updated version of this document will be available in Python/wordcode.md inside the repo.

    Bytecode

    The core of CPython is an eval loop that implements a simple stack-based virtual machine. At each step, it has a frame object, with an attached code object and an instruction pointer. The code object contains a bytes string full of bytecode. So, it just fetches the next bytecode operation and interprets it. The compiler is responsible for turning source code into bytecode, and is called as needed from within the interpreter (by the importer, the exec function, etc.).

    Each operation consists of a single 8-bit opcode, plus possibly a 16-bit argument. For example, the RETURN_VALUE opcode (which returns the value on top of the stack to the caller) doesn’t need an argument, so a RETURN_VALUE instruction is just a single byte (83). But the LOAD_CONST opcode, which loads a constant stored in the code object by index, does need an opcode, to tell the VM which index to use, so LOAD_CONST 0 takes 3 bytes (100, 0, 0).

    Argument values that don’t fit into 16 bits are handled by prefixing an operation with a special EXTENDED_ARG opcode. This obviously rarely comes up with opcodes like LOAD_CONST, but may occasionally turn up with, say, JUMP_ABSOLUTE. For example, to jump to offset 0x123456 would require EXTENDED_ARG 18 (the most-significant 0x12) followed by JUMP_ABSOLUTE 13398 (the least-significant 0x3456), which takes 6 bytes (144, 18, 0, 113, 86, 52).

    The dis module documentation explains what each opcode does, and which ones do and don’t take arguments.

    The biggest potential problem here is that fetching a typical LOAD_CONST takes three single-byte fetches: first, because the interpreter doesn’t know whether it needs an argument until it sees the opcode, it has to fetch just one byte. Then, when it needs an argument, it has two fetch two more bytes (it could just fetch a word, but half the time that word will be misaligned, which is illegal on some platforms, and no cheaper than–or even more expensive than–two single-byte fetches on others). This also means that the argument fetching has to be either conditional (which can break branch prediction), or duplicated to each opcode’s interpreter chunk (which increases cache pressure). Even the more complicated pointer arithmetic can slow things down by preventing the CPU from figuring out what to prefetch.

    On top of that, the vast majority of operations with arguments have tiny values (for example, more than half of all LOAD_CONST operations in the stdlib have indices 0-3), but they still require two bytes to store those arguments. This makes bytecode longer, meaning more cache spills, disk reads, VM page faults, etc.

    Variable-width bytecode is also more complicated to scan, interpret, and modify, which complicates code like the peephole optimizer, and third-party libraries like byteplay), as well as the interpreter itself. (Notice the custom macros to fetch the next opcode and argument, peek at the next argument, etc.) Making the code simpler would make it easier to read and maintain, and might open the door for adding further changes. (It might also allow the C compiler or CPU to optimize things better without any work on our part–for example, in some cases, a PREDICT macro doesn’t always help as much as it could, because the prediction ends up reordered right after a conditional fetch.)

    Using two-byte arguments means every argument depends on word-order (CPython stores the arguments in little-endian order, even on big-endian machines).

    Finally, variable-width opcodes mean you can’t synchronize at an arbitrary position. For example, if I want to disassemble some bytecode around offsets 100-150, there’s no way to tell whether offset 100 is an opcode, or part of an argument for an opcode starting at 98 or 99. Usually, starting from the middle of an argument will give you a bunch of invalid operations, so you can try 100, then 99, and then 98 until one of them makes sense, but that’s not guaranteed to work (and it’s not the kind of thing you’d want to automate).

    Packed Arguments

    One solution is to find the most commonly-used opcode/argument combinations and pack them into a single byte. So, we’d still have LOAD_CONST that takes an argument, but we’d also have LOAD_CONST_0 and LOAD_CONST_1 that don’t.

    We could extend this to have single-byte-arg variants, so for 2-255 you’d use LOAD_CONST_B, and only use LOAD_CONST_W for 256 and up.

    This would solve many of the problems with existing bytecode–although at the cost of making things more complicated, instead of simpler. It also means duplicating code, or adding jumps, or doing extra mask-and-shift work on the opcodes, any of which could slow things down. (It also involves using up more opcodes, but since we’re still only up to 101 out of 255 as of CPython 3.5, this probably isn’t too worrying.)

    Wordcode

    A different solution is to simplify things to use 16 bits for every operation: one byte for the opcode, one byte for the argument. So, LOAD_CONST 1 goes from 100, 1, 0 to just 100, 1. The interpreter can just fetch (aligned) 16-bit words, with no need for conditionals or duplicated code. All else being equal, it should make things faster on most platforms. It also makes things simpler, both in the interpreter and in bytecode processors.

    There are two obvious problems here.

    First, every opcode that used to take 1 byte now takes 2.

    Second, while most arguments are tiny, not all of them are. How can you use an argument >255 with this solution? The obvious answer here is to expand the use of EXTENDED_ARG. Keeping things simple, we can allow up to three EXTENDED_ARG opcodes, each carrying an additional byte for the following operation (which are then assembled in big-endian order). So, for example, LOAD_CONST 321 goes from 100, 65, 1 to 144, 1, 100, 65.

    Of course this means that many jumps (where values >255 aren’t that rare) now take 4 bytes instead of 3, although that’s balanced by many other jumps taking 2 bytes instead of 3. (Also note that if we treat the bytecode as an array of 16-bit words instead of an array of 8-bit bytes, each offset now takes 1 less bit.)

    So, putting all of that together, will code get longer or shorter overall? (Remember, shorter code also means a faster interpreter.) It’s hard to predict, but I did a quick experiment with the importlib frozen code and the .pyc files for the stdlib, and it looks like wordcode even with the peephole optimizer disabled is about 7% smaller than bytecode with the optimizer enabled. So, I think we’d get a net savings. But obviously, this is something to test with a more complete implementation, not just guess at. Plus, it’s possible that, even if we’re saving space overall, we’re hitting the worst costs in exactly the worst places (e.g., mid-sized functions may often end with with loop bodies that expand to another cache line), which we’d need to test for.

    More extensive use of EXTENDED_ARG might also increase register pressure (because the compiler and/or CPU decides to dedicate a register to holding the current extended value rather than use that register for something else).

    So, there’s no guarantee that this will make things faster, and a possibility that it will make things slower. We won’t know until we build something to test.

    Besides performance, more extensive use of EXTENDED_ARG now means if you write a quick&dirty bytecode processor that just pretends it never exists, it will fail reasonably often instead of very rarely.

    On the other hand, Python doesn’t have a type for “immutable array of pairs of bytes” akin to bytes. Continuing to use the bytes type for bytecode could be confusing (especially if jumps use word-based offsets–that means code would need *2 and //2 all over the place). But using something like array('H') has its own problems (endianness, mutability, not being a builtin). And creating a new builtin words type for something that most users are never going to touch seems like overkill.

    Hybrid

    Obviously, you can’t do both of the above at the same time. The first change is about creating more opcodes that don’t need an argument at all, while the second change gives an argument to opcodes even if they don’t need it, which would cancel out the benefit.

    However, there is a hybrid that might make sense: For opcodes that frequently need values a little bigger than 255, we could steal a few bits from the opcode and use it as part of the argument. For example, we’d have JUMP_ABSOLUTE_0 through JUMP_ABSOLUTE_3, so to jump to offset 564 (0x234), you’d do JUMP_ABSOLUTE_2 52 instead of EXTENDED_ARG 2 JUMP_ABSOLUTE 52.

    It’s worth noting that all of the problems we’re looking at occur in real machine code. For example, the x86 uses variable-width operations, where some opcodes take no arguments, some take one, some take two, and those arguments can even be different widths. And RISC is essentially this hybrid solution–for example, on PowerPC, every operation is 32 bits, with the arguments encoded in anywhere from 0 to 26 of those bits.

    Implementing wordcode

    The argument-packing idea is worth exploring on its own. The hybrid idea might be worth exploring as an extension to either wordcode or argument-packing, if they both pan out. But experimenting with just the wordcode idea seems to be worth pursuing.

    It’s probably worth starting with the smallest possible change. This means bytecode is still treated as a string of bytes, but every operation is now always 2 bytes instead of 1 or 3, and jumps are still byte-offset-based, and no additional simplifications or refactorings are attempted.

    So, what needs to be changed?

    Interpreter

    The core interpreter loop ([PyEval_FrameEx][framex] in ceval.c) already has a set of macros to wrap up the complicated fetching of opcodes. We just need to change these macros to always work on 2 bytes instead of conditionally on 1 or 3. Again, this doesn’t give us the simplest code; it gives us the code with the fewest changes. This means changing the FAST_DISPATCH macro (both versions), NEXTOP, NEXTARG, PEEKARG, PREDICTED, and PREDICTED_WITH_ARG. The only other place the instruction pointer is manipulated is the next_instr = first_instr + f->f_lasti + 1; line under the long comment.

    As that comment implies, we also have to patch the frame setup (PyFrame_New in frameobject.c) to start f_lasti at -2 instead of -1, and change the YIELD_FROM code inside the eval loop to -= 2 instead of --.

    We also need to change EXTENDED_ARG to only shift 8 bits instead of 16. (You might expect that we need more changes, to allow up to three EXTENDED_ARGs instead of just one, but the code as written already allows multiple instances–although if you use more than one with bytecode, or more than three with wordcode, the extra bits just get shifted out.)

    Surprisingly, that’s all that needs to be done to interpret wordcode instead of bytecode.

    Compiler

    Obviously the compiler (in compile.c) needs to be changed to emit wordcode instead of bytecode. But again, it’s designed pretty flexibly, so it’s less work than you’d expect. In particular, it works on a list of instruction objects as long as possible, and only emits the actual bytecode at the very end. This intermediate representation treats instructions with preceding EXTENDED_ARG as single instructions. So, it should work unchanged for our purposes.

    We need to change the instrsize function to return 2, 4, 6, or 8 depending on whether 0, 1, 2, or 3 EXTENDED_ARG opcodes will be needed, instead of 1, 3, or 6 for no args, args, or args plus EXTENDED_ARG. And we need to modify assemble_emit to emit up 0 to 3 EXTENDED_ARG opcodes instead of 0 or 1. Finally, the jump target fixup code in assemble_jump_offsets has to be similarly modified to count 0 to 3 EXTENDED_ARG opcodes instead of 0 or 1.

    And that’s it for the compiler.

    Peephole optimizer

    Unfortunately, the peephole optimizer (in peephole.c) doesn’t work on that nice intermediate list-of-instructions representation, but on the emitted bytecode. Which means it has to reproduce all the jump-target fixup code, and do it in a more complicated way. It also doesn’t process EXTENDED_ARG as nicely as the eval loop (it essentially assumes that only MAKE_FUNCTION and jump arguments will ever use it, which is no longer almost-always true with wordcode).

    For my first proof of concept, I just disabled the peephole optimizer. But obviously, we can’t do useful benchmark tests against the stock interpreter this way, so we’ll have to tackle it before too long.

    As a possible alternative: Victor Stinner’s PEP 511 proposes an API for registering AST- and bytecode-based code transformers, and in his earlier work with FAT Python he’s reproduced everything the peephole optimizer does (and more) as separate transformers. Most of these should be simpler to port to wordcode (especially since most of them are AST-based, before we even get to the bytecode step). So, it may be simpler to use the PEP 511 patch, disable the peephole optimizer, and use separate optimizers, both for bytecode and for wordcode. We could then test that the 511-ized bytecode interpreter is pretty close to stock CPython, and then fairly compare the 511-ized wordcode intepreter to the 511-ized bytecode interpreter.

    Debugger

    The pdb debugger has an intimate understanding of Python bytecode, and how it maps back to the compiled source code. Obviously it will need some changes to support wordcode. (This may be another place where we get some simplification opportunities.)

    I haven’t yet looked at how much work pdb needs, but I’m guessing it shouldn’t be too hard.

    Introspection tools

    The dis module disassembles bytecode. It obviously needs to be patched. But this turns out to be very easy. There are two functions, _get_instructions_bytes and findlabels, that each need two changes: to skip or fetch 1 byte instead of fetching 0 or 2 for arguments, and to handle multiple EXTENDED_ARGs.

    Marshaling, bootstrapping, etc.

    I expected that we might need some changes in marshaling, importer bootstrapping, and other parts of the interpreter. But as it turns out, all of the rest of the code just treats bytecode as an uninterpreted bytes object, so all of it just works.

    Third-party code

    Of course any third-party disassemblers, decompilers, debuggers, and optimizers that deal with bytecode would have to change.

    I believe the most complicated library to fix will be byteplay, so my plan is to tackle that one. (It may also be useful for replacing the peephole optimizer, and possibly for debugging any problems that come up along the way.)

    1

    View comments

  10. Many languages have a for-each loop. In some, like Python, it’s the only kind of for loop:

    for i in range(10):
        print(i)
    

    In most languages, the loop variable is only in scope within the code controlled by the for loop,[1] except in languages that don’t have granular scopes at all, like Python.[2]

    So, is that i a variable that gets updated each time through the loop or is it a new constant that gets defined each time through the loop?

    Almost every language treats it as a reused variable. Swift, and C# since version 5.0, treat it as separate constants. I think Swift is right here (and C# was right to change, despite the backward-compatibility pain), and everyone else is wrong.

    However, if there is a major language that should be using the traditional behavior, it’s Python.

    Loops and closures

    In any language that has lexical closures that capture variables, you’ll run into the foreach-lambda problem. This has been known in Python (and JavaScript, and Ruby, etc.) for decades, and is even covered in Python’s official Programming FAQ, but every new programmer discovers it on his own, and every new language seems to do so as well, and it always takes people by surprise.

    Consider this:

    powers = [lambda x: x**i for i in range(10)]
    

    This creates 10 functions that return x**0, x**1, … x**9, right?

    If for-each works by re-using a variable, as it does in Python (and JavaScript, …) then no, this creates 10 functions that all return x**9 (or, in some languages, x**10). Each function returns x**i, where i is the variable from the scope the function was defined in. At the end of that scope, i has the value 9.

    The problem has nothing to do with lambdas or comprehensions.[3] You’ll get the same behavior this way:

    powers = []
    for i in range(10):
        def func(x):
            return x**i
        powers.append(func)
    

    Whenever people run into this problem, in any language, they insist that the language “does closures wrong”–but in fact, the language is doing closures perfectly.

    So, does that mean all those users are wrong?

    Traditional solution

    In most languages, you can solve the problem by wrapping the function definition inside another function that you define and call:

    for i in range(10):
        def make_func(j):
            def func(x):
                return x**j
            return func
        powers.append(make_func(i))
    

    This works because the j parameter gets a reference to the value of the i argument, not a reference to the i variable itself. Then, func gets a reference to that j variable, which is different each time.

    The problem with this solution is that it’s verbose, and somewhat opaque to the reader. It’s a bit less verbose when you can use lambdas, but arguably it’s even less readable that way:

    for i in range(10):
        powers.append((lambda j: (lambda x: x**j))(i))
    

    In some languages, including Python, there’s a simpler way to do this:

    for i in range(10):
        def func(x, j=i):
            return x**j
        powers.append(func)
    

    That’s because the default parameter value j gets a reference to the value of i in the defining scope, not a closure cell capturing the variable i.

    Of course this is a bit “magical”, but people quickly learn to recognize it in Python. In fact, most experienced Python developers will spell it i=i instead of j=i, because (once you recognize the idiom) that makes it clear why we’re using a parameter with a default variable here: to get the value of i at the time of definition (or, in some cases, as an optimization–e.g., you can use len=len to turn len into a local variable instead of a builtin within the function, which makes it a bit faster to look up each time you call it).

    The bigger problem is that it makes j a parameter, which means you can override the default value with an argument. For example, powers[3](2, 10) gives you 1024, which is almost certainly not something anyone wanted. You can do tricks like making it keyword-only, prefixing it with an underscore, etc. to try to discourage people from accidentally passing an argument, but it’s still a flaw.

    Terminology sidebar

    In Python, this issue is often described in terms of late vs. early binding. There are actually three places things could get bound: compile time (like constants), function definition time (like default parameter values), or function call time (like free variables captured by a closure). In many discussions of late vs. early binding, the distinction is between run time and compile time, but in this case, it’s between the later and earlier run time cases: free variables are bound late, at call time, while default parameter values are bound early, at definition time. So, capturing a nonlocal is late binding; the “default-value trick” of passing the value as the default value of an extra parameter means using early binding instead.

    A different way of solving the same problem is to consider capture by variable vs. capture by value. In C++, variables are lvalues, but lvalues are also things you can pass around and take references to. You can write a function that takes an int, or one that takes a reference to an int, spelled int& (and another that takes a reference to a const int, and one that takes an rvalue-reference to an int, with complicated rules about how you collapse an lvalue reference to an rvalue reference and how overload distinguishes between an int and a reference to const int when the lvalue is constant, and…). And closures are built on top of this: there’s no special “closure cell” or “closure environment”; closures just have (lvalue) references to variables from the outer scope. Of course taking a reference to something doesn’t keep it alive, because that would be too easy, so returning a function that closes over a local variable means you’ve created a dangling reference, just like returning a pointer to a local variable. So, C++ also added capture by value: instead of copying an lvalue reference (which gives you a new reference to the same lvalue), you can copy an rvalue reference (which steals the reference if it’s a temporary that would have otherwise gone away, or copies the value into a new lvalue if not). As it turns out, this gives you another way to solve the loop problem: if you capture the loop variable by value, instead of by variable, then of course each closure is capturing a different value. Confused? You won’t be, after this week’s episode of C++.

    Anyway, you can now forget about C++. In simpler languages like Java, C#, JavaScript, and Python, people have proposed adding a way to declare capture by value for closures, to solve the loop problem. For example, borrowing C++ syntax (sort of):

    for i in range(10):
        def func[i](x):
            return x**i
        powers.append(func)
    

    Or, alternatively, one of these proposals:

    for i in range(10):
        def func(x; i):
            return x**i
        powers.append(func)
    
    for i in range(10):
        def func(x) sharedlocal(i):
            return x**i
        powers.append(func)
    
    for i in range(10):
        def func(x):
            sharedlocal i
            return x**i
        powers.append(func)
    

    No matter how you spell it, the idea is that you’re telling func to capture the current value of the i local from the enclosing scope, rather than capturing the actual variable i.

    If you think about it, that means that capture by value in a language like Python is exactly equivalent to early binding (in the “binding at definition time” sense). And all of these solutions do the exact same thing as the parameter default-value trick i=i, except that i is no longer visible (and overridable) as a parameter, so it’s no longer really a “trick”. (In fact, we could even hijack the existing machinery and store the value in the function object’s __defaults__, if we just add a new member to the code object to keep a count of captured local values that come after all the parameters.)

    While we’re talking terminology, when discussing how parameters get passed in the previous section, is that “pass by value” or “pass by reference”? If you think that’s a good question, or you think the problem could be solved by just coming up with a new name, see this post.

    Solutions

    Since the traditional solution works, and people obviously can learn it and get used to it, one solution is to just do nothing.

    But if we are going to do something, there are two obvious solutions:

    1. Provide a way for closures to use early binding or capture by value. (Again, these options are equivalent in languages like Python.)
    2. Provide a way for for-each loops to define a new variable each time through the loop, instead of reusing the same variable.

    Either one would solve the problem of closures capturing loop variables. Either one might also offer other benefits, but it’s hard to see what they might be. Consider that, in decades of people using the default-value trick in Python, it’s rarely used (with locals[4]) for anything but loop variables. And similarly, you can find FAQ sections and blog posts and StackOverflow questions in every language discussing the loop problem, and none of them mention any other cases.

    The first one obviously has to be optional: you still want closures to capture some (in fact, most) variables, or they really aren’t closures. And there’s no obvious way to make an automatic distinction, so it has to be something the user spells explicitly. (As shown in the previous section, there are plenty of alternative ways to spell it, each with room for more bikeshedding.)

    The second one could be optional–but does it have to be? The C# team carefully considered adding a new form like this (in Python terms):

    for new i in range(10):
        def func(x):
            sharedlocal i
            return x**i
        powers.append(func)
    

    But they decided that any existing code that actually depends on the difference between for i and for new i is more likely to be buggy than to be intentionally relying on the old semantics, so, even from a backward-compatibility point of view, it’s still better to change things. Of course that had to be balanced with the fact that some people write code that’s used in both C# 4 and C# 5, and having it do the right thing in one version and the wrong in the other is pretty ugly… but even so, they decided to make the breaking change.

    I’m going to examine the arguments for each of the two major alternatives, without considering any backward compatibility issues.

    Consistency with C-style for

    This one isn’t relevant to languages like Python, which don’t have C-style for loops, but many languages do, and they almost always use the same keyword, or closely related ones, to introduce C-style for loops and for-each loops. So, superficially, it seems like they should be as similar as possible, all else being equal.

    In a C-style for loop, the loop variable is clearly a variable whose lifetime lasts for the entire loop, not just a single iteration. That’s obvious from the way you define the loop:

    for (int i=0; i != 10; i++)
    

    That third part isn’t generating a value for a new constant (that also happens to be named i), it’s mutating or rebinding a variable named i. That’s what the ++ operator means. If you change to i + 1, you get an infinite loop, where i stays 0 forever.

    And this is why it’s usually legal to modify the loop variable inside the loop. It allows you to do things like this:

    for (int i=0; i != spam.len(); ++i) {
        if (spam[i].has_arg) {
            process_with_arg(spam[i], spam[i+1]);
            ++i; // skips the next spam; we already used it
        } else {
            process_no_arg(spam[i]);
        }
    }
    

    So, shouldn’t for-each loops be consistent with that?

    I don’t think so. Most languages already consider it legal and reasonable and only slightly unusual to modify the loop variable of a C-style for, but not to modify the loop variable of a for-each. Why is that? I think it’s because the two loops are semantically different. They’re not the same thing, just because they share a keyword.

    So, at first glance this seems like an argument for #1, but in fact, it’s not an argument either way. (And for a new language, or an existing language that doesn’t already have C-style for loops, really, you don’t want C-style for loops…)

    Consistency with functions

    Anyone who’s done any functional programming has noticed that statement suites in imperative languages are a lot like function bodies. In fact, most control statements can be converted to a function definition or two, and a call to a higher-order function:

    if spam:
        do_stuff()
        do_more_stuff()
    else:
        do_different_stuff()
        
    def if_body():
        do_stuff()
        do_more_stuff()
    def else_body():
        do_different_stuff()
    if_func(spam, if_body, else_body)
    
    while eggs:
        eat(cheese)
        
    def while_cond():
        return eggs
    def while_body():
        eat(cheese)
    while_func(while_cond, while_body)
    

    But that doesn’t work with for-each:

    powers = []
    for i in range(10):
        def func(x):
            return x**i
        powers.append(func)
    
    powers = []
    def for_body(i):
        def func(x):
            return x**i
        powers.append(func)
    for_each(range(10), for_body)
    

    And this makes the problem obvious: that i is an argument to the for_body. Calling a function 10 times doesn’t reuse a single variable for its parameter; each time, the parameter is separate thing.

    This is even more obvious with comprehensions, where we already have the equivalent function:[5]

    powers = [lambda x: x**i for i in range(10)]
    
    powers = map(lambda i: (lambda x: x**i), range(10))
    

    Also consider Ruby, where passing blocks to higher-order-functions[6] is the idiomatic way to loop, and for-each loops are usually described in terms of equivalence to that idiom, and yet, the following Ruby 1.8[7] code produces 10 lambdas that all raise to the same power:

    powers = []
    10.times do |i|
        powers.push(lambda |x| { x**i })
    end
    

    So, this is definitely an argument for #2. Except…

    Python

    This is where Python differs from the rest of the pack. In most other languages, each suite is a scope.[8] This is especially obvious in languages like C++, D, and Swift, which use RAII, scope-guard statements, etc. where Python would use a with statement (or a try/finally). In such languages, the duality between scopes and functions is much clearer. In particular, if i is already going to go out of scope at the end of the for loop, it makes sense for each individual i to go out of scope at the end of an iteratiion.

    But in Python, not only does i not go out of scope at the end of the for loop, there are comfortable idioms involving using the loop variable after its scope. (All such cases could be written by putting the use of the loop variable into the else clause of a for/else statement, but many novices find for/else confusing, and it’s not idiomatic to use it when not necessary.)

    In addition, one of the strengths of Python is that it has simple rules, with few exceptions, wherever possible, which makes it easier to work through the semantics of any code you read. Currently, the entire semantics of variable scope and lifetime can be expressed in a few simple rules, which are easy to hold in your head:

    • Only functions and classes have scopes. ** Comprehensions are a hidden function definition and call.
    • Lookup goes local, enclosing, global, builtin.
    • Assignment creates a local variable. ** … unless there’s an explicit global or nonlocal statement.
    • Exceptions are unbound after an except clause.

    Adding another rule to this list would mean one more thing to learn. Of course it would also mean not having to learn about the closure-over-loop-variable problem. Of course that problem is a natural consequence of the simple rules of Python, but, nevertheless, everyone runs into it at some point, has to think it through, and then has to remember it. Even after knowing about it, it’s still very easy to screw up and accidentally capture the same variable in a list of functions. (And it’s especially easy to do so when coming back to Python from Swift or C#, which don’t have that problem…)

    Mutable values

    Let’s forget about closures now, and consider what happens when we iterate over mutable values:

    x = [[], [], []]
    for i, sublist in enumerate(x):
        sublist.append(i)
    

    If you’re not familiar with Python, try it this way:

    x = [[], [], []]
    i = 0
    for sublist in x:
        sublist.append(i)
        i += 1
    

    What would we expect here? Clearly, if this is legal, the result should be:

    x = [[0], [1], [2]]
    

    And there’s no obvious reason it shouldn’t be legal.

    So, each sublist isn’t really a constant, but a variable, right? After all, it’s mutable.

    Well, yes, each sublist is mutable. But that’s irrelevant to the question of whether the name sublist is rebindable. In languages where variables hold references to values, or are just names for values, like Java (except for native types) or Python, there’s a very obvious distinction here:

    a = [1, 2, 3]
    b = [4, 5, 6]
    b = a
    b.append(7)
    

    It’s perfectly coherent that sublist is a non-rebindable name for a possibly-mutable value. In other words, a new variable each time through the loop.

    In a language like C++, things are a little trickier, but it’s just as coherent that sublist is a non-l-modifiable-but-r-mutable l-value (unless sublist is of non-const reference type in which case it’s a modifiable l-value). Anyway, C++ has bigger problems: capturing a variable doesn’t do anything to keep that variable alive past its original scope, and there’s no way to capture by value without copying, so C++ has no choice but to force the developer to specify exactly what he’s trying to capture by reference and what he’s trying to copy.

    Anyway, the one thing you definitely don’t expect is for sublist to be mutating the same list in-place each time through (what you get if you write sublist[:] = ... instead of sublist = ...). But you wouldn’t expect that whether sublist is being reused, or is a new variable.

    So, ultimately, mutability isn’t an argument either way.

    Performance

    Obviously, creating and destroying a new variable each time through the loop isn’t free.

    Most of the time, you’re not keeping a reference to the loop variable beyond the lifetime of the loop iteration. So, why should you pay the cost of creating and destroying that variable?

    Well, if you think about it, in most languages, if you elide the creation and destruction, the result is invisible to the user unless the variable is captured. Almost all languages allow optimizers to elide work that has no visible effects. There would be nothing inconsistent about Python, or Java, deciding to reuse the variable where the effect is invisible, and only create a new variable when it’s captured.

    So, what about the case where the variable is captured? If we want each closure to see a different value, our only choices are to capture a new variable each time, to copy the variable on capture, or to copy the value instead of capturing the variable. The first is no more costly than the second, and cheaper than the third. It’s the simplest and most obvious way to get the desired behavior.

    So, performance isn’t an argument either way, either.

    Simplicity

    When a new user is learning the language, and writes this:

    powers = []
    for i in range(10):
        def func(x):
            return x**i
        powers.append(func)
    

    … clearly, they’re expecting 10 separate functions, each raising x to a different power.

    And, as mentioned earlier, even experienced developers in Python, Ruby, JavaScript, C#, etc. who have run into this problem before still write such code, and expect it to work as intended; the only difference is that they know how to spot and fix this code when debugging.

    So, what’s the intuition here?

    If the intuition is that lexical closures are early-bound, or by-value, then we’re in big trouble. They obviously aren’t, and, if they were, that would make closures useless. People use closures all the time in these languages, without struggling over whether they make sense.

    If the intuition is that they’re defining a new function each time through the loop because they have a new i, that doesn’t point to any other problems or inconsistencies anywhere else.

    And the only other alternative is that nobody actually understands what they’re doing with for-each loops, and we’re all only (usually) writing code that works because we treat them like magic. I don’t think that’s true at all; the logic of these loops is not that complicated (especially in Python).

    So, I think this is an argument for #2.

    Consistency with other languages

    As I mentioned at the start, most languages that have both for-each loops and closures have this problem.

    The only language I know of that’s solved it by adding capture by value is C++, and they already needed capture by value for a far more important reason (the dangling reference problem). Not to mention that, in C++, capture by value means something different than it would in an lvalue-less language like Python or JavaScript.

    By contrast, C# 5.0 and Ruby 1.9 both changed from "reused-i" to "new-i" semantics, and Lua, Swift, and Scala have used "new-i" semantics from the start.[9]. C# and Ruby are particularly interesting, because that was a breaking backward-compatibility change, and they could very easily have instead offered new syntax (like for new i) instead of changing the semantics of the existing syntax. Eric Lippert’s blog covers the rationale for the decision in C#.

    As mentioned earlier, Python’s simpler scoping roles (only functions are scopes, not every suite) do weaken the consistency argument. But I think it still falls on #2. (And, for a language with suite scopes, and/or a language where loop bodies are some weird not-quite-function things like Ruby, it’s definitely #2.)

    Exact semantics

    In languages where each suite is a scope, or where loop suites are already function-esque objects like Ruby’s blocks, the semantics are pretty simple. But what about in Python?

    The first implementation suggestion for Python came from Greg Ewing. His idea was that whenever a loop binds a cellvar, the interpreter creates a new cell each time through the loop, replacing the local i binding each time. This obviously solves the loop-capture problem, with no performance effect on the more usual case where you aren’t capturing a loop variable.

    This works, but, as Guido pointed out, it’s pretty confusing. Normally, a binding just means a name in a dict. The fact that locals and closures are implemented with indexes into arrays instead of dict lookups is an implementation detail of CPython, but Greg’s solution requires that implementation detail. How would you translate the design to a different implementation of Python that handled closures by just capturing the parent dict and using it for dict lookups?[10]

    Nick Coghlan suggested that the simplest way to define the semantics is to spell out the translation to a while loop. So the current semantics for our familiar loop are:

    _it = iter(range(10))
    try:
        while True:
            i = next(it)
            powers.append(lambda x: x**i)
    except StopIteration:
        pass
    

    … and the new semantics are:

    _it = iter(range(10))
    try:
        while True:
            i = next(it)
            def _suite(i=i):
                powers.append(lambda x: x**i)
            _suite()
    except StopIteration:
        pass
    

    But I think it’s misleading that way. In a comprehension, the i is an argument to the _suite function, not a default parameter value, and the function is built outside the loop. If we reuse the same logic here, we get something a bit simpler to think through:

    _it = iter(range(10))
    def _suite(i):
        powers.append(lambda x: x**i)
    try:
        while True:
            i = next(it)
            _suite(i)
    except StopIteration:
        pass
    

    And now the “no-capture” optimization isn’t automatic, but it’s a pretty simple optimization that could be easily implemented by a compiler (or a plugin optimizer like FAT Python). In CPython terms, if the i in _suite ends up as a cellvar (because it’s captured by something in the suite), you can’t simplify it; otherwise, you can just inline the _suite, and that gives you exactly the same code as in Python 3.5.

    I still think there might be a better answer than the comprehension-like answer, but, if so, I haven’t thought of it. My suspicion is that it’s going to be like the comprehension problem: once you see a way to unify the two cases, everything becomes simpler.[11]

    Conclusion

    So, if you’re designing a new language whose variable semantics are like C#/Java/etc., or like Python/JavaScript/etc., I think you definitely want a for-each statement that declares a new variable each time through the loop–just like C#, Swift, and Ruby.

    For an existing language, I think it’s worth looking at existing code to try to find anything that would be broken by making the change backward-incompatibly. If you find any, then consider some new for new syntax; otherwise, just do the same thing C# and Ruby did and change the semantics of for.

    For the specific case of Python, I’m not sure. I don’t know if the no-backward-compatibility decision that made sense for C# and Ruby make sense for Python. I also think the new semantics need more thought–and, after that’s worked out, it will depend on how easily the semantics fit into the simple scoping rules, in a way which can be taught to novices and transplants from other languages and then held in their heads. (That really is an important feature of Python, worth preserving.) Also, unlike many other languages, the status quo in Python really isn’t that bad–the idiomatic default-value trick works, and doesn’t have the same verbosity, potential for errors, etc. as, say, JavaScript’s idiomatic anonymous wrapper function definition and call.


    1. In a typical for statement, there’s a statement or suite of statements run for each iteration. In a comprehension or other expression involving for, there may be subsequent for, if, and other clauses, and an output expression. Either way, the loop variable is in scope for all that stuff, but nowhere else.

    2. In Python, only functions and classes define new scopes. This means the loop variable of a for statement is available for the rest of the current function. Comprehensions compile to a function definition, followed by a call of that function on the outermost iterable, so technically the loop variable is available more widely than you’d expect, but that rarely comes up–you’d have to do something like write a nested comprehension where the second for clause uses the variable from the third for clause, but doesn’t use it the first time through.

    3. There are plenty of languages where normal function definitions and lambda definitions actually define different kinds of objects, and normal functions can’t make closures (C++, C#) or even aren’t first-class values (Ruby). In those languages, of course, you can’t reproduce the problem with lambda.

    4. The same trick is frequently used with globals and builtins, for two different purposes. First, len=len allows Python to lookup len as a local name, which is fast, instead of as a builtin, which is a little slower, which can make a difference if you’re using it a zillion times in a loop. Second, len=len allows you to “hook” the normal behavior, extending it to handle a type that forgot a __len__ method, or to add an optional parameter, or whatever, but to still access the original behavior inside the implementation of your hook. Some possible solutions to the “local capture” problem might also work for these uses, some might not, but I don’t think that’s actually relevant to whether they’re good solutions to the intended problem.

    5. In modern Python (3.0+), map is lazy–equivalent to a generator expression rather than to a list comprehension. But let’s ignore that detail for now. Just read it as list(map(...)) if you want to think through the behavior in Python 3.

    6. Well, not higher-order functions, because they can’t take functions, only blocks, in part because functions aren’t first-class values. But the Ruby approximation of HOFs. Ruby is just two-ordered instead of arbitrary-ordered like Lisp or Python.

    7. Ruby 1.9 made a breaking change that’s effectively my #2: the i block parameter is now a new variable each time, which shadows any local i in the block’s caller’s scope, instead of being a single variable in the caller’s scope that gets rebound repeatedly. There were some further changes for 2.0, but they aren’t relevant here.

    8. Ruby is an interesting exception here. The scope rules are pretty much like Python’s–but those rules don’t matter, because loop bodies are almost always written as blocks passed to looping methods rather than as suites within a looping statement or expression. You could argue that this makes the choice obvious for Ruby, in a way that makes it irrelevant to other languages–but it’s actually not obvious how block parameters should be scoped, as evidenced by the fact that things changed between 1.8 and 1.9, primarily to fix exactly this problem.

    9. Most of these also make i a constant. This avoids some potential confusion, at the cost of a restriction that really isn’t necessary. Swift’s design is full of such decisions: when the cost of the restriction is minimal (as it is here), they go with avoiding potential confusion.

    10. If you think it through, there is an answer here, but the point is that it’s far from trivial. And defining semantics in terms of CPython’s specific optimizations, and then requiring people to work back to the more general design, is not exactly a clean way to do things…

    11. Look at the Python 2.7 reference for list displays (including comprehensions), set/dict displays (including comprehensions), and generator expressions. They’re a mess. List comprehensions are given their full semantics. Set and dict comprehensions repeat most of the same text (with different formatting, and with a typo), and still fail to actually define how the key: value pairs in a dict comprehension get handled. Then generator expressions only hint at the semantics. The Python 3.3 reference tries to refactor things to make it simpler. The intention was actually to make it even simpler: [i**2 for i in range(10)] ought to be just an optimization for list(i**2 for i in range(10)), with identical semantics. But it wasn’t until someone tried to write it that way that everyone realized that, in fact, they’re not identical. (Raise a StopIteration from the result expression or a top-level if clause and see what you get.) I think there’s some kind of similar simplification possible here, and I’d like to actually work it through ahead of time, rather than 3 versions after the semantics are implemented and it’s too late to change anything. (Not that I think my suggestion in this post will, or even necessarily should, get implemented in Python anyway, but you know what I mean.)

    4

    View comments

  11. When the first betas for Swift came out, I was impressed by their collection design. In particular, the way it allows them to write map-style functions that are lazy (like Python 3), but still as full-featured as possible. On further reflection, I realized that you can get the same kinds of views without needing Swift's complicated idea of generalized indexes. So, as soon as I had some spare time, I wrote up an implementation.

    Views


    The basic idea is simple:

    >>> m = views.map(lambda x: x*x, range(10))
    >>> len(m)
    10
    >>> m[3]
    9
    >>> list(m)
    [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
    >>> list(m)
    [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
    

    In other words, map acts like a sequence, not an iterator. But it's still just as lazy as an iterator. If I just write for sq in map(lambda x: x*x, range(10)): break, only the first square ever gets computed. Much like existing views (or "virtual sequences") in the builtins and stdlib, like range, memoryview, and dict_keys.

    But try that with filter and there's an obvious problem: you can iterate a filtered list lazily (and from either end), but you can't index it. For example:

    >>> f = views.filter(lambda x: x%2, range(10))
    >>> list(f)
    [1, 3, 5, 7, 9]
    >>> f[3]
    TypeError: 'filter' object is not subscriptable
    

    How could f[3] work? Well, in this case, with a trivial bit of arithmetic, you can figure out that the fourth odd positive number is 7, but that obviously isn't something that works in general, or that can be automated. The only way the filter object could do it is if it cached each value as you asked for it.

    That's actually not an unreasonable design, but it's a different design than what I was going for (or Swift's designers). Caching definitely fits in nicely with lazy single-linked lists a la Haskell, but in Python, we'd be getting the disadvantages (e.g., iterating a list in reverse takes linear time to get started, and linear space) without the advantages (the zero-space lazy iteration that you get automatically because iterating or tail-recursing on lst.tail leaves the head node unreferenced for the garbage collector). If you want caching, you can always build that as a wrapper view (which I'll get to later), but you can't remove caching if you don't want it.

    Anyway, filter can't be a sequence, but it can still be more than an iterator. It's a collection (Python doesn't have an official term for this; I'm following Terry Reedy's suggestion) that can be repeatedly iterated, to generate the same elements each time. It's also reversible.

    So, once you realize that filter can't produce a sequence, what happens if you call map on a filter? You can't get back a sequence, but you can get back a reversible collection. (But notice that map can take multiple iterables for input, and if you pass it two filters, the result can't be reversible--you have to know which one is longer, and filters can't give you their lengths in constant time.)

    Implementation


    The obvious way to implement map is by building a sequence that forwards each of its operations. For example:
    def __getitem__(self, index):
        # deal with slice stuff here
        # deal with negative indices here
        return self.func(*(it[index] for it in self.iterables))
    

    Obviously, if any of the iterables aren't sequences, they'll raise a TypeError, and the map will just pass that through. The magic of duck typing.

    But this gets a bit complicated when you consider how to handle negative indices. For example, m = map(add, [1, 2, 3], [10, 20, 30, 40]) has values 11, 22, 33. So, m[-2] is 22, which you can't get from [1, 2, 3][-2] + [10, 20, 30, 40][-2]--you need -3 on the longer input.

    Well, at least that's doable. The easiest way is to just map negative indices the way you would in a non-view sequence: if index < 0: index += len(self) (and then if it's still negative, it's an IndexError). And __len__ can just forward to its inputs: min(len(it) for it in self.iterables), and that will duck-type you an error if any of them aren't sized. Since all sequences are sized, this isn't a problem.

    But now look at __reversed__. You need to do the same kind of work there, but there are reversible iterables that aren't sized. So, how do you handle that? Well, if all the iterables are sized and reversible, you use one algorithm; if not, if there's only one iterable, it doesn't have to be sized, just reversible.

    This is all doable, but it starts to get into a whole lot of if and try statements all over the place.

    Meanwhile, although everything works out fine for EAFP duck typing, it doesn't work so nicely for LBYL testing. If you map over a filter, or an iterator, you get something that claims to be a sequence (isinstance(m, Sequence) returns true) but raises a TypeError if you try to use it as one.

    And it's even worse for static type checking. How would you represent to MyPy that a map over a list is a Sequence, but the exact same type when used for a map over an iterator is not?

    One way to solve all of these problems is to have separate classes for map over iterators, map over iterables, map over a single reversible iterable, map over sized reversible iterables, and map over sequences. Each is a subclass of the one before, and each adds or overrides some functionality and also adds at least one ABC. And then, map isn't a constructor for a type, it's a function that figures out which type to construct and does so. Which it obviously does by LBYL against the ABCs.

    This also lets you distinguish iterator inputs (so that what you get back doesn't pretend to be a re-iterable view with unstable contents, but is openly just a one-shot iterator). You can't really duck-type that, because it's the less-powerful type (Iterator) that has the extra method (__next__), but you can LBYL it easily. (In my initial implementation, I actually went the other way with this one, but I think that was a mistake.)

    And you can even statically type-check all of this by stubbing map as a bunch of overloads. Since there's no vararg type unpacking in PEP 484 types, you either have to be sloppy:

    def map(func: Callable[..., B], *iterables: Sequence) -> MapSequenceView[B]
    

    ... or repetitive:

    def map(func: Callable[A0, B], iterable0: Sequence[A0]) -> MapSequenceView[B]
    def map(func: Callable[A0, A1, B], iterable0: Sequence[A0], iterable1: Sequence[A1]) -> MapSequenceView[B]
    # ... and so on for up to N arguments, where N is the most any sane person would want
    

    (One nice thing about "gradual static typing" is that the sloppy version makes sense, and is more useful than nothing, despite being sloppy. Maybe one day Python will gain some way to express the complete set of overloads via iteration or recursion instead of repetition, but until then, we don't have to do the N overloads unless we want to.)

    But anyway, do one of those, and repeat for inputs that aren't Sequences but are Sized Reversible Iterables, and for a single Reversible, and so on, and the type checker can tell that map(lambda x: x*x, range(10))[3] is valid, but map(lambda x: x*x, filter(lambda x: x%2, range(10))[3] is a type error.

    And that's pretty much exactly what you get from Swift's views, in Python, without any need for generalized indexes.

    More views


    The code for map is a bit repetitive. Worse, the code for filter shares a lot of the same repetition. But of course they're not identical.

    First, how do you express the difference in how they use their function argument? Well, that's pretty easy in terms of iteration: for one element (ignoring the multiple-input case for the moment), you yield from a special method. In the case of map that special method just does yield self.func(value), while for filter, it's if self.func(value): yield value.

    Meanwhile, there are a lot of small differences in the API: filter can take None for a function (which effectively just means bool), but map can't (it used to, in Python 2, where it effectively just meant lambda *args: tuple(args)). map can take multiple iterables, but filter can't.

    And, of course, there's the fact that map is "stronger": given appropriate inputs, it can be anything up to a Sequence, while filter can only be a Reversible.

    Will the yield from self._do_one(value) trick work for all reasonable views? Does it work for reverse iteration as well as forward? Can the API differences be generalized in a simple wrapper? Is the notion of "stronger" view functions coherent and linear (and, if so, is a Sized Reversible stronger than a plain Reversible)? There are clearly multiple ways to handle mulitple inputs of different lengths (hence the need for zip and zip_longest); can that be wrapped up? It's hard to guess in the abstract case. So, I built a few more views: zip and zip_longest, enumerate, islice...

    And then, of course, I got distracted by the fact that slicing an islice view should give you a smaller view. And...

    Slicing


    All of the views can return sub-views for slices. Should they?

    Well, that runs into the usual memory question. On the one hand, copying a largish subslice instead of just referencing it wastes memory for the copy. On the other hand, keeping a large collection alive when it's only referenced by a smallish subslice wastes memory for the rest of the collection. Which one is more of a problem in practice? For some kinds of work, clearly the copying is more of a problem--e.g., people using NumPy often have multiple slices over arrays that take up more than half their memory; if they were copies instead of slices, they wouldn't be able to fit. But in general?

    Well, there's obviously a reason Python chose copying semantics. But look at what the copies are: slicing a list gives you a new list; slicing a tuple gives you a new tuple; slicing a string gives you a new string. So, slicing a map view should give you... a new map view? Then that _is_ just a reference, right?

    Meanwhile, even non-sequences (like filter) can be "isliced" into views. (I've used itertools.islice and the builtin filter together in real-life code.) At first glance, it seems like it would be great if you could give filter views the native slicing syntax. But that might be more than a little odd--remember that iterating a sliced filter requires iterating everything before the start of the slice. Is that the same problem as indexing a filter requiring iterating everything before that index? Not really, because it's O(N) wasted time once per iteration, rather than O(N) wasted time N times for walking the indices, but it still seems bad.

    Anyway, I think it won't be clear how far to push slicing until I've played with it some more.

    Caching


    You can build a caching sequence on top of any iterable:

    class CacheView(Sequence):
        def __init__(self, iterable):
            self._cache = []
            self._iterator = iter(iterable)
        def __len__(self):
            self._cache.extend(self._iterator)
            return len(self._cache)
        def __getitem__(self, index):
            # deal with slices
            if index < 0:
                index += len(self)
            try:
                while index > len(self._cache):
                    self._cache.append(next(self._iterator))
            except StopIteration:
                raise IndexError
            return self._cache[index]
    

    Of course this needs to process the first n elements to do [n], which isn't necessary if you know you have a sequence. And if you know you have something sized and reversible, __len__, [-1], or __reversed__ doesn't need to process the entire input. And so on. In fact, we could build views that wrap each level of the hierarchy, and provide more laziness the more powerful the input is.

    We could also build views that present each level of the hierarchy. For example, I may want to wrap an iterator in a repeatable iterable without adding the sequence API. At first glance, that doesn't seem necessary. If I just want to iterate the same iterator twice, tee is already a great way to do that, and with maximal laziness (e.g., if I have one branch at position #200 and another at #201, I've only got two elements stored, not 201). If I want anything more than tee, I probably just want a full sequence, right? But once you consider infinite iterables, you clearly don't want to be forced to wrap those in a sequence which will consume all of your memory if you ever call len instead of raising an exception, right?

    Anyway, I'm experimenting with different ideas for the caching views. I'm not sure we actually need to generalize here. Or, if we do, I'm not sure it's the same as in the other cases. For most caches, you either want tee, a "half-sequence" (something with __getitem__ but not __len__, and no negative indices), a full lazy sequence, or just an eager list. Do you need cached reversible iteration?

    Generalized indexes


    So, we don't need generalized indexes to build views. But they still give you some nice features.

    In Swift, find works on any iterable, and gives you an index (or a range) that you can use to reference the found element (or sub-iterable).

    For example, imagine that you had to implement str.find and str.replace yourself. That's not too hard:

    def find(haystack, needle, start=0):
        while start <= len(haystack) - len(needle):
            if haystack[start:start+len(needle)] == needle:
                return start
        return -1
    
    def replace(haystack, needle, replacement):
        start = 0
        while True:
            start = find(haystack, needle, start)
            if start == -1:
                return haystack
            haystack = haystack[:start] + replacement + haystack[start + len(replacement) - len(needle):]
            start += len(replacement)
    

    This works with any sequence type, not just str. But what if you wanted to work on any forward-iterable object, not just sequences? Writing find in terms of iterables would be painful, and then writing replace on top of that would be very difficult. But generalized indexes solve this problem. The idea is that indexes are just numbers, they're something like C++ iterators. The only thing you can do with them is compare them, advance them, and use them to index their sequence. Under the covers, they might just be a number (for a list), or they might be a list of numbers (for a tree) or even a reference to a node (for a linked list). But the key is that you can remember multiple iteration positions, advance them independently, and do things like construct slices between them, and it all works for any iterable collection (although not for an iterator).

    def find(haystack, needle, start=None):
        if start is None:
            start = haystack.start()
        while start != haystack.end():
            pos = start
            for element in needle:
                if haystack[pos] != element:
                    break
                pos = pos.advance()
            else:
                return start, pos
            start = start.advance()
        return start, start
    
    def replace(haystack, needle, replacement):
        start = haystack.start()
        while True:
            start, end = find(haystack, needle, start)
            if start == haystack.end():
                return haystack
            haystack = haystack[:start] + replacement + haystack[end:]
            start = end
    

    The big question is, how often do you need this? Swift has linked lists in its stdlib. More importantly, its strings are stored as UTF-8 but iterated as grapheme clusters, meaning you can't randomly access characters. But they also intentionally designed their string API so that you use slicing and startswith and endswith, so you still rarely need to actually use indices as indices this way.

    The one place where they're indispensable in Swift is for mutation. If you want to insert into the middle of a linked list, or a string, you need some way to indicate where you want to insert. With an iterator-based API, you can't mutate like this without a tightly-coupled knowledge of the internals of the thing you're iterating over; instead, you'd have to return a new object (maybe a chain of iterators). But doing complex things immutably instead of mutably is already the norm in Python, and has a number of advantages. In fact, notice that, even with generalized indexes, what came naturally to me in replace above was still an immutable copy, not an in-place mutation. Sometimes, working in-place is a valuable optimization. More often, it's a pessimization--e.g., C++ copies strings all over the place where Python (or manually-managed C) doesn't, and then tries to use copy-on-write, inline-small-string, and other optimizations to reduce that.

    So, assuming we have views, including slice views, and the handful of tricky building-block functions (like startswith) are already written, and we don't want to mutate in-place, what do we get out of generalized indexes? Not enough to be worth the complexity, I think.

    At any rate, we definitely don't need them for the views themselves to be useful.

    Conclusions


    Building Swift-like views actually seems easier in Python than in Swift. The lack of generalized indexes is not a problem. Dynamic typing makes things a little easier, not harder (although it would have been more of a problem back in the days before ABCs), and gradual static typing allows us to express (and maybe type-check) the easier and more useful part of the interface without having to work out the whole thing.

    Of course the problem is that Python already has a language and a stdlib designed around iteration, not views. So, it's not clear how often your code would be able to take advantage of the expanded interfaces in the first place.
    2

    View comments

  12. In a previous post, I explained in detail how lookup works in Python. But, briefly, Python has certain optimizations baked into the design; those optimizations may sometimes restrict your code (e.g., you can't use exec to set a local variable), and may even restrict other optimizations the implementation might want to do. So, let's go back to the start and redesign the lookup model and see if there's an alternative.

    Core Python behavior

    There are some ideas that are central enough to Python that changing them would give us an entirely new language, so we don't want to change them. The tl;dr for most of it is the basic LEGB rule, but there's a bit more to it than that:
    • Implicit declarations: assigning to a new name creates a new local variable.
      • Explicit nonlocal declarations: after nonlocal spam, assigning to spam does not implicitly declare a new local variable; instead, it references the first variable named spam in an outer scope.
      • Explicit global declarations: after global spam, assigning to spam does not implicitly declare a new local variable; instead, it references the variable named spam in the module's global scope.
    • Lexical scoping: free names are bound in lexically outer scopes, not in the dynamic runtime environment.
    • Globals and builtins: the global scope of a module may be the outermost lexical scope in the source code, but builtins are treated as another scope even outside that.
    • Local scopes are defined by function definitions (including def, lambda, and the hidden definition inside comprehensions) and class definitions only (not by every block/suite).
    • Dynamic globals: names at the module global scope are always looked up by name at runtime. This allows us to build modules iteratively (compiling statement by statement--as is done for __main__, the top-level script or interactive REPL), modify their environments with statements like exec, and so on. (Note that we're talking about dynamic lookup within a lexical scope, not about dynamic scope. This isn't too important for globals, but it will matter later.)
    Other ideas aren't fundamental to Python's design, and are driven by other design decisions, or by optimization, or by implementation simplicity; we could change them and still have a language that feels like Python (even if it might not actually meet the Python reference well enough to be called a Python implementation).

    Function locals

    Whenever execution enters a new function call, a new context gets created. Conceptually, this is a mapping, that holds the callee's local variables. You can get the local context at any time by calling locals or vars (and it's implicitly passed as the default value to any call to eval or exec).

    However, this isn't necessarily what Python actually uses. In particular, CPython stores the context as an array, by converting all local name lookups into array indices at compile time. So, if you try to change your local context (e.g., locals()["x"] = 3 or exec('x = 3')), it will generally have no effect.

    This is different from the case with global variables. Code that accesses global variables does so through the same dict returned by globals(), and module-level code uses the global scope as its local scope.

    We can force things to work the same way at the local scope by converting every LOAD_FAST and STORE_FAST into a LOAD_NAME and STORE_NAME, putting the names in co_names instead of co_varnames, and making sure that the locals dict always gets created instead of only on demand (which, in CPython, you can do by making sure the usual CO_NEWLOCALS flag is set but the usual CO_OPTIMIZED is not). This is all something we could do in a decorator with some bytecode hacking.

    Closures

    As mentioned above, Python compiles local variable references to indexes into an array of locals. So how do closures work? For any local accessed by an inner function, a cell object is stored in place of the value in the array, which is just an object with a single member value. When that inner function is entered, it gets references to the same cells in its frame. Instead of LOAD_FAST and STORE_FAST you get LOAD_DEREF and STORE_DEREF, which, instead of accessing the cell itself, access the value inside the cell.

    When you call locals, if there are any freevars (cells inherited from an outer scope) or cellvars (cells exposed to an inner scope), you don't get the cell objects in the dictionary, but their values. Changing them has no effect even in cases where changing local variables does.

    If we changed local lookup to be by name, what would happen with closures? Well, I think the changes above would copy all names inward. Which works fine for immutable names, but if we're trying to rebind names in the outer scope (i.e., we've used nonlocal), or (I think) even if we're trying to pick up rebindings from the outer scope, it breaks.

    There might be a way to make this work easily using the LOAD_CLASSDEREF op, which is designed to look in a real locals dict (class suites don't use fastvars) and then fall back to consulting cells. But I'm not sure. Anyway, switching over to the next bit is the fun part, so let's charge forward assuming this isn't workable.

    The traditional Lispy solution to this is to make environments into something like a ChainMap instead of a dict. That automatically fixes picking up rebindings from the outer scope, but it doesn't help for pushing rebindings back up. We need to have two separate operations for "add/modify this binding in the topmost dict" vs. "modify this binding in the topmost dict that already had such a binding". Then every LOAD_DEREF turns into a LOAD_NAME just like LOAD_FAST did (it's now a chained lookup automatically by virtual of using ChainMap), but STORE_DEREF calls this new rebind-topmost-existing binding method instead of the usual set-in-top-dict method used by STORE_NAME.

    If we wanted to still have cell objects (they are exposed to the Python level, and even mutable in 3.7+), I think they'd have to change for a simple value holder to a reference to a ChainMap together with a name.

    In fact, having done that, we could just leave LOAD_DEREF and STORE_DEREF alone, and let the cell variables handle the indirection through the ChainMap for us. Although I think that might be more confusing, not less.

    The other closure-related opcode is LOAD_CLOSURE, which loads a cell object so it can be passed farther down. I think nobody will need to access the loaded cell anymore unless someone inspects the closure in a frame object or the like, so we could just strip these out (or change them to load a const—something has to be on the stack…) if we're willing to break that. Alternatively, if we're building chain-map-referencing cell objects, we have to replace LOAD_CLOSURE with code to build them.

    There might be other subtle semantic differences from the change to chained environments. On the other hand, it would make it a lot easier to experiment with different semantics from within Python. Creating new cell variables and new closures is very hard; creating a new chained environment to pass to something like exec is dead easy.

    One thing you could (I think) conceivably do with this is to generate your own chained environments and pass them into function code instead of using the default ones, which you could use to, e.g., explicitly use dynamic instead of lexical scoping for some particular call.

    Globals

    Once we have the same flexibility in locals and closures that we do in globals, is there any need for a special global scope anymore? We could just make it another environment at the bottom of the chain. If we wanted to replace LOAD_GLOBAL and STORE_GLOBAL with code that uses the chain map, we'd need a way to go to the back of the chain rather than the front.

    But the simpler solution is to just keep passing around separate globals and locals (in functions and classes) and keep accessing globals the same as always. The fact that the global namespace is just a reference to the same dict at the back of the local namespace's chain wouldn't affect normal code, but it would be true for anyone who wants to play around.

    Either way, not only can we now close over globals, we can even (if we decided to implement closure cells as indirections to the chain instead of scrapping them) construct and pass around cells for them. Closing over globals is something that nobody but Lisp fanatics even notices that Python can't do today, but hey, it can't. Of course we might want to preserve that limitation instead of fixing it, but I don't think we need to. The compiler should still barf on using nonlocal for something that's global instead of outer, and still generate LOAD_GLOBAL ops for something that's global instead of outer with neither declaration, and so on, so existing code should just work the same way; it would just become possible, with a bit of monkeying, to pick up a global environment, pass it around as if it were a nonlocal one, and have everything magically work.

    Builtins

    If we monkey with globals, builtins are a bit of a problem. Conceptually, the global namespace already works sort of like a chained environment with globals in front of builtins (and no way to directly store into the outer namespace). However, practically, it's a big pile of hacks. For example, eval just takes locals and globals, no builtins. If the globals has a __builtins__ member, it's used as the builtin environment; otherwise, all of the globals at the scope of the eval call (which normally includes the __builtins__ member, of course) are automatically copied into the passed-in globals first. Trying to make this not break anything could be a lot of fun.

    Also, of course, if we move builtins into a real chained environment outside of globals instead of the sort-of-like-that thing Python does today, the LOAD_GLOBAL and STORE_GLOBAL are no longer going to the end of the chain, but to one level before the end, which is getting a little too special to be readable as a special case. But I don't think we wanted to replace the global ops anyway. And, if not, we can just leave builtins alone.

    Classes

    Code inside a class suite gets executed the same way as top-level code (as opposed to function code), except that instead of having the enclosing global namespace for both its globals and locals, it has the enclosing global and a copy of the local namespace. (Also, some extra values are crammed into the locals to allow super to work, set up __module__, etc., but I think the compiler does that just by emitting some explicit STORE_NAME stuff at the top of the code.) When the suite is finished executing, the metaclass gets called with the class name, base classes, and the local environment.

    As mentioned above, classes actually use the locals dict, not a fast array on the frame like functions, so there's no need to change anything there. As for using LOAD_CLASSDEREF in place of LOAD_DEREF, changing that to LOAD_NAME with a chained environment should just work.

    Summary

    Basically, what (I think) we could change without breaking everything is:
    • Replace the locals dictionary with a chain of dictionaries (with an extra method to replace rather than insert).
    • Make the locals dictionary actually live and mutable even inside function calls.
    • Replace all fast locals with existing ops that go to the locals dictionary.
    • Replace closure cell loads with the same existing op.
    • Replace closure cell stores with calls to the new replace method (or a new opcode that does that).
    • Wrap up exec-like functions to chain globals onto the end of locals.
    And the result should be something that:
    • Works like normal Python for 99.9% of code.
    • Is significantly slower (fast locals and closure cells are an optimization, as the name "fast" implies, and we'd be tossing them).
    • Exposes a bit of new functionality that might be fun to play with.
    • Is significantly easier to understand, except that nobody could really try to learn it until they already knew how the more complicated normal Python rules work.
    So, is this worth doing? Well, I've got an 0.1 decorator-based version that completely breaks any inner functions with closure cells, or inner class definitions with or without them, but works for simple functions, which is kind of cool. If I have the time and inclination, maybe I'll go farther, or at least clean up what I have and post it on GitHub. But I can't imagine actually using it for anything. The exploration and implementation is the only point.
    2

    View comments

  13. The documentation does a great job explaining how things normally get looked up, and how you can hook them.

    But to understand how the hooking works, you need to go under the covers to see how that normal lookup actually happens.

    When I say "Python" below, I'm mostly talking about CPython 3.5. Other implementations vary to differing degrees in how they implement things; they just have to result in the same rules being followed.

    Variables: LEGB


    Python has simple lexical scoping, like most other languages. (Unlike many languages, only module, function, and class definitions create new scopes, not every suite, but that's not too important here. Also, remember that lambda expressions and comprehensions are function definitions, and thus create new scopes, despite not having suites.)

    The specific rule Python uses for lookup is LEGB: Local, Enclosing, Global, Builtin. In other words, when you write x, Python looks for a local variable named x then (if you were in a nested function or class) it goes through all the enclosing scopes' local variables, then it looks at the module's globals, and then it looks at builtins.

    Meanwhile, because Python has implicit declarations (any assignment may be declaring a new variable), it needs a rule for that: any assignment creates a local variable (shadowing any enclosing, global, or builtin of the same name), unless you have a nonlocal or global declaration in the same scope.

    The way Python implements these rules is not as simple as you'd think. Primarily, this is for reasons of optimization.

    tl;dr


    The short version is:

    * Locals (except when shared with a closure) are loaded and stored in an array on the stack frame.
    * Nonlocals (and locals shared with a closure) are basically the same but with an extra dereference.
    * Globals are loaded and stored in a dict.
    * Builtins are loaded via two or three dict lookups (and are never directly stored).

    Fast locals


    Local variables are stored in an array on the stack frame, and loads and stores are compiled into array-indexing operations on that array. This is basically the same behavior as C--and it's a lot faster than going through a dictionary.

    Exceptions to this rule:

    * Modules don't have fast locals; they use the globals dict as their locals, so everything works the same as for the globals section below.
    * Local variables that are shared with a nested closure are slightly more complicated; see the next section.

    So, how does Python know how to compile your locals to array indices? And how does it turn that back into the local variable names that you can see in tracebacks, locals(), inspect and dis functions, etc.?

    When compiling the body of your function, Python keeps track of all of your variable assignments. (Assignments to attributions or subscriptions don't count; those are just calls on the attributed or subscripted object.) Function parameters are local, any variable that you assign to is local (unless you have a global or nonlocal statement for it); anything else is not.

    So, it creates a tuple of all those local names, and stores it in the co_varnames of the code object. It also stores the total count in co_nlocals. Then, when compiling your code to bytecode, every load or save to co_varnames[3] becomes a LOAD_FAST 3 or STORE_FAST 3.

    When a function is called, the stack frame reserves co_nlocals extra space at the end, called f_localsplus, and when the interpreter sees a LOAD_FAST 3, it just reads from frame.f_localplus[3].

    For example, in this code:

    def spam(eggs):
        cheese = 0
        return eggs + cheese
    

    ... eggs is local because it's a parameter, and cheese is local because you assign to it, so that last line will get compiled to, in effect, LOAD_FAST(0) + LOAD_FAST(1), not LOAD_NAME('eggs') + LOAD_NAME('cheese').

    Now you can probably understand why these optimizations work:

    def spam(eggs=eggs):
        local_cheese=cheese
        for _ in range(1000000000):
            eggs(local_cheese()))
    

    You've replaced a billion dictionary lookups for the globals eggs and cheese with one dictionary lookup for each plus a billion array lookups for f_localsplus[0] and f_localsplus[1], which is obviously faster. (Dictionary lookups are, of course, constant-time, just like array lookups. But with a larger constant multiplier--large enough to make a significant difference here.)

    But how does Python get the names back out? Well, as mentioned, they're stored in co_varnames. And when a function is called, the stack frame gets a pointer to the function's code object in f_code So, if it needs to build a traceback or disassembly or whatever for fast #1, it just gives you frame.f_code.co_varnames[1].

    What about locals()? Well, there actually is a locals dict on the stack frame, called f_locals. But it doesn't get created unless you ask for it (e.g., by calling locals(), or just by asking for the Python version of the frame object with, e.g., sys._getframe()). This calls a function PyFrame_FastToLocals, which effectively does frame.f_locals = dict(zip(frame.f_code.co_varnames, frame.f_localsplus)), and then it returns that f_locals to you. It should be obvious why, as the docs say, modifying locals() doesn't affect the function's actual locals: it's just a snapshot of your locals. (There is actually a function PyFrame_LocalsToFast, but you can only call that from C, not Python.)

    What about exec()? Again, as the docs say, this function by default works on (something equivalent to) it's caller's globals() and locals(), so, again, it can't change your local variables.

    I used Python-like pseudocode for all the stuff that happens in C under the covers. But most of that actually is exposed to Python. For example, if you call sys._getframe(), the frame object you get back has an f_code member. Or, if you just look at a function, it has a __code__ member. And either way, that object has a co_varnames. And so on. The f_localsplus member, and the PyFrame_FastToLocals and PyFrame_LocalsToFast are the only things mentioned here that aren't exposed. And even there, you can always do this:

    ctypes.pythonapi.PyFrame_LocalsToFast.argtypes = [ctypes.py_object, ctypes.c_int]
    ctypes.pythonapi.PyFrame_LocalsToFast.restype = None
    ctypes.pythonapi.PyFrame_LocalsToFast(sys._getframe(), 0)
    

    Have fun with that.

    Free and cell variables


    Variables from an enclosing scope are almost the same as locals--they're handled by looking them up in an array--but with an extra deference. But that requires a bit of extra work to set up.

    So, how does Python know how to compile your nonlocals to array indices?

    As we saw above, when Python sees a variable, it knows whether that variable is local to your function. If it isn't, Python steps through the outer enclosing functions to see if it's local to any of them. If so, it's a free variable. (Of course a global statement means it's a global variable, so it doesn't look through enclosing scopes, and a nonlocal means it's a free variable even if there's a local assignment.)

    Notice that the module's global scope is not counted as a scope for these purposes--anything found in the global scope is always global, not enclosing.

    So anyway, the compiler creates a tuple of all those free names, and stores it in co_freevars. Then, when compiling your code to bytecode, every load or save to co_freevars[2] becomes a LOAD_DEREF 2 or STORE_DEREF 2.

    Now, when a function is called, the stack frame doesn't just reserve co_nlocals space, but co_nlocals + len(co_freevars) extra space at the end (I'm still lying to you here; see a few paragraphs down for the truth) in f_localsplus, and when the interpreter sees a LOAD_DEREF 2, it just reads from frame.f_localsplus[frame.f_code.co_nlocals + 2].

    Well, not quite. That obviously wouldn't allow you to reassign closure variables. (You could call mutating methods, but assignment would just break the connection, just like after a = b = [1, 2, 3], a.append(4) affects b, but a = [4] breaks the connection.) And that would defeat the entire purpose of nonlocal.

    So, what you actually have in that slot isn't the object itself, but a cell object that has a member named cell_contents that's the actual object. And what the outer scope has in its fast locals slot is the same cell object, not the thing it points to. So, if either the inner function or the outer function assigns to the variable, they're not assigning to one of their fast local slots, but assigning to the cell_contents of a cell in one of their fast local slots, so they can each see what the other one does.

    That's why local variables that are referenced by a nested scope work like nonlocal variables. They're stored in co_cellvars instead of co_freevars. Your frame's actual space is co_nlocals + len(co_cellvars) + len(co_freevars). The name that goes with a LOAD_DEREF 2 is equivalent to (co_cellvars+co_freevars)[2]. At runtime, that LOAD_DEREF 2 actually does a f_localsplus[co_nlocals+2].cell_contents. When you construct a nested function at runtime, there's zero or more LOAD_CLOSURE bytecodes to push the current frame's cells onto the stack, then a MAKE_CLOSURE instead of MAKE_FUNCTION, which does the extra step of popping those cells off and stashing them in the function's __closure__ attribute. And when you call a function, its __closure__ cells get copied into the f_localsplus array. Meanwhile, even if your function doesn't access any variables from an enclosing scope, if a function nested within your function does so, they're free variables for you, and have to get put in your __closure__ so they can end up in your stack frame so they can get to your nested function's __closure__ and stack frame.

    And I believe that's all of the actual details correct.

    But you don't have to memorize all this stuff, or go dig it out of the C code, or work it out by trial and error, or plow your way through that horrible paragraph ever again; the docs for the dis module describe what each opcode does, and have improved tremendously over the past few versions of Python. If you can't remember exactly what LOAD_DEREF looks up and where, just look at the docs: "Loads the cell contained in slot i of the cell and free variable storage. Pushes a reference to the object the cell contains on the stack." That might not have explained everything to you if you'd never read anything before, but it should be more than enough to refresh your memory once you have.

    Also, again, most of this is exposed to Python, so the Python-esque pseudocode above is almost all real code. However, from Python, you can't write to a cell through its cell_contents, and you can't construct new cell instances. (If you want to have fun with ctypes.pythonapi, you can probably figure it out from here.)

    Finally, locals(), in addition to giving you a snapshot of your locals rather than a live way to change them, also flattens out any cells, giving you their cell_contents instead.

    At first glance, it seems like it would be simpler to just have each function's stack frame have a pointer to its outer function's stack frame. You could still optimize things just as much by having instructions like LOAD_OUTER 3, 2 to go 3 steps out and get the fast-local 2 from that frame, right? In fact, some languages do implement closures that way. But that means that the a function's stack frame can't go away--and, therefore, neither can anything else it refers to--until all closures referring to any of its variables, or any of its outer scopes, goes away. And that "anything else it refers to" includes the stack frame of its caller. You can see why this could be a problem. There are ways to optimize that (although you still run into problems with coroutines), but it's actually simpler, not more complicated, to just reference the individual values pulled off the frame, as Python does.

    Globals and builtins


    Globals are easy. If the compiler sees a name that isn't assigned in the current function or an outer nested function, or that has a global statement, it's global. There's no special operation for builtins; it's just a fallback for every global lookup.

    The compiler does still use an array here, but only to fetch the name from a constant tuple on the code object (co_names) to look up with LOAD_GLOBAL and STORE_GLOBAL at runtime. They look something like this:

    def LOAD_GLOBAL(index):
        name = frame.f_code.co_names[index]
        try:
            return frame.f_globals[name]
        except KeyError as e:
            try:
                builtins = frame.f_globals['__builtins__']
                if isinstance(builtins, ModuleType): builtins = builtins.__dict__
                return builtins[name]
            except KeyError as f:
                raise e
    
    def STORE_GLOBAL(index, value):
        name = frame.f_code.co_names[index]
        frame.f_globals[name] = value
    

    That f_globals will normally be the actual globals for the module the current (module-level, function, or class) code was defined in, but it could be something else. (For example, exec lets you pass in any mapping for globals; if you want a sanitized and quasi-sandboxed environment with your own under-the-hood functions and builtins like exec and __import__ stripped out, that's easy.)

    Much simpler than closure variables, right? But it can be slower. So, why have a special case for globals instead of just making it the last stop on the closure chain (or making builtins the last stop instead of a special case within the globals special case)?

    Well, for one thing, not having fast locals or cells and doing everything through the globals dict allows for more dynamic stuff, like creating new variables in exec and using them in the same scope. For another, not having to make a list of all of the names at scope, simply adding them to the dict as they're created, means that the global scope can be executed incrementally, statement by statement (so it works the same as in the REPL--in fact, both the REPL and the top-level script work by just incrementally building a module named __main__). In particular, you can define a function that uses a global, and start passing it around, before that global has been created.

    More importantly, lookup speed shouldn't really matter at global level. If you're doing any heavy computation that involves millions of name lookups at module scope, move it into a function and call that function. And if you're using those globals millions of times from another function, get them out of global scope too. Those are "structured programming 101" anyway; why try to optimize things that people shouldn't be doing?

    Well, except for the fact that your module's top-level functions and classes are globals, and it's perfectly reasonable to use them millions of times. In less dynamic languages, they're only looked up as globals at compile time, so they aren't like runtime global variables--but in Python, functions, types, and everything else are first-class values, and they're looked up the same way as any other global variables. In practice, this usually isn't a bottleneck for most programs; when Python is too dynamic and slow for the inner code of some project, there are usually other factors that matter much more than this dict lookup. But occasionally, it is. So, this is one of those places people keep looking for new optimizations. At the user level, you can always just copy the global to a local or a default parameter value, as mentioned earlier. Or you can use an ambitious optimizer that tries to automate the same benefit (like FAT Python, which creates an extra version of each function's __code__ with globals copied into the co_consts, and wrapping the function in a guard that checks that the cached globals haven't changed since the function was compiled and, if so, calling the original "slow" function instead of the fast version).

    What about classes?


    A class definition is a scope, just like a function definition. There are a couple of minor differences (the special LOAD_CLASSDEREF opcode, the magic __class__ cell that's created in class scope so methods can access it directly or via super, etc.). But the really big difference is that the class code isn't executed every time the class is called; it's executed once to produce the dict that gets passed to type (or another metaclass), and then it, and its scope, goes away. (Of course the body of its __new__ and/or __init__ will get executed when the class is called.)

    I don't think there's anything important to learn about classes nested in functions or classes (or classes nested in functions, unless you want to know how super works) that isn't obvious once you play with it.

    Attributes


    Attribute lookup seems really simple at first: Every attribute lookup is compiled to LOAD_ATTR, which uses a string taken from the same co_names array as LOAD_GLOBAL, and looks it up in the object's __dict__. Or, if it's not found there, in the object's type's dict. Or, if not there, then each type on that type's MRO. If that all fails, __getattr__ is looked up (but using special-method lookup, not normal attribute lookup) and called. Oh, unless __getattribute__ was overridden, in which case whatever it does happens instead. Also, somehow functions get magically turned into bound methods. And @property, and __slots__ and... and how does it find the __dict__ in the first place? And what about C-API types? Maybe it's not so simple after all...

    It's actually pretty simple: the only rule in the interpreter is special-method lookup, and LOAD_ATTR only ever uses that rule to look up and then call __getattribute__. (Other special methods are handled internally by their opcodes or other special builtin or extension functions written in C.) Everything else happens inside __getattribute__; we'll get to that later.

    Special method lookup


    Special method lookup is only somewhat vaguely documented, in the data model reference chapter. See call_method and its helpers in the source for details.

    The basic idea of special method lookup is that it ignores the instance dictionary, and most kinds of dynamic customization.

    You may wonder what the point of special method lookup is. Obviously __getattribute__ lookup can't be hooked by __getattribute__, but why should it also skip the instance, or treat metaclasses differently? And why should other special methods like __add__ follow the same rules? (If you wanted to add per-instance methods to a value, __add__ or __lt__ would probably be the first things you'd think of.)

    In fact, the original "new-style class" design for Python 2.3 didn't have special method lookup--but (long before the first beta), Guido discovered a problem: Some methods make sense both on normal objects, and on types. For example, hash(int) makes just as much sense as hash(3). By normal lookup rules, both call int.__hash__. By special method lookup rules the former skips int's dict and instead calls type.__hash__. And a few other methods have the same problem, like str and repr. And what would happen if a type had both __call__ and __new__? And, even those that don't make sense for types--int + str, abs(Counter), or -socket--might make sense for some subclass of type (e.g., the new optional static types use subscripting, like Iterable[int], for generics).

    So, that's why we have special method lookup, and it obviously solves the __getattribute__ problem.

    How do we find the __dict__ without finding the __dict__ first?


    There's still a circular definition problem here. How do we find an attribute? By calling the __getattribute__ attribute on the object's type and walking its MRO looking in the __dict__ of each type. So... how do we get that __dict__? Certainly not by searching __dict__s. And the process also involves a variety of other attribute lookups--you get an object's type via its __class__, a type's MRO via its __mro__, and so on.

    The secret is that none of these are actual attribute lookups. In the C struct representing each object, there's an ob_type pointing to the type. In the C struct representing each type object, there's a tp_mro pointing to the MRO, and a tp_getattro pointing to the __getattribute__, and a tp_dict pointing to the __dict__, and so on.

    But you can set __class__ or __dict__ or __getattribute__ from Python. How does that work? In general, when you overwrite one of these special values, the interpreter updates the corresponding C struct slot. (There are a few cases where it does different things, like using a slot as a cache and searching the dict if it's NULL, but that works too.)

    So, in the pseudocode below, when I write type(obj).__mro__, that's really following the ob_type slot from the object, then the tp_mro slot from the type, and so on.

    Finding __getattribute__


    The basic process to lookup and call __getattribute__ is, in pseudocode:

    def find_special(obj, name):
        for t in type(obj).__mro__:
            if name in t.__dict__:
                it = t.__dict__[name]
                __get__ = find_special(it, '__get__')
                if __get__:
                    it = __get__(obj, type(obj))
                return it
    
    def LOAD_ATTR(obj, index):
        name = f_code.co_names[index]
        __getattribute__ = find_special(obj, '__getattribute__')
        if not __getattribute__:
            raise AttributeError
        try:
            return __getattribute__(name)
        except AttributeError:
            __getattr__ = find_special(obj, '__getattr__')
            if not __getattr__:
                raise
            return __getattr__(name)
    

    If you don't understand the __get__, read the Descriptor HowTo Guide.

    (It shouldn't be possible to ever fail to find a __getattribute__, but there's code for it anyway--if you play around with the C API, you can try to call a method on a class before setting up its MRO, and you get an AttributeError rather than a segfault or something. On the other hand, __getattr__ can easily be missing, because object doesn't implement that, nor do most other builtin/stdlib types.)

    I've skipped over a few things (like the code to handle types that aren't fully constructed yet without crashing, etc.), but that's the basic idea. And it's also how other special-method lookups, like the nb_add (__add__) lookup inside BINARY_ADD, work.

    Notice that a __getattribute__ defined in the instance, or dynamically (by __getattribute__?), or a __getattribute__, __mro__, __class__, or __dict__ pushed onto the type by its metatype, all get ignored here.

    But if you've overridden __getattribute__ on your object's type, or any of its supertypes, or done the equivalent in C, your code will get called (as a bound method--that's the point of the descriptor get bit).

    One thing I skipped that may be worth noting is the method cache. See the macros at the top of typeobject.c, but the basic idea is that 4096 methods (that aren't special-cased away or given unreasonable names) get cached, so they can skip all the MRO-walking and dict lookups and jump right to the descriptor call.

    Inside the default __getattribute__


    Assuming you got to object.__getattribute__, what it does is pretty similar to special-method lookup all over again, but with some minor differences. The short version is that it allows instances, metaclass __getattribute__, and __getattr__ to get involved. In full detail, it's something like:

    • There's no type slot equivalent to "general name known only by a string passed in at runtime", so the per-type step does the tp_dict[name] bit.
    • Four-stage lookup instead of just the per-type MRO walk:
      1. Look on the type and its MRO for a data descriptor (that is, a descriptor with __set__). If so, return its __get__(obj, type(obj)).
      2. Look in the object itself only: if tp_dict (__dict__) and tp_dict[name] exist, you're done--do not call its __get__ method, just return it.
      3. Look on the type and its MRO for a non-data descriptor or non-descriptor. Return its __get__(obj, type(obj)) (or, if it's not a descriptor, just return it).
      4. Special-method lookup __getattr__, call that, and do the __get__(obj, type(obj)) on the result if present.

    So, in pseudocode:

    def __getattribute__(self, name):
        for t in type(obj).__mro__:
            try:
                desc = t.__dict__[name]
                if '__get__' in desc.__dict__ and '__set__' in desc.__dict__:
                    return desc.__dict__['__get__'](obj, type(obj))
                break
            except AttributeError:
                pass
        if name in self.__dict__:
            return self.__dict__[name]
        try:
            if '__get__' in desc.__dict__:
                return desc.__dict__['__get__'](obj, type(obj))
            return desc
        except UnboundLocalError:
            __getattr__ = special_method(self, '__getattr__')
            if __getattr__: return __getattr__(name)
            raise AttributeError
    

    I believe the point of that multi-stage lookup is to make it easy to shadow methods (which are non-data descriptors) but hard to accidentally shadow properties and similar things (which are data descriptors).

    Notice that if the descriptor __get__ raises AttributeError, that will trigger the fallback to __getattr__ from the previous process.

    What about classes?


    If you look something up on a class, the class is the object, and the metaclass is the type. And if your metaclass doesn't override __getattribute__, type does (and it never raises or supers).

    The only difference between type.__getattribute__ and object.__getattribute__ is at the descriptor steps, it calls __get__(None, cls). So, everything works the same, except for the descriptor get bit. (If you work things through, you should see how this, e.g., lets @classmethods work the same whether called on an instance or on the class, and how it lets metaclass methods act like classmethods.)

    Other tricks


    What else is left?

    • __slots__ works by creating a hidden array, and a set of descriptors that read and write to that array, and preventing a tp_dict from being created on the instance.
    • Methods don't do anything special--spam.eggs() is the same as tmp = spam.eggs; tmp() (unlike many other OO languages). They work because they're stored as plain-old functions in the class dict, and plain-old functions are descriptors that build bound methods. See How methods work if that isn't clear.
    • Everything else--@property, @staticmethod, etc.--is just a descriptor; object.__getattribute__ (or type.__getattribute__) finds it on the class, calls its __get__ method, and gives you the result. (It may help to work through how @classmethod produces a bound class method whether looked up on the type Spam.eggs or on the instance spam.eggs.)
    • There are various C API mechanisms to expose struct members, get/set function pairs, etc. as if they were Python attributes. But it's pretty obvious how these work unless you want the gory details, in which case they're in the C API docs.

    What about super?


    It uses four different tricks to be that super, thanks for asking:
    • In every class definition, __class__ is a local referring to the class currently being defined.
    • The compiler rewrites super() to super(__class__, co_varnames[0]). (The first local is the first parameter, so this handles self in normal methods, cls in classmethods and __new__ and normal metaclass methods, mcls in metaclass __new__ and metaclass classmethods, even if you decide to call them something confusing like s or this.)
    • Each method defined in a class gets __class__ as a free variable whenever it refers to super, not just whenever it directly refers to __class__.
    • super.__getattribute__ overrides the normal behavior from object to jump ahead on the MRO chain after the current __class__.
    7

    View comments

  14. In Python (I'm mostly talking about CPython here, but other implementations do similar things), when you write the following:
        def spam(x):
            return x+1
        spam(3)
    
    What happens?

    Really, it's not that complicated, but there's no documentation anywhere that puts it all together. Anyone who's tried hacking on the eval loop, understands it, but explaining it someone else is very difficult. In fact, the original version of this was some notes to myself, which I tried to show to someone who I'm pretty sure is at least as smart as me, and he ended up completely lost. So I reorganized it so that, at least hopefully, you can start each section and then skip the rest of the section when you get over your head (and maybe skip the last few sections entirely) and still learn something useful. If I've failed, please let me know.

    Compilation

    To make things simpler to explore, let's put all of that inside an outer function for the moment. This does change a few things (notably, spam becomes a local within that outer function rather than a global), but it means the code that creates and calls the function sticks around for us to look at. (Top-level module code is executed and discarded at module import time; anything you type at the REPL is compiled, executed, and discarded as soon as the statement is finished; scripts work similar to the REPL.)

    So, there are two top-level statements here.

    The def statement creates a function object out of a name and a body and then stores the result in a local variable corresponding to that name. In pseudocode:
        store_local("spam", make_function("spam", spam_body))
    The call expression statement takes a function object and a list of arguments (and an empty dict of keyword arguments) and calls the function (and then throws away the result). In pseudocode:
        call_function(load_local("spam"), [3], {})
    So, the only tricky bit here is, what is that spam_body? Well, the compiler has to work recursively: it sees a def statement, and it knows that's going to compile into a make_function that takes some kind of body-code object, so it compiles the body suite of the def into that object, then stashes the result as a constant. Just like the number 3 and the string "spam" are constants in the outer function, so is the code object spam_body.

    So, what does the body do? It's pretty simple:
        return(add(load_local("x"), 1))
    Notice that parameters like x are just local variables, the same as any you define inside the function; the only difference is that they get an initial value from the caller's arguments, as we'll see later.

    Obviously real Python bytecode is a bit different from this pseudocode, but we'll get back to that at the very end. There are a few easier questions to get through first, starting with: what kind of thing is a code object?

    Code objects

    You need more than just compiled bytecode to store and execute a function body, because the bytecode references things like constants, which have to go somewhere, and because that call_function is going to need some information about its parameters to know how to map the first argument 3 to the local variable x.

    The object with all this information is a code object. The inspect module docs give some detail of what's in a code object. But some of the key members are:
    • co_consts, a tuple of constant values. In spam this just has the number 1. In the top-level function, it has the number 3, the string "spam", and the code object for the spam body.
    • co_argcount and related values that specify the calling convention. For spam, there's an argcount of 1, meaning it's expecting 1 positional parameter, and also meaning its first 1 local variables in co_varnames are parameters. (This is why you can call it with spam(x=3) instead of spam(3).) The full details of how arguments get mapped to parameters is pretty complicated (and I think I've written a whole blog post about it).
    • co_code, which holds the actual compiled bytecode, which we'll get to later.
    There's also a bunch of stuff there that's only needed for tracebacks, reflection, and debugging, like the filename and line number the source code came from.

    So, the compiler, after recursively compiling the inner function body into bytecode, then builds the code object around it. You can even do this yourself in Python, although it's a bit of a pain—type help(types.CodeType) and you'll see the 15-parameter constructor. (Some of those parameters are related to closures, which I'll get to later.)

    Function objects

    The make_function pseudocode above just takes a code object and a name and builds a function object out of it.

    Why do we need a separate function object? Why can't we just execute code objects? Well, we can (via exec). But function objects add a few things.

    First, function objects store an environment along with the code, which is how you get closures. If you have 16 checkboxes, and an on_check function that captures the checkbox as a nonlocal variable, you have 16 function objects, but there's no need for 16 separate code objects.

    In the next few sections, we'll ignore closures entirely, and come back to them later, because they make things more complicated (but more interesting).

    Function objects also store default values, which get used to fill in parameters with missing arguments at call time. The fact that these are created at def time is useful in multiple ways (although it also leads to the common unexpected-mutable-default bug).

    If you look at the inspect module, you'll see that the key attributes of a function are just the code object in __code__, the default values in __defaults__ and __kwdefaults__, and the closure environment in __closure__ (for nonlocals, which I'll explain later) and __globals__ (for globals—these aren't captured individually; instead, an entire globals environment is). And, of course, a bunch of stuff to aid tracebacks, debugging, reflection, and static type-checking.

    You can do the same thing as that make_function pseudocode instruction from Python—try help(types.FunctionType) to see the parameters to the constructor. And now you know enough to do some simple hackery, like turning spam from a function that adds 1 to a function that adds 2:
        c = spam.__code__
        consts = tuple(2 if const==1 else const 
                       for const in c.co_consts)
        nc = types.CodeType(
            c.co_argcount, c.co_nlocals, c.co_stacksize, 
            c.co_flags, c.co_code, consts, c.co_names, 
            c.co_varnames, c.co_filename, c.co_name, 
            c.co_firstlineno, c.co_lnotab, c.co_freevars, 
            c.co_cellvars)
        spam = types.FunctionType(
            nc, spam.__name__, spam.__defaults__, spam.__closure__)
    
    There are a few limits, but most things you can imagine are doable, and work the way you'd expect. Of course if you want to get fancy, you should consider using a library like byteplay instead of doing it manually.

    Fast Locals

    In reality, Python doesn't do load_local("x"), looking up x in a dict and returning the value, except in special cases. (It does do that for globals, however.)

    At compile time, the compiler makes a list of all the locals (just x here) and stores their names in the code object's co_varnames, and then turns every load_local("x") into local_fast(0). Conceptually, this means to do a load_local(co_varnames[0]), but of course that would be slower, not faster. So what actually happens is that the local variables are stored in an array that gets created at the start of the function call, and load_fast(0) just reads from slot #0 of that array.

    Looking things up by name (a short string whose hash value is almost always cached) in a hash table is pretty fast, but looking things up by index in an array is even faster. This why the def foo(*, min=min) microoptimization works—turning load_global("min") into load_local("min") might help a little (because the local environment is usually smaller, not to mention that builtins require an extra step over normal globals), but turning it into load_fast(0) helps a lot more.

    But it does mean that if you call locals(), Python has to build a dict on the fly—and that dict won't be updated with any further changes, nor will any changes you make to the dict have any effect on the actual locals. (Usually. If, say, you're executing top-level code, the locals and globals are the same environment, so changing locals does work.)

    That's also why you usually can't usefully exec('x = 2')—that again just creates a dict on the fly with a copy of the locals, then executes the assignment against that dict. (In Python 2, exec was a statement, and it would create a dict on the fly, execute the code, and then try to copy changed values back into the array. Sometimes that worked, but there were tons of edge cases. For example, if you never had a non-exec assignment to x, the compiler couldn't know to put x in co_varnames in the first place.)

    Finally, that means closures can't be implemented as just as stack of dicts (as in many Lisp implementations), but Python has a different optimization for them anyway, as we'll see later.

    Calling

    OK, so now we know everything about how the def statement works (except for the actual bytecode, which I'll get to later, and closure-related stuff, which I'll also get to later). What about the call expression?

    To call a function, all you need to supply is the function object and a list of positional arguments (and a dict of keyword arguments, and, depending on the Python version, *args and/or **kw may be part of the call rather than splatting them into the list/dict in-place… but let's keep it simple for now). What happens inside the interpreter when you do this?

    Well, first, calling a function actually just looks up and calls the function object's __call__ method. That's how you can call class objects, or write classes whose instances are callable. Near the end, we'll get into that. But here, we're talking about calling an actual function object, and the code for that works as follows.

    The first thing it needs to do is create a new stack frame to run the function's code in. As with codes and functions, you can see the members of a frame object in the inspect docs, and you can explore them in the interactive interpreter, and you can find the type as types.FrameType (although, unlike code and function objects, you can't construct one of these from Python). But the basic idea is pretty simple:

    A frame is a code object, an environment, an instruction pointer, and a back-pointer to the calling frame. The environment is a globals dict and a fast locals array, as described earlier. That's it.

    You may wonder how the compiler builds that list of co_varnames in the first place. While it's parsing your code, it can see all the assignments you make. The list of local variables is just the list of parameters, plus the list of names you assign to somewhere in the function body. Anything you access that isn't assigned to anywhere is (ignoring the case of closures, which we'll get to later) is a global; it goes into co_names, and gets looked up as a global (or builtin) at runtime.

    To construct a frame, in pseudocode:
        code = func.__code__
        locals = [_unbound for _ in range(code.co_nlocals)]
        do_fancy_arg_to_param_mapping(locals, args, kwargs)
        frame = Frame()
        frame.f_back = current_frame
        frame.f_lasti = 0
        frame.f_code = code
        frame.f_locals = locals
    
    And then all the interpreter has to do is recursively call interpreter(frame), which runs along until the interpreter hits a return, at which point it just returns to the recursive caller.

    The interpreter doesn't actually need to be recursive; a function call could just be the same as any other instruction in the loop except that it sets current_frame = frame, and then returning would also be the same as any other instruction except that it sets current_frame = current_frame.f_back. That lets you do deep recursion in Python without piling up on the C stack. It makes it easier to just pass frame objects around and use them to represent coroutines, which is basically what Stackless Python is all about. But mainline Python can, and does, already handle the latter just by wrapping frames up in a very simple generator object. Again, see inspect to see what's in a generator, but it's basically just a frame and a couple extra pieces of information needed to handle yield from and exhausted generators.

    Notice the special _unbound value I used in the pseudocode. In fact, at the C level, this is just a null pointer, although it could just as easily be a real sentinel object. (It could even be exposed to the language, like JavaScript's undefined, although in JS, and most other such languages, that seems to cause a lot more problems than it solves.) If you try to access a local variable before you've assigned to it, the interpreter sees that it's unbound and raises an UnboundLocalError. (And if you del a local, it gets reset to _unbound, with the same effect.)

    Defaults

    I glossed over the issues with default values above. Let's look at them now.

    Let's say we defined def spam(x=0):. What would that change?

    First, the body of spam doesn't change at all, so neither does its code object. It still just has an x parameter, which it expects to be filled in by the function-calling machinery in the interpreter, and it doesn't care how. You can dis it and explore its members and nothing has changed.
    Its function object does change, however—__defaults__ now has a value in it.

    If you look at the outer function, its code changes. It has to store the 0 as a constant, and then load that constant to pass to make_function. So the first line of pseudocode goes from this:
        store_local("spam", make_function("spam", spam_body))
    … to this:
        store_local("spam", 
            make_function("spam", spam_body, defaults=(0,)))
    Inside the interpreter's make_function implementation, it just stores that tuple in the created function's __defaults__ attribute.

    At call time, the function-calling machinery is a bit more complicated. In our case, we passed a value for x, so nothing much changes, but what if we didn't? Inside call_function, if the function expects 1 positional parameter, and we passed 0 positional arguments, and we didn't provide a keyword argument matching the name of the function's code's first positional parameter, then it uses the first value from the function's __defaults__ instead, and puts that in the frame's locals array the same way it would for a value we passed in explicitly.

    (If we hadn't set a default value, and then called without any arguments, it would try to use the first value from __defaults__, find there isn't one, and raise a TypeError to complain.)

    This explains why mutable default values work the way they do. At def time, the value gets constructed and stored in the function's __defaults__. Every time the function is called without that argument, nobody constructs a new value, it just copies in the same one from the same tuple every time.

    Closures

    As soon as you have nonlocal variables, things get more fun. So let's start with the most trivial example:
        def eggs():
            y = 1
            def spam(x):
                return x + y
            spam(3)
    
    Our inner function has a free variable, y, which it captures from the outer scope. Its code doesn't change much:
    return(add(load_local("x"), 
            load_local_cell("y").cell_contents))
    But the outer function changes a bit more. In pseudocode:
        load_local_cell("y").set_cell_contents(1)
        store_local("spam", make_function("spam", spam_body, 
            closure=[load_local_cell("y")]))
        call_function(load_local("spam"), [], {})
    The first thing to notice is that y is no longer a normal local variable. In both functions, we're doing a different call, load_local_cell to look it up. And what we get is a cell object, which we have to look inside to get or set the actual value.

    Also notice that when we're calling make_function, we pass the cell itself, not its contents. This is how the inner function ends up with the same cell as the outer one. Which means if either function changes the cell's contents, the other one sees it.

    The only hard part is how the compiler knows that y is a cellvar (a local variable you share with a nested inner function) in eggs and a freevar in spam (a local variable that an outer function is sharing with you, or with a deeper nested function).

    Remember that the compiler scans the code for assignments to figure out what's local. If it finds something that isn't local, then it walks the scope outward to see if it's local to any containing function. If so, that variable becomes a cellvar in the outer function, and a freevar in the inner function (and any functions in between). If not, it becomes a global. (Unless there's an explicit nonlocal or global declaration, of course.) Then the compiler knows which code to emit (local, cell, or global operations) for each variable. Meanwhile, it stores the list of cellvars and freevars for each code object in co_cellvars and co_freevars. When compiling a def statement, the compiler also needs to look at the co_freevars and inserts that closure = [load_load_cell("y")] and passes it along to the make_function.

    Inside call_function, if the code object has any cellvars, the interpreter creates an empty (_unbound) cell for each one. If it has any freevars, the interpreter copies the cells out of the function object's __closure__.

    And that's basically all there is to closures, except for the fast local optimization.

    For historical reasons, the way things get numbered is a little odd. The f_locals array holds the normal locals first, then the cellvars, then the freevars. But cellvars and freevars are numbered starting from the first cellvar, not from the start of the array. So if you have 3 normal locals, 2 cellvars, and 2 freevars, freevar #2 matches slot 0 in co_freevars, and slot 5 in f_locals. Confused? It's probably easier to understand in pseudocode than English. But first…

    For an additional optimization, instead of just one load_fast_cell function, Python has a load_closure that just loads the cell, load_deref that loads the cell's cell_contents in a single step (without having to load the cell object itself onto the stack), and store_deref that stores into the cell's cell_contents in a single step.

    So, the pseudocode to construct a frame looks like this:
        code = func.__code__
        locals = [_unbound for _ in range(code.co_nlocals)]
        cells = [cell() for _ in range(len(code.co_cellvars))]
        do_fancy_arg_to_param_mapping(locals, args, kwargs)
        frame = Frame()
        frame.f_back = current_frame
        frame.f_lasti = 0
        frame.f_code = code
        frame.f_locals = locals + cells + list(func.__closure__)
    
    And the pseudocode for load_closure, load_deref, and store_deref, respectively:
        frame.f_locals[frame.f_code.co_nlocals + i]
        frame.f_locals[frame.f_code.co_nlocals + i].cell_contents
        frame.f_locals[frame.f_code.co_nlocals + i].set_cell_contents(value)
    
    These cells are real Python objects. You can look at a function's __closure__ list, pick out one of the cells, and see its cell_contents. (You can't modify it, however.)

    Example

    It may be worth working through an example that actually relies on closure cells changing:
    def counter():
        i = 0
        def count():
            nonlocal i
            i += 1
            return i
        return count
    count = counter()
    count()
    count()
    
    If you want to go through all the details, you can easily dis both functions (see the section on bytecode below), look at their attributes and their code objects' attributes, poke at the cell object, etc. But the short version is pretty simple.

    The inner count function has a freevar named i. This time, it's explicitly marked nonlocal (which is necessary if you want to rebind it). So, it's going to have i in its co_freevars, and some parent up the chain has to have a cellvar i or the compiler will reject the code; in our case, of course, counter has a local variable i that it can convert into a cellvar.

    So, count is just going to load_deref the freevar, increment it, store_deref it, load_deref it again, and return it.

    At the top level, when we call counter, the interpreter sets up a frame with no locals and one cellvar, so f_locals has one empty cell in it. The i = 0 does a store_deref to set the cell's value. The def does a load_closure to load the cell object, then passes it to make_function to make sure it ends up in the defined function's __closure__, and then it just returns that function.

    When we call the returned function, the interpreter sets up a frame with no locals and one freevar, and copies the first cell out of __closure__ into the freevar slot. So, when it runs, it updates the 0 in the cell to 1, and returns 1. When we call it again, same thing, so now it updates the 1 in the cell to 2, and returns 2.

    Other callable types

    As mentioned earlier, calling a function is really just looking up and calling the function's __call__ method. Of course if calling anything means looking up its __call__ method and calling that, things have to bottom out somewhere, or that would just be an infinite recursion. So how does anything work?

    First, if you call a special method on a builtin object, that also gets short-circuited into a C call. There's a list of slots that a builtin type can provide, corresponding to the special dunder methods in Python. If a type has a function pointer in its nb_add slot, and you call __add__ on an instance of that type, the interpreter doesn't have to look up __add__ in the dict, find a wrapper around a builtin function, and call it through that wrapper; it can just find the function pointer in the slot for the object's type and call it.

    One of those slots is tp_call, which is used for the __call__ method.

    Of course the function type defines tp_call with a C function that does the whole "match the args to params, set up the frame, and call the interpreter recursively on the frame" thing described earlier. (There's a bit of extra indirection so this can share code with eval and friends, and some optimized special cases, and so on, but this is the basic idea.)

    What if you write a class with a __call__ method and then call an instance of it? Well, spam.__call__ will be a bound method object, which is a simple builtin type that wraps up a self and a function. So, when you try to call that bound method, the interpreter looks for its __call__ by calling its tp_call slot, which just calls the underlying function with the self argument crammed in. Since that underlying function will be a normal Python function object (the one you defined with def __call__), its tp_call does the whole match-frame-eval thing, and your __call__ method's code gets run and does whatever it does.

    Finally, most builtin functions (like min) don't create a whole type with a tp_call slot, they're just instances of a shared builtin-function type that just holds a C function pointer along with a name, docstring, etc. so its tp_call just calls that C function. And similarly for methods of builtin types (the ones that aren't already taken care of by slots, and have to get looked up by dict). These builtin function and method implementations get a self (or module), list of args, and dict of kwargs, and have to use C API functions like PyArg_ParseTupleAndKeywords to do the equivalent of argument-to-parameter matching. (Some functions use argclinic to automate most of the annoying bits, and the goal is for most of the core and stdlib to do so.) Beyond that, the C code just does whatever it wants, returning any Python object at the end, and the interpreter then puts that return value on the stack, just as if it came from a normal Python function.

    You can see much of this from Python. For example, if f is a function, f.__call__ is a <method-wrapper '__call__' of function object>, f.__call__.__call__ is a <method-wrapper '__call__' of method-wrapper object>, and if you pile on more .__call__ you just get method-wrappers around method-wrappers with the same method. Similarly, min is a builtin function, min.__call__ is a <method-wrapper '__call__' of builtin_function_or_method object>, and beyond that it's method-wrappers around method-wrappers. But if you just call f or min, it doesn't generate all these wrappers; it just calls the C function that the first method-wrappers is wrapping.

    Actual bytecode

    Real Python bytecode is the machine language for a stack machine with no registers (except for an instruction pointer and frame pointer). Does that sound scary? Well, that's the reason I've avoided it so far—I think there are plenty of people who could understand all the details of how function calls work, including closures, but would be scared off by bytecode.

    But the reality is, it's a very high-level stack machine, and if you've made it this far, you can probably get through the rest. In fact, I'm going to go out of my way to scare you more at the start, and you'll get through that just fine.

    Let's go back to our original trivial example:
        def spam(x):
            return x+1
        spam(3)
    
    Don't put this inside a function, just paste it at the top level of your REPL. That'll get us a global function named spam, and we can look at what's in spam.__code__.co_code:
        b'|\x00d\x01\x17\x00S\x00'
    Well, that's a bit ugly. We can make it a little nicer by mapping from bytes to hex:
        7c 00 00 64 01 00 17 00 00 53 00 00
    But what does that mess mean?

    Bytecode is just a sequence of instructions. Each instruction has a 1-byte number; if it's up to 0x90, it's followed by a 2-byte (little-endian) operand value. So, we can look up 0x7c in a table and see that it's LOAD_FAST, and the 2-byte operand is just 0, just like our pseudocode load_fast(0). So, this is taking the frame's f_locals[0] (which we know is x, because co.varnames[0] is 'x'), and pushing it on the stack.

    Fortunately, we don't have tc do all this work; the dis module does it for us. Just call dis.dis(outer.__code__.co_consts[0]) and you get this:
         0 LOAD_FAST 0 (x)
         3 LOAD_CONST 1 (1)
         6 BINARY_ADD
         7 RETURN_VALUE
    
    The dis docs also explain what each op actually does, so we can figure out how this function works: It pushes local #0 (the value of x) onto the stack, then pushes constant #1 (the number 1) onto the stack. Then it pops the top two values, adds them (which in general is pretty complicated—it needs to do the whole deal with looking up __add__ and __radd__ methods on both objects and deciding which one to try, as explained in the data model docs), and puts the result on the stack. Then it returns the top stack value.

    Brief aside: If operands are only 16 bits, what if you needed to look up the 65536th constant? Or, slightly more plausibly, you needed to jump to instruction #65536? That won't fit in a 16-bit operand. So there's a special opcode EXTENDED_ARG (with number 0x90) that sets its 16-bit operand as an extra (high) 16 bits for the next opcode. So to load the 65536th constant, you'd do EXTENDED_ARG 1 followed by LOAD_FAST 0, and this means LOAD_FAST 65536.

    Anyway, now let's add the outer function back in, and compare the pseudocode to the bytecode:
        def outer():
            def spam(x):
                return x+1
            spam(3)
    
    In pseudocode:
        store_fast(0, make_function("spam", spam_body))
        call_function(load_fast(0), [3], {})
    In real bytecode:
         0 LOAD_CONST 1 (<code object spam at 0x12345678>)
         3 LOAD_CONST 2 ('spam')
         6 MAKE_FUNCTION 0
         9 STORE_FAST 0 (spam)
        12 LOAD_FAST 0 (spam)
        15 LOAD_CONST 3 (3)
        18 CALL_FUNCTION 1
        21 POP_TOP
        22 LOAD_CONST 0 (None)
        25 RETURN_VALUE
    
    So, 0-9 map to the first line of pseudocode: push constant #1, the code object (what we called spam_body), and constant #2, the name, onto the stack. Make a function out of them, and store the result in local #0, the spam variable.

    MAKE_FUNCTION can get pretty complicated, and it tends to change pretty often from one Python version to another, read the dis docs for your version. Fortunately, when you have no default values, annotations, closures, etc., so the pseudocode is just make_function(name, code), you just push the name and code and do MAKE_FUNCTION 0.

    Line 12-18 map to the second line of pseudocode: push local #0 (spam) and constant #3 (the number 3) onto the stack, and call a function.

    Again, CALL_FUNCTION can get pretty complicated, and change from version to version, but in our case it's dead simple: we're passing nothing but one positional argument, so we just put the function and the argument on the stack and do CALL_FUNCTION 1.

    The interpreter's function-call machinery then has to create the frame out of the function object and its code object, pop the appropriate values off the stack, figure out which parameter to match our argument to and copy the value into the appropriate locals array slot, and recursively interpret the frame. We saw above how that function runs. When it returns, that leaves a return value onto the stack.

    Line 21 just throws away the top value on the stack, since we don't want to do anything with the return value from spam.

    Lines 22-25 don't map to anything in our pseudocode, or in our source code. Remember that in Python, if a function that falls off the end without returning anything, it actually returns None. Maybe this could be handled magically by the function-call machinery, but it's not; instead, the compiler stores None in the code object's constants, then adds explicit bytecode to push that constant on the stack and return it.

    By the way, you may have noticed that the bytecode does some silly things like storing a value into slot 0 just to load the same value from slot 0 and then never use it again. (You may have noticed that some of my pseudocode does the same thing.) Of course it would be simpler and faster to just not do that, but Python's limited peephole optimizer can't be sure that we're never going to load from slot 0 anywhere else. It could still dup the value before storing so it doesn't need to reload, but nobody's bothered to implement that. There have been more detailed bytecode optimizer projects, but none of them have gotten very far—probably because if you're serious about optimizing Python, you probably want to do something much more drastic—see PyPy, Unladen Swallow, ShedSkin, etc., which all make it so we rarely or never have to interpret this high-level bytecode instruction by instruction in the first place.

    MAKE_FUNCTION and CALL_FUNCTION

    As mentioned above, these two ops can get pretty complicated. As also mentioned, they're two of the most unstable ops, changing from version to version because of new features or optimizations. (Optimizing function calls is pretty important to overall Python performance.) So, if you want to know how they work, you definitely need to read the dis docs for your particular Python version. But if you want an example (for pre-3.5), here goes.

    Take this pseudocode:
        make_function("spam", spam_body,
            defaults=(1,), kw_defaults={'x': 2},
            closure=(load_closure(0),))
    
    The bytecode is:
         0 LOAD_CONST 6 ((1,))
         3 LOAD_CONST 2 (2)
         6 LOAD_CONST 3 ('x')
         9 BUILD_MAP 1
        12 LOAD_CLOSURE 0 (y)
        15 BUILD_TUPLE 1
        18 LOAD_CONST 4 (<code object spam at 0x12345678>)
        21 LOAD_CONST 5 ("spam")
        15 MAKE_CLOSURE 9
    
    Yuck. Notice that we're doing MAKE_CLOSURE rather than MAKE_FUNCTION, because, in addition to passing a name and code, we're also passing a closure (a tuple of cells). And then we're passing 9 as the operand instead of 0. This breaks down into 1 | (1<<8) | (0<<16), meaning 1 positional default, 1 keyword default, and 0 annotations, respectively. And of course we have to push all that stuff on the stack in appropriate format and order.

    If we'd had an annotation on that x parameter, that would change the operand to 1 | (1<<8) | (1<<16), meaning we'd need EXTENDED_ARG, and pushing the annotations is a few more lines, but that's really about as complicated as it ever gets.

    More info

    If you're still reading, you probably want to know where to look for more detail. Besides the inspect and dis docs, there's the C API documentation for Object Protocol ( PyObject_Call and friends) and Function and Code concrete objects, PyCFunction, and maybe Type objects. Then there's the implementations of all those types in the Objects directory in the source. And of course the main interpreter loop in Python/ceval.c.
    2

    View comments

  15. I've seen a number of people ask why, if you can have arbitrary-sized integers that do everything exactly, you can't do the same thing with floats, avoiding all the rounding problems that they keep running into.

    If we compare the two, it should become pretty obvious at some point within the comparison.

    Integers


    Let's start with two integers, a=42 and b=2323, and add them. How many digits do I need? Think about how you add numbers: you line up the columns, and at worst carry one extra column. So, the answer can be as long as the bigger one, plus one more digit for carry. In other words, max(len(a), len(b)) + 1.

    What if I multiply them? Again, think about how you multiply numbers: you line up the smaller number with each column of the larger number, then there's that extra digit of carry again. So, the answer can be as long as the sum of the two lengths, plus one more. In other words, len(a) + len(b) + 1.

    What if I exponentiate them? Here, things get a bit tricky, but there's still a well-defined, easy-to-compute answer if you think about it, asking how many digits are in a**b is just solving for x in 10**(x-1) = a**b and then rounding up. So, log10 both sides and add one, and x = log10(a**b) + 1 = log10(a) * b + 1. Fit in your variables, and it's log10(42) * 2323 ~= 3770.808, which rounds up to 3771. Try len(str(42**2323)) and you'll get 3771.

    You can come up with other fancy operations to apply to integers--factorials, gcds, whatever--but the number of digits required for the answer is always a simple, easy-to-compute function of the number of digits in the operands.*

    * Except when the answer is infinite, of course. In that case, you easily compute that the answer can't be stored in any finite number of digits and use that fact appropriately--raise an exception, return a special infinity value, whatever.

    Decimals


    Now, let's start with two decimals, a=40 and b=.2323, and add them. How many digits do I need? Well, how many digits do the originals have? It kind of depends on how you count. But the naive way of counting says 2 and 4, and the result, 42.2323 has 6 digits. As you'd suspect, len(a) + len(b) + 1 is the answer here.

    What if I multiply them? At first glance, it seems like it should be easy--our example gives us 9.7566, which has 5 digits; multiplying a by itself is the same as integers, and b by itself for 0.05396329 is just adding 4 decimal digits to 4 decimal digits, so it's still len(a) + len(b) + 1.

    What if I exponentiate them? Well, now things get not tricky, but impossible. 42**.2323 is an irrational number. That means it has an infinite number of digits (in binary, or decimal, or any other integer base) to store. (It also takes an infinite amount of time to compute, unless you have an infinite-sized lookup table to help you.) In fact, most fractional powers of most numbers are irrational--2**0.5, the square root of 2, is the most famous irrational number. This means it takes an infinite number of digits to store.

    And it's not just exponentiation; most of the things you want to do with real numbers--take the sine, multiply by pi, etc.--give you irrational answers. Unless you stick to nothing but addition, subtraction, multiplication, and division, you can't have exact math.

    Even if all you want is addition, subtraction, multiplication, and division: a=1, b=3. How many digits do I need to divide them? Start doing some long division: 1 is smaller than 3, so that's 0. 10 has three 3s in it, so that's 0.3, with 1 left over. 10 has three 3s in it, so that's 0.3 with 1 left over. That's obviously going to continue on forever: there is no way to represent 1 / 3 in decimal without an infinite number of digits. Of course you could switch bases. For example, in base 9, 1 / 3 is 0.3. But then you need infinite digits for all kinds of things that are simple in base 10.

    Fractions


    If all you want actually is addition, subtraction, multiplication, and division, you're dealing with fractions, not decimals. Python's fractions.Fraction type does all of these operations with infinite precision. Of course when you go to print our the results as decimals, they may have to get truncated (otherwise, 1/3 or 1/7 would take forever to print), but that's the only limitation.

    Of course if you try to throw exponentiation or sine at a Fraction, or multiply it by a float, you lose that exactness and just end up with a float.

    Aren't decimals just a kind of fraction, where the denominator is 10d, where d is the number of digits after the decimal point? Yes, they are. But as soon as you, say, divide by 3, the decimal result is a fraction with an infinite denominator, as we saw above so that doesn't do you any good. If you need exact rational arithmetic, you need fractions with arbitrary denominators.

    Accuracy


    In real life, very few values are exact in the first place.* Your table isn't exactly 2 meters long, it's 2.00 +/- 0.005 meters.** Doing "exact" math on that 2 isn't going to do you any good. Doing error-propagating math on that 2.00, however, might.

    Also, notice that a bigger number isn't necessarily more accurate than a smaller one (in fact, usually the opposite), but the simple decimal notation means it has more precision: 1300000000 has 10 digits in it, and if we want to let people know that only the first 3 are accurate, we have to write something like 1300000000 +/- 5000000. And even with commas, like 1,300,000,000 +/- 5,000,000, it's still pretty hard to see how many digits are accurate. In words, we solve that by decoupling the precision from the magnitude: 1300 million, plus or minus 5 million, puts most of the magnitude into the word "million", and lets us see the precision reasonably clearly in "1300 +/- 5". Of course at 13 billion plus or minus 5 million it falls down a bit, but it's still better than staring at the decimal representation and counting up commas and zeroes.

    Scientific notation is an even better way of decoupling the precision from the magnitude. 1.3*1010 +/- 5*106 obviously has magnitude around 1010, and precision of 3-4 digits.*** And going to 1.3*1011 +/- 5*106 is just as readable. And floating-point numbers give us the same benefit.

    In fact, when the measurement or rounding error is exactly half a digit, it gets even simpler: just write 1.30*1010, and it's clear that we have 3 digits of precision, and the same for 1.30*1011. And, while the float type doesn't give us this simplification, the decimal.Decimal type does. In addition to being a decimal fraction rather than a binary fraction, so you can think in powers of 10 instead of 2, it also lets you store 1.3e10 and 1.30e10 differently, to directly keep track of how much precision you want to store. It can also give you the most digits you can get out of the operation when possible--so 2*2 is 4, but 2.00*2.00 is 4.0000. That's almost always more than you want (depending on why you were multiplying 2.00 by 2.00, you probably want either 4.0 or 4.00), but you can keep the 4.0000 around as an intermediate value, which guarantees that you aren't adding any further rounding error from intermediate storage. When you perform an operation that doesn't allow that, like 2.00 ** 0.5, you have to work out for yourself how much precision you want to carry around in the intermediate value, which means you need to know how to do error propagation--but if you can work it out, decimal can let you store it.

    * Actually, there are values that can be defined exactly: the counting numbers, e, pi, etc. But notice that most of the ones that aren't integers are irrational, so that doesn't help us here. But look at the symbols section for more...

    ** If you're going to suggest that maybe it's exactly 2.0001790308123812082 meters long: which molecule is the last molecule of the table? How do you account for the fact that even within a solid, molecules move around slowly? And what's the edge of a molecule? And, given that molecules' arms vibrate, the edge at what point in time? And how do you even pick a specific time that's exactly the same across the entire table, when relativity makes that impossible? And, even if you could pick a specific molecule at a specific time, its edge is a fuzzy cloud of electron position potential that fades out to 0.

    *** The powers are 10 and 6, so it's at worst off by 4 digits. But the smaller one has a 5, while the bigger one has a 1, so it's obviously a lot less than 4 digits. To work out exactly how many digits it's off, do the logarithm-and-round trick again.

    Money


    Some values inherently have a precision cutoff. For example, with money, you can't have less than one cent.* In other words, they're fixed-point, rather than floating-point, values.

    The decimal module can handle these for you as well. In fact, money is a major reason there's a decimal standard, and implementations of that standard in many languages' standard libraries.***

    * Yes, American gas stations give prices in tenths-of-a-cent per gallon, and banks transact money in fractional cents, but unless you want to end up in Superman 3,** you can ignore that.

    ** And yes, I referenced Superman 3 instead of Office Space. If you're working in software in 2015 and haven't seen Office Space, I don't know what I could say that can help.

    *** For some reason, people are willing to invest money in solving problems that help deal with money.

    Symbols


    So, how do mathematicians deal with all of this in real life? They don't. They do math symbolically, rather than numerically. The square root of 2 is just the square root of 2. And you carry it around that way throughout the entire operation. Multiply 3 * sqrt(2) and the answer is 3 * sqrt(2). But multiply sqrt(2) * sqrt(2) and you get 2, and multiply sqrt(2) * sqrt(3) and you get sqrt(6), and so on. There are simplification rules that give you exact equivalents, and you can apply these as you go along, and/or at the end, to try to get things as simple as possible. But in the end, the answer may end up being irrational, and you're just stuck with sqrt(6).

    Sometimes you need a rough idea of how big that sqrt(6) is. When that happens, you know how rough you want it, and you can calculate it to that precision. To three digits, more than enough to give you a sense of scale, it's 2.45. If you need a pixel-precise graph, you can calculate it to +/- half a pixel. But the actual answer is sqrt(6), and that's what you're going to keep around (and use for further calculation).

    In fact, let's think about that graph in more detail. For something simple and concrete, let's say you're graphing radii vs. circumferences of circles, measured in inches, on a 1:1 scale, to display on a 96 pixels-per-inch screen. So, a circle with radius 3" has a diameter of 18.850" +/- half a pixel. Or, if you prefer, 1810 pixels. But now, let's say your graph is interactive, and the user can zoom in on it. If you just scale that 1810 pixels up at 10:1, you get 18100 pixels. But if you stored 6*pi and recalculate it at the new zoom level, you get 18096 pixels. A difference of 4 pixels may not sound like much, but it's enough to make things look blocky and jagged. Zoom in too much more, and you're looking at the equivalent of face-censored video from Cops.

    Python doesn't have anything built-in for symbolic math, but there are some great third-party libraries like SymPy that you can use.
    2

    View comments

  16. In a recent thread on python-ideas, Stephan Sahm suggested, in effect, changing the method resolution order (MRO) from C3-linearization to a simple depth-first search a la old-school Python or C++. I don't think he realized that's what he was proposing at first, but the key idea is that he wants Mixin to override A in B as well as overriding object in A in this code:

    class Mixin: pass
    class A(Mixin): pass
    class B(Mixin, A): pass
    

    In other words, the MRO should be B-Mixin-A-Mixin-object. (Why not B-Mixin-object-A-Mixin-object? I think he just didn't think that through, but let's not worry about that.) After all, why would he put Mixin before A if he didn't want it to override A in B? And why would he attach Mixin to A if he didn't want it to override object in A?

    Well, that doesn't actually work. The whole point of linearization is that each class appears only once in the MRO, and many feature of Python--including super, which he wanted to make extensive use of--depend on that. For example, with his MRO, inside Mixin.spam, super().spam() is going to call A.spam, and its super().spam() is going to call Mixin.spam(), and you've obviously got a RecursionError on your hands.

    I think ultimately, his problem is that what he wants isn't really a mixin class (in typical Python terms--in general OO programming terms, it's one of the most overloaded words you can find...). For example, a wrapper class factory could do exactly what he wants, like this:

    def Wrapper(cls): return cls
    class A(Wrapper(object)): pass
    class B(Wrapper(A)): pass
    

    And there are other ways to get where he wants.

    Anyway, changing the default MRO in Python this way is a non-starter. But if he wants to make that change manually, how hard is it? And could he build a function that lets his classes could cooperate using that function instead of super?

    Customizing MRO


    The first step is to build the custom MRO. This is pretty easy. He wants a depth-first search of all bases, so:

    [cls] + list(itertools.chain.from_iterable(base.__mro__ for base in cls.__bases__))
    

    Or, if leaving the extra object out was intentional, that's almost as easy:

    [cls] + list(itertools.chain.from_iterable(base.__mro__[:-1] for base in cls.__bases__)) + [object]
    

    But now, how do we get that into the class's __mro__ attribute?

    It's a read-only property; you can't just set it. And, even if you could, you need type.__new__ to actually return something for you to modify--but if you give it a non-linearizable inheritance tree, it'll raise an exception. And finally, even if you could get it set the way you want, every time __bases__ is changed, __mro__ is automatically rebuilt.

    So, we need to hook the way type builds __mro__.

    I'm not sure if this is anywhere in the reference documentation or not, but the answer is pretty easy: the way type builds __mro__ is by calling its mro method. This is treated as a special method (that part definitely isn't documented anywhere), meaning it's looked up on the class (that is, the metaclass of the class being built) rather than the instance (the class being built), doesn't go through __getattribute__, etc., so we have to build a metaclass.

    But once you know that, it's all trivial:

    class MyMRO(type): 
        def mro(cls): 
            return ([cls] + 
                    list(itertools.chain.from_iterable(base.__mro__[1:] for base in cls.__bases__)) +
                    [object])
    
    class Mixin(metaclass=MyMRO): pass
    class A(Mixin): pass
    class B(Mixin, A): pass
    

    And now, B.__mro__ is B-Mixin-A-Mixin-object, exactly as desired.

    For normal method calls, this does what the OP wanted: Mixin gets to override A.

    But, as mentioned earlier, it obviously won't enable the kind of super he wants, and there's no way it could. So, we'll have to build our own replacement.

    Bizarro Super


    If you want to learn how super works, I think the documentation in Invoking Descriptors is complete, but maybe a little terse to serve as a tutorial. I know there's a great tutorial out there, but I don't have a link, so... google it.

    Anyway, how super works isn't important; what's important is the define what we want here. Once we actually know exactly what we want, anything is possible as long as you believe, that's what science is all about.

    Since we're defining something very different from super but still sort of similar, the obvious name is bizarro.

    Now, we want a call to bizarro().spam() inside B.spam to call Mixin.spam, a call inside Mixin.spam to call A.spam, a call inside A.spam to call Mixin.spam, and a call inside Mixin.spam to call object.spam.

    The first problem is that calling object.spam is just going to raise an AttributeError. Multiple inheritance uses of super are all about cooperative class hierarchies, and part of that cooperation is usually that the root of your tree knows not to call super. But here, Mixin is the root of our tree, but it also appears in other places on the tree, so that isn't going to work.

    Well, since we're designing our own super replacement, there's no reason it can't also cooperate with the classes, instead of trying to be fully general. Just make it return a do-nothing function if the next class is object, or if the next class doesn't have the method, or if the next class has a different metaclass, etc. Pick whatever rule makes sense for your specific use case. Since I don't have a specific use case, and don't know what the OP's was (he wanted to create a "Liftable" mixin that helps convert instances of a base class into instances of a subclass, but he didn't explain how he wanted all of the edge cases to work, and didn't explain enough about why he wanted such a thing for me to try to guess on his behalf), I'll go with the "doesn't have the method".

    While we're at it, we can skip over any non-cooperating classes that end up in the MRO (which would obviously be important if we didn't block object from appearing multiple times--but even with the MRO rule above, you'll have the same problem if your root doesn't directly inherit object).

    The next problem--the one that's at the root of everything we're trying to work around here--is that we want two different things to happen "inside Mixin.spam", depending on whether it's the first time we're inside or the second. How are we going to do that?

    Well, obviously, we need to keep track of whether it's the first time or the second time. One obvious way is to keep track of the index, so it's not A.spam if we're in Mixin.spam or object.spam if we're in Mixin.spam, it's B.__mro__[2] if we're in B.__mro__[1], and B.__mro__[4] if we're in B.__mro__[3]. (After first coding this up, I realized that an iterator might be a lot nicer than an index, so if you actually need to implement this yourself, try it that way. But I don't want to change everything right now.)

    How can we keep track of anything? Make the classes cooperate. Part of the protocol for calling bizarro is that you take a bizarro_index parameter and pass it into the bizarro call. Let's make it as an optional parameter with a default value of 0, so your users don't have to worry about it, and make it keyword-only, so it doesn't interfere with *args or anything. So:

    class Mixin(metaclass=MyMRO):
        def doit(self, *, bizarro_index=0):
            print('Mixin')
            bizarro(Mixin, self, bizarro_index).doit()
    
    class A(Mixin):
        def doit(self, *, bizarro_index=0):
            print('A')
            bizarro(A, self, bizarro_index).doit()
    
    class B(Mixin, A):
        def doit(self, *, bizarro_index=0):
            print('B')
            bizarro(B, self, bizarro_index).doit()
    

    And now, we just have to write bizarro.

    The key to writing something like super is that it returns a proxy object whose __getattribute__ looks in the next class on the MRO. If you found that nice tutorial on how super works, you can start with the code from there. We then have to make some changes:

    1. The way we pick the next class has to be based on the index.
    2. Our proxy has to wrap the function up to pass the index along.
    3. Whatever logic we wanted for dealing with non-cooperating classes has to go in there somewhere.

    Nothing particularly hard. So:

    def bizarro(cls, inst, idx): 
        class proxy: 
            def __getattribute__(self, name): 
                for superidx, supercls in enumerate(type(inst).__mro__[idx+1:], idx+1): 
                    try:
                        method = getattr(supercls, name).__get__(inst) 
                    except AttributeError: 
                        continue 
                    if not callable(method):
                        return method # so bizarro works for data attributes
                    @functools.wraps(method) 
                    def wrapper(*args, **kw):
                        return method(*args, bizarro_index=superidx, **kw)
                    return wrapper 
                return lambda *args, **kw: None 
        return proxy() 
    

    And now, we're done.

    Bizarro am very beautiful


    In Python 3, super(Mixin, self) was turned into super(). This uses a bit of magic, and we can use the same magic here.

    Every method has a cell named __class__ that tells you which class it's defined in. And every method takes its self as the first parameter. So, if we just peek into the caller's frame, we can get those easily. And while we're peeking into the frames, since we know the index has to be the bizarro_index parameter to any function that's going to participate in bizarro super-ing, we can grab that too:

    def bizarro():
        f = sys._getframe(1)
        cls = f.f_locals['__class__']
        inst = f.f_locals[f.f_code.co_varnames[0]]
        idx = f.f_locals['bizarro_index']
        # everything else is the same as above
    

    This is cheating a bit; if you read PEP 3135, the super function doesn't actually peek into the parent frame; instead, the parser recognizes calls to super() and changes them to pass the two values. I'm not sure that's actually less hacky--but it is certainly more portable, because other Python implementations aren't required to provide CPython-style frames and code objects. Also leaving the magic up to the parser means that, e.g., PyPy can still apply its no-frames-unless-needed optimization, trading a tiny bit of compile-time work for a small savings in every call.

    If you want to do the same here, you can write an import hook that AST-transforms bizarro calls in the same way. But I'm going to stick with the frame hack.

    Either way, now you can write this:

    class Mixin(metaclass=MyMRO):
        def doit(self, *, bizarro_index=0):
            print('Mixin')
            bizarro().doit()
    
    class A(Mixin):
        def doit(self, *, bizarro_index=0):
            print('A')
            bizarro().doit()
    
    class B(Mixin, A):
        def doit(self, *, bizarro_index=0):
            print('B')
            bizarro().doit()
    

    Meanwhile, notice that we don't actually use cls anywhere anyway, so... half a hack is only 90% as bad, right?

    But still, that bizarro_index=0 bit. All that typing. All that reading. There's gotta be a better way.

    Well, now you can!

    We've already got a metaclass. We're already peeking under the covers. We're already wrapping functions. So, let's have our metaclass peek under the covers of all of our functions and automatically wrap anything that uses bizarro to take that bizarro_index parameter. The only problem is that the value will now be in the calling frame's parent frame (that is, the wrapper), but that's easy to fix too: just look in f_back.f_locals instead of f_locals.

    import functools
    import itertools
    import sys
    
    class BizarroMeta(type):
        def mro(cls):
            return ([cls] +
                    list(itertools.chain.from_iterable(base.__mro__[:-1] for base in cls.__bases__)) +
                    [object])
        def __new__(mcls, name, bases, attrs):
            def _fix(attr):
                if callable(attr) and 'bizarro' in attr.__code__.co_names:
                    @functools.wraps(attr)
                    def wrapper(*args, bizarro_index=0, **kw):
                        return attr(*args, **kw)
                    return wrapper
                return attr
            attrs = {k: _fix(v) for k, v in attrs.items()}
            return super().__new__(mcls, name, bases, attrs)
    
    def bizarro():
        f = sys._getframe(1)
        inst = f.f_locals[f.f_code.co_varnames[0]]
        idx = f.f_back.f_locals['bizarro_index']
        class proxy: 
            def __getattribute__(self, name): 
                for superidx, supercls in enumerate(type(inst).__mro__[idx+1:], idx+1):
                    try:
                        method = getattr(supercls, name).__get__(inst)
                    except AttributeError: 
                        continue 
                    if not callable(method):
                        return method # so bizarro works for data attributes
                    @functools.wraps(method) 
                    def wrapper(*args, **kw): 
                        return method(*args, bizarro_index=superidx, **kw)
                    return wrapper 
                return lambda *args, **kw: None 
        return proxy() 
    
    class Mixin(metaclass=BizarroMeta):
        def doit(self):
            print('Mixin')
            bizarro().doit()
    
    class A(Mixin):
        def doit(self):
            print('A')
            bizarro().doit()
    
    class B(Mixin, A):
        def doit(self):
            print('B')
            bizarro().doit()
    
    B().doit()
    

    Run this, and it'll print B, then Mixin, then A, then Mixin.

    Unless I made a minor typo somewhere, in which case it'll probably crash in some way you can't possibly debug. So you'll probably want to add a bit of error handling in various places. For example, it's perfectly legal for something to be callable but not have a __code__ member--a class, a C extension function, an instance of a class with a custom __call__ method... Whether you want to warn that you can't tell whether Spam.eggs uses bizarro or not because you can't find the code, assume it doesn't and skip it, assume it does and raise a readable exception, or something else, I don't know, but you probably don't want to raise an exception saying that type objects don't have __code__ attributes, or whatever comes out of this mess by default.

    Anyway, the implementation is pretty it's pretty small, and not that complicated once you understand all the things we're dealing with, and the API for using it is about as nice as you could want.

    I still don't know why you'd ever want to do this, but if you do, go for it.
    1

    View comments

  17. Note: This post doesn't talk about Python that much, except as a point of comparison for JavaScript.

    Most object-oriented languages out there, including Python, are class-based. But JavaScript is instead prototype-based. Over the years, this has led to a lot of confusion, and even more attempts to resolve that confusion, either by faking classes, or by explaining why prototypes are good. Unfortunately, most of the latter attempts are completely wrong, and only add to the confusion.

    A mess?

    I'm going to pick on Kyle Simpson's JS Objects: Inherited a Mess (and its two followups)—not because it's particularly bad, but for the opposite reason: it's one of the best-written attempts at clearing things up (and it's frequently linked to).

    Simpson's main point is that, ultimately, prototype inheritance is more like delegation than like inheritance, because the inheritance "arrows" go in the opposite direction.

    Let's take a language like C++, and write a class Foo, a subclass Bar, and a Bar instance bar1. There's a superclass-to-subclass arrow pointing from Foo to Bar, which represents Bar's vtable being made by copying Foo's, and a class-to-instance arrow pointing from Bar to bar1, representing bar1's vptr being copied from Bar.

    In JavaScript, on the other hand, there's no such copying; instead, there are live links. Methods are looked up on the object; if not found, the link is followed to look up the method on its prototype, and so on up the chain. So, he draws an arrow from bar1 to Bar.prototype, which its prototype link, and an arrow from Bar.prototype to Foo.prototype, representing the same thing.

    But he's drawing entirely the wrong distinction. When you notice that C++, Java, and C# all do things one way, but JavaScript does things differently, it's a decent guess that the difference is prototypes vs. classes. But as soon as you toss in a few other languages, it's obvious that the guess is wrong. For example, Python and Smalltalk do things the same way as JavaScript, but they're very definitely class-based languages.

    Simpson brings up some other differences between JS and other languages in the rest of the series--e.g., the fact that a JS object's constructor doesn't automatically call its prototype's constructor, while a C++ class's constructor does automatically call its base classes' constructors. But here again, Python does the same thing as JS. At any rate, the direction of the arrows is what he focuses on, so let's stick with that.

    Dynamic vs. semi-dynamic lookup

    If this difference isn't about prototypes vs. classes, what is it?

    In Python, method lookup is dynamic. If your source code has a call to foo.spam, the compiler turns that into code that does the following, until one of them works:*
    • Ask the object to handle "spam".
    • Ask the object's __class__ to handle "spam".
    • Walk the class's superclass tree (in C3-linearized order), asking each one in turn to handle "spam".
    • Raise an AttributeError.
    In JavaScript, method lookup is dynamic in exactly the same way. The compiler turns a call to foo.spam into this:
    • Ask the object to handle "spam".
    • Walk the object's prototype list, asking each one in turn to handle "spam".
    • Raise a TypeError.
    The only difference is that Python has two kinds of links--instance-to-class and class-to-superclass--that JavaScript collapses into one--object-to-prototype. (And that Python has a linearized multiple inheritance tree, but let's ignore that here.)

    In C++, method lookup is mostly static, with a hook for a very specific kind of dynamism. The compiler turns a call to foo.spam into this:**
    • Look up foo.vptr[2] and call it.
    The compiler has to know that foo is (a reference to) an instance of Foo or some subclass thereof, and it has to have compiled Foo already so that it knows that spam is its third method. The only way anyone outside of Foo can affect what happens here is that a subclass like Bar can override the spam method, so its vtable ends up with a different function pointer in the third slot. If so, if foo is a Bar instance at runtime, then foo.vptr[2] will point to Bar.spam instead of Foo.spam. That's all the late binding you get.

    The downsides to C++-style virtual methods instead of fully dynamic lookup are pretty obvious. In JavaScript or Python, you can set a new foo.spam at any time, overriding the behavior of Foo. Or you can set a new Foo.spam, which will affect all existing instances of Foo or its descendants that don't have any overriding implementation. This can be pretty handy for, e.g., building up transparent proxies based on runtime information, or mocking objects for unit tests.

    But you don't do those things that often. Pretty often, C++'s dynamic behavior is all the dynamic behavior you need. And it's obviously simpler, and a lot faster, to just follow a pointer, index an array, and follow another pointer than to walk through a linked chain of objects asking each one to do a dictionary lookup for you. And it even opens the door to further optimizations--e.g., if the compiler knows that foo must be exactly a Foo rather than a subclass (e.g., because it's a local allocated on the stack), or it knows that no subclass of Foo overrides the method (e.g., because it's got feedback from a whole-program analyzer), it can just insert a call directly to the Foo.spam implementation.

    The problem is that something that works "pretty often" but has no escape hatch only works pretty often. And just as often, the speed of the lookups don't matter anyway. As Alan Kay pointed out back in the 70s, usually you're either doing all your work inside some inner loop so it hardly matters how fast you get to that loop, or spending all your time waiting on a user to interact with the GUI/a client to interact with the reactor/etc. so it hardly matters how fast you do anything. Better to give people features they might need than performance they probably don't. So, that's what Smalltalk did. And Objective-C, Python, Ruby, etc.

    * Python offers a wide variety of ways to hook this process, as do many other dynamic OO languages, so this is only the default behavior. But let's not worry about that right now.
    ** I'm ignoring non-virtual functions here, but they don't add any complexity. I'm also ignoring the screwy way that C++ does multiple inheritance. Be glad. I'm also hiding reference/pointer stuff, which means this is closer to C# or Java than C++, but that doesn't really matter here.

    The best of both worlds?

    A Smalltalk offshoot called Self was built around the idea that a just-in-time (JIT) optimizer can eliminate most of the costs of dynamic lookup, while preserving all the benefits. It's pretty uncommon that you change out a method at runtime, and very rare that you do so more than once or twice in the run of a program. So, if you cache the lookup and recompile any function that needs the lookup to use the cached value, and insert some guards to invalidate the cache on the rare occasions when you change something, you may end up with millions or billions of faster-than-C++ lookups for every one recompile and slow lookup.

    It's not coincidental that Self is also the language that popularized prototype-based inheritance. One major thing classes add in dynamic inheritance is that making that first link special makes it easier to optimize. Also, classes of objects tend to almost always override the same set of methods in the same way, so you might as well label those classes to help the optimizer. And if the language encourages you to think in terms of classes of objects, you'll usually make the optimizer's job easier. But with a smart-enough JIT, that isn't necessary. At any rate, the dynamic lookup was the main reason for the JIT, and that reason applies to dynamic class-based languages just as much as to prototype-based ones.

    Plus, of course, it turns out that the idea of a JIT worked pretty well on static code too; Hotshot running compiled Java bytecode is generally faster than PyPy or V8 running Python or JavaScript. But for most uses, PyPy and V8 are more than fast enough. (In fact, for many uses, CPython and Spidermonkey are more than fast enough.)

    What's in a name?

    Simpson wants to insist that JavaScript's behavior is not dynamic inheritance, it's delegation.

    But, if so, Python, Smalltalk, and many other class-based languages aren't doing dynamic inheritance either. So, the difference between class-based and prototype-based still has nothing to do with delegation vs. inheritance, even if you use his terms.

    In fact, what he's calling "delegation" really just means dynamic inheritance. Which makes it obvious why claiming that JavaScript doesn't do dynamic inheritance, it does delegation, is wrong.

    While we're on the subject of names, you could argue that the confusion about prototypes is from the start caused by other things JS got wrong. Other prototype-based languages, from Self to newer languages like IO, don't use constructor functions and a Java-like new operator, so they don't make prototyping/cloning look misleadingly like instantiation. Maybe if JS didn't do that, people wouldn't have been misled as badly in the first place?

    Prototypes

    So then, what is the difference between classes and prototypes?

    Of course there's the minor difference noted above, that JavaScript only has one kind of link instead of two.* And, if you read the footnotes, there could be a performance difference. But... so what?

    Well, getting rid of classes does lose something: In JavaScript, your prototype chain doesn't affect your type. In Python, because the first step of that chain is special, the language can define its notion of type in terms of that link. Of course 95% of your code is duck-typed and EAFP-driven and this doesn't matter. But for that last 5% (and for debugging), you have a coherent and useful notion of type, which can help.

    Also, making multiple inheritance work with classes is not that hard (even though C++ gets it wrong, Java punts on it, etc.); making it work with prototypes is harder.

    The biggest downside is that cooperative inheritance through a super-like mechanism is hard with prototypes. In particular, composing mixins that don't know about each other is a serious problem. Simpson gets into this in his "Polymorphism redux" section in his third part. But I believe this is actually only a problem if you use the constructor idiom instead of using prototype-chained object literals, so a prototype-based language that didn't encourage the former and make the latter more verbose than necessary may not have this problem.

    Anyway, what do you gain in return for these negatives? One thing (although not the motivation behind Self**) is flexible object literals. In JavaScript, creating an object is just like creating a dictionary. If some of the members are callable, they're methods. Sure, this is just syntactic sugar, but it's syntactic sugar around something you often want.

    At first glance, that sounds horrible. We have 30000 objects, all anonymous, all locally created in different calls to the same function and therefore closing over different scopes… so much duplication! But again, you're thinking about ahead-of-time optimization; there's no reason the interpreter can't realize that all of those closures have the same body, and guardably-almost-always the same behavior, so with a JIT, you're wasting very little space and even less time. And, even if that weren't true, it probably wouldn't matter anyway. Meanwhile, if you don't want that duplication, you can always move some of those functions into a prototype—exactly the same way as you'd move them into a class in Python.

    At second glance, it sounds horrible again. Everyone knows that closures and objects are dual—one is the functional way of encapsulating state, the other is the OO way. So, who wants closures inside objects? But here... well, a Python developer shouldn't be surprised that sometimes practicality beats purity, and adding boilerplate can hurt readability. For a locally-created object that just has a bunch of simple functions that all refer to the local variable username, does it actually make things more readable to create and call a constructor that copies username to self.username and make all those functions call self.username, or is that just boilerplate?

    (In fact, you can go farther and explicitly link up multiple closures through manual delegation chains. But I haven't seen many uses of inheritance chains beyond 2-deep in JavaScript, and nearly all of those that I've seen are equivalent to what you'd do in a class-based language. So, let's just stick with the simple case here.)

    Really, it's similar to def vs. lambda: for most functions, def is clearly better, but that doesn't mean there's never a good use for lambda. Nobody needs a separate statement, and an explicit name, for the function lambda x: x+1 or lambda cb: cb(result). And similarly, for most objects, using methods in the class or prototype to encapsulate the object's state is clearly better, but that doesn't mean there's never a good use for object literals with object-specific methods.

    The big difference is that in JavaScript, both ways are easy; in Python, both ways are possible, but only one is easy—there's no clean and simple way to put the behavior in each object.

    Well, that's not quite true; JS's horrible scoping rules and its quirky this mean both ways are actually not easy. But that's not because JS is prototype-based; it's because JS got some other things wrong.

    Anyway, the differences in this section are what really matter—and they're the ones that are down to JS being prototype-based.

    * Or, if you count metaclass-class as different from class-instance, one instead of three. Simulating metaclasses in a prototype language usually means making use of the distinction between calling the prototype vs. just referencing its .prototype, in JS terms, and there's nothing to protect you from mixing up your "metaprototypes" and prototypes. But then Python has nothing to protect you from inheriting from a metaclass as if it were a normal class either, and that never causes problems in real-life code.
    ** The inspiration behind prototypes in Self was that it's often hard to define a class hierarchy in advance, and too hard to modify a class hierarchy after you've already written a bunch of code around it. Slightly oversimplifying, prototypes were mainly an attempt to solve the fragile base class problem.

    The best of both... wait, deja-vu

    Some newer class-based languages, like Scala, have an object-literal and/or singleton-defining syntax. (And it's not that hard to add that to Python, if you want.) If you combine this with a few other features, you can get objects that act like instances and clones at the same time (and hopefully in the ways you want, not the ways you don't).

    On the other hand, JS (ES6) is adding classes for when you want them, but object literals will still an anonymously-typed clone of an optionally-specific prototype rather than an instance of a class.

    Are we meeting in the middle? I'm not sure.
    1

    View comments

  18. About a year and a half ago, I wrote a blog post on the idea of adding pattern matching to Python.

    I finally got around to playing with Scala semi-seriously, and I realized that they pretty much solved the same problem, in a pretty similar way to my straw man proposal, and it works great. Which makes me think that it's time to revisit this idea. For one thing, using Scala terminology that people may be familiar with (and can look up) can't hurt. For another, we can see whether any of the differences between Python and Scala will make the solution less workable for Python.

    Extractors


    In my post, I described the solution as a matching protocol. A class that wants to be decomposable by pattern matching has to implement a method __match__, which returns something like a tuple of values, or maybe an argspec of values and optional values, that the pattern-matching code can then match against and/or bind to the components of a pattern.

    For simple cases, this could be automated. Both namedtuple and Namespace could define a __match__, and we could write decorators to examine the class in different ways--__slots__, propertys, copy/pickle, or even runtime introspection--and build the __match__ for you. But when none of these simple cases apply, you have to implement it yourself.

    In Scala, the equivalent is the extractor protocol. A class that wants to be decomposable by pattern matching supplies an associated extractor (normally a singleton object, but that isn't important here), which has an unapply method that returns an optional value or tuple of values, as well as an optional apply method that constructs instances of the class from the same arguments. A case class (a simple kind of type that's basically just a record) automatically gets an extractor built for it; otherwise, you have to build it yourself.

    In the simplest case, this is no different from just using __match__ for unapply and, of course, the existing __init__ for apply. But it's actually a lot more flexible.

    • You can write multiple extractors that all work on the same type.
    • You can add an extractor for an existing type, without having to monkeypatch the type. For example, you could extract the integer and fraction parts of a float.
    • You can write an extractor that doesn't match all values of the type.

    Example


    Daniel Westheide's great Neophyte's Guide to Scala shows some of this in action. I'll borrow one of his examples, with minor variations:

    trait User {
      def name: String
      def score: Int
    }
    class FreeUser(val name: String, val score: Int, val upgradeProbability: Double)
      extends User
    class PremiumUser(val name: String, val score: Int) extends User
    
    object FreeUser {
      def unapply(user: FreeUser): Option[(String, Int, Double)] =
        Some((user.name, user.score, user.upgradeProbability))
    }
    object PremiumUser {
      def unapply(user: PremiumUser): Option[(String, Int)] = Some((user.name, user.score))
    }
    
    val user: User = new FreeUser("Daniel", 3000, 0.7d)
    
    user match {
      case FreeUser(name, _, p) =>
        if (p > 0.75) name + ", what can we do for you today?" else "Hello " + name
      case PremiumUser(name, _) => "Welcome back, dear " + name
    }
    

    If you're not used to Scala's syntax, this might be a bit hard to read, but it's not too hard to muddle through. A trait is like an ABC, a class is like a normal class, and an object defines a singleton class and the one instance of it. So:

    First, we define a class hierarchy: FreeUser and PremiumUser are both subclasses of the abstract base class User, and FreeUser adds a new attribute.

    Next, we define two extractors. The fact that they're named FreeUser and PremiumUser, just like the classes, is convenient for readability, but it's the type annotations on their unapply methods that makes them actually work on those types respectively.

    Then we construct a User, who happens to be a free user with an 0.7 probability of upgrading.

    Then we pattern-match that user, first as a free user, then as a premium user if that fails. Here, the syntax works like my proposal, except for minor spelling differences (match for case, case for of, braces for indents, => for :). But, instead of checking isinstance(user, FreeUser) and then calling instance.__match__ and binding to name, _, p, Scala tries to call FreeUser.unapply(user), which works, and binds to name, _, p. The result is the same, but the way it gets there is a little more flexible (as we'll see).

    And then, inside the free user case, we just do an if-else, so the result is "Hello Daniel".

    Another example


    In case it isn't obvious what the optional-value bit is for, here's another example that makes use of that, as well as applying an extractor to a builtin type without having to monkeypatch it. This is a pretty silly example, but it's in the official Scala tutorial, so...

    object Twice {
        def unapply(x: Int): Option[Int] = if (x%2 == 0) Some(x/2) else None
    }
    
    val i: int = 42
    i match {
      case Twice(n) => n
      _ => -1
    }
    

    Here, Twice.unapply(42) returns Some(21), so the case Twice(n) will match, binding n to 21, which we'd return.

    But if we tried the same thing with 23, then Twice.unapply(23) returns None, so the case Twice(n) would not match, so we'd hit the default case, so we'd return -1.

    As for the optional values, I'll get to that below.

    Can we do this in Python?


    Some of the details obviously aren't going to fit into Python as-is.

    We could easily add a @singleton class decorator that lets us make the class anonymous and return a singleton instance pretty easily. But we still can't use the same name as the class without a conflict in Python (the instance would just unbind the type). And really, do we need to define a class here just to wrap up a function in a method (which is, or might as well be, a @staticmethod, since it doesn't touch self, which has no state to touch anyway)? I think it would be more pythonic to just have decorated functions. (If you disagree, it's pretty trivial to change.)

    Python normally uses dynamic (per-instance) attributes on classes. You can use __slots__ or @property, and you can even define an ABC that makes those abstract, matching Scala's traits, but I think it's more Pythonic to just skip that here. (I'm pretty sure it's orthogonal to everything related to pattern matching and extracting; it just means our most Pythonic equivalent example isn't exactly like the Scala example.)

    Python doesn't have optional types. It does have "implicit nullable types" by just returning None, but since None is often a valid value, that doesn't do any good. Of course the answer here is exceptions; everywhere that Scala returns Option[T], Python returns T or raises, so why should this be any different?

    In my previous proposal, we used AttributeError (or a new subclass of it, MatchError) to signal a failed match (partly because you get that automatically if __match__ doesn't exist), so let's stay consistent here. We can even make a special unapply_simple decorator that lets us return None and have it turned into an MatchError when we know it isn't a valid value.

    Finally, Python doesn't have static typing. Is that necessary here? Let's try it and see.

    So, using my earlier case syntax:

    class User:
        pass
    
    class FreeUser(User):
        def __init__(self, name: str, score: int, upgrade_probability: float):
            self.name, self.score, self.upgrade_probability = name, score, upgrade_probability
    
    class PremiumUser(User):
        def __init__(self, name: str, score: int):
            self.name, self.score = name, score
    
    @unapply
    def free_user(user):
        return user.name, user.score, user.upgrade_probability
    
    @unapply
    def premium_user(user):
        return user.name, user.score
    
    user = FreeUser("Daniel", 3000, 0.7d)
    
    case user:
        of free_user(name, _, p):
            if p > 0.75:
                return '{}, what can we do for you today?'.format(name)
            else:
                return 'Hello {}'.format(name)
        of premium_user(name, _):
            return 'Welcome back, dear {}'.format(name)
    
    @unapply_simple
    def twice(x):
        if x%2 == 0: return (x/2,)
    
    i = 42
    case i:
        of twice(n): print(n)
        else: print(-1)
    

    There are a few problems here, but I think they're easily solvable.

    My earlier proposal matched Breakfast(0, eggs, _) by first checking isinstance(breakfast, Breakfast), and then calling Breakfast.__match__; here, there's no type-checking at all.

    As it turns out, that free_user will fail to match a PremiumUser because user.upgrade_probability will raise an AttributeError--but the other way around would match just fine. And we don't want it to. But with nothing but duck typing, there's no way that Python can know that premium_user shouldn't match a FreeUser. (The only reason a human knows that is that it's implied by the names.)

    So, is the lack of static typing a problem here?

    Well, we can do the exact same isinstance check inside the unapply decorator; just use @unapply(FreeUser) instead of just @unapply. This seems perfectly reasonable. And it's no more verbose than what you have to do in Scala.

    Or we could even have @unapply check for annotations and test them. It's even closer to what you do in Scala (and in most other FP/hybrid languages), and, if we're going to encourage type annotations in general, why not here? On the downside, PEP 484 specifically says that its annotations don't affect anything at runtime. Also, not every PEP 484/mypy static annotation will work dynamically (e.g., Iterable[int] can only check for Iterable). But I don't think that's a problem. The decorator version seems more conservative, but either one seems reasonably Pythonic if we wanted to go with it.

    Can the two proposals be merged?


    It's pretty nice that the earlier proposal uses the type name as the pattern name, especially for simple types like namedtuple.

    But it's also pretty nice that the Scala-influenced proposal allows you to have multiple pattern extractors for the same type.

    Scala gets the best of both worlds by letting you give one of the extractors the same name as the type (and by giving you one automatically for case classes).

    We could get the best of both worlds by just implementing both. The relevant bit of case-statement logic from my previous proposal did this to match of Breakfast(0, _, eggs)::

    if isinstance(case_value, Breakfast):
        try:
            _0, _, eggs = case_value.__match__()
            if _0 != 0: raise MatchError(_0, 0)
        except AttributeError:
            pass # fall through to next case
        else:
            case body here
    

    To extend this to handle extractors:

    if isinstance(case_value, Breakfast):
        # ... code above
    elif isinstance(Breakfast, unapply_function):
        try:
            _0, _, eggs = Breakfast(case_value)
        # ... rest of the code is the same as above
    

    We need some way to avoid calling Breakfast when it's a class (again, consider the FileIO problem from my previous post). That's what the second isinstance check is for. That unapply_function type is what the @unapply decorator (and any related decorators) returns. (It's a shame that it can't inherit from types.FunctionType, because, as of Python 3.6, that's still not acceptable as a base type. But it can implement __call__ and otherwise duck-type as a function. Or, alternatively, just use a function and attach a special attribute to it in the decorator.)

    In fact, we can even allow externally defining decomposition without monkeypatching just by having a special decorator that registers an extractor as the type-named extractor for a type, maybe @unapply_default(Spam). That could create Spam.__match__ as a wrapper around the function, but it could just as easily store it in a registry that we look up if isinstance(case_value, Spam) but Spam.__match__ doesn't exist. (The latter allows us to avoid monkeypatching existing classes--which is obviously important for classes like int that can't be patched... but then do you want coders to be able to create a new pattern named int? An extractor that can be applied to int, sure, but one actually called int?)

    Other features of Scala pattern matching


    Features already covered


    There are some Scala features I didn't mention here because they're equivalent to features I already covered in my previous post:
    • Scala has as clauses with the same semantics. (they're spelled NEWNAME @ PATTERN instead of PATTERN as NEWNAME, which I think is less readable).
    • Scala has if guard clauses with the same semantics (including being able to use values bound by the decomposition and the as clause), and almost the same syntax.
    • Scala has value definition and variable definition pattern syntax, with the same semantics and nearly the same syntax as my pattern-matching assignment (not surprising, since their syntax was inspired by Python's sequence unpacking assignment, and mine is an extension of the same).

    Pattern matching expressions


    Scala has a match expression, rather than a statement.

    That shouldn't be too surprising. As I mentioned last time, most languages that make heavy use of pattern matching use an expression. But that's because they're expression-based languages, rather than statement/expression-divide languages. (Traditionally, functional languages are expression-based either because they encourage pure-functional style, or because they're heavily based on Lisp. Scala isn't being traditional, but it's expression-based following the same trend as JavaScript and Ruby.)

    In Python (unlike traditional functional languages, Scala, or Ruby), if is a statement--but then Python also has a more limited form of if that can be used in expressions. Could the same idea apply here? Maybe. But consider that Python had an if statement for many years before the stripped-down expression was added, and the statement is still used far more often.

    I think we'll want pattern-matching statements for anything non-trivial, and we won't know what (if anything) we'd want from a stripped-down limited pattern-matching expression until we get some experience with pattern-matching statements.

    Predicate extractors


    Scala allows extractors that return a boolean instead of an optional tuple. This allows you to match without deconstructing anything:

    object PremiumCandidate {
      def unapply(user: FreeUser): Boolean = user.upgradeProbability > 0.75
    }
    
    user match {
      case PremiumCandidate() => initiateSpamProgram(user)
      case _ => sendRegularNewsletter(user)
    }
    

    This is obviously easy to add to Python:

    @unapply
    def premium_candidate(user: FreeUser):
        return user.upgrade_probability > 0.75
    
    case user:
        of premium_candidate:
            initiate_spam_program(user)
        else:
            send_regular_newsletter(user)
    

    In Scala, the different logic is driven off the static return type, but we can just as easily use the syntactic difference: If we get an unapply_function rather than a (syntactic) call of an unapply_function, we call it and if it returns something truthy it's a match; if it returns something falsey or raises AttributeError it doesn't.

    Scala also lets you use predicate extractors together with its equivalent of the as clause as a way of type-casting the value if it passes a type-check. For example, if initiateSpamProgram requires a FreeUser, rather than just a User, you could write:

    user match {
      case freeUser @ PremiumCandidate() => initiateSpamProgram(freeUser)
    }
    

    We can't do that in Python, but then we don't have any need to. At runtime, initiate_spam_program will probably just duck-type; if not, it'll isinstance or similar, and it will see that user is indeed a FreeUser. For mypy or other static type checkers, there's no reason the checker can't infer that if user passes premium_candidate(user: FreeUser), it's a FreeUser; not being able to do that is just a limitation of Scala's typer (user can't be defined as a User but then be inferred as a FreeUser within the same expression). We don't need to copy that limitation just so we can try to find a workaround for it.

    Extractors with apply


    You'll notice that I mentioned apply way back at the top, but I only implemented unapply in the above examples. So, let's show an example of apply:

    object Twice {
      def apply(x: Int): Int = x * 2
      def unapply(x: Int): Option[Int] = if (x%2 == 0) Some(x/2) else None
    }
    
    val i = Twice(21)
    

    Or, maybe a better example:
    object Email {
      def apply(user: String, domain: String): String = user + "@" + domain
      def unapply(email: String) =
        val parts = email split "@"
        if (parts.length == 2) Some((parts(0), parts(1)) else None
    }
    
    email match {
      case Email(user, domain) => if (domain == our_domain) message(user) else forward(email, domain)
      _ => log_unknown_email(email)
    }
    
    val email = Email("joe.schmoe", "example.com")
    

    As you can see, the apply method gives us a constructor whose syntax perfectly matches the case pattern used for decomposition. (Just as we get in my original proposal when your __init__ and __match__ match.) That's nice, but not a huge deal in many cases.

    Anyway, this one is easy to do in Python:

    @apply
    def twice(x):
        return x * 2
    
    case twice(21):
        of twice(n): print(n)
        else: print(-1)
    

    But I don't think this is necessary. It doesn't seem to be used much in Scala code, except in the special case of an extractor whose name matches the class--which we'd handle by having __match__, with the already-existing __init__ for a constructor. (Of course there's nothing forcing you to make your __match__ and __init__ parallel. But then there's nothing forcing apply and unapply to be parallel in Scala either. Usually, you'll want them to go together; in the rare cases where you want them to be different, there's nothing stopping you.)

    Plus, we already have ways to write external factory functions for types when you really want to; there's no reason to add and standardize on a new way.

    But, if we do want apply (and, again, I don't think we do), is sometimes having two functions instead of one a good argument for putting apply and unapply together in a class? I don't think so. Compare:

    @apply
    def twice(x):
        return x * 2
    @unapply
    def twice(x):
        if x%2 == 0: return (x/2,)
    
    @extractor
    class twice:
        def apply(x):
            return x * 2
        def unapply(x):
            if x%2 == 0: return (x/2,)
    

    Sure, the latter does group the two together in an indented block, but I think it looks like heavier boilerplate, not lighter. And in the very common case where you don't have apply that will be much more true. Also, notice that @extractor has to be a bit magical, auto-singleton-ing the class and/or converting the methods into @staticmethods, which makes it harder to understand. And, if anything, it's less clear what twice(21) or of twice(n): are going to do this way; sure, once you internalize the extractor protocol you'll be able to figure it out, but that's not trivial. (Plenty of Scala developers don't seem to get how it works in their language; they just treat it as magic.)

    Sequence extractors


    Scala lets you write sequence extractors, with another special method, unapplySeq. This returns an optional sequence instead of an optional tuple (or an optional tuple whose last element is a sequence), and there's special syntax that allows you to match the decomposed sequence with a * wildcard, like this:

    xs match {
      case List(a, b, _*) => a * b
    }
    

    It's pretty obvious how to do this in Python, and a lot more simply: this is just the normal use of * for iterable unpacking assignment, *args parameters, and splatting arguments. I already covered how this works with __match__ in my earlier post; it's the same for extractors.

    Infix extractors


    Scala also allows you to use infix extractors, to match infix patterns. For example, you could match 2 * n directly by using an extractor named * with an appropriate unapply.

    More typically, you match head :: tail to deconstruct cons lists, using an extractor that looks like this:

    object :: {
      def unapply[A](xs: List[A]): Option[(A, List[A]) {
        if (xs.isEmpty) None else Some((xs.head, xs.tail))
    }
    

    If Python doesn't need arbitrary infix operators, or the ability to use existing operators as prefix functions or vice-versa, or other infix-operator-related features like sectioning, it doesn't need infix patterns either.

    Especially since the primary use for this, deconstructing cons lists, is something you're rarely going to need in Python. (If you want cons lists, you have to write them yourself, and you won't get any syntactic sugar for constructing them, so why would you be surprised that you don't get syntactic sugar for deconstructing them? You can use the same Cons(head, tail) for a pattern as you use for a constructor.)

    Patterns in other places


    My proposal suggested explicit case statements, and generalizing sequence unpacking in assignments to pattern decomposition, but I didn't think about all the other places Python binds variables besides assignments. Scala's designers did, and they made it work in as many places as possible.

    So, looking at all the places Python does bindings:

    • Iteration variables in for statements and comprehensions: We can already sequence-unpack into the iteration variables; why not pattern-match? So, for Point3D(x, y, z) in vertices:.
    • as-clause variables in various statements. Python already allows sequence unpacking here, although it requires extra parentheses:
      • with: I don't think Scala has an equivalent, but I could definitely imagine wanting it here. Although often you're going to want to bind the whole thing as well as decomposing it, and doubling the as (with spam.open() as TextFile(filename) as f:) seems pretty horrible.
      • except: Scala's catch basically has an implicit ex match, which you write normal case clauses inside. This could actually be more useful in Python than in Scala, where exceptions are full of good stuff, and you could, e.g., pull the filename out of a FileNotFoundError. But it would be clumsy, because you've already matched the type: except FileNotFoundError as FileNotFoundError(filename):. So... maybe in a new language, but probably not useful in Python.
      • import: It seems like it would always be more readable to import and then case. Maybe someone can come up with a counterexample, but I doubt it.
    • For completeness, if we're looking at import ... as, then we should also look at from ... import, and maybe even plain import. And I think it looks just as bad in the former statement, and a thousand times even worse in the latter.
    • Function parameters in def and/or lambda. In my previous post, I suggested ML-style pattern overloads for def, and assumed that the fact that this wouldn't work for lambda wouldn't matter because why would you want it there? But in Scala, single-case lambdas can be written without all the match syntax, and this is used pretty extensively in idiomatic Scala. (Of course in Scala, pattern matching is an expression, not a statement, and idiomatic Scala uses lambdas in many places where idiomatic Python uses named functions.) There's a bit of ambiguity in the fact that a parameter list is actually a list of target-lists that have to be single targets or parenthesized--and remember that Python 3.0 removed that ambiguity by killing even basic iterable unpacking for parameters. Maybe someone could come up with a syntax that allows def line_to(Point3D(x, y, z)): or similar without that ambiguity, but even if you could, does that really look useful in a language without static typing and overloads? (Again, I think if we want this, we want a way to define overloads by pattern.)
    • Conditions are not currently a place to do bindings in Python, and that's a good thing. But occasionally, something like C's if ((c = getch()) != EOF) is handy. Could a stripped-down single-case pattern-matching binding in conditions be a way to get most of the benefits, without the problems? (Similar things certainly work in a variety of languages from Swift to Haskell, but in those languages, an if body has its own scope, which will either shadow an outer-scope name or raise an error complaining about shadowing, so I don't know if that means anything.) Anyway, I haven't thought much about this.

    Partial functions


    Scala makes extensive use of partial functions. Not partially-applied functions (like functools.partial), but partial functions in the mathematical sense: functions that are only defined on part of the implied domain.

    This is actually a weird hybrid of static and dynamic functionality, but it turns out to be useful. At compile time, Scala can tell that your cases don't exhaust all of the possibilities. In many functional languages, this would mean that you get a compiler error. In Scala, it means you get a partial function instead of a normal function. If you call it on an out-of-domain value, you get a MatchError at runtime. But you can also LBYL it by calling its isDefinedAt function. And code that always checks before calling can be statically determined to be safe.

    And, idiomatically, partial functions--usually constructed with the special anonymous-function-with-implicit-match syntax mentioned above--are what you often pass to things like collect. The collect function does the equivalent of map and filter at the same time, building a new sequence with the results of the partial function applied to the members of a source sequence, while skipping the ones that aren't defined. Like this, in Python terms:

    def collect(func, iterable):
        for value in iterable:
            try:
                yield func(value)
            except MatchError:
                pass
    

    Of course that demonstrates that a dynamic language that loves exceptions and EAFP really doesn't need an explicit concept of partial functions. A partial function is just a function that can raise (or that can raise a specific exception), and using partial functions is something we trivially do all the time without thinking about it. So, this is a neat way to get familiar Pythonic functionality into a very non-Python-like language, but Python has nothing to learn there. (Although maybe mypy does?)

    Other languages


    ML


    First, traditional ML patterns have a feature that we haven't discussed: "either patterns", where two (or more) different patterns (with "|" or "or" between them) are treated as a single case. In Python terms:

    case meal:
        of Breakfast(spam, *_) or Lunch(spam, *_):
            print('Are {} servings of spam really enough?'.format(spam))
    

    In ML, the composed patterns have to bind the same set of variables, with the same types (as they are in the example above). There have been attempts to expand that to wrap variables bound only on one branch in optional types, and even to allow variables with different types on each branch to be bound as choice types, but that turns out to be painful to use (you end up with a mess of pattern-matching expressions inside pattern-matching expressions that would be a lot more readable if you just flattened them out).

    In object-oriented ML variants, it makes sense to allow the different branches to match different class types, as long as they share a common supertype, in which case the bound variable is of that supertype. (I don't think either base OCaml or F# actually does that, but it makes sense...)

    But in a duck-typed language like Python, of course, you could dispense with all requirements: variables matched to different types in different branches are fine, as long as all the types duck-type successfully for whatever operations you use with them, and even variables only matched in one branch and not the other are fine, as long as you can handle the possible UnboundLocalError.

    Some ML offshoots have experimented with the dual idea of "both patterns", where two (or more) different patterns (with "&" or "and" between them) are treated as a single case. Here, obviously, they have to either not bind the same variables, or bind them to the same values. (The Python rule would presumably be that they can bind the same variables, but if they bind them to different values, one of them wins arbitrarily. Or maybe it would be useful to say the last one always wins, or the first truthy one, or some other rule? Without experimenting with it, it's hard to guess whether any of them would be particularly helpful...)

    With only literally deconstructing patterns, both patterns aren't very useful (what's the point of matching Breakfast(spam, _, _) and Breakfast(_, 0, eggs) instead of just Breakfast(spam, 0, eggs)?), but with unapply functions, they make a lot more sense. For example, if you had any two of the coordinate systems for 3D points, you could match the others--e.g., cartesian(_, _, z) and spherical(r, theta, _) gives you the cylindrical coordinates.

    I'm not sure how useful either of these features would be in Python, but they're not too hard to define. If someone wants to write a serious proposal to add pattern matching to Python, they should at least provide a rationale for not including them.

    F#


    F# builds on OCaml with the specific intent of using .NET libraries, and in particular using the C#-centric, OO-heavy .NET framework for most of its stdlib. So, they obviously needed a way to decompose arbitrary classes. Since Scala, Swift, Rust, etc. hadn't been invented yet, they had to figure it out for themselves. They took it as a specific case of the more general idea of matching abstract data structures, and wrote a paper on their design, Extensible Pattern Matching Via a Lightweight Language Extension.

    In simplest form, F#'s "active patterns" are pretty similar to Scala's extractors, and even more similar to the variation I suggested above: instead of defining a singleton object with unapply and apply methods, you just define plain (unapply) function with a special marker (F# uses "banana brackets" around the name, like (|Complex|), rather than a function decorator, which becomes important later) to declare it as an active pattern; you can then use it in any (single, partial-multiple, or total-multiple) match expression the same way you'd use a record type.

    But you can go farther. For example, instead of just returning a tuple, you can return what looks like a record constructor with the function name as the record type name and the tuple as its arguments. And you can then define a "banana split" pattern, with multiple names inside the banana brackets, which can return what looks like a record constructor for any of those names, which can then be pattern-matched as if they were the multiple constructors of an actual sum type. (In fact, they are constructors of an actual sum type, it's just anonymous). The example in the docs deconstructs an abstract list with (|Cons|Nil|) by calling its non-empty, head, and tail methods. In Python terms, it might look something like this:

    @unapply_split('Cons', 'Nil')
    def cons_nil(lst: LinkedList): # LinkedList is an ABC
        if lst: return Cons(lst.head(), lst.tail())
        else: return Nil()
    

    And you can then use Cons and Nil in patterns to match list instances. The (inferred) type of that function in F# is something horrible that you don't want to ever read involving the parametric choice type, so the fact that it would probably not be represented in the type system at all in Python isn't terrible. Anyway, while that's a nice feature, I can't imagine how to make it fit in nicely with Python syntax, and defining two separate partial unapply functions is not that much heavier, and probably a lot easier to understand:

    @unapply
    def Cons(lst: LinkedList):
        if lst: return lst.head(), lst.tail()
    
    @unapply
    def Nil(lst: LinkedList):
        if not lst: return ()
    

    However, it's a lot easier to prove that a pattern matching both Cons and Nil is a total pattern (or detect whether it's a partial pattern, for use in building automatic partial functions with is_defined support) on LinkedList with the "banana split". That's important for F# (or Scala), but probably not so much for Python. (Of course this particular example is pretty stupid for Python in the first place--just use the iterable protocol. But it obviously generalize to less trivial examples that aren't as silly in Python.)

    F# also allows for matching with independent active patterns, with the usual rule that the patterns are just dynamically tried in order and there must be a default case, which allows you to use partial matching functions and still prove total matching. Anyway, that feature is exactly equivalent to the separate unapply functions just described, except that in F# you have to explicitly define each active pattern as partial (by banana-splitting it with _), and of course it has to be represented somehow in the type system (by wrapping the single type or choice of types in an option that's normally handled for you by the syntactic sugar of the multiple-match expression), while in Python, you'd take it for granted that patterns without default can't be proven total and that the values are duck typed.

    Finally, you can extend this to wrap almost any function as a pattern match. For example, you can write a parse_re active pattern that takes a regex from the pattern, and additional arguments from the match expression, and returns its groups if successful, This is really only tricky with static typing, because inferring the type of that parse_re function is a nightmare. Also, in a dynamic language like Python, it's pretty much unnecessary: you'd either add __match__ to the SRE_Match type, or write an unapply function for it, and then just use re.compile(regex).match(x, y) as your pattern, right?

    Beyond all the ways active patterns expand on ML-style patterns, the fact that they're ultimately just functions means you can use them with higher-order functions. Of course the Scala extractor objects or the proposed unapply functions for Python have the same benefit.

    Finally, F#, unlike most ML offshoots, has a notion of mutable records. With a bit of clumsy syntax, you can actually bind to a mutable attribute in the match and then mutate that attribute in the resulting code. You'd think that, if this were useful, there would be a way to do the same with abstract types, but (short of building a bunch of proxy objects) there isn't. So, presumably that implies that we don't need binding to "reference cells" or anything like that in Python.

    C#


    More recently, the designers of C# looked at F# and Scala pattern matching in practice, and designed pattern matching for C#.

    It's worth noting that they decided that, even though their pattern matching has to handle arbitrary OO classes, it's still worth adding record types or more general ADTs if they're going to add pattern matching. That seems like more evidence for adding automatic matching for the record-like types that already exist in Python (like namedtuples) and/or adding actual ADTs.

    But the most interesting part of C#'s design is that they start out with a single-clause pattern-matching expression, EXPR is PATTERN, and build everything on that. The expression is true if the pattern matches, and any bindings in the pattern are bound in the enclosing scope. (Actually, their rule for deciding the appropriate scope is pretty complicated, and weird in some cases, but the Python equivalent would be easy--the enclosing function or class body is always the appropriate scope.) That's general enough to let you do things like if (expr is List x) do_list_stuff(x); or if (x?.y?.z is Int i) do_int_stuff(i); without needing special syntax like Swift's if let. And then a multi-pattern switch statement or a a multi-pattern match expression--or, if you're the C# team, why not both?--is pretty simple syntactic sugar for that (but syntactic sugar that the optimizer can recognize and maybe take advantage of). Destructuring assignment seems to be the only single-pattern use that couldn't be directly built on is, so they added special syntactic sugar for that.

    So, how do they handle deconstructing arbitrary classes? The default behavior only works for destructuring record/ADT types (and for simply matching values to types without any destructuring), but that is operator is overloadable, like any other operator in C#. At first glance, that sounds similar to the __match__ protocol (for normal-method operator overloads) or to the extractor protocol (for static-method overloads), but it's actually more complicated: is is treated like a conversion operator, so you can specify how any class destructures to or from any other class. The example in the docs is a Polar point class that deconstructs a separate Cartesian point class, so you can write var p = Cartesian(3, 4) and then later match it with is Polar(var r, var theta). So, Scala's extractors are really just a special case of C#'s overloaded is. The question is, are there any other cases that are useful? The only examples I see all use a static class with a static is overload and no other methods at all, which means they're just unapply functions with extra boilerplate...

    Mathematica


    I may have mentioned this in my previous post, but... In Mathematica, everything is a primitive value or a tree. And this means pattern-matching in Mathematica is designed around trees, rather than flat structures. So, you can match something like a[b[_], _], which is like first matching a[X, _] and then matching X against b[_]. You can also match against just the second level, or against just a leaf at whatever level it appears, etc. That's all pretty cool, but I don't think it's needed in any language where everything isn't a tree.

    Swift, Rust, etc.


    I've already mentioned Swift multiple times. The other obvious modern language to look at is Rust. I don't think there are any radical new ideas from either, but if I'm wrong, I'll come back and edit this section.

    Dynamic languages


    Pattern matching comes out of statically-typed functional programming. Scala, C#, Swift, Rust, etc. are all primarily attempts to bring ideas from static FP to static OOP (in ways that will hopefully be familiar to traditional OO programmers), and secondarily attempts to find ways to simulate some of the benefits of duck typing on top of static typing, so it's not too surprising that they all came up with similar ways to import the idea of pattern matching.

    But, besides asking whether the lack of static typing could be a problem for pattern matching in Python (e.g., the fact that we can't detect partial patterns at compile time), we should also ask whether duck typing could offer more flexibility or other benefits for pattern matching. I already mentioned how either and both patterns could be more useful in Python than in OCaml. And the fact that Python doesn't have to infer horrible choice types that no human will ever be able to debug if the inference goes wrong might be an advantage.

    But if there are any good examples in other dynamic languages, it would definitely be worth looking at them.

    Of course many dynamic languages have "pattern matching" in Perl terms: regex matching on strings, or some generalization of string-matching ideas to other types. I know that generalization is supposed to be one of the key ideas in Perl 6, but, having not dug into Perl 6 beyond reading the apocalypses, I don't know if it meets up with any interesting common ground with ML-style pattern matching, or if it's just SNOBOL string patterns with a more powerful grammar.

    Also, of course, people have implemented pattern matching in Perl, Ruby, etc., just as they have in Python, and those languages with more flexible syntax allow them to make it look more like ML--but that just makes it look less like the native language, doesn't help at all in designing a way to integrate it as a real native language feature, and, most importantly, generally does nothing to solve the problem of decomposing objects, which is the key problem.

    There's gotta be something on LtU...


    I've only done a cursory search at Lambda the Ultimate to see if anyone's published any papers on integrating pattern matching into dynamic OO languages. I found and skimmed some posts about the problem of decomposing objects, but all in terms of static OO (that's where I found the F# paper, which led me to the C# work in progress...). Some of those might have interesting ideas that aren't incorporated in any of the languages I've looked at, but I don't yet know. And it seems like there must be something on dynamic object decomposition out there, even if I didn't find it in a couple minutes.

    How useful is pattern matching in idiomatic Scala?


    Scala, like many modern functional languages (and, increasingly, even some primarily-imperative languages), builds its failure handling around an Option type, basically equivalent to Haskell's Maybe but with some extra stuff piled on top (like being able to construct an Option from a nullable-typed Java object).

    In most such functional languages, pattern-matching is the idiomatic way to deal with optional values. But in Scala, many guides (like part 5 of Westheide's) point out that it's pretty verbose in Scala, and it's often better to use the extras piled on top of the type instead. Compare:

    println("Name: " + user.name match {
      case Some(name) => name
      case None => ""
    })
    
    println("Name: " + user.name.getOrElse("<unknown>"))
    

    Honestly, I'm not qualified to say whether Westheide is right here (although other guides agree with him--and the fact that Scala has the getOrElse method is telling), or whether this extends to other paradigm examples of pattern matching or is specific to the fact that Option, and what you usually do with its values, is so simple. But still, it's worrying: if idiomatic Scala usually doesn't use pattern matching even for one of the paradigm examples of pattern matching, how often would idiomatic Python use it?

    Comparing to Swift


    Swift made extensive use of Optional from the first beta, but the designers originally didn't feel the need to add general pattern matching. Instead, the general solution is to test for nil, and then unwrap inside the tested code. There's special support for that all over the place:

    • There's a single-character type operator for optional types: int? is the same as Optional<int>.
    • Unwrapping is a single-character postfix operator: var!.
    • The language's truthiness notion is based on non-nil (except for actual boolean values), so you can write if var { println(var!) }. (Yes, in theory, this means optional booleans are a bit harder to deal with, but that rarely if ever comes up--practicality beats purity and all that.)
    • if let lets you combine the test and binding into one: if let v: int = var { println(v) } binds the value from the optional int var to the non-optional int v, so you can use it in the "then" scope.
    • guard is like if, but lets you bind a variable in the enclosing scope (by enforcing that the failing case must exit that scope), so you can write guard let v: int = var else { return } and then use v in the rest of the function.
    • The nil-coalescing operator ?? serves most uses of an "unwrap-or-else-use-default" method: 2 + (v ?? 0).
    • The typing engine, including inference as well as checking, understands all of this, so you can just write if let v = var without specifying a type, the compiler can skip any warning or runtime type check inside the body of an if or after a guard, etc.

    However, in later versions of Swift, they felt the need to add additional sugar to pattern matching for optional values anyway. In general, you match an Enum (ADT) constructor (in switch, in a single-pattern if, and a few other places) with case .Kind(val1, val2) = var, and any bindings you want have to be explicitly signaled, with case .Kind(val1, let var2). But when you're matching an optional value, instead of if case .Some(let var1) = var, you can just write case let var1? = var2. So, maybe this implies that pattern matching is important, even when you've done everything you can to special-case away the need for it. Usually, idiomatic Swift code gets away with avoiding any pattern matching for optionals by using if let and ??, but the fact that the designers felt the need to add case let var? implies that they couldn't get rid of every need for it.

    Naming


    From a quick survey of other languages:

    ML uses case ... of to introduce the multi-case pattern-matching construct, and | to separate cases; a function definition can also be defined in terms of cases on its parameters, without needing the case bit. Some ML descendants change the | to a different symbol, or change it to the keyword of (and remove that from the outer clause). But newer descendants mostly strip it down even further. Erlang makes every binding a single-case pattern match, so you don't need any syntax. Haskell embeds the multi-case match into various bits of syntax, like binding, and replaces the | with whitespace, so there's basically no syntax anywhere.

    Meanwhile, to people coming from the imperative world, C's switch and case are familiar. Whether that's a good thing, or an attractive nuisance, seems to be up for debate. Swift deliberately uses both of the same words. Many other languages (Scala, Rust, draft ES7) use match instead of switch, but some of them still use case for the individual cases. Others avoid both C keywords.

    One advantage out of using the word case this way is that, unlike of or a symbol or whitespace, it reads well in single-case stripped-down constructs, like Scala's single-case simple anon functions ({ case(a, b) => a+b }) or Swift's if case and similar statements (if case .Point(let x, let y) = point where x == y { print("diagonal at \{x}") }).

    Anyway, my use of case and of probably isn't actually familiar to anyone who isn't currently in a late-80s university class, and it collides with C just enough to be misleading but not enough to be comforting. So, I think either match/case or switch/case would fit Python better. (And, if C# plans to set the precedent that switch means statement and match means expression--coinciding with Swift's switch statement and Scala's match expression--maybe it's not worth fighting that.)

    Use cases


    The main thing stopping anyone (or at least me) from writing a serious proposal for adding pattern matching to Python is the lack of good use cases. Everyone who uses any language built around pattern matching knows that pattern matching is useful... but coming up with actual examples that make sense in Python is harder than it should be. And maybe that means something.

    Part of the problem is that most of the examples that aren't just toys tend to be things that can be better done in other ways in Python:

    • Option, Error, etc. types are an alternative to exceptions, and Python uses exceptions pervasively instead. (Plus, as Scala implies, pattern matching may not even be the best way to handle these most of the time.)
    • Decomposing records in general tends to fight against OO encapsulation. While abstract matching solves that problem, most examples, even in languages like Scala and F# that have abstract matching, seem to be borrowed from languages that don't (and written around case classes, records, or whatever they call them). While such cases might arise in Python, the fact that we don't generally use namedtuples all over the place (and they're buried in collections rather than builtins) and we don't have more general ADT-like features implies otherwise.
    • Simple switching functions like factorial (whether with a match expression, or as pattern-matched overloads) only really look simpler with pattern matching when you define them recursively. But in Python, you generally wouldn't define them recursively.

    • Recursively decomposing (cons) lists (and generalizing that to, e.g., abstract list-like objects, as in the F# example above) has the same problem, and the further problem that you rarely use cons lists in Python, and the further problem that, even if you did, you'd almost certainly want to use the iterator protocol on them.

    • Recursively decomposing more complex recursive structures that don't fit the iterator protocol seems more promising, but again, you'd usually want to process them iteratively rather than recursively if they're large enough to be a problem. Plus, you don't often run into deeply recursive data structures anyway--or, if you do, they tend to be implicit in JSON or similar, and pattern matching is more verbose there, not less; the benefit of using patterns there is static type checking, which you aren't going to get in Python.

    • Many examples would be more idiomatically written as method dispatch in an OO language. Even the more complex related problems (like operator double dispatch) already have better idiomatic solutions in Python.

    • The equivalent of a C switch statement is not very exciting, even when extended to matching enum constants or strings instead of just integers, and that's already readable enough as an elif chain. Or, in many cases, you'd just use a dict.

    There must be examples in my code or other code in other languages that would still look like idiomatic Python-plus-patterns when translated... but I haven't found them yet.

    Conclusions (or lack thereof)


    A year and a half ago, I concluded that, even though I think I came up with a reasonable and implementable proposal for pattern matching in Python, I don't think it's a feature Python needs, or should get, at least at present.

    The fact that Scala's designers came up with a similar solution to what I proposed, and it's very heavily used in idiomatic Scala, even by people who are mostly coming to the language from Java rather than ML or Haskell or something, is pretty heartening.

    There are definitely ways that Scala's feature is more flexible than the one I proposed--and, for the most part, we can pretty easily add that flexibility to my proposal, making it even better.

    On the other hand, the fact that Scala seems to encourage alternatives to pattern matching in some of the places I'd use it in Haskell or ML makes me think it might turn out to be underused in Python as well. Or maybe not--after all, maybe you sometimes have to fall back to pattern matching (or get much more verbose) even though the alternatives are there, as is presumably the case in Swift?

    So... now I'm less against the idea of adding pattern matching to Python. But I'm still not sufficiently excited about it to build an implementation, write a PEP, try and convince the core developers, etc. And if switching between Haskell and Python didn't make me ache for pattern matching in Python, I doubt that switching between Scala and Python will either. But maybe I'm wrong; time will tell.
    0

    Add a comment

  19. About a year ago, Jules Jacobs wrote a series (part 1 and part 2, with part 3 still forthcoming) on the best collections library design.

    Much of what follows is a repeat of ideas from Swift-style map and filter views and other posts like Iterator terminology, but hopefully seeing it from a different viewpoint will make it clearer. (At least writing this response to his post made some of it clearer to me.)

    Of course Jacobs acknowledges that there are tradeoffs, and his final best design isn't perfect. It's basically an extension of Clojure's reducer/transducer design (by splitting transducers into refolders and folders, aka transformers and consumers, you can design everything around composing transformers without running into all of the same problems). He's candid about what's missing.

    But I think he's missed something huge.

    The starting idea is that you want a single, fully-general sequence type, with functions to convert to and from all of your different collection types. This way, you only need to write all of your transformers (map, filter, zip, etc.) once, and, with minimal compiler/runtime support, fromSet.map(f).toSet can be just as efficient as a custom setMap f would be (and likewise for list, array, sorted set, or custom third-party types).

    Of course this isn't a revelation. The fundamental idea behind the STL (which eventually became the nucleus of the C++ standard library) decades ago was that you can split up iteration and algorithms, so if you have N collection types and M algorithms you only need to write N+M functions instead of N*M. And this is built into Python at a fundamental level: map, filter, zip, etc. consume any kind of iterable (which means collections or iterators) and produce an iterator. And most collection types can construct themselves from any iterable. So, the Python equivalent is just set(map(f, x)).

    But newer collection libraries since 1979's original STL design have run into the same N*M issues, and tried to solve them in new ways, and run into problems doing so. For example, in Scala, map and friends try to return the same collection type as their argument, meaning that things like zip or concat have to choose one argument and work out how to convert the other, and have to be able to programmatically figure out how to build a list[T,U] when specialized on list[T], set[U]. Just giving you back a general sequence type, or iterator, or whatever that can be lazily and cheaply converted to list[T,U] (or to any other type you prefer, or just used as-is) by the user avoids all of those problems.

    Back to the future (1979)


    But STL had something that Python doesn't, and neither do any of Jacobs' steps toward a solution: instead of just one kind of iteration, there are four kinds. (I'm ignoring output iterators, and mutating algorithms in general, in what follows, to keep things simple.)

    Python iteration is one-pass, forward-only (which STL calls "input").

    But sometimes this isn't sufficient. For example, if you want to build an average function with one-pass iteration, you need to keep the running total and count in some kind of persistent state. But with (multi-pass) forward iteration, you can just compose the sum and len functions in the obvious way. Jacobs's design takes this into account.

    But sometimes, this isn't sufficient either. For example, if you want to build unique_justseen with forward iteration, you need to keep the previously-seen value in some kind of persistent state. But with bidirectional iteration, you can just look at the previous value directly. Or, more simply, imagine how you'd build a reverse function with only forward iteration. (In Python, of course, there's a special method for producing reversed iterables, as part of the Sequence protocol.) Jacobs's design doesn't take this into account; the only way to reverse his sequences is to stack up all of the values.

    And sometimes, even this isn't sufficient, because you want to randomly access values. For example, there are a variety of sorting, permuting, etc. algorithms that are obvious with random access and either more difficult or impossible without. Jacobs's design doesn't take this into account either.

    The way STL handles this is simple, if a bit clumsy in practice: instead of dealing with sequences, you deal with iterator ranges, and there are four different iterator types, with the obvious chain of subtype relationships. Some algorithms require random-access iterators; some only require forward iterators, and will therefore also work with bidirectional or random-access iterators; etc. In languages with the right kind of overloading, you can even have an algorithm that requires bidirectional, but has an optimized specialization for random-access (e.g., it can run in linear instead of log-linear time, or constant rather than linear space, with random-access iterators).

    Swift makes things simpler, even if the idea is more confusing at first. Swift collections use indexing for everything. Random-access iteration is indexing with integers, as you're used to. Bidirectional iteration is indexing with special index types, which you can only get by asking a collection for its start or end or by incrementing or decrementing another index. And so on. Forward iteration is the same, except there's no decrement.

    So, in Swift, you define map as a function that takes a sequence, and returns some sequence type (a MapView) that's indexable by whatever indexes worked on the original sequence. When you write m = map(f, x) and then ask for m[idx], you get f(m[idx]). And likewise for zip, and plenty of other functions.

    Of course filter is a bit of a problem; there's no way to get the 7th element of a filtered list without looking at the first 6 predicate-passing elements and all predicate-failing elements up to the next passing one. So, filter can only give you bidirectional indexing, even if you give it a random-indexable sequence. But that's not all that complicated.

    It should be possible design a language that let you only write the code for map, filter, etc. once, and have it automatically give you back the appropriate type (which would be inferrable at compile time from the sequence type of the argument).

    How does this fit in with Jacobs's pros and cons?


    The first thing to notice is that as long as you only use the results in one-shot style (even if the input supports more powerful indexing), the semantics are identical to Python's iterators. So you get all the same benefits and limitations.

    But if you, e.g., have a random-access sequence, and you pass it through map to get another random-access sequence, the result is still data-parallelizable in exactly the same way as the input, which is an advantage that Python iterators (and lazy lists, etc.) don't have.

    So, it's the best of all worlds, right?

    Well, not quite; there's a new con that Jacobs's taxonomy didn't anticipate. If you write m = map(f, x) and then ask for m[7] twice, you're going to call f[x[7]] twice. For some algorithms, this could be a big performance hit. (And, if you stop ignoring mutation and side effects, it could even be incorrect or dangerous.)

    Of course you can manually slap memoization onto a lazy sequence, but then you're risking the memory benefits. For example, if you want to one-shot iterate all adjacent pairs of values that should only take O(2) space, but a naively-memoized map view will use O(N). Sure, you can use an LRU cache with depth 1 for memoization, but then you need to convince yourself that you're never re-calculating anything. If you can do that, you could probably just as easily have written the code that uses explicit persistent state to store the last value and just stick with a one-shot iterator.

    I don't know whether this is a real problem, and, if so, how often, and how easy it generally is to work around. It may be a complication worth ignoring, or it may make the entire idea into an attractive nuisance.

    Are there really only four kinds of iteration?


    The original goal was to get N+M functions instead of N*M. But if there are K kinds of iteration, we've really got N+M*K, which is just as bad as we started with, right?

    Well, no. N, the number of possible collection types, is huge and open-ended; K, the number of possible iteration types, is small and fixed.

    Maybe you can imagine ways to extend random-access indexing to things like NumPy multi-dimensional slices or Mathematica tree level selections, but if you think through what map should do with those, it's pretty clear that there shouldn't be any new code to write. (There might be some new algorithms that require these stronger kinds of indexes, but that's fine.)

    Also, how many of these can you imagine? People have come up with half a dozen ways to index or iterate over the past half-century, as opposed to dozens of collection types with hundreds of different variations and implementations, with new ones appearing every year.

    The hard part


    So, I said above that it should be possible to build a language that lets you just write map once, and it'll work appropriately with any of the four iterator types (or any new subtypes added to the iterator tower). Of course if you want to write an algorithm with a special optimized implementation for stronger iterators, you'd need to write it twice, but that's pretty rare, and never mandatory.

    And ideally, the static type system should be able to infer that map[a->b, Set[a]] returns the same kind of iterator as Set, but with element type b instead of a. (If filter has to manually specify that it can't handle anything better than bidirectional, instead of the type system inferring it, that's acceptable.)

    So, is that doable?

    Swift doesn't succeed. You have to write a lot of boilerplate, and to repeat yourself a lot more than should be necessary. But is that a limitation of Swift, or of the idea? I'm working this through in my spare time, but I don't have as much of that as I'd like.

    Fold, refold, unfold, spindle, manipulate


    Going back to Jacobs's posts, he gets into a number of things I haven't covered here. Ideally, you want to be able to support push producers as well as pull producers, and take advantage of parallelism at the production side as well as the processing side, and provide some way to keep state persistently within a refold--and doing all of those at the same time is impossible.

    He takes a step toward a solution, by proposing to redefine map, filter, etc. in terms of what chains of these transformers do. I believe the idea is that you can capture all kinds of transformers by writing a single producer-adapter for each kind of input and a single consumer-adapter for each kind of output. Since he never finished part 3, I'm not positive this is what he was going to write, nor am I positive that it would work, but at least it seems like this is where he was going.

    If so, it shares a lot in common with the replacing one kind of sequence with four kinds with different indexing. It means that not all functions can work with all producers or all consumers--filter can't work with a random-access consumer, and, likewise, zip can't work with a push producer--but hopefully the system can take care of this automatically (or, at worst, require you to specify it with a simple tag), leaving you to just write the basic implementation of filter in the obvious way.

    The big question is, what happens when you put these two together?

    I think his proposal (that I'm imagining) effectively needs an adapter for each combination of each of the binary questions (push vs. pull, parallelizable, etc.), and, ignoring the nonsensical ones, this means something like 8 adapters on each side. And I think the adapters may need to be able to interact differently with different kinds of iteration, rather than being purely orthogonal, which means we've now got 32 adapters on each side. Put another way, N+M*K+K*J isn't too bad if K and J are small and constant, but still, writing 64 functions, almost all of which are semantically empty, still seems pretty ugly. Hopefully, if I'm right about all of this, most or all of the adapters could be trivially auto-generated (which means that a clever enough language wouldn't require you to implement them at all, of course).

    Anyway, that's a lot of ifs toward the end. But unless Jacobs posts part 3, or I get the time to look into this in more detail myself, that's as good as I can do.
    1

    View comments

  20. In three separate discussions on the Python mailing lists this month, people have objected to some design because it leaks something into the enclosing scope. But "leaks into the enclosing scope" isn't a real problem. It's something that implies the actual problem they're complaining about—but it implies many different things.

    Often, people are just worrying because what they're doing would be a problem in C or Lisp or whatever their first language is. Many of these problems aren't true in Python, or are in theory but are never or rarely relevant to most Python scripts or modules, in which case the solution is to just stop worrying.

    Sometimes, there is a real problem. But it's worth knowing what the real problem is, so you can solve that.

    Background

    For example, consider the usual way of adding optional parameters:
        def spam(eggs=None):
            if eggs is not None:
    
    The problem here is that None is a perfectly good value for many uses, and you may want to distinguish spam(None) from spam().

    One possibility is:
        def spam(*args):
            if args:
                eggs, = *args
    
    But this gives you a much less meaningful signature, and takes more code, and provides less useful error messages (e.g., if you pass two arguments, instead of a TypeError telling you that spam takes at most 1 positional argument, you get a ValueError telling you that there were too many values to unpack).

    So, the idiomatic solution is this:
        sentinel = object()
        def spam(eggs=sentinel):
            if eggs is not sentinel:
    
    That solves both problems—None may be a valid value for eggs, but sentinel surely isn't, because you just invented it here, and don't use it for any meaning except "not a valid value for eggs". And the signature makes it pretty clear that "eggs" is an optional parameter, and the errors are exactly what you'd expect, and so on.

    But it means the sentinel is now available in the same scope as the function.

    So what? Well, that's what this post is about.

    It's public

    Often, the scope in question is the global scope of some module. Putting sentinel in the global scope of a module means anyone who uses the module will see it. This means that from module import * imports it. And that the inspect module, and similar tools, will show it as a part of the module's interface—your IDE or REPL may offer it as a possible completion for module. and module.s, your automatic documentation generator may list it in the documentation of the module's public interface, etc.

    But the problem here isn't that it's in the module's global scope; it's that it's public in the module's global scope. This is exactly what the private underscore convention is for: use the name _sentinel instead of sentinel, and it's no longer public. People can still find it if they go looking for it, but most tools will not offer it to them—and, even if they do find it, they'll know they're not supposed to use it.

    Also, any module that you intend people to use with from module import * really should have an __all__, which of course shouldn't include _sentinel.

    But private isn't really private

    Some people object that Python private names aren't really private. A malicious user can still access module._sentinel and, e.g., put it in the middle of a list, causing some function in your module to stop checking the list halfway through so they can sneak some other data through your protection.

    Nothing is private in Python. There's no way to hide anything. If they couldn't access module._sentinel, they could still access module.spam.__defaults__[0]. Or, if you somehow hid it in the constants, module.spam.__code__.co_consts[0].

    This is inherent to any language that allows reflection. Even in languages that don't, like C++, people can still find the "hidden" values by reinterpret_cast<char *> and similar tricks. Unless your language is specifically designed to protect pieces of code from each other (as Java is), there's nothing you can do about this. If you don't worry about it in every other language, don't worry about it in Python.

    It may collide with something else

    It's exceedingly unlikely that you actually had something different in the module named _sentinel. But what if use this same idiom twice in the same module? For example:
        _sentinel = object()
        def spam(eggs=_sentinel):
            if eggs is _sentinel:
    
        # ...
    
        _sentinel = object()
        def cheese(ham=_sentinel):
    
    Now, if you call spam(), the default value is the first _sentinel (because that gets bound at function creation time), but the if statement is checking it against the second one (because that gets looked up at function call time). Oops!

    But really, is this ever a problem? Unless you're writing 10000-line modules, or writing your modules by blind copy-paste, it should never come up. (And if you are, don't do that!) If you need multiple functions with sentinels in the same module, you either use the same one for all of them, or use a different name for each one.

    The one time this can come up is in auto-generated code. You need to make your code generator create a guaranteed-new name each time, in case it gets run twice on the same module.

    It wastes space

    Micro-optimizing space in a Python module is a mug's game. The value is already in the module. So is the name (for reflection, help, etc.). It does add one more entry to a dict that links the two together. That's 24 bytes on most 64-bit platforms. Even an empty dict is 288 bytes, and a module dict usually has at least __name__, __package__, __doc__, __all__, __loader__, __spec__, on top of your public and private names. This isn't a problem worth solving. If it were, you'd want to write a custom module loader that returned a different object that uses __slots__ (or C code) instead of a __dict__.

    It's slow

    Comparing to None is fast, because None is stored as a fast constant embedded in the compiled function, but comparing to _sentinel is slow, because it's stored as a global name lookup, which generally means a dict lookup on the module.

    You <i>can</i> solve that:
        _sentinel = object()
        def spam(eggs=_sentinel, _sentinel=_sentinel):
            if eggs is _sentinel:
    
    Now, you're comparing the first argument to the second argument, and they both default to the same reference stored in the function's fast locals, instead of the second one being a global lookup.

    But is this really a problem that needs to be solved? Yes, global lookups are slow. But so are function calls. And whatever you're actually going to do in the function is almost surely a lot slower. And even if everything were blazingly fast, the branch prediction failure on half the calls to spam would probably be much more costly than an access that almost always hits the cache.

    Accessing a global inside a loop over 1000000 elements may be worth optimizing away, but here?

    This micro-optimization used to be fairly common in early large Python apps—the Zope framework uses it all over the place. People also used to do things like pushing unbound methods as default values and then passing the self explicitly to avoid the attribute lookup and descriptor call. But as people have actually learned to profile Python code over the past couple decades (and as Python has improved, and CPUs have gotten more complicated…), it's become a lot less common, because it almost never provides any measurable benefit. So it probably won't help you here.

    This also solves the collision problem, and allows you to solve the visibility problem by just doing del _sentinel after the function definition. But neither of those is really a problem either, as explained above.

    And the cost is that it's less clear that, while eggs is an optional parameter, _sentinel is a "never pass this" parameter. The underscore helps, but this idiom isn't as widely known by programmers, or as widely supported by tools, as the usual uses of the underscore.

    Conclusion

    When you're worried that a value leaks into the enclosing scope, all of the real problems can be solved by just underscore-prefixing the name and leaving it out of __all__, and the other problems rarely need to be solved.
    2

    View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.