Let's say you have a good idea for a change to Python. How do you implement it?

Well, first you go to the devguide, and then you see that if you want to change the grammar there's a long checklist of things you usually need to change, and then you look at the source, and some of those things aren't even really documented, you just have to pick it up from the surrounding code.

But in many cases, that isn't necessary during the exploratory phase. Sure, a real implementation is going to need to do all that stuff, but you can slap together a test implementation in pure Python, without that much work.

Please keep in mind that the code in this post, and in the linked examples, is not the best way to do things, it's just the shortest way to do things. That's fine when you're just tinkering with an idea and want to get the boilerplate out of the way, but if you actually want to write and deploy real code that uses an import hook to modify Python source, read the docs and do it right.

Import Hooks

See PEP 302 for details on the import hook system, and the docs on the import system for details on what each part does. There's a lot to take in, but if you want a quick&dirty hack, it's as simple as this:
    def _call_with_frames_removed(f, *args, **kwargs):
        return f(*args, **kwargs)

    class MyLoader(importlib.machinery.SourceFileLoader):
        def source_to_code(self, data, path, *, _optimize=-1):
            source = importlib.util.decode_source(data)
            return _call_with_frames_removed(compile, source, path, 'exec',
                                             dont_inherit=True, optimize=_optimize)

    _real_pathfinder = sys.meta_path[-1]
    class MyFinder(type(_real_pathfinder)):
        @classmethod
        def find_module(cls, fullname, path=None):
            spec = _real_pathfinder.find_spec(fullname, path)
            if not spec: return spec
            loader = spec.loader
            if type(loader).__name__ == 'SourceFileLoader':
                loader.__class__ = MyLoader
            return loader

    sys.meta_path[-1] = FloatLiteralFinder
Again, if it isn't obvious from the implementation that this is a hack that depends on specific details of CPython 3.4's import machinery: it's a hack that depends on specific details of CPython 3.4's import machinery.

Now, inside that source_to_code method, you can transform the code in any of four ways:
  • As bytes
  • As text
  • As tokens
  • As an abstract syntax tree
  • As bytecode
I'll explain the give separately below, but first, a few notes that apply to all of them:
  • What you're changing is how Python interprets .py files. This means that, unless you change some other stuff, anything that's already been compiled to a .pyc won't be recompiled.
  • Because you're hacking the import system at runtime, this only affects modules that are loaded after you make these changes. So, you can't use this for single-script programs; you have to split them up into a module that does the work, and a script that hacks the import system then imports that module.

Transforming bytecode

The compile function returns you a code object, whose bytecode is just a bytes object in its co_code attribute.

Since both code and bytes are immutable, to change this, you need to build up a new bytes object (that's normal for Python), and then build a new code object with the new co_code and copies of the other members. You may sometimes also want to change some of those other members—e.g., if you want to add the statement a = 2 somewhere, that a has to go into co_names, and that 2 into co_consts so that the bytecode in co_code can reference them by index into the lists of names and consts.

Non-trivial hacks to the bytecode can be complicated. For example, if you insert any new code in the middle, you have to adjust all the jumps that point after it to new values. There are some great modules for Python 2.x, like byteplay, that make this easier for you, but as far as I know, none of them have been ported to 3.x. (From a quick, incomplete attempt at it myself, I think rewriting one of these from scratch to use the new features in 3.4 would be a lot easier than fixing one of them up to be compatible with 3.4. For example, up to Python 2.5, every bytecode has a fixed effect on the stack; later versions have made that much more complicated, and trying to hack up byteplay to understand how some of these bytecodes work with just the local knowledge of the current bytecode and current stack is either not possible or at least not easy enough to do in a quick 60-minute hack…)

Unfortunately, I don't have any examples to show for this one that work with Python 3.

Transforming ASTs

The AST is usually the easiest place to do most things—as long as what you're trying to do is add new semantics, not new syntax, that is. This is the same level that macros work on in other languages, and the level that MacroPy works on in Python.

In many cases, even if what you want to do really does require a syntax change, you can come up with a way to hack it into the existing syntax that's good enough for exploration.

But if you look at the source_to_code method, we don't have an abstract syntax tree anywhere. So, how do you hack on one?

The secret is that the compile function can stop after AST generation (by passing the flag ast.PyCF_ONLY_AST—or you can just call ast.parse instead of compile), and it can take an AST instead of source code. So:
    class MyLoader(importlib.machinery.SourceFileLoader):
        def source_to_code(self, data, path, *, _optimize=-1):
            source = importlib.util.decode_source(data)
            tree = _call_with_frames_removed(compile, source, path, 'exec',
                                             dont_inherit=True, optimize=_optimize,
                                             flags=ast.PyCF_ONLY_AST)
            tree = do_stuff(tree)
            return _call_with_frames_removed(compile, tree, path, 'exec',
                                             dont_inherit=True, optimize=_optimize)
Now, what do you do with one of those trees? You create a NodeTransformer and visit it. The example in the docs is pretty easy to understand. To know what nodes you'll have to deal with and what attributes there are, the best thing to do is spend some time interactively playing with ast.dump(ast.parse(s)) for different strings of source and looking up help(ast.Str) and so on.

For a complete quick&dirty example of hacking up the importer to run an AST transform on every Python source module, see .floatliteralhack/floatliteral_ast.py or emptyset/emptify.py.

Of course the aforementioned MacroPy is a much better—and less hacky—example. In fact, if you think you need to hack on ASTs, it's usually easier to just use MacroPy than to do it yourself. It comes with a complete framework (that works on Python 2.7 and 3.3, not just 3.4+) full of helpful utilities; many transformations can be written without even directly thinking about the AST, and when you need to, it's as easy as it possibly could be. Using it as sample code, on the other hand, may not be as good an idea unless you need backward compatibility (because it's written for 2.7's less-friendly import system, and works on 3.4+ by effectively treating it as 2.7).

Transforming tokens

The problem with transforming ASTs is that you can't change the syntax, because you're getting to the code after it's already been parsed. So, for example, replacing every 1.2 literal with a FloatLiteral('1.2') call is impossible, because by the time you've got an AST, that 1.2 is already a float, not a string anymore. (Well, for 1.2, that works out anyway, because repr(1.2) == '1.2', but that's not true for every float, and presumably the whole reason you'd want to do this particular transformation is to get the actual string for every float.)

Many syntax transformations can be done after lexical analysis. For example, to replace every 1.2 literal with FloatLiteral('1.2') is easy at the token level: that 1.2 is a NUMBER token whose value is still the string '1.2', and generating the sequence of tokens for FloatLiteral('1.2') is dead simple.

But, as with the AST, there's no place above where we have a token stream. And, unlike the AST case, there's no way to ask compile to stop at just the token level and then pick up after you've transformed the tokens.

Fortunately, the tokenize module has both tokenize and untokenize functions, so you can break the source string into a stream of tokens, transform that stream, and turn it back into source that you can then pass to the compiler to parse. In fact, the example in the docs shows how to do exactly that.
For example:
    class MyLoader(importlib.machinery.SourceFileLoader):
        def source_to_code(self, data, path, *, _optimize=-1):
            source = importlib.util.decode_source(data)
            tokens = _retokenize(tokenize(io.BytesIO(source.encode('utf-8')).readline))
            source = untokenize(tokens).decode('utf-8')
            return _call_with_frames_removed(compile, source, path, 'exec',
                                             dont_inherit=True, optimize=_optimize)
That _retokenize function is just an iteratable transformation on a stream of token tuples:
    def _retokenize(tokens):
        for num, val, *stuff in tokens:
            if num == NUMBER and ('.' in val or 'e' in val or 'E' in val):
                yield NAME, 'FloatLiteral'
                yield OP, '('
                yield STRING, repr(val)
                yield OP, ')'
            else:
                yield num, val
For a complete quick&dirty example of hacking up the importer to run a token stream transform on every Python source module, see floatliteral/floatliteral.

Transforming text

Sometimes, you need to change things in a way that isn't even valid Python at the lexical level, much less the syntactic level. For example, if you want to add identifiers or keywords that aren't made up of Unicode letters and digits, you can't hack that in at the syntax level.

Here, it's obvious where to hack things in. You've got a big str with all of your code; just use your favorite string-manipulation methods on it.

However, this is pretty hard to do without breaking stuff. How do you detect one of those identifiers only where it actually would be an identifier, instead of catching false positives inside strings, comments, etc.? Python isn't a regular language, so trying to do it with regex and is difficult and inefficient (not impossible, since Python's re, like most other modern regex interpreters, isn't limited to actual regular expressions, but still not fun—see bonince's famous answer to parsing HTML, an even simple language, with regular expressions).

But there's a handy trick here. Notice that the tokenize docs link to the source. Which is in relatively simple pure Python. So, you don't have to figure out how to write your own Python-like lexer from scratch, or debug your quick&dirty regex that almost works but not quite; you can just fork the stdlib code and hack it up.

For a complete quick&dirty example of hacking up the importer to run an source text transform on every Python source module, see emptyset/emptify.py.

Transforming bytes

Normally, you don't want to bother with this, unless you're trying to implement a change to how encodings are specified. The decode_source function takes care of figuring out the character set and newlines and then decoding to a str for you. But if you want to replace what it does, you can. It's just a matter of not calling decode_source and doing your own decoding, or modifying data before you call it.

I don't have any examples for this one, because I've never had a good reason to write one.

Hybrid hacking

As I mentioned in my operator sectioning post, the hard part of hacking the compiler is not the parsing part, but the compiling part. For example, for that hack, I had to modify the way symbol tables are generated, the way bytecode is generated (including adding another macro to insert not-quite-trivial steps into a function scope), and so on. If I'd just stopped at creating the BinOpSect AST node, then I could have transformed that into a lambda expression easily. (Of course that wouldn't have the advantage of having a nice __name__, gensyming a hygienic parameter name, etc., but this was never intended to be production code…)

What's with that _call_with_frames_removed thing?

In Python 2.x, the whole import system is written in C, and any tracebacks that you get from an ImportError just go right from the illegal code in the imported module to the import statement.

In Python 3.3+, the import system is mostly written in Python. That's why we can hook it so easily. But that also means that any tracebacks are full of extra stack frames from the import system itself. That's useful while you're debugging the import hook itself, but once it's working, you don't want to see it on every ImportError or SyntaxError in any module you import.

There's been discussion about better ways to fix this, but as of 3.4, there's only a workaround, documented in _bootstrap.py: any sequence of import code that ends in a call to a function named _call_with_frames_removed gets removed from any tracebacks generated. This is just checked by name, not identity. I'm not sure whether it's better to write your own function, or import an undocumented function from a private module. (One possible advantage of the latter might be that if it breaks in Python 3.6, that will probably mean that the workaround has been replaced by a better solution, so you'll want to know about that…)
1

View comments

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed.
I haven't posted anything new in a couple years (partly because I attempted to move to a different blogging platform where I could write everything in markdown instead of HTML but got frustrated—which I may attempt again), but I've had a few private comments and emails on some of the old posts, so I
Looking before you leap

Python is a duck-typed language, and one where you usually trust EAFP ("Easier to Ask Forgiveness than Permission") over LBYL ("Look Before You Leap").
Background

Currently, CPython’s internal bytecode format stores instructions with no args as 1 byte, instructions with small args as 3 bytes, and instructions with large args as 6 bytes (actually, a 3-byte EXTENDED_ARG followed by a 3-byte real instruction).
If you want to skip all the tl;dr and cut to the chase, jump to Concrete Proposal.
Many people, when they first discover the heapq module, have two questions:

Why does it define a bunch of functions instead of a container type? Why don't those functions take a key or reverse parameter, like all the other sorting-related stuff in Python? Why not a type?

At the abstract level, it'
Currently, in CPython, if you want to process bytecode, either in C or in Python, it’s pretty complicated.

The built-in peephole optimizer has to do extra work fixing up jump targets and the line-number table, and just punts on many cases because they’re too hard to deal with.
One common "advanced question" on places like StackOverflow and python-list is "how do I dynamically create a function/method/class/whatever"? The standard answer is: first, some caveats about why you probably don't want to do that, and then an explanation of the various ways to do it when you reall
A few years ago, Cesare di Mauro created a project called WPython, a fork of CPython 2.6.4 that “brings many optimizations and refactorings”. The starting point of the project was replacing the bytecode with “wordcode”. However, there were a number of other changes on top of it.
Many languages have a for-each loop.
When the first betas for Swift came out, I was impressed by their collection design. In particular, the way it allows them to write map-style functions that are lazy (like Python 3), but still as full-featured as possible.
In a previous post, I explained in detail how lookup works in Python.
The documentation does a great job explaining how things normally get looked up, and how you can hook them.

But to understand how the hooking works, you need to go under the covers to see how that normal lookup actually happens.

When I say "Python" below, I'm mostly talking about CPython 3.5.
In Python (I'm mostly talking about CPython here, but other implementations do similar things), when you write the following:

def spam(x): return x+1 spam(3) What happens?

Really, it's not that complicated, but there's no documentation anywhere that puts it all together.
I've seen a number of people ask why, if you can have arbitrary-sized integers that do everything exactly, you can't do the same thing with floats, avoiding all the rounding problems that they keep running into.
In a recent thread on python-ideas, Stephan Sahm suggested, in effect, changing the method resolution order (MRO) from C3-linearization to a simple depth-first search a la old-school Python or C++.
Note: This post doesn't talk about Python that much, except as a point of comparison for JavaScript.

Most object-oriented languages out there, including Python, are class-based. But JavaScript is instead prototype-based.
About a year and a half ago, I wrote a blog post on the idea of adding pattern matching to Python.

I finally got around to playing with Scala semi-seriously, and I realized that they pretty much solved the same problem, in a pretty similar way to my straw man proposal, and it works great.
About a year ago, Jules Jacobs wrote a series (part 1 and part 2, with part 3 still forthcoming) on the best collections library design.
In three separate discussions on the Python mailing lists this month, people have objected to some design because it leaks something into the enclosing scope. But "leaks into the enclosing scope" isn't a real problem.
There's a lot of confusion about what the various kinds of things you can iterate over in Python. I'll attempt to collect definitions for all of the relevant terms, and provide examples, here, so I don't have to go over the same discussions in the same circles every time.
Python has a whole hierarchy of collection-related abstract types, described in the collections.abc module in the standard library. But there are two key, prototypical kinds. Iterators are one-shot, used for a single forward traversal, and usually lazy, generating each value on the fly as requested.
There are a lot of novice questions on optimizing NumPy code on StackOverflow, that make a lot of the same mistakes. I'll try to cover them all here.

What does NumPy speed up?

Let's look at some Python code that does some computation element-wise on two lists of lists.
When asyncio was first proposed, many people (not so much on python-ideas, where Guido first suggested it, but on external blogs) had the same reaction: Doing the core reactor loop in Python is going to be way too slow. Something based on libev, like gevent, is inherently going to be much faster.
Let's say you have a good idea for a change to Python.
There are hundreds of questions on StackOverflow that all ask variations of the same thing. Paraphrasing:

lst is a list of strings and numbers. I want to convert the numbers to int but leave the strings alone.
In Haskell, you can section infix operators. This is a simple form of partial evaluation. Using Python syntax, the following are equivalent:

(2*) lambda x: 2*x (*2) lambda x: x*2 (*) lambda x, y: x*y So, can we do the same in Python?

Grammar

The first form, (2*), is unambiguous.
Many people—especially people coming from Java—think that using try/except is "inelegant", or "inefficient". Or, slightly less meaninglessly, they think that "exceptions should only be for errors, not for normal flow control".

These people are not going to be happy with Python.
If you look at Python tutorials and sample code, proposals for new language features, blogs like this one, talks at PyCon, etc., you'll see spam, eggs, gouda, etc. all over the place.
Most control structures in most most programming languages, including Python, are subordinating conjunctions, like "if", "while", and "except", although "with" is a preposition, and "for" is a preposition used strangely (although not as strangely as in C…).
There are two ways that some Python programmers overuse lambda. Doing this almost always mkes your code less readable, and for no corresponding benefit.
Some languages have a very strong idiomatic style—in Python, Haskell, or Swift, the same code by two different programmers is likely to look a lot more similar than in Perl, Lisp, or C++.

There's an advantage to this—and, in particular, an advantage to you sticking to those idioms.
Python doesn't have a way to clone generators.

At least for a lot of simple cases, however, it's pretty obvious what cloning them should do, and being able to do so would be handy. But for a lot of other cases, it's not at all obvious.
Every time someone has a good idea, they believe it should be in the stdlib. After all, it's useful to many people, and what's the harm? But of course there is a harm.
This confuses every Python developer the first time they see it—even if they're pretty experienced by the time they see it:

>>> t = ([], []) >>> t[0] += [1] --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <stdin> in <module>()
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.