Let's say you have a good idea for a change to Python. How do you implement it?

Well, first you go to the devguide, and then you see that if you want to change the grammar there's a long checklist of things you usually need to change, and then you look at the source, and some of those things aren't even really documented, you just have to pick it up from the surrounding code.

But in many cases, that isn't necessary during the exploratory phase. Sure, a real implementation is going to need to do all that stuff, but you can slap together a test implementation in pure Python, without that much work.

Please keep in mind that the code in this post, and in the linked examples, is not the best way to do things, it's just the shortest way to do things. That's fine when you're just tinkering with an idea and want to get the boilerplate out of the way, but if you actually want to write and deploy real code that uses an import hook to modify Python source, read the docs and do it right.

Import Hooks

See PEP 302 for details on the import hook system, and the docs on the import system for details on what each part does. There's a lot to take in, but if you want a quick&dirty hack, it's as simple as this:
    def _call_with_frames_removed(f, *args, **kwargs):
        return f(*args, **kwargs)

    class MyLoader(importlib.machinery.SourceFileLoader):
        def source_to_code(self, data, path, *, _optimize=-1):
            source = importlib.util.decode_source(data)
            return _call_with_frames_removed(compile, source, path, 'exec',
                                             dont_inherit=True, optimize=_optimize)

    _real_pathfinder = sys.meta_path[-1]
    class MyFinder(type(_real_pathfinder)):
        @classmethod
        def find_module(cls, fullname, path=None):
            spec = _real_pathfinder.find_spec(fullname, path)
            if not spec: return spec
            loader = spec.loader
            if type(loader).__name__ == 'SourceFileLoader':
                loader.__class__ = MyLoader
            return loader

    sys.meta_path[-1] = FloatLiteralFinder
Again, if it isn't obvious from the implementation that this is a hack that depends on specific details of CPython 3.4's import machinery: it's a hack that depends on specific details of CPython 3.4's import machinery.

Now, inside that source_to_code method, you can transform the code in any of four ways:
  • As bytes
  • As text
  • As tokens
  • As an abstract syntax tree
  • As bytecode
I'll explain the give separately below, but first, a few notes that apply to all of them:
  • What you're changing is how Python interprets .py files. This means that, unless you change some other stuff, anything that's already been compiled to a .pyc won't be recompiled.
  • Because you're hacking the import system at runtime, this only affects modules that are loaded after you make these changes. So, you can't use this for single-script programs; you have to split them up into a module that does the work, and a script that hacks the import system then imports that module.

Transforming bytecode

The compile function returns you a code object, whose bytecode is just a bytes object in its co_code attribute.

Since both code and bytes are immutable, to change this, you need to build up a new bytes object (that's normal for Python), and then build a new code object with the new co_code and copies of the other members. You may sometimes also want to change some of those other members—e.g., if you want to add the statement a = 2 somewhere, that a has to go into co_names, and that 2 into co_consts so that the bytecode in co_code can reference them by index into the lists of names and consts.

Non-trivial hacks to the bytecode can be complicated. For example, if you insert any new code in the middle, you have to adjust all the jumps that point after it to new values. There are some great modules for Python 2.x, like byteplay, that make this easier for you, but as far as I know, none of them have been ported to 3.x. (From a quick, incomplete attempt at it myself, I think rewriting one of these from scratch to use the new features in 3.4 would be a lot easier than fixing one of them up to be compatible with 3.4. For example, up to Python 2.5, every bytecode has a fixed effect on the stack; later versions have made that much more complicated, and trying to hack up byteplay to understand how some of these bytecodes work with just the local knowledge of the current bytecode and current stack is either not possible or at least not easy enough to do in a quick 60-minute hack…)

Unfortunately, I don't have any examples to show for this one that work with Python 3.

Transforming ASTs

The AST is usually the easiest place to do most things—as long as what you're trying to do is add new semantics, not new syntax, that is. This is the same level that macros work on in other languages, and the level that MacroPy works on in Python.

In many cases, even if what you want to do really does require a syntax change, you can come up with a way to hack it into the existing syntax that's good enough for exploration.

But if you look at the source_to_code method, we don't have an abstract syntax tree anywhere. So, how do you hack on one?

The secret is that the compile function can stop after AST generation (by passing the flag ast.PyCF_ONLY_AST—or you can just call ast.parse instead of compile), and it can take an AST instead of source code. So:
    class MyLoader(importlib.machinery.SourceFileLoader):
        def source_to_code(self, data, path, *, _optimize=-1):
            source = importlib.util.decode_source(data)
            tree = _call_with_frames_removed(compile, source, path, 'exec',
                                             dont_inherit=True, optimize=_optimize,
                                             flags=ast.PyCF_ONLY_AST)
            tree = do_stuff(tree)
            return _call_with_frames_removed(compile, tree, path, 'exec',
                                             dont_inherit=True, optimize=_optimize)
Now, what do you do with one of those trees? You create a NodeTransformer and visit it. The example in the docs is pretty easy to understand. To know what nodes you'll have to deal with and what attributes there are, the best thing to do is spend some time interactively playing with ast.dump(ast.parse(s)) for different strings of source and looking up help(ast.Str) and so on.

For a complete quick&dirty example of hacking up the importer to run an AST transform on every Python source module, see .floatliteralhack/floatliteral_ast.py or emptyset/emptify.py.

Of course the aforementioned MacroPy is a much better—and less hacky—example. In fact, if you think you need to hack on ASTs, it's usually easier to just use MacroPy than to do it yourself. It comes with a complete framework (that works on Python 2.7 and 3.3, not just 3.4+) full of helpful utilities; many transformations can be written without even directly thinking about the AST, and when you need to, it's as easy as it possibly could be. Using it as sample code, on the other hand, may not be as good an idea unless you need backward compatibility (because it's written for 2.7's less-friendly import system, and works on 3.4+ by effectively treating it as 2.7).

Transforming tokens

The problem with transforming ASTs is that you can't change the syntax, because you're getting to the code after it's already been parsed. So, for example, replacing every 1.2 literal with a FloatLiteral('1.2') call is impossible, because by the time you've got an AST, that 1.2 is already a float, not a string anymore. (Well, for 1.2, that works out anyway, because repr(1.2) == '1.2', but that's not true for every float, and presumably the whole reason you'd want to do this particular transformation is to get the actual string for every float.)

Many syntax transformations can be done after lexical analysis. For example, to replace every 1.2 literal with FloatLiteral('1.2') is easy at the token level: that 1.2 is a NUMBER token whose value is still the string '1.2', and generating the sequence of tokens for FloatLiteral('1.2') is dead simple.

But, as with the AST, there's no place above where we have a token stream. And, unlike the AST case, there's no way to ask compile to stop at just the token level and then pick up after you've transformed the tokens.

Fortunately, the tokenize module has both tokenize and untokenize functions, so you can break the source string into a stream of tokens, transform that stream, and turn it back into source that you can then pass to the compiler to parse. In fact, the example in the docs shows how to do exactly that.
For example:
    class MyLoader(importlib.machinery.SourceFileLoader):
        def source_to_code(self, data, path, *, _optimize=-1):
            source = importlib.util.decode_source(data)
            tokens = _retokenize(tokenize(io.BytesIO(source.encode('utf-8')).readline))
            source = untokenize(tokens).decode('utf-8')
            return _call_with_frames_removed(compile, source, path, 'exec',
                                             dont_inherit=True, optimize=_optimize)
That _retokenize function is just an iteratable transformation on a stream of token tuples:
    def _retokenize(tokens):
        for num, val, *stuff in tokens:
            if num == NUMBER and ('.' in val or 'e' in val or 'E' in val):
                yield NAME, 'FloatLiteral'
                yield OP, '('
                yield STRING, repr(val)
                yield OP, ')'
            else:
                yield num, val
For a complete quick&dirty example of hacking up the importer to run a token stream transform on every Python source module, see floatliteral/floatliteral.

Transforming text

Sometimes, you need to change things in a way that isn't even valid Python at the lexical level, much less the syntactic level. For example, if you want to add identifiers or keywords that aren't made up of Unicode letters and digits, you can't hack that in at the syntax level.

Here, it's obvious where to hack things in. You've got a big str with all of your code; just use your favorite string-manipulation methods on it.

However, this is pretty hard to do without breaking stuff. How do you detect one of those identifiers only where it actually would be an identifier, instead of catching false positives inside strings, comments, etc.? Python isn't a regular language, so trying to do it with regex and is difficult and inefficient (not impossible, since Python's re, like most other modern regex interpreters, isn't limited to actual regular expressions, but still not fun—see bonince's famous answer to parsing HTML, an even simple language, with regular expressions).

But there's a handy trick here. Notice that the tokenize docs link to the source. Which is in relatively simple pure Python. So, you don't have to figure out how to write your own Python-like lexer from scratch, or debug your quick&dirty regex that almost works but not quite; you can just fork the stdlib code and hack it up.

For a complete quick&dirty example of hacking up the importer to run an source text transform on every Python source module, see emptyset/emptify.py.

Transforming bytes

Normally, you don't want to bother with this, unless you're trying to implement a change to how encodings are specified. The decode_source function takes care of figuring out the character set and newlines and then decoding to a str for you. But if you want to replace what it does, you can. It's just a matter of not calling decode_source and doing your own decoding, or modifying data before you call it.

I don't have any examples for this one, because I've never had a good reason to write one.

Hybrid hacking

As I mentioned in my operator sectioning post, the hard part of hacking the compiler is not the parsing part, but the compiling part. For example, for that hack, I had to modify the way symbol tables are generated, the way bytecode is generated (including adding another macro to insert not-quite-trivial steps into a function scope), and so on. If I'd just stopped at creating the BinOpSect AST node, then I could have transformed that into a lambda expression easily. (Of course that wouldn't have the advantage of having a nice __name__, gensyming a hygienic parameter name, etc., but this was never intended to be production code…)

What's with that _call_with_frames_removed thing?

In Python 2.x, the whole import system is written in C, and any tracebacks that you get from an ImportError just go right from the illegal code in the imported module to the import statement.

In Python 3.3+, the import system is mostly written in Python. That's why we can hook it so easily. But that also means that any tracebacks are full of extra stack frames from the import system itself. That's useful while you're debugging the import hook itself, but once it's working, you don't want to see it on every ImportError or SyntaxError in any module you import.

There's been discussion about better ways to fix this, but as of 3.4, there's only a workaround, documented in _bootstrap.py: any sequence of import code that ends in a call to a function named _call_with_frames_removed gets removed from any tracebacks generated. This is just checked by name, not identity. I'm not sure whether it's better to write your own function, or import an undocumented function from a private module. (One possible advantage of the latter might be that if it breaks in Python 3.6, that will probably mean that the workaround has been replaced by a better solution, so you'll want to know about that…)
1

View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.