Let's say you have a good idea for a change to Python. How do you implement it?

Well, first you go to the devguide, and then you see that if you want to change the grammar there's a long checklist of things you usually need to change, and then you look at the source, and some of those things aren't even really documented, you just have to pick it up from the surrounding code.

But in many cases, that isn't necessary during the exploratory phase. Sure, a real implementation is going to need to do all that stuff, but you can slap together a test implementation in pure Python, without that much work.

Please keep in mind that the code in this post, and in the linked examples, is not the best way to do things, it's just the shortest way to do things. That's fine when you're just tinkering with an idea and want to get the boilerplate out of the way, but if you actually want to write and deploy real code that uses an import hook to modify Python source, read the docs and do it right.

Import Hooks

See PEP 302 for details on the import hook system, and the docs on the import system for details on what each part does. There's a lot to take in, but if you want a quick&dirty hack, it's as simple as this:
    def _call_with_frames_removed(f, *args, **kwargs):
        return f(*args, **kwargs)

    class MyLoader(importlib.machinery.SourceFileLoader):
        def source_to_code(self, data, path, *, _optimize=-1):
            source = importlib.util.decode_source(data)
            return _call_with_frames_removed(compile, source, path, 'exec',
                                             dont_inherit=True, optimize=_optimize)

    _real_pathfinder = sys.meta_path[-1]
    class MyFinder(type(_real_pathfinder)):
        @classmethod
        def find_module(cls, fullname, path=None):
            spec = _real_pathfinder.find_spec(fullname, path)
            if not spec: return spec
            loader = spec.loader
            if type(loader).__name__ == 'SourceFileLoader':
                loader.__class__ = MyLoader
            return loader

    sys.meta_path[-1] = FloatLiteralFinder
Again, if it isn't obvious from the implementation that this is a hack that depends on specific details of CPython 3.4's import machinery: it's a hack that depends on specific details of CPython 3.4's import machinery.

Now, inside that source_to_code method, you can transform the code in any of four ways:
  • As bytes
  • As text
  • As tokens
  • As an abstract syntax tree
  • As bytecode
I'll explain the give separately below, but first, a few notes that apply to all of them:
  • What you're changing is how Python interprets .py files. This means that, unless you change some other stuff, anything that's already been compiled to a .pyc won't be recompiled.
  • Because you're hacking the import system at runtime, this only affects modules that are loaded after you make these changes. So, you can't use this for single-script programs; you have to split them up into a module that does the work, and a script that hacks the import system then imports that module.

Transforming bytecode

The compile function returns you a code object, whose bytecode is just a bytes object in its co_code attribute.

Since both code and bytes are immutable, to change this, you need to build up a new bytes object (that's normal for Python), and then build a new code object with the new co_code and copies of the other members. You may sometimes also want to change some of those other members—e.g., if you want to add the statement a = 2 somewhere, that a has to go into co_names, and that 2 into co_consts so that the bytecode in co_code can reference them by index into the lists of names and consts.

Non-trivial hacks to the bytecode can be complicated. For example, if you insert any new code in the middle, you have to adjust all the jumps that point after it to new values. There are some great modules for Python 2.x, like byteplay, that make this easier for you, but as far as I know, none of them have been ported to 3.x. (From a quick, incomplete attempt at it myself, I think rewriting one of these from scratch to use the new features in 3.4 would be a lot easier than fixing one of them up to be compatible with 3.4. For example, up to Python 2.5, every bytecode has a fixed effect on the stack; later versions have made that much more complicated, and trying to hack up byteplay to understand how some of these bytecodes work with just the local knowledge of the current bytecode and current stack is either not possible or at least not easy enough to do in a quick 60-minute hack…)

Unfortunately, I don't have any examples to show for this one that work with Python 3.

Transforming ASTs

The AST is usually the easiest place to do most things—as long as what you're trying to do is add new semantics, not new syntax, that is. This is the same level that macros work on in other languages, and the level that MacroPy works on in Python.

In many cases, even if what you want to do really does require a syntax change, you can come up with a way to hack it into the existing syntax that's good enough for exploration.

But if you look at the source_to_code method, we don't have an abstract syntax tree anywhere. So, how do you hack on one?

The secret is that the compile function can stop after AST generation (by passing the flag ast.PyCF_ONLY_AST—or you can just call ast.parse instead of compile), and it can take an AST instead of source code. So:
    class MyLoader(importlib.machinery.SourceFileLoader):
        def source_to_code(self, data, path, *, _optimize=-1):
            source = importlib.util.decode_source(data)
            tree = _call_with_frames_removed(compile, source, path, 'exec',
                                             dont_inherit=True, optimize=_optimize,
                                             flags=ast.PyCF_ONLY_AST)
            tree = do_stuff(tree)
            return _call_with_frames_removed(compile, tree, path, 'exec',
                                             dont_inherit=True, optimize=_optimize)
Now, what do you do with one of those trees? You create a NodeTransformer and visit it. The example in the docs is pretty easy to understand. To know what nodes you'll have to deal with and what attributes there are, the best thing to do is spend some time interactively playing with ast.dump(ast.parse(s)) for different strings of source and looking up help(ast.Str) and so on.

For a complete quick&dirty example of hacking up the importer to run an AST transform on every Python source module, see .floatliteralhack/floatliteral_ast.py or emptyset/emptify.py.

Of course the aforementioned MacroPy is a much better—and less hacky—example. In fact, if you think you need to hack on ASTs, it's usually easier to just use MacroPy than to do it yourself. It comes with a complete framework (that works on Python 2.7 and 3.3, not just 3.4+) full of helpful utilities; many transformations can be written without even directly thinking about the AST, and when you need to, it's as easy as it possibly could be. Using it as sample code, on the other hand, may not be as good an idea unless you need backward compatibility (because it's written for 2.7's less-friendly import system, and works on 3.4+ by effectively treating it as 2.7).

Transforming tokens

The problem with transforming ASTs is that you can't change the syntax, because you're getting to the code after it's already been parsed. So, for example, replacing every 1.2 literal with a FloatLiteral('1.2') call is impossible, because by the time you've got an AST, that 1.2 is already a float, not a string anymore. (Well, for 1.2, that works out anyway, because repr(1.2) == '1.2', but that's not true for every float, and presumably the whole reason you'd want to do this particular transformation is to get the actual string for every float.)

Many syntax transformations can be done after lexical analysis. For example, to replace every 1.2 literal with FloatLiteral('1.2') is easy at the token level: that 1.2 is a NUMBER token whose value is still the string '1.2', and generating the sequence of tokens for FloatLiteral('1.2') is dead simple.

But, as with the AST, there's no place above where we have a token stream. And, unlike the AST case, there's no way to ask compile to stop at just the token level and then pick up after you've transformed the tokens.

Fortunately, the tokenize module has both tokenize and untokenize functions, so you can break the source string into a stream of tokens, transform that stream, and turn it back into source that you can then pass to the compiler to parse. In fact, the example in the docs shows how to do exactly that.
For example:
    class MyLoader(importlib.machinery.SourceFileLoader):
        def source_to_code(self, data, path, *, _optimize=-1):
            source = importlib.util.decode_source(data)
            tokens = _retokenize(tokenize(io.BytesIO(source.encode('utf-8')).readline))
            source = untokenize(tokens).decode('utf-8')
            return _call_with_frames_removed(compile, source, path, 'exec',
                                             dont_inherit=True, optimize=_optimize)
That _retokenize function is just an iteratable transformation on a stream of token tuples:
    def _retokenize(tokens):
        for num, val, *stuff in tokens:
            if num == NUMBER and ('.' in val or 'e' in val or 'E' in val):
                yield NAME, 'FloatLiteral'
                yield OP, '('
                yield STRING, repr(val)
                yield OP, ')'
            else:
                yield num, val
For a complete quick&dirty example of hacking up the importer to run a token stream transform on every Python source module, see floatliteral/floatliteral.

Transforming text

Sometimes, you need to change things in a way that isn't even valid Python at the lexical level, much less the syntactic level. For example, if you want to add identifiers or keywords that aren't made up of Unicode letters and digits, you can't hack that in at the syntax level.

Here, it's obvious where to hack things in. You've got a big str with all of your code; just use your favorite string-manipulation methods on it.

However, this is pretty hard to do without breaking stuff. How do you detect one of those identifiers only where it actually would be an identifier, instead of catching false positives inside strings, comments, etc.? Python isn't a regular language, so trying to do it with regex and is difficult and inefficient (not impossible, since Python's re, like most other modern regex interpreters, isn't limited to actual regular expressions, but still not fun—see bonince's famous answer to parsing HTML, an even simple language, with regular expressions).

But there's a handy trick here. Notice that the tokenize docs link to the source. Which is in relatively simple pure Python. So, you don't have to figure out how to write your own Python-like lexer from scratch, or debug your quick&dirty regex that almost works but not quite; you can just fork the stdlib code and hack it up.

For a complete quick&dirty example of hacking up the importer to run an source text transform on every Python source module, see emptyset/emptify.py.

Transforming bytes

Normally, you don't want to bother with this, unless you're trying to implement a change to how encodings are specified. The decode_source function takes care of figuring out the character set and newlines and then decoding to a str for you. But if you want to replace what it does, you can. It's just a matter of not calling decode_source and doing your own decoding, or modifying data before you call it.

I don't have any examples for this one, because I've never had a good reason to write one.

Hybrid hacking

As I mentioned in my operator sectioning post, the hard part of hacking the compiler is not the parsing part, but the compiling part. For example, for that hack, I had to modify the way symbol tables are generated, the way bytecode is generated (including adding another macro to insert not-quite-trivial steps into a function scope), and so on. If I'd just stopped at creating the BinOpSect AST node, then I could have transformed that into a lambda expression easily. (Of course that wouldn't have the advantage of having a nice __name__, gensyming a hygienic parameter name, etc., but this was never intended to be production code…)

What's with that _call_with_frames_removed thing?

In Python 2.x, the whole import system is written in C, and any tracebacks that you get from an ImportError just go right from the illegal code in the imported module to the import statement.

In Python 3.3+, the import system is mostly written in Python. That's why we can hook it so easily. But that also means that any tracebacks are full of extra stack frames from the import system itself. That's useful while you're debugging the import hook itself, but once it's working, you don't want to see it on every ImportError or SyntaxError in any module you import.

There's been discussion about better ways to fix this, but as of 3.4, there's only a workaround, documented in _bootstrap.py: any sequence of import code that ends in a call to a function named _call_with_frames_removed gets removed from any tracebacks generated. This is just checked by name, not identity. I'm not sure whether it's better to write your own function, or import an undocumented function from a private module. (One possible advantage of the latter might be that if it breaks in Python 3.6, that will probably mean that the workaround has been replaced by a better solution, so you'll want to know about that…)
1

View comments

Hybrid Programming
Hybrid Programming
5
Greenlets vs. explicit coroutines
Greenlets vs. explicit coroutines
6
ABCs: What are they good for?
ABCs: What are they good for?
1
A standard assembly format for Python bytecode
A standard assembly format for Python bytecode
6
Unified call syntax
Unified call syntax
8
Why heapq isn't a type
Why heapq isn't a type
1
Unpacked Bytecode
Unpacked Bytecode
3
Everything is dynamic
Everything is dynamic
1
Wordcode
Wordcode
1
For-each loops should define a new variable
For-each loops should define a new variable
4
Views instead of iterators
Views instead of iterators
2
How lookup _could_ work
How lookup _could_ work
2
How lookup works
How lookup works
7
How functions work
How functions work
2
Why you can't have exact decimal math
Why you can't have exact decimal math
2
Can you customize method resolution order?
Can you customize method resolution order?
1
Prototype inheritance is inheritance
Prototype inheritance is inheritance
1
Pattern matching again
Pattern matching again
The best collections library design?
The best collections library design?
1
Leaks into the Enclosing Scope
Leaks into the Enclosing Scope
2
Iterable Terminology
Iterable Terminology
8
Creating a new sequence type is easy
Creating a new sequence type is easy
2
Going faster with NumPy
Going faster with NumPy
2
Why isn't asyncio too slow?
Why isn't asyncio too slow?
Hacking Python without hacking Python
Hacking Python without hacking Python
1
How to detect a valid integer literal
How to detect a valid integer literal
2
Operator sectioning for Python
Operator sectioning for Python
1
If you don't like exceptions, you don't like Python
If you don't like exceptions, you don't like Python
2
Spam, spam, spam, gouda, spam, and tulips
Spam, spam, spam, gouda, spam, and tulips
And now for something completely stupid…
And now for something completely stupid…
How not to overuse lambda
How not to overuse lambda
1
Why following idioms matters
Why following idioms matters
1
Cloning generators
Cloning generators
5
What belongs in the stdlib?
What belongs in the stdlib?
3
Augmented Assignments (a += b)
Augmented Assignments (a += b)
11
Statements and Expressions
Statements and Expressions
3
An Abbreviated Table of binary64 Values
An Abbreviated Table of binary64 Values
1
IEEE Floats and Python
IEEE Floats and Python
Subtyping and Ducks
Subtyping and Ducks
1
Greenlets, threads, and processes
Greenlets, threads, and processes
6
Why don't you want getters and setters?
Why don't you want getters and setters?
8
The (Updated) Truth About Unicode in Python
The (Updated) Truth About Unicode in Python
1
How do I make a recursive function iterative?
How do I make a recursive function iterative?
1
Sockets and multiprocessing
Sockets and multiprocessing
Micro-optimization and Python
Micro-optimization and Python
3
Why does my 100MB file take 1GB of memory?
Why does my 100MB file take 1GB of memory?
1
How to edit a file in-place
How to edit a file in-place
ADTs for Python
ADTs for Python
5
A pattern-matching case statement for Python
A pattern-matching case statement for Python
2
How strongly typed is Python?
How strongly typed is Python?
How do comprehensions work?
How do comprehensions work?
1
Reverse dictionary lookup and more, on beyond z
Reverse dictionary lookup and more, on beyond z
2
How to handle exceptions
How to handle exceptions
2
Three ways to read files
Three ways to read files
2
Lazy Python lists
Lazy Python lists
2
Lazy cons lists
Lazy cons lists
1
Lazy tuple unpacking
Lazy tuple unpacking
3
Getting atomic writes right
Getting atomic writes right
Suites, scopes, and lifetimes
Suites, scopes, and lifetimes
1
Swift-style map and filter views
Swift-style map and filter views
1
Inline (bytecode) assembly
Inline (bytecode) assembly
Why Python (or any decent language) doesn't need blocks
Why Python (or any decent language) doesn't need blocks
18
SortedContainers
SortedContainers
1
Fixing lambda
Fixing lambda
2
Arguments and parameters, under the covers
Arguments and parameters, under the covers
pip, extension modules, and distro packages
pip, extension modules, and distro packages
Python doesn't have encapsulation?
Python doesn't have encapsulation?
3
Grouping into runs of adjacent values
Grouping into runs of adjacent values
dbm: not just for Unix
dbm: not just for Unix
How to use your self
How to use your self
1
Tkinter validation
Tkinter validation
7
What's the deal with ttk.Frame.__init__(self, parent)
What's the deal with ttk.Frame.__init__(self, parent)
1
Does Python pass by value, or by reference?
Does Python pass by value, or by reference?
9
"if not exists" definitions
"if not exists" definitions
repr + eval = bad idea
repr + eval = bad idea
1
Solving callbacks for Python GUIs
Solving callbacks for Python GUIs
Why your GUI app freezes
Why your GUI app freezes
21
Using python.org binary installations with Xcode 5
Using python.org binary installations with Xcode 5
defaultdict vs. setdefault
defaultdict vs. setdefault
1
Lazy restartable iteration
Lazy restartable iteration
2
Arguments and parameters
Arguments and parameters
3
How grouper works
How grouper works
1
Comprehensions vs. map
Comprehensions vs. map
2
Basic thread pools
Basic thread pools
Sorted collections in the stdlib
Sorted collections in the stdlib
4
Mac environment variables
Mac environment variables
Syntactic takewhile?
Syntactic takewhile?
4
Can you optimize list(genexp)
Can you optimize list(genexp)
MISRA-C and Python
MISRA-C and Python
1
How to split your program in two
How to split your program in two
How methods work
How methods work
3
readlines considered silly
readlines considered silly
6
Comprehensions for dummies
Comprehensions for dummies
Sockets are byte streams, not message streams
Sockets are byte streams, not message streams
9
Why you don't want to dynamically create variables
Why you don't want to dynamically create variables
7
Why eval/exec is bad
Why eval/exec is bad
Iterator Pipelines
Iterator Pipelines
2
Why are non-mutating algorithms simpler to write in Python?
Why are non-mutating algorithms simpler to write in Python?
2
Sticking with Apple's Python 2.7
Sticking with Apple's Python 2.7
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.