Let's say you have a good idea for a change to Python. How do you implement it?

Well, first you go to the devguide, and then you see that if you want to change the grammar there's a long checklist of things you usually need to change, and then you look at the source, and some of those things aren't even really documented, you just have to pick it up from the surrounding code.

But in many cases, that isn't necessary during the exploratory phase. Sure, a real implementation is going to need to do all that stuff, but you can slap together a test implementation in pure Python, without that much work.

Please keep in mind that the code in this post, and in the linked examples, is not the best way to do things, it's just the shortest way to do things. That's fine when you're just tinkering with an idea and want to get the boilerplate out of the way, but if you actually want to write and deploy real code that uses an import hook to modify Python source, read the docs and do it right.

Import Hooks

See PEP 302 for details on the import hook system, and the docs on the import system for details on what each part does. There's a lot to take in, but if you want a quick&dirty hack, it's as simple as this:
    def _call_with_frames_removed(f, *args, **kwargs):
        return f(*args, **kwargs)

    class MyLoader(importlib.machinery.SourceFileLoader):
        def source_to_code(self, data, path, *, _optimize=-1):
            source = importlib.util.decode_source(data)
            return _call_with_frames_removed(compile, source, path, 'exec',
                                             dont_inherit=True, optimize=_optimize)

    _real_pathfinder = sys.meta_path[-1]
    class MyFinder(type(_real_pathfinder)):
        @classmethod
        def find_module(cls, fullname, path=None):
            spec = _real_pathfinder.find_spec(fullname, path)
            if not spec: return spec
            loader = spec.loader
            if type(loader).__name__ == 'SourceFileLoader':
                loader.__class__ = MyLoader
            return loader

    sys.meta_path[-1] = FloatLiteralFinder
Again, if it isn't obvious from the implementation that this is a hack that depends on specific details of CPython 3.4's import machinery: it's a hack that depends on specific details of CPython 3.4's import machinery.

Now, inside that source_to_code method, you can transform the code in any of four ways:
  • As bytes
  • As text
  • As tokens
  • As an abstract syntax tree
  • As bytecode
I'll explain the give separately below, but first, a few notes that apply to all of them:
  • What you're changing is how Python interprets .py files. This means that, unless you change some other stuff, anything that's already been compiled to a .pyc won't be recompiled.
  • Because you're hacking the import system at runtime, this only affects modules that are loaded after you make these changes. So, you can't use this for single-script programs; you have to split them up into a module that does the work, and a script that hacks the import system then imports that module.

Transforming bytecode

The compile function returns you a code object, whose bytecode is just a bytes object in its co_code attribute.

Since both code and bytes are immutable, to change this, you need to build up a new bytes object (that's normal for Python), and then build a new code object with the new co_code and copies of the other members. You may sometimes also want to change some of those other members—e.g., if you want to add the statement a = 2 somewhere, that a has to go into co_names, and that 2 into co_consts so that the bytecode in co_code can reference them by index into the lists of names and consts.

Non-trivial hacks to the bytecode can be complicated. For example, if you insert any new code in the middle, you have to adjust all the jumps that point after it to new values. There are some great modules for Python 2.x, like byteplay, that make this easier for you, but as far as I know, none of them have been ported to 3.x. (From a quick, incomplete attempt at it myself, I think rewriting one of these from scratch to use the new features in 3.4 would be a lot easier than fixing one of them up to be compatible with 3.4. For example, up to Python 2.5, every bytecode has a fixed effect on the stack; later versions have made that much more complicated, and trying to hack up byteplay to understand how some of these bytecodes work with just the local knowledge of the current bytecode and current stack is either not possible or at least not easy enough to do in a quick 60-minute hack…)

Unfortunately, I don't have any examples to show for this one that work with Python 3.

Transforming ASTs

The AST is usually the easiest place to do most things—as long as what you're trying to do is add new semantics, not new syntax, that is. This is the same level that macros work on in other languages, and the level that MacroPy works on in Python.

In many cases, even if what you want to do really does require a syntax change, you can come up with a way to hack it into the existing syntax that's good enough for exploration.

But if you look at the source_to_code method, we don't have an abstract syntax tree anywhere. So, how do you hack on one?

The secret is that the compile function can stop after AST generation (by passing the flag ast.PyCF_ONLY_AST—or you can just call ast.parse instead of compile), and it can take an AST instead of source code. So:
    class MyLoader(importlib.machinery.SourceFileLoader):
        def source_to_code(self, data, path, *, _optimize=-1):
            source = importlib.util.decode_source(data)
            tree = _call_with_frames_removed(compile, source, path, 'exec',
                                             dont_inherit=True, optimize=_optimize,
                                             flags=ast.PyCF_ONLY_AST)
            tree = do_stuff(tree)
            return _call_with_frames_removed(compile, tree, path, 'exec',
                                             dont_inherit=True, optimize=_optimize)
Now, what do you do with one of those trees? You create a NodeTransformer and visit it. The example in the docs is pretty easy to understand. To know what nodes you'll have to deal with and what attributes there are, the best thing to do is spend some time interactively playing with ast.dump(ast.parse(s)) for different strings of source and looking up help(ast.Str) and so on.

For a complete quick&dirty example of hacking up the importer to run an AST transform on every Python source module, see .floatliteralhack/floatliteral_ast.py or emptyset/emptify.py.

Of course the aforementioned MacroPy is a much better—and less hacky—example. In fact, if you think you need to hack on ASTs, it's usually easier to just use MacroPy than to do it yourself. It comes with a complete framework (that works on Python 2.7 and 3.3, not just 3.4+) full of helpful utilities; many transformations can be written without even directly thinking about the AST, and when you need to, it's as easy as it possibly could be. Using it as sample code, on the other hand, may not be as good an idea unless you need backward compatibility (because it's written for 2.7's less-friendly import system, and works on 3.4+ by effectively treating it as 2.7).

Transforming tokens

The problem with transforming ASTs is that you can't change the syntax, because you're getting to the code after it's already been parsed. So, for example, replacing every 1.2 literal with a FloatLiteral('1.2') call is impossible, because by the time you've got an AST, that 1.2 is already a float, not a string anymore. (Well, for 1.2, that works out anyway, because repr(1.2) == '1.2', but that's not true for every float, and presumably the whole reason you'd want to do this particular transformation is to get the actual string for every float.)

Many syntax transformations can be done after lexical analysis. For example, to replace every 1.2 literal with FloatLiteral('1.2') is easy at the token level: that 1.2 is a NUMBER token whose value is still the string '1.2', and generating the sequence of tokens for FloatLiteral('1.2') is dead simple.

But, as with the AST, there's no place above where we have a token stream. And, unlike the AST case, there's no way to ask compile to stop at just the token level and then pick up after you've transformed the tokens.

Fortunately, the tokenize module has both tokenize and untokenize functions, so you can break the source string into a stream of tokens, transform that stream, and turn it back into source that you can then pass to the compiler to parse. In fact, the example in the docs shows how to do exactly that.
For example:
    class MyLoader(importlib.machinery.SourceFileLoader):
        def source_to_code(self, data, path, *, _optimize=-1):
            source = importlib.util.decode_source(data)
            tokens = _retokenize(tokenize(io.BytesIO(source.encode('utf-8')).readline))
            source = untokenize(tokens).decode('utf-8')
            return _call_with_frames_removed(compile, source, path, 'exec',
                                             dont_inherit=True, optimize=_optimize)
That _retokenize function is just an iteratable transformation on a stream of token tuples:
    def _retokenize(tokens):
        for num, val, *stuff in tokens:
            if num == NUMBER and ('.' in val or 'e' in val or 'E' in val):
                yield NAME, 'FloatLiteral'
                yield OP, '('
                yield STRING, repr(val)
                yield OP, ')'
            else:
                yield num, val
For a complete quick&dirty example of hacking up the importer to run a token stream transform on every Python source module, see floatliteral/floatliteral.

Transforming text

Sometimes, you need to change things in a way that isn't even valid Python at the lexical level, much less the syntactic level. For example, if you want to add identifiers or keywords that aren't made up of Unicode letters and digits, you can't hack that in at the syntax level.

Here, it's obvious where to hack things in. You've got a big str with all of your code; just use your favorite string-manipulation methods on it.

However, this is pretty hard to do without breaking stuff. How do you detect one of those identifiers only where it actually would be an identifier, instead of catching false positives inside strings, comments, etc.? Python isn't a regular language, so trying to do it with regex and is difficult and inefficient (not impossible, since Python's re, like most other modern regex interpreters, isn't limited to actual regular expressions, but still not fun—see bonince's famous answer to parsing HTML, an even simple language, with regular expressions).

But there's a handy trick here. Notice that the tokenize docs link to the source. Which is in relatively simple pure Python. So, you don't have to figure out how to write your own Python-like lexer from scratch, or debug your quick&dirty regex that almost works but not quite; you can just fork the stdlib code and hack it up.

For a complete quick&dirty example of hacking up the importer to run an source text transform on every Python source module, see emptyset/emptify.py.

Transforming bytes

Normally, you don't want to bother with this, unless you're trying to implement a change to how encodings are specified. The decode_source function takes care of figuring out the character set and newlines and then decoding to a str for you. But if you want to replace what it does, you can. It's just a matter of not calling decode_source and doing your own decoding, or modifying data before you call it.

I don't have any examples for this one, because I've never had a good reason to write one.

Hybrid hacking

As I mentioned in my operator sectioning post, the hard part of hacking the compiler is not the parsing part, but the compiling part. For example, for that hack, I had to modify the way symbol tables are generated, the way bytecode is generated (including adding another macro to insert not-quite-trivial steps into a function scope), and so on. If I'd just stopped at creating the BinOpSect AST node, then I could have transformed that into a lambda expression easily. (Of course that wouldn't have the advantage of having a nice __name__, gensyming a hygienic parameter name, etc., but this was never intended to be production code…)

What's with that _call_with_frames_removed thing?

In Python 2.x, the whole import system is written in C, and any tracebacks that you get from an ImportError just go right from the illegal code in the imported module to the import statement.

In Python 3.3+, the import system is mostly written in Python. That's why we can hook it so easily. But that also means that any tracebacks are full of extra stack frames from the import system itself. That's useful while you're debugging the import hook itself, but once it's working, you don't want to see it on every ImportError or SyntaxError in any module you import.

There's been discussion about better ways to fix this, but as of 3.4, there's only a workaround, documented in _bootstrap.py: any sequence of import code that ends in a call to a function named _call_with_frames_removed gets removed from any tracebacks generated. This is just checked by name, not identity. I'm not sure whether it's better to write your own function, or import an undocumented function from a private module. (One possible advantage of the latter might be that if it breaks in Python 3.6, that will probably mean that the workaround has been replaced by a better solution, so you'll want to know about that…)
1

View comments

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed. Languages in the Abject-Oriented space have been borrowing ideas from another paradigm entirely—and then everyone realized that languages like Python, Ruby, and JavaScript had been doing it for years and just hadn't noticed (because these languages do not require you to declare what you're doing, or even to know what you're doing). Meanwhile, new hybrid languages borrow freely from both paradigms.

This other paradigm—which is actually older, but was largely constrained to university basements until recent years—is called Functional Addiction.
5

I haven't posted anything new in a couple years (partly because I attempted to move to a different blogging platform where I could write everything in markdown instead of HTML but got frustrated—which I may attempt again), but I've had a few private comments and emails on some of the old posts, so I decided to do some followups.

A couple years ago, I wrote a blog post on greenlets, threads, and processes.
6

Looking before you leap

Python is a duck-typed language, and one where you usually trust EAFP ("Easier to Ask Forgiveness than Permission") over LBYL ("Look Before You Leap"). In Java or C#, you need "interfaces" all over the place; you can't pass something to a function unless it's an instance of a type that implements that interface; in Python, as long as your object has the methods and other attributes that the function needs, no matter what type it is, everything is good.
1

Background

Currently, CPython’s internal bytecode format stores instructions with no args as 1 byte, instructions with small args as 3 bytes, and instructions with large args as 6 bytes (actually, a 3-byte EXTENDED_ARG followed by a 3-byte real instruction). While bytecode is implementation-specific, many other implementations (PyPy, MicroPython, …) use CPython’s bytecode format, or variations on it.

Python exposes as much of this as possible to user code.
6

If you want to skip all the tl;dr and cut to the chase, jump to Concrete Proposal.

Why can’t we write list.len()? Dunder methods C++ Python Locals What raises on failure? Method objects What about set and delete? Data members Namespaces Bytecode details Lookup overrides Introspection C API Concrete proposal CPython Analysis

Why can’t we write list.len()?

Python is an OO language. To reverse a list, you call lst.reverse(); to search a list for an element, you call lst.index().
8

Many people, when they first discover the heapq module, have two questions:

Why does it define a bunch of functions instead of a container type? Why don't those functions take a key or reverse parameter, like all the other sorting-related stuff in Python? Why not a type?

At the abstract level, it's often easier to think of heaps as an algorithm rather than a data structure.
1

Currently, in CPython, if you want to process bytecode, either in C or in Python, it’s pretty complicated.

The built-in peephole optimizer has to do extra work fixing up jump targets and the line-number table, and just punts on many cases because they’re too hard to deal with. PEP 511 proposes a mechanism for registering third-party (or possibly stdlib) optimizers, and they’ll all have to do the same kind of work.
3

One common "advanced question" on places like StackOverflow and python-list is "how do I dynamically create a function/method/class/whatever"? The standard answer is: first, some caveats about why you probably don't want to do that, and then an explanation of the various ways to do it when you really do need to.

But really, creating functions, methods, classes, etc. in Python is always already dynamic.

Some cases of "I need a dynamic function" are just "Yeah? And you've already got one".
1

A few years ago, Cesare di Mauro created a project called WPython, a fork of CPython 2.6.4 that “brings many optimizations and refactorings”. The starting point of the project was replacing the bytecode with “wordcode”. However, there were a number of other changes on top of it.

I believe it’s possible that replacing the bytecode with wordcode would be useful on its own.
1

Many languages have a for-each loop. In some, like Python, it’s the only kind of for loop:

for i in range(10): print(i) In most languages, the loop variable is only in scope within the code controlled by the for loop,[1] except in languages that don’t have granular scopes at all, like Python.[2]

So, is that i a variable that gets updated each time through the loop or is it a new constant that gets defined each time through the loop?

Almost every language treats it as a reused variable.
4
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.