Currently, in CPython, if you want to process bytecode, either in C or in Python, it’s pretty complicated.

The built-in peephole optimizer has to do extra work fixing up jump targets and the line-number table, and just punts on many cases because they’re too hard to deal with. PEP 511 proposes a mechanism for registering third-party (or possibly stdlib) optimizers, and they’ll all have to do the same kind of work.

Code that processes bytecode in function decorators or import hooks, whether for optimization or for other purposes like extending the language, is usually written in Python rather than C, but it’s no simpler there, so such code usually has to turn to a third-party library like byteplay.

Compile process

All of the details are very well documented, but I’ll give a brief summary of the important parts, to make it easy to see where things hook in and what information is available there.

When the Python interpreter starts, it reads interactive input, stdin, or a script file, and sets up a tokenizer. Then, it parses iteratively, generating an AST (abstract syntax tree) for each top-level statement. For each one:

  • Assuming PEP 511 is accepted, it will first call pass the AST node through the chain of all registered AST processors.

  • Next, it calls compile on the node,1 which compiles it into an internal structure, assembles that into bytecode and associated values, passes that through the optimizer, and builds a code object.

    • As of Python 3.5, “the optimizer” means the built-in peephole optimizer; assuming PEP 511, it will mean the peephole optimizer plus the chain of all registered bytecode processors.
  • The code object is then passed to exec.

Compiling a statement may recursively call the compiler. For example, a function definition first compiles the function body, then compiles the def statement into a statement code object that makes a function out of that function body’s code object. And functions and classes can of course be nested.

The compiler can also be triggered at runtime, e.g., by a call to compile or exec with source or an AST–or, probably more commonly, by the import system. The default import loader for .py files calls compile on the entire file, which builds an AST for the entire file, then performs the same steps as done for each input statement, to build a module code object. However, an import hook can change this–most simply, by parsing the AST explicitly, processing the AST, then compiling the result, or by processing the bytecode resulting from compile.

It’s also possible to post-process code at runtime. A function decorator is just a normal function called with a function object–but that function object contains a code object, and a decorator can modify the function to use a different code object, or just return a new function.

Bytecode formats

Bytecode and code objects

Bytecode is, as the name implies, just a string of bytes.

The peephole optimizer (and, in the current version of PEP 511, any plugin optimizers) receives this bytecode string, plus a few associated values as in-out parameters. (See PyCode_Optimize in the source.)

Decorators and import hooks receive a code object, which includes the bytecode string and a much larger set of associated values, whose members are described in the inspect docs.

Bytecode is nicely explained in the dis documentation, but I’ll briefly cover it here, and explain where the headaches come in.

Each operation takes either 1 byte (opcodes below HAVE_ARGUMENT) or 3 (opcodes >= HAVE_ARGUMENT, where the extra 2 bytes form a short integer argument). If an operation needs an argument too large to fit, it’s prefixed by a special EXTENDED_ARG opcode, which provides 2 more bytes.2

Argument values

Most arguments are stored as offsets into arrays that get stored with the bytecode. There are separate arrays for constant values, local names, free variable names, cell variable names, and other names (globals, attributes, etc.).

For example, a LOAD_CONST None may be stored as a LOAD_CONST with argument 1, meaning that it loads the code object’s co_consts[1] at runtime.

Jumps

Jump instructions (which include things like SETUP_FINALLY or FOR_ITER, which indirectly include jumps) come in two forms: relative, and absolute. Either way, jump offsets or targets are in terms of bytes. So, jumping from the 3rd instruction to the 9th might be jumping over 6 bytes, or 19.

Obviously, this means that inserting or removing an instruction, or changing its argument to now need or no longer need EXTENDED_ARG, requires changing the offsets of any relative jumps over that instruction, and the targets of any absolute jumps beyond that instruction. Worse, in some cases, fixing up a jump will push its argument over or under the EXTENDED_ARG limit, cascading further changes.

Special arguments

A few other opcodes assign special meanings to their arguments. For example, CALL_FUNCTION treats its argument as two bytes rather than one short, with the high byte representing the count of positional arguments pushed on the stack, and the low byte representing the count of keyword argument pairs pushed on the stack. MAKE_FUNCTION and MAKE_CLOSURE are the most complicated such opcodes.

lnotab

One of the associated values is a compressed line-number table that maps between bytecode offsets and source line numbers. This is complicated enough that it needs a special file, lnotab_notes.txt, to explain it. This of course also needs fixups whenever the first opcode of any source line moves to a new offset.

Assembler format

The compiler has a pretty simple and flexible structure that it uses internally, which can be seen at the top of compile.c.

The compiler first build a tree of blocks, each containing an array of struct instr instruction objects. Because Python doesn’t have arbitrary goto or similar instructions, all jumps are always to the start of a block, so the compiler just represents jump targets as pointers to blocks. For non-jump arguments, the instruction objects also hold their actual argument (a 32-bit integer for cases like MAKE_FUNCTION, but a const value, variable name, etc. rather than an array index for typical opcodes). Instructions also hold their source line number.

The compiler linearizes the tree of blocks into a linked list. Then it starts the “assembler” stage, which walks that list, emitting the bytecode and lnotab as it goes, and keeping track of the starting offset of each block. It then runs a fixup step, where it fills in the jump targets and adjusts the lnotab. If any of jump targets get pushed into EXTENDED_ARG territory, then the block offsets have to be updated and the fixup rerun. It repeats this until there are no changes.

The assembler’s list-of-blocks representation would be very easy to work with. Iterating over instructions in linear order is slightly complicated by the fact that you have to iterate a linked list, then an array within each node, but it’s still easier than walking 1, 3, or 6 bytes at a time.3

Meanwhile, you could add and delete instructions just by changing the array of instructions, without invalidating any jump targets or line-number mappings. And you can change arguments without worrying about whether they push you over or under the EXTENDED_ARG limit.

However, the compiler than calls the optimizer after this fixup step. This means the optimizer has to walk the more complicated packed bytecode, and has to repeat some of the same fixup work (or, as currently implemented, does only the simplest possible fixup work, and punts and doesn’t optimize if anything else might be necessary). The current peephole optimizer is kept pretty limited because of these restrictions and complexities. But if PEP 511 adds plugin optimizers, they’re all going to have to deal with these headaches. (And, besides the complexity, there’s obviously a performance cost to repeating the fixup work once for every optimizer.)

dis format

The dis module disassembles bytecode into a format that’s easier to understand: a Bytecode object can be constructed from a code object, which wraps up all of the associated values, and acts as an iterable of Instruction objects. These objects are somewhat similar to the struct instr used by the assembler, in that they hold the argument value and line number. However, jumps are still in terms of bytes rather than (blocks of) instructions, and EXTENDED_ARG is still treated as a separate instruction.

So, this format would be a little easier to process than the raw bytecode, but it would still require jump fixups, including cascading jump fixups over EXTENDED_ARG changes. Plus, there is no assembler that can be used to go back from the Bytecode object to a code object. So, you need to write code that iterates the instructions and emits bytecode and the lnotab and then does the same fixup work as the compiler. This code is usually in Python, rather than in C, which makes it a little simpler–but it’s still pretty painful.

byteplay format

The third-party byteplay module has a byteplay.Code type that expands on the dis.Bytecode model by removing the EXTENDED_ARG instructions and synthesizing fake instructions to make things simpler: a SetLineNo instruction to represent each lnotab entry, and a Label instruction to represent each jump target. (It’s also easy to create new SetLineNo and Label instructions as needed.)

One way to look at this is that byteplay lets you edit code for a simple “macro assembler”, instead of for a raw assembler of the kind that used to be built into the ROM of old home computers.

Unlike dis, byteplay also works recursively: if a function has a nested function or class definition,4 its code object is already disassembled to a byteplay.Code object.

The byteplay disassembly also acts as a mutable sequence of instructions, rather than a immutable, non-indexable iterable, and allows replacing instructions with simple opcode-argument 2-tuples instead of having to build instruction objects.

Finally, byteplay provides a method to assemble the instructions back into a code object–which automatically takes care of building the value and name arrays, calculating the jump targets, building the lnotab, etc.

Together, all of these changes allow processing instructions without worrying about any of the complicated issues. Instead of being a sequence of variable-length instructions that contain references to byte offsets within that sequence and external arrays along with an associated compacted table of line numbers, bytecode is just a sequence of (opcode, argument) pairs.

On top of that, byteplay also includes code to check that the stack effects of all of the instructions add up, which helps quite a bit for catching common errors that would otherwise be hard to debug.

The biggest problem with byteplay is that it’s somewhat complicated code that has to change with each new version of Python, but generally hasn’t tracked those new versions very well. It took years to get it compatible with Python 3.x, and for 2.7 and each new 3.x version, and it had (and may still have) longstanding bugs with those versions. As of the 3.4 improvements to dis, it now builds some of its work on top of that module when available (and dis, being in the stdlib, always tracks changes), but so far, each new version has still required patches that were months behind the release. And of course the need to be backward compatible with a wide range of supported Python versions also complicates things.

A smaller problem with byteplay is that, while it does a lot of the same things as dis (especially now that it’s partly built on dis), it does them in different terms. For example, the equivalent of dis.Bytecode is called byteplay.Code, not Bytecode; it has a @classmethod alternate constructor from code objects instead of a normal constructor; that constructor can only take actual code objects rather than extracting them from things like functions automatically; it isn’t directly iterable but instead has a code attribute that is; etc.

Improving the process

AST processing

Import hooks already provide a simple way to hook the process between AST parsing and compilation, and the ast module makes processing the ASTs pretty easy. (Some changes can be done even earlier, at the source code or token level.)

Many of the optimizations and other processing tasks people envision (and much of what the current peephole optimizer does) could also be done on the AST, so PEP 511 allows for AST optimizers as well as bytecode optimizers.

However, not everything can be done this way–e.g., eliminating redundant store-load pairs when the stored value is never used again. Plus, there’s no way to write a post-processing function decorator in AST terms–at that point, you’ve got a live function object.

So, this is (already) part of the solution, but not a complete solution on its own.

Assembler structures

As explained earlier, the assembler structures are much easier to work on than the compiled bytecode.

And there’s really no good reason the peephole optimizer couldn’t be called with these structures. Rather than emitting bytecode, fixing it up, and passing it to some code that tries to partially undo that work to make changes and then redo the work it undid seems somewhat silly, and it’s pretty complicated.

Unfortunately, exposing the assembler structures to third-party code, like the bytecode processors envisioned by PEP 511, is a different story. The peephole optimizer is essentially part of the compiler; PEP 511 optimizers aren’t, and shouldn’t be given access to structs that contain information that’s both (it’s very easy to segfault the compiler by messing with these structs) and brittle (the structs could change between versions, and definitely should be allowed to). Plus, some of the optimizers will be written in Python.

So, we’d have to provide some way to wrap these up and expose a C API and Python API for dealing with blocks of instructions. But we probably don’t want the compiler and assembler themselves to work in terms of this C API instead of working directly on their internal structs.

Meanwhile, for the peephole optimizer and PEP 511 processors, it makes sense to process the code before it’s been fixed up and turned into bytecode–but that obviously doesn’t work for import hooks, or for decorators. So, we’d need some function to convert back from bytecode to a more convenient format, and then to convert from that format back to bytecode and fix it up.

So, at this point, we’re really not talking about exposing the assembler structures, but designing a similar public structure, and functions to convert that to and from bytecode. At which point there’s no reason it necessarily has to be anything like the assembler structs.

PyBytecode

Once you look at it in those terms, what we’re looking for is exactly what byteplay is. If byteplay were incorporated into the stdlib, it would be a lot easier to keep it tracking the dis module and bytecode changes from version to version: any patch that would break byteplay has to instead include the changes to byteplay.

But byteplay is pure Python. We need something that can be called from C code.

So, what I’m imagining is a PyBytecode object, which is similar to a byteplay.Code object (including having pseudo-ops like SetLineNum and Label), but with a C API, and hewing closer to the dis model (and to PEP 8). The assembler stage of the compiler, instead of emitting a string of bytecode and related objects, builds up a PyBytecode object. That’s the object that the peephole optimizer and PEP 511 processors work on. Then, the compiler calls PyBytecode_ToCode, which generates executable bytecode and associated objects, does all the fixups, and builds the code object.

The PyBytecode type would be exposed to Python as, say, dis.Bytecode (possibly renaming the existing thing to dis.RawBytecode). An import hook or decorator can call PyBytecode_FromCode or the dis.Bytecode constructor, process it, then call PyBytecode_ToCode or the to_code method to get back to executable bytecode. A PEP 511 processor written in Python would get one of these objects, but wouldn’t have to import dis to use it (although it would probably be convenient to do so in order to get access to the opcodes, etc.).

We could, like byteplay, allow instructions to be just 2-sequences of opcode and argument instead of Instruction objects–and make those objects namedtuples, too, a la the tokenize module. We could even allow PEP 511 processors to return any iterable of such instructions. There’s some precedent for this in the way tokenizer works (although it has its own kinds of clunkiness). And this would mean that simple one-pass processors could actually be written as generators:

def constify(bytecode: dis.Bytecode):
    for op, arg in bytecode:
        if op == dis.LOAD_GLOBAL:
            yield (dis.LOAD_CONST, eval(arg, globals()))
        else:
            yield (op, arg)

If you wanted to write a decorator that just constifies a specified list of globals, it would look something like this:

def constify(*names, globals=None):
    mapping = {name: eval(name, globals) for name in names}
    def process(bytecode: dis.Bytecode):
        for op, arg in bytecode:
            if op == dis.LOAD_GLOBAL and arg in mapping:
                yield (dis.LOAD_CONST, mapping[arg])
            else:
                yield (op, arg)
    def deco(func):
        func.__code__ = dis.Bytecode(process(bytecode)).to_code()
        return func
    return deco

For a more complicated example, here’s a PEP 511 processor to eliminate double jumps:

def dejumpify(bytecode: dis.Bytecode):
    for instr in b:
        if isinstance(instr.argval, dis.Label):
            target = bytecode[instr.argval.offset + 1]
            if target.opcode in {dis.JUMP_FORWARD, dis.JUMP_ABSOLUTE}:
                yield (instr.opcode, target.argval)
                continue
        yield instr

Of course not everything makes sense as a generator. So, Bytecode objects should still be mutable as sequences.

And, to make things even simpler for in-place processors, the to_code method can skip NOP instructions, so you can delete an instruction by writing bytecode[i] = (dis.NOP, None) instead of bytecode[i:i+1] = [] (which means you can do this in the middle of iterating over bytecode without losing your place).

Unpacked opcodes

My initial thought (actually, Serhiy Storchaka came up with it first) was that “unpacking” bytecode into a fixed-length form would be enough of a simplification.

From Python, you’d call c.unpacked() on a code object, and get back a new code object where there are no EXTENDED_ARGs, and every opcode is 4 bytes5 instead of 1, 3, or 6. Jump targets and the lnotab are adjusted accordingly. The lnotab is also unpacked into 8-byte entries each holding a line number and and offset, instead of 2-byte entries holding deltas and special handling for large deltas.

So, you do your processing on that, and call uc.packed() to pack it, remove NOPs, and redo the fixups, giving you back a normal code object.6

This means you still do have to do some fixup if you rearrange any arguments, but at least it’s much easier fixup.

This also means that dis, as it is today, is really all the help you need. For example, here’s the constify decorator from above:

def constify(*names, globals=None):
    mapping = {name: eval(name, globals) for name in names}
    def process(code: types.CodeType):
        bytecode = bytearray(code.co_code)
        consts = list(code.co_consts)
        for i in range(0, len(bytecode), 4):
            if bytecode[i] == dis.LOAD_GLOBAL:
                arg = struct.unpack('<I', bytecode[i+1:i+4] + b'\0')
                name = code.co_names[arg]
                if name in mapping:
                    try:
                        const = consts.index(mapping[name])
                    except ValueError:
                        const = len(consts)
                        consts.append(mapping[name])
                    bytecode[i] = dis.LOAD_CONST
                    bytecode[i+1:i+4] = struct.pack('<I', const)[:3]
    def deco(func):
        func.__code__ = dis.Bytecode(process(bytecode)).to_code()
        return func
    return deco

Not quite as nice as the byteplay-ish example, but still not terrible.

However, after implementing a proof of concept for this, I no longer think this is the best idea. That easier fixup is still too much of a pain for simple processors. For example, to add an instruction at offset i, you still have to scan every instruction from 0 to i for any jrel > i, and every instruction for any jabs > i, and all 4 to all of them, and also go through the lnotab and add 4 to the offset for every entry > i.

Plus, this is obviously horribly inefficient if you’re going to make lots of insertions–which nobody ever is, but people are still going to do overcomplicated things like build a balanced sort tree of jumps and of lnotab entries so they can make those fixups logarithmic instead of linear.

Of course if you really want to get efficient, there’s a solution that’s constant time, and also makes things a lot simpler: just use relocatable labels instead of offsets, and do all the fixups in a single pass at the end. And now we’re encouraging everyone to build half of byteplay themselves for every code processor.

And it’s not like we save that much in compiler performance or simplicity–the fixup needed in pack isn’t much simpler than what’s needed by the full byteplay-style design (and the pack fixup plus the manual fixup done in the processor itself will probably be slower than the byteplay fixup, not faster). Also, while having the same code object to represent both packed and unpacked formats seems like a savings, in practice it’ll probably lead to confusion more often than less to learn about.

Why can’t we just let people use byteplay if they want?

Well, for import hooks and decorators, they already can, and do. For PEP 511 optimizers, as long as they’re written in Python, they can.

The peephole optimizer, of course, can’t use byteplay, if for no other reason than it’s built into Python and byteplay isn’t. But maybe we could just leave that alone. People who want more will install PEP 511 optimizers.

There is, of course, a cost to building a code object and then converting back and forth between code and byteplay.Code once for each optimizer, instead of building a byteplay.Code (or PyBytecode) object, passing it through all the optimizers, and then converting to code once. Remember that converting to code involves that repeated fixup loop and other stuff that might be kind of slow to do multiple times. Then again, it might still be so simple that the cost is negligible. The only way we’ll really know is to build PEP 511 as currently designed, build a bunch of optimizers that manually use byteplay, then build PEP 511 with PyBytecode and rewrite the optimizers to use that, and profile times for imports, trivial scripts, compileall, etc. both ways.

Besides the probably-irrelevant performance issues, there are other advantages to PyBytecode. They’ve mostly been mentioned earlier, but to put them all in one place:

  • PyBytecode is automatically always up to date, instead of lagging behind Python releases like byteplay.
  • The compiler gets a bit simpler.
  • A big chunk of what byteplay and the compiler do today is the same work; better to have one implementation than two.
  • The peephole optimizer gets a lot simpler (although, as a change from today, “a lot simpler” is still more work than “unchanged”).
  • The peephole optimizer gets a little better (doesn’t punt as often).
  • If we later decide to build more optimizations into CPython itself as bytecode processors, it will be a lot simpler (which probably means we’ll be a lot more likely to build such optimizations).
  • PyBytecode comes with Python, instead of requiring a third-party install. (Probably not a huge benefit–anyone installing a third-party PEP 511 optimizer can install its dependencies, and the number of people who need to write an optimizer for local use but for some reason can’t use any third-party modules is probably vanishingly small.)
  • It seems like a fun project.

Overall, none of them seem so hugely compelling that I’d say someone ought to do this–except maybe the last one. :) If I get the time, I’ll try to build it, and rebuild the peephole optimizer, and maybe modify the latest PEP 511 patch as well. Then, if it looks promising, it might be worth suggesting as a patch to CPython to go along with PEP 511.



  1. Well, not really. It just does the same stuff that’s wrapped up by the Python compile function (when called on an AST).
  2. The bytes within an opcode’s argument are in little-endian format, but EXTENDED_ARG holds higher-order bytes, so the overall format for a 4-byte arg is mixed-endian, a la VAX.
  3. Most code actually just walks 1 or 3 bytes at a time, and treats EXTENDED_ARG like a separate instruction, keeping track of the current extended-arg value in an extra register.
  4. Comprehensions are compiled into a function definition and call, so you often have nested functions without realizing it.
  5. Yes, this would mean that you can no longer have absolute or relative jumps over 2**24, const/name/etc. arrays longer than 2**24, or annotations for more than 255 parameters (currently, you can have up to 32767 of these, but you can only have 512 total arguments). I don’t think anyone would complain about any of these limits. But if it really is a problem, we could use 5 or 8 bytes per instruction instead of 4.
  6. We could even allow code in unpacked form to be executed–I played with the ceval loop a bit, and it’s not at all hard to make this work, and doesn’t seem likely to would have a significant performance impact when you’re executing packed code. But I don’t see a good reason for that, and it’s a lot simpler to just raise a SystemError('cannot execute unpacked code').
3

View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.