Currently, in CPython, if you want to process bytecode, either in C or in Python, it’s pretty complicated.
The built-in peephole optimizer has to do extra work fixing up jump targets and the line-number table, and just punts on many cases because they’re too hard to deal with. PEP 511 proposes a mechanism for registering third-party (or possibly stdlib) optimizers, and they’ll all have to do the same kind of work.
Code that processes bytecode in function decorators or import hooks, whether for optimization or for other purposes like extending the language, is usually written in Python rather than C, but it’s no simpler there, so such code usually has to turn to a third-party library like byteplay
.
Compile process
All of the details are very well documented, but I’ll give a brief summary of the important parts, to make it easy to see where things hook in and what information is available there.
When the Python interpreter starts, it reads interactive input, stdin, or a script file, and sets up a tokenizer. Then, it parses iteratively, generating an AST (abstract syntax tree) for each top-level statement. For each one:
Assuming PEP 511 is accepted, it will first call pass the AST node through the chain of all registered AST processors.
Next, it calls
compile
on the node,1 which compiles it into an internal structure, assembles that into bytecode and associated values, passes that through the optimizer, and builds acode
object.- As of Python 3.5, “the optimizer” means the built-in peephole optimizer; assuming PEP 511, it will mean the peephole optimizer plus the chain of all registered bytecode processors.
The
code
object is then passed toexec
.
Compiling a statement may recursively call the compiler. For example, a function definition first compiles the function body, then compiles the def
statement into a statement code
object that makes a function out of that function body’s code
object. And functions and classes can of course be nested.
The compiler can also be triggered at runtime, e.g., by a call to compile
or exec
with source or an AST–or, probably more commonly, by the import system. The default import loader for .py
files calls compile
on the entire file, which builds an AST for the entire file, then performs the same steps as done for each input statement, to build a module code
object. However, an import hook can change this–most simply, by parsing the AST explicitly, processing the AST, then compiling the result, or by processing the bytecode resulting from compile
.
It’s also possible to post-process code at runtime. A function decorator is just a normal function called with a function object–but that function object contains a code
object, and a decorator can modify the function to use a different code
object, or just return a new function.
Bytecode formats
Bytecode and code
objects
Bytecode is, as the name implies, just a string of bytes.
The peephole optimizer (and, in the current version of PEP 511, any plugin optimizers) receives this bytecode string, plus a few associated values as in-out parameters. (See PyCode_Optimize
in the source.)
Decorators and import hooks receive a code
object, which includes the bytecode string and a much larger set of associated values, whose members are described in the inspect
docs.
Bytecode is nicely explained in the dis
documentation, but I’ll briefly cover it here, and explain where the headaches come in.
Each operation takes either 1 byte (opcodes below HAVE_ARGUMENT
) or 3 (opcodes >= HAVE_ARGUMENT
, where the extra 2 bytes form a short integer argument). If an operation needs an argument too large to fit, it’s prefixed by a special EXTENDED_ARG
opcode, which provides 2 more bytes.2
Argument values
Most arguments are stored as offsets into arrays that get stored with the bytecode. There are separate arrays for constant values, local names, free variable names, cell variable names, and other names (globals, attributes, etc.).
For example, a LOAD_CONST None
may be stored as a LOAD_CONST
with argument 1
, meaning that it loads the code object’s co_consts[1]
at runtime.
Jumps
Jump instructions (which include things like SETUP_FINALLY
or FOR_ITER
, which indirectly include jumps) come in two forms: relative, and absolute. Either way, jump offsets or targets are in terms of bytes. So, jumping from the 3rd instruction to the 9th might be jumping over 6 bytes, or 19.
Obviously, this means that inserting or removing an instruction, or changing its argument to now need or no longer need EXTENDED_ARG
, requires changing the offsets of any relative jumps over that instruction, and the targets of any absolute jumps beyond that instruction. Worse, in some cases, fixing up a jump will push its argument over or under the EXTENDED_ARG
limit, cascading further changes.
Special arguments
A few other opcodes assign special meanings to their arguments. For example, CALL_FUNCTION
treats its argument as two bytes rather than one short, with the high byte representing the count of positional arguments pushed on the stack, and the low byte representing the count of keyword argument pairs pushed on the stack. MAKE_FUNCTION
and MAKE_CLOSURE
are the most complicated such opcodes.
lnotab
One of the associated values is a compressed line-number table that maps between bytecode offsets and source line numbers. This is complicated enough that it needs a special file, lnotab_notes.txt
, to explain it. This of course also needs fixups whenever the first opcode of any source line moves to a new offset.
Assembler format
The compiler has a pretty simple and flexible structure that it uses internally, which can be seen at the top of compile.c
.
The compiler first build a tree of blocks, each containing an array of struct instr
instruction objects. Because Python doesn’t have arbitrary goto
or similar instructions, all jumps are always to the start of a block, so the compiler just represents jump targets as pointers to blocks. For non-jump arguments, the instruction objects also hold their actual argument (a 32-bit integer for cases like MAKE_FUNCTION
, but a const value, variable name, etc. rather than an array index for typical opcodes). Instructions also hold their source line number.
The compiler linearizes the tree of blocks into a linked list. Then it starts the “assembler” stage, which walks that list, emitting the bytecode and lnotab as it goes, and keeping track of the starting offset of each block. It then runs a fixup step, where it fills in the jump targets and adjusts the lnotab. If any of jump targets get pushed into EXTENDED_ARG
territory, then the block offsets have to be updated and the fixup rerun. It repeats this until there are no changes.
The assembler’s list-of-blocks representation would be very easy to work with. Iterating over instructions in linear order is slightly complicated by the fact that you have to iterate a linked list, then an array within each node, but it’s still easier than walking 1, 3, or 6 bytes at a time.3
Meanwhile, you could add and delete instructions just by changing the array of instructions, without invalidating any jump targets or line-number mappings. And you can change arguments without worrying about whether they push you over or under the EXTENDED_ARG
limit.
However, the compiler than calls the optimizer after this fixup step. This means the optimizer has to walk the more complicated packed bytecode, and has to repeat some of the same fixup work (or, as currently implemented, does only the simplest possible fixup work, and punts and doesn’t optimize if anything else might be necessary). The current peephole optimizer is kept pretty limited because of these restrictions and complexities. But if PEP 511 adds plugin optimizers, they’re all going to have to deal with these headaches. (And, besides the complexity, there’s obviously a performance cost to repeating the fixup work once for every optimizer.)
dis
format
The dis
module disassembles bytecode into a format that’s easier to understand: a Bytecode
object can be constructed from a code
object, which wraps up all of the associated values, and acts as an iterable of Instruction
objects. These objects are somewhat similar to the struct instr
used by the assembler, in that they hold the argument value and line number. However, jumps are still in terms of bytes rather than (blocks of) instructions, and EXTENDED_ARG
is still treated as a separate instruction.
So, this format would be a little easier to process than the raw bytecode, but it would still require jump fixups, including cascading jump fixups over EXTENDED_ARG
changes. Plus, there is no assembler that can be used to go back from the Bytecode
object to a code
object. So, you need to write code that iterates the instructions and emits bytecode and the lnotab
and then does the same fixup work as the compiler. This code is usually in Python, rather than in C, which makes it a little simpler–but it’s still pretty painful.
byteplay
format
The third-party byteplay
module has a byteplay.Code
type that expands on the dis.Bytecode
model by removing the EXTENDED_ARG
instructions and synthesizing fake instructions to make things simpler: a SetLineNo
instruction to represent each lnotab
entry, and a Label
instruction to represent each jump target. (It’s also easy to create new SetLineNo
and Label
instructions as needed.)
One way to look at this is that byteplay
lets you edit code for a simple “macro assembler”, instead of for a raw assembler of the kind that used to be built into the ROM of old home computers.
Unlike dis
, byteplay
also works recursively: if a function has a nested function or class definition,4 its code object is already disassembled to a byteplay.Code
object.
The byteplay
disassembly also acts as a mutable sequence of instructions, rather than a immutable, non-indexable iterable, and allows replacing instructions with simple opcode-argument 2-tuples instead of having to build instruction objects.
Finally, byteplay
provides a method to assemble the instructions back into a code
object–which automatically takes care of building the value and name arrays, calculating the jump targets, building the lnotab
, etc.
Together, all of these changes allow processing instructions without worrying about any of the complicated issues. Instead of being a sequence of variable-length instructions that contain references to byte offsets within that sequence and external arrays along with an associated compacted table of line numbers, bytecode is just a sequence of (opcode, argument) pairs.
On top of that, byteplay
also includes code to check that the stack effects of all of the instructions add up, which helps quite a bit for catching common errors that would otherwise be hard to debug.
The biggest problem with byteplay
is that it’s somewhat complicated code that has to change with each new version of Python, but generally hasn’t tracked those new versions very well. It took years to get it compatible with Python 3.x, and for 2.7 and each new 3.x version, and it had (and may still have) longstanding bugs with those versions. As of the 3.4 improvements to dis
, it now builds some of its work on top of that module when available (and dis
, being in the stdlib, always tracks changes), but so far, each new version has still required patches that were months behind the release. And of course the need to be backward compatible with a wide range of supported Python versions also complicates things.
A smaller problem with byteplay
is that, while it does a lot of the same things as dis
(especially now that it’s partly built on dis
), it does them in different terms. For example, the equivalent of dis.Bytecode
is called byteplay.Code
, not Bytecode
; it has a @classmethod
alternate constructor from code
objects instead of a normal constructor; that constructor can only take actual code
objects rather than extracting them from things like functions automatically; it isn’t directly iterable but instead has a code
attribute that is; etc.
Improving the process
AST processing
Import hooks already provide a simple way to hook the process between AST parsing and compilation, and the ast
module makes processing the ASTs pretty easy. (Some changes can be done even earlier, at the source code or token level.)
Many of the optimizations and other processing tasks people envision (and much of what the current peephole optimizer does) could also be done on the AST, so PEP 511 allows for AST optimizers as well as bytecode optimizers.
However, not everything can be done this way–e.g., eliminating redundant store-load pairs when the stored value is never used again. Plus, there’s no way to write a post-processing function decorator in AST terms–at that point, you’ve got a live function object.
So, this is (already) part of the solution, but not a complete solution on its own.
Assembler structures
As explained earlier, the assembler structures are much easier to work on than the compiled bytecode.
And there’s really no good reason the peephole optimizer couldn’t be called with these structures. Rather than emitting bytecode, fixing it up, and passing it to some code that tries to partially undo that work to make changes and then redo the work it undid seems somewhat silly, and it’s pretty complicated.
Unfortunately, exposing the assembler structures to third-party code, like the bytecode processors envisioned by PEP 511, is a different story. The peephole optimizer is essentially part of the compiler; PEP 511 optimizers aren’t, and shouldn’t be given access to structs that contain information that’s both (it’s very easy to segfault the compiler by messing with these structs) and brittle (the structs could change between versions, and definitely should be allowed to). Plus, some of the optimizers will be written in Python.
So, we’d have to provide some way to wrap these up and expose a C API and Python API for dealing with blocks of instructions. But we probably don’t want the compiler and assembler themselves to work in terms of this C API instead of working directly on their internal structs.
Meanwhile, for the peephole optimizer and PEP 511 processors, it makes sense to process the code before it’s been fixed up and turned into bytecode–but that obviously doesn’t work for import hooks, or for decorators. So, we’d need some function to convert back from bytecode to a more convenient format, and then to convert from that format back to bytecode and fix it up.
So, at this point, we’re really not talking about exposing the assembler structures, but designing a similar public structure, and functions to convert that to and from bytecode. At which point there’s no reason it necessarily has to be anything like the assembler structs.
PyBytecode
Once you look at it in those terms, what we’re looking for is exactly what byteplay
is. If byteplay
were incorporated into the stdlib, it would be a lot easier to keep it tracking the dis
module and bytecode changes from version to version: any patch that would break byteplay
has to instead include the changes to byteplay
.
But byteplay
is pure Python. We need something that can be called from C code.
So, what I’m imagining is a PyBytecode
object, which is similar to a byteplay.Code
object (including having pseudo-ops like SetLineNum
and Label
), but with a C API, and hewing closer to the dis
model (and to PEP 8). The assembler stage of the compiler, instead of emitting a string of bytecode and related objects, builds up a PyBytecode
object. That’s the object that the peephole optimizer and PEP 511 processors work on. Then, the compiler calls PyBytecode_ToCode
, which generates executable bytecode and associated objects, does all the fixups, and builds the code
object.
The PyBytecode
type would be exposed to Python as, say, dis.Bytecode
(possibly renaming the existing thing to dis.RawBytecode
). An import hook or decorator can call PyBytecode_FromCode
or the dis.Bytecode
constructor, process it, then call PyBytecode_ToCode
or the to_code
method to get back to executable bytecode. A PEP 511 processor written in Python would get one of these objects, but wouldn’t have to import dis
to use it (although it would probably be convenient to do so in order to get access to the opcodes, etc.).
We could, like byteplay
, allow instructions to be just 2-sequences of opcode and argument instead of Instruction
objects–and make those objects namedtuples, too, a la the tokenize
module. We could even allow PEP 511 processors to return any iterable of such instructions. There’s some precedent for this in the way tokenizer
works (although it has its own kinds of clunkiness). And this would mean that simple one-pass processors could actually be written as generators:
def constify(bytecode: dis.Bytecode):
for op, arg in bytecode:
if op == dis.LOAD_GLOBAL:
yield (dis.LOAD_CONST, eval(arg, globals()))
else:
yield (op, arg)
If you wanted to write a decorator that just constifies a specified list of globals, it would look something like this:
def constify(*names, globals=None):
mapping = {name: eval(name, globals) for name in names}
def process(bytecode: dis.Bytecode):
for op, arg in bytecode:
if op == dis.LOAD_GLOBAL and arg in mapping:
yield (dis.LOAD_CONST, mapping[arg])
else:
yield (op, arg)
def deco(func):
func.__code__ = dis.Bytecode(process(bytecode)).to_code()
return func
return deco
For a more complicated example, here’s a PEP 511 processor to eliminate double jumps:
def dejumpify(bytecode: dis.Bytecode):
for instr in b:
if isinstance(instr.argval, dis.Label):
target = bytecode[instr.argval.offset + 1]
if target.opcode in {dis.JUMP_FORWARD, dis.JUMP_ABSOLUTE}:
yield (instr.opcode, target.argval)
continue
yield instr
Of course not everything makes sense as a generator. So, Bytecode
objects should still be mutable as sequences.
And, to make things even simpler for in-place processors, the to_code
method can skip NOP
instructions, so you can delete an instruction by writing bytecode[i] = (dis.NOP, None)
instead of bytecode[i:i+1] = []
(which means you can do this in the middle of iterating over bytecode
without losing your place).
Unpacked opcodes
My initial thought (actually, Serhiy Storchaka came up with it first) was that “unpacking” bytecode into a fixed-length form would be enough of a simplification.
From Python, you’d call c.unpacked()
on a code
object, and get back a new code
object where there are no EXTENDED_ARG
s, and every opcode is 4 bytes5 instead of 1, 3, or 6. Jump targets and the lnotab
are adjusted accordingly. The lnotab
is also unpacked into 8-byte entries each holding a line number and and offset, instead of 2-byte entries holding deltas and special handling for large deltas.
So, you do your processing on that, and call uc.packed()
to pack it, remove NOPs, and redo the fixups, giving you back a normal code
object.6
This means you still do have to do some fixup if you rearrange any arguments, but at least it’s much easier fixup.
This also means that dis
, as it is today, is really all the help you need. For example, here’s the constify
decorator from above:
def constify(*names, globals=None):
mapping = {name: eval(name, globals) for name in names}
def process(code: types.CodeType):
bytecode = bytearray(code.co_code)
consts = list(code.co_consts)
for i in range(0, len(bytecode), 4):
if bytecode[i] == dis.LOAD_GLOBAL:
arg = struct.unpack('<I', bytecode[i+1:i+4] + b'\0')
name = code.co_names[arg]
if name in mapping:
try:
const = consts.index(mapping[name])
except ValueError:
const = len(consts)
consts.append(mapping[name])
bytecode[i] = dis.LOAD_CONST
bytecode[i+1:i+4] = struct.pack('<I', const)[:3]
def deco(func):
func.__code__ = dis.Bytecode(process(bytecode)).to_code()
return func
return deco
Not quite as nice as the byteplay
-ish example, but still not terrible.
However, after implementing a proof of concept for this, I no longer think this is the best idea. That easier fixup is still too much of a pain for simple processors. For example, to add an instruction at offset i, you still have to scan every instruction from 0 to i for any jrel > i, and every instruction for any jabs > i, and all 4 to all of them, and also go through the lnotab and add 4 to the offset for every entry > i.
Plus, this is obviously horribly inefficient if you’re going to make lots of insertions–which nobody ever is, but people are still going to do overcomplicated things like build a balanced sort tree of jumps and of lnotab entries so they can make those fixups logarithmic instead of linear.
Of course if you really want to get efficient, there’s a solution that’s constant time, and also makes things a lot simpler: just use relocatable labels instead of offsets, and do all the fixups in a single pass at the end. And now we’re encouraging everyone to build half of byteplay
themselves for every code processor.
And it’s not like we save that much in compiler performance or simplicity–the fixup needed in pack
isn’t much simpler than what’s needed by the full byteplay
-style design (and the pack
fixup plus the manual fixup done in the processor itself will probably be slower than the byteplay
fixup, not faster). Also, while having the same code
object to represent both packed and unpacked formats seems like a savings, in practice it’ll probably lead to confusion more often than less to learn about.
Why can’t we just let people use byteplay
if they want?
Well, for import hooks and decorators, they already can, and do. For PEP 511 optimizers, as long as they’re written in Python, they can.
The peephole optimizer, of course, can’t use byteplay
, if for no other reason than it’s built into Python and byteplay
isn’t. But maybe we could just leave that alone. People who want more will install PEP 511 optimizers.
There is, of course, a cost to building a code
object and then converting back and forth between code
and byteplay.Code
once for each optimizer, instead of building a byteplay.Code
(or PyBytecode
) object, passing it through all the optimizers, and then converting to code
once. Remember that converting to code
involves that repeated fixup loop and other stuff that might be kind of slow to do multiple times. Then again, it might still be so simple that the cost is negligible. The only way we’ll really know is to build PEP 511 as currently designed, build a bunch of optimizers that manually use byteplay
, then build PEP 511 with PyBytecode
and rewrite the optimizers to use that, and profile times for imports, trivial scripts, compileall
, etc. both ways.
Besides the probably-irrelevant performance issues, there are other advantages to PyBytecode
. They’ve mostly been mentioned earlier, but to put them all in one place:
PyBytecode
is automatically always up to date, instead of lagging behind Python releases likebyteplay
.- The compiler gets a bit simpler.
- A big chunk of what
byteplay
and the compiler do today is the same work; better to have one implementation than two. - The peephole optimizer gets a lot simpler (although, as a change from today, “a lot simpler” is still more work than “unchanged”).
- The peephole optimizer gets a little better (doesn’t punt as often).
- If we later decide to build more optimizations into CPython itself as bytecode processors, it will be a lot simpler (which probably means we’ll be a lot more likely to build such optimizations).
PyBytecode
comes with Python, instead of requiring a third-party install. (Probably not a huge benefit–anyone installing a third-party PEP 511 optimizer can install its dependencies, and the number of people who need to write an optimizer for local use but for some reason can’t use any third-party modules is probably vanishingly small.)- It seems like a fun project.
Overall, none of them seem so hugely compelling that I’d say someone ought to do this–except maybe the last one. :) If I get the time, I’ll try to build it, and rebuild the peephole optimizer, and maybe modify the latest PEP 511 patch as well. Then, if it looks promising, it might be worth suggesting as a patch to CPython to go along with PEP 511.
- Well, not really. It just does the same stuff that’s wrapped up by the Python
compile
function (when called on an AST). ↩ - The bytes within an opcode’s argument are in little-endian format, but
EXTENDED_ARG
holds higher-order bytes, so the overall format for a 4-byte arg is mixed-endian, a la VAX. ↩ - Most code actually just walks 1 or 3 bytes at a time, and treats
EXTENDED_ARG
like a separate instruction, keeping track of the current extended-arg value in an extra register. ↩ - Comprehensions are compiled into a function definition and call, so you often have nested functions without realizing it. ↩
- Yes, this would mean that you can no longer have absolute or relative jumps over
2**24
, const/name/etc. arrays longer than2**24
, or annotations for more than 255 parameters (currently, you can have up to 32767 of these, but you can only have 512 total arguments). I don’t think anyone would complain about any of these limits. But if it really is a problem, we could use 5 or 8 bytes per instruction instead of 4. ↩ - We could even allow code in unpacked form to be executed–I played with the
ceval
loop a bit, and it’s not at all hard to make this work, and doesn’t seem likely to would have a significant performance impact when you’re executing packed code. But I don’t see a good reason for that, and it’s a lot simpler to just raise aSystemError('cannot execute unpacked code')
. ↩
View comments