The core problem is that the key to pattern matching is unpacking constructors of algebraic data types, and Python has arbitrary OO-style classes, not ADTs. Everything else is bikesheddable, but ultimately obvious. And there are two potential solutions to that core problem.
I'll start by laying out a straw-man proposal.
(This was originally written 17 August 2014, and then updated 7 April 2015. The major change is the new section at the end, although I also fixed a couple typos and restored some formatting that Blogspot broke because it sucks.)
(Also see Pattern matching again, written a year and a half later, after I became less ignorant about a few more languages, most notably Scala.)
Semantics
You can match any of the following:- Any assignment target: Always matches, and the target is bound to the match value.*
- A constructor call: This is the tricky bit that most of this post is about.
- Any expression that isn't ambiguously an assignment target or constructor call: Matches iff the match value == the expression value. (Generally this means literal constants.)
- Any of the above followed by an "as" clause: Matches iff the pattern matches, and if so, binds the as target to the match value. (That is, "(0, spam) as eggs" matches iff (0, spam) matches, and if so binds eggs. This is mainly useful if you need to further decompose a subpattern, but also get a name for the whole thing.)
- Any comma-separated list of the above: Unpacks the match value just like in an assignment target list, then matches iff each pattern matches the corresponding value.
- Any of the above followed by a guard clause: Matches iff the pattern list matches and the if expression (identical to the conditional clause in a comprehension) is truthy. The if test can, and usually will, use values bound by the pattern list match. The "and" mentioned is of course short-circuiting.
If you've seen a case expression in Haskell or ML, all of this should be familiar. Except for one conspicuous absence:
Lists
The very first use case anyone learns for pattern matching in Haskell or ML is matching the head and tail of a list.Of course in Haskell and ML, a list is, under the covers, just semantic sugar for the Cons record type. When you match (x:xs), those x and xs are really the two members of the list object, and (x:xs) is just syntactic sugar for Cons(x, xs). While we could come up with a syntax to match lists similarly in Python, it would be misleading, because a list is not made up of a head and a tail, it's an array made up of N elements.
On top of that, the main use for pattern-matching lists is to write recursive functions over lists, which is a decidedly non-Pythonic thing to do in the first place.
So, I've left out list decomposition intentionally.*
* Note that you can actually match the empty list or tuple, because those are both unambiguously not targets—or, of course, you could match the constructor call list(). And you can match x, *xs against any iterable of 1 or more elements. So, if you really want to recursively decompose a list, you can. But it's a bad idea, so there's no reason to add syntactic sugar to make it easier.
Non-literal values
(It's a bit of an oversimplification to say that only literals can be used as values to match, but it's a useful oversimplification for this part of the discussion.)In many cases, you're going to want to match the value in, say, spam[3].eggs, but because that's a valid target, you can't. Instead, you have to do it with a guard clause.
In ML or Haskell, that isn't as much of a problem. For one thing, only identifiers are actually binding targets. For another, each case (or let) is a new scope, and you can't actually rebind variables (although you can shadow them with inner-scope variables of the same name).
One possibility is to add a special notation that forces something to be a value rather than a target. In the comments, I suggested a hideously ugly straw-man, @target, which draws misleading parallels to not only Haskell pattern matching syntax, but also languages like Perl and its cousins. But I'm sure someone could come up with some better syntax. And that might be worth doing.
An alternate possibility is to always treat any valid expression as a value, and only treat something as a target if the expression would raise a NameError or LookupError. If you want to rebind an existing variable (or a list member), that would be the special case that requires some special notation. However, I think that would very quickly lead to confusion. Anything that allows "spam" to sometimes mean "bind spam to the value" and sometimes "match the value against the current value of spam" would be hard to read without absorbing a lot more context. Worse, a lot of newcomers to Python would see identifiers being used in case statements for their values and assume it's like a C or Java switch statement, which would be very misleading. And then there's the question of what happens if the expression has a global binding.
A more extreme, but much simpler, version would be to have no rule at all: there's always special syntax for a value, even if it's a literal--or, alternatively, there's always special syntax for a binding, even if it's a currently-unbound identifier. (Note that libraries that add pattern matching into Python without new syntax seem to go that way—e.g., pypatt uses "quote(x)", rather than just "x", to mean "match anything and bind to x", but they may not have much choice.) But I don't think this is necessary. I think the "anything that looks like a target is a target" rule is easy to understand, remember, teach, and even discover through experimentation.
Syntax
Example
class Breakfast: def __init__(self, spam, eggs, beans): self.spam, self.eggs, self.beans = spam, eggs, beans # ... case order: of _, Breakfast(0, _, _): raise WaitressError("You can't have it without spam") of _, Breakfast(_, _, beans) if beans: raise WaitressError("Beans are off") of _, Breakfast(spam, eggs, _) if eggs > spam: raise WaitressError("That'd be more eggs than spam, can't have that") of customer, Breakfast(spam, eggs, beans) as breakfast: kitchen.fry_spam(spam) kitchen.fry_eggs(eggs) kitchen.fry_beans(beans) waitress.deliver(customer, breakfast) else: raise WaitressError("I'm confused by that order")This syntax obviously isn't ideal, what with the double indent and the of and else statements being statements rather than clauses, but this should be enough to discuss the real issue.
Note that, unlike Haskell or ML, this is a statement, not an expression.*
Like an if/elif/else chain, this just tries each case until it finds one that matches, then executes its suite. If none of them match and there's no else, you just fall off the statement, having done nothing.**
All of these cases are 2-element pattern lists (or 2-element pattern lists with a guard clause), so order[0] gets matched against the first pattern, and order[1] against the second.***
The first thing is always a target (_ or customer)****, so it always succeeds, and binds the target to order[0].
The second thing in every case is a Breakfast constructor (or a Breakfast constructor with an as clause). That gets (recursively) matched against order[1], and it's the tricky bit that will be examined in detail.
The as clause just means that, if Breakfast(spam, eggs, beans) matches, breakfast gets bound to the whole Breakfast object it matched.
The if clauses are checked only after the pattern list is matched, so eggs and spam are the matched values. If the if condition is false, the whole pattern clause fails.
Note that a failed clause may still end up binding some names. And the case statement doesn't happen in its own scope. If you wanted to do something confusing, like writing a pattern that always fails and then using the names it bound in the suite of the next pattern, well, consenting adults and all that.
When the Breakfast constructor gets matched, it will in some way (described later) try to match its arguments. Some of these are targets, but not all of them—i.e., the literal constant 0. Note that matching Breakfast(0, _, _) is equivalent to matching Breakfast(spam, _, _) if spam == 0 (except that the latter will bind spam to the value).
* Of course in those languages, almost everything is an expression. Just as Python has if statements, often used for side effects, instead of if expressions that have a value, Python would presumably have case statements often used for side effects. Note that Python did eventually add a ternary conditional expression. I suppose it might be possible to add a case expression later, but I don't think it would be very useful; the whole point of the conditional expression is to use it in very short statements that would be overly verbose if split into a 4-line statement.
** In most languages with pattern matching, either else is required, or falling off a case without matching anything is an error. But again, that's because case is an expression, and must have a value, and the same thing is true for if expressions.
*** As with target unpacking, order can be any iterable, not just a sequence. I'd have to look at how assignment unpacking works under the covers, but if it's not building a list but just calling next once for each argument, we could still do the same here. Basically it would just need to try to unpack to up to the Nth element each time it first sees a pattern list of length N, and remember all previously unpacked values and whether it's received StopIteration.
**** Note that (unlike Haskell) _ doesn't have any special meaning here; we're just binding some of the match values to the name "_". If you bind multiple values to the same name, it gets one of them arbitrarily.
Grammar
I've actually built a GRAMMAR that adds this syntax, and from some simple tests there don't seem to be any problems parsing any of it. But of course it would need new AST nodes, at least dummy code generation, and a real test suite, and I didn't build any of that, so don't take that as too strong of a promise.Matching constructor calls
The big question is, how can Python know what to do with Breakfast(spam, eggs, beans)?In Haskell or ML, a case statement matches by deconstruction. Breakfast in those languages would be akin to a ('Breakfast', 'spam eggs beans'), so it's really just as easy as tuple unpacking, which Python already does, except with the addition of checking the type.
But in Python, Breakfast is a class. The fact that its constructor happens to take three arguments, and it happens to assign them to attributes with the same name in the same order (even if that order were recoverable…) is just a lucky accident, not something the language requires, or even encourages.
Option 1: Matching by construction
One alternative is to actually call the Breakfast constructor with special values that override __eq__, then trust that Breakfast's __eq__ method will do the right thing. Like this:class MatchParam: def __eq__(self, other): self.matched = other return TrueSo, to match order[1] against, say, Breakfast(0, eggs, _), Python would do this:
eggs, _ = MatchParam(), MatchParam() if Breakfast(0, eggs, _) == order[1]: eggs, _ = eggs.matched, _.matched # if there's a guard clause, check it here # run the suite code hereTo give a more complicated example, if Order were a class rather than a tuple, we could match Order(customer, Breakfast(0, eggs, _)) like this:
egg, _ = MatchParam(), MatchParam() customer = MatchParam() if Order(customer, Breakfast(0, eggs, _)) == order: # etc.The "as" clause makes this marginally more complicated, but it should be obvious how to make that work.
Unfortunately, while this works for many simple classes, there are plenty of classes for which it does the wrong thing. Basically, any class that does anything with the arguments in the constructor beyond storing them will probably fail. (Even static type checking will complain that a MatchParam is not a str… although that could be handled by making MatchParam act like a subclass of any type.)
Worse, some classes would succeed, but only by doing something dangerous or expensive. For example, I don't think you'd want to create a new FileIO(path, 'w') just to match an existing one!
So, I think this is ultimately a non-starter. Which is a pity, because it's so simple, and so close to what ML-style languages do…
Option 2: Matching by match protocol
Any automatic form of matching is going to have the same "FileIO problem". Really, matching has to be something that classes opt in to. The obvious way to do that is to add a match protocol: a new dunder method that's called to get back the constructor arguments, so they can be checked. Like this:class Breakfast: def __init__(self, spam, eggs, beans): self.spam, self.eggs, self.beans = spam, eggs, beans def __match__(self): return self.spam, self.eggs, self.beansNow we can match order[1] against Breakfast(0, eggs, _) like this (showing the semantics by transforming to existing Python in a way that could be implemented programmatically, although of course the actual implementation wouldn't do that--which means there's no need to worry about "macro hygiene" problems with those underscore-prefixed names affecting existing variables):
if isinstance(order[1], Breakfast): _0, eggs, _ = order[1].__match__() if _0 == 0: # run the suite code here… except that if _either_ if fails, we want to fall through to the next pattern, not just the outer one. I won't try to show how to write that in any of the examples, just keep it in mind. (Note that if case statements evaluate to a function definition and call, like comprehensions, this is a lot simpler to write. But that has the side effect of not exposing bound variables to the outer scope--which may be something you actually want, but if not, it's not something you want to add incidentally...)
To match Order(customer, Breakfast(0, eggs, _)):
if isinstance(order, Order): customer, _breakfast = order.__match__() if isinstance(_breakfast, Breakfast): _0, eggs, _ = _breakfast.__match__() if _0 == 0: # etc.
And to match Order(0, Breakfast(0, eggs, _) as breakfast) if eggs:
if isinstance(order, Order): _0, _breakfast = order.__match__() if _0 == 0: if isinstance(_breakfast, Breakfast): _0, eggs, _ = _breakfast.__match__() if _0 == 0: breakfast = _breakfast if eggs: # suite hereThis transformation shows exactly why pattern-matching can be attractive. It's just syntactic sugar, but it's syntactic sugar for code you wouldn't want to write explicitly, and you certainly wouldn't want to read.
The obvious downside here is that most simple classes won't be matchable. But if matching has to be opt-in, that seems pretty much unavoidable.
Keyword arguments?
One obvious problem with __match__ returning a tuple is that it can't match keyword arguments. Which means classes with keyword-only parameters in their constructors can't be matched at all.I think this could be solved by returning something like an inspect.ArgSpec instead of a tuple, but I haven't investigated to make sure this is feasible. Anyway, it should be obvious how to fit it into the case statement syntax; it's only the semantics that need to be worked through.
Can we remove the boilerplate?
Clearly, we could trivially add __match__ support to namedtuple, which helps in cases where you really do just want an ML-like record type and were already going to used namedtuple for it.It would be easy to define a @simpleclass decorator that generates a __match__ automatically by, e.g., inspecting the parameter names to __init__. Or, alternatively, the decorator could just take the attribute names and generate both the __init__ and __match__ methods for you.
There could also be special decorators for dealing with __slots__ classes, classes with only @property attributes, etc.
But somehow, after the first one, each seems increasingly less Pythonic. It might be appropriate for certain frameworks; you might even want to build a framework with base classes (and/or metaclasses) that do this automatically, instead of using a decorator. (For example, a Django model seems like it could define matching in terms of fields—except for the fact that you generally can't actually construct them directly, you have to call a create method… but there's no reason that Model.objects.create couldn't itself be a matchable type, rather than Model.)
Can we leverage the copy/pickle protocol?
The match protocol is really doing the same thing as the copy/pickle protocol: it's asking a class instance for its values, in a form that could be used to construct an equal instance. Could this be leveraged?The copy/pickle protocol is very flexible. While classes can implement __getnewargs__ or __getnewargs_ex__ to provide constructor arguments, they can also implement __getstate__ and __setstate__, or even __reduce_ex__ and a custom constructor-from-reduction function. And it's not clear how anything other than __getnewargs_ex__ could be used for matching unless the class also implemented __matchstate__ or a custom matcher-from-reduction function.
Still, this might be a good answer: Make __getnewargs__ and __getnewargsex__ automatically provide match information, and provide a way for the other alternatives to optionally specify match information (and if they don't, the class just can't be matched).
However, there aren't all that many classes that use __getnewargs__. Most classes that are simple enough to not need __getstate__ or __reduce__ just rely on the default pickler in object.__reduce_ex__, which of course stores the __dict__ and a _reconstructor function that calls __new__ but not __init__ and then updates the __dict__, which won't be of any use to matching.
And encouraging people to implement __getnewargs__ if they want to match means that they will no longer get the default pickler. For a class that does expensive or dangerous work in the constructor, this means copy.copy will now do that expensive or dangerous work—so we're right back to the FileIO problem.
So, I think that, despite the obvious similarities between the two protocols, they have to remain separate. (Maybe in a new Python-like language with both ideas there from the start, it would be a different story.)
Beyond case
In languages that rely heavily on pattern matching, it can often be used in two other places: function definitions, and let expressions. The equivalents of both are feasible in Python.Function definitions
In Haskell, you can define a function as, in effect, a series of separate definitions that use patterns as their arguments.Putting it in Python terms:
def fact(0): return 1 def fact(n): return n * fact(n-1)In Haskell, this is explicitly syntactic sugar for declaring a single function with a case statement. In other words, the compiler turns that into:
def fact(_n): case _n: of 0: return 1 of n: return n * fact(n-1)Note that this is different from overloading by type in languages like C++ (or in Python's functools.singledispatch or any of the multiple dispatch modules on PyPI), because it's overloading by structure (or value, in the case of constants). But it can be used to overload by type just by matching different type constructors.
In Python, we would need some way to mark these as pattern overloads, as opposed to redefinitions, but the answer to that is obvious, and already used by singledispatch: use a decorator:
@pattern def fact(0): return 1 @fact.pattern def fact(n): return n * fact(n-1)(The specific "@fact.pattern" syntax isn't absolutely necessary; "@pattern" alone could look up __name__ and find the right thing to attach to, and some of the alternative overloading modules on PyPI do this. I've just written it this way to parallel singledispatch.)
This would need pretty minimal language support: a function definition can take a single case pattern instead of a normal argument list. Normally, such a function is not callable (it always raises a TypeError), but the pattern decorator can inspect it to build something callable out of it. I'll leave the implementation of the pattern function as an exercise for the reader.
Assignments
Haskell of course doesn't have assignments; it's a pure (meaning purely-immutable) functional language. And let expressions aren't the same as assignment statements. So in this case, rather that start off with the parallel and then show how it maps to Python, I'll just show the Python:Breakfast(spam, eggs, beans) = order[1]It should be obvious what this means, and how to implement it given the above.
What if there are constants?
Breakfast(0, eggs, beans) = order[1]That matches eggs and beans if order[1].spam == 0, and raises MatchException("0 does not match 42") if order[1].spam == 42.
There's no ambiguity here. A call is not a valid target today, nor is a literal. A target list is a valid target, but since a pattern list is a superset of a target list, this isn't a problem. (At least I'm pretty sure there aren't any assignments that are valid today but that would have a different meaning if treated as a pattern assignment.)
And yes, you could even make as clauses and if clauses work here for consistency, although I don't think they'd be useful that often.
In fact, the easiest way to add pattern-matching assignments is to redefine assignments in terms of case statements, so target_list = expression is equivalent to:
case expression: of target_list: pass else: raise ValueError
What about adding ADTs to Python?
There are really two parts of ADTs: record types, which are basically tuples with named elements, and sum types, which are tagged unions with named values. The first is exactly the same thing as collections.namedtuple, which Python already has. Adding pattern matching for namedtuples would be trivial (with the match protocol described above, or otherwise). The second, Python doesn't have—but enum.Enum is halfway there. The only part that's missing is the ability to attach a value to an enumeration member (and, maybe, statically specify a type for each member). This is exactly the approach that Swift takes. In Python terms, imagine you could do this:class Barcode(enum.Enum): UPCA = 1 QRCode = 2 >>> barcode = Barcode.UPCA(17, 23, 42, 42) >>> barcode <Barcode.UPCA: 1 = (17, 23, 42, 42)> >>> barcode.value (17, 23, 42, 42)For more on this idea, see my separate post ADTs for Python.
If we had this, would it be worth adding pattern matching only for Enum and namedtuple? I don't think so. If neither of these types is common enough to even be a builtin; adding new syntax just for them seems silly.
However, the ability to match on these types in particular--as well as any other types that opt in--might be a good argument for adding a general match protocol, as described above.
Is any of this a good idea?
I'm not actually sure it is.A big part of the reason pattern matching is so important in functional languages is that it's declarative. Like let, it's an expression that binds values within a subexpression, and has the value of that subexpression.
There doesn't seem to be any good reason you couldn't use pattern matching to bind names applicatively for the rest of the scope, or to execute side effects, any more than there's a good reason that for statements and if statements don't make sense. But it's also hard to find a nice paradigm case from a pattern-matching language that isn't better rewritten in Python in a more traditional iterative style. Someone would need to find a good use case that really is more readable as a pattern decomposition.
It's possible that things like singledispatch and the proposed new static type annotations will gradually change Python into a language where pattern matching is a more obvious win (as the addition of iterators and generators changed it into a language where other features borrowed from Haskell and friends turned out to be a bigger win), but it's hard to anticipate that. Certainly adding features like not-quite-sum union types and not-quite-parametric generic types looks like a step in that direction, but it's way too early to tell how big of a step, and how far Python will go, and what it will look like when it gets there.
Szymon Pyzalski's proposal
After reading Szymon Pyzalski's proposal on python-ideas, there are a few ideas worth stealing from him, or at least considering.No targets in patterns
Szymon solves the tricky problem with distinguishing values from targets in patterns by just not having targets. Instead, the "as" clause handles the existing Python tuple decomposition, which handles some simple cases, and when that isn't sufficient, the match object can dump a bunch of variables into the case statement's scope. I don't like that, for multiple reasons, but let's not get into those here. However, it does make some of his other ideas fit in a lot more simply, so keep that in mind."for" instead of "case"
Trying to pick a good keyword for an addition to Python is always tricky. Adding a new keyword means that any code that used that keyword as an identifier is now illegal, and I'm sure there's code out there that has variables named "case" or "switch" or "match" or anything else you might pick. On the other hand, reusing an existing keyword runs the risk of making the syntax ambiguous to the parser and/or to a human. Szymon uses "for", and I think you can make that unambiguous to the parser... but it looks odd to a human. Especially if the pattern (or the iterable, in an existing loop statement) is long and complex, so you have to hunt for an "in" (and one not inside another expression!) to tell which type of "for" statement it is. I think introducing "case" is, all things considered, less bad, but of the existing keywords, "for" does seem to be the best bet.No keyword for subclauses
In his proposal, there's no "of" or "else" keyword, you just put the pattern followed by a colon. (For the "else" case, you use an ellipsis followed by a colon.) And this isn't just more concise, it looks cleaner.I think this means the parser has to actually treat these as clauses of the case statement, instead of (as I did) treating them as separate statements, then having a post-parser rule (equivalent to the "break must be inside a loop" rule) to only allow them inside a case, which would be harder to implement--but then again, the way I implemented it is clearly a hack, so if his way is doable, that's no big loss.
Matching on types
There are a lot of cases in Python where you'd like to match by type, without decomposing. In fact, going back to the duality between function overloading and pattern matching, all of the examples for single and multiple dispatch are also potentially examples for type-based pattern matching. Here's the motivating example from Matthew Rocklin's multipledispatch library converted to pattern matching:def add(*args): case args: of (x: int, y: int): return x+y of (x, y): return "%s + %s" % (x, y)Of course this requires another extension to the language, being able to use function-style type annotations in assignment target lists, one that we're not likely to get any time soon. (Note that PEP 484 explicitly puts off even considering a syntax like "x: int = 3" for future consideration.) Then again, we're not likely to get pattern matching any time soon, either, so maybe that's not a problem.
Without that, by adding the rule that values match pattern values if they're equal or if the pattern value is a type and the value is an instance of that type, you could write it as "of (@int, @int) as x, y:". But that only works in simple cases (again, where tuple decomposition works), and it's more verbose, and it requires the horrible "@" syntax for disambiguating a value from a target (which I hoped would be so rarely needed that it wouldn't matter how ugly it is, but it seems like this case would be pretty common if it were allowed).
Regular expressions
Many newer scripting languages have C-style case statements that can match regular expressions. Szymon includes something similar in his proposal, and takes it even farther, by providing a way to decompose the regular expressions—<P> named subexpressions are like variable names. For example:case xy: of re.compile(r'(?P<x>[0-9]+)-(?P<y>[0-9]+)'): return range(x, y)This works with his design of dumping variable names into the scope. But with ML-style decomposition with targets, it could only work if regexp syntax were part of the language syntax--as in, say, Perl.
Also, I'm not entirely sure this is a good idea anyway. There's a reason regexps aren't embedded in the language in Python as they are in Perl. While it makes a few things harder or more verbose, it also discourages people from using regexp solutions to non-regexp problems. And, in fact, Szymon's motivating example seems to be exactly one of the cases where regexps are an attractive nuisance, not an actual solution: he wants to match two numbers as either a tuple, a dict of values named x and y, or a string, separated by a hyphen. Like this (translated to my case ... of syntax and slightly simplified):
case point: of (Number, Number) as x, y: return x+y of {'x': Number, 'y': Number}: return x+y of re.compile(r'(?P<x>[0-9]+)-(?P<y>[0-9]+)'): return x+yThat looks nice, except that the third one doesn't actually match two numbers, it matches two Arabic numerals representing nonnegative integers. It won't work for "-5-3", or "3.1-3.5", or "1+2j-1-2j". The first two, on the other hand, work for any Number type--even one I didn't know about (say, if the user has imported a quaternion library).
It's also worth noting that he presents the first two as parallels to "a tuple (x, y)" and "a mapping {'x': x, 'y': y}" (in his proposal, other objects are also matched by effectively converting themselves into a tuple or a mapping, so this covers more than it appears to), and it's pretty clear that (with the addition of "where x and y are Numbers") the Python pattern is an exact translation of the for-humans pattern. But the third is presented as a parallel to "a string "{x}-{y}.format(x, y)", which not only doesn't actually match (as explained above), but is not at all readable as doing so.
Can this be hacked into Python as a module?
MacroPy comes with pattern matching as an example of the kinds of things you could add to Python using its macro facility. It essentially uses the "decorator to define both __init__ and __match__" idea above, and only classes defined with that decorator can be matched, and then it (ab)uses the with and if statements, but it seems pretty nice.If you don't want a full macro-processing import hook, there are hacks that can accomplish some of the same by, e.g., ignoring the code and inspecting the function's source, or even decompiling the function. Or, if you're willing to give up some niceties, you can write things a bit more clumsily and use a portable (non-hacky) decorator to convert. Or, with even more clunky syntax, you can do it with just a special type with overloaded operators. There are lots of implementations using every one of these tricks; Grant Jenks' PyPatt helpfully provides a quick survey of them (as well as being a good one in itself).
View comments