Python has a whole hierarchy of collection-related abstract types, described in the collections.abc module in the standard library. But there are two key, prototypical kinds. Iterators are one-shot, used for a single forward traversal, and usually lazy, generating each value on the fly as requested. Sequences--like list, tuple, or range--are repeatably iterable (generating a new iterator each time), searchable with the "in" operator, and indexable with subscript expressions (including negative indices and slices).

Of course there are other kinds of collections that are repeatably-iterable containers but not indexable like sequences (set and dict being obvious examples), and even collections that aren't containers (not to mention strings—the in operator is a substring search, not an element search, so a string contains everything it collects, but not vice-versa…). But when you want one of these, you generally know it.

When what you need is a bunch of values in a row, the key question is: do you need them repeatedly, or in random-access order? If yes to either, you want a sequence; if not, you want an iterator.

Most people who get to this point think, "But I don't need the whole sequence interface! Look at how many methods list has!" But if you actually look at what's needed, it's not hard.

Example: geometric ranges

To take a concrete example, let's say we need to loop over geometric ranges--like range, but each value is the previous value multiplied by a factor, rather than incremented by a step. So, for example, the range from 10 to 10000 by 3 gives us 10, 30, 90, 270, 810, 2430.

Iterators

If we only need to iterate over this once, we can do it pretty simply:

    def geometric_range(start, stop, factor):
        while start < stop:
            yield start
            start *= factor

There are other ways to do this. For example, you can use itertools.accumulate to repeatedly iterate a function over previous values, and itertools.takewhile to truncate it at a given value. You can build useful building blocks out of those to allow you to write a whole family of geometric-range-like iterators that are all simple one-liners. But here, we only need this one, so we might as well keep it simple.

Sequences

But what if we need to iterate the same range multiple times? Or iterate it in reverse? Or pass it around as an object? Or check whether a value is in the range? Or randomly access r[7]?

One simple solution is to just return a list of all of the values. In fact, you can just do list(geometric_range(start, stop, factor)) using the generator function above, and you're done. This is what Python 2's range function did.

That works, but it's not such a great idea if the range is large.

Virtual sequences

What you often want is the best of both worlds: something that lazily generates the values as needed, but that's still a sequence, not an iterator. Since what we're doing is very similar to range, just with geometric rather than arithmetic sequences, this shouldn't be surprising, because that's what range is. The docs describe this as a "virtual sequence".

Where do you compute the values? In the __getitem__ method, of course. And, if you look at the docs, that's almost all you have to implement to be a Sequence; the only other thing you need to write is __len__.

But how do we compute the values lazily? We've defined this as an iterated function: each value is the previous value times some factor. We obviously don't want r[7138] to have to multiply 7138 times. Fortunately, in this case, there's an easy answer: for integers, exponentiation is the same thing as repeated multiplication. So, the way to avoid multiplying 7138 times is to exponentiate once:

    class GeometricRange(collections.abc.Sequence):
        def __init__(self, start, stop, factor):
            self.start, self.stop, self.factor = start, stop, factor  
        def __getitem__(self, index):
            return self.start * (self.factor ** index)
        def __len__(self):
            pass # TODO

If you paste this into your Python interpreter, you'll see that it works. You can write r = GeometricRange(10, 10000, 3), and then r[1] is 30.

Of course this has two obvious problems: We haven't implemented __len__, which means not only len but also various other methods and functions that depend on it, will fail. And we haven't done anything with stop, which means that r[17] gives you 1291401630 instead of raising an IndexError). And it should be pretty obvious that these are related.

So, what's the length of a geometric range? Consider that __getitem__ for an arithmetic range does start + (step*index), while for a geometric range it's start * (factor**index). And __len__ for an arithmetic range is (stop-start) // step, so we need to invert things in the same way: log(stop/start, factor). Python doesn't have an "integer log" function equivalent to its // integer division. The easy way around this is to just use the floating-point math.log and then round it to integer. If you're going to want huge ranges this may not be a great idea, but we'll stick with it for now. (Note that range can take values too big to fit even into a 64-bit integer, or even a float, although calling len on one will raise an OverflowError, at least as of 3.5.)

But just implementing __len__ isn't quite sufficient; we also need to use it in __getitem__. So:

    class GeometricRange(collections.abc.Sequence):
        def __init__(self, start, stop, factor):
            self.start, self.stop, self.factor = start, stop, factor  
        def __getitem__(self, index):
            if index < 0 or index >= len(self):
                raise IndexError('{} object index out of range'.format(
                    type(self).__name__))
            return self.start * (self.factor ** index)
        def __len__(self):
            return round(math.log(self.stop//self.start, self.factor))

And that's a complete sequence. You can call any of the Sequence methods on it, or pass it to functions like reverse, etc. The Sequence class provides mixin implementations for everything necessary. For example, it makes your class iterable by defining an __iter__ method that yields self[0], then self[1], and so on, until it gets an IndexError.[1]

Everything below here is nice to have, and often worth doing. But next time you say to yourself "I don't want to build a whole sequence class", look at the above. That's a whole sequence class, and one that implements a virtual sequence. You never have to do any more than this (although often you'll want to).[2]

[1] That actually isn't strictly necessary, because Python already does that for you automatically if you try to iterate something with a __getitem__ and no __iter__. This is a leftover feature from before Python 2.3, but it's still sometimes useful. The reason the ABC provides a mixin implementation anyway is so that your type can be recognized as an iterable by the Iterable ABC.

[2] Of course not every iterated function can be converted into an analytic function, and not every function is fully invertible. So, let's say that within the boundaries of what's mathematically possible, this is all you ever need. If you want to turn some fractal recurrence into a sequence--or, say, the line numbers in a file--you'll need a different solution, such as a caching lazy sequence. (See the linecache module in the stdlib for an example.)

Missing "basic type" features

There are a few things that aren't part of being a Sequence but that you still probably want. Most types should have a repr (if it makes sense, one that's usable to create the same object in code). We probably don't need a separate str, because you aren't going to be displaying these things to end-users. (Note that range doesn't have one.) Having == work as expected is always nice. And, if you have == and your type is meant to be used as an immutable value, you should add hash (although we haven't quite enforced immutability yet; see later).

The usual way to implement these methods is just by comparing all of the attributes:

        def __repr__(self):
            return '{}({}, {}, {})'.format(
                type(self).__name__, self.start, self.stop, self.factor)

        def __eq__(self, other):
            return ((self.start, self.stop, self.factor) ==
                    (other.start, other.stop, other.factor))

        def __hash__(self):
            return hash((type(self), self.start, self.stop, self.factor))

But that isn't really correct here. Shouldn't GeometricSequence(1, 10, 2) == GeometricSequence(1, 11, 2) be true? After all, they both start at one, keep doubling, and finish with 8.

If you look at range, it uses logic like that. The intuitive idea is that two ranges are equal if they contain the same values in the same order, just like two tuples or two lists or any other kind of sequence. Put in mathematical terms so we can implement it without iterating the entire sequence: two ranges are equal if they're empty, or if they start with the same value and have no other values, or if they start with the same value and have the same step and the same length.

Notice that this also complicates hashing. Two equal values have to have the same hash. That means you can't include the start value when hashing an empty range, and you can't include the step and length when hashing a length-1 range.

So:

        def __eq__(self, other):
            if len(self) != len(other):
                return False
            if len(self) == 0:
                return True
            elif len(self) == 1:
                return self.start == other.start
            else:
                return (self.start, self.factor) == (other.start, other.factor)

        def __hash__(self):
            if len(self) == 0:
                return hash((type(self), 0, None, None))
            elif len(self) == 1:
                return hash((type(self), 1, self.start, None))
            else:
                return hash((type(self), len(self), self.start, self.factor))

While we're at it: often, when you need an unusual __eq__ method, you also need to customize pickling and (deep-)copying. Fortunately, that isn't the case here. But, just for completeness, here's what it looks like:

        def __getnewargs__(self):
            return self.start, self.stop, self.factor

Optimizing container methods

Building a sequence automatically makes it a container: the in operator works, as do the index and count methods. But they work by exhaustively searching the sequence. The same way those methods work on list and tuple. But range doesn't have to do that; you can write 10**5000 in range(1, 10**10000, 2) and it returns false instantly.

There are other types where it's worth implementing these yourself. For example, hash-based types like set can test for membership in constant time, and binary-tree-based types like a SortedTuple can do so in logarithmic time.

Anyway, it's almost always pretty simple. Usually you abstract out the actual search algorithm, then call it in the three ABC methods, like this:

    def _find(self, value)
        return round(math.log(value/self.start, self.factor))

    def index(self, value):
        idx = self._find(value)
        if 0 ≤ idx < len(self) and self[idx] == value: return idx
        raise ValueError('{} is not in range'.format(value))

    def __contains__(self, value):
        idx = self._find(value)
        return 0 ≤ idx < len(self) and self[idx] == value
    
    def count(self, value):
        return 1 if value in self else 0

While we're on optimization: For some types, it's also worth implementing __reversed__, but that isn't needed here.

Also, if you put this class through its paces, you'll see that on some platforms, while iterating an instance, it spends a lot of time calling math.log. Almost all of those calls are in len(self) calls used to bounds-test each index. You can calculate the length at initialization time and store it, then just access self._length everywhere.

Extending the sequence interface

While this is enough for a complete, efficient sequence, it doesn't have all of the features of range, tuple, or other built-in sequences, because some of their methods have more complicated interfaces than the ABC requires.

Negative indices

One of the first little things many people fall in love with Python for is that you can write lst[-1] instead of lst[len(lst)-1]. But that won't work with our class. If you want negative indices, you have to do it manually. Fortunately, it's pretty easy:

        def __getitem__(self, index):
            if index < 0:
                index += len(self)
            if index < 0 or index ≥ len(self):
                raise IndexError('{} object index out of range'.format(
                    type(self).__name__))
            return self.start * (self.factor ** index)

Slices

Python sequences also usually allow slicing. That's again something you have to implement yourself. Many classes can do something as simple as this:

        def __getitem__(self, index):
            if isinstance(index, slice):
                return [self[i] for i in range(*index.indices(len(self)))]
            # normal non-slice implementation here

But ideally, you want slicing to return the same type as the sliced object. When you write range(1000000000)[::2], you get back another range, not a list of 500 million values. So, it's back to basic math for us:

        def __getitem__(self, index):
            if isinstance(index, slice):
                start, stop, stride = index.indices(self._length)
                start = self[start]
                stop = self[stop] if stop < self._length else self.stop
                return type(self)(start, stop, self.factor ** stride)
            # rest as above

Bounded index

The index methods of most built-in sequences take optional start and stop parameters. That's easy:
       def index(self, value, start=None, stop=None):
            idx = self._find(value)
            if start is None: start = 0
            if stop is None: stop = len(self)
            if start ≤ idx < stop and self[idx] == value: return idx
            raise ValueError('{} is not in range'.format(value))

Immutability

A tuple or range is immutable. Our class is sort of immutable, in that you can't assign r[7] = 133. But you can assign r.start = 3. That's a problem--especially since, by defining __hash__ we've implied that we're immutable.

It's very hard to actually enforce immutability in Python, but you don't really have to; you just have to make it clear that the object is supposed to be immutable, and make it hard to mutate by accident. Generally, that means storing the attributes in private attributes behind read-only accessors.

While we're at it, we might as well store the attributes in __slots__ variables:

    class GeometricRange(collections.abc.Sequence):
        __slots__ = '_start _stop _factor'.split()
        def __init__(self, start, stop, factor):
            sekf._start, self._stop, self._factor = start, stop, factor
        @property
        def start(self):
            return self._start
        @property
        def stop(self):
            return self._stop
        @property
        def factor(self):
            return self._factor
        def __getitem__(self, idx): pass
        def __len__(self): pass

If you really want to be immutable, you sometimes need a __new__ method rather than __init__, especially if you expect users to subclass your type. That isn't necessary here, but just to show that it's not nearly as scary as it sounds:

    class GeometricRange(collections.abc.Sequence):
        __slots__ = '_start _stop _factor'.split()
        def __new__(cls, start, stop, factor):
            obj = super(GeometricRange, cls).__new__(cls)
            obj._start, obj._stop, obj._factor = start, stop, factor
            return obj

If you instead wanted to make the class mutable... Well, it can't be a MutableSequence, because it doesn't have __setitem__, __delitem__, and insert, and there's no reasonable meaning for any of those. (Although I guess you could implement pop at least.) But could it be mutable and a Sequence but not a MutableSequence? Sure; the various SortedList types you can find on PyPI fit into that category (because they are, in effect, a list for immutable operations but a set for mutable ones...). So, you can remove the __hash__ method and not change anything else, and that's legal. But it feels like a strange design--a mutable sequence seems like it ought to allow you to replace, add, or remove elements by index, but this type can only globally change all the elements to a whole different set of elements. At any rate, even if it weren't for that gut feeling, the fact that we're intentionally emulating range, and range is immutable, makes the choice for us.

Optional constructor arguments

Constructor arguments aren't actually part of Sequence, or any ABC. (And that's a good thing--otherwise, defaultdict would have to be a factory that returns a dict subclass instead of just being a dict subclass...) But often, you're trying to emulate some specific sequence type, and that means emulating its constructor. The range class actually has a pretty complicated constructor signature. In pseudocode:
     range([start, ]stop[, step])
Or, as shown in the actual docstring for range:
     range(stop)
     range(start, stop[, step])
Python makes it easy to specify optional trailing arguments, but how do you specify optional leading arguments? Or, how do you specify two overloaded signatures for the same name? Well, you can't. If you think about it, in the general case it's impossible ambiguous; it only makes sense here in very special cases. Generally, you're never supposed to do this, and you should treat range as an aberration from the early days of Python rather than something to be emulated. Except that here, we're explicitly writing a range-like class, so we kind of have to emulate it. And there are a few other times when the same kind of thing shows up. So, here are the two ways to do it:
        def __init__(self, start_or_stop, stop=None, step=None):
            if stop is None and factor is None:
                start, stop = 0, start_or_stop
            else:
                start = start_or_stop
            if step is None:
                step = 1

        def __init__(self, *args):
            if len(args) == 1:
                start, stop, step = 0, args[0], 1
            elif len(args) == 2:
                start, stop, step = args[0], args[1], 1
            else:
                start, stop, step = args
The first one is slightly more readable, but it's still ugly. The second one works exactly the same as range.[3] Notice that you could wrap the *args trick up in an overload-dispatcher similar to functools.singledispatch but switching on argument count rather than type. But, since you're only going to do this in very rare special cases, I don't think you want to abstract it out and make it easy. So, just pick whichever one you hate less.

[3] The first one allows you to pass None as any argument, with the same effect as leaving out the argument; range won't accept that. If you really care, you can always do _sentinel=object() in class scope, then use GeometricRange._sentinel instead of None everywhere, then del it at the end of the class definition. But why would you care?

Domain checking

It goes without saying that you need to add error handling, but in this case, it's not clear what that means everywhere. If you pass something to __getitem__ that isn't a slice or an integer (or a slice with non-integral slice values), you'll probably get an exception, but not the nice TypeError you get from range. You'll want to add a check for that. This is common to most sequence types. The three find-related methods will all raise a TypeError or do something else wrong if you pass them non-integral values. You need to check explicitly, then raise a nice ValueError, return False, and return 0, respectively. This only usually comes up with virtual sequence types, or with containers that have some other reason to care about their element types (e.g., a bucket counter for a fixed set of values).

If you pass a negative or zero factor, everything will appear to work until you try to (directly or indirectly) call __len__, at which point you'll get a ValueError: math domain error for trying to take the log of a nonpositive number. Obviously that's bad, but what's the right thing to do? One answer is just to raise a ValueError('GeometricRange() arg 3 must be positive') (akin to the equivalent range error). But GeometricRange(10, 1000, -3) makes intuitive sense as 10, -30, 90, -270. But then what about GeometricRange(10, 100, -3)? A log-based check will exclude the -270, but a simple less-than check will include it.

 If you pass non-integral numbers to the constructor, or to any of the various methods, many things go right, but some don't. If you want floats to work here, it's not hard to think things through and make everything work (although then you're losing a lot of the analogy with range). If you don't, it's pretty simple to add the tests and raise a TypeError.

But notice that range is a bit special here. Most functions that want integers mean they want something where isinstance(value, int) is true. A few mean that they want something that returns a native C long from its index slot or (if implemented in C) or an int value within native C long range from its __index__ method (if not). Normally, in Python code, you don't have to worry about the latter. But, because range elements are often used as indices into lists, the range type does worry about it. If you want to emulate that, you have to do it manually (basically, just try to call value.__index__() if the isinstance fails).

Also, along with this, range.__len__ raises an OverflowError if the range has more elements than will fit into a C long, but you probably don't want to emulate that.

Anything else?

Of course you'll want to write a docstring for the class, and for each public method. And, as implied above, there's probably some trivial error handling still to be added. But otherwise, this is pretty much as complete as you can get; I think I've covered everything you would ever need to touch even in the most esoteric virtual sequence. And it's still not that bad.
2

View comments

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed.

5

I haven't posted anything new in a couple years (partly because I attempted to move to a different blogging platform where I could write everything in markdown instead of HTML but got frustrated—which I may attempt again), but I've had a few private comments and emails on some of the old posts, so I

6

Looking before you leap

Python is a duck-typed language, and one where you usually trust EAFP ("Easier to Ask Forgiveness than Permission") over LBYL ("Look Before You Leap").

1

Background

Currently, CPython’s internal bytecode format stores instructions with no args as 1 byte, instructions with small args as 3 bytes, and instructions with large args as 6 bytes (actually, a 3-byte EXTENDED_ARG followed by a 3-byte real instruction).

6

If you want to skip all the tl;dr and cut to the chase, jump to Concrete Proposal.

8

Many people, when they first discover the heapq module, have two questions:

Why does it define a bunch of functions instead of a container type? Why don't those functions take a key or reverse parameter, like all the other sorting-related stuff in Python? Why not a type?

At the abstract level, it'

1

Currently, in CPython, if you want to process bytecode, either in C or in Python, it’s pretty complicated.

The built-in peephole optimizer has to do extra work fixing up jump targets and the line-number table, and just punts on many cases because they’re too hard to deal with.

3

One common "advanced question" on places like StackOverflow and python-list is "how do I dynamically create a function/method/class/whatever"? The standard answer is: first, some caveats about why you probably don't want to do that, and then an explanation of the various ways to do it when you reall

1

A few years ago, Cesare di Mauro created a project called WPython, a fork of CPython 2.6.4 that “brings many optimizations and refactorings”. The starting point of the project was replacing the bytecode with “wordcode”. However, there were a number of other changes on top of it.

1

Many languages have a for-each loop.

4

When the first betas for Swift came out, I was impressed by their collection design. In particular, the way it allows them to write map-style functions that are lazy (like Python 3), but still as full-featured as possible.

2

In a previous post, I explained in detail how lookup works in Python.

2

The documentation does a great job explaining how things normally get looked up, and how you can hook them.

But to understand how the hooking works, you need to go under the covers to see how that normal lookup actually happens.

When I say "Python" below, I'm mostly talking about CPython 3.5.

7

In Python (I'm mostly talking about CPython here, but other implementations do similar things), when you write the following:

def spam(x): return x+1 spam(3) What happens?

Really, it's not that complicated, but there's no documentation anywhere that puts it all together.

2

I've seen a number of people ask why, if you can have arbitrary-sized integers that do everything exactly, you can't do the same thing with floats, avoiding all the rounding problems that they keep running into.

2

In a recent thread on python-ideas, Stephan Sahm suggested, in effect, changing the method resolution order (MRO) from C3-linearization to a simple depth-first search a la old-school Python or C++.

1

Note: This post doesn't talk about Python that much, except as a point of comparison for JavaScript.

Most object-oriented languages out there, including Python, are class-based. But JavaScript is instead prototype-based.

1

About a year and a half ago, I wrote a blog post on the idea of adding pattern matching to Python.

I finally got around to playing with Scala semi-seriously, and I realized that they pretty much solved the same problem, in a pretty similar way to my straw man proposal, and it works great.

About a year ago, Jules Jacobs wrote a series (part 1 and part 2, with part 3 still forthcoming) on the best collections library design.

1

In three separate discussions on the Python mailing lists this month, people have objected to some design because it leaks something into the enclosing scope. But "leaks into the enclosing scope" isn't a real problem.

2

There's a lot of confusion about what the various kinds of things you can iterate over in Python. I'll attempt to collect definitions for all of the relevant terms, and provide examples, here, so I don't have to go over the same discussions in the same circles every time.

8

Python has a whole hierarchy of collection-related abstract types, described in the collections.abc module in the standard library. But there are two key, prototypical kinds. Iterators are one-shot, used for a single forward traversal, and usually lazy, generating each value on the fly as requested.

2

There are a lot of novice questions on optimizing NumPy code on StackOverflow, that make a lot of the same mistakes. I'll try to cover them all here.

What does NumPy speed up?

Let's look at some Python code that does some computation element-wise on two lists of lists.

2

When asyncio was first proposed, many people (not so much on python-ideas, where Guido first suggested it, but on external blogs) had the same reaction: Doing the core reactor loop in Python is going to be way too slow. Something based on libev, like gevent, is inherently going to be much faster.

Let's say you have a good idea for a change to Python.

1

There are hundreds of questions on StackOverflow that all ask variations of the same thing. Paraphrasing:

lst is a list of strings and numbers. I want to convert the numbers to int but leave the strings alone.

2

In Haskell, you can section infix operators. This is a simple form of partial evaluation. Using Python syntax, the following are equivalent:

(2*) lambda x: 2*x (*2) lambda x: x*2 (*) lambda x, y: x*y So, can we do the same in Python?

Grammar

The first form, (2*), is unambiguous.

1

Many people—especially people coming from Java—think that using try/except is "inelegant", or "inefficient". Or, slightly less meaninglessly, they think that "exceptions should only be for errors, not for normal flow control".

These people are not going to be happy with Python.

2

If you look at Python tutorials and sample code, proposals for new language features, blogs like this one, talks at PyCon, etc., you'll see spam, eggs, gouda, etc. all over the place.

Most control structures in most most programming languages, including Python, are subordinating conjunctions, like "if", "while", and "except", although "with" is a preposition, and "for" is a preposition used strangely (although not as strangely as in C…).

There are two ways that some Python programmers overuse lambda. Doing this almost always mkes your code less readable, and for no corresponding benefit.

1

Some languages have a very strong idiomatic style—in Python, Haskell, or Swift, the same code by two different programmers is likely to look a lot more similar than in Perl, Lisp, or C++.

There's an advantage to this—and, in particular, an advantage to you sticking to those idioms.

1

Python doesn't have a way to clone generators.

At least for a lot of simple cases, however, it's pretty obvious what cloning them should do, and being able to do so would be handy. But for a lot of other cases, it's not at all obvious.

5

Every time someone has a good idea, they believe it should be in the stdlib. After all, it's useful to many people, and what's the harm? But of course there is a harm.

3

This confuses every Python developer the first time they see it—even if they're pretty experienced by the time they see it:

>>> t = ([], []) >>> t[0] += [1] --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <stdin> in <module>()

11
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.