Python has a whole hierarchy of collection-related abstract types, described in the collections.abc module in the standard library. But there are two key, prototypical kinds. Iterators are one-shot, used for a single forward traversal, and usually lazy, generating each value on the fly as requested. Sequences--like list, tuple, or range--are repeatably iterable (generating a new iterator each time), searchable with the "in" operator, and indexable with subscript expressions (including negative indices and slices).

Of course there are other kinds of collections that are repeatably-iterable containers but not indexable like sequences (set and dict being obvious examples), and even collections that aren't containers (not to mention strings—the in operator is a substring search, not an element search, so a string contains everything it collects, but not vice-versa…). But when you want one of these, you generally know it.

When what you need is a bunch of values in a row, the key question is: do you need them repeatedly, or in random-access order? If yes to either, you want a sequence; if not, you want an iterator.

Most people who get to this point think, "But I don't need the whole sequence interface! Look at how many methods list has!" But if you actually look at what's needed, it's not hard.

Example: geometric ranges

To take a concrete example, let's say we need to loop over geometric ranges--like range, but each value is the previous value multiplied by a factor, rather than incremented by a step. So, for example, the range from 10 to 10000 by 3 gives us 10, 30, 90, 270, 810, 2430.

Iterators

If we only need to iterate over this once, we can do it pretty simply:

    def geometric_range(start, stop, factor):
        while start < stop:
            yield start
            start *= factor

There are other ways to do this. For example, you can use itertools.accumulate to repeatedly iterate a function over previous values, and itertools.takewhile to truncate it at a given value. You can build useful building blocks out of those to allow you to write a whole family of geometric-range-like iterators that are all simple one-liners. But here, we only need this one, so we might as well keep it simple.

Sequences

But what if we need to iterate the same range multiple times? Or iterate it in reverse? Or pass it around as an object? Or check whether a value is in the range? Or randomly access r[7]?

One simple solution is to just return a list of all of the values. In fact, you can just do list(geometric_range(start, stop, factor)) using the generator function above, and you're done. This is what Python 2's range function did.

That works, but it's not such a great idea if the range is large.

Virtual sequences

What you often want is the best of both worlds: something that lazily generates the values as needed, but that's still a sequence, not an iterator. Since what we're doing is very similar to range, just with geometric rather than arithmetic sequences, this shouldn't be surprising, because that's what range is. The docs describe this as a "virtual sequence".

Where do you compute the values? In the __getitem__ method, of course. And, if you look at the docs, that's almost all you have to implement to be a Sequence; the only other thing you need to write is __len__.

But how do we compute the values lazily? We've defined this as an iterated function: each value is the previous value times some factor. We obviously don't want r[7138] to have to multiply 7138 times. Fortunately, in this case, there's an easy answer: for integers, exponentiation is the same thing as repeated multiplication. So, the way to avoid multiplying 7138 times is to exponentiate once:

    class GeometricRange(collections.abc.Sequence):
        def __init__(self, start, stop, factor):
            self.start, self.stop, self.factor = start, stop, factor  
        def __getitem__(self, index):
            return self.start * (self.factor ** index)
        def __len__(self):
            pass # TODO

If you paste this into your Python interpreter, you'll see that it works. You can write r = GeometricRange(10, 10000, 3), and then r[1] is 30.

Of course this has two obvious problems: We haven't implemented __len__, which means not only len but also various other methods and functions that depend on it, will fail. And we haven't done anything with stop, which means that r[17] gives you 1291401630 instead of raising an IndexError). And it should be pretty obvious that these are related.

So, what's the length of a geometric range? Consider that __getitem__ for an arithmetic range does start + (step*index), while for a geometric range it's start * (factor**index). And __len__ for an arithmetic range is (stop-start) // step, so we need to invert things in the same way: log(stop/start, factor). Python doesn't have an "integer log" function equivalent to its // integer division. The easy way around this is to just use the floating-point math.log and then round it to integer. If you're going to want huge ranges this may not be a great idea, but we'll stick with it for now. (Note that range can take values too big to fit even into a 64-bit integer, or even a float, although calling len on one will raise an OverflowError, at least as of 3.5.)

But just implementing __len__ isn't quite sufficient; we also need to use it in __getitem__. So:

    class GeometricRange(collections.abc.Sequence):
        def __init__(self, start, stop, factor):
            self.start, self.stop, self.factor = start, stop, factor  
        def __getitem__(self, index):
            if index < 0 or index >= len(self):
                raise IndexError('{} object index out of range'.format(
                    type(self).__name__))
            return self.start * (self.factor ** index)
        def __len__(self):
            return round(math.log(self.stop//self.start, self.factor))

And that's a complete sequence. You can call any of the Sequence methods on it, or pass it to functions like reverse, etc. The Sequence class provides mixin implementations for everything necessary. For example, it makes your class iterable by defining an __iter__ method that yields self[0], then self[1], and so on, until it gets an IndexError.[1]

Everything below here is nice to have, and often worth doing. But next time you say to yourself "I don't want to build a whole sequence class", look at the above. That's a whole sequence class, and one that implements a virtual sequence. You never have to do any more than this (although often you'll want to).[2]

[1] That actually isn't strictly necessary, because Python already does that for you automatically if you try to iterate something with a __getitem__ and no __iter__. This is a leftover feature from before Python 2.3, but it's still sometimes useful. The reason the ABC provides a mixin implementation anyway is so that your type can be recognized as an iterable by the Iterable ABC.

[2] Of course not every iterated function can be converted into an analytic function, and not every function is fully invertible. So, let's say that within the boundaries of what's mathematically possible, this is all you ever need. If you want to turn some fractal recurrence into a sequence--or, say, the line numbers in a file--you'll need a different solution, such as a caching lazy sequence. (See the linecache module in the stdlib for an example.)

Missing "basic type" features

There are a few things that aren't part of being a Sequence but that you still probably want. Most types should have a repr (if it makes sense, one that's usable to create the same object in code). We probably don't need a separate str, because you aren't going to be displaying these things to end-users. (Note that range doesn't have one.) Having == work as expected is always nice. And, if you have == and your type is meant to be used as an immutable value, you should add hash (although we haven't quite enforced immutability yet; see later).

The usual way to implement these methods is just by comparing all of the attributes:

        def __repr__(self):
            return '{}({}, {}, {})'.format(
                type(self).__name__, self.start, self.stop, self.factor)

        def __eq__(self, other):
            return ((self.start, self.stop, self.factor) ==
                    (other.start, other.stop, other.factor))

        def __hash__(self):
            return hash((type(self), self.start, self.stop, self.factor))

But that isn't really correct here. Shouldn't GeometricSequence(1, 10, 2) == GeometricSequence(1, 11, 2) be true? After all, they both start at one, keep doubling, and finish with 8.

If you look at range, it uses logic like that. The intuitive idea is that two ranges are equal if they contain the same values in the same order, just like two tuples or two lists or any other kind of sequence. Put in mathematical terms so we can implement it without iterating the entire sequence: two ranges are equal if they're empty, or if they start with the same value and have no other values, or if they start with the same value and have the same step and the same length.

Notice that this also complicates hashing. Two equal values have to have the same hash. That means you can't include the start value when hashing an empty range, and you can't include the step and length when hashing a length-1 range.

So:

        def __eq__(self, other):
            if len(self) != len(other):
                return False
            if len(self) == 0:
                return True
            elif len(self) == 1:
                return self.start == other.start
            else:
                return (self.start, self.factor) == (other.start, other.factor)

        def __hash__(self):
            if len(self) == 0:
                return hash((type(self), 0, None, None))
            elif len(self) == 1:
                return hash((type(self), 1, self.start, None))
            else:
                return hash((type(self), len(self), self.start, self.factor))

While we're at it: often, when you need an unusual __eq__ method, you also need to customize pickling and (deep-)copying. Fortunately, that isn't the case here. But, just for completeness, here's what it looks like:

        def __getnewargs__(self):
            return self.start, self.stop, self.factor

Optimizing container methods

Building a sequence automatically makes it a container: the in operator works, as do the index and count methods. But they work by exhaustively searching the sequence. The same way those methods work on list and tuple. But range doesn't have to do that; you can write 10**5000 in range(1, 10**10000, 2) and it returns false instantly.

There are other types where it's worth implementing these yourself. For example, hash-based types like set can test for membership in constant time, and binary-tree-based types like a SortedTuple can do so in logarithmic time.

Anyway, it's almost always pretty simple. Usually you abstract out the actual search algorithm, then call it in the three ABC methods, like this:

    def _find(self, value)
        return round(math.log(value/self.start, self.factor))

    def index(self, value):
        idx = self._find(value)
        if 0 ≤ idx < len(self) and self[idx] == value: return idx
        raise ValueError('{} is not in range'.format(value))

    def __contains__(self, value):
        idx = self._find(value)
        return 0 ≤ idx < len(self) and self[idx] == value
    
    def count(self, value):
        return 1 if value in self else 0

While we're on optimization: For some types, it's also worth implementing __reversed__, but that isn't needed here.

Also, if you put this class through its paces, you'll see that on some platforms, while iterating an instance, it spends a lot of time calling math.log. Almost all of those calls are in len(self) calls used to bounds-test each index. You can calculate the length at initialization time and store it, then just access self._length everywhere.

Extending the sequence interface

While this is enough for a complete, efficient sequence, it doesn't have all of the features of range, tuple, or other built-in sequences, because some of their methods have more complicated interfaces than the ABC requires.

Negative indices

One of the first little things many people fall in love with Python for is that you can write lst[-1] instead of lst[len(lst)-1]. But that won't work with our class. If you want negative indices, you have to do it manually. Fortunately, it's pretty easy:

        def __getitem__(self, index):
            if index < 0:
                index += len(self)
            if index < 0 or index ≥ len(self):
                raise IndexError('{} object index out of range'.format(
                    type(self).__name__))
            return self.start * (self.factor ** index)

Slices

Python sequences also usually allow slicing. That's again something you have to implement yourself. Many classes can do something as simple as this:

        def __getitem__(self, index):
            if isinstance(index, slice):
                return [self[i] for i in range(*index.indices(len(self)))]
            # normal non-slice implementation here

But ideally, you want slicing to return the same type as the sliced object. When you write range(1000000000)[::2], you get back another range, not a list of 500 million values. So, it's back to basic math for us:

        def __getitem__(self, index):
            if isinstance(index, slice):
                start, stop, stride = index.indices(self._length)
                start = self[start]
                stop = self[stop] if stop < self._length else self.stop
                return type(self)(start, stop, self.factor ** stride)
            # rest as above

Bounded index

The index methods of most built-in sequences take optional start and stop parameters. That's easy:
       def index(self, value, start=None, stop=None):
            idx = self._find(value)
            if start is None: start = 0
            if stop is None: stop = len(self)
            if start ≤ idx < stop and self[idx] == value: return idx
            raise ValueError('{} is not in range'.format(value))

Immutability

A tuple or range is immutable. Our class is sort of immutable, in that you can't assign r[7] = 133. But you can assign r.start = 3. That's a problem--especially since, by defining __hash__ we've implied that we're immutable.

It's very hard to actually enforce immutability in Python, but you don't really have to; you just have to make it clear that the object is supposed to be immutable, and make it hard to mutate by accident. Generally, that means storing the attributes in private attributes behind read-only accessors.

While we're at it, we might as well store the attributes in __slots__ variables:

    class GeometricRange(collections.abc.Sequence):
        __slots__ = '_start _stop _factor'.split()
        def __init__(self, start, stop, factor):
            sekf._start, self._stop, self._factor = start, stop, factor
        @property
        def start(self):
            return self._start
        @property
        def stop(self):
            return self._stop
        @property
        def factor(self):
            return self._factor
        def __getitem__(self, idx): pass
        def __len__(self): pass

If you really want to be immutable, you sometimes need a __new__ method rather than __init__, especially if you expect users to subclass your type. That isn't necessary here, but just to show that it's not nearly as scary as it sounds:

    class GeometricRange(collections.abc.Sequence):
        __slots__ = '_start _stop _factor'.split()
        def __new__(cls, start, stop, factor):
            obj = super(GeometricRange, cls).__new__(cls)
            obj._start, obj._stop, obj._factor = start, stop, factor
            return obj

If you instead wanted to make the class mutable... Well, it can't be a MutableSequence, because it doesn't have __setitem__, __delitem__, and insert, and there's no reasonable meaning for any of those. (Although I guess you could implement pop at least.) But could it be mutable and a Sequence but not a MutableSequence? Sure; the various SortedList types you can find on PyPI fit into that category (because they are, in effect, a list for immutable operations but a set for mutable ones...). So, you can remove the __hash__ method and not change anything else, and that's legal. But it feels like a strange design--a mutable sequence seems like it ought to allow you to replace, add, or remove elements by index, but this type can only globally change all the elements to a whole different set of elements. At any rate, even if it weren't for that gut feeling, the fact that we're intentionally emulating range, and range is immutable, makes the choice for us.

Optional constructor arguments

Constructor arguments aren't actually part of Sequence, or any ABC. (And that's a good thing--otherwise, defaultdict would have to be a factory that returns a dict subclass instead of just being a dict subclass...) But often, you're trying to emulate some specific sequence type, and that means emulating its constructor. The range class actually has a pretty complicated constructor signature. In pseudocode:
     range([start, ]stop[, step])
Or, as shown in the actual docstring for range:
     range(stop)
     range(start, stop[, step])
Python makes it easy to specify optional trailing arguments, but how do you specify optional leading arguments? Or, how do you specify two overloaded signatures for the same name? Well, you can't. If you think about it, in the general case it's impossible ambiguous; it only makes sense here in very special cases. Generally, you're never supposed to do this, and you should treat range as an aberration from the early days of Python rather than something to be emulated. Except that here, we're explicitly writing a range-like class, so we kind of have to emulate it. And there are a few other times when the same kind of thing shows up. So, here are the two ways to do it:
        def __init__(self, start_or_stop, stop=None, step=None):
            if stop is None and factor is None:
                start, stop = 0, start_or_stop
            else:
                start = start_or_stop
            if step is None:
                step = 1

        def __init__(self, *args):
            if len(args) == 1:
                start, stop, step = 0, args[0], 1
            elif len(args) == 2:
                start, stop, step = args[0], args[1], 1
            else:
                start, stop, step = args
The first one is slightly more readable, but it's still ugly. The second one works exactly the same as range.[3] Notice that you could wrap the *args trick up in an overload-dispatcher similar to functools.singledispatch but switching on argument count rather than type. But, since you're only going to do this in very rare special cases, I don't think you want to abstract it out and make it easy. So, just pick whichever one you hate less.

[3] The first one allows you to pass None as any argument, with the same effect as leaving out the argument; range won't accept that. If you really care, you can always do _sentinel=object() in class scope, then use GeometricRange._sentinel instead of None everywhere, then del it at the end of the class definition. But why would you care?

Domain checking

It goes without saying that you need to add error handling, but in this case, it's not clear what that means everywhere. If you pass something to __getitem__ that isn't a slice or an integer (or a slice with non-integral slice values), you'll probably get an exception, but not the nice TypeError you get from range. You'll want to add a check for that. This is common to most sequence types. The three find-related methods will all raise a TypeError or do something else wrong if you pass them non-integral values. You need to check explicitly, then raise a nice ValueError, return False, and return 0, respectively. This only usually comes up with virtual sequence types, or with containers that have some other reason to care about their element types (e.g., a bucket counter for a fixed set of values).

If you pass a negative or zero factor, everything will appear to work until you try to (directly or indirectly) call __len__, at which point you'll get a ValueError: math domain error for trying to take the log of a nonpositive number. Obviously that's bad, but what's the right thing to do? One answer is just to raise a ValueError('GeometricRange() arg 3 must be positive') (akin to the equivalent range error). But GeometricRange(10, 1000, -3) makes intuitive sense as 10, -30, 90, -270. But then what about GeometricRange(10, 100, -3)? A log-based check will exclude the -270, but a simple less-than check will include it.

 If you pass non-integral numbers to the constructor, or to any of the various methods, many things go right, but some don't. If you want floats to work here, it's not hard to think things through and make everything work (although then you're losing a lot of the analogy with range). If you don't, it's pretty simple to add the tests and raise a TypeError.

But notice that range is a bit special here. Most functions that want integers mean they want something where isinstance(value, int) is true. A few mean that they want something that returns a native C long from its index slot or (if implemented in C) or an int value within native C long range from its __index__ method (if not). Normally, in Python code, you don't have to worry about the latter. But, because range elements are often used as indices into lists, the range type does worry about it. If you want to emulate that, you have to do it manually (basically, just try to call value.__index__() if the isinstance fails).

Also, along with this, range.__len__ raises an OverflowError if the range has more elements than will fit into a C long, but you probably don't want to emulate that.

Anything else?

Of course you'll want to write a docstring for the class, and for each public method. And, as implied above, there's probably some trivial error handling still to be added. But otherwise, this is pretty much as complete as you can get; I think I've covered everything you would ever need to touch even in the most esoteric virtual sequence. And it's still not that bad.
2

View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.