collections.abc
module in the standard library. But there are two key, prototypical kinds. Iterators are one-shot, used for a single forward traversal, and usually lazy, generating each value on the fly as requested. Sequences--like list, tuple, or range--are repeatably iterable (generating a new iterator each time), searchable with the "in" operator, and indexable with subscript expressions (including negative indices and slices).Of course there are other kinds of collections that are repeatably-iterable containers but not indexable like sequences (
set
and dict
being obvious examples), and even collections that aren't containers (not to mention strings—the in
operator is a substring search, not an element search, so a string contains everything it collects, but not vice-versa…). But when you want one of these, you generally know it.When what you need is a bunch of values in a row, the key question is: do you need them repeatedly, or in random-access order? If yes to either, you want a sequence; if not, you want an iterator.
Most people who get to this point think, "But I don't need the whole sequence interface! Look at how many methods list has!" But if you actually look at what's needed, it's not hard.
Example: geometric ranges
To take a concrete example, let's say we need to loop over geometric ranges--likerange
, but each value is the previous value multiplied by a factor, rather than incremented by a step. So, for example, the range from 10 to 10000 by 3 gives us 10, 30, 90, 270, 810, 2430.Iterators
If we only need to iterate over this once, we can do it pretty simply:def geometric_range(start, stop, factor): while start < stop: yield start start *= factor
There are other ways to do this. For example, you can use
itertools.accumulate
to repeatedly iterate a function over previous values, and itertools.takewhile
to truncate it at a given value. You can build useful building blocks out of those to allow you to write a whole family of geometric-range-like iterators that are all simple one-liners. But here, we only need this one, so we might as well keep it simple.Sequences
But what if we need to iterate the same range multiple times? Or iterate it in reverse? Or pass it around as an object? Or check whether a value is in the range? Or randomly accessr[7]
?One simple solution is to just return a list of all of the values. In fact, you can just do
list(geometric_range(start, stop, factor))
using the generator function above, and you're done. This is what Python 2's range
function did.That works, but it's not such a great idea if the range is large.
Virtual sequences
What you often want is the best of both worlds: something that lazily generates the values as needed, but that's still a sequence, not an iterator. Since what we're doing is very similar torange
, just with geometric rather than arithmetic sequences, this shouldn't be surprising, because that's what range
is. The docs describe this as a "virtual sequence".Where do you compute the values? In the
__getitem__
method, of course. And, if you look at the docs, that's almost all you have to implement to be a Sequence
; the only other thing you need to write is __len__
.But how do we compute the values lazily? We've defined this as an iterated function: each value is the previous value times some factor. We obviously don't want
r[7138]
to have to multiply 7138 times. Fortunately, in this case, there's an easy answer: for integers, exponentiation is the same thing as repeated multiplication. So, the way to avoid multiplying 7138 times is to exponentiate once:class GeometricRange(collections.abc.Sequence): def __init__(self, start, stop, factor): self.start, self.stop, self.factor = start, stop, factor def __getitem__(self, index): return self.start * (self.factor ** index) def __len__(self): pass # TODO
If you paste this into your Python interpreter, you'll see that it works. You can write
r = GeometricRange(10, 10000, 3)
, and then r[1]
is 30.Of course this has two obvious problems: We haven't implemented
__len__
, which means not only len
but also various other methods and functions that depend on it, will fail. And we haven't done anything with stop
, which means that r[17]
gives you 1291401630 instead of raising an IndexError
). And it should be pretty obvious that these are related.So, what's the length of a geometric range? Consider that
__getitem__
for an arithmetic range does start + (step*index)
, while for a geometric range it's start * (factor**index)
. And __len__
for an arithmetic range is (stop-start) // step
, so we need to invert things in the same way: log(stop/start, factor)
. Python doesn't have an "integer log" function equivalent to its //
integer division. The easy way around this is to just use the floating-point math.log
and then round
it to integer. If you're going to want huge ranges this may not be a great idea, but we'll stick with it for now. (Note that range
can take values too big to fit even into a 64-bit integer, or even a float, although calling len
on one will raise an OverflowError
, at least as of 3.5.)But just implementing
__len__
isn't quite sufficient; we also need to use it in __getitem__
. So:class GeometricRange(collections.abc.Sequence): def __init__(self, start, stop, factor): self.start, self.stop, self.factor = start, stop, factor def __getitem__(self, index): if index < 0 or index >= len(self): raise IndexError('{} object index out of range'.format( type(self).__name__)) return self.start * (self.factor ** index) def __len__(self): return round(math.log(self.stop//self.start, self.factor))
And that's a complete sequence. You can call any of the
Sequence
methods on it, or pass it to functions like reverse
, etc. The Sequence
class provides mixin implementations for everything necessary. For example, it makes your class iterable by defining an __iter__
method that yields self[0]
, then self[1]
, and so on, until it gets an IndexError
.[1]Everything below here is nice to have, and often worth doing. But next time you say to yourself "I don't want to build a whole sequence class", look at the above. That's a whole sequence class, and one that implements a virtual sequence. You never have to do any more than this (although often you'll want to).[2]
[1] That actually isn't strictly necessary, because Python already does that for you automatically if you try to iterate something with a
__getitem__
and no __iter__
. This is a leftover feature from before Python 2.3, but it's still sometimes useful. The reason the ABC provides a mixin implementation anyway is so that your type can be recognized as an iterable by the Iterable
ABC.[2] Of course not every iterated function can be converted into an analytic function, and not every function is fully invertible. So, let's say that within the boundaries of what's mathematically possible, this is all you ever need. If you want to turn some fractal recurrence into a sequence--or, say, the line numbers in a file--you'll need a different solution, such as a caching lazy sequence. (See the
linecache
module in the stdlib for an example.)Missing "basic type" features
There are a few things that aren't part of being aSequence
but that you still probably want. Most types should have a repr
(if it makes sense, one that's usable to create the same object in code). We probably don't need a separate str
, because you aren't going to be displaying these things to end-users. (Note that range
doesn't have one.) Having ==
work as expected is always nice. And, if you have ==
and your type is meant to be used as an immutable value, you should add hash
(although we haven't quite enforced immutability yet; see later).The usual way to implement these methods is just by comparing all of the attributes:
def __repr__(self): return '{}({}, {}, {})'.format( type(self).__name__, self.start, self.stop, self.factor) def __eq__(self, other): return ((self.start, self.stop, self.factor) == (other.start, other.stop, other.factor)) def __hash__(self): return hash((type(self), self.start, self.stop, self.factor))
But that isn't really correct here. Shouldn't
GeometricSequence(1, 10, 2) == GeometricSequence(1, 11, 2)
be true? After all, they both start at one, keep doubling, and finish with 8.If you look at
range
, it uses logic like that. The intuitive idea is that two ranges are equal if they contain the same values in the same order, just like two tuples or two lists or any other kind of sequence. Put in mathematical terms so we can implement it without iterating the entire sequence: two ranges are equal if they're empty, or if they start with the same value and have no other values, or if they start with the same value and have the same step and the same length.Notice that this also complicates hashing. Two equal values have to have the same hash. That means you can't include the start value when hashing an empty range, and you can't include the step and length when hashing a length-1 range.
So:
def __eq__(self, other): if len(self) != len(other): return False if len(self) == 0: return True elif len(self) == 1: return self.start == other.start else: return (self.start, self.factor) == (other.start, other.factor) def __hash__(self): if len(self) == 0: return hash((type(self), 0, None, None)) elif len(self) == 1: return hash((type(self), 1, self.start, None)) else: return hash((type(self), len(self), self.start, self.factor))
While we're at it: often, when you need an unusual
__eq__
method, you also need to customize pickling and (deep-)copying. Fortunately, that isn't the case here. But, just for completeness, here's what it looks like:def __getnewargs__(self): return self.start, self.stop, self.factor
Optimizing container methods
Building a sequence automatically makes it a container: thein
operator works, as do the index
and count
methods. But they work by exhaustively searching the sequence. The same way those methods work on list
and tuple
. But range
doesn't have to do that; you can write 10**5000 in range(1, 10**10000, 2)
and it returns false instantly.There are other types where it's worth implementing these yourself. For example, hash-based types like
set
can test for membership in constant time, and binary-tree-based types like a SortedTuple
can do so in logarithmic time.Anyway, it's almost always pretty simple. Usually you abstract out the actual search algorithm, then call it in the three ABC methods, like this:
def _find(self, value) return round(math.log(value/self.start, self.factor)) def index(self, value): idx = self._find(value) if 0 ≤ idx < len(self) and self[idx] == value: return idx raise ValueError('{} is not in range'.format(value)) def __contains__(self, value): idx = self._find(value) return 0 ≤ idx < len(self) and self[idx] == value def count(self, value): return 1 if value in self else 0
While we're on optimization: For some types, it's also worth implementing
__reversed__
, but that isn't needed here.Also, if you put this class through its paces, you'll see that on some platforms, while iterating an instance, it spends a lot of time calling
math.log
. Almost all of those calls are in len(self)
calls used to bounds-test each index. You can calculate the length at initialization time and store it, then just access self._length
everywhere.Extending the sequence interface
While this is enough for a complete, efficient sequence, it doesn't have all of the features ofrange
, tuple
, or other built-in sequences, because some of their methods have more complicated interfaces than the ABC requires.Negative indices
One of the first little things many people fall in love with Python for is that you can writelst[-1]
instead of lst[len(lst)-1]
. But that won't work with our class. If you want negative indices, you have to do it manually. Fortunately, it's pretty easy:def __getitem__(self, index): if index < 0: index += len(self) if index < 0 or index ≥ len(self): raise IndexError('{} object index out of range'.format( type(self).__name__)) return self.start * (self.factor ** index)
Slices
Python sequences also usually allow slicing. That's again something you have to implement yourself. Many classes can do something as simple as this:def __getitem__(self, index): if isinstance(index, slice): return [self[i] for i in range(*index.indices(len(self)))] # normal non-slice implementation here
But ideally, you want slicing to return the same type as the sliced object. When you write
range(1000000000)[::2]
, you get back another range
, not a list of 500 million values. So, it's back to basic math for us:def __getitem__(self, index): if isinstance(index, slice): start, stop, stride = index.indices(self._length) start = self[start] stop = self[stop] if stop < self._length else self.stop return type(self)(start, stop, self.factor ** stride) # rest as above
Bounded index
The index
methods of most built-in sequences take optional start
and stop
parameters. That's easy:
def index(self, value, start=None, stop=None): idx = self._find(value) if start is None: start = 0 if stop is None: stop = len(self) if start ≤ idx < stop and self[idx] == value: return idx raise ValueError('{} is not in range'.format(value))
Immutability
Atuple
or range
is immutable. Our class is sort of immutable, in that you can't assign r[7] = 133
. But you can assign r.start = 3
. That's a problem--especially since, by defining __hash__
we've implied that we're immutable.It's very hard to actually enforce immutability in Python, but you don't really have to; you just have to make it clear that the object is supposed to be immutable, and make it hard to mutate by accident. Generally, that means storing the attributes in private attributes behind read-only accessors.
While we're at it, we might as well store the attributes in
__slots__
variables:class GeometricRange(collections.abc.Sequence): __slots__ = '_start _stop _factor'.split() def __init__(self, start, stop, factor): sekf._start, self._stop, self._factor = start, stop, factor @property def start(self): return self._start @property def stop(self): return self._stop @property def factor(self): return self._factor def __getitem__(self, idx): pass def __len__(self): pass
If you really want to be immutable, you sometimes need a
__new__
method rather than __init__
, especially if you expect users to subclass your type. That isn't necessary here, but just to show that it's not nearly as scary as it sounds:
class GeometricRange(collections.abc.Sequence): __slots__ = '_start _stop _factor'.split() def __new__(cls, start, stop, factor): obj = super(GeometricRange, cls).__new__(cls) obj._start, obj._stop, obj._factor = start, stop, factor return obj
If you instead wanted to make the class mutable... Well, it can't be a
MutableSequence
, because it doesn't have __setitem__
, __delitem__
, and insert
, and there's no reasonable meaning for any of those. (Although I guess you could implement pop
at least.) But could it be mutable and a Sequence
but not a MutableSequence
? Sure; the various SortedList
types you can find on PyPI
fit into that category (because they are, in effect, a list for immutable operations but a set for mutable ones...). So, you can remove the __hash__
method and not change anything else, and that's legal. But it feels like a strange design--a mutable sequence seems like it ought to allow you to replace, add, or remove elements by index, but this type can only globally change all the elements to a whole different set of elements.
At any rate, even if it weren't for that gut feeling, the fact that we're intentionally emulating range
, and range
is immutable, makes the choice for us.
Optional constructor arguments
Constructor arguments aren't actually part ofSequence
, or any ABC. (And that's a good thing--otherwise, defaultdict
would have to be a factory that returns a dict subclass instead of just being a dict subclass...) But often, you're trying to emulate some specific sequence type, and that means emulating its constructor.
The range
class actually has a pretty complicated constructor signature. In pseudocode:
range([start, ]stop[, step])Or, as shown in the actual docstring for
range
:
range(stop) range(start, stop[, step])Python makes it easy to specify optional trailing arguments, but how do you specify optional leading arguments? Or, how do you specify two overloaded signatures for the same name? Well, you can't. If you think about it, in the general case it's impossible ambiguous; it only makes sense here in very special cases. Generally, you're never supposed to do this, and you should treat
range
as an aberration from the early days of Python rather than something to be emulated. Except that here, we're explicitly writing a range
-like class, so we kind of have to emulate it. And there are a few other times when the same kind of thing shows up. So, here are the two ways to do it:
def __init__(self, start_or_stop, stop=None, step=None): if stop is None and factor is None: start, stop = 0, start_or_stop else: start = start_or_stop if step is None: step = 1 def __init__(self, *args): if len(args) == 1: start, stop, step = 0, args[0], 1 elif len(args) == 2: start, stop, step = args[0], args[1], 1 else: start, stop, step = argsThe first one is slightly more readable, but it's still ugly. The second one works exactly the same as
range
.[3]
Notice that you could wrap the *args
trick up in an overload-dispatcher similar to functools.singledispatch
but switching on argument count rather than type. But, since you're only going to do this in very rare special cases, I don't think you want to abstract it out and make it easy. So, just pick whichever one you hate less.[3] The first one allows you to pass
None
as any argument, with the same effect as leaving out the argument; range
won't accept that. If you really care, you can always do _sentinel=object()
in class scope, then use GeometricRange._sentinel
instead of None
everywhere, then del
it at the end of the class definition. But why would you care?
Domain checking
It goes without saying that you need to add error handling, but in this case, it's not clear what that means everywhere. If you pass something to__getitem__
that isn't a slice or an integer (or a slice with non-integral slice values), you'll probably get an exception, but not the nice TypeError
you get from range
. You'll want to add a check for that. This is common to most sequence types.
The three find-related methods will all raise a TypeError
or do something else wrong if you pass them non-integral values. You need to check explicitly, then raise a nice ValueError
, return False
, and return 0
, respectively. This only usually comes up with virtual sequence types, or with containers that have some other reason to care about their element types (e.g., a bucket counter for a fixed set of values).If you pass a negative or zero factor, everything will appear to work until you try to (directly or indirectly) call
__len__
, at which point you'll get a ValueError: math domain error
for trying to take the log of a nonpositive number. Obviously that's bad, but what's the right thing to do? One answer is just to raise a ValueError('GeometricRange() arg 3 must be positive')
(akin to the equivalent range
error). But GeometricRange(10, 1000, -3)
makes intuitive sense as 10, -30, 90, -270. But then what about GeometricRange(10, 100, -3)
? A log-based check will exclude the -270, but a simple less-than check will include it.If you pass non-integral numbers to the constructor, or to any of the various methods, many things go right, but some don't. If you want floats to work here, it's not hard to think things through and make everything work (although then you're losing a lot of the analogy with
range
). If you don't, it's pretty simple to add the tests and raise a TypeError
.But notice that
range
is a bit special here. Most functions that want integers mean they want something where isinstance(value, int)
is true. A few mean that they want something that returns a native C long from its index slot or (if implemented in C) or an int value within native C long range from its __index__
method (if not). Normally, in Python code, you don't have to worry about the latter. But, because range
elements are often used as indices into list
s, the range
type does worry about it. If you want to emulate that, you have to do it manually (basically, just try
to call value.__index__()
if the isinstance
fails).Also, along with this,
range.__len__
raises an OverflowError
if the range has more elements than will fit into a C long, but you probably don't want to emulate that.
View comments