In Sorted Collections in the stdlib, I went over why I think we want sorted collections in the stdlib, what they should look like, and why we don't have them.

The biggest stumbling block is that we don't have a good implementation written by someone willing to contribute it (including making any changes that come out of the PEP process) and maintain it into the future. Of course with most non-copyleft open-source libraries, anyone could just fork the library and contribute and maintain that fork. But most of them aren't really close enough to do that. In particular, most of them either don't have a pure-Python implementation, or it's incomplete, or it's much slower than the C implementation; porting someone else's C code to pure Python and then optimizing can be a pretty big project.

In a comment Grant Jenks pointed to his SortedContainers library (or at least someone named Grant pointed to Grant Jenks's library), and it looks great as a potential starting point. It's pure-Python, and appears to be fast, clean, and well-written. It seems at least plausible that he'd be willing to contribute it—and, if not, it would probably be easy for someone else to fork and contribute it.

So, let me review that library.

Implementation

From the docs:
The sorted container types are implemented based on a single observation: bisect.insort is fast, really fast. This is somewhat counter-intuitive since most schools teach that shifting elements in an array is slow. But modern processors do this really well. A lot of time has been spent optimizing memcopy/memmove-like operations both in hardware and software.

But using only one list and bisect.insort would produce sluggish behavior for lengths exceeding one thousand. So the implementation of SortedList uses a list of lists to store values. In this way, inserting or deleting is most often performed on a short list. Only rarely does a new list need to be added or deleted.

Basic performance

Somewhat like blist, this is a wide b-tree-like structure rather than a binary tree, and therefore benefits from the speed of bulk moves and cache/page locality.

Also, using such a simple implementation presumably makes it a lot easier to implement this efficiently in pure Python. Implementing linked lists or trees in Python can be a lot slower and larger than in C, but for resizable arrays, it's hard to beat Python's list type.

At small and medium sizes, these collections seem competitive with, and maybe even a little faster than, custom C collections. And that's in CPython; in PyPy they might even blow them away.

Unfortunately, the performance tests on the website only go so far; they just test how quickly each single operation works. If you're not mixing up mutations and accesses, you're not testing that much; you could probably beat any sorted collection implementation by just adding onto a plain list, sorting it, then accessing it with bisect.

More importantly, the graphs only go up to 10K.

Scalability

Unlike blist, SortedContainers' B-tree-list structure is fixed at two levels, which seems like it places an upper limit on the usable size. Also unlike blist, the bin width is tunable, using a load argument to the constructors. This means that the upper limit is presumably tunable. But the fact that it needs to be tuned seems to argue against its use in the stdlib as a general-purpose sorted collection library.

If (as the documentation says) insort is sluggish for sizes over 1000, that means that as soon as you get beyond 100K, you need to tune the load manually, and as soon as you get over 1M, no amount of tuning will avoid that sluggishness. Sure, it'll be better than a flat list, but if it's still not good enough, that's all that matters.

People store collections of far beyond 1M objects today. I needed a sorted collection of that size a couple years (which I ended up implementing in front of an in-memory sqlite database). And remember, if we're talking about the stdlib, we need to try to imagine what kinds of collections people will be storing in a decade, which could easily be in the billions. A design that can't even handle that in principle, much less in practice, may not be acceptable for the stdlib.

You could obviously solve this by just doing the same thing recursively: instead of a list of lists, a SortedContainer is either a flat list or a list of SortedContainers. Which of course makes it effectively like a blist or B-tree. The question is whether the existing code could be relatively easily modified to work with such a structure, and whether doing so would preserve the speed benefits that come with the simplicity. I think the answer is yes, but I think we'd need at least a proof of concept that shows that it could be grown that way if necessary before I'd want to suggest the current implementation for the stdlib.

Interface

The library provides three main types—SortedList, SortedSet, and SortedDict—as well as new key/value/item views for SortedDict.

General

Sorting keys

None of these types take a key (or a reverse flag). The Sorting HOW TO explains why you want these in a sorting API, but in brief, it's so you can avoid the clutter of decorate-sort-undecorate. I think this is even more important for a sorted collection than for a sorting function, because you otherwise need to decorate and undecorate around every access to the object.

Adding a key function, and possibly a reverse flag, should be trivial. The problem (as described in my initial post) is how to pass it in to the constructor. If they're keyword arguments, then you can't create a SortedDict with a key named "key" or "reverse"! Maybe the right answer is to provide an alternate constructor as a @classmethod, like SortedDict.keyed(key, iterable, **kwargs)? At any rate, I think this problem needs to be solved, not just ignored. But if the proposal gets as far as python-ideas, much less a PEP, there will be lots of time for bikeshedding the solution.

ABCs

The library doesn't define any new ABCs. As I explained before, I think having new ABCs for these types would be very helpful. SortedList and the SortedDict views below give further reasons why these are non-obvious enough to be worth writing.

Also, these ABCs could implement most methods as mixins on top of a couple of abstract methods. For example, if a class provides bisect_left and bisect_right, the ABC can add all of the different value-based-slicing methods, and almost everything else I listed in my previous post. (Some of them might also require something like insert, as explained below.)

The library does attempt to fit its types into the standard ABCs, but this seems incomplete in some places (e.g., its SortedDict is not a MutableMapping), and possibly problematic in others (e.g., its KeysView is both a Sequence and a Set).

SortedSet

SortedSet could easily be a MutableSet just by inheriting from it. (The docs don't appear to define the <= and >= operators, but if they're missing, they're trivial—in fact, just inheriting from the ABC would add them as mixins based on <.) And all of the MutableSet methods seem sensible and obvious.

On top of being a MutableSet, it's also indexable. This makes it almost a Sequence, but with the wrong comparison operators. I'm not sure whether or not this is a good design; maybe having a method that returns a view (or an attribute like iloc) that is a Sequence, or just a named function for indexing, would be clearer? Maybe we need use cases: When do you actually want an indexable set, rather than a sequence with some set-like features?

It also provides some, but not all, MutableSequence methods. The deletion-related methods are fine. But it also has a __setitem__—for a single index, not a slice—which removes the value at that index and then inserts the new value in the appropriate place. So, after s[i]=k, s[i]==k is not necessarily true, and if not, any other indices you had have been invalidated. I can't imagine ever wanting that, and I think it's more likely to mislead readers than to help coders. Anyway, the other MutableSequence methods for inserting/appending/etc. are not present, which I think is good.

Extra methods

SortedSet also adds three extra methods for bisecting, but nothing more.

However, as I mentioned earlier, a MutableSortedSet API should be able to add everything else you'd ever want as mixins on top of the basic MutableSet operations, two bisect functions, and indexing. So, I think it's sufficient.

SortedDict

A SortedDict could easily be a MutableMapping just by inheriting from it. And all of the mapping operations seem sensible and obvious. The constructor is not defined as clearly as it could be, but I think the intention is that it works exactly like the constructor for dict, which seems perfect, except for the lack of key and reverse.

Views

SortedDict also provides custom key, value, and item views. In my previous post, I just said that we'd probably need these types, but didn't go into any detail about what they should look like. And from what I remember, most other existing libraries don't even try to fit the modern Python view-based design properly.

The KeysView doesn't implement the KeysView ABC, or even MappingView, but that's easy to fix. Meanwhile, it goes beyond the ABC by implementing Sequence as well as Set. This seems great at first glance, but I think it's a bad idea, for the same reason that the SortedSet class doesn't implement both MutableSequence and MutableSet. Set and Sequence actually have conflicting semantics for comparison operators—subset-based and lexicographical—and the docs here don't state which one you get. Also, most of the set operations have to return something (in this case a SortedSet) which is no longer a Sequence.

The ValuesView implements Sequence, but not Set, which seems fine.

The ItemsView, like the KeysView, implements both Sequence and Set. Again, maybe this should be implementing a SortedSet ABC. On top of that, the Set operations seem to be implemented on top of a set or dict, as they only work if the values are hashable. This seems odd; just looking up the key by bisecting (as with KeysView and SortedSet), then equality-comparing the value, seems like it should be sufficient. (The lack of a key function may be relevant here: an ItemsView is like a SortedSet of the key-value pairs with key=itemgetter(0).)

Extra methods

SortedDict also provides an iloc attribute, which appears to be effectively a special key view that passes through deletions. It also provides an index method for searching. Given the sequence-like KeysView, neither seems necessary; d.iloc[key] === d[k.keys()[index]], del d.iloc[key] === del d[k.keys()[index]], and d.index(key) === d.keys().index(key). I think this may confuse the interface. But maybe the extra brevity is more helpful than the potential confusion is harmful?

Neither SortedDict nor its key or item views (nor its iloc) appears to implement anything like bisect (index only looks for an exact match). So, none of this is sufficient to build all of the other features of SortedMapping and SortedMutableMapping that I suggested (or the features of bintrees that I borrowed them from). For example, there's no way to get all keys within a given range. I assume it would be easy to add the bisect methods to SortedDict and/or its KeysView, which would solve that.

SortedList

The SortedList type is almost a MutableSequence, but does not provide reverse. (If it inherited from the ABC, it would even get the default implementation of that, but that would not be very useful—if would just raise a ValueError for any SortedList with more than one distinct item.)

As explained in my previous post, I don't think trying to be a MutableSequence is a good idea at all. The Sequence API is a no-brainer, and adding __delitem__ and friends is easy. But the other core mutating methods—__setitem__, insert, append, etc.—are all about adding elements into particular locations, which in general doesn't make sense for sorted lists. And every library that tries to implement a SortedList runs into this problem, and deals with it in some clunky way.

SortedContainers handles all of the insert/append/etc. methods, but raising a ValueError when the element (or elements) is not being inserted in sorted order. (I'm not sure ValueError feels like the right error here, but that's not a big deal.) But this makes them only useful as an optimization for specific cases when you happen to definitely know where the new elements should go.

Meanwhile, it also handles __setitem__, with slices, but doesn't document whether it does so by raising a ValueError if the value(s) aren't in the right place (as with the other MutableSequence methods) or by deleting and adding (as with SortedSet—but obviously more complicated by slices). Either one seems wrong. I can't think of any case where I'd want to write sl[idx] = newvalue, either way.

The methods you actually want for adding to a SortedList are not the list-like insert, extend, etc., but the set-like add and update. Most libraries provide these, and SortedContainers is no exception.

In fact, the one case where one of the SortedSequence methods looks useful—insert into a position that you've just bisected—might be better handled by giving add an optional index as a keyword. I'm not sure whether it should be a hint (as with many C++ methods, finding the right position but starting from the provided one), or a checked requirement (raising a ValueError if the provided position is not right).

As with SortedSet, SortedList provides three additional bisect methods. Together with indexing, all of the other desirable SortedList ABC methods you might want are trivial. To also make all of the desirable MutableSortedList methods work, you also need either insert, or an indexed add. (Note that in my previous post, I screwed things up a bit, putting things like pop_min into SortedSequence instead of MutableSortedSequence.)

Conclusion

This looks like a great library, and it definitely moves forward the discussion on getting sorted collections into the stdlib.

It's certainly fast for medium-sized collections. But I'm not sure it's scalable enough (especially without tuning) for inclusion in the stdlib.

The interface adds a few new clever ideas on top of the previous state of the art, and fleshes out the dict view types in a way that I don't think anyone else has before. I think it may need some work before it's stdlib-ready, but then anything that gets proposed for the stdlib needs some work by the time the bikeshed is painted; it's at the very least a good foundation to build on.
1

View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.