People want sorted collections in the stdlib, but no one has ever made a solid proposal to add them.

Some people want a SortedList that maintains sorted order as elements are added and removed (without requiring linear-time shuffling); some want a SortedDict that keeps its keys in sorted order (and doesn't require hashing); SortedSet is less critical, but an obvious extension.

Every so often, someone proposes adding sorted collections to the stdlib. However, usually, it's in the middle of a thread about something else, like a wrapper around heapq, or making OrderedDict easier to use, or even extending sum (don't ask). There have been two suggestions that would have given us sorted collections, but one was hidden in a proposal to replace the basic list type, and the other was hidden in a proposal to add a general binary tree library; in both cases, the fact that we would get sorted dictionaries for free was a minor afterthought.

The idea came up for a completely different reason in the discussions leading up to PEP 456: Secure and interchangeable hash algorithm: While hash tables have amortized constant average performance, they have linear worst-case performance, and generating pathological cases that trigger such performance is a way to DoS a web server. A collection with logarithmic average time, but also logarithmic worst-case time, is a safe way to avoid that. Even though sorting is irrelevant to the problem they're trying to solve, the obvious types to use are based on sorting, and the same types you'd use if sorting was the goal.

Don't we already have heapq and bisect?

Yes, we do. 

Of course they have horrible C-style APIs rather than OO APIs, but that's not much of a problem. The bisect docs have a link to a SortedCollection recipe at ActiveState, and a bit of searching turns up multiple ActiveState recipes and PyPI modules wrapping up both modules.

The problem is that neither of these is a good sorted list type, much less a sorted dictionary. A heap isn't randomly accessibly by either key or index, and a bisect-based list doesn't provide any way to mutate in better than linear time.

Other languages

Wikipedia has an article describing the mapping types (both hashed and sorted) in a wide variety of languages. But briefly:

Most other "scripting" languages (Ruby, Perl, Lua, Common.JS, PHP, etc.) do not have a sorted collection in the stdlib, but most "system" languages (JavaC++, .NET, GHC Haskell, etc.) do.

Almost every language that has sorted collections has a SortedDict-like type, plus either a SortedList or a SortedSet. Some also have things like SortedMultiDict (which isn't as useful in Python, given how easy it would be to use a SortedDefaultDict(list) instead) and SortedMultiSet (again, a SortedCounter would take care of most uses better), or even multi-indexable sorted collections. A few languages (such as ObjC) provide tools to make a wide variety of uses (including multi-indexed collections, nested index paths, etc.) easier to build, instead of providing the concrete types.

Almost all of these are implemented with red-black trees, although Java also adds skip-list implementations.

Performance characteristics

There are a variety of different implementations that effectively make everything O(log N)—inserting a new element, deleting an element, looking up a key, etc. This is not as good as the amortized O(1) average time for hash tables, although most implementations can also guarantee O(log N) worst-case time, while hash tables are generally O(N). At any rate, searching a sorted collection is much better than the O(N) time for unsorted collections, and inserting or deleting is much better than the O(N) time for sorted dynamic arrays.

Within that logarithmic time, there are a lot of tradeoffs. B-trees and their relatives have a much larger logarithmic base, but also have better cache locality (for small collections) and paging characteristics (for very large ones). Skip lists are faster at accessing ranges, and easier to implement without locks or with fine-grained locks (hence Java's ConcurrentSkipListMap), while red-black tree algorithms are easy to internally parallelize. AVL trees are very rigidly balanced, which is great when you do a lot more searches than mutations; not so much when you don't. Red-black trees offer very consistent lookup times. Weight-balanced trees and treaps can make more frequently accessed values faster to access. And so on. I don't know of a good survey paper or other link that discusses all of the tradeoffs.

Key requirements

The requirements on the keys of a sorted collection are a little more complicated than those of a dict or a set, and they're not automatically checkable. On the other hand, collections.SortedDict won't be one of the half-dozen types that novices are introduced to in Python right off the bat, so it doesn't have to be quire as simple or foolproof.

Comparability

With a dict, the main key requirement is simple: every key must be hashable. Every operation automatically tests this requirement, and provides a nice exception if it fails. Of course it's possible to trick the code by, e.g., returning a different hash each time you're asked, but anything non-malicious will either work, or fail obviously.

With a sorted dict, the main key requirement is not quite as simple: every key must be comparable with every other key, and that comparison must define a strict weak order (or, if the sort isn't stable, a strict total order—but that only matters for a sorted multidict…). And this does not get automatically tested; in general, for any given operation, the supplied key is only tested against log N other keys; if it's comparable to all of them, and the ordering is not obviously inconsistent, there will be no error.

Note that heapq and bisect already have this problem—in fact, bisect can't even verify that your list is sorted. And plenty of other stdlib modules have requirements that can't be automatically verified.

One very obvious and important type, float (and complex, and numpy.float*, etc.), does not meet the requirements, because if any of the values are NaN, you don't have an order. However, this is a widespread problem that already affects many core operations in Python (and almost every other language), including list.sort and sorted, so there's no new problem with sorted collections.

Mutability

For dict and friends, the keys are required to be immutable, or only mutable in ways that don't affect hashing and equality. And dict is important enough that any type that is hashable, but doesn't meet these requirements, is often considered broken. Of course a dict can't check that its values are hash-immutable and equality-immutable, but in practice, just checking that they're hashable (which it does automatically) is pretty close. 

For a sorted collection, the keys are required to be immutable, or only mutable in ways that don't affect comparison. But there is no convention that all comparable types should be immutably comparable. This is another way that the requirements for a sorted collection aren't automatically checkable.

It would be possible to use hash just for verification. Equality-immutability is not sufficient for comparability-immutability, and hashing is not necessary for equality-immutability, but it would catch at least a good number of problem types, and not have that many false positives. (It's rarely that onerous to add a hash method to an immutable class.) However, hashing can be expensive for some types. And is "not that many" good enough? More importantly, this is a very unobvious test. When a dict refuses your type because it's not hashable, that makes sense; if a SortedDict does the same, it will be baffling; what does hashing have to do with sorting?

Finally, some languages/libraries (e.g., C++ std::map and friends) allow mutable keys that you promise not to mutate while they're in the collection, and at least one person has suggested the same here. But this is a much more complicated requirement to explain than immutability, for very little benefit. And really, the exact same benefit could be added to set and dict, and Python decided against it there. And nobody could think of a really compelling use case. But, since mutability isn't actually tested, you're just violating the documentation if you use "temporarily-immutable" keys, and if you can figure out how to do it safely, well, you're a consenting adult.

Interface

SortedDict

Obviously a SortedDict has to support all of the operations in collections.abc.Mapping. Of course most of those operations will be logarithmic time (even the ItemView, KeyView, and ValueView are logarithmic to index), but that doesn't change the API.

However, there are additional operations a SortedDict can provide:
  • __getitem__, __setitem__, and __delitem__ can take slices.
    • d[low:high] returns all of the values where low <= key < high, and similar for set and del.
    • Leaving out start or end (or both) works just as well as it does for lists.
    • Negative indices or slice components don't make any sense, as these are keys, not indices.
    • A skip value is feasible, but doesn't seem all that useful (except maybe as a reverse flag).
    • __delitem__, and __setitem__ with a slice that isn't an exact replacement, may not be O(log N) for all possible implementations (in fact, they're worst-case O(log^2 N) for red-black trees).
  • item_slice, key_slice, and value_slice methods.
  • prev, succ, floor, and ceil on a key (or prev_key, prev_item, etc.).
  • min and max (or min_key, min_item, max_key, and max_item)
    • pop_min and pop_max
    • nsmallest and nlargest
    • pop_nsmallest, pop_nlargest
Lookup by index is possible, but probably not necessary, as d[d.keys()[n]] takes care of it.

Not all of these are necessarily worth adding, and of course the names are up for bikeshedding.

Whichever methods are added should be made part of a collections.abc.SortedMapping, which inherits from Mapping. Which methods are required by the ABC and which can be implemented in terms of others is up for investigation. There would of course then be a collections.abc.MutableSortedMapping, and collections.SortedDict would implement that.

I believe this also implies SortedItemView, SortedKeyView, and SortedValueView ABCs (and un-exported concrete implementations as a detail of SortedDict, just as for the dict equivalents).

The ABCs don't define construction. However, the obvious constructor for a SortedDict is the constructor for dict, plus a keyword-only "key" argument. (It might be worth having the key accessible after construction as an attribute, too.) The problem is that you can construct a dict as dict(a=1, b=2, c=3), so "key" is ambiguous! I don't have an answer to that one…

The quasi-documented __missing__ method could be different from dict.__missing__, because it can be faster if it remembers the insert position. However, this may not be necessary (it seems like it should be only a 33% speedup in theory, and even less in practice, because the rebalance will probably be slower than the search on average).

Note that this interface is very close to the one defined in bintrees 2.0, but not identical. (In particular, bintrees cannot take a key parameter, which means you have to decorate-sort-undecorate to use it.)

SortedSet

A SortedSet can do everything a Set can, so again, we have an ABC subclass SortedSet, and MutableSortedSet, and collections.SortedSet implements collections.abc.MutableSortedSet. (There doesn't seem to be a compelling use case for a collections.FrozenSortedSet).

The added methods are:
  • slice (returns all members in the specified range)
  • prev, succ, floor, ceil
  • min, max, pop_min, pop_max, nsmallest, nlargest, pop_nsmallest, pop_nlargest
Again, the constructor needs to take an optional key.

SortedList

A SortedSequence can do everything a Sequence can, but a MutableSortedSequence cannot do __setitem__, append, extend, or insert. (It can still do __delitem__.) Also, note that indexing is logarithmic instead of constant.

Meanwhile, SortedSequence adds a few methods:
  • indices (like item_slice and slice)
  • prev, succ, floor, ceil
  • min, max, pop_min, pop_max, nsmallest, nlargest, pop_nsmallest, pop_nlargest
  • bisect_left, bisect, bisect_right?
  • insort_left, insort, insort_right? (for manually maintaining stable order)
And MutableSortedSequence adds:
  • add
  • update
And MutableList implements MutableSortedSequence, and has a constructor like list's but with a key parameter.

Note that this interface is in some ways more like MutableSet than like MutableSequence. That seems unavoidable. However, in other ways—most notably, indexing—it's very un-set-like. It's possible that the two could be merged into a single abc, and that the concrete classes could be merged as well, but that might lead to more confusion than benefit.

Alternate implementations

It should be relatively easy to adapt a third-party type like blist.sorteddict to fit collections.abc.SortedMutableMapping. And of course the ABC should do as much work as possible, as it does for the existing ABCs. As mentioned earlier, the exact set of methods that are required vs. default-implemented needs to be explored.

Adapting blist is particularly important, because it's probably the most popular sorted collection library available, and it's also an oddball in many ways, some of which would make it desirable even if there were an alternative in the stdlib.

Subclasses

Subclassing these types should be as easy as subclassing the non-sorted equivalents, but note the open issue with SortedDict.__missing__.

SortedDefaultDict could definitely be useful. Whether it needs to be included in the stdlib, listed as a recipe or a link to a recipe on ActiveState, or left as an exercise for the user probably depends on how easy it is to implement (which may depend on whether __missing__ is optimized).

SortedCounter seems like it will be useful less often, but still has some use. Again, how it should be included/referenced probably depends on how easy it is to implement.

The blist library comes with weakref-based collections that automatically remove members when no longer alive. I don't think this needs to be part of the collections library, but again, it might be worth a link to a recipe.

Implementation

As mentioned earlier, nearly every language that has sorted collections uses red-black trees. Unless someone makes a solid case for a different type, it seems like the default assumption should be that Python should also use red-black trees.

However, that doesn't mean red-black trees are definitely the answer. For example, if blist turns out to be the easiest module to bring into the stdlib and maintain there, I don't think it would be worth rejecting because it uses a different implementation. Or, if skip lists turn out to be useful for some other purpose (e.g., a coroutine-friendly lock-free priority queue for tulip's scheduler), it might be worth reusing them instead of adding a red-black tree implementation alongside it.

Optimizations

One thing not often considered in sort trees is the performance of building a tree all at once from an existing collection (or iterator). Obviously, SortedList(foo) is always going to be slower than list(foo) because of the cost of building tree nodes, etc. However, ideally, it should be slower than timsort only by a constant factor—in particular, O(N) or near it instead of O(N log N) in the key cases. At least, creating a SortedFoo from an already-sorted collection should be fast enough that the optimization of "just copy the tree (possibly transforming each node)" to create a SortedFoo from a SortedBar isn't necessary.

There are some other optimizations worth considering. For example, with a B-tree, calling update with a bunch of values within a small range will go much faster if you either pre-sort the values, or just partition the values into subtrees and then sort only at the bottom, than if you insert one by one. Which means you might even want to batch and lazily insert adds. That one doesn't make any sense for balanced trees or treaps, but there are other optimizations that might.

The bintrees library provides a more complex view type, TreeSlice; blist, on the other hand, uses copy-on-write subtrees. If these optimizations aren't necessary for list and other types, are they actually worth adding for sorted collections?

There are some obvious optimizations that can be done in various special cases. For example, if all of the keys are smallish integers, you can use a bucket- or radix-based structure instead of a general tree structure, for constant operations (well, really something like O(log K) where K is the largest key) instead of O(log N). Given that we don't do such optimizations for sorted and sort, I don't think we need to do sorted collections.

Banyan has some interesting optimizations based on tree augmentation, as well as some obvious ones for things like homogenous collections.

Pre-existing implementations

The bintrees library is already 90% of the way to implementing this design. It has a solid pure-Python implementation that's known to work in (and even somewhat optimized for) PyPy 2.0, and a C-accelerated implementation that's reasonably fast in CPython 3.3 (and 2.7), although it uses Cython rather than native C. It also includes two other trees (an AVL tree and an unbalanced tree), but those can be stripped out easily. It's actively maintained. Unfortunately, the author does not want to maintain it as a stdlib module.

The Banyan library is also pretty close, and seems to be actively maintained, although I don't know if it's as mature. However, there is no pure-Python implementation, and the native implementation is C++ rather than C (and inherently so, as it relies on template metaprogramming).

There's also blist. This uses a custom B+tree, with some unique optimizations. It would also need more changes than bintrees, and a pure-Python implementation (currently, there's an underlying C library and a bunch of Python modules built on top of it). On the other hand, it's probably in more widespread use by applications than any other sorted collection library. And, given that the author proposed blist for the stdlib once, there's a decent chance he'd be willing to again.

There are a number of other red-black tree implementations on PyPI to look at; I don't know of any that are maintained as well as bintrees, but it's worth doing a survey.

Finally, building a red-black tree from scratch isn't all _that_ hard.

Staged introduction

We could just add the ABCs first, wait for library authors to adapt their code to fit, and then pick one to migrate to the stdlib.

We could even provide a "registry", so the user can just ask for "a MutableSortedMapping" and, depending on what's installed on his system, end up with blist or bintrees or rbtree or whatever, as with BeautifulSoup picking a parser. (This would probably still require adding a pure-Python implementation with the right algorithmic complexity as a fallback, but it wouldn't need C acceleration, optimizations for already-sorted data, etc.)

However, similar ideas have been suggested for other proposals over the last few months, and roundly shouted down, so I don't think this is seriously tenable. We need a production-quality implementation for the stdlib or nothing's getting into the stdlib.

Other open issues

Attractive nuisance

There are problems where a sorted collection seems like the obvious solution (especially to a novice), but really isn't.

For example, getting the M largest lines from an N-line file can be done in O(N log N) time with a SortedList, but it can be done in O(N log M) time, and with a smaller constant factor to boot, using heapq.

In fact, if you aren't going to mutate the collection after sorting, sorted is faster by a significant constant factor than SortedList—and the fact that a regular sorted list it's more painful to use (via bisect) makes the attraction even stronger.

And if you just want a SortedDict because you're mechanically porting code from Java or C++ and want the equivalent type in Python… you shouldn't be doing that. If your algorithm depends on having a logarithmic sorted mapping, the Python implementation will too, but if your Java or C++ implementation is just using SortedMap or std::map because it was convenient, the Python implementation should not.

One true sorted collection?

Today, we have a choice of various sorted collections; a few seconds on PyPI turned up dozens, plus libraries that are relatively simple to implement a sorted collection on top of. We can choose between red-black trees, skiplists, or B-trees. We can even choose data structures that are slower in general, but faster in some use cases (e.g., OrderedDict plus calling sort() for any search after an insert, and using bisect() for searching, is faster if you do only a few large batches of inserts). If we want red-black trees, we can choose between bintrees, rbtree, red-black-tree, Banyan, or half a dozen others. So, maybe attempting to define the One True Sorted Collection is a bad idea?

I don't think this is really a problem. The stdlib is full of libraries that aren't the only way to do something, just a way that's usually good and rarely surprising. For example, the existence of MySQL, BeautifulSoup, and lxml doesn't mean sqlite3, html.parser, and ElementTree are bad ideas. In fact, providing a good-enough SortedDict, along with a SortedMapping ABC that alternatives can conform to, should make it easier for users to experiment with different types to optimize their code, not harder.

4

View comments

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed. Languages in the Abject-Oriented space have been borrowing ideas from another paradigm entirely—and then everyone realized that languages like Python, Ruby, and JavaScript had been doing it for years and just hadn't noticed (because these languages do not require you to declare what you're doing, or even to know what you're doing). Meanwhile, new hybrid languages borrow freely from both paradigms.

This other paradigm—which is actually older, but was largely constrained to university basements until recent years—is called Functional Addiction.

A Functional Addict is someone who regularly gets higher-order—sometimes they may even exhibit dependent types—but still manages to retain a job.

Retaining a job is of course the goal of all programming. This is why some of these new hybrid languages, like Rust, check all borrowing, from both paradigms, so extensively that you can make regular progress for months without ever successfully compiling your code, and your managers will appreciate that progress. After all, once it does compile, it will definitely work.

Closures

It's long been known that Closures are dual to Encapsulation.

As Abject-Oriented Programming explained, Encapsulation involves making all of your variables public, and ideally global, to let the rest of the code decide what should and shouldn't be private.

Closures, by contrast, are a way of referring to variables from outer scopes. And there is no scope more outer than global.

Immutability

One of the reasons Functional Addiction has become popular in recent years is that to truly take advantage of multi-core systems, you need immutable data, sometimes also called persistent data.

Instead of mutating a function to fix a bug, you should always make a new copy of that function. For example:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

When you discover that you actually wanted fields 2 and 3 rather than 1 and 2, it might be tempting to mutate the state of this function. But doing so is dangerous. The right answer is to make a copy, and then try to remember to use the copy instead of the original:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

function getCustName2(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[2] + ' ' + custRec[3];
    return fullname;
}

This means anyone still using the original function can continue to reference the old code, but as soon as it's no longer needed, it will be automatically garbage collected. (Automatic garbage collection isn't free, but it can be outsourced cheaply.)

Higher-Order Functions

In traditional Abject-Oriented Programming, you are required to give each function a name. But over time, the name of the function may drift away from what it actually does, making it as misleading as comments. Experience has shown that people will only keep once copy of their information up to date, and the CHANGES.TXT file is the right place for that.

Higher-Order Functions can solve this problem:

function []Functions = [
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    },
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[2] + ' ' + custRec[3];
        return fullname;
    },
]

Now you can refer to this functions by order, so there's no need for names.

Parametric Polymorphism

Traditional languages offer Abject-Oriented Polymorphism and Ad-Hoc Polymorphism (also known as Overloading), but better languages also offer Parametric Polymorphism.

The key to Parametric Polymorphism is that the type of the output can be determined from the type of the inputs via Algebra. For example:

function getCustData(custId, x)
{
    if (x == int(x)) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return int(fullname);
    } else if (x.real == 0) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return double(fullname);
    } else {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return complex(fullname);
    }
}

Notice that we've called the variable x. This is how you know you're using Algebraic Data Types. The names y, z, and sometimes w are also Algebraic.

Type Inference

Languages that enable Functional Addiction often feature Type Inference. This means that the compiler can infer your typing without you having to be explicit:


function getCustName(custID)
{
    // WARNING: Make sure the DB is locked here or
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

We didn't specify what will happen if the DB is not locked. And that's fine, because the compiler will figure it out and insert code that corrupts the data, without us needing to tell it to!

By contrast, most Abject-Oriented languages are either nominally typed—meaning that you give names to all of your types instead of meanings—or dynamically typed—meaning that your variables are all unique individuals that can accomplish anything if they try.

Memoization

Memoization means caching the results of a function call:

function getCustName(custID)
{
    if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Non-Strictness

Non-Strictness is often confused with Laziness, but in fact Laziness is just one kind of Non-Strictness. Here's an example that compares two different forms of Non-Strictness:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRate(custId) {}

function callByNameTextRate(custId)
{
    /****************************************
    *
    * TO DO:
    *
    * get tax rate for the customer state
    * eventually from some table
    *
    ****************************************/
}

Both are Non-Strict, but the second one forces the compiler to actually compile the function just so we can Call it By Name. This causes code bloat. The Lazy version will be smaller and faster. Plus, Lazy programming allows us to create infinite recursion without making the program hang:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRateRecursive(custId) { lazyTaxRateRecursive(custId); }

Laziness is often combined with Memoization:

function getCustName(custID)
{
    // if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Outside the world of Functional Addicts, this same technique is often called Test-Driven Development. If enough tests can be embedded in the code to achieve 100% coverage, or at least a decent amount, your code is guaranteed to be safe. But because the tests are not compiled and executed in the normal run, or indeed ever, they don't affect performance or correctness.

Conclusion

Many people claim that the days of Abject-Oriented Programming are over. But this is pure hype. Functional Addiction and Abject Orientation are not actually at odds with each other, but instead complement each other.
5

View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.