Reverse lookup
Here's a dead simple problem: I have a dictionary, and I want to find the key corresponding to a certain value.You can do this by linearly searching the items:
next(k for k, v in d.items() if v == value)But if the dictionary might be big, or you're doing this often enough, this can be slow.
And, more importantly, if you're doing this in multiple places in the code, this can be hard to read, and hard to maintain.
The solution to this case is almost obvious: build a dictionary that maps each value to the corresponding key:
rev = {v: k for k, v in d.items()}Then, the lookup is trivial, both in terms of readability and performance:
rev[value]
Mutable dictionaries
This assumes your dictionary is immutable after you've built it. If not, you obviously have to maintain both dictionaries—every time you add a value in the forward dictionary, you have to do the same in the reverse:d[k] = v rev[v] = kBut what if you're not just adding values, but replacing them? Then the code gets a little clunky:
try: del rev[d[k]] except KeyError: pass d[k] = v rev[v] = kSo, you really want to wrap that in a function.
ReversibleDict
In fact, for this simple case, you can do better than wrapping up the clunky bits in a function: wrap the two dictionaries up together in a class:class ReversibleDict(dict): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.rev = {v: k for k, v in self.items()} def __delitem__(self, k): del self.rev[self[k]] del super()[k] def __setitem__(self, k, v): try: del self.rev[self[k]] except KeyError: pass super()[k] = v self.rev[v] = k def lookup(self, v): return self.rev[v]In this, and all following examples, I'm making no attempt at decent error handling. You should always get an exception if you lookup something that doesn't exist, but it may not be the right kind of exception, or may not contain a useful message. Beyond lookup errors, I won't even guarantee that; using non-hashable keys might not raise until later on in your program, etc. Don't use these examples in production code. (Of course you can use them as a starting point for a real implementation, just be aware that a starting point is all they're meant to be.)
Simplifying __setitem__
Notice that __setitem__ is duplicating the hard part of __delitem__ inside its try/except. As things get more complicated, you may want to factor that out into a separate (private) method and use it in both. But if you think about it, there's pretty much no harm in deleting super()[k] before setting super()[k] = v. Sure, there's a performance cost, but it's tiny as a proportion of the total time spent in __setitem__. So, you can just delegate to __delitem__ and not worry about it:def __setitem__(self, k, v): try: del self[k] except KeyError: pass super()[k] = v self.rev[v] = k
Non-hashable values
Everything above assumes that the values are usable as keys—that is, that they're hashable.For values that aren't hashable, but can compare equal to another type that is, you can get around this by converting to the other type—e.g., if you're mapping strings to sets, you can reverse-map frozensets to strings. But this is almost always a bad idea, because most non-hashable types in Python are non-hashable because they're mutable. If your dict is mapping strings to sets, and I mutate one of those sets, that means the reverse dict is no longer valid. If the "use frozenset instead of set" trick works, you probably should be storing frozensets in your dict in the first place.
But there is an alternative that does work. The values may mutate, but they'll never change identity. (They may be replaced by other objects, but that can only happen through actual dictionary operations.) So:
class ReversibleDictionary(dict): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.rev = {id(v): k for k, v in self.items()} def __delitem__(self, k): del self.rev[id(self[k])] del super()[k] def __setitem__(self, k, v): try: del self.rev[id(self[k])] except KeyError: pass super()[k] = v self.rev[id(v)] = k def lookup(self, v): return self.rev[id(v)]
Non-unique values
All of the code above assumes that the dict is one-to-one—that is, there are never two keys that map to the same value. But many real-life dictionaries are not like that. So, how do you handle that? Obviously you can't get "the" key for a value if there might be more than one, you get all of them:class ReversibleDictionary(dict): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.rev = {} for k, v in self.items(): self.rev.setdefault(id(v), set()).add(k) def __delitem__(self, k): vid = id(self[k]) self.rev[vid].remove(k) if not self.rev[vid]: del self.rev[vid] del super()[k] def __setitem__(self, k, v): del self[k] # at this point it's easier than repeating super()[k] = v self.rev.setdefault(id(v), set()).add(k) def lookup(self, v): return self.rev[id(v)]Note that I could have used a defaultdict(set) instead of using setdefault, but then I'd have to add code to explicitly raise in lookup when we get an empty set; otherwise, you'd never get a KeyError. Also, notice that I've exposed self.rev as a public attribute, because I can definitely imagine it being useful (at least for testing and debugging)—but if I used a defaultdict, it would contain every value that had ever existed, or even been looked up, in the dict, not just the ones that are actually present, which seems less useful. (Plus it would waste space, but I don't think that would be an issue with most real-life uses.)
Indirect reverse dictionaries
This exact same technique can be generalized to a lot of common cases. For example, there are three questions on StackOverflow that have minor variations this two-level dictionary (which I assume came from a tutorial, online programming test, or homework problem, because otherwise that's just weird):d = {'ID1': {'a': 123, 'b': 456, 'c': 789}, 'ID2': {'a': 132, 'b': 654, 'c': 987}, 'ID3': {'a': 321, 'b': 546, 'c': 879}}… and want to know which ID maps to the dict that has the value 654 for "b".
Yes, this might be better handled with instances of a namedtuple or ordinary class instead of dicts for the values, but ignore that; the question does make sense, and the answer is a minor variation on what we've done above. It's not really a reverse dictionary anymore, but it's an extension of the same idea. Put another way, it indexes the same data in a different direction, just not the exact opposite direction. Anyway:
rev = {(innerkey, innervalue): outerkey for outerkey, subdict in d.items() for innerkey, innervalue in subdict}And you can use it like this:
>>> rev['b', 654] 'ID2'Notice that this is not one-to-one between the (outer) values and keys, but for a completely different reason than the previous section—in fact, it's because it's one-to-one between the inner values of the outer values and the alternate keys instead, and obviously they can't both be one-to-one (unless the outer values all have size 1, in which case it's silly to make them collections). Obviously, there's no reason the two cases can't be combined into one. Fortunately, that doesn't cause any confusion or ambiguity problems. The first kind of one-to-manyness is encapsulated in the complex keys, and the other kind in returning sets instead of single values.
Multi-index dictionaries
Often, you actually need multiple ways to look things up in the same dictionary. For example, let's say I wanted to be able to get the ID for a given key-value pair in the dict above, but I also wanted to be able to get the ID and key for any value. How can a single reverse dictionary do that?You can't. So you just build multiple dictionaries.
At this point, it starts to be less helpful to talk about variations on reverse dictionaries; calling two different dictionaries a reverse of the same dictionary will just confuse people. So, let's talk about them as alternate index dictionaries. And a dictionary that wraps up alternate indices isn't a reversible dictionary, it's a multi-index dictionary.
The specific example can be wrapped up in a class like this (I'll skip the setitem/delitem, because it's getting a little long and tedious and there are no interesting tricks to learn there):
class MyMultiIndexDict(dict): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.kv, self.v = {} for k1, v1 in self.items(): for k2, v2 in v1.items(): self.kv.setdefault((k2, v2), set()).add(k1) self.v.setdefault(v2, set()).add((k1, k2)) # [snip] def by_kv(self, k, v): return self.kv[k, v] def by_value(self, v): return self.v[v]
Generalizing multi-indexed dictionaries
You don't want to write a separate dict wrapper for each different dictionary that you want to index; there will be a whole lot of boilerplate to repeat.But if you think about it, all you're really doing is applying a key function to each key-value pair to get an index, which is easy to wrap up.:
class IndexDict(dict): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.indices = {} def add_index(self, name, key): self.indices[name] = (key, {key(k, v): k for k, v in self.items()}) def __delitem__(self, key): value = self[key] for keyfunc, index in self.indices.values(): index[keyfunc(key, value)].remove(key) del super()[key] def __setitem__(self, key, value): del self[key] for keyfunc, index in self.indices.values(): index.setdefault(keyfunc((key, value)), set()].add(key) super()[key] = value def lookup(self, name, key): return self.indices[name][key]Except that in the case of the specific example above, each key-value pair actually maps to multiple keys in each of the alternate indexes, so you'll need to extend this in some way that takes that into account—e.g., the protocol for the key function passed to add_index is to return an iterable of keys, not a sing;e key, and the add_index, __delitem__, and __setitem__ functions loop over them. (Hopefully that explanation is sufficient, and I don't need to show the code for this variation.)
Index views
The design above obviously privileges one key as the dict's key, and other keys are secondary keys for different indices. For some applications, that makes perfect sense. But for others, you really want to use different views, which look just like dicts, but keyed off different indices, in different parts of the program. There may still be a privileged index, but most of your code doesn't have to know whether it's dealing with the privileged index or an alternate index.That's pretty easy to add:
class IndexedDictView(collections.abc.Mapping): def __init__(self, parent, name): self.parent, self.name = parent, name def __getitem__(self, key): return self.parent[self.parent.indices[self.name]][key] def __iter__(self): return map(self.parent.indices[name][0], self.parent) def __len__(self): return len(self.parent)And now, you can add this to IndexedDict:
def add_index(self, name, key): self.indices[name] = (key, {key(k, v): k for k, v in self.items()}) return self.index(name) def index(self, name): return IndexedDictView(self, name)So, you can just store the view returned by d.add_index (or look it up later with d.index) as spamd, then do spamd[23] instead of d.lookup('spam', 23), which is a lot more convenient, and readable.
What if you want to be able to mutate through the views? Deleting items is pretty easy—you look up the parent key corresponding to your view key, and tell the parent to set or delete the value at that key. But inserting new items is a problem. There's no way of knowing the parent key corresponding to a new view key. (Put another way: the whole reason we built the index and view in the first place is to remember which index key goes with which parent key; if there were a way to just get the parent keys directly, we wouldn't need it.)
But if the key functions are reversible, you can just pass in a keyfunc and a revkeyfunc to create a new index. Then things are easy. Or you can make revkeyfunc optional; any index without a revkeyfunc gets immutable views, but any index with one gets mutable views.
Keyword indexing
That d.lookup('spam', 23) is kind of ugly; the views are nice for some use cases, but not all. But there's another way to provide the same things through a different interface: d[23] lookups up 23 in the main index, while d[23, 'spam'] looks it up in the spam index. And that's easy to implement:def __getitem__(self, key): if isinstance(key, tuple): name, key = key return super()[self.indices[name][key]] else: return super()[key]If PEP 472 goes through, this is a perfect use case; then you could write d[23, index='spam'], which looks a lot more explicit… or, depending on your application, it may or may not be abuse to write d[spam=23].
If d[spam=23] looks appropriate to you, you can actually do something similar today, by (ab)using slice notation, as in d[spam: 23]. The code is no more complex than the code for taking two values:
def __getitem__(self, key): if isinstance(key, slice): name, key = key.start, key.stop return super()[self.indices[name][key]] else: return super()[key]
Related ideas
Multi-key dictionaries
Very closely related to the idea of multi-indexed dictionaries are multi-key dictionaries. A multi-key dictionary maps multiple keys to the same value. Well, a normal dict can do that, but a multi-key dict keeps the same keys mapped to the same value, only counts the value once when iterating values or items, etc. The usual interface requires a different method for adding a value (which requires all of the keys) vs. lookup/delete (which require any key), but in Python, it's pretty easy to make this look intuitive: d[1, 2, 3] = 4 just calls __setitem__ with the tuple (1, 2, 3).If there are a fixed number of keys for each value, and the keys are tagged in some way (if each key has a different type, you can just use that as the tag), the tags are pretty much exactly equivalent to indexes. But if you don't need that, the keys can be treated homogenously, which gives you more flexibility. (This is similar to the difference between a generic but homogenous list<T> type and a heterogeneous type like Python's list.)
Path-based lookup
A completely different extension to the dict (and list) idea is path-based lookup, as seen in XPath, KVC, etc. For example, given a typical JSON-RPC structure, you'll often write code like j['spam'][0]['eggs']; it might be more convenient to be able to write that as j['spam.0.eggs']. More importantly, you often write code like [spam['eggs'] for spam in j['spam']], and it can definitely be more convenient—and a lot more readable—to write j['spam.*.eggs']. Once you're doing that, some cases of complex alternative lookup (or "indirect reverse lookup", as I described it earlier) become just simple reverse lookup on key paths.Multi-indexed sets
A set is basically a dict without an index. So a multi-indexed set is a multi-indexed dict without a privileged index—that is, it has no s[key], but it can still have s.lookup(key, indexname), or maybe even s[key, indexname].Record-based indexing
Above, I mentioned that you can't mutate through views without a reversible key function. But for many real-life use cases, this is difficult or impossible. Fortunately, for many real-life use cases, it's also not necessary. Often, the values are just records (objects, namedtuples, dicts, …), and the indexes are defined by attribute or key names. So, instead of defining each index in terms of a key function which is always attrgetter(name) or itemgetter(name) or something equivalent, define it in terms of name. And meanwhile, because the record contains the parent key (and all of the other keys), there is no need for a reverse key function; you can get the parent key out of the value. So, spamview[spamvalue] = (spamvalue, eggsvalue, beansvalue) is sensible and unambiguous.But a dict with mutable index views may not be conceptually the best interface for this type. (Why should you have to specify spamvalue twice?) Especially given that record sets are a common data structure with well known idioms of their own, combining record-based indexing and multi-indexed sets to give you multi-indexed record sets may look more appropriate.
Multi-indexed sorted mappings
A sorted mapping is obviously a relative of an unordered mapping like Python's dict. By keeping the key-value pairs in a tree (or other logarithmic data structure) sorted by key, you can still look up and insert values relatively quickly (not as quickly as a hash table—it's logarithmic instead of constant time), and you can also do things that are impossible with a hash table, like finding all keys within a given range. (And notice that you use slicing as the interface for this: d[23:46] can return the values for keys in the range(23, 46).)Obviously, you can extend sorted mappings to multiple indices in exactly the same way as unordered mappings. Instead of having a dict for each index, you'll have a bintrees.FastRBTree or blist.SortedDict or whatever for each index. An IndexedSortedDict extends the SortedDict interface in roughly the same ways as IndexedDict extends the dict interface—or, alternatively, IndexedSortedDict extends IndexedDict in roughly the same ways SortedDict extends dict.
Anyway, often, when you need multiple indexing beyond the basic "one-to-one reverse dict", you also need some of the additional features of sorted mappings.
Space partitioning
Combining the record-set idea and the sorting idea—that is, your values consist of a bunch of keys (plus maybe some additional members that aren't indexed) that are each independently orderable—gives you a k-dimensional space. Each value is a point in that space. Each lookup(index, key) call is gives you all of the values on the (k-1)-dimensional hyperplane crossing the index's axis at the key.So multiple indexing is effectively a space-partitioning problem, and you can use a data structure designed for such problems, like a k-D tree. In fact, while you can build a multi-indexed dictionary on top of a k-D tree, it may make sense to actually think in terms of the k-D tree interface—especially if you need to do things like multidimensional range searches.
Databases
Going in a different direction, what we're doing with IndexedDict—especially if we're using trees and treating the values as records that contain the keys—what we're doing is very similar to what relational databases do.So, if you're trying to do something fancy, ask yourself whether you're really just looking for a relational database. If so, just use one. Python has sqlite3 built-in, and you can upgrade to solutions like MySQL or Postgres with only a few lines of code. If you don't want to write SQL, you can use an ORM or mini-ORM like SqlAlchemy.
So, there's not much down side, but what's the up side? Well, there are a lot of complex issues that you haven't thought of, but that others have, and the solutions are designed into SQL. And there are a bunch of features you may not need (persistence, transactions, aggregation, etc.) but if you get them for free, well, that'll be nice if you need them later. You may not be using the main advantage of relational databases (multiple tables, with relationships between them written declaratively as part of the data model and enforced automatically by the database), but there are still plenty of other advantages.
Even if you decide not to use a relational database, there are a lot of good ideas you can borrow from them; you should at least work through a basic Python sqlite3 tutorial. (Put it on your queue to come back and learn the underlying theory and how it maps and mis-maps to practice, but you don't need all of that yet. Just remember that "relational" doesn't refer to the relationships between rows in different tables, but to each table itself being a relations among attributes, and everyone will think you know what you're talking about.)
There are also a wide variety of "NoSQL" databases, which are effectively different ways of getting at some particular subset of the functionality of a relational database without the (cognitive, performance, or other) overhead related to the unneeded features. You may want to use one of them for implementation, or to borrow ideas from. (Notice that a dict is basically a NoSQL database taken to the extreme in almost every dimension; by adding any database-like features on top of it, you're often walking along the path to one of the major NoSQL designs.)
Multi-indexed arrays
Finally, Pandas is basically a library for multi-indexed arrays—that is, pd.DataFrame is to np.ndarray as IndexedDict is to dict. How close the correspondence is depends on exactly which variation you need and which features, but it's definitely worth looking at how Pandas represents the features that it adds to arrays that correspond to features you're adding to dicts.In fact, in some cases, Pandas may even be the best way to implement what you want. If your keys are, or can be mapped to, some compact, monotonic index, then use those to index a DataFrame and you're done.
Meanwhile, there are similar multi-indexed-array types for other languages besides Python. I wouldn't suggest you go and learn R just so you can see if it has any ideas that can apply to your multi-indexed dictionary design, but if you already know R, that may be a question worth thinking about.
Further reading
There are a lot of choices to make in designing an indexed dictionary, so for a real project, you shouldn't just dive in and write one without thinking it through (or, worse, use the sample code from this blog post).Search PyPI distributions, ActiveState recipes, and maybe sources like StackOverflow and python-list. You may have to try multiple terms, like "index" vs. "multi" and "dict" vs. "mapping", but you'll find lots of implementations; even if none of them are exactly what you want, there will be useful design ideas to get from all of them.
If that isn't enough, try searching for other languages. Ruby on Rails has ActiveRecord, there are two different things called MultiFieldDict for Objective-C, there's a KDMultiIndexedMapping (built on a k-D tree) for Java; etc.
View comments