On Stack Overflow, a user asked, "Should python-dev be required to install pip"? I think the answer to that is no, at least the way things are split up on most distros today. But there's clearly a potential for confusion for new developers, as the OP pointed out in a comment.
On most linux distros, the python package (usually pre-installed) doesn't include Python.h and other files needed for development, which are split off into a package named something like python-dev or python-devel.
Meanwhile, there are distro packages for Python packages that go with that distro-installed Python. If you want numpy, or django, you use the distro's package manager to install python-numpy or python-django.
And for packages the distro doesn't have, there's always python-pip, which installs pip, so you can use pip to install additional packages.
The problem is that if you install python-pip, but not python-dev, you can't use pip for any packages that need to build C extension modules. Besides needing a compiler toolchain, for pip install lxml to work, you need to have first installed your distro's libxml2-dev package. And, unless pip is going to add hooks to work with distro packaging systems (in which case it probably shouldn't be installing anything in the first place, just creating distro packages to be installed), there's no way around that problem. Plus, making python-pip depend on python-dev would mean you could no longer use pip to install pure Python packages or binary wheels. And how would it work with virtual environments; would python-virtualenv also depend on python-dev?
So, it seems like a big mess…
But only if you're thinking in terms of today's distros, with Python 2.7 and pip 1.4 or earlier. Python 3.4, pip 1.5, and wheel make everything a lot different—and a lot simpler.
Python 3.4 comes with a pip bootstrap (see PEP 453 for details). The distros might decide to split that off into a separate package, but PEP 453 explicitly recommends against that (and was cowritten by the guy in charge of Python strategy for Red Hat).
Python 3.4 also comes with a built-in virtual environment tool, venv, which automatically installs pip into every virtual environment. And you can use venv as a deployment tool. And this will completely eliminate the problems—you won't want or need to use pip on your production machines.
Meanwhile, pip 1.5 has better integration with the wheel format and package (see PEP 427). If you need to use a package that your distro doesn't provide, and can't use a venv, you can build a wheel on your dev machine, host it on an in-house repo, and pip install it to your production machine without needing python-dev, or a compiler or libxml2-dev or anything of the sort.
So, the solution becomes very simple: Don't install python-dev on production machines; even if you're not using a venv, you won't be pip install-ing anything except off an internal repo. Install python-dev on all other machines, so you can pip install off PyPI.
What about Python 2.7? Well, it's not perfect. That's why we have 3.2, 3.3, 3.4, and other new versions of Python. It's too late to fix 2.7. Red Hat or Ubuntu might come up with their own solution (it wouldn't be that hard to preinstall pip and virtualenv…), but Python isn't going to.
-
Part of the common wisdom among some OO fanatics is that Python isn't a real OO language because it "doesn't have encapsulation." There are a few different things that could mean, but none of them say anything useful.
Python facilitates and encourages bundling data together with methods that work on that data, in the exact same way that Smalltalk, C++, and their descendants do (and equivalent way to what other OO paradigms like Self-style prototypes). This is really the only useful definition of "encapsulation", and in this sense, Python does have encapsulation.The fact that Python's idiom doesn't encourage getters and setters is irrelevant, because getters and setters just provide a different way of spelling attribute access, and one which (except in the case of syntactically restricted languages like C++) adds no flexibility, future-proofing, or other benefits.The fact that Python's _-prefix idiom doesn't actually hide or protect private variables is true, but the same is true for almost every other paradigm OO language. So, if you want to define encapsulation in these terms, then no, Python is not a good encapsulating OO language—and neither is Smalltalk, C#, Ruby, JavaScript, …Hidden internal state
One definition of encapsulation is that the data members of an object should not be visible.
The C++ family (including Java and C#), Objective C, even Eiffel have the full list of the data members visible in the source. And in most of those languages, the header file or other "interface" that you distribute with a compiled module includes them as well.
But at least they hide that list at runtime, right? Well, if you look at most of the OO languages with any kind of reflection—Ruby, Java, etc.—no, no they don't. And in languages without reflection, like C++, the methods are just as hidden as the members.
In fact, this kind of "encapsulation" does exist, and is used frequently, in non-OO languages like C and Lisp: Just pass around an opaque, untyped, meaningless handle instead of a reference to an actual object. This can be a void*-cast pointer to an object whose type isn't defined anywhere in the headers you distribute (like a Win32 HANDLE), or it can be a key into some table that you maintain inside your library (like a POSIX file descriptor). And of course you can do this in Java or Eiffel or Python as well (in fact, slightly more easily in Python than in most languages, because it has a built-in mapping type to use for that table), but I don't think that makes it an OO feature.
(In C++, there's an idiom ("pimpl") that wraps up this kind of handle in an OO interface. And this same idiom works in Python, it's just not very common. But that just puts Python in the same class of languages as Java, Ruby, Eiffel, ObjC, etc.)
Restricted access
So forget about actually hiding information, what about restricting access to it?In C++ and friends, you can mark an attribute as "private". Python's equivalent is to prefix the attribute name with an "_".Python's "_" doesn't actually stop you from accessing the attribute from outside, it just discourages you. It's a clear signal to the user of your class that he shouldn't be using this attribute, that it could disappear or change meaning in future versions, etc. (It also prevents the attribute from showing up in various kinds of reflection, but you can always get around that with other kinds of reflection—e.g., in IPython or various IDEs, private names aren't be offered as a completion, unless someone first types a _ to see them.) This is pretty closely equivalent to the POSIX notion of "hidden files" with a "." prefix, as opposed to, say, MacOS or Windows actual hidden files.But then very few other languages actually stop you from accessing the attribute either. This protection effectively comes from static type checking, and almost all statically-typed OO languages either have leaky type systems, or inflexible type systems that need (and have) escape hatches. For example, in C++, you can always cast through void* to char* and get at the structure members. Or, even simpler, define a class with identical but all-public structure and just cast to that. (Of course it's easier to do that to a C++ class from Python via ctypes or Cython than from C++, but that doesn't actually speak well of C++'s "protection" of its private members.) Just as with, say, MacOS or Windows actual hidden files, there are flags to pass to allow access to the hidden files if you want it.If you build an object system on top of, say, Haskell, it can actually prevent access in ways that these languages can't. But the fact that few if any OO languages have static strong typing, and people who use strongly-typed languages like Haskell tend to see only limited use for OO, implies that this kind of restricted access is not an OO feature, any more than access through opaque tokens is.Sandboxing
Java (and, to an extent, C#) tries to restrict access further than C++, despite providing more reflection, by effectively making the static protection information available at load time (and attempting to make that secure even for code from different sources, in Java's case) and then running the entire program inside a sandbox that can cover holes in the leaky type system.The standard idiom in most "encapsulated" OO languages is to provide "getter" and "setter" methods for every data attribute. Some, like C# and Eiffel, have ways to automate that for you. Not only does Python have no way to automate this, the idiom explicitly discourages this kind of design.
If someone really wanted to argue that this means Java (modulo design flaws or JVM implementation bugs) is OO in a way that Eiffel, Ruby, Smalltalk, etc. are not, I suppose that would count as a way that Python isn't OO either. But that doesn't seem like a very useful distinction.
Getters and setters
But using ubiquitous getters and setters means the data members are conceptually not hidden at all. They add absolutely nothing. What Python idiomatically spells as "foo.spam" and "foo.spam = eggs" is exactly the same thing C# idiomatically spells "foo.GetSpam()" and "foo.SetSpam(eggs)". The spam attribute is idiomatically visible in both languages.
Of course getters and setters have an advantage: you can later decide to replace the real attribute with a virtual, computed attribute; just change the getter and setter and your API is unchanged.
But that isn't necessary in Python—or even in C#, ObjC, and similar languages. You can always just replace the real attribute with a @property, and the API is unchanged, but now it's accessing a virtual, computed attribute. (Or, of course, you can always intercept access via __getattr__ and friends…)
Bundling data and methods
Another common definition of encapsulation is that it facilitates bundling data together with the methods that work on that data.There's really nothing objectionable about that definition.And it applies perfectly well to the class notion in C++, Java, C#, Eiffel, Sather, Smalltalk, ObjC, Ruby, etc.—and in Python. (And the prototype notion in Self or JavaScript, etc.)This is something that you don't get from C or Lisp—you have to build encapsulation manually, and use project-specific naming conventions, header-file layouts, documentation, etc. to expose the API you want to the user—while in OO languages, there's a construct that makes it easy to build and self-documenting.So, in the most useful sense of the word, Python does have encapsulation. And in every other sense where it doesn't, neither do any of the languages it's compared to do.3View comments
-
The itertools module in the standard library comes with a nifty groupby function to group runs of equal values together.
If you wanted to group runs of adjacent values instead, that should be easy, right?
Let's give a concrete example:
>>> runs([0, 1, 2, 3, 5, 6, 7, 10, 11, 13, 16]) [(0, 3), (5, 7), (10, 11), (13, 13), (16, 16)]
If we could make groupby give us (0, 1, 2, 3), then (5, 6, 7), etc.—this would be trivial.
Unfortunately, that's easier said than done.
What's the key?
To customize groupby, you give it the same kind of key function as all of the sorting-related functions: a function that takes each value and produces a key that can be compared to another value's key. So, if you want to group 1 and 2 into the same run… what key does that?
It may not be immediately obvious how to build such a key. But the functools module comes with a helper called cmp_to_key that does that automatically. You give it an "old-style comparison function"—that is, a function on two arguments that returns -1, 0, or 1 depending on whether the left argument is less than, equal to, or greater than the right—and gives you a key function. And that comparison function is obvious.
But if you try it, this doesn't actually work:
>>> def adjacent_cmp(x, y): ... if x+1 < y: return -1 ... elif x > y: return 1 ... else: return 0 >>> adjacent_key = functools.cmp_to_key(adjacent_cmp) >>> a = [0, 1, 2, 3, 5, 6, 7, 10, 11, 13, 16] >>> [list(g) for k, g in itertools.groupby(a, adjacent_key)] [[0, 1], [2, 3], [5, 6], [7], [10, 11], [13], [16]]
It groups 0 and 1 together, but 2 isn't part of the group. Why?
Because groupby is remembering the first key in the group, and testing each new value against it, not remembering the most recent key in the group. Since the whole point of the key function is that the first key and the most recent key are equal, it's perfectly within its rights to do this—and, since it makes the code a little simpler and a little more efficient, it makes sense that it would.
There are two ways to fix this: We could change groupby (after all, the documentation comes with a pure Python equivalent) to remember the most recent key in the group instead of the first one, or we could create a stateful key callable that caches the last value that compared equal instead of its initial value. The second one seems hackier and harder, so let's start with the first.
Customizing groupby
First, let's just take the code out of the docs and run it:
>>> [list(g) for k, g in itertools.groupby(a, adjacent_key)] TypeError: other argument must be K instance
It's not actually equivalent after all! What's the difference?
The Python implementation has lines like this:
while self.currkey == self.tgtkey:
That seems reasonable, but these values start off as an object() sentinel, and the key returned by a key function created by cmp_to_key doesn't know how to compare itself to anything but another key returned by such a function. It's K.__eq__ will raise a TypeError, and the == operator fallback code will take this to mean that means a K instance can only be compared to another K instance and generate the different TypeError we're ultimately seeing.
So, the first fix we have to make is to change this comparison, and a similar one a few lines down, to deal with the sentinel explicitly (which also means we have to store the sentinel so we can compare to it). I won't show the code for this or the next change, because you don't really need to see the broken code; the last version will have all of the changes.
So let's try it again:
>>> [list(g) for k, g in groupby(a, functools.cmp_to_key(adjacent_cmp))] [[0], [1], [2], [3], [5], [6], [7], [10], [11], [13], [16]]
It's still not equivalent. What's different now?
If you look at the comparisons, they look like this:
while self.currkey == self.tgtkey:
That's clearly backward. Obviously when you're doing an ==, it shouldn't matter which value is on which side… except that we've deliberately distorted the meaning of == so that being 1 less counts as the same, without almost making 1 more count as the same. We could, and probably should, change our cmp function to do this... but if we're already changing the "equivalent" pure Python groupby to actually be equivalent, we might as well change this too.
So, with that change, let's try again:
>>> [list(g) for k, g in itertools.groupby(a, adjacent_key)] [[0, 1], [2, 3], [5, 6], [7], [10, 11], [13], [16]]
OK, now we're back to where we wanted to be. The change we wanted was to just remember the last key in the group instead of the first. Which means we need to change this line to remember the last key:
self.currkey = self.keyfunc(self.currvalue)
And let's try it out:
>>> [list(g) for k, g in groupby(a, functools.cmp_to_key(adjacent_cmp))] [[0, 1, 2, 3], [5, 6, 7], [10, 11], [13], [16]]
But wait, there's another problem. If the last key is part of a group, but doesn't compare equal to the start of that group, it gets repeated on its own. For example:
>>> a = [1, 2, 3] >>> [list(g) for k, g in groupby(a, functools.cmp_to_key(adjacent_cmp))] [[1, 2, 3], [3]]
Why does this happen? Well, we're updating tgtkey within _grouper, but that doesn't affect self.tgtkey. As far as I can tell, the only reason _grouper takes tgtkey as an argument instead of using the attribute is a micro-optimization (local variable lookup is faster than attribute lookup), one that's rarely going to make any difference in your program's performance (especially considering all the other attribute lookups we're doing in the same loop). So, the easy fix is to not do that.
Putting it all together:
class groupby: # [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D def __init__(self, iterable, key=None): if key is None: key = lambda x: x self.keyfunc = key self.it = iter(iterable) self.sentinel = self.tgtkey = self.currkey = self.currvalue = object() def __iter__(self): return self def __next__(self): while (self.currkey is self.sentinel or self.tgtkey is not self.sentinel and self.tgtkey == self.currkey): self.currvalue = next(self.it) # Exit on StopIteration self.currkey = self.keyfunc(self.currvalue) self.tgtkey = self.currkey return (self.currkey, self._grouper()) def _grouper(self): while self.tgtkey is self.sentinel or self.tgtkey == self.currkey: yield self.currvalue self.currvalue = next(self.it) # Exit on StopIteration self.tgtkey, self.currkey = self.currkey, self.keyfunc(self.currvalue)
At any rate, this has turned into a much more complicated solution than it originally appeared. Is there a better way? Stateful keys sounded a lot worse, but let's see if they really are.
Stateful keys
So, what would a stateful key look like? Well, let's look at what our existing key looks like. cmp_to_key is actually implemented in pure Python (with an optional C accelerator), and it looks like this:
def cmp_to_key(mycmp): """Convert a cmp= function into a key= function""" class K(object): __slots__ = ['obj'] def __init__(self, obj): self.obj = obj def __lt__(self, other): return mycmp(self.obj, other.obj) < 0 def __gt__(self, other): return mycmp(self.obj, other.obj) > 0 def __eq__(self, other): return mycmp(self.obj, other.obj) == 0 def __le__(self, other): return mycmp(self.obj, other.obj) <= 0 def __ge__(self, other): return mycmp(self.obj, other.obj) >= 0 def __ne__(self, other): return mycmp(self.obj, other.obj) != 0 __hash__ = None return K
So, we want to update self.obj = other whenever they're equal, in all six comparison functions.
But we already know from the groupby source that it never calls anything but ==. And really, we don't want this key function to work with anything that calls, say, <; the idea of "sorting by adjacency" doesn't sound as compelling (or even as coherent) as grouping by adjacency. So, let's just implement that one operator.
While we're at it, let's fix the problem mentioned above, that if you compare our keys backward you get the wrong answer. If we can make a==b and b==a always the same, that makes things even less hacky.
class AdjacentKey(object): __slots__ = ['obj'] def __init__(self, obj): self.obj = obj def __eq__(self, other): ret = self.obj - 1 <= other.obj <= self.obj + 1 if ret: self.obj = other.obj return ret
Does it work?
>>> [list(g) for k, g in itertools.groupby(a, AdjacentKey)] [[0, 1, 2, 3], [5, 6, 7], [10, 11], [13], [16]]
Yes, and on the first try. As it turns out, that was actually easier, not harder. Sometimes it's worth looking under the covers of the standard library.
From groups to runs
Now we can easily write the function we wanted in the first place:
def first_and_last(iterable): start = end = next(iterable) for end in iterable: pass return start, end def runs(iterable): for k, g in itertools.groupby(iterable, AdjacentKey): yield first_and_last(g)
Beyond 1
We've baked the notion of adjacency into our key: within +/- 1. But you might want a larger or smaller cutoff. Or, for that matter, a different type of cutoff—e.g., timedelta(hours=24). Or you might want to apply a key function to the values before comparing them, which would make it easy to implement a "natural sorting" rule (like the Mac Finder), so "file001.jpg" and "file002.jpg" are adjacent. Or you might want a different comparison predicate all together—e.g., within +/- 1% instead of +/- 1.
All of these are easy to add:
def adjacent_key(cutoff=1, key=None, predicate=None): if key is None: key = lambda v: v if predicate is None: def predicate(lhs, rhs): return lhs - cutoff <= rhs <= lhs + cutoff class K(object): __slots__ = ['obj'] def __init__(self, obj): self.obj = obj def __eq__(self, other): ret = predicate(key(self.obj), key(other.obj)) if ret: self.obj = other.obj return ret return K
Let's test it out. (I'm going to from … import a bunch of things here to make it a bit more concise.)
>>> [list(g) for k, g in groupby(a, adjacent_key(2))] [[0, 1, 2, 3, 5, 6, 7], [10, 11, 13], [16]] >>> b = [date(2001, 1, 1), date(2001, 2, 28), date(2001, 3, 1), date(2001, 3, 2), date(2001, 6, 4)] >>> [list(g) for k, g in groupby(b, adjacent_key(timedelta(days=1)))] [[datetime.date(2001, 1, 1)], [datetime.date(2001, 2, 28), datetime.date(2001, 3, 1), datetime.date(2001, 3, 2)], [datetime.date(2001, 6, 4)]] >>> def extract_number(s): return int(search(r'\d+', s).group(0)) >>> c = ['file001.jpg', 'file002.jpg', 'file010.jpg'] >>> [list(g) for k, g in groupby(c, adjacent_key(key=extract_number)] [['file001.jpg', 'file002.jpg'], 'file010.jpg'] >>> d = [1, 1.01, 1.02, 10, 99, 100] >>> [list(g) for k, g in groupby(d, predicate=lambda x, y: 1/1.1 < x/y < 1.1))] [[1, 1.01, 1.02], [10], [99, 100]]
And let's make runs more flexible, while we're at it:
def runs(iterable, *args, **kwargs): for k, g in itertools.groupby(iterable, adjacent_key(*args, **kwargs)): yield first_and_last(g)
A quick test:
>>> list(runs(c, key=extract_number)) [('file001.jpg', 'file002.jpg'), ('file010.jpg', 'file010.jpg')]
And we're done.0Add a comment
-
Often, you have an algorithm that just screams out to use a dict for storage, but your data set is just too big to hold in memory. Or you need to keep the data persistently, but pickling or JSON-ing takes way too long.
A database like sqlite3 (built into the stdlib for almost all builds of Python) is a possible solution, but it's a pretty major change. Instead of just writing this:
dial(phone_numbers[person])
… you have to write this:
cur = db.execute('SELECT number FROM phone_numbers WHERE person=?', (person_id,)) dial(cur.fetchone()[0])
Fortunately, there's a much simpler answer: the dbm family of modules. It gives you a simple dict-like interface, built on top of a simple key-value database. The data are kept on disk, using a hash table implementation optimized for on-disk storage (with in-memory caching), instead of an in-memory hash table. There are limits to it (see below), but in many cases it just works.
Unfortunately, it scares a lot of people away. If you search for "python dbm", the first thing you'll find is probably the docs for the Python 2.7 dbm module, which it says only supports Unix.
This is because all the modules shuffled around between 2.x and 3.x. Rest assured, you can use the dbm family on Windows, you just can't (usually) use the specific module that's called dbm on Python 2.7.
Limitations
A dbm database can only store strings, both as keys and values. (In fact, it only stores bytes, and by default uses your default encoding if you give it Unicode strings.)If you want to store arbitrary (pickleable) Python object as value, but can live with strings for keys, see the shelve module.If you need your keys to be something other than strings, then dbm may not be the answer. If you have a way to decorate and undecorate your values as strings that's unambiguous and efficient, you can write a wrapper… but often if you're getting that fancy, you need something better than dbm.Generally, only one program can use a dbm database file at the same time. If you want to share data between processes, dbm is not the answer.Despite all having the same API, the different implementations have different storage formats. Some of them may also use incompatible storage on different platforms. So, a dbm database is generally not portable between machines.What does "dbm" mean, and is it Unix-only?
The actual library named dbm is Unix-only—as in real, licensed, 1970s-style AT&T Unix; not linux or even OS X. Nobody uses it today, but it's the ancestor of a whole family of simple key-value databases used today, like ndbm, gdbm, and Berkeley DB, and that family is often collectively called dbm.In Python 3.x, dbm is a package in the standard library that includes and wraps up all of the database modules.In Python 2.x, those modules were all scattered around the stdlib. The top-level wrapper module is named anydbm, and the name dbm is used for the specific ndbm implementation.Berkeley DB
Berkeley DB is a more powerful replacement for dbm. It has a lot of features dbm doesn't, but also a completely different API. Versions up to 1.x were BSD-licensed; later versions were Sleepycat-licensed, and then dual-licensed as AGPL or commercial. Also, each major version has changed the API considerably. So, many people have stuck with older versions. 1.85 is still in use today—it's available for Windows, and built-in on Macs.What does Python support?
If you use the wrappers, your code will work on every platform. But it may be using the "dumb" implementation on some. If this matters to you, you'll have to know what's available so you can decide what you want to use.Python 2.7 and 3.4 mostly support the same libraries, but under different names. I'll give the 3.x names first, separated by a slash where relevant.In general, the wrappers around C libraries are present if the library was present when you built Python. But of course you probably don't build Python, you just install it from a binary, or it comes with your OS. The official Windows installers don't contain any of the implementations except dumb. The official Mac installers contain dumb and ndbm. Apple's pre-installed Python versions contain dumb, ndbm, and bsddb185. Linux distros may include any of dumb, ndbm, gdbm, bsddb185, and bsddb, and will usually have packages for any they don't include; you'll have to check your distro. Similarly for FreeBSD, Solaris, etc. And if you use a third-party installation like ActiveState or Enthought, you'll have to check the documentation.- dbm / anydbm (and whichdb). Wrapper that uses the appropriate module for existing files, and the best available module for new ones.
- dbm.dumb / dumbdbm. Simple, pure-Python implementation that's always available.
- dbm.ndbm / dbm. Wrapper around ndbm, or around gdbm's ndbm-compat mode.
- dbm.gnu / gdbm. Wrapper around gdbm.
- bsddb: Obsolete wrapper around Berkeley DB 1.x-4.x, deprecated since 2.6. Does not provide a dbm-style API, but the dbhash module wraps it with one. You probably don't want this.
- bsddb185: Third-party (but you may have it pre-installed) wrapper around Berkeley DB 1.85. Includes the dbhash-style wrapper to provide a dbm-style API.
- pybsddb: Third-party wrapper around Berkeley DB 5.0+. Includes the dbhash-style wrapper to provide a dbm-style API.
So, what should you use? That depends on a whole lot of factors, but here are some rough rules of thumb:- If you don't need to support Windows, or dumb is fast enough, just use dbm/anydbm's generic wrappers and don't worry about it.
- If you aren't distributing your project, or are willing to dual-license your open source project as AGPL, or to buy a license for your commercial project, consider Berkeley DB 5.0+ and pybsddb.
- If you don't mind compiling Python yourself on Windows, consider ndbm.
- Otherwise, consider Berkeley DB 1.85 and bsddb185.
You might also want to look at third-party Python bundles like ActiveState to see if they can guarantee a better-than-dumb dbm on every platform you care about.0Add a comment
Add a comment