Unfortunately, that information is out of date, and he hasn't updated it. So, I think it's time for a new version.
Read his article for the state of Python 2.5, then come back here for the differences in Python 3.4. (If you care about 2.6-2.7, it's somewhere in between the two, so you may need a bit of trial and error to see for yourself.)
tl;dr
Case conversion and internal representation are fixed; collation and regular expressions are improved but not entirely fixed; everything else is mostly the same.
Collation
It's still true that the basic str comparison methods compare code points lexicographically, rather than collating properly. His example still gives the same results.However, for many uses, locale.strcoll, or locale.strxfrm as a key function, does the right thing.
Of course whenever you see "many", you'll want to know what the limitations are.
The biggest issue is that many characters compare differently in different locales. The most famous example is that in different European locales, "ä" may sort as a single character "a" and "b", a single character after "z", or the two characters "ae". In fact, in some countries, there are two different sorting rules, usually called dictionary and phonebook sorting.
A second issue is that many strings of codepoints are equivalent, but may not be sorted the same. For example, if you've got a script with string literals stored in NFC, but you take input from the terminal which may come in NFKD, any character that can be decomposed won't match when you try to compare them.
The Unicode Collation Algorithm specifies an algorithm, based on ISO 14651, plus rules for customizing that algorithm for different locales, and a Common Locale Data Repository that includes the customizations for all common locales. (They handle the dictionary-vs.-phonebook rule by treating the phonebook as a separate locale—e.g., in ICU, you specify "de__phonebook" instead of "de".)
Python does not come with the CLDR data. It relies on your OS to supply the same data as part of its locale library. But traditional Unix locale data doesn't supply everything needed to replace CLDR, and many *nix systems don't come with a complete set of locales anyway—and Windows has nothing at all. So, if you stick with the stdlib, you're usually just getting plain ISO 14651, which means "ä" is going to come after "z". If you want to sort with German rules, you can explicitly setlocale de_DE, but that may fail—or, worse, it may succeed but not affect strcoll/strxfrm.
The Unicode Collation FAQ explains the differences, with links to the full details.
Case Conversion
Python 3.3+ supports the full case conversion rules, including the SpecialCasing.txt table that can convert one character into multiple characters. For example, 'ß'.title() == 'Ss'.
This also includes the casefold method, which can handle case-insensitive comparisons in cases when lower does the wrong thing—'ß'.casefold() == 'ss'.casefold().
Regular Expressions
The long-term intention is (or at least was, at one point) to replace the re module with a new implementation, regex—which, among other things, solves some of the Unicode problems and makes others easier to solve. However, regex wasn't ready for 3.3, or for 3.4, so the re module is still in the stdlib, but has gotten little attention. So, a proper fix for all of these problems may be a few years off.
First, as Christopher Lenz pointed out, Python's re module doesn't support the TR18 Unicode character classes, like \p{Ll}. This is still true in 3.4, and it's also true in regex, although it should be easier to add support to regex.
On top of that, while he didn't mention it, case-insensitive searching has always effectively used lower instead of casefold, meaning it won't find "straße" in "Königstrasse". This is fixed in regex, but is not fixed in the 3.4 stdlib module. (When casefold was added, equivalent code didn't appear to be needed in re, because regex was going to be replacing it anyway.)
It might be nice to add normal-form-independent matching
It might be nice to add normal-form-independent matching
Also, character classes like \w were in ASCII mode by default; they're now UNICODE by default. (Also, the LOCALE mode was broken; it's still broken in 3.4; it's been fixed in regex, but it's been quasi-deprecated for so long that I doubt anyone will care.)
Finally, the re engine didn't support \u, \U, and \N escapes. Since you normally write regular expressions with raw string literals, which of course also don't support such escapes, this made it hard to write patterns for characters that are hard to type, or more readable as escapes. This was fixed in… I think 2.7/3.2; at any rate, it's definitely working in 3.4; re.search(r'\u0040', '\u0040') returns a match.
Text Segmentation
This needs to be broken down a bit.
First, Python strings are now iterables of Unicode code points, not of UTF-32-or-UTF-16-depending-on-build code units. This means s[3] now always a "character" in the sense of a code point, never half a surrogate pair. (See Internal Representation for more on this—and note that many other languages use either UTF-16 or UTF-8 for strings, and therefore still have this problem.)
However, they're still not iterables of grapheme clusters. For example, a single Hangul syllable may be three Unicode code points, but in most use cases people think of it as a single character. So, given the string "각" (Hangul gag, in NFD form), the length is 3 instead of 1, and s[2] is the lower "ᄀ", which isn't really a separate character.
But the only way to "fix" this would mean a major tradeoff: in a language where strings are treated as iterables of grapheme clusters (Swift is the only one I know of off-hand), strings aren't randomly accessible. So, s[2] will never give you part of a character, but then s[2] will always give you a TypeError.
The way to get the best of both worlds is probably to use the Python representation, plus a generator function so you can iterate graphemes(s). (And maybe some helper functions for the previous and next grapheme after a specific point. Swift handles this by having smart "indexes" that act sort of like bidirectional iterators, so s.index(c) doesn't return a number, but it does return something that you can use with s[i] and s++ and s--.)
Unfortunately, Python doesn't come with such a function. Partly because when it was first suggested long ago, the grapheme rule was so trivial that anyone could write it as a one-liner. (Just read a character, then all continuation characters before the next non-continuation character.) But even if you ignore the the extended grapheme rule (which you usually shouldn't), the legacy rule is more complicated than that nowadays—still easy to write, but not easy to come up with on your own. I've seen at least 4 people suggest that Python should add such a function over the past 5 or so years, but none of them have submitted a patch or provided a real-life use case, so it hasn't happened. I'm pretty sure no one has any objection to adding it, just no one has enough motivation to do it.
To Unicode, text segmentation also includes word and sentence segmentation, not just characters. But this is pretty clearly outside the scope of the stdlib. For example, as the spec itself mentions, splitting Japanese words is pretty much the same problem as hyphenating English words, and can't be done without a dictionary.
To Unicode, text segmentation also includes word and sentence segmentation, not just characters. But this is pretty clearly outside the scope of the stdlib. For example, as the spec itself mentions, splitting Japanese words is pretty much the same problem as hyphenating English words, and can't be done without a dictionary.
Bi-directional Text
I'm not sure what Python could include that would be helpful here. If you're going to write a layout engine that handles bidi text, you need to know a lot more than just the bidi boundaries (like font metrics), and I can't imagine anyone expects any of the other stuff to be part of a string library.
Python does include the unicodedata.bidirectional function, but that's been there since the early 2.x days, so clearly Christopher Lenz was looking for more than that. I just don't know what, exactly.
Locale Context
This is still as much of a problem as ever: locales are global (plus, changing locales isn't thread-safe), which means you can't use the locale module to handle multiple locales in a single app, like a webserver, unless you spawn a subprocess.
In fact, it may be worse than in 2008. Back then, Python wasn't in much use on resource-limited systems, especially on web servers; nowadays, a lot more people are running Python web servers on Raspberry Pi devices and micro-VMs and so on, and some of these systems have very incomplete locale installations.
Internal Representation
This one was fixed in Python 3.3.
For 3.2, there was some talk about deprecating narrow builds, but nobody pushed it very hard. Nobody brought up the disk space issue again (it really isn't relevant; normally you don't store raw Python objects on disk…), but people did bring up the memory issue, and some other issues with dealing with Windows/Cocoa/Java/whatever UTF-16 APIs.
This dragged on until it was too late to fix 3.2, but Martin van Löwis took some of the ideas that came out of that discussion and turned them into PEP 393, which made it into 3.3. Now, strings use 1 byte/character if they're pure Latin-1, 2 bytes if they're pure UCS-2, 4 bytes otherwise. Wider strings can also cache their UTF-8 and/or UTF-16 representation if needed. The old C APIs are still supported (with no significant loss in speed for any extension anyone could find), and there's a set of new, cleaner C APIs that can be used when you want a specific width or as much speed as possible.
Other Stuff
There are a few problems that weren't mentioned in the original blog that are still minor problems today.
A lot of effort over the past couple years has gone into making it easier for people to install, use, and require third-party modules. Python now comes with pip, and many popular packages that need to be compiled are available as pre-compiled wheels. (And the stable API, and the lack of narrow vs. wide builds, makes that a lot less of a nightmare than it would have been to implement in the old days.)
And this has brought with it an attitude that many things not only don't have to be in the stdlib, but shouldn't be there, because they need to evolve at a different pace. For example, while the UCD doesn't change that often (Python 3.4.2 is on UCD 6.3.0, and 3.5 should be on 7.0.0, which just came out in October 2014), supplemental data like the CLDR change more often (it's up to 26.0.1). If the recommended way of dealing with CLDR-compatible sorting is to use PyICU or PyUCA, Python doesn't have to update multiples times per year.
Meanwhile, the unicodedata module doesn't contain all of the information in the UCD tables. Basically, it only contains the data from UnicodeData.txt; you can't ask what block a character is in, or the annotations, or any of the extra information (like the different Shift-JIS mappings for each Emoji). And this hasn't changed much since 2008. This is probably reasonable given that the full database is over 3MB, and Python itself is only about 14MB, but people have been annoyed that they can write 90% of what they want with the stdlib only to discover that for the last 10% they need to either get a third-party library or write their own UCD parser.
Finally, Python 3.0 added a problem that 2.5 didn't have—it made it hard, in many places, to deal with text in an unknown encoding, or in a combination of ASCII-compatible text and non-text, both of which you often want to search for ASCII-compatible segments (e.g., consider munging HTTP response headers; it you have to use bytes because the body can't be decoded, but there's no bytes.format or bytes.__mod__, this is no fun). Python 3.3, 3.4, and 3.5 have all added improvements in this area (e.g., surrogateescape means the body can be decoded, or bytes.__mod__ means it doesn't have to be).
Conclusions
Some of the changes that Christopher Lenz wanted 6 years ago have been made in the stdlib. And some don't belong there—I think the right solution for collation beyond what locale can do should be to encourage people to use PyUCA, Babel, or PyICU (although it would still be nice if there were a more Pythonic wrapper around ICU than PyICU, as he called for 6 years ago). But there are at least two problems that I think still need to be fixed.
Hopefully one day, either the regex module will be ready, or people will give up on it and start improving the re module instead. Either way, it's high time ICU-style character classes were added (and, if the answer is to stick with re, proper casefolding too).
Some way to iterate a string by grapheme clusters would be a great addition to the stdlib. Someone just needs to write it.
View comments