Back in 2008, Christopher Lenz wrote a great article called The Truth About Unicode in Python, which covers a lot of useful information that you wouldn't get just by reading the Unicode HOWTO, and that you might not even know to ask about unless you were already… well, not a Unicode expert, but more than a novice.

Unfortunately, that information is out of date, and he hasn't updated it. So, I think it's time for a new version.

Read his article for the state of Python 2.5, then come back here for the differences in Python 3.4. (If you care about 2.6-2.7, it's somewhere in between the two, so you may need a bit of trial and error to see for yourself.)

tl;dr

Case conversion and internal representation are fixed; collation and regular expressions are improved but not entirely fixed; everything else is mostly the same.

Collation

It's still true that the basic str comparison methods compare code points lexicographically, rather than collating properly. His example still gives the same results.

However, for many uses, locale.strcoll, or locale.strxfrm as a key function, does the right thing.

Of course whenever you see "many", you'll want to know what the limitations are.

The biggest issue is that many characters compare differently in different locales. The most famous example is that in different European locales, "ä" may sort as a single character "a" and "b", a single character after "z", or the two characters "ae". In fact, in some countries, there are two different sorting rules, usually called dictionary and phonebook sorting.

A second issue is that many strings of codepoints are equivalent, but may not be sorted the same. For example, if you've got a script with string literals stored in NFC, but you take input from the terminal which may come in NFKD, any character that can be decomposed won't match when you try to compare them.

The Unicode Collation Algorithm specifies an algorithm, based on ISO 14651, plus rules for customizing that algorithm for different locales, and a Common Locale Data Repository that includes the customizations for all common locales. (They handle the dictionary-vs.-phonebook rule by treating the phonebook as a separate locale—e.g., in ICU, you specify "de__phonebook" instead of "de".)

Python does not come with the CLDR data. It relies on your OS to supply the same data as part of its locale library. But traditional Unix locale data doesn't supply everything needed to replace CLDR, and many *nix systems don't come with a complete set of locales anyway—and Windows has nothing at all. So, if you stick with the stdlib, you're usually just getting plain ISO 14651, which means "ä" is going to come after "z". If you want to sort with German rules, you can explicitly setlocale de_DE, but that may fail—or, worse, it may succeed but not affect strcoll/strxfrm.

The Unicode Collation FAQ explains the differences, with links to the full details.

Case Conversion

Python 3.3+ supports the full case conversion rules, including the SpecialCasing.txt table that can convert one character into multiple characters. For example, 'ß'.title() == 'Ss'.

This also includes the casefold method, which can handle case-insensitive comparisons in cases when lower does the wrong thing—'ß'.casefold() == 'ss'.casefold().

Regular Expressions

The long-term intention is (or at least was, at one point) to replace the re module with a new implementation, regex—which, among other things, solves some of the Unicode problems and makes others easier to solve. However, regex wasn't ready for 3.3, or for 3.4, so the re module is still in the stdlib, but has gotten little attention. So, a proper fix for all of these problems may be a few years off.

First, as Christopher Lenz pointed out, Python's re module doesn't support the TR18 Unicode character classes, like \p{Ll}. This is still true in 3.4, and it's also true in regex, although it should be easier to add support to regex.

On top of that, while he didn't mention it, case-insensitive searching has always effectively used lower instead of casefold, meaning it won't find "straße" in "Königstrasse". This is fixed in regex, but is not fixed in the 3.4 stdlib module. (When casefold was added, equivalent code didn't appear to be needed in re, because regex was going to be replacing it anyway.)

It might be nice to add normal-form-independent matching

Also, character classes like \w were in ASCII mode by default; they're now UNICODE by default. (Also, the LOCALE mode was broken; it's still broken in 3.4; it's been fixed in regex, but it's been quasi-deprecated for so long that I doubt anyone will care.)

Finally, the re engine didn't support \u, \U, and \N escapes. Since you normally write regular expressions with raw string literals, which of course also don't support such escapes, this made it hard to write patterns for characters that are hard to type, or more readable as escapes. This was fixed in… I think 2.7/3.2; at any rate, it's definitely working in 3.4; re.search(r'\u0040', '\u0040') returns a match.

Text Segmentation

This needs to be broken down a bit.

First, Python strings are now iterables of Unicode code points, not of UTF-32-or-UTF-16-depending-on-build code units. This means s[3] now always a "character" in the sense of a code point, never half a surrogate pair. (See Internal Representation for more on this—and note that many other languages use either UTF-16 or UTF-8 for strings, and therefore still have this problem.)

However, they're still not iterables of grapheme clusters. For example, a single Hangul syllable may be three Unicode code points, but in most use cases people think of it as a single character. So, given the string "각" (Hangul gag, in NFD form), the length is 3 instead of 1, and s[2] is the lower "ᄀ", which isn't really a separate character.

But the only way to "fix" this would mean a major tradeoff: in a language where strings are treated as iterables of grapheme clusters (Swift is the only one I know of off-hand), strings aren't randomly accessible. So, s[2] will never give you part of a character, but then s[2] will always give you a TypeError.

The way to get the best of both worlds is probably to use the Python representation, plus a generator function so you can iterate graphemes(s). (And maybe some helper functions for the previous and next grapheme after a specific point. Swift handles this by having smart "indexes" that act sort of like bidirectional iterators, so s.index(c) doesn't return a number, but it does return something that you can use with s[i] and s++ and s--.)

Unfortunately, Python doesn't come with such a function. Partly because when it was first suggested long ago, the grapheme rule was so trivial that anyone could write it as a one-liner. (Just read a character, then all continuation characters before the next non-continuation character.) But even if you ignore the the extended grapheme rule (which you usually shouldn't), the legacy rule is more complicated than that nowadays—still easy to write, but not easy to come up with on your own. I've seen at least 4 people suggest that Python should add such a function over the past 5 or so years, but none of them have submitted a patch or provided a real-life use case, so it hasn't happened. I'm pretty sure no one has any objection to adding it, just no one has enough motivation to do it.

To Unicode, text segmentation also includes word and sentence segmentation, not just characters. But this is pretty clearly outside the scope of the stdlib. For example, as the spec itself mentions, splitting Japanese words is pretty much the same problem as hyphenating English words, and can't be done without a dictionary.

Bi-directional Text

I'm not sure what Python could include that would be helpful here. If you're going to write a layout engine that handles bidi text, you need to know a lot more than just the bidi boundaries (like font metrics), and I can't imagine anyone expects any of the other stuff to be part of a string library.

Python does include the unicodedata.bidirectional function, but that's been there since the early 2.x days, so clearly Christopher Lenz was looking for more than that. I just don't know what, exactly.

Locale Context

This is still as much of a problem as ever: locales are global (plus, changing locales isn't thread-safe), which means you can't use the locale module to handle multiple locales in a single app, like a webserver, unless you spawn a subprocess.

In fact, it may be worse than in 2008. Back then, Python wasn't in much use on resource-limited systems, especially on web servers; nowadays, a lot more people are running Python web servers on Raspberry Pi devices and micro-VMs and so on, and some of these systems have very incomplete locale installations.

Internal Representation

This one was fixed in Python 3.3.

For 3.2, there was some talk about deprecating narrow builds, but nobody pushed it very hard. Nobody brought up the disk space issue again (it really isn't relevant; normally you don't store raw Python objects on disk…), but people did bring up the memory issue, and some other issues with dealing with Windows/Cocoa/Java/whatever UTF-16 APIs.

This dragged on until it was too late to fix 3.2, but Martin van Löwis took some of the ideas that came out of that discussion and turned them into PEP 393, which made it into 3.3. Now, strings use 1 byte/character if they're pure Latin-1, 2 bytes if they're pure UCS-2, 4 bytes otherwise. Wider strings can also cache their UTF-8 and/or UTF-16 representation if needed. The old C APIs are still supported (with no significant loss in speed for any extension anyone could find), and there's a set of new, cleaner C APIs that can be used when you want a specific width or as much speed as possible.

Other Stuff

There are a few problems that weren't mentioned in the original blog that are still minor problems today.

A lot of effort over the past couple years has gone into making it easier for people to install, use, and require third-party modules. Python now comes with pip, and many popular packages that need to be compiled are available as pre-compiled wheels. (And the stable API, and the lack of narrow vs. wide builds, makes that a lot less of a nightmare than it would have been to implement in the old days.) 

And this has brought with it an attitude that many things not only don't have to be in the stdlib, but shouldn't be there, because they need to evolve at a different pace. For example, while the UCD doesn't change that often (Python 3.4.2 is on UCD 6.3.0, and 3.5 should be on 7.0.0, which just came out in October 2014), supplemental data like the CLDR change more often (it's up to 26.0.1). If the recommended way of dealing with CLDR-compatible sorting is to use PyICU or PyUCA, Python doesn't have to update multiples times per year.

Meanwhile, the unicodedata module doesn't contain all of the information in the UCD tables. Basically, it only contains the data from UnicodeData.txt; you can't ask what block a character is in, or the annotations, or any of the extra information (like the different Shift-JIS mappings for each Emoji). And this hasn't changed much since 2008. This is probably reasonable given that the full database is over 3MB, and Python itself is only about 14MB, but people have been annoyed that they can write 90% of what they want with the stdlib only to discover that for the last 10% they need to either get a third-party library or write their own UCD parser.

Finally, Python 3.0 added a problem that 2.5 didn't have—it made it hard, in many places, to deal with text in an unknown encoding, or in a combination of ASCII-compatible text and non-text, both of which you often want to search for ASCII-compatible segments (e.g., consider munging HTTP response headers; it you have to use bytes because the body can't be decoded, but there's no bytes.format or bytes.__mod__, this is no fun). Python 3.3, 3.4, and 3.5 have all added improvements in this area (e.g., surrogateescape means the body can be decoded, or bytes.__mod__ means it doesn't have to be).

Conclusions

Some of the changes that Christopher Lenz wanted 6 years ago have been made in the stdlib. And some don't belong there—I think the right solution for collation beyond what locale can do should be to encourage people to use PyUCA, Babel, or PyICU (although it would still be nice if there were a more Pythonic wrapper around ICU than PyICU, as he called for 6 years ago). But there are at least two problems that I think still need to be fixed.

Hopefully one day, either the regex module will be ready, or people will give up on it and start improving the re module instead. Either way, it's high time ICU-style character classes were added (and, if the answer is to stick with re, proper casefolding too).

Some way to iterate a string by grapheme clusters would be a great addition to the stdlib. Someone just needs to write it.
1

View comments

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed. Languages in the Abject-Oriented space have been borrowing ideas from another paradigm entirely—and then everyone realized that languages like Python, Ruby, and JavaScript had been doing it for years and just hadn't noticed (because these languages do not require you to declare what you're doing, or even to know what you're doing). Meanwhile, new hybrid languages borrow freely from both paradigms.

This other paradigm—which is actually older, but was largely constrained to university basements until recent years—is called Functional Addiction.

A Functional Addict is someone who regularly gets higher-order—sometimes they may even exhibit dependent types—but still manages to retain a job.

Retaining a job is of course the goal of all programming. This is why some of these new hybrid languages, like Rust, check all borrowing, from both paradigms, so extensively that you can make regular progress for months without ever successfully compiling your code, and your managers will appreciate that progress. After all, once it does compile, it will definitely work.

Closures

It's long been known that Closures are dual to Encapsulation.

As Abject-Oriented Programming explained, Encapsulation involves making all of your variables public, and ideally global, to let the rest of the code decide what should and shouldn't be private.

Closures, by contrast, are a way of referring to variables from outer scopes. And there is no scope more outer than global.

Immutability

One of the reasons Functional Addiction has become popular in recent years is that to truly take advantage of multi-core systems, you need immutable data, sometimes also called persistent data.

Instead of mutating a function to fix a bug, you should always make a new copy of that function. For example:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

When you discover that you actually wanted fields 2 and 3 rather than 1 and 2, it might be tempting to mutate the state of this function. But doing so is dangerous. The right answer is to make a copy, and then try to remember to use the copy instead of the original:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

function getCustName2(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[2] + ' ' + custRec[3];
    return fullname;
}

This means anyone still using the original function can continue to reference the old code, but as soon as it's no longer needed, it will be automatically garbage collected. (Automatic garbage collection isn't free, but it can be outsourced cheaply.)

Higher-Order Functions

In traditional Abject-Oriented Programming, you are required to give each function a name. But over time, the name of the function may drift away from what it actually does, making it as misleading as comments. Experience has shown that people will only keep once copy of their information up to date, and the CHANGES.TXT file is the right place for that.

Higher-Order Functions can solve this problem:

function []Functions = [
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    },
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[2] + ' ' + custRec[3];
        return fullname;
    },
]

Now you can refer to this functions by order, so there's no need for names.

Parametric Polymorphism

Traditional languages offer Abject-Oriented Polymorphism and Ad-Hoc Polymorphism (also known as Overloading), but better languages also offer Parametric Polymorphism.

The key to Parametric Polymorphism is that the type of the output can be determined from the type of the inputs via Algebra. For example:

function getCustData(custId, x)
{
    if (x == int(x)) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return int(fullname);
    } else if (x.real == 0) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return double(fullname);
    } else {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return complex(fullname);
    }
}

Notice that we've called the variable x. This is how you know you're using Algebraic Data Types. The names y, z, and sometimes w are also Algebraic.

Type Inference

Languages that enable Functional Addiction often feature Type Inference. This means that the compiler can infer your typing without you having to be explicit:


function getCustName(custID)
{
    // WARNING: Make sure the DB is locked here or
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

We didn't specify what will happen if the DB is not locked. And that's fine, because the compiler will figure it out and insert code that corrupts the data, without us needing to tell it to!

By contrast, most Abject-Oriented languages are either nominally typed—meaning that you give names to all of your types instead of meanings—or dynamically typed—meaning that your variables are all unique individuals that can accomplish anything if they try.

Memoization

Memoization means caching the results of a function call:

function getCustName(custID)
{
    if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Non-Strictness

Non-Strictness is often confused with Laziness, but in fact Laziness is just one kind of Non-Strictness. Here's an example that compares two different forms of Non-Strictness:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRate(custId) {}

function callByNameTextRate(custId)
{
    /****************************************
    *
    * TO DO:
    *
    * get tax rate for the customer state
    * eventually from some table
    *
    ****************************************/
}

Both are Non-Strict, but the second one forces the compiler to actually compile the function just so we can Call it By Name. This causes code bloat. The Lazy version will be smaller and faster. Plus, Lazy programming allows us to create infinite recursion without making the program hang:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRateRecursive(custId) { lazyTaxRateRecursive(custId); }

Laziness is often combined with Memoization:

function getCustName(custID)
{
    // if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Outside the world of Functional Addicts, this same technique is often called Test-Driven Development. If enough tests can be embedded in the code to achieve 100% coverage, or at least a decent amount, your code is guaranteed to be safe. But because the tests are not compiled and executed in the normal run, or indeed ever, they don't affect performance or correctness.

Conclusion

Many people claim that the days of Abject-Oriented Programming are over. But this is pure hype. Functional Addiction and Abject Orientation are not actually at odds with each other, but instead complement each other.
5

View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.