Back in 2008, Christopher Lenz wrote a great article called The Truth About Unicode in Python, which covers a lot of useful information that you wouldn't get just by reading the Unicode HOWTO, and that you might not even know to ask about unless you were already… well, not a Unicode expert, but more than a novice.

Unfortunately, that information is out of date, and he hasn't updated it. So, I think it's time for a new version.

Read his article for the state of Python 2.5, then come back here for the differences in Python 3.4. (If you care about 2.6-2.7, it's somewhere in between the two, so you may need a bit of trial and error to see for yourself.)

tl;dr

Case conversion and internal representation are fixed; collation and regular expressions are improved but not entirely fixed; everything else is mostly the same.

Collation

It's still true that the basic str comparison methods compare code points lexicographically, rather than collating properly. His example still gives the same results.

However, for many uses, locale.strcoll, or locale.strxfrm as a key function, does the right thing.

Of course whenever you see "many", you'll want to know what the limitations are.

The biggest issue is that many characters compare differently in different locales. The most famous example is that in different European locales, "ä" may sort as a single character "a" and "b", a single character after "z", or the two characters "ae". In fact, in some countries, there are two different sorting rules, usually called dictionary and phonebook sorting.

A second issue is that many strings of codepoints are equivalent, but may not be sorted the same. For example, if you've got a script with string literals stored in NFC, but you take input from the terminal which may come in NFKD, any character that can be decomposed won't match when you try to compare them.

The Unicode Collation Algorithm specifies an algorithm, based on ISO 14651, plus rules for customizing that algorithm for different locales, and a Common Locale Data Repository that includes the customizations for all common locales. (They handle the dictionary-vs.-phonebook rule by treating the phonebook as a separate locale—e.g., in ICU, you specify "de__phonebook" instead of "de".)

Python does not come with the CLDR data. It relies on your OS to supply the same data as part of its locale library. But traditional Unix locale data doesn't supply everything needed to replace CLDR, and many *nix systems don't come with a complete set of locales anyway—and Windows has nothing at all. So, if you stick with the stdlib, you're usually just getting plain ISO 14651, which means "ä" is going to come after "z". If you want to sort with German rules, you can explicitly setlocale de_DE, but that may fail—or, worse, it may succeed but not affect strcoll/strxfrm.

The Unicode Collation FAQ explains the differences, with links to the full details.

Case Conversion

Python 3.3+ supports the full case conversion rules, including the SpecialCasing.txt table that can convert one character into multiple characters. For example, 'ß'.title() == 'Ss'.

This also includes the casefold method, which can handle case-insensitive comparisons in cases when lower does the wrong thing—'ß'.casefold() == 'ss'.casefold().

Regular Expressions

The long-term intention is (or at least was, at one point) to replace the re module with a new implementation, regex—which, among other things, solves some of the Unicode problems and makes others easier to solve. However, regex wasn't ready for 3.3, or for 3.4, so the re module is still in the stdlib, but has gotten little attention. So, a proper fix for all of these problems may be a few years off.

First, as Christopher Lenz pointed out, Python's re module doesn't support the TR18 Unicode character classes, like \p{Ll}. This is still true in 3.4, and it's also true in regex, although it should be easier to add support to regex.

On top of that, while he didn't mention it, case-insensitive searching has always effectively used lower instead of casefold, meaning it won't find "straße" in "Königstrasse". This is fixed in regex, but is not fixed in the 3.4 stdlib module. (When casefold was added, equivalent code didn't appear to be needed in re, because regex was going to be replacing it anyway.)

It might be nice to add normal-form-independent matching

Also, character classes like \w were in ASCII mode by default; they're now UNICODE by default. (Also, the LOCALE mode was broken; it's still broken in 3.4; it's been fixed in regex, but it's been quasi-deprecated for so long that I doubt anyone will care.)

Finally, the re engine didn't support \u, \U, and \N escapes. Since you normally write regular expressions with raw string literals, which of course also don't support such escapes, this made it hard to write patterns for characters that are hard to type, or more readable as escapes. This was fixed in… I think 2.7/3.2; at any rate, it's definitely working in 3.4; re.search(r'\u0040', '\u0040') returns a match.

Text Segmentation

This needs to be broken down a bit.

First, Python strings are now iterables of Unicode code points, not of UTF-32-or-UTF-16-depending-on-build code units. This means s[3] now always a "character" in the sense of a code point, never half a surrogate pair. (See Internal Representation for more on this—and note that many other languages use either UTF-16 or UTF-8 for strings, and therefore still have this problem.)

However, they're still not iterables of grapheme clusters. For example, a single Hangul syllable may be three Unicode code points, but in most use cases people think of it as a single character. So, given the string "각" (Hangul gag, in NFD form), the length is 3 instead of 1, and s[2] is the lower "ᄀ", which isn't really a separate character.

But the only way to "fix" this would mean a major tradeoff: in a language where strings are treated as iterables of grapheme clusters (Swift is the only one I know of off-hand), strings aren't randomly accessible. So, s[2] will never give you part of a character, but then s[2] will always give you a TypeError.

The way to get the best of both worlds is probably to use the Python representation, plus a generator function so you can iterate graphemes(s). (And maybe some helper functions for the previous and next grapheme after a specific point. Swift handles this by having smart "indexes" that act sort of like bidirectional iterators, so s.index(c) doesn't return a number, but it does return something that you can use with s[i] and s++ and s--.)

Unfortunately, Python doesn't come with such a function. Partly because when it was first suggested long ago, the grapheme rule was so trivial that anyone could write it as a one-liner. (Just read a character, then all continuation characters before the next non-continuation character.) But even if you ignore the the extended grapheme rule (which you usually shouldn't), the legacy rule is more complicated than that nowadays—still easy to write, but not easy to come up with on your own. I've seen at least 4 people suggest that Python should add such a function over the past 5 or so years, but none of them have submitted a patch or provided a real-life use case, so it hasn't happened. I'm pretty sure no one has any objection to adding it, just no one has enough motivation to do it.

To Unicode, text segmentation also includes word and sentence segmentation, not just characters. But this is pretty clearly outside the scope of the stdlib. For example, as the spec itself mentions, splitting Japanese words is pretty much the same problem as hyphenating English words, and can't be done without a dictionary.

Bi-directional Text

I'm not sure what Python could include that would be helpful here. If you're going to write a layout engine that handles bidi text, you need to know a lot more than just the bidi boundaries (like font metrics), and I can't imagine anyone expects any of the other stuff to be part of a string library.

Python does include the unicodedata.bidirectional function, but that's been there since the early 2.x days, so clearly Christopher Lenz was looking for more than that. I just don't know what, exactly.

Locale Context

This is still as much of a problem as ever: locales are global (plus, changing locales isn't thread-safe), which means you can't use the locale module to handle multiple locales in a single app, like a webserver, unless you spawn a subprocess.

In fact, it may be worse than in 2008. Back then, Python wasn't in much use on resource-limited systems, especially on web servers; nowadays, a lot more people are running Python web servers on Raspberry Pi devices and micro-VMs and so on, and some of these systems have very incomplete locale installations.

Internal Representation

This one was fixed in Python 3.3.

For 3.2, there was some talk about deprecating narrow builds, but nobody pushed it very hard. Nobody brought up the disk space issue again (it really isn't relevant; normally you don't store raw Python objects on disk…), but people did bring up the memory issue, and some other issues with dealing with Windows/Cocoa/Java/whatever UTF-16 APIs.

This dragged on until it was too late to fix 3.2, but Martin van Löwis took some of the ideas that came out of that discussion and turned them into PEP 393, which made it into 3.3. Now, strings use 1 byte/character if they're pure Latin-1, 2 bytes if they're pure UCS-2, 4 bytes otherwise. Wider strings can also cache their UTF-8 and/or UTF-16 representation if needed. The old C APIs are still supported (with no significant loss in speed for any extension anyone could find), and there's a set of new, cleaner C APIs that can be used when you want a specific width or as much speed as possible.

Other Stuff

There are a few problems that weren't mentioned in the original blog that are still minor problems today.

A lot of effort over the past couple years has gone into making it easier for people to install, use, and require third-party modules. Python now comes with pip, and many popular packages that need to be compiled are available as pre-compiled wheels. (And the stable API, and the lack of narrow vs. wide builds, makes that a lot less of a nightmare than it would have been to implement in the old days.) 

And this has brought with it an attitude that many things not only don't have to be in the stdlib, but shouldn't be there, because they need to evolve at a different pace. For example, while the UCD doesn't change that often (Python 3.4.2 is on UCD 6.3.0, and 3.5 should be on 7.0.0, which just came out in October 2014), supplemental data like the CLDR change more often (it's up to 26.0.1). If the recommended way of dealing with CLDR-compatible sorting is to use PyICU or PyUCA, Python doesn't have to update multiples times per year.

Meanwhile, the unicodedata module doesn't contain all of the information in the UCD tables. Basically, it only contains the data from UnicodeData.txt; you can't ask what block a character is in, or the annotations, or any of the extra information (like the different Shift-JIS mappings for each Emoji). And this hasn't changed much since 2008. This is probably reasonable given that the full database is over 3MB, and Python itself is only about 14MB, but people have been annoyed that they can write 90% of what they want with the stdlib only to discover that for the last 10% they need to either get a third-party library or write their own UCD parser.

Finally, Python 3.0 added a problem that 2.5 didn't have—it made it hard, in many places, to deal with text in an unknown encoding, or in a combination of ASCII-compatible text and non-text, both of which you often want to search for ASCII-compatible segments (e.g., consider munging HTTP response headers; it you have to use bytes because the body can't be decoded, but there's no bytes.format or bytes.__mod__, this is no fun). Python 3.3, 3.4, and 3.5 have all added improvements in this area (e.g., surrogateescape means the body can be decoded, or bytes.__mod__ means it doesn't have to be).

Conclusions

Some of the changes that Christopher Lenz wanted 6 years ago have been made in the stdlib. And some don't belong there—I think the right solution for collation beyond what locale can do should be to encourage people to use PyUCA, Babel, or PyICU (although it would still be nice if there were a more Pythonic wrapper around ICU than PyICU, as he called for 6 years ago). But there are at least two problems that I think still need to be fixed.

Hopefully one day, either the regex module will be ready, or people will give up on it and start improving the re module instead. Either way, it's high time ICU-style character classes were added (and, if the answer is to stick with re, proper casefolding too).

Some way to iterate a string by grapheme clusters would be a great addition to the stdlib. Someone just needs to write it.
1

View comments

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed.
I haven't posted anything new in a couple years (partly because I attempted to move to a different blogging platform where I could write everything in markdown instead of HTML but got frustrated—which I may attempt again), but I've had a few private comments and emails on some of the old posts, so I
Looking before you leap

Python is a duck-typed language, and one where you usually trust EAFP ("Easier to Ask Forgiveness than Permission") over LBYL ("Look Before You Leap").
Background

Currently, CPython’s internal bytecode format stores instructions with no args as 1 byte, instructions with small args as 3 bytes, and instructions with large args as 6 bytes (actually, a 3-byte EXTENDED_ARG followed by a 3-byte real instruction).
If you want to skip all the tl;dr and cut to the chase, jump to Concrete Proposal.
Many people, when they first discover the heapq module, have two questions:

Why does it define a bunch of functions instead of a container type? Why don't those functions take a key or reverse parameter, like all the other sorting-related stuff in Python? Why not a type?

At the abstract level, it'
Currently, in CPython, if you want to process bytecode, either in C or in Python, it’s pretty complicated.

The built-in peephole optimizer has to do extra work fixing up jump targets and the line-number table, and just punts on many cases because they’re too hard to deal with.
One common "advanced question" on places like StackOverflow and python-list is "how do I dynamically create a function/method/class/whatever"? The standard answer is: first, some caveats about why you probably don't want to do that, and then an explanation of the various ways to do it when you reall
A few years ago, Cesare di Mauro created a project called WPython, a fork of CPython 2.6.4 that “brings many optimizations and refactorings”. The starting point of the project was replacing the bytecode with “wordcode”. However, there were a number of other changes on top of it.
Many languages have a for-each loop.
When the first betas for Swift came out, I was impressed by their collection design. In particular, the way it allows them to write map-style functions that are lazy (like Python 3), but still as full-featured as possible.
In a previous post, I explained in detail how lookup works in Python.
The documentation does a great job explaining how things normally get looked up, and how you can hook them.

But to understand how the hooking works, you need to go under the covers to see how that normal lookup actually happens.

When I say "Python" below, I'm mostly talking about CPython 3.5.
In Python (I'm mostly talking about CPython here, but other implementations do similar things), when you write the following:

def spam(x): return x+1 spam(3) What happens?

Really, it's not that complicated, but there's no documentation anywhere that puts it all together.
I've seen a number of people ask why, if you can have arbitrary-sized integers that do everything exactly, you can't do the same thing with floats, avoiding all the rounding problems that they keep running into.
In a recent thread on python-ideas, Stephan Sahm suggested, in effect, changing the method resolution order (MRO) from C3-linearization to a simple depth-first search a la old-school Python or C++.
Note: This post doesn't talk about Python that much, except as a point of comparison for JavaScript.

Most object-oriented languages out there, including Python, are class-based. But JavaScript is instead prototype-based.
About a year and a half ago, I wrote a blog post on the idea of adding pattern matching to Python.

I finally got around to playing with Scala semi-seriously, and I realized that they pretty much solved the same problem, in a pretty similar way to my straw man proposal, and it works great.
About a year ago, Jules Jacobs wrote a series (part 1 and part 2, with part 3 still forthcoming) on the best collections library design.
In three separate discussions on the Python mailing lists this month, people have objected to some design because it leaks something into the enclosing scope. But "leaks into the enclosing scope" isn't a real problem.
There's a lot of confusion about what the various kinds of things you can iterate over in Python. I'll attempt to collect definitions for all of the relevant terms, and provide examples, here, so I don't have to go over the same discussions in the same circles every time.
Python has a whole hierarchy of collection-related abstract types, described in the collections.abc module in the standard library. But there are two key, prototypical kinds. Iterators are one-shot, used for a single forward traversal, and usually lazy, generating each value on the fly as requested.
There are a lot of novice questions on optimizing NumPy code on StackOverflow, that make a lot of the same mistakes. I'll try to cover them all here.

What does NumPy speed up?

Let's look at some Python code that does some computation element-wise on two lists of lists.
When asyncio was first proposed, many people (not so much on python-ideas, where Guido first suggested it, but on external blogs) had the same reaction: Doing the core reactor loop in Python is going to be way too slow. Something based on libev, like gevent, is inherently going to be much faster.
Let's say you have a good idea for a change to Python.
There are hundreds of questions on StackOverflow that all ask variations of the same thing. Paraphrasing:

lst is a list of strings and numbers. I want to convert the numbers to int but leave the strings alone.
In Haskell, you can section infix operators. This is a simple form of partial evaluation. Using Python syntax, the following are equivalent:

(2*) lambda x: 2*x (*2) lambda x: x*2 (*) lambda x, y: x*y So, can we do the same in Python?

Grammar

The first form, (2*), is unambiguous.
Many people—especially people coming from Java—think that using try/except is "inelegant", or "inefficient". Or, slightly less meaninglessly, they think that "exceptions should only be for errors, not for normal flow control".

These people are not going to be happy with Python.
If you look at Python tutorials and sample code, proposals for new language features, blogs like this one, talks at PyCon, etc., you'll see spam, eggs, gouda, etc. all over the place.
Most control structures in most most programming languages, including Python, are subordinating conjunctions, like "if", "while", and "except", although "with" is a preposition, and "for" is a preposition used strangely (although not as strangely as in C…).
There are two ways that some Python programmers overuse lambda. Doing this almost always mkes your code less readable, and for no corresponding benefit.
Some languages have a very strong idiomatic style—in Python, Haskell, or Swift, the same code by two different programmers is likely to look a lot more similar than in Perl, Lisp, or C++.

There's an advantage to this—and, in particular, an advantage to you sticking to those idioms.
Python doesn't have a way to clone generators.

At least for a lot of simple cases, however, it's pretty obvious what cloning them should do, and being able to do so would be handy. But for a lot of other cases, it's not at all obvious.
Every time someone has a good idea, they believe it should be in the stdlib. After all, it's useful to many people, and what's the harm? But of course there is a harm.
This confuses every Python developer the first time they see it—even if they're pretty experienced by the time they see it:

>>> t = ([], []) >>> t[0] += [1] --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <stdin> in <module>()
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.