Many novices notice that, for many types, repr and eval are perfect opposites, and assume that this is a great way to serialize their data:
    def save(path, *things):
        with open(path, 'w') as f:
            for thing in things:
                f.write(repr(thing) + '\n')

    def load(path):
        with open(path) as f:
            return [eval(line) for line in f]

If you get lucky, you start running into problems, because some objects don't have a round-trippable repr. If you don't get lucky, you run into the _real_ problems later on.

Notice that the same basic problems come up designing network protocols as file formats, with most of the same solutions.

The obvious problems

By default, a custom class--like many third-party and even stdlib types--will just look like <spam.Spam at 0x12345678>, which you can't eval. And these are the most fun kinds of bugs--the save succeeds, with no indication that anything went wrong, until you try to load the data later and find that all your useful information is missing.

You can add a __repr__ method to your own types (which can be tricky; I'll get to that later), and maybe even subclass or patch third-party types, but eventually you run into something that just doesn't have an obvious round-trippable representation. For example, what string could you eval to re-generate an ElementTree node?

Besides that, there are types that are often round-trippable, but aren't when they're too big (like NumPy arrays), or in some other cases you aren't likely to run into until you're deep into development (e.g., lists that contain themselves).

The real problems

Safety and security

Let's say you've written a scheduler program. I look at the config file, and there's a line like this (with the quotes):
    "my task"
What do you think will happen if I change it to this?
    __import__("os").system("rm -rf /")
The eval docs explicitly point this out: "See ast.literal_eval() for a function that can safely evaluate strings with expressions containing only literals." Since the set of objects that can be safely encoded with repr and eval is pretty not much wider than the set of objects that can be encoded as literals, this can be a solution to the problem. But it doesn't solve most of the other problems.

Robustness

Using repr/eval often leads to bugs that only appear in certain cases you may not have thought to test for, are very hard to track down when they do.

For example, if you accidentally write an f without quotes when you meant to write "f", that _might_ get an error at eval time... or, if you happen to have a variable named f lying around at eval time, it'll get whatever value is in that variable. (Given the name, chances are it's a file that you expected to be short-lived, which now ends up being kept open for hours, causing some bug half-way across your program...)

And the fact that repr looks human-readable (as long as the human is the developer) makes such un-caught mistakes even more likely once you start editing the files by hand.

Using literal_eval solves this problem, but not in the best way. You will usually get an error at read time, instead of successfully reading in garbage. But it would be a lot nicer to get an error at write time, and there's no way to do that with repr.

Portability

It would be nice if your data were usable by other programs. Python representations look like they should be pretty portable--they look like JavaScript, and Ruby, and most other "scripting" langauges (and in some cases, even like valid initializers for C, C++, etc.).

But each of those languages is a little bit different. They don't backslash-escape the same characters in strings, and have different rules about what unnecessary/unknown backslashes mean, or what characters are allowed without escaping. They have different rules for quoting things with quotes in them. Almost all scripting languages agree on the basic two collection types (list/array and dict/hash/object), but most of them have at least one other native collection type that the others don't. For example, {1, 2, 3} is perfectly legal JavaScript, but it doesn't mean a set of three numbers.

Unicode

Python string literals can have non-ASCII characters in them... but only if you know what encoding they're in. Source files have a coding declaration to specify that. But data files don't (unless you decide you want to incorporate PEP 263 into your data file spec, and write the code to parse the coding declarations, and so on).

Fortunately, repr will unicode-escape any strings you give it. (Some badly-designed third-party types may not do that properly, so you'll have to fix them.)

But this means repr is not at all readable for non-English strings. A Chinese name turns into a string of \u1234 sequences.

What to use instead

The key is that you want to use a format designed for data storage or interchange, not one that just happens to often work.

JSON

JavaScript Object Notation is a subset of JavaScript literal syntax, which also happens to be a subset of Python literal syntax. But JSON has advantages over literal_eval.

It's a de facto standard language for data interchange. Python comes with a json module; every other language you're likely to use has something similar either built in or easily available; there are even command-line tools to use JSON from the shell. Good text editors understand it.

There's a good _reason_ it's a de facto standard: It's a good balance between easy for machines to generate, and parse, and validate, and easy for humans to read and edit. It's so simple that a JSON generator can add easy-to-twiddle knobs to let you trade off between compactness and pretty-printing, etc. (and yes, Python's stdlib module has them).

Finally, it's based on UTF-8 instead of ASCII (or Latin-1 or "whatever"), so it doesn't have to escape any characters except a few special invisible ones; a Chinese name will look like a string of Chinese characters (in a UTF-8-compatible editor or viewer, at least).

pickle

The pickle module provides a Python-specific serialization framework that's incredibly powerful and flexible.

Most types can be serialized without you even having to think about it. If you need to customize the way a class is pickled, you can, but you usually don't have to. It's robust. Anything that can't be pickled will give you an error at save time, rather than at load time. It's generally fast and compact.

Pickle even understands references. For example, if you have a list with 20 references to the same list, and you dump it out and restore it with repr/eval, or JSON, you're going to get back a list of 20 separate copies that are equal, but not identical; with Pickle, you get what you started with. This also means that pickle will only use 1/20th as much storage as repr or JSON for that list. And it means pickle can dump things with circular dependencies.

But of course it's not a silver bullet.

Pickle is not meant to be readable, and in fact the default format isn't even text-based.

Pickle does avoid the kind of accidental problems that make eval unsafe, but it's no more secure against malicious data.

Pickle is not only Python-specific, but your-program-specific. In order to load an instance of a custom class, pickle has to be able to find the same class in the same module that it used at save time.

Between JSON and pickle

Sometimes you need to store data types that JSON can't handle, but you don't want all the flexibility (and insecurity and non-human-readability) of pickle.

The json module itself can be extended to tell it how to encode your types; simple types can often be serialized as just the class name and the __dict__; complex types can mirror the pickling API. jsonpickle can do a lot of this work for you.

YAML is a superset of JSON that adds a number of basic types (like datetime), and an extensibility model, plus an optional human-friendly (indentation-based) alternative syntax. You can restrict yourself to the safe subset of YAML (no extension types), or you can use it to build something nearly as powerful as pickle (simple types encoded as basically the class name plus the __dict__, complex types using a custom encoding).

There are also countless other serialization formats out there, some of which fall within this range. Many of them are better when you have a more fixed, static shape for your objects. Some of them, you'll be forced to use because you're talking to some other program that only talks ASN.1 or plist or whatever. Wikipedia has a comparison table, with links.

Beyond JSON and pickle

Generally, you don't really need to serialize your objects, you need to serialize your data in such a way that you can create objects as-needed.

For example, let's say you have a set of Polynomial objects, where each Polynomial has a NumPy array of coefficients. While you could pickle that set, there's a much simpler idea: Just store a list instead of a set, and each element as a list of up to N coefficients instead of an object. Now you've got something you can easily store as JSON, or even just CSV.

Sometimes, you can reduce everything to a single list or dict of strings, or of fixed records (each a list of N strings), or to things that you already know how to serialize to one of those structures. A list of strings without newlines can be serialized as just a plain text file, with each string on its own line. If there are newlines, you can escape them, or you can use something like netstrings. Fixed records are perfect for CSV. If you have a dict instead of a list, use dbm. For many applications, simple tools like that are all you need.

Often, using one of those trivial formats at the top level, and JSON or pickle for relatively small but flexible objects underneath, is a good solution. The shelve module can automatically pickle simple objects into a dbm for you, but you can build similar things yourself relatively easily.

When your data turn out to have more complex relationships, you may find out that you're better off thinking through your data model, separate from "just a bunch of objects". If you have tons of simple entities with complex relations, and need to search in terms of any of those relations, a relational database like sqlite3 is perfect. On the other hand, if your data are best represented as non-trivial documents with simple relations, and you need to search based on document structure, Couchdb might make more sense. Sometimes an object model like SQLAlchemy can help you connect your live object model to your data model. And so on.

But whatever you have, once you've extracted the data model, it'll be a lot easier to decide how to serialize it.
1

View comments

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed. Languages in the Abject-Oriented space have been borrowing ideas from another paradigm entirely—and then everyone realized that languages like Python, Ruby, and JavaScript had been doing it for years and just hadn't noticed (because these languages do not require you to declare what you're doing, or even to know what you're doing). Meanwhile, new hybrid languages borrow freely from both paradigms.

This other paradigm—which is actually older, but was largely constrained to university basements until recent years—is called Functional Addiction.

A Functional Addict is someone who regularly gets higher-order—sometimes they may even exhibit dependent types—but still manages to retain a job.

Retaining a job is of course the goal of all programming. This is why some of these new hybrid languages, like Rust, check all borrowing, from both paradigms, so extensively that you can make regular progress for months without ever successfully compiling your code, and your managers will appreciate that progress. After all, once it does compile, it will definitely work.

Closures

It's long been known that Closures are dual to Encapsulation.

As Abject-Oriented Programming explained, Encapsulation involves making all of your variables public, and ideally global, to let the rest of the code decide what should and shouldn't be private.

Closures, by contrast, are a way of referring to variables from outer scopes. And there is no scope more outer than global.

Immutability

One of the reasons Functional Addiction has become popular in recent years is that to truly take advantage of multi-core systems, you need immutable data, sometimes also called persistent data.

Instead of mutating a function to fix a bug, you should always make a new copy of that function. For example:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

When you discover that you actually wanted fields 2 and 3 rather than 1 and 2, it might be tempting to mutate the state of this function. But doing so is dangerous. The right answer is to make a copy, and then try to remember to use the copy instead of the original:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

function getCustName2(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[2] + ' ' + custRec[3];
    return fullname;
}

This means anyone still using the original function can continue to reference the old code, but as soon as it's no longer needed, it will be automatically garbage collected. (Automatic garbage collection isn't free, but it can be outsourced cheaply.)

Higher-Order Functions

In traditional Abject-Oriented Programming, you are required to give each function a name. But over time, the name of the function may drift away from what it actually does, making it as misleading as comments. Experience has shown that people will only keep once copy of their information up to date, and the CHANGES.TXT file is the right place for that.

Higher-Order Functions can solve this problem:

function []Functions = [
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    },
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[2] + ' ' + custRec[3];
        return fullname;
    },
]

Now you can refer to this functions by order, so there's no need for names.

Parametric Polymorphism

Traditional languages offer Abject-Oriented Polymorphism and Ad-Hoc Polymorphism (also known as Overloading), but better languages also offer Parametric Polymorphism.

The key to Parametric Polymorphism is that the type of the output can be determined from the type of the inputs via Algebra. For example:

function getCustData(custId, x)
{
    if (x == int(x)) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return int(fullname);
    } else if (x.real == 0) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return double(fullname);
    } else {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return complex(fullname);
    }
}

Notice that we've called the variable x. This is how you know you're using Algebraic Data Types. The names y, z, and sometimes w are also Algebraic.

Type Inference

Languages that enable Functional Addiction often feature Type Inference. This means that the compiler can infer your typing without you having to be explicit:


function getCustName(custID)
{
    // WARNING: Make sure the DB is locked here or
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

We didn't specify what will happen if the DB is not locked. And that's fine, because the compiler will figure it out and insert code that corrupts the data, without us needing to tell it to!

By contrast, most Abject-Oriented languages are either nominally typed—meaning that you give names to all of your types instead of meanings—or dynamically typed—meaning that your variables are all unique individuals that can accomplish anything if they try.

Memoization

Memoization means caching the results of a function call:

function getCustName(custID)
{
    if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Non-Strictness

Non-Strictness is often confused with Laziness, but in fact Laziness is just one kind of Non-Strictness. Here's an example that compares two different forms of Non-Strictness:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRate(custId) {}

function callByNameTextRate(custId)
{
    /****************************************
    *
    * TO DO:
    *
    * get tax rate for the customer state
    * eventually from some table
    *
    ****************************************/
}

Both are Non-Strict, but the second one forces the compiler to actually compile the function just so we can Call it By Name. This causes code bloat. The Lazy version will be smaller and faster. Plus, Lazy programming allows us to create infinite recursion without making the program hang:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRateRecursive(custId) { lazyTaxRateRecursive(custId); }

Laziness is often combined with Memoization:

function getCustName(custID)
{
    // if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Outside the world of Functional Addicts, this same technique is often called Test-Driven Development. If enough tests can be embedded in the code to achieve 100% coverage, or at least a decent amount, your code is guaranteed to be safe. But because the tests are not compiled and executed in the normal run, or indeed ever, they don't affect performance or correctness.

Conclusion

Many people claim that the days of Abject-Oriented Programming are over. But this is pure hype. Functional Addiction and Abject Orientation are not actually at odds with each other, but instead complement each other.
5

View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.