def save(path, *things): with open(path, 'w') as f: for thing in things: f.write(repr(thing) + '\n') def load(path): with open(path) as f: return [eval(line) for line in f]
If you get lucky, you start running into problems, because some objects don't have a round-trippable repr. If you don't get lucky, you run into the _real_ problems later on.
Notice that the same basic problems come up designing network protocols as file formats, with most of the same solutions.
The obvious problems
By default, a custom class--like many third-party and even stdlib types--will just look like <spam.Spam at 0x12345678>, which you can't eval. And these are the most fun kinds of bugs--the save succeeds, with no indication that anything went wrong, until you try to load the data later and find that all your useful information is missing.You can add a __repr__ method to your own types (which can be tricky; I'll get to that later), and maybe even subclass or patch third-party types, but eventually you run into something that just doesn't have an obvious round-trippable representation. For example, what string could you eval to re-generate an ElementTree node?
Besides that, there are types that are often round-trippable, but aren't when they're too big (like NumPy arrays), or in some other cases you aren't likely to run into until you're deep into development (e.g., lists that contain themselves).
The real problems
Safety and security
Let's say you've written a scheduler program. I look at the config file, and there's a line like this (with the quotes):"my task"What do you think will happen if I change it to this?
__import__("os").system("rm -rf /")The eval docs explicitly point this out: "See ast.literal_eval() for a function that can safely evaluate strings with expressions containing only literals." Since the set of objects that can be safely encoded with repr and eval is pretty not much wider than the set of objects that can be encoded as literals, this can be a solution to the problem. But it doesn't solve most of the other problems.
Robustness
Using repr/eval often leads to bugs that only appear in certain cases you may not have thought to test for, are very hard to track down when they do.For example, if you accidentally write an f without quotes when you meant to write "f", that _might_ get an error at eval time... or, if you happen to have a variable named f lying around at eval time, it'll get whatever value is in that variable. (Given the name, chances are it's a file that you expected to be short-lived, which now ends up being kept open for hours, causing some bug half-way across your program...)
And the fact that repr looks human-readable (as long as the human is the developer) makes such un-caught mistakes even more likely once you start editing the files by hand.
Using literal_eval solves this problem, but not in the best way. You will usually get an error at read time, instead of successfully reading in garbage. But it would be a lot nicer to get an error at write time, and there's no way to do that with repr.
Portability
It would be nice if your data were usable by other programs. Python representations look like they should be pretty portable--they look like JavaScript, and Ruby, and most other "scripting" langauges (and in some cases, even like valid initializers for C, C++, etc.).But each of those languages is a little bit different. They don't backslash-escape the same characters in strings, and have different rules about what unnecessary/unknown backslashes mean, or what characters are allowed without escaping. They have different rules for quoting things with quotes in them. Almost all scripting languages agree on the basic two collection types (list/array and dict/hash/object), but most of them have at least one other native collection type that the others don't. For example, {1, 2, 3} is perfectly legal JavaScript, but it doesn't mean a set of three numbers.
Unicode
Python string literals can have non-ASCII characters in them... but only if you know what encoding they're in. Source files have a coding declaration to specify that. But data files don't (unless you decide you want to incorporate PEP 263 into your data file spec, and write the code to parse the coding declarations, and so on).Fortunately, repr will unicode-escape any strings you give it. (Some badly-designed third-party types may not do that properly, so you'll have to fix them.)
But this means repr is not at all readable for non-English strings. A Chinese name turns into a string of \u1234 sequences.
What to use instead
The key is that you want to use a format designed for data storage or interchange, not one that just happens to often work.JSON
JavaScript Object Notation is a subset of JavaScript literal syntax, which also happens to be a subset of Python literal syntax. But JSON has advantages over literal_eval.It's a de facto standard language for data interchange. Python comes with a json module; every other language you're likely to use has something similar either built in or easily available; there are even command-line tools to use JSON from the shell. Good text editors understand it.
There's a good _reason_ it's a de facto standard: It's a good balance between easy for machines to generate, and parse, and validate, and easy for humans to read and edit. It's so simple that a JSON generator can add easy-to-twiddle knobs to let you trade off between compactness and pretty-printing, etc. (and yes, Python's stdlib module has them).
Finally, it's based on UTF-8 instead of ASCII (or Latin-1 or "whatever"), so it doesn't have to escape any characters except a few special invisible ones; a Chinese name will look like a string of Chinese characters (in a UTF-8-compatible editor or viewer, at least).
pickle
The pickle module provides a Python-specific serialization framework that's incredibly powerful and flexible.Most types can be serialized without you even having to think about it. If you need to customize the way a class is pickled, you can, but you usually don't have to. It's robust. Anything that can't be pickled will give you an error at save time, rather than at load time. It's generally fast and compact.
Pickle even understands references. For example, if you have a list with 20 references to the same list, and you dump it out and restore it with repr/eval, or JSON, you're going to get back a list of 20 separate copies that are equal, but not identical; with Pickle, you get what you started with. This also means that pickle will only use 1/20th as much storage as repr or JSON for that list. And it means pickle can dump things with circular dependencies.
But of course it's not a silver bullet.
Pickle is not meant to be readable, and in fact the default format isn't even text-based.
Pickle does avoid the kind of accidental problems that make eval unsafe, but it's no more secure against malicious data.
Pickle is not only Python-specific, but your-program-specific. In order to load an instance of a custom class, pickle has to be able to find the same class in the same module that it used at save time.
Between JSON and pickle
Sometimes you need to store data types that JSON can't handle, but you don't want all the flexibility (and insecurity and non-human-readability) of pickle.The json module itself can be extended to tell it how to encode your types; simple types can often be serialized as just the class name and the __dict__; complex types can mirror the pickling API. jsonpickle can do a lot of this work for you.
YAML is a superset of JSON that adds a number of basic types (like datetime), and an extensibility model, plus an optional human-friendly (indentation-based) alternative syntax. You can restrict yourself to the safe subset of YAML (no extension types), or you can use it to build something nearly as powerful as pickle (simple types encoded as basically the class name plus the __dict__, complex types using a custom encoding).
There are also countless other serialization formats out there, some of which fall within this range. Many of them are better when you have a more fixed, static shape for your objects. Some of them, you'll be forced to use because you're talking to some other program that only talks ASN.1 or plist or whatever. Wikipedia has a comparison table, with links.
Beyond JSON and pickle
Generally, you don't really need to serialize your objects, you need to serialize your data in such a way that you can create objects as-needed.For example, let's say you have a set of Polynomial objects, where each Polynomial has a NumPy array of coefficients. While you could pickle that set, there's a much simpler idea: Just store a list instead of a set, and each element as a list of up to N coefficients instead of an object. Now you've got something you can easily store as JSON, or even just CSV.
Sometimes, you can reduce everything to a single list or dict of strings, or of fixed records (each a list of N strings), or to things that you already know how to serialize to one of those structures. A list of strings without newlines can be serialized as just a plain text file, with each string on its own line. If there are newlines, you can escape them, or you can use something like netstrings. Fixed records are perfect for CSV. If you have a dict instead of a list, use dbm. For many applications, simple tools like that are all you need.
Often, using one of those trivial formats at the top level, and JSON or pickle for relatively small but flexible objects underneath, is a good solution. The shelve module can automatically pickle simple objects into a dbm for you, but you can build similar things yourself relatively easily.
When your data turn out to have more complex relationships, you may find out that you're better off thinking through your data model, separate from "just a bunch of objects". If you have tons of simple entities with complex relations, and need to search in terms of any of those relations, a relational database like sqlite3 is perfect. On the other hand, if your data are best represented as non-trivial documents with simple relations, and you need to search based on document structure, Couchdb might make more sense. Sometimes an object model like SQLAlchemy can help you connect your live object model to your data model. And so on.
But whatever you have, once you've extracted the data model, it'll be a lot easier to decide how to serialize it.
View comments