How to detect a valid integer literal

There are hundreds of questions on StackOverflow that all ask variations of the same thing. Paraphrasing:

lst is a list of strings and numbers. I want to convert the numbers to int but leave the strings alone. How do I do that?

This immediately gets a half-dozen answers that all do some equivalent of:

    lst = [int(x) if x.isdigit() else x for x in lst]

This has a number of problems, but they all come down to the same two:

"Numbers" is vague. You can assume it means only integers based on "I want to convert the numbers to int", but does it mean Python integer literals, things that can be converted with the int function with no base, or things that can be converted with the int function with base=0, or something different entirely, like JSON numbers or Excel numbers or the kinds of input you expect your 3rd-grade class to enter?
Whichever meaning you actually wanted, isdigit() does not test for that.

The right answer depends on what "numbers" actually means.

If it means "things that can be converted with the int function with no base", the right answer—as usual in Python—is to just try to convert with the int function:

    def tryint(x):
        try:
            return int(x)
        except ValueError:
            return x
    lst = [tryint(x) for x in lst]

Of course if you mean something different, that's not the right answer. Even "valid integer literals in Python source" isn't the same rule. (For example, 099 is an invalid literal in both 2.x and 3.x, and 012 is valid in 2.x but probably not what you wanted, but int('099') and int('0123') gives 99 and 123.) That's why you have to actually decide on a rule that you want to apply; otherwise, you're just assuming that all reasonable rules are the same, which is a patently false assumption. If your rule isn't actually "things that can be converted with the int function with no base, then the isdigit check is wrong, and the int(x) conversion is also wrong.

What specifically is wrong with `isdigit`?

I'm going to assume that you already thought through what you meant by "number", and the decision was "things that can be converted to int with the int function with no base", and you're just looking for how to LBYL that so you don't have to use a try.

Negative numbers

Obviously, -234 is an integer, but just as obviously, "-234".isdigit() is clearly going to be false, because - is not a digit.

Sometimes people try to solve this by writing all(c.isdigit() or c == '-' for c in x). But, besides being a whole lot slower and more complicated, that's even more wrong. It means that 123-456 now looks like an integer, so you're going to pass it to int without a try, and you're going to get a ValueError from your comprehension.

Of course you can solve that problem with (x[0].isdigit() or x[0] == '-') and x[1:].isdigit(), and now maybe every test you've thought of passes. But it will give you "1" instead of converting that to an integer, and it will raise an IndexError for an empty string.

One of these might be correct for handling negative integer numerals:

    x.isdigit() or x.startswith('-') and x[1:].isdigit()
    re.match(r'-?\d+', x)?

But is it obvious that either one is correct? The whole reason you wanted to use isdigit is to have something simple, obviously right, and fast, and you already no longer have that. And we're not even nearly done yet.

Positive numbers

+234 is an integer too. And int will treat it as one. But the code above won't. So now, whatever you did for -, you have to do the same thing for +. WHich is pretty ugly if you're using the non-regex solution:

    lst = [int(x) if x.isdigit() or x.startswith(('-', '+')) and x[1:].isdigit() else x
           for x in lst]

Whitespace

The int function allows the numeral to be surrounded by whitespace. But isdigit does not. So, now you have to add .strip() before the isdigit() call. Except we don't just have one isdigit call; to fix the other problems we've had two go with two isdigit calls and a startswith, and surely you don't want to call strip three times. Or we've switched to a regex. Either way, now we've got:

    lst = [int(x) if x.isdigit() or x.startswith(('-', '+')) and x[1:].isdigit() else x
           for x in (x.strip() for x in lst)]
    lst = [int(x) if re.match('\s*[+-]?\d+\s*', x) else x for x in lst]

What's a digit?

The isdigit function tests for characters that are in the Number, Decimal Digit category. In Python 3.x, that's the same rule the int function uses.

But 2.x doesn't use the same rule. If you're using a unicode, it's not entirely clear what int accepts, but it's not all Unicode digits, at least not in all Python 2.x implementations and versions; if you're using a str encoded in your default encoding, int still accepts the same set of digits, but isdigit only checks ASCII digits.

Plus, if you're using either 2.x or 3.0-3.2, and you've got a "narrow" Python build (like the default builds for Windows from python.org), isdigit is actually checking each UTF-16 code point, not each character, so for "\N{MATHEMATICAL SANS-SERIF DIGIT ZERO}", isdigit will return False, but int should accept it.

So, if your user types in an Arabic number like ١٠٤, the isdigit check may mean you end up with "١٠٤", or it may mean you end up with the int 104, or it may be one on some platforms and the other on other platforms.

I can't even think of any way to LBYL around this problem except to just say that your code requires 3.3+.

Have I thought of everything?

I don't know. Do you know? If you don't how are you going to write code that handles the things we haven't thought of.

Other rules might be even more complicated than the int with no base rule. For different use cases, users might reasonably expect 0x1234 or 1e10 or 1.0 or 1+0j or who knows what else to count as integers. The way to test for whatever it is you want to test for is still simple: write a conversion function for that, and see if it fails. Trying to LBYL it means that you have to write most of the same logic twice. Or, if you're relying on int or literal_eval or whatever to provide some or all of that logic, you have to duplicate its logic.

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed.

I haven't posted anything new in a couple years (partly because I attempted to move to a different blogging platform where I could write everything in markdown instead of HTML but got frustrated—which I may attempt again), but I've had a few private comments and emails on some of the old posts, so I

Looking before you leap

Python is a duck-typed language, and one where you usually trust EAFP ("Easier to Ask Forgiveness than Permission") over LBYL ("Look Before You Leap").

Background

Currently, CPython’s internal bytecode format stores instructions with no args as 1 byte, instructions with small args as 3 bytes, and instructions with large args as 6 bytes (actually, a 3-byte EXTENDED_ARG followed by a 3-byte real instruction).

If you want to skip all the tl;dr and cut to the chase, jump to Concrete Proposal.

Many people, when they first discover the heapq module, have two questions:

Why does it define a bunch of functions instead of a container type? Why don't those functions take a key or reverse parameter, like all the other sorting-related stuff in Python? Why not a type?

At the abstract level, it'

Currently, in CPython, if you want to process bytecode, either in C or in Python, it’s pretty complicated.

The built-in peephole optimizer has to do extra work fixing up jump targets and the line-number table, and just punts on many cases because they’re too hard to deal with.

One common "advanced question" on places like StackOverflow and python-list is "how do I dynamically create a function/method/class/whatever"? The standard answer is: first, some caveats about why you probably don't want to do that, and then an explanation of the various ways to do it when you reall

A few years ago, Cesare di Mauro created a project called WPython, a fork of CPython 2.6.4 that “brings many optimizations and refactorings”. The starting point of the project was replacing the bytecode with “wordcode”. However, there were a number of other changes on top of it.

Many languages have a for-each loop.

When the first betas for Swift came out, I was impressed by their collection design. In particular, the way it allows them to write map-style functions that are lazy (like Python 3), but still as full-featured as possible.

In a previous post, I explained in detail how lookup works in Python.

The documentation does a great job explaining how things normally get looked up, and how you can hook them.

But to understand how the hooking works, you need to go under the covers to see how that normal lookup actually happens.

When I say "Python" below, I'm mostly talking about CPython 3.5.

In Python (I'm mostly talking about CPython here, but other implementations do similar things), when you write the following:

def spam(x): return x+1 spam(3) What happens?

Really, it's not that complicated, but there's no documentation anywhere that puts it all together.

I've seen a number of people ask why, if you can have arbitrary-sized integers that do everything exactly, you can't do the same thing with floats, avoiding all the rounding problems that they keep running into.

In a recent thread on python-ideas, Stephan Sahm suggested, in effect, changing the method resolution order (MRO) from C3-linearization to a simple depth-first search a la old-school Python or C++.

Note: This post doesn't talk about Python that much, except as a point of comparison for JavaScript.

Most object-oriented languages out there, including Python, are class-based. But JavaScript is instead prototype-based.

About a year and a half ago, I wrote a blog post on the idea of adding pattern matching to Python.

I finally got around to playing with Scala semi-seriously, and I realized that they pretty much solved the same problem, in a pretty similar way to my straw man proposal, and it works great.

About a year ago, Jules Jacobs wrote a series (part 1 and part 2, with part 3 still forthcoming) on the best collections library design.

In three separate discussions on the Python mailing lists this month, people have objected to some design because it leaks something into the enclosing scope. But "leaks into the enclosing scope" isn't a real problem.

There's a lot of confusion about what the various kinds of things you can iterate over in Python. I'll attempt to collect definitions for all of the relevant terms, and provide examples, here, so I don't have to go over the same discussions in the same circles every time.

Python has a whole hierarchy of collection-related abstract types, described in the collections.abc module in the standard library. But there are two key, prototypical kinds. Iterators are one-shot, used for a single forward traversal, and usually lazy, generating each value on the fly as requested.

There are a lot of novice questions on optimizing NumPy code on StackOverflow, that make a lot of the same mistakes. I'll try to cover them all here.

What does NumPy speed up?

Let's look at some Python code that does some computation element-wise on two lists of lists.

When asyncio was first proposed, many people (not so much on python-ideas, where Guido first suggested it, but on external blogs) had the same reaction: Doing the core reactor loop in Python is going to be way too slow. Something based on libev, like gevent, is inherently going to be much faster.

Let's say you have a good idea for a change to Python.

There are hundreds of questions on StackOverflow that all ask variations of the same thing. Paraphrasing:

lst is a list of strings and numbers. I want to convert the numbers to int but leave the strings alone.

In Haskell, you can section infix operators. This is a simple form of partial evaluation. Using Python syntax, the following are equivalent:

(2*) lambda x: 2*x (*2) lambda x: x*2 (*) lambda x, y: x*y So, can we do the same in Python?

Grammar

The first form, (2*), is unambiguous.

Many people—especially people coming from Java—think that using try/except is "inelegant", or "inefficient". Or, slightly less meaninglessly, they think that "exceptions should only be for errors, not for normal flow control".

These people are not going to be happy with Python.

If you look at Python tutorials and sample code, proposals for new language features, blogs like this one, talks at PyCon, etc., you'll see spam, eggs, gouda, etc. all over the place.

Most control structures in most most programming languages, including Python, are subordinating conjunctions, like "if", "while", and "except", although "with" is a preposition, and "for" is a preposition used strangely (although not as strangely as in C…).

There are two ways that some Python programmers overuse lambda. Doing this almost always mkes your code less readable, and for no corresponding benefit.

Some languages have a very strong idiomatic style—in Python, Haskell, or Swift, the same code by two different programmers is likely to look a lot more similar than in Perl, Lisp, or C++.

There's an advantage to this—and, in particular, an advantage to you sticking to those idioms.

Python doesn't have a way to clone generators.

At least for a lot of simple cases, however, it's pretty obvious what cloning them should do, and being able to do so would be handy. But for a lot of other cases, it's not at all obvious.

Every time someone has a good idea, they believe it should be in the stdlib. After all, it's useful to many people, and what's the harm? But of course there is a harm.

This confuses every Python developer the first time they see it—even if they're pretty experienced by the time they see it:

>>> t = ([], []) >>> t[0] += [1] --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <stdin> in <module>()

How to detect a valid integer literal

What specifically is wrong with `isdigit`?

Negative numbers

Positive numbers

Whitespace

What's a digit?

Have I thought of everything?

View comments

Stupid Python Ideas

How to detect a valid integer literal

What specifically is wrong with isdigit?

Negative numbers

Positive numbers

Whitespace

What's a digit?

Have I thought of everything?

View comments

Stupid Python Ideas

What specifically is wrong with `isdigit`?