How to detect a valid integer literal

There are hundreds of questions on StackOverflow that all ask variations of the same thing. Paraphrasing:

lst is a list of strings and numbers. I want to convert the numbers to int but leave the strings alone. How do I do that?

This immediately gets a half-dozen answers that all do some equivalent of:

    lst = [int(x) if x.isdigit() else x for x in lst]

This has a number of problems, but they all come down to the same two:

"Numbers" is vague. You can assume it means only integers based on "I want to convert the numbers to int", but does it mean Python integer literals, things that can be converted with the int function with no base, or things that can be converted with the int function with base=0, or something different entirely, like JSON numbers or Excel numbers or the kinds of input you expect your 3rd-grade class to enter?
Whichever meaning you actually wanted, isdigit() does not test for that.

The right answer depends on what "numbers" actually means.

If it means "things that can be converted with the int function with no base", the right answer—as usual in Python—is to just try to convert with the int function:

    def tryint(x):
        try:
            return int(x)
        except ValueError:
            return x
    lst = [tryint(x) for x in lst]

Of course if you mean something different, that's not the right answer. Even "valid integer literals in Python source" isn't the same rule. (For example, 099 is an invalid literal in both 2.x and 3.x, and 012 is valid in 2.x but probably not what you wanted, but int('099') and int('0123') gives 99 and 123.) That's why you have to actually decide on a rule that you want to apply; otherwise, you're just assuming that all reasonable rules are the same, which is a patently false assumption. If your rule isn't actually "things that can be converted with the int function with no base, then the isdigit check is wrong, and the int(x) conversion is also wrong.

What specifically is wrong with `isdigit`?

I'm going to assume that you already thought through what you meant by "number", and the decision was "things that can be converted to int with the int function with no base", and you're just looking for how to LBYL that so you don't have to use a try.

Negative numbers

Obviously, -234 is an integer, but just as obviously, "-234".isdigit() is clearly going to be false, because - is not a digit.

Sometimes people try to solve this by writing all(c.isdigit() or c == '-' for c in x). But, besides being a whole lot slower and more complicated, that's even more wrong. It means that 123-456 now looks like an integer, so you're going to pass it to int without a try, and you're going to get a ValueError from your comprehension.

Of course you can solve that problem with (x[0].isdigit() or x[0] == '-') and x[1:].isdigit(), and now maybe every test you've thought of passes. But it will give you "1" instead of converting that to an integer, and it will raise an IndexError for an empty string.

One of these might be correct for handling negative integer numerals:

    x.isdigit() or x.startswith('-') and x[1:].isdigit()
    re.match(r'-?\d+', x)?

But is it obvious that either one is correct? The whole reason you wanted to use isdigit is to have something simple, obviously right, and fast, and you already no longer have that. And we're not even nearly done yet.

Positive numbers

+234 is an integer too. And int will treat it as one. But the code above won't. So now, whatever you did for -, you have to do the same thing for +. WHich is pretty ugly if you're using the non-regex solution:

    lst = [int(x) if x.isdigit() or x.startswith(('-', '+')) and x[1:].isdigit() else x
           for x in lst]

Whitespace

The int function allows the numeral to be surrounded by whitespace. But isdigit does not. So, now you have to add .strip() before the isdigit() call. Except we don't just have one isdigit call; to fix the other problems we've had two go with two isdigit calls and a startswith, and surely you don't want to call strip three times. Or we've switched to a regex. Either way, now we've got:

    lst = [int(x) if x.isdigit() or x.startswith(('-', '+')) and x[1:].isdigit() else x
           for x in (x.strip() for x in lst)]
    lst = [int(x) if re.match('\s*[+-]?\d+\s*', x) else x for x in lst]

What's a digit?

The isdigit function tests for characters that are in the Number, Decimal Digit category. In Python 3.x, that's the same rule the int function uses.

But 2.x doesn't use the same rule. If you're using a unicode, it's not entirely clear what int accepts, but it's not all Unicode digits, at least not in all Python 2.x implementations and versions; if you're using a str encoded in your default encoding, int still accepts the same set of digits, but isdigit only checks ASCII digits.

Plus, if you're using either 2.x or 3.0-3.2, and you've got a "narrow" Python build (like the default builds for Windows from python.org), isdigit is actually checking each UTF-16 code point, not each character, so for "\N{MATHEMATICAL SANS-SERIF DIGIT ZERO}", isdigit will return False, but int should accept it.

So, if your user types in an Arabic number like ١٠٤, the isdigit check may mean you end up with "١٠٤", or it may mean you end up with the int 104, or it may be one on some platforms and the other on other platforms.

I can't even think of any way to LBYL around this problem except to just say that your code requires 3.3+.

Have I thought of everything?

I don't know. Do you know? If you don't how are you going to write code that handles the things we haven't thought of.

Other rules might be even more complicated than the int with no base rule. For different use cases, users might reasonably expect 0x1234 or 1e10 or 1.0 or 1+0j or who knows what else to count as integers. The way to test for whatever it is you want to test for is still simple: write a conversion function for that, and see if it fails. Trying to LBYL it means that you have to write most of the same logic twice. Or, if you're relying on int or literal_eval or whatever to provide some or all of that logic, you have to duplicate its logic.

How to detect a valid integer literal

What specifically is wrong with `isdigit`?

Negative numbers

Positive numbers

Whitespace

What's a digit?

Have I thought of everything?

View comments

Stupid Python Ideas

Hybrid Programming

Greenlets vs. explicit coroutines

ABCs: What are they good for?

A standard assembly format for Python bytecode

Unified call syntax

Why heapq isn't a type

Unpacked Bytecode

Everything is dynamic

Wordcode

For-each loops should define a new variable

What specifically is wrong with isdigit?

Negative numbers

Positive numbers

Whitespace

What's a digit?

Have I thought of everything?

View comments

Stupid Python Ideas

What specifically is wrong with `isdigit`?