Difference between unicode.isdigit() and unicode.isnumeric() - python

What is the difference between methods unicode.isdigit() and unicode.isnumeric()?

The Python 3 documentation is a little clearer than the Python 2 docs:
str.isdigit()
[...] Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.
str.isnumeric()
Numeric characters include digit characters, and all characters that have the Unicode numeric value property, e.g. U+2155, VULGAR FRACTION ONE FIFTH. Formally, numeric characters are those with the property value Numeric_Type=Digit, Numeric_Type=Decimal or Numeric_Type=Numeric.
So isnumeric() tests additionally for Numeric_Type=Numeric. Quoting from a historic proposal for official numeric type definitions:
Numeric_Type=Decimal
Characters used in a positional decimal systems, which standard base-10 radix systems with contiguous digits 0..9, and are most-significant-digit first (backingstore order). These are coextensive by definition with General_Category=Decimal_Number.
Numeric_Type=Digit
Variants of positional decimal characters (Numeric_Type=Decimal) or sequences thereof. These include super/subscripts, enclosed, or decorated by the addition of characters such as parentheses, dots, or commas.
Numeric_Type=Numeric
Characters with numeric value, but that are neither Decimal nor Digit.
So any character that is numeric, but not decimal or a variation thereof. Think fractions, roman numerals, glyphs that combine digits, and any numbering system that is not decimal-based.
That includes:
>>> import unicodedata
>>> for codepoint in range(2**16):
... chr = unichr(codepoint)
... if chr.isnumeric() and not chr.isdigit():
... print u'{:04x}: {} ({})'.format(codepoint, chr, unicodedata.name(chr, 'UNNAMED'))
...
00bc: ¼ (VULGAR FRACTION ONE QUARTER)
00bd: ½ (VULGAR FRACTION ONE HALF)
00be: ¾ (VULGAR FRACTION THREE QUARTERS)
09f4: ৴ (BENGALI CURRENCY NUMERATOR ONE)
09f5: ৵ (BENGALI CURRENCY NUMERATOR TWO)
09f6: ৶ (BENGALI CURRENCY NUMERATOR THREE)
09f7: ৷ (BENGALI CURRENCY NUMERATOR FOUR)
09f8: ৸ (BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR)
09f9: ৹ (BENGALI CURRENCY DENOMINATOR SIXTEEN)
0bf0: ௰ (TAMIL NUMBER TEN)
0bf1: ௱ (TAMIL NUMBER ONE HUNDRED)
0bf2: ௲ (TAMIL NUMBER ONE THOUSAND)
0c78: ౸ (TELUGU FRACTION DIGIT ZERO FOR ODD POWERS OF FOUR)
0c79: ౹ (TELUGU FRACTION DIGIT ONE FOR ODD POWERS OF FOUR)
0c7a: ౺ (TELUGU FRACTION DIGIT TWO FOR ODD POWERS OF FOUR)
0c7b: ౻ (TELUGU FRACTION DIGIT THREE FOR ODD POWERS OF FOUR)
0c7c: ౼ (TELUGU FRACTION DIGIT ONE FOR EVEN POWERS OF FOUR)
0c7d: ౽ (TELUGU FRACTION DIGIT TWO FOR EVEN POWERS OF FOUR)
0c7e: ౾ (TELUGU FRACTION DIGIT THREE FOR EVEN POWERS OF FOUR)
0d70: ൰ (MALAYALAM NUMBER TEN)
0d71: ൱ (MALAYALAM NUMBER ONE HUNDRED)
0d72: ൲ (MALAYALAM NUMBER ONE THOUSAND)
0d73: ൳ (MALAYALAM FRACTION ONE QUARTER)
0d74: ൴ (MALAYALAM FRACTION ONE HALF)
0d75: ൵ (MALAYALAM FRACTION THREE QUARTERS)
0f2a: ༪ (TIBETAN DIGIT HALF ONE)
0f2b: ༫ (TIBETAN DIGIT HALF TWO)
0f2c: ༬ (TIBETAN DIGIT HALF THREE)
0f2d: ༭ (TIBETAN DIGIT HALF FOUR)
0f2e: ༮ (TIBETAN DIGIT HALF FIVE)
0f2f: ༯ (TIBETAN DIGIT HALF SIX)
0f30: ༰ (TIBETAN DIGIT HALF SEVEN)
0f31: ༱ (TIBETAN DIGIT HALF EIGHT)
0f32: ༲ (TIBETAN DIGIT HALF NINE)
0f33: ༳ (TIBETAN DIGIT HALF ZERO)
1372: ፲ (ETHIOPIC NUMBER TEN)
1373: ፳ (ETHIOPIC NUMBER TWENTY)
1374: ፴ (ETHIOPIC NUMBER THIRTY)
1375: ፵ (ETHIOPIC NUMBER FORTY)
1376: ፶ (ETHIOPIC NUMBER FIFTY)
1377: ፷ (ETHIOPIC NUMBER SIXTY)
1378: ፸ (ETHIOPIC NUMBER SEVENTY)
1379: ፹ (ETHIOPIC NUMBER EIGHTY)
137a: ፺ (ETHIOPIC NUMBER NINETY)
137b: ፻ (ETHIOPIC NUMBER HUNDRED)
137c: ፼ (ETHIOPIC NUMBER TEN THOUSAND)
16ee: ᛮ (RUNIC ARLAUG SYMBOL)
16ef: ᛯ (RUNIC TVIMADUR SYMBOL)
16f0: ᛰ (RUNIC BELGTHOR SYMBOL)
17f0: ៰ (KHMER SYMBOL LEK ATTAK SON)
17f1: ៱ (KHMER SYMBOL LEK ATTAK MUOY)
17f2: ៲ (KHMER SYMBOL LEK ATTAK PII)
17f3: ៳ (KHMER SYMBOL LEK ATTAK BEI)
17f4: ៴ (KHMER SYMBOL LEK ATTAK BUON)
17f5: ៵ (KHMER SYMBOL LEK ATTAK PRAM)
17f6: ៶ (KHMER SYMBOL LEK ATTAK PRAM-MUOY)
17f7: ៷ (KHMER SYMBOL LEK ATTAK PRAM-PII)
17f8: ៸ (KHMER SYMBOL LEK ATTAK PRAM-BEI)
17f9: ៹ (KHMER SYMBOL LEK ATTAK PRAM-BUON)
2150: ⅐ (VULGAR FRACTION ONE SEVENTH)
2151: ⅑ (VULGAR FRACTION ONE NINTH)
2152: ⅒ (VULGAR FRACTION ONE TENTH)
2153: ⅓ (VULGAR FRACTION ONE THIRD)
2154: ⅔ (VULGAR FRACTION TWO THIRDS)
2155: ⅕ (VULGAR FRACTION ONE FIFTH)
2156: ⅖ (VULGAR FRACTION TWO FIFTHS)
2157: ⅗ (VULGAR FRACTION THREE FIFTHS)
2158: ⅘ (VULGAR FRACTION FOUR FIFTHS)
2159: ⅙ (VULGAR FRACTION ONE SIXTH)
215a: ⅚ (VULGAR FRACTION FIVE SIXTHS)
215b: ⅛ (VULGAR FRACTION ONE EIGHTH)
215c: ⅜ (VULGAR FRACTION THREE EIGHTHS)
215d: ⅝ (VULGAR FRACTION FIVE EIGHTHS)
215e: ⅞ (VULGAR FRACTION SEVEN EIGHTHS)
215f: ⅟ (FRACTION NUMERATOR ONE)
2160: Ⅰ (ROMAN NUMERAL ONE)
2161: Ⅱ (ROMAN NUMERAL TWO)
2162: Ⅲ (ROMAN NUMERAL THREE)
2163: Ⅳ (ROMAN NUMERAL FOUR)
2164: Ⅴ (ROMAN NUMERAL FIVE)
2165: Ⅵ (ROMAN NUMERAL SIX)
2166: Ⅶ (ROMAN NUMERAL SEVEN)
2167: Ⅷ (ROMAN NUMERAL EIGHT)
2168: Ⅸ (ROMAN NUMERAL NINE)
2169: Ⅹ (ROMAN NUMERAL TEN)
216a: Ⅺ (ROMAN NUMERAL ELEVEN)
216b: Ⅻ (ROMAN NUMERAL TWELVE)
216c: Ⅼ (ROMAN NUMERAL FIFTY)
216d: Ⅽ (ROMAN NUMERAL ONE HUNDRED)
216e: Ⅾ (ROMAN NUMERAL FIVE HUNDRED)
216f: Ⅿ (ROMAN NUMERAL ONE THOUSAND)
2170: ⅰ (SMALL ROMAN NUMERAL ONE)
2171: ⅱ (SMALL ROMAN NUMERAL TWO)
2172: ⅲ (SMALL ROMAN NUMERAL THREE)
2173: ⅳ (SMALL ROMAN NUMERAL FOUR)
2174: ⅴ (SMALL ROMAN NUMERAL FIVE)
2175: ⅵ (SMALL ROMAN NUMERAL SIX)
2176: ⅶ (SMALL ROMAN NUMERAL SEVEN)
2177: ⅷ (SMALL ROMAN NUMERAL EIGHT)
2178: ⅸ (SMALL ROMAN NUMERAL NINE)
2179: ⅹ (SMALL ROMAN NUMERAL TEN)
217a: ⅺ (SMALL ROMAN NUMERAL ELEVEN)
217b: ⅻ (SMALL ROMAN NUMERAL TWELVE)
217c: ⅼ (SMALL ROMAN NUMERAL FIFTY)
217d: ⅽ (SMALL ROMAN NUMERAL ONE HUNDRED)
217e: ⅾ (SMALL ROMAN NUMERAL FIVE HUNDRED)
217f: ⅿ (SMALL ROMAN NUMERAL ONE THOUSAND)
2180: ↀ (ROMAN NUMERAL ONE THOUSAND C D)
2181: ↁ (ROMAN NUMERAL FIVE THOUSAND)
2182: ↂ (ROMAN NUMERAL TEN THOUSAND)
2185: ↅ (ROMAN NUMERAL SIX LATE FORM)
2186: ↆ (ROMAN NUMERAL FIFTY EARLY FORM)
2187: ↇ (ROMAN NUMERAL FIFTY THOUSAND)
2188: ↈ (ROMAN NUMERAL ONE HUNDRED THOUSAND)
2189: ↉ (VULGAR FRACTION ZERO THIRDS)
2469: ⑩ (CIRCLED NUMBER TEN)
246a: ⑪ (CIRCLED NUMBER ELEVEN)
246b: ⑫ (CIRCLED NUMBER TWELVE)
246c: ⑬ (CIRCLED NUMBER THIRTEEN)
246d: ⑭ (CIRCLED NUMBER FOURTEEN)
246e: ⑮ (CIRCLED NUMBER FIFTEEN)
246f: ⑯ (CIRCLED NUMBER SIXTEEN)
2470: ⑰ (CIRCLED NUMBER SEVENTEEN)
2471: ⑱ (CIRCLED NUMBER EIGHTEEN)
2472: ⑲ (CIRCLED NUMBER NINETEEN)
2473: ⑳ (CIRCLED NUMBER TWENTY)
247d: ⑽ (PARENTHESIZED NUMBER TEN)
247e: ⑾ (PARENTHESIZED NUMBER ELEVEN)
247f: ⑿ (PARENTHESIZED NUMBER TWELVE)
2480: ⒀ (PARENTHESIZED NUMBER THIRTEEN)
2481: ⒁ (PARENTHESIZED NUMBER FOURTEEN)
2482: ⒂ (PARENTHESIZED NUMBER FIFTEEN)
2483: ⒃ (PARENTHESIZED NUMBER SIXTEEN)
2484: ⒄ (PARENTHESIZED NUMBER SEVENTEEN)
2485: ⒅ (PARENTHESIZED NUMBER EIGHTEEN)
2486: ⒆ (PARENTHESIZED NUMBER NINETEEN)
2487: ⒇ (PARENTHESIZED NUMBER TWENTY)
2491: ⒑ (NUMBER TEN FULL STOP)
2492: ⒒ (NUMBER ELEVEN FULL STOP)
2493: ⒓ (NUMBER TWELVE FULL STOP)
2494: ⒔ (NUMBER THIRTEEN FULL STOP)
2495: ⒕ (NUMBER FOURTEEN FULL STOP)
2496: ⒖ (NUMBER FIFTEEN FULL STOP)
2497: ⒗ (NUMBER SIXTEEN FULL STOP)
2498: ⒘ (NUMBER SEVENTEEN FULL STOP)
2499: ⒙ (NUMBER EIGHTEEN FULL STOP)
249a: ⒚ (NUMBER NINETEEN FULL STOP)
249b: ⒛ (NUMBER TWENTY FULL STOP)
24eb: ⓫ (NEGATIVE CIRCLED NUMBER ELEVEN)
24ec: ⓬ (NEGATIVE CIRCLED NUMBER TWELVE)
24ed: ⓭ (NEGATIVE CIRCLED NUMBER THIRTEEN)
24ee: ⓮ (NEGATIVE CIRCLED NUMBER FOURTEEN)
24ef: ⓯ (NEGATIVE CIRCLED NUMBER FIFTEEN)
24f0: ⓰ (NEGATIVE CIRCLED NUMBER SIXTEEN)
24f1: ⓱ (NEGATIVE CIRCLED NUMBER SEVENTEEN)
24f2: ⓲ (NEGATIVE CIRCLED NUMBER EIGHTEEN)
24f3: ⓳ (NEGATIVE CIRCLED NUMBER NINETEEN)
24f4: ⓴ (NEGATIVE CIRCLED NUMBER TWENTY)
24fe: ⓾ (DOUBLE CIRCLED NUMBER TEN)
277f: ❿ (DINGBAT NEGATIVE CIRCLED NUMBER TEN)
2789: ➉ (DINGBAT CIRCLED SANS-SERIF NUMBER TEN)
2793: ➓ (DINGBAT NEGATIVE CIRCLED SANS-SERIF NUMBER TEN)
2cfd: ⳽ (COPTIC FRACTION ONE HALF)
3007: 〇 (IDEOGRAPHIC NUMBER ZERO)
3021: 〡 (HANGZHOU NUMERAL ONE)
3022: 〢 (HANGZHOU NUMERAL TWO)
3023: 〣 (HANGZHOU NUMERAL THREE)
3024: 〤 (HANGZHOU NUMERAL FOUR)
3025: 〥 (HANGZHOU NUMERAL FIVE)
3026: 〦 (HANGZHOU NUMERAL SIX)
3027: 〧 (HANGZHOU NUMERAL SEVEN)
3028: 〨 (HANGZHOU NUMERAL EIGHT)
3029: 〩 (HANGZHOU NUMERAL NINE)
3038: 〸 (HANGZHOU NUMERAL TEN)
3039: 〹 (HANGZHOU NUMERAL TWENTY)
303a: 〺 (HANGZHOU NUMERAL THIRTY)
3192: ㆒ (IDEOGRAPHIC ANNOTATION ONE MARK)
3193: ㆓ (IDEOGRAPHIC ANNOTATION TWO MARK)
3194: ㆔ (IDEOGRAPHIC ANNOTATION THREE MARK)
3195: ㆕ (IDEOGRAPHIC ANNOTATION FOUR MARK)
3220: ㈠ (PARENTHESIZED IDEOGRAPH ONE)
3221: ㈡ (PARENTHESIZED IDEOGRAPH TWO)
3222: ㈢ (PARENTHESIZED IDEOGRAPH THREE)
3223: ㈣ (PARENTHESIZED IDEOGRAPH FOUR)
3224: ㈤ (PARENTHESIZED IDEOGRAPH FIVE)
3225: ㈥ (PARENTHESIZED IDEOGRAPH SIX)
3226: ㈦ (PARENTHESIZED IDEOGRAPH SEVEN)
3227: ㈧ (PARENTHESIZED IDEOGRAPH EIGHT)
3228: ㈨ (PARENTHESIZED IDEOGRAPH NINE)
3229: ㈩ (PARENTHESIZED IDEOGRAPH TEN)
3251: ㉑ (CIRCLED NUMBER TWENTY ONE)
3252: ㉒ (CIRCLED NUMBER TWENTY TWO)
3253: ㉓ (CIRCLED NUMBER TWENTY THREE)
3254: ㉔ (CIRCLED NUMBER TWENTY FOUR)
3255: ㉕ (CIRCLED NUMBER TWENTY FIVE)
3256: ㉖ (CIRCLED NUMBER TWENTY SIX)
3257: ㉗ (CIRCLED NUMBER TWENTY SEVEN)
3258: ㉘ (CIRCLED NUMBER TWENTY EIGHT)
3259: ㉙ (CIRCLED NUMBER TWENTY NINE)
325a: ㉚ (CIRCLED NUMBER THIRTY)
325b: ㉛ (CIRCLED NUMBER THIRTY ONE)
325c: ㉜ (CIRCLED NUMBER THIRTY TWO)
325d: ㉝ (CIRCLED NUMBER THIRTY THREE)
325e: ㉞ (CIRCLED NUMBER THIRTY FOUR)
325f: ㉟ (CIRCLED NUMBER THIRTY FIVE)
3280: ㊀ (CIRCLED IDEOGRAPH ONE)
3281: ㊁ (CIRCLED IDEOGRAPH TWO)
3282: ㊂ (CIRCLED IDEOGRAPH THREE)
3283: ㊃ (CIRCLED IDEOGRAPH FOUR)
3284: ㊄ (CIRCLED IDEOGRAPH FIVE)
3285: ㊅ (CIRCLED IDEOGRAPH SIX)
3286: ㊆ (CIRCLED IDEOGRAPH SEVEN)
3287: ㊇ (CIRCLED IDEOGRAPH EIGHT)
3288: ㊈ (CIRCLED IDEOGRAPH NINE)
3289: ㊉ (CIRCLED IDEOGRAPH TEN)
32b1: ㊱ (CIRCLED NUMBER THIRTY SIX)
32b2: ㊲ (CIRCLED NUMBER THIRTY SEVEN)
32b3: ㊳ (CIRCLED NUMBER THIRTY EIGHT)
32b4: ㊴ (CIRCLED NUMBER THIRTY NINE)
32b5: ㊵ (CIRCLED NUMBER FORTY)
32b6: ㊶ (CIRCLED NUMBER FORTY ONE)
32b7: ㊷ (CIRCLED NUMBER FORTY TWO)
32b8: ㊸ (CIRCLED NUMBER FORTY THREE)
32b9: ㊹ (CIRCLED NUMBER FORTY FOUR)
32ba: ㊺ (CIRCLED NUMBER FORTY FIVE)
32bb: ㊻ (CIRCLED NUMBER FORTY SIX)
32bc: ㊼ (CIRCLED NUMBER FORTY SEVEN)
32bd: ㊽ (CIRCLED NUMBER FORTY EIGHT)
32be: ㊾ (CIRCLED NUMBER FORTY NINE)
32bf: ㊿ (CIRCLED NUMBER FIFTY)
3405: 㐅 (CJK UNIFIED IDEOGRAPH-3405)
3483: 㒃 (CJK UNIFIED IDEOGRAPH-3483)
382a: 㠪 (CJK UNIFIED IDEOGRAPH-382A)
3b4d: 㭍 (CJK UNIFIED IDEOGRAPH-3B4D)
4e00: 一 (CJK UNIFIED IDEOGRAPH-4E00)
4e03: 七 (CJK UNIFIED IDEOGRAPH-4E03)
4e07: 万 (CJK UNIFIED IDEOGRAPH-4E07)
4e09: 三 (CJK UNIFIED IDEOGRAPH-4E09)
4e5d: 九 (CJK UNIFIED IDEOGRAPH-4E5D)
4e8c: 二 (CJK UNIFIED IDEOGRAPH-4E8C)
4e94: 五 (CJK UNIFIED IDEOGRAPH-4E94)
4e96: 亖 (CJK UNIFIED IDEOGRAPH-4E96)
4ebf: 亿 (CJK UNIFIED IDEOGRAPH-4EBF)
4ec0: 什 (CJK UNIFIED IDEOGRAPH-4EC0)
4edf: 仟 (CJK UNIFIED IDEOGRAPH-4EDF)
4ee8: 仨 (CJK UNIFIED IDEOGRAPH-4EE8)
4f0d: 伍 (CJK UNIFIED IDEOGRAPH-4F0D)
4f70: 佰 (CJK UNIFIED IDEOGRAPH-4F70)
5104: 億 (CJK UNIFIED IDEOGRAPH-5104)
5146: 兆 (CJK UNIFIED IDEOGRAPH-5146)
5169: 兩 (CJK UNIFIED IDEOGRAPH-5169)
516b: 八 (CJK UNIFIED IDEOGRAPH-516B)
516d: 六 (CJK UNIFIED IDEOGRAPH-516D)
5341: 十 (CJK UNIFIED IDEOGRAPH-5341)
5343: 千 (CJK UNIFIED IDEOGRAPH-5343)
5344: 卄 (CJK UNIFIED IDEOGRAPH-5344)
5345: 卅 (CJK UNIFIED IDEOGRAPH-5345)
534c: 卌 (CJK UNIFIED IDEOGRAPH-534C)
53c1: 叁 (CJK UNIFIED IDEOGRAPH-53C1)
53c2: 参 (CJK UNIFIED IDEOGRAPH-53C2)
53c3: 參 (CJK UNIFIED IDEOGRAPH-53C3)
53c4: 叄 (CJK UNIFIED IDEOGRAPH-53C4)
56db: 四 (CJK UNIFIED IDEOGRAPH-56DB)
58f1: 壱 (CJK UNIFIED IDEOGRAPH-58F1)
58f9: 壹 (CJK UNIFIED IDEOGRAPH-58F9)
5e7a: 幺 (CJK UNIFIED IDEOGRAPH-5E7A)
5efe: 廾 (CJK UNIFIED IDEOGRAPH-5EFE)
5eff: 廿 (CJK UNIFIED IDEOGRAPH-5EFF)
5f0c: 弌 (CJK UNIFIED IDEOGRAPH-5F0C)
5f0d: 弍 (CJK UNIFIED IDEOGRAPH-5F0D)
5f0e: 弎 (CJK UNIFIED IDEOGRAPH-5F0E)
5f10: 弐 (CJK UNIFIED IDEOGRAPH-5F10)
62fe: 拾 (CJK UNIFIED IDEOGRAPH-62FE)
634c: 捌 (CJK UNIFIED IDEOGRAPH-634C)
67d2: 柒 (CJK UNIFIED IDEOGRAPH-67D2)
6f06: 漆 (CJK UNIFIED IDEOGRAPH-6F06)
7396: 玖 (CJK UNIFIED IDEOGRAPH-7396)
767e: 百 (CJK UNIFIED IDEOGRAPH-767E)
8086: 肆 (CJK UNIFIED IDEOGRAPH-8086)
842c: 萬 (CJK UNIFIED IDEOGRAPH-842C)
8cae: 貮 (CJK UNIFIED IDEOGRAPH-8CAE)
8cb3: 貳 (CJK UNIFIED IDEOGRAPH-8CB3)
8d30: 贰 (CJK UNIFIED IDEOGRAPH-8D30)
9621: 阡 (CJK UNIFIED IDEOGRAPH-9621)
9646: 陆 (CJK UNIFIED IDEOGRAPH-9646)
964c: 陌 (CJK UNIFIED IDEOGRAPH-964C)
9678: 陸 (CJK UNIFIED IDEOGRAPH-9678)
96f6: 零 (CJK UNIFIED IDEOGRAPH-96F6)
a6e6: ꛦ (BAMUM LETTER MO)
a6e7: ꛧ (BAMUM LETTER MBAA)
a6e8: ꛨ (BAMUM LETTER TET)
a6e9: ꛩ (BAMUM LETTER KPA)
a6ea: ꛪ (BAMUM LETTER TEN)
a6eb: ꛫ (BAMUM LETTER NTUU)
a6ec: ꛬ (BAMUM LETTER SAMBA)
a6ed: ꛭ (BAMUM LETTER FAAMAE)
a6ee: ꛮ (BAMUM LETTER KOVUU)
a6ef: ꛯ (BAMUM LETTER KOGHOM)
a830: ꠰ (NORTH INDIC FRACTION ONE QUARTER)
a831: ꠱ (NORTH INDIC FRACTION ONE HALF)
a832: ꠲ (NORTH INDIC FRACTION THREE QUARTERS)
a833: ꠳ (NORTH INDIC FRACTION ONE SIXTEENTH)
a834: ꠴ (NORTH INDIC FRACTION ONE EIGHTH)
a835: ꠵ (NORTH INDIC FRACTION THREE SIXTEENTHS)
f96b: 參 (CJK COMPATIBILITY IDEOGRAPH-F96B)
f973: 拾 (CJK COMPATIBILITY IDEOGRAPH-F973)
f978: 兩 (CJK COMPATIBILITY IDEOGRAPH-F978)
f9b2: 零 (CJK COMPATIBILITY IDEOGRAPH-F9B2)
f9d1: 六 (CJK COMPATIBILITY IDEOGRAPH-F9D1)
f9d3: 陸 (CJK COMPATIBILITY IDEOGRAPH-F9D3)
f9fd: 什 (CJK COMPATIBILITY IDEOGRAPH-F9FD)
However, the distinction between Numeric_Type=Digit and Numeric_Type=Numeric is no longer considered useful, and Numeric_Type=Digit is no longer used for new characters since Unicode 6.3.0. Quoting Unicode Standard Annex #44:
Starting with Unicode 6.3.0, no newly encoded numeric characters will be given Numeric_Type=Digit, nor will existing characters with Numeric_Type=Numeric be changed to Numeric_Type=Digit. The distinction between those two types is not considered useful.
Thus, 🄌 (DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO) and other characters that once would have been assigned Numeric_Type=Digit have instead been assigned Numeric_Type=Numeric, and they report False for isdigit:
>>> '🄌'.isdigit()
False

unicode.isnumeric()
Return True if there are only numeric characters in S, False otherwise. Numeric characters include digit characters, and all characters that have the Unicode numeric value property, e.g. U+2155, VULGAR FRACTION ONE FIFTH.
str.isdigit()
Return true if all characters in the string are digits and there is at least one character, false otherwise.
For 8-bit strings, this method is locale-dependent.

From the manual
The method isnumeric() checks whether the string consists of only
numeric characters. This method is present only on unicode objects.
Digits include decimal characters and digits that need special
handling, such as the compatibility superscript digits. Formally, a
digit is a character that has the property value Numeric_Type=Digit or
Numeric_Type=Decimal.

The code snippet provided by #Martijn Pieters doesn't work on the latest Python version i.e. 3.7 at the time of writing this answer.
Here is the updated code snippet.
import unicodedata
count = 0
for codepoint in range(2**16):
ch = chr(codepoint)
if ch.isnumeric() and not ch.isdigit():
print(u'{:04x}: {} ({})'.format(codepoint, ch, unicodedata.name(ch, 'UNNAMED')))
count = count + 1
print(f'Total Number of Numeric and Non-Digit Unicode Characters = {count}')
Output:
...
f9d1: 六 (CJK COMPATIBILITY IDEOGRAPH-F9D1)
f9d3: 陸 (CJK COMPATIBILITY IDEOGRAPH-F9D3)
f9fd: 什 (CJK COMPATIBILITY IDEOGRAPH-F9FD)
Total Number of Numeric and Non-Digit Unicode Characters = 335
NOTE: I am using f-strings for formatting. It's a really cool new way to format string and introduced in Python 3.6 under PEP-498. It's also called Literal String Interpolation. You can read more about it here or check out Official Documentation too.

From python inbuilt docs,
>>> unicode.isdigit.__doc__
'S.isdigit() -> bool\n\nReturn True if all characters in S are digits\nand there is at least one character in S, False otherwise.'
>>> unicode.isnumeric.__doc__
'S.isnumeric() -> bool\n\nReturn True if there are only numeric characters in S,\nFalse otherwise.'

Related

how to add options to inflect number-to-word?

I have a program which calculates the number of minutes of a person's age. It works correctly. However, I want to ask if I can print the first letter capitalized.
from datetime import datetime, date
import sys
import inflect
inflector = inflect.engine()
def main():
# heute = date.today()
user = input('Date of birth: ')
min_preter(user)
def min_preter(data):
try:
data != datetime.strptime(data, '%Y-%m-%d')
# Get the y-m-d in current time
today = date.today()
# die y-m-d teilen
year, month , day = data.split('-')
# Convert to datetime
data = date(year=int(year), month=int(month), day=int(day))
# And valla
end = (today - data).total_seconds() / 60
# Convert to words
words = inflector.number_to_words(end).replace('point zero','minutes').upper()
return words
except:
sys.exit('Invalid date')
# convert from string format to datetime format
if __name__ == "__main__":
main()
Here is the output when I enter e.g 1999-01-01:
twelve million, four hundred and fifty-seven thousand, four hundred and forty point zero
where I expected
Twelve million, four hundred and fifty-seven thousand, four hundred and forty minutes
first word 'Twelve'(first letter capitalize)
I don't know what this point zero is. I just want the minutes at the end.
Thank you
You can use string.capitalize(). So you can do that:
return words.capitalize()
... and as for the "point zero", try converting the result to int before running your function, like
end = int((today - data).total_seconds() / 60)
Just replace .upper() by capitalize() in your code
An alternative to your replace would be to obtain the total number of minutes as an integer (point zero is because end is a float number) :
end = int((today - data).total_seconds() / 60)
In that case, your words variable would be :
words = inflector.number_to_words(end).capitalize() + " minutes"
You can use .capitalize() to capitalize the first word of the string.
EXAMPLE: words.capitalize()
"twelve million, four hundred and fifty-seven thousand, four hundred and forty point zero".capitalize()
OUTPUT
'Twelve million, four hundred and fifty-seven thousand, four hundred and forty point zero'
Regarding point zero
This particular code end = (today - data).total_seconds() / 60 is giving output as float which is leading to point zero so instead of division use floor division i.e. // instead of / which will return integer and hence point zero will be gone or else convert end to int.
Lastly add minutes string i.e. end + ' minutes' to final result.

How to generate a 14 digit serial number in python?

How can I generate 14 digit serial numbers in python where the last for 4 digit will be 0001 the next will be 0002 ...... 0011 and so on? This is how I want the format of the number to be 12101010010001 below is the breakdown of the format.
First three digits (121) = Local Govt. ID
4th & 5th digits (01) = Zonal ID
6th & 7th digits (01) = Area ID
8th to 10th digits (001) = CDA No.
Last four digits (0001) Property No.
I would construct it as a string:
sn = "{:03}{:02}{:02}{:03}{:04}".format(121, 1, 1, 1, 1)
This gives sn the value '12101010010001', zero-padding the fields to the desired width. If you want to convert the result to an integer (as opposed to leaving it as a string), just use int(sn).

how to return numbers in word format from a string in python

EDIT: The "already answered" is not talking about what I am. My string already comes out in word format. I need to strip those words from my string into a list
I'm trying to work with phrase manipulation for a voice assistant in python.
When I speak something like:
"What is 15,276 divided by 5?"
It comes out formatted like this:
"what is fifteen thousand two hundred seventy six divided by five"
I already have a way to change a string to an int for the math part, so is there a way to somehow get a list like this from the phrase?
['fifteen thousand two hundred seventy six','five']
Go through the list of words and check each one for membership in a set of numerical words. Add adjacent words to a temporary list. When you find a non-number word, create a new list. Remember to account for non-number words at the beginning of the sentence and number words at the end of the sentence.
result = []
group = []
nums = set('one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty thirty forty fifty sixty seventy eighty ninety hundred thousand million billion trillion quadrillion quintillion'.split())
for word in "what is fifteen thousand two hundred seventy six divided by five".split():
if word in nums:
group.append(word)
else:
if group:
result.append(group)
group = []
if group:
result.append(group)
Result:
>>> result
[['fifteen', 'thousand', 'two', 'hundred', 'seventy', 'six'], ['five']]
To join each sublist into a single string:
>>> list(map(' '.join, result))
['fifteen thousand two hundred seventy six', 'five']

Counting di-Amino Acid frequencies (Bigram frequencies) from FASTA files

Given a large amount of FASTA files (the peptidome for various organisms for secreted peptides), how can I read the FASTA files (from UNIProt) with Python (Or Matlab), and count the frequencies of each Amino Acid, and of amino-acid "double" pairings?
(I.E - the output should have the % of each individual amino acid (Out of the 22 letters/Chars) AND the frequencies of pairings of amino acids.
Effectively, I want to count the bigram (or n-gram if easy to implement) frequencies for letter pairs.
The 22 amino acids are each represented by a unique letter in the FASTA file, and the name of each protein is preceded on its line by >. ( already parsed it, so only relevent characters remain)
Sample of a file:
FFKA
FLRN
MTTVSYVTILLTVLVQVLTSDAKATNNKRELSSGLKERSLSDDAPQFWKGRFSRSEEDPQ
FWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQ
FWKGRFSDGTKRENDPQYWKGRFSRSFEDQPDSEAQFWKGRFARTSSGEKREPQYWKGRF
SRDSVPGRYGRELQGRFGRELQGRFGREAQGRFGRELQGRFGREFQGRFGREDQGRFGRE
DQGRFGREDQGRFGREDQGRFGREDQGRFGREDQGRFGRELQGRFGREFQGRFGREDQGR
FGREDQGRFGRELQGRFGREDQGRFGREDQGRFGREDLAKEDQGRFGREDLAKEDQGRFG
REDIAEADQGRFGRNAAAAAAAAAAAKKRTIDVIDIESDPKPQTRFRDGKDMQEKRKVEK
KDKIEKSDDALAKTS
Thank you very much!
How does this look?
>>> sequence = "LTSDAKAARFSDPQFWKGRFSDPQFWKGRSAAKGRFARTSSGAAEKREPQAAYWKGRF "
>>> occurrenceAA = str(sequence.count("AA")) # counting occurence of n-aminos
>>> percent_occurrenceAA = float(occurrenceAA)/len(sequence)*100 # calculate percent total of protein
>>> print occurrenceAA, " Double-alanines in your sequence"
4 Double-alanines in your sequence
>>> print round(percent_occurrenceAA,2), " % of total" # rounding off % to 2 decimal places
6.78 % of total

python: How to extract records from a natural language file only delimiter is 5 characters from the beginning of the record

I need to extract individual records from log files generated from a fairly archaic system and get them ready for database input. These flat files are all I can extract (and just formatting the query took weeks). Here is an example of a file with two records. The only delimiter I see is "/11 S11-" which is itself at a regular spot 5 characters in, but not quite at the beginning or end.
For those watching, yes, this is related to my other newb question. I have looked at the python documentation, some google results, and some related questions. So, my questions are
a) how to use a delimiter that starts 5 characters into the record?
b) how to grab these big chunks of natural language?
c) how to get rid of the whitespace after newlines? This is probably the easiest part: I can specify in the query how much long each field is. Right now, the accessionDate is 10 characters long, the accessionNumber is 10 characters long, the patMedicalRecordNum is 15 characters long. So the whitespace on the finalDxText is 35 characters.
01/01/11 S11-55555 20/444-55-6666 A. PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:
- ADENOCARCINOMA.
TOTAL GLEASON SCORE: GLEASON 5+4=9
TUMOR LOCATION: BILATERAL
TUMOR QUANTITATION: 15% OF PROSTATE INVOLVED BY TUMOR
EXTRAPROSTATIC EXTENSION: PRESENT AT RIGHT POSTERIOR
SEMINAL VESICLE INVASION: PRESENT
MARGINS: UNINVOLVED
LYMPHOVASCULAR INVASION: PRESENT
PERINEURAL INVASION: PRESENT
LYMPH NODES (SPECIMENS B AND C):
NUMBER EXAMINED: 25
NUMBER INVOLVED: 1
DIAMETER OF LARGEST METASTASIS: 1.7 mm
ADDITIONAL FINDINGS: HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,
ACUTE AND CHRONIC INFLAMMATION, INTRADUCTAL EXTENSION OF INVASIVE
CARCINOMA
PATHOLOGIC STAGE: pT3b N1 MX
B. LYMPH NODES, RIGHT PELVIC, EXCISION:
- ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).
C. LYMPH NODES, LEFT PELVIC, EXCISION:
- EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).
01/02/11 S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:
- ADENOCARCINOMA.
GLEASON SCORE: 3 + 3 = 6 WITH TERTIARY PATTERN OF 5.
TUMOR QUANTITATION: APPROXIMATELY 10% BY VOLUME.
TUMOR LOCATION: BILATERAL.
EXTRAPROSTATIC EXTENSION: NOT IDENTIFIED.
MARGINS: NEGATIVE.
PERINEURAL INVASION: IDENTIFIED.
LYMPH-VASCULAR INVASION: NOT IDENTIFIED.
SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.
LYMPH NODES: NONE SUBMITTED.
OTHER: HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.
PATHOLOGIC STAGE (pTNM): pT2c NX.
Delimiters
I might be off the hook, but looking at your records and spefically at 01/01/11 S11-55555 20/444-55-6666, 01/01/11 kinda looks like a date to me.
Therefore, judging from your input:
You could check whether the line starts with a date (I'd say mm/dd/yy is the format here), using for instance a pretty straightforward regex and re.match.
Looks like the data in each record is indented, so it looks like a line not being indented means it's a delimiter.
Whitespace
my_string.strip returns my_string stripped of initial and trailing whitespace.
I'd try something like this:
import re # regex module
in_string = """Text from above"""
records = [] # list to store all records in order
record = "" # string to store current record
for line in in_string.splitlines(): # go through each line of the input
if re.match('\d\d/\d\d/\d\d',line): # match the date at the start
records.append(record) # add current record to list
record = "" # start new current record
record += line.strip() # add line (without whitespace) to current record
records.append(record) # add last record to records list
This outputs the following:
['',
'01/01/11 S11-55555 20/444-55-6666 A. PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:- ADENOCARCINOMA.TOTAL GLEASON SCORE: GLEASON 5+4=9TUMOR LOCATION: BILATERALTUMOR QUANTITATION: 15% OF PROSTATE INVOLVED BY TUMOREXTRAPROSTATIC EXTENSION: PRESENT AT RIGHT POSTERIORSEMINAL VESICLE INVASION: PRESENTMARGINS: UNINVOLVEDLYMPHOVASCULAR INVASION: PRESENTPERINEURAL INVASION: PRESENTLYMPH NODES (SPECIMENS B AND C):NUMBER EXAMINED: 25NUMBER INVOLVED: 1DIAMETER OF LARGEST METASTASIS: 1.7 mmADDITIONAL FINDINGS: HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,ACUTE AND CHRONIC INFLAMMATION, INTRADUCTAL EXTENSION OF INVASIVECARCINOMAPATHOLOGIC STAGE: pT3b N1 MXB. LYMPH NODES, RIGHT PELVIC, EXCISION:- ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).C. LYMPH NODES, LEFT PELVIC, EXCISION:- EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).',
'01/02/11 S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:- ADENOCARCINOMA.GLEASON SCORE: 3 + 3 = 6 WITH TERTIARY PATTERN OF 5.TUMOR QUANTITATION: APPROXIMATELY 10% BY VOLUME.TUMOR LOCATION: BILATERAL.EXTRAPROSTATIC EXTENSION: NOT IDENTIFIED.MARGINS: NEGATIVE.PERINEURAL INVASION: IDENTIFIED.LYMPH-VASCULAR INVASION: NOT IDENTIFIED.SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.LYMPH NODES: NONE SUBMITTED.OTHER: HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.PATHOLOGIC STAGE (pTNM): pT2c NX.']
Note: This is a crappy regular expression and will match any line that starts with "nn/nn/nn"
You'll probably want to add in a space between rows - something like record += line.strip()+' '
Good luck!
You can muck around with Regular Expressions (regex/re) here - put your regular expression (ie \d\d/\d\d/\d\d S11) in the top box, and your text in the bottom one.
This is an idea:
chunky = open(file, 'r')
for line in chunky:
if line>'00': # It's a starting line
linedata = line.split(None, 3) # separates line in four pieces
chunk = linedata[3].strip()
else:
chunk += ' ' + line.strip()
And for a newb: a part of a string: line[a:b] in which a is the first you need starting at 0 and b is the first you don't need. Your S11 would be linedata[1][0:3]

Categories

Resources