What is the difference between unicodedata.digit and unicodedata.numeric? - python

From unicodedata doc:
unicodedata.digit(chr[, default]) Returns the digit value assigned to
the character chr as integer. If no such value is defined, default is
returned, or, if not given, ValueError is raised.
unicodedata.numeric(chr[, default]) Returns the numeric value assigned
to the character chr as float. If no such value is defined, default is
returned, or, if not given, ValueError is raised.
Can anybody explain me the difference between those two functions?
Here ones can read the implementation of both functions but is not evident for me what is the difference from a quick look because I'm not familiar with CPython implementation.
EDIT 1:
Would be nice an example that shows the difference.
EDIT 2:
Examples useful to complement the comments and the spectacular answer from #user2357112:
print(unicodedata.digit('1')) # Decimal digit one.
print(unicodedata.digit('١')) # ARABIC-INDIC digit one
print(unicodedata.digit('¼')) # Not a digit, so "ValueError: not a digit" will be generated.
print(unicodedata.numeric('Ⅱ')) # Roman number two.
print(unicodedata.numeric('¼')) # Fraction to represent one quarter.

Short answer:
If a character represents a decimal digit, so things like 1, ¹ (SUPERSCRIPT ONE), ① (CIRCLED DIGIT ONE), ١ (ARABIC-INDIC DIGIT ONE), unicodedata.digit will return the digit that character represents as an int (so 1 for all of these examples).
If the character represents any numeric value, so things like ⅐ (VULGAR FRACTION ONE SEVENTH) and all the decimal digit examples, unicodedata.numeric will give that character's numeric value as a float.
For technical reasons, more recent digit characters like 🄌 (DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO) may raise a ValueError from unicodedata.digit.
Long answer:
Unicode characters all have a Numeric_Type property. This property can have 4 possible values: Numeric_Type=Decimal, Numeric_Type=Digit, Numeric_Type=Numeric, or Numeric_Type=None.
Quoting the Unicode standard, version 10.0.0, section 4.6,
The Numeric_Type=Decimal property value (which is correlated with the General_Category=Nd
property value) is limited to those numeric characters that are used in decimal-radix
numbers and for which a full set of digits has been encoded in a contiguous range,
with ascending order of Numeric_Value, and with the digit zero as the first code point in
the range.
Numeric_Type=Decimal characters are thus decimal digits fitting a few other specific technical requirements.
Decimal digits, as defined in the Unicode Standard by these property assignments, exclude
some characters, such as the CJK ideographic digits (see the first ten entries in Table 4-5),
which are not encoded in a contiguous sequence. Decimal digits also exclude the compatibility
subscript and superscript digits, to prevent simplistic parsers from misinterpreting
their values in context. (For more information on superscript and subscripts, see
Section 22.4, Superscript and Subscript Symbols.) Traditionally, the Unicode Character
Database has given these sets of noncontiguous or compatibility digits the value Numeric_Type=Digit, to recognize the fact that they consist of digit values but do not necessarily
meet all the criteria for Numeric_Type=Decimal. However, the distinction between
Numeric_Type=Digit and the more generic Numeric_Type=Numeric has proven not to be
useful in implementations. As a result, future sets of digits which may be added to the standard
and which do not meet the criteria for Numeric_Type=Decimal will simply be
assigned the value Numeric_Type=Numeric.
So Numeric_Type=Digit was historically used for other digits not fitting the technical requirements of Numeric_Type=Decimal, but they decided that wasn't useful, and digit characters not meeting the Numeric_Type=Decimal requirements have just been assigned Numeric_Type=Numeric since Unicode 6.3.0. For example, 🄌 (DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO) introduced in Unicode 7.0 has Numeric_Type=Numeric.
Numeric_Type=Numeric is for all characters that represent numbers and don't fit in the other categories, and Numeric_Type=None is for characters that don't represent numbers (or at least, don't under normal usage).
All characters with a non-None Numeric_Type property have a Numeric_Value property representing their numeric value. unicodedata.digit will return that value as an int for characters with Numeric_Type=Decimal or Numeric_Type=Digit, and unicodedata.numeric will return that value as a float for characters with any non-None Numeric_Type.

Related

Python decimal.Decimal producing result in scientific notation

I'm dividing a very long into much smaller number. Both are of type decimal.Decimal().
The result is coming out in scientific notation. How do I stop this? I need to print the number in full.
>>> decimal.getcontext().prec
50
>>> val
Decimal('1000000000000000000000000')
>>> units
Decimal('1500000000')
>>> units / val
Decimal('1.5E-15')
The precision is kept internally - you just have to explicitly call for the number of decimal places you want at the point you are exporting your decimal value to a string.
So, if you are going a print, or inserting the value in an HTML template, the first step is to use the string format method (or f-strings), to ensure the number is encompassed:
In [29]: print(f"{units/val:.50f}")
0.00000000000000150000000000000000000000000000000000
Unfortunatelly, the string-format minilanguage has no way to eliminate by itself the redundant zeroes on the right hand side. (the left side can be padded with "0", " ", custom characters, whatever one want, but all the precision after the decimal separator is converted to trailing 0s).
Since finding the least significant non-zero digit is complicated - otherwiser we could use a parameter extracted from the number instead of the "50" for precision in the format expression, the simpler thing is to remove those zeros after formatting take place, with the string .rstrip method:
In [30]: print(f"{units/val:.50f}".rstrip("0"))
0.0000000000000015
In short: this seems to be the only way to go: in all interface points, where the number is leaving the core to an output where it is representd as a string, you format it with an excess of precision with the fixed point notation, and strip out the tailing zeros with f-string:
return template.render(number=f"{number:.50f}".rstrip("0"), ...)
Render the decimal into a formatted string with a float type-indicator {:,f}, and it will display just the right number of digits to express the whole number, regardless of whether it is a very large integer or a very large decimal.
>>> val
Decimal('1000000000000000000000000')
>>> units
Decimal('1500000000')
>>> "{:,f}".format(units / val)
'0.0000000000000015'
# very large decimal integer, formatted as float-type string, appears without any decimal places at all when it has none! Nice!
>>> "{:,f}".format(units * val)
'1,500,000,000,000,000,000,000,000,000,000,000'
You don't need to specify the decimal places. It will display only as many as required to express the number, omitting that trail of useless zeros that appear after the final decimal digit when the decimal is shorter than a fixed format width. And you don't get any decimal places if the number has no fraction part.
Very large numbers are therefore accommodated without having to second guess how large they will be. And you don't have to second guess whether they will be have decimal places either.
Any specified thousands separator {:,f} will likewise only have effect if it turns out that the number is a large integer instead of a long decimal.
Proviso
Decimal(), however, has this idea of significant places, by which it will add trailing zeros if it thinks you want them.
The idea is that it intelligently handles situations where you might be dealing with currency digits such as £ 10.15. To use the example from the documentation:
>>> decimal.Decimal('1.30') + decimal.Decimal('1.20')
Decimal('2.50')
It makes no difference if you format the Decimal() - you still get the trailing zero if the Decimal() deems it to be significant:
>>> "{:,f}".format( decimal.Decimal('1.30') + decimal.Decimal('1.20'))
'2.50'
The same thing happens (perhaps for some good reason?) when you treat thousands and fractions together:
>>> decimal.Decimal(2500) * decimal.Decimal('0.001')
Decimal('2.500')
Remove significant trailing zeros with the Decimal().normalize() method:
>>> (2500 * decimal.Decimal('0.001')).normalize()
Decimal('2.5')

Reason for residuum in python float multiplication

Why are in some float multiplications in python those weird residuum?
e.g.
>>> 50*1.1
55.00000000000001
but
>>> 30*1.1
33.0
The reason should be somewhere in the binary representation of floats, but where is the difference in particular of both examples?
(This answer assumes your Python implementation uses IEEE-754 binary64, which is common.)
When 1.1 is converted to floating-point, the result is exactly 1.100000000000000088817841970012523233890533447265625, because this is the nearest representable value. (This number is 4953959590107546 • 2−52 — an integer with at most 53 bits multiplied by a power of two.)
When that is multiplied by 50, the exact mathematical result is 55.00000000000000444089209850062616169452667236328125. That cannot be exactly represented in binary64. To fit it into the binary64 format, it is rounded to the nearest representable value, which is 55.00000000000000710542735760100185871124267578125 (which is 7740561859543041 • 2−47).
When it is multiplied by 30, the exact result is 33.00000000000000266453525910037569701671600341796875. it also cannot be represented exactly in binary64. It is rounded to the nearest representable value, which is 33. (The next higher representable value is 33.00000000000000710542735760100185871124267578125, and we can see …026 is closer to …000 than to …071.)
That explains what the internal results are. Next there is an issue of how your Python implementation formats the output. I do not believe the Python implementation is strict about this, but it is likely one of two methods is used:
In effect, the number is converted to a certain number of decimal digits, and then trailing insignificant zeros are removed. Converting 55.00000000000000710542735760100185871124267578125 to a numeral with 16 digits after the decimal point yields 55.00000000000001, which has no trailing zeros to remove. Converting 33 to a numeral with 16 digits after the decimal point yields 33.00000000000000, which has 15 trailing zeros to remove. (Presumably your Python implementation always leaves at least one trailing zero after a decimal point to clearly distinguish that it is a floating-point number rather than an integer.)
Just enough decimal digits are used to uniquely distinguish the number from adjacent representable values. This method is required in Java and JavaScript but is not yet common in other programming languages. In the case of 55.00000000000000710542735760100185871124267578125, printing “55.00000000000001” distinguishes it from the neighboring values 55 (which would be formatted as “55.0”) and 55.0000000000000142108547152020037174224853515625 (which would be “55.000000000000014”).

what's the different between "%0.6X" and "%06X" in python?

what's the different between "%0.6X" and "%06X" in python ?
def formatTest():
print "%0.6X" %1024
print "%06X" %1024
if __name__=='__main__':
formatTest()
The result is :
000400
000400
https://docs.python.org/2/library/stdtypes.html#string-formatting
A conversion specifier contains two or more characters and has the following components, which must occur in this order:
The '%' character, which marks the start of the specifier.
Mapping key (optional), consisting of a parenthesised sequence of characters (for example, (somename)).
Conversion flags (optional), which affect the result of some conversion types.
Minimum field width (optional). If specified as an '*' (asterisk), the actual width is read from the next element of the tuple in values, and the object to convert comes after the minimum field width and optional precision.
Precision (optional), given as a '.' (dot) followed by the precision. If specified as '*' (an asterisk), the actual width is read from the next element of the tuple in values, and the value to convert comes after the precision.
Length modifier (optional).
Conversion type.
So the documentation doesn't clearly state what the interaction of width versus precision is. Let's explore some more.
>>> '%.4X' % 1024
'0400'
>>> '%6.4X' % 1024
' 0400'
>>> '%#.4X' % 1024
'0x0400'
>>> '%#8.4X' % 1024
' 0x0400'
>>> '%#08.4X' % 1024
'0x000400'
Curious. It appears that width (the part before .) controls the whole field, and space-pads by default, unless flagged with 0. Precision (the part after .) controls only the integer part, and always 0-pads.
Let's take a look at new-style formatting. It's the future! (And by future I mean it's available now and has been for many years.)
https://docs.python.org/2/library/string.html#format-specification-mini-language
width is a decimal integer defining the minimum field width. If not specified, then the field width will be determined by the content.
When no explicit alignment is given, preceding the width field by a zero ('0') character enables sign-aware zero-padding for numeric types. This is equivalent to a fill character of '0' with an alignment type of '='.
The precision is a decimal number indicating how many digits should be displayed after the decimal point for a floating point value formatted with 'f' and 'F', or before and after the decimal point for a floating point value formatted with 'g' or 'G'. For non-number types the field indicates the maximum field size - in other words, how many characters will be used from the field content. The precision is not allowed for integer values.
Much more clearly specified! {0:06X} is valid, {0:0.6X} is not.
>>> '{0:06x}'.format(1024)
'000400'
>>> '{0:0.6x}'.format(1024)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Precision not allowed in integer format specifier
06 means that the data passed in, if less than 6 digits long will be prepended by 0 to fill that space. The x denotes the type of data, in this case the format string is expecting a hexadecimal number.
1024 in hexadecimal is 400, which is why you get 000400 as your result.
For 0.6x, the . denotes the precision (width) of characters to be shown. So, %0.6x means:
% - start of the format string specification
A 0, which means that for numerical values, pad them by 0 to meet the format specification.
A . which precision modifier. The number after this (6) is how much width to give for the precision characters.
Finally, the x which is the conversion type, in this case hexadecimal.
Since hexadecimal numbers don't have float components, the results of both those operations are the same.
These happen to be equivalent. That doesn't mean you can always ignore the ., though; the reason for the equivalence is pretty specific and doesn't generalize.
0.6X specifies a precision of 6, whereas 06X specifies a minimum field width of 6. The documentation doesn't say what a precision does for the X conversion type, but Python follows the behavior of printf here, where a precision for X is treated as a minimum number of digits to print.
With 06X, the formatting produces at least 6 digits, adding leading zeros if necessary. With 0.6X, the formatting produces a result at least 6 characters wide. The result would be padded with spaces, but the 0 says to pad with zeros instead. Overall, the behavior works out to be the same.

What is difference between {:.4e} and {:2.4} in Python scientific notation

I can't quite understand what the difference is between the two print statements below for the number I am trying to express in scientific notation. I thought the the bottom one is supposed to allow 2 spaces for the printed result, and move the decimal place 4 times, but the result I get does not corroborate that understanding. As far as the first one, What does 4e mean?
>>> print('{:.4e}'.format(3454356.7))
3.4544e+06
>>> print('{:2.4}'.format(3454356.7))
3.454e+06
All help greatly appreciated.
In the first example , 4e means, 4 decimal places in scientific notation. You can come to know that by doing
>>> print('{:.4e}'.format(3454356.7))
3.4544e+06
>>> print('{:.5e}'.format(3454356.7))
3.45436e+06
>>> print('{:.6e}'.format(3454356.7))
3.454357e+06
In the second example, .4 , means 4 significant figures. And 2 means to fit the whole data into 2 characters
>>> print('{:2.4}'.format(3454356.7))
3.454e+06
>>> print('{:2.5}'.format(3454356.7))
3.4544e+06
>>> print('{:2.6}'.format(3454356.7))
3.45436e+06
Testing with different value of 2
>>> print('-{:20.6}'.format(3454356.7))
- 3.45436e+06
You can learn more from the python documentation on format
If you want to produce a float, you will have to specify the float type:
>>> '{:2.4f}'.format(3454356.7)
'3454356.7000'
Otherwise, if you don’t specify a type, Python will choose g as the type for which precision will mean the precision based on its significant figures, the digits before and after the decimal point. And since you have a precision of 4, it will only display 4 digits, falling back to scientific notation so it doesn’t add false precision.
The precision is a decimal number indicating how many digits should be displayed after the decimal point for a floating point value formatted with 'f' and 'F', or before and after the decimal point for a floating point value formatted with 'g' or 'G'. For non-number types the field indicates the maximum field size - in other words, how many characters will be used from the field content. The precision is not allowed for integer values.
(source, emphasis mine)
Finally, note that the width (the 2 in above format string) includes the full width, including digits before the decimal point, digits after it, the decimal point itself, and the components of the scientific notation. The above result would have a width of 12, so in this case, the width of the format string is simply ignored.

How to compute a double precision float score from the first 8 bytes of a string in Python?

Trying to get a double-precision floating point score from a UTF-8 encoded string object in Python. The idea is to grab the first 8 bytes of the string and create a float, so that the strings, ordered by their score, would be ordered lexicographically according to their first 8 bytes (or possibly their first 63 bits, after forcing them all to be positive to avoid sign errors).
For example:
get_score(u'aaaaaaa') < get_score(u'aaaaaaab') < get_score(u'zzzzzzzz')
I have tried to compute the score in an integer using bit-shift-left and XOR, but I am not sure of how to translate that into a float value. I am also not sure if there is a better way to do this.
How should the score for a string be computed so the condition I specified before is met?
Edit: The string object is UTF-8 encoded (as per #Bakuriu's commment).
float won't give you 64 bits of precision. Use integers instead.
def get_score(s):
return struct.unpack('>Q', (u'\0\0\0\0\0\0\0\0' + s[:8])[-8:])[0]
In Python 3:
def get_score(s):
return struct.unpack('>Q', ('\0\0\0\0\0\0\0\0' + s[:8])[-8:].encode('ascii', 'error'))[0]
EDIT:
For floats, with 6 characters:
def get_score(s):
return struct.unpack('>d', (u'\0\1' + (u'\0\0\0\0\0\0\0\0' + s[:6])[-6:]).encode('ascii', 'error'))[0]
You will need to setup the entire alphabet and do the conversion by hand, since conversions to base > 36 are not built in, in order to do that you only need to define the complete alphabet to use. If it was an ascii string for instance you would create a conversion to a long in base 256 from the input string using all the ascii table as an alphabet.
You have an example of the full functions to do it here: string to base 62 number
Also you don't need to worry about negative-positive numbers when doing this, since the encoding of the string with the first character in the alphabet will yield the minimum possible number in the representation, which is the negative value with the highest absolute value, in your case -2**63 which is the correct value and allows you to use < > against it.
Hope it helps!

Categories

Resources