I have been reading the python documentation on the format operator'%' and have encounter some questions.
A conversion specifier contains two or more characters and has the following components, which must occur in this order:
1.The '%' character, which marks the start of the specifier.
2.Mapping key (optional), consisting of a parenthesised sequence of characters (for example, (somename)).
3.Conversion flags (optional), which affect the result of some conversion types.
4.Minimum field width (optional). If specified as an '*' (asterisk), the actual width is read from the next element of the tuple in values, and the object to convert comes after the minimum field width and optional precision.
5.Precision (optional), given as a '.' (dot) followed by the precision. If specified as '*' (an asterisk), the actual precision is read from the next element of the tuple in values, and the value to convert comes after the precision.
6.Length modifier (optional).
7.Conversion type.
A length modifier (h, l, or L) may be present, but is ignored as it is not necessary for Python – so e.g. %ld is identical to %d.
These are the two part that I don't understand
For the Minimum field width, what does the If specified as an '*' (asterisk), the actual width is read from the next element of the tuple in values, and the object to convert comes after the minimum field width and optional precision.means?
Similarly, for the Precision, what does the If specified as '*' (an asterisk), the actual precision is read from the next element of the tuple in values, and the value to convert comes after the precision.means?
For the Length Modifier(k,l,L) what does each of them does to the formatting?
Related
From unicodedata doc:
unicodedata.digit(chr[, default]) Returns the digit value assigned to
the character chr as integer. If no such value is defined, default is
returned, or, if not given, ValueError is raised.
unicodedata.numeric(chr[, default]) Returns the numeric value assigned
to the character chr as float. If no such value is defined, default is
returned, or, if not given, ValueError is raised.
Can anybody explain me the difference between those two functions?
Here ones can read the implementation of both functions but is not evident for me what is the difference from a quick look because I'm not familiar with CPython implementation.
EDIT 1:
Would be nice an example that shows the difference.
EDIT 2:
Examples useful to complement the comments and the spectacular answer from #user2357112:
print(unicodedata.digit('1')) # Decimal digit one.
print(unicodedata.digit('١')) # ARABIC-INDIC digit one
print(unicodedata.digit('¼')) # Not a digit, so "ValueError: not a digit" will be generated.
print(unicodedata.numeric('Ⅱ')) # Roman number two.
print(unicodedata.numeric('¼')) # Fraction to represent one quarter.
Short answer:
If a character represents a decimal digit, so things like 1, ¹ (SUPERSCRIPT ONE), ① (CIRCLED DIGIT ONE), ١ (ARABIC-INDIC DIGIT ONE), unicodedata.digit will return the digit that character represents as an int (so 1 for all of these examples).
If the character represents any numeric value, so things like ⅐ (VULGAR FRACTION ONE SEVENTH) and all the decimal digit examples, unicodedata.numeric will give that character's numeric value as a float.
For technical reasons, more recent digit characters like 🄌 (DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO) may raise a ValueError from unicodedata.digit.
Long answer:
Unicode characters all have a Numeric_Type property. This property can have 4 possible values: Numeric_Type=Decimal, Numeric_Type=Digit, Numeric_Type=Numeric, or Numeric_Type=None.
Quoting the Unicode standard, version 10.0.0, section 4.6,
The Numeric_Type=Decimal property value (which is correlated with the General_Category=Nd
property value) is limited to those numeric characters that are used in decimal-radix
numbers and for which a full set of digits has been encoded in a contiguous range,
with ascending order of Numeric_Value, and with the digit zero as the first code point in
the range.
Numeric_Type=Decimal characters are thus decimal digits fitting a few other specific technical requirements.
Decimal digits, as defined in the Unicode Standard by these property assignments, exclude
some characters, such as the CJK ideographic digits (see the first ten entries in Table 4-5),
which are not encoded in a contiguous sequence. Decimal digits also exclude the compatibility
subscript and superscript digits, to prevent simplistic parsers from misinterpreting
their values in context. (For more information on superscript and subscripts, see
Section 22.4, Superscript and Subscript Symbols.) Traditionally, the Unicode Character
Database has given these sets of noncontiguous or compatibility digits the value Numeric_Type=Digit, to recognize the fact that they consist of digit values but do not necessarily
meet all the criteria for Numeric_Type=Decimal. However, the distinction between
Numeric_Type=Digit and the more generic Numeric_Type=Numeric has proven not to be
useful in implementations. As a result, future sets of digits which may be added to the standard
and which do not meet the criteria for Numeric_Type=Decimal will simply be
assigned the value Numeric_Type=Numeric.
So Numeric_Type=Digit was historically used for other digits not fitting the technical requirements of Numeric_Type=Decimal, but they decided that wasn't useful, and digit characters not meeting the Numeric_Type=Decimal requirements have just been assigned Numeric_Type=Numeric since Unicode 6.3.0. For example, 🄌 (DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO) introduced in Unicode 7.0 has Numeric_Type=Numeric.
Numeric_Type=Numeric is for all characters that represent numbers and don't fit in the other categories, and Numeric_Type=None is for characters that don't represent numbers (or at least, don't under normal usage).
All characters with a non-None Numeric_Type property have a Numeric_Value property representing their numeric value. unicodedata.digit will return that value as an int for characters with Numeric_Type=Decimal or Numeric_Type=Digit, and unicodedata.numeric will return that value as a float for characters with any non-None Numeric_Type.
what's the different between "%0.6X" and "%06X" in python ?
def formatTest():
print "%0.6X" %1024
print "%06X" %1024
if __name__=='__main__':
formatTest()
The result is :
000400
000400
https://docs.python.org/2/library/stdtypes.html#string-formatting
A conversion specifier contains two or more characters and has the following components, which must occur in this order:
The '%' character, which marks the start of the specifier.
Mapping key (optional), consisting of a parenthesised sequence of characters (for example, (somename)).
Conversion flags (optional), which affect the result of some conversion types.
Minimum field width (optional). If specified as an '*' (asterisk), the actual width is read from the next element of the tuple in values, and the object to convert comes after the minimum field width and optional precision.
Precision (optional), given as a '.' (dot) followed by the precision. If specified as '*' (an asterisk), the actual width is read from the next element of the tuple in values, and the value to convert comes after the precision.
Length modifier (optional).
Conversion type.
So the documentation doesn't clearly state what the interaction of width versus precision is. Let's explore some more.
>>> '%.4X' % 1024
'0400'
>>> '%6.4X' % 1024
' 0400'
>>> '%#.4X' % 1024
'0x0400'
>>> '%#8.4X' % 1024
' 0x0400'
>>> '%#08.4X' % 1024
'0x000400'
Curious. It appears that width (the part before .) controls the whole field, and space-pads by default, unless flagged with 0. Precision (the part after .) controls only the integer part, and always 0-pads.
Let's take a look at new-style formatting. It's the future! (And by future I mean it's available now and has been for many years.)
https://docs.python.org/2/library/string.html#format-specification-mini-language
width is a decimal integer defining the minimum field width. If not specified, then the field width will be determined by the content.
When no explicit alignment is given, preceding the width field by a zero ('0') character enables sign-aware zero-padding for numeric types. This is equivalent to a fill character of '0' with an alignment type of '='.
The precision is a decimal number indicating how many digits should be displayed after the decimal point for a floating point value formatted with 'f' and 'F', or before and after the decimal point for a floating point value formatted with 'g' or 'G'. For non-number types the field indicates the maximum field size - in other words, how many characters will be used from the field content. The precision is not allowed for integer values.
Much more clearly specified! {0:06X} is valid, {0:0.6X} is not.
>>> '{0:06x}'.format(1024)
'000400'
>>> '{0:0.6x}'.format(1024)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Precision not allowed in integer format specifier
06 means that the data passed in, if less than 6 digits long will be prepended by 0 to fill that space. The x denotes the type of data, in this case the format string is expecting a hexadecimal number.
1024 in hexadecimal is 400, which is why you get 000400 as your result.
For 0.6x, the . denotes the precision (width) of characters to be shown. So, %0.6x means:
% - start of the format string specification
A 0, which means that for numerical values, pad them by 0 to meet the format specification.
A . which precision modifier. The number after this (6) is how much width to give for the precision characters.
Finally, the x which is the conversion type, in this case hexadecimal.
Since hexadecimal numbers don't have float components, the results of both those operations are the same.
These happen to be equivalent. That doesn't mean you can always ignore the ., though; the reason for the equivalence is pretty specific and doesn't generalize.
0.6X specifies a precision of 6, whereas 06X specifies a minimum field width of 6. The documentation doesn't say what a precision does for the X conversion type, but Python follows the behavior of printf here, where a precision for X is treated as a minimum number of digits to print.
With 06X, the formatting produces at least 6 digits, adding leading zeros if necessary. With 0.6X, the formatting produces a result at least 6 characters wide. The result would be padded with spaces, but the 0 says to pad with zeros instead. Overall, the behavior works out to be the same.
What does '=' alignment mean in the following error message, and why does this code cause it?
>>> "{num:03}".format(num="1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: '=' alignment not allowed in string format specifier
The code has a subtle problem: the input value "1" is text, not a number. But the error message doesn't appear to have anything to do with that.
Nothing in the error message indicates why “'=' alignment” is relevant,
and it does not appear in the code. So what is the significance of emitting that error message?
The error message occurs because '=' alignment has been implied by the format specifier.
The str.format format spec mini-language parser has decided on the
alignment specifier “=” because:
Preceding the width field by a zero ('0') character enables
sign-aware zero-padding for numeric types. This is equivalent to a
fill character of '0' with an alignment type of '='.
So by specifying 0N as the “zero-padding to N width”, you have implied both “the input is a numeric type”, and “the zeros should go between the sign and the digits”. That latter implication is what is meant by '=' alignment.
Since the value "1" is not numeric, the “=”-alignment handling code raises that exception. The message is written expecting you know what it's talking about because you requested (by implication) the “=” alignment.
Yes, I think that error message needs to be improved. I've raised an issue for that.
A workaround is to use '>' (right justify) padding, which is with the syntax:
[[fill]align][width]
with align being >, fill being 0 and width being 3.
>>> "{num:0>3}".format(num="1")
'001'
The problem was that there is a different 0 in the format specification:
format_spec ::= [[fill]align][sign][#][0][width][grouping_option][.precision][type]
# ^^^ This one
That zero just makes fill default to 0 and align to =.
= alignment is specified as:
Forces the padding to be placed after the sign (if any) but before the digits. This is used for printing fields in the form ‘+000000120’. This alignment option is only valid for numeric types. It becomes the default when ‘0’ immediately precedes the field width.
Source (Python 3 docs)
This expects the argument to be an int, as strings don't have signs. So we just manually set it to the normal default of > (right justify).
Also note that 0 just specifies the default values for fill and align. You can change both or just the align.
>>> # fill defaults to '0', align is '>', `0` is set, width is `3`
>>> "{num:>03}".format(num=-1)
'0-1'
>>> # fill is `x`, align is '>', `0` is set (but does nothing), width is `"3"`
>>> "{num:x>03}".format(num=-1)
'x-1'
>>> # fill is `x`, align is '>', `0` is set (but does nothing), width is `"03"` (3)
>>> "{num:x>003}".format(num=-1)
'x-1'
str.__format__ doesn't know what to do with your 03 part. That only works with numbers:
>>> "{num:03}".format(num=1)
'001'
If you actually want to zero-pad a string, you can use rjust:
>>> "1".rjust(3, "0")
'001'
In my case, I was trying to zero-pad a string instead of a number.
The solution was simply to convert the text to a number before applying the padding:
num_as_text = '23'
num_as_num = int(num_as_text)
padded_text = f'{num_as_num:03}'
You are trying to insert 'string->"1" where a float->3.44 is required. Remove the quotes "1", i.e. num=1, and it will work
This format would be acceptable
"{num}:03".format(num="1")
but the way you have the placeholder specified {num:03} is not. That is an interesting ValueError though, if you remove the : the interesting error is replaced by a standard KeyError.
I wrote a function that reads from a file and checks each line according to some conditions. Each line in the file contains a number. In the function itself, I'd like to add to this number and check it again so I tried the following:
str(int(l.strip())+1)) # 'l' being a line in the file
I noticed that I got some faulty results each time I cast a number with leading zeroes. (The file contains every possible 6-digit number, for example: 000012).
I think that the conversion to integer just discards the unnecessary leading zeroes, which throws off my algorithm later since the string length has changed.
Can I convert a string to an integer without losing the leading zeores?
If you want to keep the zero padding on the left you should convert the number to a string again after the addition. Here's an example of string formatting for zero padding on the left for up to six characters.
In [13]: "%06d" % 88
Out[13]: '000088'
How to truncate a string using str.format in Python? Is it even possible?
There is a width parameter mentioned in the Format Specification Mini-Language:
format_spec ::= [[fill]align][sign][#][0][width][,][.precision][type]
...
width ::= integer
...
But specifying it apparently only works for padding, not truncating:
>>> '{:5}'.format('aaa')
'aaa '
>>> '{:5}'.format('aaabbbccc')
'aaabbbccc'
So it's more a minimal width than width really.
I know I can slice strings, but the data I process here is completely dynamic, including the format string and the args that go in. I cannot just go and explicitly slice one.
Use .precision instead:
>>> '{:5.5}'.format('aaabbbccc')
'aaabb'
According to the documentation of the Format Specification Mini-Language:
The precision is a decimal number indicating how many digits should be displayed after the decimal point for a floating point value formatted with 'f' and 'F', or before and after the decimal point for a floating point value formatted with 'g' or 'G'. For non-number types the field indicates the maximum field size - in other words, how many characters will be used from the field content. The precision is not allowed for integer values.
you may truncate by the precision parameter alone:
>>> '{:.1}'.format('aaabbbccc')
'a'
the size parameter is setting the padded size:
>>> '{:3}'.format('ab')
' ab'
alex