Non-ASCII Python identifiers and reflectivity [duplicate] - python

This question already has answers here:
Identifier normalization: Why is the micro sign converted into the Greek letter mu?
(2 answers)
Closed 5 years ago.
I have learnt from PEP 3131 that non-ASCII identifiers were supported in Python, though it's not considered best practice.
However, I get this strange behaviour, where my 𝜏 identifier (U+1D70F) seems to be automatically converted to τ (U+03C4).
class Base(object):
def __init__(self):
self.𝜏 = 5 # defined with U+1D70F
a = Base()
print(a.𝜏) # 5 # (U+1D70F)
print(a.τ) # 5 as well # (U+03C4) ? another way to access it?
d = a.__dict__ # {'τ': 5} # (U+03C4) ? seems converted
print(d['τ']) # 5 # (U+03C4) ? consistent with the conversion
print(d['𝜏']) # KeyError: '𝜏' # (U+1D70F) ?! unexpected!
Is that expected behaviour? Why does this silent conversion occur? Does it have anything to see with NFKC normalization? I thought this was only for canonically ordering Unicode character sequences...

Per the documentation on identifiers:
All identifiers are converted into the normal form NFKC while parsing;
comparison of identifiers is based on NFKC.
You can see that U+03C4 is the appropriate result using unicodedata:
>>> import unicodedata
>>> unicodedata.normalize('NFKC', '𝜏')
'τ'
However, this conversion doesn't apply to string literals, like the one you're using as a dictionary key, hence it's looking for the unconverted character in a dictionary that only contains the converted character.
self.𝜏 = 5 # implicitly converted to "self.τ = 5"
a.𝜏 # implicitly converted to "a.τ"
d['𝜏'] # not converted
You can see similar problems with e.g. string literals used with getattr:
>>> getattr(a, '𝜏')
Traceback (most recent call last):
File "python", line 1, in <module>
AttributeError: 'Base' object has no attribute '𝜏'
>>> getattr(a, unicodedata.normalize('NFKD', '𝜏'))
5

Related

What to do with the error [<__main__.Student object at 0x000001E84D968090>, <__main__.Student object at 0x000001E84D95E750>] [duplicate]

This question already has answers here:
How to print instances of a class using print()?
(12 answers)
Closed 7 months ago.
Well this interactive python console snippet will tell everything:
>>> class Test:
... def __str__(self):
... return 'asd'
...
>>> t = Test()
>>> print(t)
asd
>>> l = [Test(), Test(), Test()]
>>> print(l)
[__main__.Test instance at 0x00CBC1E8, __main__.Test instance at 0x00CBC260,
__main__.Test instance at 0x00CBC238]
Basically I would like to get three asd string printed when I print the list. I have also tried pprint but it gives the same results.
Try:
class Test:
def __repr__(self):
return 'asd'
And read this documentation link:
The suggestion in other answers to implement __repr__ is definitely one possibility. If that's unfeasible for whatever reason (existing type, __repr__ needed for reasons other than aesthetic, etc), then just do
print [str(x) for x in l]
or, as some are sure to suggest, map(str, l) (just a bit more compact).
You need to make a __repr__ method:
>>> class Test:
def __str__(self):
return 'asd'
def __repr__(self):
return 'zxcv'
>>> [Test(), Test()]
[zxcv, zxcv]
>>> print _
[zxcv, zxcv]
Refer to the docs:
object.__repr__(self)
Called by the repr() built-in function and by string conversions (reverse quotes) to compute the “official” string representation of an object. If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value (given an appropriate environment). If this is not possible, a string of the form <...some useful description...> should be returned. The return value must be a string object. If a class defines __repr__() but not __str__(), then __repr__() is also used when an “informal” string representation of instances of that class is required.
This is typically used for debugging, so it is important that the representation is information-rich and unambiguous.

What does "a: 5" without curly braces mean? [duplicate]

This question already has answers here:
Use of colon in variable declaration [duplicate]
(1 answer)
What is this odd colon behavior doing?
(2 answers)
Closed 12 months ago.
I noticed that if I type, for instance >>> a: 5 as input of the python interpreter, it does not return an error (whether or not the variable 'a' is already defined). However, if I type >>> a afterwards, I get the usual NameError.
My question is: what does the python interpreter do when I type this kind of dictionary syntax without the curly braces?
Originally, I found this syntax in matplotlib's matplotlibrc file (see here).
It defines a type hint. But without a value, the variable will not be initialized in the global scope.
>>> a: int = 3
>>> globals()['__annotations__']
{'a': <class 'int'>}
>>> a
3
>>> b: str
>>> globals()['__annotations__']
{'a': <class 'int'>, 'b': <class 'str'>}
>>> b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'b' is not defined
a: 5 in the interpreter uses the type hint syntax, but this has nothing to do with what you see in matplotlib's documentation that defines a configuration file (and thus is not python code).
from the documentation:
You can create custom styles and use them by calling style.use with the path or URL to the style sheet.
For example, you might want to create ./images/presentation.mplstyle with the following:
axes.titlesize : 24
axes.labelsize : 20
lines.linewidth : 3
lines.markersize : 10
xtick.labelsize : 16
ytick.labelsize : 16
The above is not python code

Converting string with leading-zero integer to json

I convert a string to a json-object using the json-library:
a = '{"index":1}'
import json
json.loads(a)
{'index': 1}
However, if I instead change the string a to contain a leading 0, then it breaks down:
a = '{"index":01}'
import json
json.loads(a)
>>> JSONDecodeError: Expecting ',' delimiter
I believe this is due to the fact that it is invalid JSON if an integer begins with a leading zero as described in this thread.
Is there a way to remedy this? If not, then I guess the best way is to remove any leading zeroes by a regex from the string first, then convert to json?
A leading 0 in a number literal in JSON is invalid unless the number literal is only the character 0 or starts with 0.. The Python json module is quite strict in that it will not accept such number literals. In part because a leading 0 is sometimes used to denote octal notation rather than decimal notation. Deserialising such numbers could lead to unintended programming errors. That is, should 010 be parsed as the number 8 (in octal notation) or as 10 (in decimal notation).
You can create a decoder that will do what you want, but you will need to heavily hack the json module or rewrite much of its internals. Either way, you will see a performance slow down as you will no longer be using the C implementation of the module.
Below is an implementation that can decode JSON which contains numbers with any number of leading zeros.
import json
import re
import threading
# a more lenient number regex (modified from json.scanner.NUMBER_RE)
NUMBER_RE = re.compile(
r'(-?(?:\d*))(\.\d+)?([eE][-+]?\d+)?',
(re.VERBOSE | re.MULTILINE | re.DOTALL))
# we are going to be messing with the internals of `json.scanner`. As such we
# want to return it to its initial state when we're done with it, but we need to
# do so in a thread safe way.
_LOCK = threading.Lock()
def thread_safe_py_make_scanner(context, *, number_re=json.scanner.NUMBER_RE):
with _LOCK:
original_number_re = json.scanner.NUMBER_RE
try:
json.scanner.NUMBER_RE = number_re
return json.scanner._original_py_make_scanner(context)
finally:
json.scanner.NUMBER_RE = original_number_re
json.scanner._original_py_make_scanner = json.scanner.py_make_scanner
json.scanner.py_make_scanner = thread_safe_py_make_scanner
class MyJsonDecoder(json.JSONDecoder):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# overwrite the stricter scan_once implementation
self.scan_once = json.scanner.py_make_scanner(self, number_re=NUMBER_RE)
d = MyJsonDecoder()
n = d.decode('010')
assert n == 10
json.loads('010') # check the normal route still raise an error
I would stress that you shouldn't rely on this as a proper solution. Rather, it's a quick hack to help you decode malformed JSON that is nearly, but not quite valid. It's useful if recreating the JSON in a valid form is not possible for some reason.
First, using regex on JSON is evil, almost as bad as killing a kitten.
If you want to represent 01 as a valid JSON value, then consider using this structure:
a = '{"index" : "01"}'
import json
json.loads(a)
If you need the string literal 01 to behave like a number, then consider just casting it to an integer in your Python script.
How to convert string int JSON into real int with json.loads
Please see the post above
You need to use your own version of Decoder.
More information can be found here , in the github
https://github.com/simplejson/simplejson/blob/master/index.rst
c = '{"value": 02}'
value= json.loads(json.dumps(c))
print(value)
This seems to work .. It is strange
> >>> c = '{"value": 02}'
> >>> import json
> >>> value= json.loads(json.dumps(c))
> >>> print(value) {"value": 02}
> >>> c = '{"value": 0002}'
> >>> value= json.loads(json.dumps(c))
> >>> print(value) {"value": 0002}
As #Dunes, pointed out the loads produces string as an outcome which is not a valid solution.
However,
DEMJSON seems to decode it properly.
https://pypi.org/project/demjson/ -- alternative way
>>> c = '{"value": 02}'
>>> import demjson
>>> demjson.decode(c)
{'value': 2}

Creating new conversion specifier in Python

In python we have conversion specifier like
'{0!s}'.format(10)
which prints
'10'
How can I make my own conversion specifiers like
'{0!d}'.format(4561321)
which print integers in following format
4,561,321
Or converts it into binary like
'{0!b}'.format(2)
which prints
10
What are the classes I need to inherit and which functions I need to modify? If possible please provide a small example.
Thanks!!
What you want to do is impossible, because built-in types cannot be modified and literals always refer to built-in types.
There is a special method to handle the formatting of values, that is __format__, however it only handles the format string, not the conversion specifier, i.e. you can customize how {0:d} is handled but not how {0!d} is. The only things that work with ! are s and r.
Note that d and b already exist as format specifiers:
>>> '{0:b}'.format(2)
'10'
In any case you could implement your own class that handles formatting:
class MyInt:
def __init__(self, value):
self.value = value
def __format__(self, fmt):
if fmt == 'd':
text = list(str(self.value))
elif fmt == 'b':
text = list(bin(self.value)[2:])
for i in range(len(text)-3, 0, -3):
text.insert(i, ',')
return ''.join(text)
Used as:
>>> '{0:d}'.format(MyInt(5000000))
5,000,000
>>> '{0:b}'.format(MyInt(8))
1,000
Try not to make your own and try to use default functions already present in python. You can use,
'{0:b}'.format(2) # for binary
'{0:d}'.format(2) # for integer
'{0:x}'.format(2) # for hexadecimal
'{0:f}'.format(2) # for float
'{0:e}'.format(2) # for exponential
Please refer https://docs.python.org/2/library/string.html#formatspec for more.

Accentuation in python: structure and for loop

I've got a set filled with value that are present in a JSON, when I print my set I got the following output:
set(['Path\xc3\xa9', 'Synergy Cin\xc3\xa9ma'])
but if I print each element by using a for loop I've got the following output:
Pathé
Synergy Cinéma
Why I don't got the same encoding for each words?
I guess you are using python 2 and it might be related to the default encoding behavior. The value stocked in your set is the "encoded" value and when you use print (which is based on the underlying __repr__ and/or __str__ methods of the object) you get the decoded/formated output (according to the default system encoding).
You can obtain information about the default encoding used with the function sys.getdefaultencoding()
Note that in python 3, encoding is utf-8 by default (ie. by default "any string created (...) is stored as Unicode", according to the documentation) and you wont have the exact same behavior (you can see in the python 2 snippet that the hashed values, as python sets are based on them, are the same if your input string is encoded or not) :
Python 2 :
>>> a = b'Path\xc3\xa9'
>>> a
'Path\xc3\xa9'
>>> print(a)
Pathé
>>> sys.getdefaultencoding()
'ascii'
>>> hash('Pathé')
8776754739882320435
>>> hash(b'Path\xc3\xa9')
8776754739882320435
Python 3:
>>> a = b'Path\xc3\xa9'
>>> a
b'Path\xc3\xa9'
>>> print(a)
b'Path\xc3\xa9'
>>> print(a.decode())
Pathé
>>> sys.getdefaultencoding()
'utf-8'
>>> hash("Pathé")
1530394699459763000
>>> hash(b"Path\xc3\xa9")
1621747577200686773

Categories

Resources