How to print unicode character from a string variable? - python

I am new in programming world, and I am a bit confused.
I expecting that both print result the same graphical unicode exclamation mark symbol:
My experiment:
number = 10071
byteStr = number.to_bytes(4, byteorder='big')
hexStr = hex(number)
uniChar = byteStr.decode('utf-32be')
uniStr = '\\u' + hexStr[2:6]
print(f'{number} - {hexStr[2:6]} - {byteStr} - {uniChar}')
print(f'{uniStr}') # Not working
print(f'\u2757') # Working
Output:
10071 - 2757 - b"\x00\x00'W" - ❗
\u2757
❗
What are the difference in the last two lines?
Please, help me to understand it!
My environment is JupyterHub and v3.9 python.

An escape code evaluated by the Python parser when constructing literal strings. For example, the literal string '马' and '\u9a6c' are evaluated by the parser as the same, length 1, string.
You can (and did) build a string with the 6 characters \u9a6c by using an escape code for the backslash (\\) to prevent the parser from evaluating those 6 characters as an escape code, which is why it prints as the 6-character \u2757.
If you build a byte string with those 6 characters, you can decode it with .decode('unicode-escape') to get the character:
>>> b'\\u2757'.decode('unicode_escape')
'❗'
But it is easier to use the chr() function on the number itself:
>>> chr(0x2757)
'❗'
>>> chr(10071)
'❗'

Related

How to print unicode from a generator expression in python?

Create a list from generator expression:
V = [('\\u26' + str(x)) for x in range(63,70)]
First issue: if you try to use just "\u" + str(...) it gives a decoder error right away. Seems like it tries to decode immediately upon seeing the \u instead of when a full chunk is ready. I am trying to work around that with double backslash.
Second, that creates something promising but still cannot actually print them as unicode to console:
>>> print([v[0:] for v in V])
['\\u2663', '\\u2664', '\\u2665', .....]
>>> print(V[0])
\u2663
What I would expect to see is a list of symbols that look identical to when using commands like '\u0123' such as:
>>> print('\u2663')
♣
Any way to do that from a generated list? Or is there a better way to print them instead of the '\u0123' format?
This is NOT what I want. I want to see the actual symbols drawn, not the Unicode values:
>>> print(['{}'.format(v[0:]) for v in V])
['\\u2663', '\\u2664', '\\u2665', '\\u2666', '\\u2667', '\\u2668', '\\u2669']
Unicode is a character to bytes encoding, not escape sequences. Python 3 strings are Unicode. To return the character that corresponds to a Unicode code point use chr :
chr(i)
Return the string representing a character whose Unicode code point is the integer i. For example, chr(97) returns the string 'a', while chr(8364) returns the string '€'. This is the inverse of ord().
The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16). ValueError will be raised if i is outside that range.
To generate the characters between 2663 and 2670:
>>> [chr(x) for x in range(2663,2670)]
['੧', '੨', '੩', '੪', '੫', '੬', '੭']
Escape sequences use hexadecimal notation though. 0x2663 is 9827 in decimal, and 0x2670 becomes 9840.
>>> [chr(x) for x in range(9827,9840)]
['♣', '♤', '♥', '♦', '♧', '♨', '♩', '♪', '♫', '♬', '♭', '♮', '♯']
You can use also use hex numeric literals:
>>> [chr(x) for x in range(0x2663,0x2670)]
['♣', '♤', '♥', '♦', '♧', '♨', '♩', '♪', '♫', '♬', '♭', '♮', '♯']
or, to use exactly the same logic as the question
>>> [chr(0x2600 + x) for x in range(0x63,0x70)]
['♣', '♤', '♥', '♦', '♧', '♨', '♩', '♪', '♫', '♬', '♭', '♮', '♯']
The reason the original code doesn't work is that escape sequences are used to represent a single character in a string when we can't or don't want to type the character itself. The interpreter or compiler replaces them with the corresponding character immediatelly. The string \\u26 is an escaped \ followed by u, 2 and 6:
>>> len('\\u26')
4

Add a non escaped escape character to python bytearray

I have an API that is demanding that the quotation marks in my XML attributes are escaped, so <cmd_id="1"> will not work, it requires <cmd_id=\"1\">.
I have tried iterating through my string, for example:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">SetChLevel</cmd><name>C</name><value>30</value></tx>'
Each time that I encounter a " (ascii 34) I will replace it with an escape character (ascii 92) and another quote. Infuriatingly this results in:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id=\\"1\\">SetChLevel</cmd><name>C</name><value>30</value></tx>'
where the escapes have been escaped. As a sanity check I replaced 92 with any other character and it works as expected.
temp = b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">\
SetChLevel</cmd><name>C</name><value>30</value></tx>'
i = 0
j = 0
payload = bytearray(len(temp) + 4)
for char in temp:
if char == 34:
payload[i] = 92
i += 1
payload[i] = 34
i += 1
j += 1
else:
payload[i] = temp[j]
i += 1
j += 1
print(bytes(payload))
I would assume that character 92 would appear once but something is escaping the escape!
Your problem is the result of a very common misunderstanding for programmers new to Python.
When printing a string (or bytes) to the console, Python escapes the escape character (\) to show a string that, when used in Python as a literal, would give you the exact same value.
So:
s = 'abc\\abc'
print(s)
Prints abc\abc, but on the interpreter you get:
>>> s = 'abc\\abc'
>>> print(s)
abc\abc
>>> s
'abc\\abc'
Note that this is correct. After all print(s) should show the string on the console as it is, while s on the interpreter is asking Python to show you the representation of s, which includes the quotes and the escape characters.
Compare:
>>> repr(s)
"'abc\\\\abc'"
repr here prints the representation of the representation of s.
For bytes, things are further complicated because the representation is printed when using print, since print prints a string and a bytes needs to be decoded first, i.e.:
>>> print(some_bytes.decode('utf-8')) # or whatever the encoding is
In short: your code was doing what you wanted it to, it does not duplicate escape characters, you only thought it did because you were looking at the representation of the bytes, not the actual bytes content.
By the way, this also means that you don't have to be paranoid and go through the trouble of writing custom code to replace characters based on their ASCII values, you can simply:
>>> example = bytes('<some attr="value">test</some>', encoding='utf-8')
>>> result = example.replace(b'"', b"\\\"")
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>
I won't pretend that b"\\\"" is intuitive, perhaps b'\\"' is better - but both require that you understand the difference between the representation of a string, or its printed value.
So, finally:
>>> example = b'<some attr="value">test</some>'
>>> result = example.replace(b'"', b'\\"')
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>

Create raw unicode character from hex string representation/enter single backslash [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I want to create a raw unicode character from a string hex representation. That is, I have a string s = '\u0222' which will be the 'Ȣ' character.
Now, this works if I do
>>> s = '\u0222'
>>> print(s)
'Ȣ'
but, if I try to do concatenation, it comes out as
>>> h = '0222'
>>> s = r'\u' + '0222'
>>> print(s)
\u0222
>>> s
'\\u0222'
because as it can be seen, what's actually in string is '\\u' not '\u'. How can I create the unicode character from hex strings or, how can I enter a true single backslash?
This was a lot harder to solve than I initially expected:
code = '0222'
uni_code = r'\u' + code
s = uni_code.encode().decode('unicode_escape')
print(s)
Or
code = b'0222'
uni_code = b'\u' + code
s = uni_code.decode('unicode_escape')
print(s)
Entering \u0222 is only for string constants and the Python interpreter generates a single Unicode code point for that syntax. It's not meant to be constructed manually. The chr() function is used to generate Unicode code points. The following works for strings or integers:
>>> chr(int('0222',16)) # convert string to int base 16
'Ȣ'
>>> chr(0x222) # or just pass an integer.
'Ȣ'
And FYI ord() is the complementary function:
>>> hex(ord('Ȣ'))
'0x222'

Unicode representation to formatted Unicode?

I am having some difficulty understanding the translation of unicode expressions into their respective characters. I have been looking at the unicode specification, and I have come across various strings that are formatted as follows U+1F600. As far as I have seen, there doesn't appear to be a built in function that knows how to translate these strings into the correct formatting for Python, such as \U0001F600.
In my program I have made a small regular expression that will find these U\+.{5} patterns and substitute the U+ with \U000. However, what I have found is that this syntax isn't the same for all unicode characters, such as the zero width join that actually is supposed to be translated from U+200D to \u200D.
Because I don't know every variation of the correct unicode escape sequence, what is the best method to handle this case? Is it that there are only a finite amount of these special characters that I can just check for or am I going about this the wrong way entirely?
Python version is 2.7.
I think your most reliable method will be to parse the number to an integer and then use unichr to lookup that codepoint:
unichr(0x1f600) # or: unichr(int('1f600', 16))
Note: on Python 3, it's just chr.
U+NNNN is just common notation used to talk about Unicode. Python's syntax for a single Unicode character is one of:
u'\xNN' for Unicode characters through U+00FF
u'\uNNNN' for Unicode characters through U+FFFF
u'\U00NNNNNN' for Unicode characters through U+10FFFF (max)
Note: N is a hexadecimal digit.
Use the correct notation when entering a character. You can use the longer notations even for low characters:
u'A' == u'\x41' == u'\u0041' == u'\U00000041'
Programmatically, you can also generate the correct character with unichr(n) (Python 2) or chr(n) (Python 3).
Note that before Python 3.3, there were narrow and wide Unicode builds of Python. unichr/chr can only support sys.maxunicode, which is 65535 (0xFFFF) in narrow builds and 1114111 (0x10FFFF) on wide builds. Python 3.3 unified the builds and solved many issues with Unicode.
If you are dealing with text string in the format U+NNNN, here's a regular expression (Python 3). It looks for U+ and 4-6 hexadecimal digits, and replaces them with the chr() version. Note that ASCII characters (Python 2) or printable characters (Python 3) will display the actual character and not the escaped version.
>>> re.sub(r'U\+([0-9A-Fa-f]{4,6})',lambda m: chr(int(m.group(1),16)),'testing U+1F600')
'testing \U0001f600'
>>> re.sub(r'U\+([0-9A-Fa-f]{4,6})',lambda m: chr(int(m.group(1),16)),'testing U+5000')
'testing \u5000'
>>> re.sub(r'U\+([0-9A-Fa-f]{4,6})',lambda m: chr(int(m.group(1),16)),'testing U+0041')
'testing A'
>>> re.sub(r'U\+([0-9A-Fa-f]{4,6})',lambda m: chr(int(m.group(1),16)),'testing U+0081')
'testing \x81'
You can look into json module implementation. It seem that it is not that simple:
# Unicode escape sequence
uni = _decode_uXXXX(s, end)
end += 5
# Check for surrogate pair on UCS-4 systems
if sys.maxunicode > 65535 and \
0xd800 <= uni <= 0xdbff and s[end:end + 2] == '\\u':
uni2 = _decode_uXXXX(s, end + 1)
if 0xdc00 <= uni2 <= 0xdfff:
uni = 0x10000 + (((uni - 0xd800) << 10) | (uni2 - 0xdc00))
end += 6
char = unichr(uni)
(from cpython-2.7.9/Lib/json/decoder.py lines 129-138)
I think that it would be easier to use json.loads directly:
>>> print json.loads('"\\u0123"')
ģ

Python3: Creating a string with an unescaped backslash

In Python 3.3, I am trying to rebuild unicode characters from truncated unicode values,
and then print the character to console.
For example, from '4E00' I want to form the string '\u4E00'. I have tried:
base = '4E00'
uni = r'\u' + base
print(uni) # getting '\u4E00', want: '一'
print(repr(uni)) # '\\u4E00'
Is there a way to form an unescaped string like '\u4E00' in this situation?
Keep in mind that \u followed by a Unicode character code is only a thing in string literals. r'\u' + '4E00' has no special meaning as a Unicode character because it's not all in one literal; it's just a six-character string.
So you're trying to take a Unicode escape code as it would appear in a Python string literal, then decode that into a Unicode character. You can do that:
base = '4E00'
uni = str(bytes(r'\u' + base, encoding="ascii"), encoding="unicode_escape")
But it's the long way around (especially since you have to convert it to bytes first since it's already Unicode). Your Unicode character spec is in hexadecimal. So convert it directly to an integer and then use chr() to turn it into a Unicode character.
base = '4E00'
uni = chr(int(base, 16))
Use:
chr(int(base, 16))
to turn a hex value into a Unicode character.
The \u escape sequence only works in string literals. You could use:
(br'\u' + base.encode('ascii')).decode('unicode_escape')
but that's much more verbose than this needs to be.
Demo:
>>> base = '4E00'
>>> chr(int(base, 16))
'一'
>>> (br'\u' + base.encode('ascii')).decode('unicode_escape')
'一'

Categories

Resources