Python3: Creating a string with an unescaped backslash - python

In Python 3.3, I am trying to rebuild unicode characters from truncated unicode values,
and then print the character to console.
For example, from '4E00' I want to form the string '\u4E00'. I have tried:
base = '4E00'
uni = r'\u' + base
print(uni) # getting '\u4E00', want: '一'
print(repr(uni)) # '\\u4E00'
Is there a way to form an unescaped string like '\u4E00' in this situation?

Keep in mind that \u followed by a Unicode character code is only a thing in string literals. r'\u' + '4E00' has no special meaning as a Unicode character because it's not all in one literal; it's just a six-character string.
So you're trying to take a Unicode escape code as it would appear in a Python string literal, then decode that into a Unicode character. You can do that:
base = '4E00'
uni = str(bytes(r'\u' + base, encoding="ascii"), encoding="unicode_escape")
But it's the long way around (especially since you have to convert it to bytes first since it's already Unicode). Your Unicode character spec is in hexadecimal. So convert it directly to an integer and then use chr() to turn it into a Unicode character.
base = '4E00'
uni = chr(int(base, 16))

Use:
chr(int(base, 16))
to turn a hex value into a Unicode character.
The \u escape sequence only works in string literals. You could use:
(br'\u' + base.encode('ascii')).decode('unicode_escape')
but that's much more verbose than this needs to be.
Demo:
>>> base = '4E00'
>>> chr(int(base, 16))
'一'
>>> (br'\u' + base.encode('ascii')).decode('unicode_escape')
'一'

Related

How to print unicode character from a string variable?

I am new in programming world, and I am a bit confused.
I expecting that both print result the same graphical unicode exclamation mark symbol:
My experiment:
number = 10071
byteStr = number.to_bytes(4, byteorder='big')
hexStr = hex(number)
uniChar = byteStr.decode('utf-32be')
uniStr = '\\u' + hexStr[2:6]
print(f'{number} - {hexStr[2:6]} - {byteStr} - {uniChar}')
print(f'{uniStr}') # Not working
print(f'\u2757') # Working
Output:
10071 - 2757 - b"\x00\x00'W" - ❗
\u2757
❗
What are the difference in the last two lines?
Please, help me to understand it!
My environment is JupyterHub and v3.9 python.
An escape code evaluated by the Python parser when constructing literal strings. For example, the literal string '马' and '\u9a6c' are evaluated by the parser as the same, length 1, string.
You can (and did) build a string with the 6 characters \u9a6c by using an escape code for the backslash (\\) to prevent the parser from evaluating those 6 characters as an escape code, which is why it prints as the 6-character \u2757.
If you build a byte string with those 6 characters, you can decode it with .decode('unicode-escape') to get the character:
>>> b'\\u2757'.decode('unicode_escape')
'❗'
But it is easier to use the chr() function on the number itself:
>>> chr(0x2757)
'❗'
>>> chr(10071)
'❗'

How to convert byte to hex string starting with '%' in python

I would like to convert Unicode characters encoded by 'utf-8' to hex string of which character starts with '%' because the API server only recognize this form.
For example, if I need to input unicode character '※' to the server, I should convert this character to string '%E2%80%BB'. (It does not matter whether a character is upper or lower.)
I found the way to convert unicode character to bytes and convert bytes to hex string in https://stackoverflow.com/a/35599781.
>>> print('※'.encode('utf-8'))
b'\xe2\x80\xbb'
>>> print('※'.encode('utf-8').hex())
e280bb
But I need the form of starting with '%' like '%E2%80%BB' or %e2%80%bb'
Are there any concise way to implement this? Or do I need to make the function to add '%' to each hex character?
There is two ways to do this:
The preferred solution. Use urllib.parse.urlencode and specify multiple parameters and encode all at once:
urllib.parse.urlencode({'parameter': '※', 'test': 'True'})
# parameter=%E2%80%BB&test=True
Or, you can manually convert this into chunks of two symbols, then join with the % symbol:
def encode_symbol(symbol):
symbol_hex = symbol.encode('utf-8').hex()
symbol_groups = [symbol_hex[i:i + 2].upper() for i in range(0, len(symbol_hex), 2)]
symbol_groups.insert(0, '')
return '%'.join(symbol_groups)
encode_symbol('※')
# %E2%80%BB

Dealing with doubly escaped unicode string

I have a database of badly formatted database of strings. The data looks like this:
"street"=>"\"\\u4e2d\\u534e\\u8def\""
when it should be like this:
"street"=>"中华路"
The problem I have is that when that doubly escaped strings comes from the database they are not being decoded to the chinese characters as they should be. So suppose I have this variable; street="\"\\u4e2d\\u534e\\u8def\"" and if I print that print(street) the result is a string of codepoints "\u4e2d\u534e\u8def"
What can I do at this point to convert "\u4e2d\u534e\u8def" to actual unicode characters ?
First encode this string as utf8 and then decode it with unicode-escape which will handle the \\ for you:
>>> line = "\"\\u4e2d\\u534e\\u8def\""
>>> line.encode('utf8').decode('unicode-escape')
'"中华路"'
You can then strip the " if necessary
You could remove the quotation marks with strip and split at every '\\u'. This would give you the characters as strings representing hex numbers. Then for each string you could convert it to int and back to string with chr:
>>> street = "\"\\u4e2d\\u534e\\u8def\""
>>> ''.join(chr(int(x, 16)) for x in street.strip('"').split('\\u') if x)
'中华路'
Based on what you wrote, the database appears to be storing an eval-uable ascii representation of a string with non-unicode chars.
>>> eval("\"\\u4e2d\\u534e\\u8def\"")
'中华路'
Python has a built-in function for this.
>>> ascii('中华路')
"'\\u4e2d\\u534e\\u8def'"
The only difference is the use of \" instead of ' for the needed internal quote.

How do I use string formatting with explicit character escape codes in Python?

I know that I can print the following:
print u'\u2550'
How do I do that with the .format string method? For example:
for i in range(0x2550, 0x257F):
s = u'\u{04X}'.format(i) # of course this is a syntax error
You are looking for the unichr() function:
s = u'{}'.format(unichr(i))
or just plain
s = unichr(i)
unichr(integer) produces a unicode character for the given codepoint.
The \uxxxx syntax only works for string literals; these are processed by the Python compiler before running the code.
Demo:
>>> for i in range(0x2550, 0x257F):
... print unichr(i)
...
═
║
╒
╓
╔
╕
╖
# etc.
If you ever do have literal \uxxxx sequences in your strings, you can still have Python turn those into Unicode characters with the unicode_escape codec:
>>> print u'\\u2550'
\u2550
>>> print u'\\u2550'.decode('unicode_escape')
═
>>> print '\\u2550'.decode('unicode_escape')
═
>>> '\\u2550'.decode('unicode_escape')
u'\u2550'
In Python 2, you can use this codec on both byte string and unicode string values; the output is always a unicode string.

Convert u'\xe0' to '\u00E0' in Python 2.x?

In Python 2.x, how can I convert an unicode string (ex, u'\xe0') to a string (here I need it to be '\u00E0')?
To make it clearer. I do like to have '\u00E0', a string with length of 6. That is, ¥u is treated as 2 chars instead of one escaped char.
\u doesn't exist as a string escape sequence in Python 2.
You might mean a JSON-encoded string:
>>> s = u'\xe0'
>>> import json
>>> json.dumps(s)
'"\\u00e0"'
or a UTF-16 (big-endian)-encoded string:
>>> s.encode("utf-16-be")
'\x00\xe0'
but your original request is not fulfillable.
As an aside, note that u'\u00e0' is identical to u'\xe0', but '\u00e0' doesn't exist:
>>> u'\u00e0'
u'\xe0'

Categories

Resources