Unable to convert hex code to unicode characters, get unicodeescape error - python

I have a pandas dataframe with hex values as given below:
df['col1']
<0020>
<0938>
<002E>
<092B>
<092B>
<0916>
<0915>
<0915>
<096F>
<096C>
I want to convert the hex values to their corresponding unicode literals. So, I try to do the following:
df['col1'] = df['col1'].apply(lambda x : '\u' + str(x)[1:-1])
Hoping, that this would convert it to my required unicode literal, but I get the following error:
File "<ipython-input-22-891ccdd39e79>", line 1
df['col1'].apply(lambda x : '\u' + str(x)[1:-1])
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
In python3, when we try the following we get :
>>> string1 = '\u03b4'
>>> print(string1)
>>> δ
So, I tried adding \u to my given string, I also tried adding \\u, but that shows up as two backslashes . Also, adding a r before \u, also ends up showing two backslashes, instead of the unicode literal. I also tried decode-unicode, but it didn't work either.
Also, it'd be great, if someone can explain the concept of rawstrings , \u, etc.

Oops, literals are for... literal values! As soon as you have variables, you should use conversion functions like int and chr.
Here you have a column containing strings. For each cell in the column, you want to remove first and last character, process what remains as an hex value, and get the unicode character with that code point. In Python, it just reads:
df['col1'].apply(lambda x: chr(int(x[1:-1], 16)))
And with your values, it gives:
0
1 स
2 .
3 फ
4 फ
5 ख
6 क
7 क
8 ९
9 ६
Now for the reason of your error.
\uxxxx escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs module and the unicode_escape encoding to decode a string that would contain actual \u character characters (meaning that you escape the backslash as in "\\uxxx", but as you have directly an hex representation of the code point, it is simpler to directly use the chr function.
And in your initial code, when you write '\u', the parser sees the initial part of an encoded character and tries to decode it immediately... but cannot find the hexa code point after it, so it throws the exception. If you really want to go that way, you have to double the backslash (\) to escape it and store it as is in the string and then use codecs.decode(..., encoding='unicode_escape') to decode the string as shown in #ndclt's answer. But I do not advise you to do so.
References are to be found in the Standard Python Library documentation, chr function and codecs module.

In order to convert all your codes into unicode here one line:
import codecs
import pandas as pd
(
# create a series with the prefix "\u" to add to the existing column
pd.Series([r'\u'] * len(df['col1']))
# str.strip deletes the "<" and ">" from your column
# str.cat concatenates the prefix created before to the existing column
.str.cat(df['col1'].str.strip('<>'))
# then you apply a conversion from the raw string to normal string.
.apply(codecs.decode, args=['unicode_escape'])
)
In the previous code, you have to create the prefix as a raw string. If not, Python is waiting for a valid utf-8 code (the error you have in your code).
Edit: I add the explanation from Serge Ballesta post
\uxxxx escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs module and the unicode_escape encoding to decode a string that would contain actual \u character characters (meaning that you escape the backslash as in "\uxxx", but as you have directly an hex representation of the code point, it is simpler to directly use the chr function.
His solution is more elegant than mine.

Related

What encoding is used by json.dumps?

When I use json.dumps in Python 3.8 for special characters they are being "escaped", like:
>>> import json
>>> json.dumps({'Crêpes': 5})
'{"Cr\\u00eapes": 5}'
What kind of encoding is this? Is this an "escape encoding"? And why is this kind of encoding not part of the encodings module? (Also see codecs, I think I tried all of them.)
To put it another way, how can I convert the string 'Crêpes' to the string 'Cr\\u00eapes' using Python encodings, escaping, etc.?
You are probably confused by the fact that this is a JSON string, not directly a Python string.
Python would encode this string as "Cr\u00eapes", where \u00ea represents a single Unicode character using its hexadecimal code point. In other words, in Python, len("\u00ea") == 1
JSON requires the same sort of encoding, but embedding the JSON-encoded value in a Python string requires you to double the backslash; so in Python's representation, this becomes "Cr\\u00eapes" where you have a literal backslash (which has to be escaped by another backslash), two literal zeros, a literal e character, and a literal a character. Thus, len("\\u00ea") == 6
If you have JSON in a file, the absolutely simplest way to load it into Python is to use json.loads() to read and decode it into a native Python data structure.
If you need to decode the hexadecimal sequence separately, the unicode-escape function does that on a byte value:
>>> b"Cr\\u00eapes".decode('unicode-escape')
'Crêpes'
This is sort of coincidental, and works simply because the JSON representation happens to be identical to the Python unicode-escape representation. You still need a b'...' aka bytes input for that. ("Crêpes".encode('unicode-escape') produces a slightly different representation. "Cr\\u00eapes".encode('us-ascii') produces a bytes string with the Unicode representation b"Cr\\u00eapes".)
It is not a Python encoding. It is the way JSON encodes Unicode non-ASCII characters. It is independent of Python and is used exactly the same for example in Java or with a C or C++ library.
The rule is that a non-ASCII character in the Basic Multilingual Plane (i.e. with a maximum 16 bits code) is encoded as \uxxxx where xxxx is the unicode code value.
Which explains why the ê is written as \u00ea, because its unicode code point is U+00EA

Converting escaped characters to utf in Python

Is there an elegant way to convert "test\207\128" into "testπ" in python?
My issue stems from using avahi-browse on Linux, which has a -p flag to output information in an easy to parse format. However the problem is that it outputs non alpha-numeric characters as escaped sequences. So a service published as "name#id" gets output by avahi-browse as "name\035id". This can be dealt with by splitting on the \, dropping a leading zero and using chr(35) to recover the #. This solution breaks on multi-byte utf characters such as "π" which gets output as "\207\128".
The input string you have is an encoding of a UTF-8 string, in a format that Python can't deal with natively. This means you'll need to write a simple decoder, then use Python to translate the UTF-8 string to a string object:
import re
value = r"test\207\128"
# First off turn this into a byte array, since it's not a unicode string
value = value.encode("utf-8")
# Now replace any "\###" with a byte character based off
# the decimal number captured
value = re.sub(b"\\\\([0-9]{3})", lambda m: bytes([int(m.group(1))]), value)
# And now that we have a normal UTF-8 string, decode it back to a string
value = value.decode("utf-8")
print(value)
# Outputs: testπ

Order of evaluation , unicode string & format

Here are a lot of strings in unicode map.
unicode_strings = ["\U00000{:0>3}".format(str.upper(hex(i))[2:]) for i in range(16)]
but this code emits an error message.
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape
This causes because of the order of evaluation.
First, "\U00000" is evaluated, Second, format is executed.
As the error message, unicode character must be the block of "\UXXXXXXXX".
Unicode characters are evaluated at the first time, but this is not the appropriate block at that time.
At the time the format function is executed, the unicode characters must be constructed completely.
I know the cap string 'r' can escape this error message, but it makes no unicode string.
How should I attach "\U" to the string or execute the format function at the first time?
If I delete '\U', the result is utopia.
['00000001',
'00000002',
'00000003',
'00000004',
'00000005',
'00000006',
'00000007',
'00000008',
'00000009',
'0000000A',
'0000000B',
'0000000C',
'0000000D',
'0000000E',
'0000000F']
UPDATE:
I want such a result.
['\U00000001',
'\U00000002',
'\U00000003',
'\U00000004',
'\U00000005',
'\U00000006',
'\U00000007',
'\U00000008',
'\U00000009',
'\U0000000A',
'\U0000000B',
'\U0000000C',
'\U0000000D',
'\U0000000E',
'\U0000000F']
I want to get the sequence of characters in Unicode map.
Not entirely sure what exactly are you after, but given that for instance \U00000000 is the same as \x00 and to generate this list a following comprehension would seem to make more sense anyways:
unicode_strings = [chr(i) for i in range(16)]
If the question was why does this happen, the format docs may be a little subtle about it:*)
The string on which this method is called can contain literal text or replacement fields delimited by braces {}... Returns a copy of the string where each replacement field is replaced with the string value of the corresponding argument.
But basically the literal strings and "replacement fields" are identified and each is considered as such. In your case a string literal \U00000 is being considered and is invalid as four bytes hex values are expected following \U. Or in other words, it's not really a matter of order (literal firsts, expressions later), but how does the str get split into chunks / processed (literals and expressions are identified first and processed as such).
So if you were trying to do something like that for a larger string generation, you could do it as follows:
somelist = [f"abcd{chr(i)}efgh" for i in range(16)]
*) PEP-498 on f-strings may be a bit more explicit (and the mechanics are the same in this respect), namely:
f-strings are parsed in to literal strings and expressions...
The parts of the f-string outside of braces are literal strings. These literal portions are then decoded. For non-raw f-strings, this includes converting backslash escapes such as '\n', '\"', "\'", '\xhh', '\uxxxx', '\Uxxxxxxxx', and named unicode characters '\N{name}' into their associated Unicode characters.

Python 2 string somehow saved as pure Unicode

I have the following strings in Chinese that are saved in a following form as "str" type:
\u72ec\u5230
\u7528\u8272
I am on Python 2.7, when I print those strings they are printed as actual Chinese characters:
chinese_list = ["\u72ec\u5230", "\u7528\u8272", "\u72ec"]
print(chinese_list[0], chinese_list[1], chinese_list[2])
>>> 独到 用色 独
I can't really figure out how they were saved in that form, to me it looks like Unicode. The goal would be to take other Chinese characters that I have and save them in the same kind of encoding. Say I have "国道" and I would need them to be saved in the same way as in the original chinese_list.
I've tried to encode it as utf-8 and also other encodings but I never get the same output as in the original:
new_string = u"国道"
print(new_string.encode("utf-8"))
# >>> b'\xe5\x9b\xbd\xe9\x81\x93'
print(new_string.encode("utf-16"))
# >>> b'\xff\xfe\xfdVS\x90'
Any help appreciated!
EDIT: it doesn't have to have 2 Chinese characters.
EDIT2: Apparently, the encoding was unicode-escape. Thanks #deceze.
print(u"国".encode('unicode-escape'))
>>> \u56fd
The \u.... is unicode escape syntax. It works similar to how \n is a newline, not the two characters \ and n.
The elements of your list never actually contain a byte string with literal characters of \, u, 7 and so on. They contain a unicode string with the actual unicode characters, i.e. 独 and so on.
Note that this only works with unicode strings! In Python2, you need to write u"\u....". Python3 always uses unicode strings.
The unicode escape value of a character can be gotten with the ord builtin. For example, ord(u"国") gives 22269 - the same value as 0x56fd.
To get the hexadezimal escape value, convert the result to hex.
>>> def escape_literal(character):
... return r'\u' + hex(ord(character))[2:]
...
>>> print(escape_literal('国'))
\u56fd

How to combine unicode subscript with string formatting

I am trying to get unicode subscripts working with string formatting...
I know I can do something like this...
>>>print('Y\u2081')
Y₁
>>>print('Y\u2082')
Y₂
But what i really need is something like this since I need the subscript to iterate over a range. Obviously this doesn't work though.
>>>print('Y\u208{0}'.format(1))
File "<ipython-input-62-99965eda0209>", line 1
print('Y\u208{0}'.format(1))
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 1-5: truncated \uXXXX escape
Any help appreciated
\uhhhh is an escape syntax in the string literal. You'd have to produce a raw string (where the escape syntax is ignored), then re-apply the normal Python parser handling of escapes:
import codecs
print(codecs.decode(r'Y\u208{0}'.format(1), 'unicode_escape'))
However, you'd be better of using the chr() function to produce the whole character:
print('Y{0}'.format(chr(0x2080 + 1)))
The chr() function takes an integer and outputs the corresponding Unicode codepoint in a string. The above defines a hexadecimal number and adds 1 to produce your desired 2080 range Unicode character.

Categories

Resources