Convert ASCII to Unicode encoding issue [duplicate] - python

This question already has answers here:
Python string to unicode [duplicate]
(3 answers)
Closed 6 years ago.
I have a question about Python 2 encoding. I am trying to decode an ASCII string which contains Unicode code of a letter to Unicode, and then encode it back to Latin-1, but with no success. Here is an illustration:
In[27]: d = u'\u010d'
In[28]: print d.encode('utf-8')
č
In[29]: d1 = '\u010d'
In[30]: d1.decode('ascii').encode('utf-8')
Out[30]: '\\u010d'
I would like to convert '\u010d' to 'č'. Are there any built-in solutions to avoid custom string replacement?

When you do
d1 = '\u010d'
you actually get this string:
In [3]: d1
Out[3]: '\\u010d'
This is because "normal" (non-Unicode) strings don't recognize the \unnnn escape sequence and therefore convert it to a literal backslash, followed by unnnn.
In order to decode that, you need to use the unicode_escape codec:
In [4]: print d1.decode("unicode_escape").encode('utf-8')
č
But of course you shouldn't use Unicode escape sequences in non-Unicode strings in the first place.

Related

Convert unicode string into byte string [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I have a string like:
s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
I need to be able to get the corresponding byte literal of that unicode (for pickle.loads):
s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
Here the solution of using s_new: bytes = bytes(s_str, encoding="raw_unicode_escape") was posted, but it does not work for me. I got an incorrect result: b'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04' that has two backslashes (actually representing only one) for each one that it should have.
Also here and here a similar solution is proposed, but it does not work for me either, I end up getting the double backslashes again. Why does this occur? How do I get the bytes result I want?
You do not have byte escape codes as shown below (length 9) or you wouldn't get the s_not_bytes result:
s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
You have literal escape codes (length 36), and note the r for raw string that prevents interpreting the escape codes as bytes:
s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
Note the difference. \\ is an escape code indicating a literal, single backslash:
>>> '\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\x00\x01\x00À\x01\x00\x00\x00\x04'
>>> r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'
>>> len('\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
9
>>> len(r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
36
The following gets the desired byte string by converting each code point to a byte using the latin1 codec, which maps 1:1 between the first 256 code points (U+0000 to U+00FF) and the byte values 0x00 to 0xFF. Then it decodes the literal escape codes, resulting in a Unicode string again so once more encode using latin1 to convert 1:1 back to bytes:
s_bytes: bytes = s_str.encode('latin1').decode('unicode_escape').encode('latin1')
print(s_bytes)
Output:
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
If you did have s_str as posted, a simple .encode('latin1') would convert it:
>>> s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
>>> s_str.encode('latin1')
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
I was about to post the question when I encounter a valid solution almost by chance. The combination that works for me is:
s_new: bytes = bytes(s_str.encode('utf-8').decode('unicode-escape'), encoding="oem")
As I said I have no idea why this works so feel free to explain it if you know why.
You might simply use .encode("utf-8") to get desired result i.e.:
s_1 = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
s_2 = s_1.encode("utf-8")
print(s_2)
output
b'\x00\x01\x00\xc3\x80\x01\x00\x00\x00\x04'

How to print unicode-character-codes-stored-as-strings as human-readable text [duplicate]

This question already has answers here:
Decode Hex String in Python 3
(3 answers)
Closed 3 years ago.
I have lots of unicode characters codes stored as strings in Python3, e.g.
unicode = '3077'
where U+3077 is ぷ. How do I print this as human-readable text? I.e. how do I convert the string unicode to unicode_as_text such that:
>>> print(unicode_as_text)
ぷ
Your string is the unicode codepoint represented in hexdecimal, so the character can be rendered by printing the result of calling chr on the decimal value of the code point.
>>> print(chr(int('3077', 16)))
ぷ

Remove unicode characters python [duplicate]

This question already has answers here:
Replace non-ASCII characters with a single space
(12 answers)
Closed 6 years ago.
I am pulling tweets in python using tweepy.
It gives the entire data in type unicode.
Eg: print type(data) gives me <type 'unicode'>
It contains unicode characters in it.
Eg: hello\u2026 im am fine\u2019s
I want to remove all of these unicode characters. Is there any regular expression i can use?
str.replace isn't a viable option as unicode characters can be any values, from smileys to unicode apostrophes.
In [10]: from unicodedata import normalize
In [11]: out_text = normalize('NFKD', input_text).encode('ascii','ignore')
Try this.
Edit
Actually normalize Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. If you wana more about NFKD go to this link
In [12]: u = unichr(40960) + u'abcd' + unichr(1972)
In [13]: u.encode('utf-8')
Out[13]: '\xea\x80\x80abcd\xde\xb4'
In [14]: u
Out[14]: u'\ua000abcd\u07b4'
In [16]: u.encode('ascii', 'ignore')
Out[16]: 'abcd'
From the above code you will get what encode('ascii','ignore') does.
Ref : https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize

Python decode "\u041b" string [duplicate]

This question already has an answer here:
How can I convert strings like "\u5c0f\u738b\u5b50\u003a\u6c49\u6cd5\u82f1\u5bf9\u7167" to Chinese characters
(1 answer)
Closed 9 years ago.
I have unicode string, i'm sure that it's UTF-8, but I can't decode it. The string is '\u041b\u0435\u0433\u043a\u043e\u0432\u044b\u0435'. How to decode it?
You can use aString.decode('unicode_escape'), it convert a unicode-format string to unicode object
>>> u'\u041b\u0435\u0433\u043a\u043e\u0432\u044b\u0435'
u'\u041b\u0435\u0433\u043a\u043e\u0432\u044b\u0435'
>>> '\u041b\u0435\u0433\u043a\u043e\u0432\u044b\u0435'.decode('unicode_escape')
u'\u041b\u0435\u0433\u043a\u043e\u0432\u044b\u0435'
>>>
In your case
>>> print '\u041b\u0435\u0433\u043a\u043e\u0432\u044b\u0435'.decode('unicode_escape')
Легковые
>>>

Python string to unicode [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?
How do convert unicode escape sequences to unicode characters in a python string
I have a string that contains unicode characters e.g. \u2026 etc. Somehow it is not received to me as unicode, but is received as a str. How do I convert it back to unicode?
>>> a="Hello\u2026"
>>> b=u"Hello\u2026"
>>> print a
Hello\u2026
>>> print b
Hello…
>>> print unicode(a)
Hello\u2026
>>>
So clearly unicode(a) is not the answer. Then what is?
Unicode escapes only work in unicode strings, so this
a="\u2026"
is actually a string of 6 characters: '\', 'u', '2', '0', '2', '6'.
To make unicode out of this, use decode('unicode-escape'):
a="\u2026"
print repr(a)
print repr(a.decode('unicode-escape'))
## '\\u2026'
## u'\u2026'
Decode it with the unicode-escape codec:
>>> a="Hello\u2026"
>>> a.decode('unicode-escape')
u'Hello\u2026'
>>> print _
Hello…
This is because for a non-unicode string the \u2026 is not recognised but is instead treated as a literal series of characters (to put it more clearly, 'Hello\\u2026'). You need to decode the escapes, and the unicode-escape codec can do that for you.
Note that you can get unicode to recognise it in the same way by specifying the codec argument:
>>> unicode(a, 'unicode-escape')
u'Hello\u2026'
But the a.decode() way is nicer.
>>> a="Hello\u2026"
>>> print a.decode('unicode-escape')
Hello…

Categories

Resources