I have unicode-like strings but with slash escaped. For example, '\\u000D'. I need to decode them as normal strings. The above example should be convert to '\r' which '\u000D' corresponds to.
Use the unicode-escape codec.
>>> import codecs
>>> codecs.decode('\\u000D', 'unicode-escape')
'\r'
Related
I have a txt file which contains a line:
' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
The contents in the double quotes is actually octal encoding, but with two escape characters.
After the line has been read in, I used regex to extract the contents in the double quotes.
c = re.search(r': "(.+)"', line).group(1)
After that, I have two problem:
First, I need to replace the two escape characters with one.
Second, Tell python that the str object c is actually a byte object.
None of them has been done.
I have tried:
re.sub('\\', '\', line)
re.sub(r'\\', '\', line)
re.sub(r'\\', r'\', line)
All failed.
A bytes object can be easily define with 'b'.
c = b'\351\231\220\346\227\266\345\205\215\350\264\271'
How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.
I googled a lot, but with no answers. Maybe I use the wrong key word.
Does anyone know how to do these? Or other way to get what I want?
This is always a little confusing. I assume your bytes object should represent a string like:
b = b'\351\231\220\346\227\266\345\205\215\350\264\271'
b.decode()
# '限时免费'
To get that with your escaped string, you could use the codecs library and try:
import re
import codecs
line = ' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
c = re.search(r': "(.+)"', line).group(1)
codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
# '限时免费'
giving the same result.
The string contains literal text for escape codes. You cannot just replace the literal backslashes with a single backslash as escape codes are used in source code to indicate a single character. Decoding is needed to change literal escape codes to the actual character, but only byte strings can be decoded.
Encoding a Unicode string to a byte string with the Latin-1 codec translates Unicode code points 1:1 to the corresponding byte, so it is the common way to directly convert a "byte-string-like" Unicode string to an actual byte string.
Step-by-Step:
>>> s = "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"
>>> print(s) # Actual text of the string
\351\231\220\346\227\266\345\205\215\350\264\271
>>> s.encode('latin1') # Convert to byte string
b'\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271'
>>> # decode the escape codes...result is latin-1 characters in Unicode
>>> s.encode('latin1').decode('unicode-escape')
'é\x99\x90æ\x97¶å\x85\x8dè´¹' # convert back to byte string
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xe9\x99\x90\xe6\x97\xb6\xe5\x85\x8d\xe8\xb4\xb9'
>>> # data is UTF-8-encoded text so decode it correctly now
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'限时免费'
Your text example looks like part of a Python dictionary. You may be able to save some steps by using the ast module's literal_eval function to turn the dictionary directly into a Python object, and then just fix this line of code:
>>> # Python dictionary-like text
d='{6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"}'
>>> import ast
>>> ast.literal_eval(d) # returns Python dictionary with value already decoded
{6: 'é\x99\x90æ\x97¶å\x85\x8dè´¹'}
>>> ast.literal_eval(d)[6] # but decoded incorrectly as Latin-1 text.
'é\x99\x90æ\x97¶å\x85\x8dè´¹'
>>> ast.literal_eval(d)[6].encode('latin1').decode('utf8') # undo Latin1, decode as UTF-8
'限时免费'
I have: b'{"street":"Grossk\\xc3\\xb6lnstra\\xc3\\x9fe"}'
I need: '{"street": "Grosskölnstraße"}'
I tried:
s.decode('utf8'): # '{"street":"Grossk\\xc3\\xb6lnstra\\xc3\\x9fe"}'
s.decode('unicode_escape'): # '{"street":"GrosskölnstraÃ\x9fe"}'
What's the correct way?
That's.. quite a mess you have there. That looks like UTF-8 bytes embedded as Python byte escape sequences.
There is no codec that'll produce bytes as output again; you'll need to use the unicode_escape sequence then re-encode as Latin-1 to go back to UTF8 bytes, then decode as UTF-8:
s.decode('unicode_escape').encode('latin1').decode('utf8')
Demo:
>>> s = b'{"street":"Grossk\\xc3\\xb6lnstra\\xc3\\x9fe"}'
>>> s.decode('unicode_escape').encode('latin1').decode('utf8')
'{"street":"Grosskölnstraße"}'
Another option is to target just the \x[hexdigits]{3} pattern in a regex; this may be the more robust option if the specific data wasn't produced by a faulty Python script:
import re
from functools import partial
escape = re.compile(rb'\\x([\da-f]{2})')
repair = partial(escape.sub, lambda m: bytes.fromhex(m.group(1).decode()))
repair() returns a bytes object:
>>> repair(s)
b'{"street":"Grossk\xc3\xb6lnstra\xc3\x9fe"}'
>>> repair(s).decode('utf8')
'{"street":"Grosskölnstraße"}'
For instance if my string contains - 'नमस्ते' how do I print all the unicode escape sequence for the alphabets in the string.
If you want the \u escapes for each character (what you'd type to redefine the string in pure ASCII Python code), use the unicode-escape codec:
>>> 'नमसत'.encode('unicode-escape')
b'\\u0928\\u092e\\u0938\\u0924'
If it needs to end up a str, rather than bytes, decode it back as ASCII (and remove the quoting and doubled backslashes on display by printing it):
>>> print('नमसत'.encode('unicode-escape').decode('ascii'))
\u0928\u092e\u0938\u0924
>>> s = "नमस्ते"
>>> s.encode('utf-8')
b'\xe0\xa4\xa8\xe0\xa4\xae\xe0\xa4\xb8\xe0\xa5\x8d\xe0\xa4\xa4\xe0\xa5\x87'
>>> s.encode('unicode-escape')
b'\\u0928\\u092e\\u0938\\u094d\\u0924\\u0947'
What is the correct way to convert '\xbb' into a unicode string? I have tried the following and only get UnicodeDecodeError:
unicode('\xbb', 'utf-8')
'\xbb'.decode('utf-8')
Since it comes from Word it's probably CP1252.
>>> print '\xbb'.decode('cp1252')
»
It looks to be Latin-1 encoded. You should use:
unicode('\xbb', 'Latin-1')
Not sure what you are trying to do. But in Python3 all strings are unicode per default. In Python2.X you have to use u'my unicode string \xbb' (or double, tripple quoted) to get unicode strings. When you want to print unicode strings you have to encode them in character set that is supported on the output device, eg. the terminal. u'my unicode string \xbb'.endoce('iso-8859-1') for instance.
When I tried to get the content of a tag using "unicode(head.contents[3])" i get the output similar to this: "Christensen Sk\xf6ld". I want the escape sequence to be returned as string. How to do it in python?
Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:
>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'
Another way of achieving this:
>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'
Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:
>>> print name.decode('latin-1')
Christensen Sköld
BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:
>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'
I suspect that it's acutally working correctly. By default, Python displays strings in ASCII encoding, since not all terminals support unicode. If you actually print the string, though, it should work. See the following example:
>>> u'\xcfa'
u'\xcfa'
>>> print u'\xcfa'
Ïa
Given a byte string with Unicode escapes b"\N{SNOWMAN}", b"\N{SNOWMAN}".decode('unicode-escape) will produce the expected Unicode string u'\u2603'.