bytes with embedded literal \xhh escapes to unicode - python

I have: b'{"street":"Grossk\\xc3\\xb6lnstra\\xc3\\x9fe"}'
I need: '{"street": "Grosskölnstraße"}'
I tried:
s.decode('utf8'): # '{"street":"Grossk\\xc3\\xb6lnstra\\xc3\\x9fe"}'
s.decode('unicode_escape'): # '{"street":"GrosskölnstraÃ\x9fe"}'
What's the correct way?

That's.. quite a mess you have there. That looks like UTF-8 bytes embedded as Python byte escape sequences.
There is no codec that'll produce bytes as output again; you'll need to use the unicode_escape sequence then re-encode as Latin-1 to go back to UTF8 bytes, then decode as UTF-8:
s.decode('unicode_escape').encode('latin1').decode('utf8')
Demo:
>>> s = b'{"street":"Grossk\\xc3\\xb6lnstra\\xc3\\x9fe"}'
>>> s.decode('unicode_escape').encode('latin1').decode('utf8')
'{"street":"Grosskölnstraße"}'
Another option is to target just the \x[hexdigits]{3} pattern in a regex; this may be the more robust option if the specific data wasn't produced by a faulty Python script:
import re
from functools import partial
escape = re.compile(rb'\\x([\da-f]{2})')
repair = partial(escape.sub, lambda m: bytes.fromhex(m.group(1).decode()))
repair() returns a bytes object:
>>> repair(s)
b'{"street":"Grossk\xc3\xb6lnstra\xc3\x9fe"}'
>>> repair(s).decode('utf8')
'{"street":"Grosskölnstraße"}'

Related

How to tell python that a string is actually bytes-object? Not converting

I have a txt file which contains a line:
' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
The contents in the double quotes is actually octal encoding, but with two escape characters.
After the line has been read in, I used regex to extract the contents in the double quotes.
c = re.search(r': "(.+)"', line).group(1)
After that, I have two problem:
First, I need to replace the two escape characters with one.
Second, Tell python that the str object c is actually a byte object.
None of them has been done.
I have tried:
re.sub('\\', '\', line)
re.sub(r'\\', '\', line)
re.sub(r'\\', r'\', line)
All failed.
A bytes object can be easily define with 'b'.
c = b'\351\231\220\346\227\266\345\205\215\350\264\271'
How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.
I googled a lot, but with no answers. Maybe I use the wrong key word.
Does anyone know how to do these? Or other way to get what I want?
This is always a little confusing. I assume your bytes object should represent a string like:
b = b'\351\231\220\346\227\266\345\205\215\350\264\271'
b.decode()
# '限时免费'
To get that with your escaped string, you could use the codecs library and try:
import re
import codecs
line = ' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
c = re.search(r': "(.+)"', line).group(1)
codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
# '限时免费'
giving the same result.
The string contains literal text for escape codes. You cannot just replace the literal backslashes with a single backslash as escape codes are used in source code to indicate a single character. Decoding is needed to change literal escape codes to the actual character, but only byte strings can be decoded.
Encoding a Unicode string to a byte string with the Latin-1 codec translates Unicode code points 1:1 to the corresponding byte, so it is the common way to directly convert a "byte-string-like" Unicode string to an actual byte string.
Step-by-Step:
>>> s = "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"
>>> print(s) # Actual text of the string
\351\231\220\346\227\266\345\205\215\350\264\271
>>> s.encode('latin1') # Convert to byte string
b'\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271'
>>> # decode the escape codes...result is latin-1 characters in Unicode
>>> s.encode('latin1').decode('unicode-escape')
'é\x99\x90æ\x97¶å\x85\x8dè´¹' # convert back to byte string
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xe9\x99\x90\xe6\x97\xb6\xe5\x85\x8d\xe8\xb4\xb9'
>>> # data is UTF-8-encoded text so decode it correctly now
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'限时免费'
Your text example looks like part of a Python dictionary. You may be able to save some steps by using the ast module's literal_eval function to turn the dictionary directly into a Python object, and then just fix this line of code:
>>> # Python dictionary-like text
d='{6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"}'
>>> import ast
>>> ast.literal_eval(d) # returns Python dictionary with value already decoded
{6: 'é\x99\x90æ\x97¶å\x85\x8dè´¹'}
>>> ast.literal_eval(d)[6] # but decoded incorrectly as Latin-1 text.
'é\x99\x90æ\x97¶å\x85\x8dè´¹'
>>> ast.literal_eval(d)[6].encode('latin1').decode('utf8') # undo Latin1, decode as UTF-8
'限时免费'

Correctly decoding hex escaped unicode strings in python [duplicate]

I have a unicode string like '%C3%A7%C3%B6asd+fjkls%25asd' and I want to decode this string.
I used urllib.unquote_plus(str) but it works wrong.
expected : çöasd+fjkls%asd
result : çöasd fjkls%asd
double coded utf-8 characters(%C3%A7 and %C3%B6) are decoded wrong.
My python version is 2.7 under a linux distro.
What is the best way to get expected result?
You have 3 or 4 or 5 problems ... but repr() and unicodedata.name() are your friends; they unambiguously show you exactly what you have got, without the confusion engendered by people with different console encodings communicating the results of print fubar.
Summary: either (a) you start with a unicode object and apply the unquote function to that or (b) you start off with a str object and your console encoding is not UTF-8.
If as you say you start off with a unicode object:
>>> s0 = u'%C3%A7%C3%B6asd+fjkls%25asd'
>>> print repr(s0)
u'%C3%A7%C3%B6asd+fjkls%25asd'
this is an accidental nonsense. If you apply urllibX.unquote_YYYY() to it, you get another nonsense unicode object (u'\xc3\xa7\xc3\xb6asd+fjkls%asd') which would cause your shown symptoms when printed. You should convert your original unicode object to a str object immediately:
>>> s1 = s0.encode('ascii')
>>> print repr(s1)
'%C3%A7%C3%B6asd+fjkls%25asd'
then you should unquote it:
>>> import urllib2
>>> s2 = urllib2.unquote(s1)
>>> print repr(s2)
'\xc3\xa7\xc3\xb6asd+fjkls%asd'
Looking at the first 4 bytes of that, it's encoded in UTF-8. If you do print s2, it will look OK if your console is expecting UTF-8, but if it's expecting ISO-8859-1 (aka latin1) you'll see your symptomatic rubbish (first char will be A-tilde). Let's park that thought for a moment and convert it to a Unicode object:
>>> s3 = s2.decode('utf8')
>>> print repr(s3)
u'\xe7\xf6asd+fjkls%asd'
and inspect it to see what we've actually got:
>>> import unicodedata
>>> for c in s3[:6]:
... print repr(c), unicodedata.name(c)
...
u'\xe7' LATIN SMALL LETTER C WITH CEDILLA
u'\xf6' LATIN SMALL LETTER O WITH DIAERESIS
u'a' LATIN SMALL LETTER A
u's' LATIN SMALL LETTER S
u'd' LATIN SMALL LETTER D
u'+' PLUS SIGN
Looks like what you said you expected. Now we come to the question of displaying it on your console. Note: don't freak out when you see "cp850"; I'm doing this portably and just happen to be doing this in a Command Prompt on Windows.
>>> import sys
>>> sys.stdout.encoding
'cp850'
>>> print s3
çöasd+fjkls%asd
Note: the unicode object was explicitly encoded using sys.stdout.encoding. Fortunately all the unicode characters in s3 are representable in that encoding (and cp1252 and latin1).
Using either unquote or unquote_plus will give you a byte string. If you want a Unicode string then you have to decode the byte string to unicode:
>>> print(urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd').decode('utf8'))
çöasd fjkls%asd
>>>
Compared with:
>>> print(urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd'))
çöasd fjkls%asd
>>>
Note that your input string must be a byte string: if you pass unicode to unquote/unquote_plus then you'll get a bit of a mess. If this is the case then encode it first:
>>> print(urllib.unquote_plus(u'%C3%A7%C3%B6asd+fjkls%25asd'.encode('ascii')).decode('utf8'))
çöasd fjkls%asd
Try urllib2 once more:
print urllib2.unquote('%C3%A7%C3%B6asd+fjkls%25asd')
'%C3%A7%C3%B6asd+fjkls%25asd' - this is not a unicode string.
This is a url-encoded string. Use urllib2.unquote() instead.
You have a double problem: your string is unicode encoded and contains caracter urlencoded. Some match. You can normalize your string to ascci to be sure it won't be interpreted incorrectly:
>>> s = '%C3%A7%C3%B6asd+fjkls%25asd' # ascii string
>>> print urllib2.unquote(s) # works as expected
çöasd+fjkls%asd
>>> s = u'%C3%A7%C3%B6asd+fjkls%25asd' # unicode string
>>> print urllib2.unquote(s) # decode stuff that it shouldn't
çöasd+fjkls%asd
>>> print urllib2.unquote(s.encode('ascii')) # encode the unicode string to ascii: works!
çöasd+fjkls%asd
You are using unquote_plus method which is taking space into account and converting to +. Just use unquote method and you should be fine.
>>> import urllib
>>> print urllib.unquote('%C3%A7%C3%B6asd+fjkls%25asd')
çöasd+fjkls%asd
>>> print urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd')
çöasd fjkls%asd

Python - ASCII encoding string in the unicode string; how to remove that 'u'?

When I use python module 'pygoogle' in chinese, I got url like u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
It's unicode but include ascii. I try to encode it back to utf-8 but the code be changed too.
a = u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
a.encode('utf-8')
>>> 'http://zh.wikipedia.org/zh/\xc3\xa6\xc2\xb1\xc2\x89\xc3\xa8\xc2\xaf\xc2\xad'
Also I try to use :
str(a)
but I got error :
UnicodeEncodeError: 'ascii' codec can't encode characters in position 27-32: ordinal not in range(128)
How can I encoding it for remove the 'u' ?
By the way, if there is not 'u' I will get correct result like:
s = 'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
print s
>>> http://zh.wikipedia.org/zh/汉语
You have a Mojibake; in this case those are UTF-8 bytes decoded as if they were Latin-1 bytes.
To reverse the process, encode to Latin-1 again:
>>> a = u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> a.encode('latin-1')
'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> print a.encode('latin-1')
http://zh.wikipedia.org/zh/汉语
The print worked because my terminal is configured to handle UTF-8. You can get a unicode object again by decoding as UTF-8:
>>> a.encode('latin-1').decode('utf8')
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'
The ISO-8859-1 (Latin-1) codec maps one-on-one to the first 255 Unicode codepoints, which is why the string contents look otherwise unchanged.
You may want to use the ftfy library for jobs like these; it handles a wide variety of text issues, including Windows codepage Mojibake where some resulting 'codepoints' are not legally encodable to the codepage. The ftfy.fix_text() function takes Unicode input and repairs it:
>>> import ftfy
>>> ftfy.fix_text(a)
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'

Evaluate UTF-8 literal escape sequences in a string in Python3

I have a string of the form:
s = '\\xe2\\x99\\xac'
I would like to convert this to the character ♬ by evaluating the escape sequence. However, everything I've tried either results in an error or prints out garbage. How can I force Python to convert the escape sequence into a literal unicode character?
What I've read elsewhere suggests that the following line of code should do what I want, but it results in a UnicodeEncodeError.
print(bytes(s, 'utf-8').decode('unicode-escape'))
I also tried the following, which has the same result:
import codecs
print(codecs.getdecoder('unicode_escape')(s)[0])
Both of these approaches produce the string 'â\x99¬', which print is subsequently unable to handle.
In case it makes any difference the string is being read in from a UTF-8 encoded file and will ultimately be output to a different UTF-8 encoded file after processing.
...decode('unicode-escape') will give you string '\xe2\x99\xac'.
>>> s = '\\xe2\\x99\\xac'
>>> s.encode().decode('unicode-escape')
'â\x99¬'
>>> _ == '\xe2\x99\xac'
True
You need to decode it. But to decode it, encode it first with latin1 (or iso-8859-1) to preserve the bytes.
>>> s = '\\xe2\\x99\\xac'
>>> s.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
'♬'

How to convert a string to utf-8 in Python

I have a browser which sends utf-8 characters to my Python server, but when I retrieve it from the query string, the encoding that Python returns is ASCII. How can I convert the plain string to utf-8?
NOTE: The string passed from the web is already UTF-8 encoded, I just want to make Python to treat it as UTF-8 not ASCII.
In Python 2
>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)
^ This is the difference between a byte string (plain_string) and a unicode string.
>>> s = "Hello!"
>>> u = unicode(s, "utf-8")
^ Converting to unicode and specifying the encoding.
In Python 3
All strings are unicode. The unicode function does not exist anymore. See answer from #Noumenon
If the methods above don't work, you can also tell Python to ignore portions of a string that it can't convert to utf-8:
stringnamehere.decode('utf-8', 'ignore')
Might be a bit overkill, but when I work with ascii and unicode in same files, repeating decode can be a pain, this is what I use:
def make_unicode(inp):
if type(inp) != unicode:
inp = inp.decode('utf-8')
return inp
Adding the following line to the top of your .py file:
# -*- coding: utf-8 -*-
allows you to encode strings directly in your script, like this:
utfstr = "ボールト"
If I understand you correctly, you have a utf-8 encoded byte-string in your code.
Converting a byte-string to a unicode string is known as decoding (unicode -> byte-string is encoding).
You do that by using the unicode function or the decode method. Either:
unicodestr = unicode(bytestr, encoding)
unicodestr = unicode(bytestr, "utf-8")
Or:
unicodestr = bytestr.decode(encoding)
unicodestr = bytestr.decode("utf-8")
city = 'Ribeir\xc3\xa3o Preto'
print city.decode('cp1252').encode('utf-8')
In Python 3.6, they do not have a built-in unicode() method.
Strings are already stored as unicode by default and no conversion is required. Example:
my_str = "\u221a25"
print(my_str)
>>> √25
Translate with ord() and unichar().
Every unicode char have a number asociated, something like an index. So Python have a few methods to translate between a char and his number. Downside is a ñ example. Hope it can help.
>>> C = 'ñ'
>>> U = C.decode('utf8')
>>> U
u'\xf1'
>>> ord(U)
241
>>> unichr(241)
u'\xf1'
>>> print unichr(241).encode('utf8')
ñ
First, str in Python is represented in Unicode.
Second, UTF-8 is an encoding standard to encode Unicode string to bytes. There are many encoding standards out there (e.g. UTF-16, ASCII, SHIFT-JIS, etc.).
When the client sends data to your server and they are using UTF-8, they are sending a bunch of bytes not str.
You received a str because the "library" or "framework" that you are using, has implicitly converted some random bytes to str.
Under the hood, there is just a bunch of bytes. You just need ask the "library" to give you the request content in bytes and you will handle the decoding yourself (if library can't give you then it is trying to do black magic then you shouldn't use it).
Decode UTF-8 encoded bytes to str: bs.decode('utf-8')
Encode str to UTF-8 bytes: s.encode('utf-8')
The url is translated to ASCII and to the Python server it is just a Unicode string, eg.:
"T%C3%A9st%C3%A3o"
Python understands "é" and "ã" as actual %C3%A9 and %C3%A3.
You can encode an URL just like this:
import urllib
url = "T%C3%A9st%C3%A3o"
print(urllib.parse.unquote(url))
>> Téstão
See https://www.adamsmith.haus/python/answers/how-to-decode-a-utf-8-url-in-python for details.
you can also do this:
from unidecode import unidecode
unidecode(yourStringtoDecode)
You can use python's standard library codecs module.
import codecs
codecs.decode(b'Decode me', 'utf-8')
Yes, You can add
# -*- coding: utf-8 -*-
in your source code's first line.
You can read more details here https://www.python.org/dev/peps/pep-0263/

Categories

Resources