This question already has answers here:
Python: Converting from ISO-8859-1/latin1 to UTF-8
(5 answers)
Closed 1 year ago.
In Python 2.7, how do you convert a latin1 string to UTF-8.
For example, I'm trying to convert é to utf-8.
>>> "é"
'\xe9'
>>> u"é"
u'\xe9'
>>> u"é".encode('utf-8')
'\xc3\xa9'
>>> print u"é".encode('utf-8')
é
The letter is é which is LATIN SMALL LETTER E WITH ACUTE (U+00E9)
The UTF-8 byte encoding for is: c3a9
The latin byte encoding is: e9
How do I get the UTF-8 encoded version of a latin string? Could someone give an example of how to convert the é?
To decode a byte sequence from latin 1 to Unicode, use the .decode() method:
>>> '\xe9'.decode('latin1')
u'\xe9'
Python uses \xab escapes for unicode codepoints below \u00ff.
>>> '\xe9'.decode('latin1') == u'\u00e9'
True
The above Latin-1 character can be encoded to UTF-8 as:
>>> '\xe9'.decode('latin1').encode('utf8')
'\xc3\xa9'
>>> u"é".encode('utf-8')
'\xc3\xa9'
You've got a UTF-8 encoded byte sequence. Don't try to print encoded bytes directly. To print them you need to decode the encoded bytes back into a Unicode string.
>>> u"é".encode('utf-8').decode('utf-8')
u'\xe9'
>>> print u"é".encode('utf-8').decode('utf-8')
é
Notice that encoding and decoding are opposite operations which effectively cancel out. You end up with the original u"é" string back, although Python prints it as the equivalent u'\xe9'.
>>> u"é" == u'\xe9'
True
concept = concept.encode('ascii', 'ignore') concept =
MySQLdb.escape_string(concept.decode('latin1').encode('utf8').rstrip())
I do this, I am not sure if that is a good approach but it works everytime !!
Related
I have a Python 2.7 code which retrieves a base64 encoded response from a server. This response is decoded using base64 module (b64decode / decodestring functions, returning str). Its decoded content has the Unicode code points of the original strings.
I need to convert these Unicode code points to UTF-8.
The original string has a substring content "Não". When I decode the responded string, it shows:
>>> encoded_str = ... # server response
>>> decoded_str = base64.b64decode(encoded_str)
>>> type(decoded_str)
<type 'str'>
>>> decoded_str[x:y]
'N\xe3o'
When I try to encode to UTF-8, it leads to errors as
>>> (decode_str[x:y]).encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 2: ordinal not in range(128)
However, when this string is manually written in Unicode type, I can correctly convert it to my desired UTF-8 string.
>>> test_str = u'N\xe3o'
>>> test.encode('utf-8')
'N\xc3\xa3o'
I have to retrieve this response from the server and correctly generate an UTF-8 string which can be printed as "Não", how can I do this in Python 2?
You want to decode, not encode the byte string.
Think of it like this: a Unicode string was encoded into bytes, and these bytes were further encoded into base64.
To reverse this, you need to reverse both encodings, in the opposite order.
However, the sample you show most definitely isn't a valid UTF-8 byte string - 0xE3 in isolation is not a valid UTF-8 encoding. Most likely, the Unicode string was encoded using Latin-1 or a related encoding (the sample is much too small to establish this conclusively; other common candidates are the fugly Windows code page CP1252 and Latin-9).
>>> c='中文'
>>> c
'\xe4\xb8\xad\xe6\x96\x87'
>>> len(c)
6
>>> cu=u'中文'
>>> cu
u'\u4e2d\u6587'
>>> len(cu)
2
>>> s='𤭢'
>>> s
'\xf0\xa4\xad\xa2'
>>> len(s)
4
>>> su=u'𤭢'
>>> su
u'\U00024b62'
>>> len(su)
2
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'UTF-8'
First, I want to make some concepts clear myself.
I've learned that unicode string like cu=u'中文' ,actually is encoded in UTF-16 by python shell default. Right? So, when we saw '\u*' , that actually UTF-16 encoding? And '\u4e2d\u6587' is an unicode string or byte string? But cu has to be stored in the memory, so
0100 1110 0010 1101 0110 0101 1000 0111
(convert \u4e2d\u6587 to binary) is the form that cu preserved if that a byte string? Am I right?
But it can't be byte string. Otherwise len(cu) can't be 2, it should be 4!!
So it has to be unicode string. BUT!!! I've also learned that
python attempts to implicitly encode the Unicode string with whatever
scheme is currently set in sys.stdout.encoding, in this instance it's
"UTF-8".
>>> cu.encode('utf-8')
'\xe4\xb8\xad\xe6\x96\x87'
So! how could len(cu) == 2??? Is that because there are two '\u' in it?
But that doesn't make len(su) == 2 sense!
Am I missing something?
I'm using python 2.7.12
The Python unicode type holds Unicode codepoints, and is not meant to be an encoding. How Python does this internally is an implementation detail and not something you need to be concerned with most of the time. They are not UTF-16 code units, because UTF-16 is another codec you can use to encode Unicode text, just like UTF-8 is.
The most important thing here is that a standard Python str object holds bytes, which may or may not hold text encoded to a certain codec (your sample uses UTF-8 but that's not a given), and unicode holds Unicode codepoints. In an interactive interpreter session, it is the codec of your terminal that determines what bytes are received by Python (which then uses sys.stdin.encoding to decode these as needed when you create a u'...' unicode object).
Only when writing to sys.stdout (say, when using print) does the sys.stdout.encoding value come to play, where Python will automatically encode your Unicode strings again. Only then will your 2 Unicode codepoints be encoded to UTF-8 again and written to your terminal, which knows how to interpret those.
You probably want to read up about Python and Unicode, I recommend:
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
I have this code fragment (Python 2.7):
from bs4 import BeautifulSoup
content = ' foo bar';
soup = BeautifulSoup(content, 'html.parser')
w = soup.get_text()
At this point w has a byte with value 160 in it, but it's encoding is ASCII.
How do replace all of the \xa0 bytes by another character?
I've tried:
w = w.replace(chr(160), ' ')
w = w.replace('\xa0', ' ')
but I am getting the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
And why does BS return an ASCII encoded string with an invalid character in it?
Is there a way to convert w to a 'latin1` encoded string?
At this point w has a byte with value 160 in it, but it's encoding is 'ascii'.
You have an unicode string:
>>> w
u'\xa0 foo bar'
>>> type(w)
<type 'unicode'>
How do replace all of the \xa0 bytes with another character?
>>> x = w.replace(u'\xa0', ' ')
>>> x
u' foo bar'
And why does BS return an 'ascii' encoded string with an invalid character in it?
As mentioned above, it is not an ascii encoded string, but an Unicode string instance.
Is there a way to convert w to a 'latin1` encoded string?
Sure:
>>> w.encode('latin1')
'\xa0 foo bar'
(Note this last string is an encoded string, not an unicode object, and its representation is not prefixed by 'u' like the previous unicode objects).
Notes (edited):
If you are typing strings into your source files, note that encoding of source files matters. Python will assume your source files are ASCII. The command line interpreter, on the other hand, will assume you are entering strings in your default system encoding. Of course you can override all this.
Avoid latin1, use UTF-8 if possible: ie. w.encode('utf8')
When encoding and decoding can tell Python to ignore errors, or replace characters that cannot be encoded with some marker character . I don't recommend to ignore encoding errors (at least without logging them), except for the hopefully rare cases when you know there are encoding errors or you need to encode text into a more reduced character set, requiring replacement of the code points that cannot be represented (ie if you need to encode 'España' into ASCII, you definitely should replace the 'ñ'). But for these cases there are imho better alternatives and you should look into the magical unicodedata module (see https://stackoverflow.com/a/1207479/401656).
There is a Python Unicode HOWTO: https://docs.python.org/2/howto/unicode.html
When I use python module 'pygoogle' in chinese, I got url like u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
It's unicode but include ascii. I try to encode it back to utf-8 but the code be changed too.
a = u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
a.encode('utf-8')
>>> 'http://zh.wikipedia.org/zh/\xc3\xa6\xc2\xb1\xc2\x89\xc3\xa8\xc2\xaf\xc2\xad'
Also I try to use :
str(a)
but I got error :
UnicodeEncodeError: 'ascii' codec can't encode characters in position 27-32: ordinal not in range(128)
How can I encoding it for remove the 'u' ?
By the way, if there is not 'u' I will get correct result like:
s = 'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
print s
>>> http://zh.wikipedia.org/zh/汉语
You have a Mojibake; in this case those are UTF-8 bytes decoded as if they were Latin-1 bytes.
To reverse the process, encode to Latin-1 again:
>>> a = u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> a.encode('latin-1')
'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> print a.encode('latin-1')
http://zh.wikipedia.org/zh/汉语
The print worked because my terminal is configured to handle UTF-8. You can get a unicode object again by decoding as UTF-8:
>>> a.encode('latin-1').decode('utf8')
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'
The ISO-8859-1 (Latin-1) codec maps one-on-one to the first 255 Unicode codepoints, which is why the string contents look otherwise unchanged.
You may want to use the ftfy library for jobs like these; it handles a wide variety of text issues, including Windows codepage Mojibake where some resulting 'codepoints' are not legally encodable to the codepage. The ftfy.fix_text() function takes Unicode input and repairs it:
>>> import ftfy
>>> ftfy.fix_text(a)
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'
Suppose I have strings with lots of stuff like
âwords words words
Is there a way to convert these through python directly into the characters they represent?
I tried
h = HTMLParser.HTMLParser()
print h.unescape(x)
but got this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
I also tried
print h.unescape(x).encode(utf-8)
but it encodes
â as â
when it should be a quote
â form a UTF-8 byte sequence, for the U+201C LEFT DOUBLE QUOTATION MARK character. Something is majorly mucked up there. The correct encoding would have been “.
You can use the HTML parser to unescape this, but you'll need to repair the resulting Mochibake:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> x = 'â'
>>> h.unescape(x)
u'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1')
'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1').decode('utf8')
u'\u201c'
>>> print h.unescape(x).encode('latin1').decode('utf8')
“
If printing still gives you a UnicodeEncodeError, then your terminal or console is incorrectly configured and Python is inadventently encoding to ASCII.
the problem is that you cannot decode unicode properly ... you need to convert it away from unicode to just utf8
x="âwords words words"
h = HTMLParser.HTMLParser()
msg=h.unescape(x) #this converts it to unicode string ..
downcast = "".join(chr(ord(c)&0xff) for c in msg) #convert it to normal string (python2)
print downcast.decode("utf8")
there may be a better way to do this in the HTMLParser library ...