Suppose I have strings with lots of stuff like
âwords words words
Is there a way to convert these through python directly into the characters they represent?
I tried
h = HTMLParser.HTMLParser()
print h.unescape(x)
but got this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
I also tried
print h.unescape(x).encode(utf-8)
but it encodes
â as â
when it should be a quote
â form a UTF-8 byte sequence, for the U+201C LEFT DOUBLE QUOTATION MARK character. Something is majorly mucked up there. The correct encoding would have been “.
You can use the HTML parser to unescape this, but you'll need to repair the resulting Mochibake:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> x = 'â'
>>> h.unescape(x)
u'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1')
'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1').decode('utf8')
u'\u201c'
>>> print h.unescape(x).encode('latin1').decode('utf8')
“
If printing still gives you a UnicodeEncodeError, then your terminal or console is incorrectly configured and Python is inadventently encoding to ASCII.
the problem is that you cannot decode unicode properly ... you need to convert it away from unicode to just utf8
x="âwords words words"
h = HTMLParser.HTMLParser()
msg=h.unescape(x) #this converts it to unicode string ..
downcast = "".join(chr(ord(c)&0xff) for c in msg) #convert it to normal string (python2)
print downcast.decode("utf8")
there may be a better way to do this in the HTMLParser library ...
Related
I'm sanitizing a pandas dataframe and encounters unicode string that has a u inside it with a backslash than I need to replace e.g.
u'\u2014'.replace('\u','')
Result: u'\u2014'
I've tried encoding it as utf-8 then decoding it but that didn't work and I feel there must be an easier way around this.
pandas code
merged['Rank World Bank'] = merged['Rank World Bank'].astype(str)
Error
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 0: ordinal not in range(128)
u'\u2014' is actually -. It's not a number. It's a utf-8 character. Try using print keyword to print it . You will know
This is the output in ipython:
In [4]: print("val = ", u'\u2014')
val = —
Based on your comment, here is what you are doing wrong
"-" is not same as "EM Dash" Unicode character(u'\u2014')
So, you should do the following
print(u'\u2014'.replace("\u2014",""))
and that will work
EDIT:
since you are using python 2.x, you have to encode it with utf-8 as follows
u'\u2014'.encode('utf-8').decode('utf-8').replace("-","")
Yeah, Because it is taking '2014' followed by '\u' as a unicode string and not a string literal.
Things that can help:
Converting to ascii using .encode('ascii', 'ignore')
As you are using pandas, you can use 'encoding' parameter and pass 'ascii' there.
Do this instead : u'\u2014'.replace(u'\u2014', u'2014').encode('ascii', 'ignore')
Hope this helps.
I have this code fragment (Python 2.7):
from bs4 import BeautifulSoup
content = ' foo bar';
soup = BeautifulSoup(content, 'html.parser')
w = soup.get_text()
At this point w has a byte with value 160 in it, but it's encoding is ASCII.
How do replace all of the \xa0 bytes by another character?
I've tried:
w = w.replace(chr(160), ' ')
w = w.replace('\xa0', ' ')
but I am getting the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
And why does BS return an ASCII encoded string with an invalid character in it?
Is there a way to convert w to a 'latin1` encoded string?
At this point w has a byte with value 160 in it, but it's encoding is 'ascii'.
You have an unicode string:
>>> w
u'\xa0 foo bar'
>>> type(w)
<type 'unicode'>
How do replace all of the \xa0 bytes with another character?
>>> x = w.replace(u'\xa0', ' ')
>>> x
u' foo bar'
And why does BS return an 'ascii' encoded string with an invalid character in it?
As mentioned above, it is not an ascii encoded string, but an Unicode string instance.
Is there a way to convert w to a 'latin1` encoded string?
Sure:
>>> w.encode('latin1')
'\xa0 foo bar'
(Note this last string is an encoded string, not an unicode object, and its representation is not prefixed by 'u' like the previous unicode objects).
Notes (edited):
If you are typing strings into your source files, note that encoding of source files matters. Python will assume your source files are ASCII. The command line interpreter, on the other hand, will assume you are entering strings in your default system encoding. Of course you can override all this.
Avoid latin1, use UTF-8 if possible: ie. w.encode('utf8')
When encoding and decoding can tell Python to ignore errors, or replace characters that cannot be encoded with some marker character . I don't recommend to ignore encoding errors (at least without logging them), except for the hopefully rare cases when you know there are encoding errors or you need to encode text into a more reduced character set, requiring replacement of the code points that cannot be represented (ie if you need to encode 'España' into ASCII, you definitely should replace the 'ñ'). But for these cases there are imho better alternatives and you should look into the magical unicodedata module (see https://stackoverflow.com/a/1207479/401656).
There is a Python Unicode HOWTO: https://docs.python.org/2/howto/unicode.html
When I use python module 'pygoogle' in chinese, I got url like u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
It's unicode but include ascii. I try to encode it back to utf-8 but the code be changed too.
a = u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
a.encode('utf-8')
>>> 'http://zh.wikipedia.org/zh/\xc3\xa6\xc2\xb1\xc2\x89\xc3\xa8\xc2\xaf\xc2\xad'
Also I try to use :
str(a)
but I got error :
UnicodeEncodeError: 'ascii' codec can't encode characters in position 27-32: ordinal not in range(128)
How can I encoding it for remove the 'u' ?
By the way, if there is not 'u' I will get correct result like:
s = 'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
print s
>>> http://zh.wikipedia.org/zh/汉语
You have a Mojibake; in this case those are UTF-8 bytes decoded as if they were Latin-1 bytes.
To reverse the process, encode to Latin-1 again:
>>> a = u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> a.encode('latin-1')
'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> print a.encode('latin-1')
http://zh.wikipedia.org/zh/汉语
The print worked because my terminal is configured to handle UTF-8. You can get a unicode object again by decoding as UTF-8:
>>> a.encode('latin-1').decode('utf8')
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'
The ISO-8859-1 (Latin-1) codec maps one-on-one to the first 255 Unicode codepoints, which is why the string contents look otherwise unchanged.
You may want to use the ftfy library for jobs like these; it handles a wide variety of text issues, including Windows codepage Mojibake where some resulting 'codepoints' are not legally encodable to the codepage. The ftfy.fix_text() function takes Unicode input and repairs it:
>>> import ftfy
>>> ftfy.fix_text(a)
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'
My file is in unicode. However, for some reason, I want to change it to plain ascii while dropping any characters that are not recognized in ascii. For example, I want to change u'This is a string�' to just 'This is a string'. Following is the code I use to do so.
ascii_str = unicode_str.encode('ascii', 'ignore')
However, I still get the following annoying error.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0:
ordinal not in range(128)
How can I solve this problem? I am fine with plain ascii strings.
I assume that your unicode_str is a real unicode string.
>>> u"\xf3".encode("ascii", "ignore")
''
If not use this
>>> "\xf3".decode("ascii", "ignore").encode("ascii")
Always the best way would be, find out which encoding you deal with and than decode it. So you have an unicode string in the right format. This means start at unicode_str either to be a real unicode string or read it with the right codec. I assume that there is a file. So the very best would be:
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)
Another desperate approach would be:
>>> import string
>>> a = "abc\xf3abc"
>>> "".join(b for b in a if b in string.printable)
'abcabc'
You need to decode it. if you have a file
with open('example.csv', 'rb') as f:
csv = f.read().decode("utf-8")
if you wanna decode a string, you can do it this way
data.decode('UTF-8')
UPDATE
You can use ord() to get code ascii of every character
d=u'This is a string'
l=[ord(s) for s in d.encode('ascii', 'ignore')]
print l
If you need to concatenate them, you can use join
print "".join(l)
As you have a Replacement character ( a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table) in your string , you need to specify that for your interpreter before decoding , with add u at the leading of your string :
>>> unicode_str=u'This is a string�'
>>> unicode_str.encode('ascii', 'ignore')
'This is a string'
This question already has answers here:
Python: Converting from ISO-8859-1/latin1 to UTF-8
(5 answers)
Closed 1 year ago.
In Python 2.7, how do you convert a latin1 string to UTF-8.
For example, I'm trying to convert é to utf-8.
>>> "é"
'\xe9'
>>> u"é"
u'\xe9'
>>> u"é".encode('utf-8')
'\xc3\xa9'
>>> print u"é".encode('utf-8')
é
The letter is é which is LATIN SMALL LETTER E WITH ACUTE (U+00E9)
The UTF-8 byte encoding for is: c3a9
The latin byte encoding is: e9
How do I get the UTF-8 encoded version of a latin string? Could someone give an example of how to convert the é?
To decode a byte sequence from latin 1 to Unicode, use the .decode() method:
>>> '\xe9'.decode('latin1')
u'\xe9'
Python uses \xab escapes for unicode codepoints below \u00ff.
>>> '\xe9'.decode('latin1') == u'\u00e9'
True
The above Latin-1 character can be encoded to UTF-8 as:
>>> '\xe9'.decode('latin1').encode('utf8')
'\xc3\xa9'
>>> u"é".encode('utf-8')
'\xc3\xa9'
You've got a UTF-8 encoded byte sequence. Don't try to print encoded bytes directly. To print them you need to decode the encoded bytes back into a Unicode string.
>>> u"é".encode('utf-8').decode('utf-8')
u'\xe9'
>>> print u"é".encode('utf-8').decode('utf-8')
é
Notice that encoding and decoding are opposite operations which effectively cancel out. You end up with the original u"é" string back, although Python prints it as the equivalent u'\xe9'.
>>> u"é" == u'\xe9'
True
concept = concept.encode('ascii', 'ignore') concept =
MySQLdb.escape_string(concept.decode('latin1').encode('utf8').rstrip())
I do this, I am not sure if that is a good approach but it works everytime !!