I'm sanitizing a pandas dataframe and encounters unicode string that has a u inside it with a backslash than I need to replace e.g.
u'\u2014'.replace('\u','')
Result: u'\u2014'
I've tried encoding it as utf-8 then decoding it but that didn't work and I feel there must be an easier way around this.
pandas code
merged['Rank World Bank'] = merged['Rank World Bank'].astype(str)
Error
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 0: ordinal not in range(128)
u'\u2014' is actually -. It's not a number. It's a utf-8 character. Try using print keyword to print it . You will know
This is the output in ipython:
In [4]: print("val = ", u'\u2014')
val = —
Based on your comment, here is what you are doing wrong
"-" is not same as "EM Dash" Unicode character(u'\u2014')
So, you should do the following
print(u'\u2014'.replace("\u2014",""))
and that will work
EDIT:
since you are using python 2.x, you have to encode it with utf-8 as follows
u'\u2014'.encode('utf-8').decode('utf-8').replace("-","")
Yeah, Because it is taking '2014' followed by '\u' as a unicode string and not a string literal.
Things that can help:
Converting to ascii using .encode('ascii', 'ignore')
As you are using pandas, you can use 'encoding' parameter and pass 'ascii' there.
Do this instead : u'\u2014'.replace(u'\u2014', u'2014').encode('ascii', 'ignore')
Hope this helps.
Related
I am running into this problem where when I try to decode a string I run into one error,when I try to encode I run into another error,errors below,is there a permanent solution for this?
P.S please note that you may not be able to reproduce the encoding error with the string I provided as I couldnt copy/paste some errors
text = "sometext"
string = '\n'.join(list(set(text)))
try:
print "decode"
text = string.decode('UTF-8')
except Exception as e:
print e
text = string.encode('UTF-8')
Errors:-
error while using string.decode('UTF-8')
'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)
Error while using string.encode('UTF-8')
Exception All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
The First Error
The code you have provided will work as the text is a a bytestring (as you are using Python 2). But what you're trying to do is to decode from a UTF-8 string to
an ASCII one, which is possible, but only if that Unicode string contains only characters that have an ASCII equivalent (you can see the list of ASCII characters here). In your case, it's encountering a unicode character (specifically ☂) which has no ASCII equivalent. You can get around this behaviour by using:
string.decode('UTF-8', 'ignore')
Which will just ignore (i.e. replace with nothing) the characters that cannot be encoded into ASCII.
The Second Error
This error is more interesting. It appears the text you are trying to encode into UTF-8 contains either NULL bytes or specific control characters, which are not allowed by the version of Unicode (UTF-8) that you are trying to encode into. Again, the code that you have actually provided works, but something in the text that you are trying to encode is violating the encoding. You can try the same trick as above:
string.encode('UTF-8', 'ignore')
Which will simply remove the offending characters, or you can look into what it is in your specific text input that is causing the problem.
I am trying run a bash command from my python program which out put the result in a file.I am using os.system to execute the bash command.But I am getting an error as follows:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 793: ordinal not in range(128)
I am not able to understand how to handle it.Please suggest me a solution for it.
Have a look at this Blog post
These messages usually means that you’re trying to either mix Unicode strings with 8-bit strings, or is trying to write Unicode strings to an output file or device that only handles ASCII.
Try to do the following to encode your string:
This can then be used to properly convert input data to Unicode. Assuming the string referred to by value is encoded as UTF-8:
value = unicode(value, "utf-8")
You need to encode your string as:
your_string = your_string.encode('utf-8')
For example:
>>> print(u'\u201c'.encode('utf - 8'))
“
My file is in unicode. However, for some reason, I want to change it to plain ascii while dropping any characters that are not recognized in ascii. For example, I want to change u'This is a string�' to just 'This is a string'. Following is the code I use to do so.
ascii_str = unicode_str.encode('ascii', 'ignore')
However, I still get the following annoying error.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0:
ordinal not in range(128)
How can I solve this problem? I am fine with plain ascii strings.
I assume that your unicode_str is a real unicode string.
>>> u"\xf3".encode("ascii", "ignore")
''
If not use this
>>> "\xf3".decode("ascii", "ignore").encode("ascii")
Always the best way would be, find out which encoding you deal with and than decode it. So you have an unicode string in the right format. This means start at unicode_str either to be a real unicode string or read it with the right codec. I assume that there is a file. So the very best would be:
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)
Another desperate approach would be:
>>> import string
>>> a = "abc\xf3abc"
>>> "".join(b for b in a if b in string.printable)
'abcabc'
You need to decode it. if you have a file
with open('example.csv', 'rb') as f:
csv = f.read().decode("utf-8")
if you wanna decode a string, you can do it this way
data.decode('UTF-8')
UPDATE
You can use ord() to get code ascii of every character
d=u'This is a string'
l=[ord(s) for s in d.encode('ascii', 'ignore')]
print l
If you need to concatenate them, you can use join
print "".join(l)
As you have a Replacement character ( a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table) in your string , you need to specify that for your interpreter before decoding , with add u at the leading of your string :
>>> unicode_str=u'This is a string�'
>>> unicode_str.encode('ascii', 'ignore')
'This is a string'
Suppose I have strings with lots of stuff like
âwords words words
Is there a way to convert these through python directly into the characters they represent?
I tried
h = HTMLParser.HTMLParser()
print h.unescape(x)
but got this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
I also tried
print h.unescape(x).encode(utf-8)
but it encodes
â as â
when it should be a quote
â form a UTF-8 byte sequence, for the U+201C LEFT DOUBLE QUOTATION MARK character. Something is majorly mucked up there. The correct encoding would have been “.
You can use the HTML parser to unescape this, but you'll need to repair the resulting Mochibake:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> x = 'â'
>>> h.unescape(x)
u'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1')
'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1').decode('utf8')
u'\u201c'
>>> print h.unescape(x).encode('latin1').decode('utf8')
“
If printing still gives you a UnicodeEncodeError, then your terminal or console is incorrectly configured and Python is inadventently encoding to ASCII.
the problem is that you cannot decode unicode properly ... you need to convert it away from unicode to just utf8
x="âwords words words"
h = HTMLParser.HTMLParser()
msg=h.unescape(x) #this converts it to unicode string ..
downcast = "".join(chr(ord(c)&0xff) for c in msg) #convert it to normal string (python2)
print downcast.decode("utf8")
there may be a better way to do this in the HTMLParser library ...
I'm dealing with unknown data and trying to insert into a MySQL database using Python/Django. I'm getting some errors that I don't quite understand and am looking for some help. Here is the error.
Incorrect string value: '\xEF\xBF\xBDs m...'
My guess is that the string is not being properly converted to unicode? Here is my code for unicode conversion.
s = unicode(content, "utf-8", errors="replace")
Without the above unicode conversion, the error I get is
'utf8' codec can't decode byte 0x92 in position 31: unexpected code byte. You passed in 'Fabulous home on one of Decatur\x92s most
Any help is appreciated!
What is the original encoding? I'm assuming "cp1252", from pixelbeat's answer. In that case, you can do
>>> orig # Byte string, encoded in cp1252
'Fabulous home on one of Decatur\x92s most'
>>> uni = orig.decode('cp1252')
>>> uni # Unicode string
u'Fabulous home on one of Decatur\u2019s most'
>>> s = uni.encode('utf8')
>>> s # Correct byte string encoded in utf-8
'Fabulous home on one of Decatur\xe2\x80\x99s most'
0x92 is right single curly quote in windows cp1252 encoding.
\xEF\xBF\xBD is the UTF8 encoding of the unicode replacement character
(which was inserted instead of the erroneous cp1252 character).
So it looks like your database is not accepting the valid UTF8 data?
2 options:
1. Perhaps you should be using unicode(content,"cp1252")
2. If you want to insert UTF-8 into the DB, then you'll need to config it appropriately. I'll leave that answer to others more knowledgeable
The "Fabulous..." string doesn't look like utf-8: 0x92 is above 128 and as such should be a continuation of a multi-byte character. However, in that string it appears on its own (apparently representing an apostrophe).