UnicodeDecodeError when using a Python string handling function - python

I'm doing this:
word.rstrip(s)
Where word and s are strings containing unicode characters.
I'm getting this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
There's a bug report where this error happens on some Windows Django systems. However, my situation seems unrelated to that case.
What could be the problem?
EDIT: The code is like this:
def Strip(word):
for s in suffixes:
return word.rstrip(s)

The issue is that s is a bytestring, while word is a unicode string - so, Python tries to turn s into a unicode string so that the rstrip makes sense. The issue is, it assumes s is encoded in ASCII, which it clearly isn't (since it contains a character outside the ASCII range).
So, since you intitialise it as a literal, it is very easy to turn it into a unicode string by putting a u in front of it:
suffixes = [u'ি']
Will work. As you add more suffixes, you'll need the u in front of all of them individually.

I guess this happens because of implicit conversion in python2.
It's explained in this document, but I recommend you to read the whole presentation about handling unicode in python 2 and 3 (and why python3 is better ;-))
So, I think the solution to your problem would be to force the decoding of strings as utf8 before striping.
Something like :
def Strip(word):
word = word.decode("utf8")
for s in suffixes:
return word.rstrip(s.decode("utf8")
Second try :
def Strip(word):
if type(word) == str:
word = word.decode("utf8")
for s in suffixes:
if type(s) == str:
s = s.decode("utf8")
return word.rstrip(s)

Related

Encoding and decoding for chars are not treated the same for polish letters

From other source i get two names with two polish letter (ń and ó), like below:
piaseczyński
zielonogórski
Of course these names is more then two.
The 1st should be looks like piaseczyński and the 2nd looks good. But when I use some operation to fix it using:
str(entity_name).encode('1252').decode('utf-8') then 1st is fixed, but 2nd return error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte
Why polish letter are not treated the same?
How to fix it?
As you probably realise already, those strings have different encodings. The best approach is to fix it at the source, so that it always returns UTF-8 (or at least some consistent, known encoding).
If you really can't do that, you should try to decode as UTF-8 first, because it's more strict: not every string of bytes is valid UTF-8. If you get UnicodeDecodeError, try to decode it as some other encoding:
def decode_crappy_bytes(b):
try:
return b.decode('utf-8')
except UnicodeDecodeError:
return b.decode('1252')
Note that this can still fail, in two ways:
If you get a string in some non-UTF-8 encoding that happens to be decodable as UTF-8 as well.
If you get a string in a non-UTF-8 encoding that's not Windows codepage 1252. Another common one in Europe is ISO-8859-1 (Latin-1). Every bytestring that's valid in one is also valid in the other.
If you do need to deal with multiple different non-UTF-8 encodings and you know that it should be Polish, you could count the number of non-ASCII Polish letters in each possible decoding, and return the one with the highest score. Still not infallible, so really, it's best to fix it at the source.
#Thomas I added another except then now works perfectly:
try:
entity_name = entity_name.encode('1252').decode('utf-8')
except UnicodeDecodeError:
pass
except UnicodeEncodeError:
pass
Passed for żarski.

Struggling with utf-16 encoding/decoding

I'm parsing a document that have some UTF-16 encoded string.
I have a byte string that contains the following:
my_var = b'\xc3\xbe\xc3\xbf\x004\x004\x000\x003\x006\x006\x000\x006\x00-\x001\x000\x000\x003\x008\x000\x006\x002\x002\x008\x005'
When converting to utf-8, I get the following output:
print(my_var.decode('utf-8'))
#> þÿ44036606-10038062285
The first two chars þÿ indicate it's a BOM for UTF-16BE, as indicated on Wikipedia
But, what I don't understand is that if I try the UTF16 BOM like this:
if value.startswith(codecs.BOM_UTF16_BE)
This returns false. In fact, printing codecs.BOM_UTF16_BE doesn't show the same results:
print(codecs.BOM_UTF16_BE)
#> b'\xfe\xff'
Why is that? I'm suspecting some enconding issues on the higher end but not sure how to fix that one.
There are already a few mentions of how to decode UTF-16 on Stackoverflow (like this one), and they all say one thing: Decode using utf-16 and Python will handle the BOM.
... But that doesn't work for me.
print(my_var.decode('utf-16')
#> 뻃뿃㐀㐀 ㌀㘀㘀 㘀ⴀ㄀  ㌀㠀 㘀㈀㈀㠀㔀
But with UTF-16BE:
print(my_var.decode('utf-16be')
#> 쎾쎿44036606-10038062285
(the bom is not removed)
And with UTF-16LE:
print(my_var.decode('utf-16le')
#> 뻃뿃㐀㐀 ㌀㘀㘀 㘀ⴀ㄀  ㌀㠀 㘀㈀㈀㠀㔀
So, for a reason I can't explain, using only .decode('UTF-16') doesn't work for me. Why?
UPDATE
The original source string isn't the one I mentioned, but this one:
source = '\376\377\0004\0004\0000\0003\0006\0006\0000\0006\000-\0001\0000\0000\0003\0008\0000\0006\0002\0002\0008\0005'
I converted it using the following:
def decode_8bit(cls, match):
value = match.group().replace(b'\\', b'')
return chr(int(value, base=8)).encode('utf-8')
my_var = re.sub(b'\\\\[0-9]{1,3}', decode_8bit, source)
Maybe I did something wrong here?
It is right that þÿ indicates the BOM for UTF-16BE, if you use the CP1252 encoding.
The difference is the following:
Your first byte is 0xC3, which is 11000011 in binary.
UTF-8:
The first two bits are set and indicate that your UTF-8 char is 2 byte long.
Getting 0xC3 0xBE for your first character, which is þ for UTF-8.
CP1252
CP1252 is always 1 byte long and returns à for 0xC3.
But if you lookup 0xC3 in your linked BOM list you won't find any matching Encoding.
Looks like there wasn't a BOM in the first place.
Using the default encoding is probably the way to go, which is UTF-16LE for Windows.
Edit after original source added
Your encoding to UTF-8 destorys the BOM because it is not valid UTF-8. Try to avoid decoding and pass on a list of bytes.
OPs solution:
bytes(int(value, base=8))
As per requested by #Tomalak and #Hyarus, here's the reason of my issue:
When decoding the 8bit values, I was returning them as UTF-8 encoded:
def decode_8bit(cls, match):
value = match.group().replace(b'\\', b'')
return chr(int(value, base=8)).encode('utf-8')
my_var = re.sub(b'\\\\[0-9]{1,3}', decode_8bit, source)
This was messing the data returned since it was not encoded using UTF-8 (duh). So the correct code should have been:
def decode_8bit(cls, match):
value = match.group().replace(b'\\', b'')
return bytes(int(value, base=8))
my_var = re.sub(b'\\\\[0-9]{1,3}', decode_8bit, source)
Hope that helps someone else... Good luck with encoding! :/

Python 3 Decoding Strings

I understand that this is likely a repeat question, but I'm having trouble finding a solution.
In short I have a string I'd like to decode:
raw = "\x94my quote\x94"
string = decode(raw)
expected from string
'"my quote"'
Last point of note is that I'm working with Python 3 so raw is unicode, and thus is already decoded. Given that, what exactly do I need to do to "decode" the "\x94" characters?
string = "\x22my quote\x22"
print(string)
You don't need to decode, Python 3 does that for you, but you need the correct control character for the double quote "
If however you have a different character set, it appears you have Windows-1252, then you need to decode the byte string from that character set:
str(b"\x94my quote\x94", "windows-1252")
If your string isn't a byte string you have to encode it first, I found the latin-1 encoding to work:
string = "\x94my quote\x94"
str(string.encode("latin-1"), "windows-1252")
I don't know if you mean to this, but this works:
some_binary = a = b"\x94my quote\x94"
result = some_binary.decode()
And you got the result...
If you don't know which encoding to choose, you can use chardet.detect:
import chardet
chardet.detect(some_binary)
Did you try it like this? I think you need to call decode as a method of the byte class, and pass utf-8 as the argument. Add b in front of the string too.
string = b"\x94my quote\x94"
decoded_str = string.decode('utf-8', 'ignore')
print(decoded_str)

Identify string ascii status in a if/else loop with Python

I might be asking this wrong, but please help if I am. I need to establish whether a string contains non-ascii characters in order to separate them from the ones that is purely ascii.
I gather a string from multiple separate files and need to remove the non-ascii containing ones so that I can place the strings in a list to be used further. Without any filtering I get the following error while extracting the strings:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 40: ordinal not in range(128)
I would like to achieve the following:
Read string
if string contains non-ascii
->add to list
else
->do not add to list.
All I need to do is determine the how to filter, I have the rest of the code in tact.
You can attempt to encode the string and use a try/except to detect those that contain non-ascii characters. Something like this might work for you:
ascii_strings = []
non_ascii_strings = []
for s in sequence_of_strings:
try:
if isinstance(s, bytes): # handle Python 3 byte strings
_ = s.decode('ascii')
else:
_ = s.encode('ascii')
ascii_strings.append(s)
except UnicodeError:
non_ascii_strings.append(s)
That's the general idea and it should work in Python 2 and 3.

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END
Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.
In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв
xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

Categories

Resources