how to reliably decode various encodings to system default encoding

how to reliably decode various encodings to system default encoding - python

I am trying to work with several documents that all have various encodings - some utf-8, some ISO-8859-2, some ascii etc. Is there a reliable way of decoding to a standard encoding for processing?
I have tried the following:
import chardet
encoding = chardet.detect(text)
text = unicode(text,encoding['encoding']).decode(sys.getdefaultencoding(),'ignore')
With the above code I still get UnicodeEncodeError errors

Use decode to convert bytes to unicode, and encode to convert unicode to bytes:
text.decode(encoding['encoding'], 'ignore').encode(sys.getdefaultencoding(), 'ignore')
Although I would recommend doing your processing on the unicode objects themselves, or UTF-8 encoded strings if you absolutely need to work with bytes. sys.getdefaultencoding() is 'ascii', which provides a very limited character set. See also: http://wiki.python.org/moin/DefaultEncoding

You probably mean encode:
u = unicode(text, encoding['encoding'], 'ignore')
text = u.encode(sys.getdefaultencoding(), 'ignore')
or equivalently and more commonly,
u = text.decode(encoding['encoding'], 'ignore')
text = u.encode(sys.getdefaultencoding(), 'ignore')
You may want ignore on both, as above: the incoming text may have invalid characters in it, causing it to fail to decode to Unicode, and it may have characters which can't be represented in the default encoding, causing it to fail to encode. (You may not actually want to ignore errors, though, since it looks like you were just trying to work around using the wrong function.)

Related

Python2: Using .decode with errors='replace' still returns errors

So I have a message which is read from a file of unknown encoding. I want to send to a webpage for display. I've grappled a lot with UnicodeErrors and have gone through many Q&As on StackOverflow and think I have decent understanding of how Unicode and encoding works. My current code looks like this
try :
return message.decode(encoding='utf-8')
except:
try:
return message.decode(encoding='latin-1')
except:
try:
print("Unable to entirely decode in latin or utf-8, will replace error characters with '?'")
return message.decode(encoding='utf-8', errors="replace")
The returned message is then dumped into a JSON and send to the front end.
I assumed that because I'm using errors="replace"on the last try except that I was going to avoid exceptions at the expense of having a few '?' characters in my display. An acceptable cost.
However, it seems that I was too hopeful, and for some files I still get a UnicodeDecodeException saying "ascii codecs cannot decode" for some character. Why doesn't errors="replace" just take care of this?
(also as a bonus question, what does ascii have to do with any of this?.. I'm specifying UTF-8)

You should not get a UnicodeDecodeError with errors='replace'. Also str.decode('latin-1') should never fail, because ISO-8859-1 has a valid character mapping for every possible byte sequence.
My suspicion is that message is already a unicode string, not bytes. Unicode text has already been ‘decoded’ from bytes and can't be decoded any more.
When you call .decode() an a unicode string, Python 2 tries to be helpful and decides to encode the Unicode string back to bytes (using the default encoding), so that you have something that you can really decode. This implicit encoding step doesn't use errors='replace', so if there are any characters in the Unicode string that aren't in the default encoding (probably ASCII) you'll get a UnicodeEncodeError.
(Python 3 no longer does this as it is terribly confusing.)
Check the type of message and assuming it is indeed Unicode, work back from there to find where it was decoded (possibly implicitly) to replace that with the correct decoding.

decode with error replace implements the 'replace' error handling (for text encodings only): substitutes '?' for encoding errors (to be encoded by the codec), and '\ufffd' (the Unicode replacement character) for decoding errors
text encodings means A "codec which encodes Unicode strings to bytes."
maybe your data is malformed - u should try 'ignore' error handling where malformed data is ignored and encoding or decoding is continued without further notice.
message.decode(encoding='utf-8', errors="ignore")

Unicode encoding error in python

I have characters \u002d, \u2019, u\2022, \u25ba, \u2013 etc, coming in my data.
I have to do json.loads(data)
I tried doing
data1 = data.encode('utf-8')
json.loads(data1)
I still get an error.
Also tried the below but ended up in an error
b1 = data.encode('ascii', 'ignore')
b2 = json.loads(b1)
It works if I replace the characters in my data, like '\u002d' to '-', but I do not know what other characters might creep in. So I am looking for a solution which would encode these characters

There is no need to encode the data.
Feed it directly to json.loads(); the JSON standard uses \u.... escape codes to denote unicode values too.
The values are not encoded in UTF-8, the Python json module will handle them for you.
Even if the data was encoded in UTF-8, the json module will handle that for you as well. Even if it didn't, you'd use str.decode(), not encode.
UTF-8 data looks different as well; the U+2019 codepoint looks like:
>>> u'\u2019'.encode('utf8')
'\xe2\x80\x99'
when encoded to UTF-8.

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END

Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.

In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв

xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

python convert unknown character to ascii

In a text file I'm processing, I have characters like ����. Not sure what they are.
I'm wondering how to remove/convert these characters.
I have tried to convert it into ascii by using .encode(‘ascii’,'ignore’). python told me char is not whithin 0,128
I have also tried unicodedata, unicodedata.normalize('NFKD', text).encode('ascii','ignore'), with the same error
Anyone help?
Thanks!

You can always take a Unicode string an use the code you showed:
my_ascii = my_uni_string.encode('ascii', 'ignore')
If that gave you an error, then you didn't really have a Unicode string to begin with. If that is true, then you have a byte string instead. You'll need to know what encoding it's using, and you can turn it into a Unicode string with:
my_uni_string = my_byte_string.decode('utf8')
(assuming your encoding is UTF-8).
This split between byte string and Unicode string can be confusing. My presentation, Pragmatic Unicode, or, How Do I Stop The Pain can help you to keep it all straight.

It's not perfect (especially for shorter strings) but the chardet library would be of use here:
http://pypi.python.org/pypi/chardet
To have chardet figure out the encoding and then encode as unicode you would do:
import chardet
encoding = chardet.detect(some_string)['encoding']
unicode_string = unicode(some_string, encoding)
Of course, you won't be able to encode them as ascii if they're out of the ascii range.

Decoding not reversing unicode encoding in Django/Python

Ok, I have a hardcoded string I declare like this
name = u"Par Catégorie"
I have a # -- coding: utf-8 -- magic header, so I am guessing it's converted to utf-8
Down the road it's outputted to xml through
xml_output.toprettyxml(indent='....', encoding='utf-8')
And I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Most of my data is in French and is ouputted correctly in CDATA nodes, but that one harcoded string keep ... I don't see why an ascii codec is called.
what's wrong ?

The coding header in your source file tells Python what encoding your source is in. It's the encoding Python uses to decode the source of the unicode string literal (u"Par Catégorie") into a unicode object. The unicode object itself has no encoding; it's raw unicode data. (Internally, Python will use one of two encodings, depending on how it was configured, but Python code shouldn't worry about that.)
The UnicodeDecodeError you get means that somewhere, you are mixing unicode strings and bytestrings (normal strings.) When mixing them together (concatenating, performing string interpolation, et cetera) Python will try to convert the bytestring into a unicode string by decoding the bytestring using the default encoding, ASCII. If the bytestring contains non-ASCII data, this will fail with the error you see. The operation being done may be in a library somewhere, but it still means you're mixing inputs of different types.
Unfortunately the fact that it'll work just fine as long as the bytestrings contain just ASCII data means this type of error is all too frequent even in library code. Python 3.x solves that problem by getting rid of the implicit conversion between unicode strings (just str in 3.x) and bytestrings (the bytes type in 3.x.)

Wrong parameter name? From the doc, I can see the keyword argument name is supposed to be encoding and not coding.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to reliably decode various encodings to system default encoding - python

Related

Python2: Using .decode with errors='replace' still returns errors

Unicode encoding error in python

Python, Encoding output to UTF-8

python convert unknown character to ascii

Decoding not reversing unicode encoding in Django/Python

Categories

Resources