set the implicit default encoding\decoding error handling in python

set the implicit default encoding\decoding error handling in python - python

I am working with external data that's encoded in latin1. So I've add sitecustomize.py and in it added
sys.setdefaultencoding('latin_1')
sure enough, now working with latin1 strings works fine.
But, in case I encounter something that is not encoded in latin1:
s=str(u'abc\u2013')
I get UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)
What I would like is that the undecodable chars would simply be ignored, i.e I would get that in the above example s=='abc?', and do that without explicitly calling decode() or encode each time, i.e not s.decode(...,'replace') on each call.
I tried doing different things with codecs.register_error but to no avail.
please help?

There is a reason scripts can't call sys.setdefaultencoding. Don't do that, some libraries (including standard libraries included with Python) expect the default to be 'ascii'.
Instead, explicitly decode strings to Unicode when read into your program (via file, stdin, socket, etc.) and explicitly encode strings when writing them out.
Explicit decoding takes a parameter specifying behavior for undecodable bytes.

You can define your own custom handler and use it instead to do as you please. See this example:
import codecs
from logging import getLogger
log = getLogger()
def custom_character_handler(exception):
log.error("%s for %s on %s from position %s to %s. Using '?' in-place of it!",
exception.reason,
exception.object[exception.start:exception.end],
exception.encoding,
exception.start,
exception.end )
return ("?", exception.end)
codecs.register_error("custom_character_handler", custom_character_handler)
print( b'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'.decode('utf8', 'custom_character_handler') )
print( codecs.encode(u"abc\u03c0de", "ascii", "custom_character_handler") )
Running it, you will see:
invalid start byte for b'\xbb' on utf-8 from position 5 to 6. Using '?' in-place of it!
Føö?Bår
ordinal not in range(128) for π on ascii from position 3 to 4. Using '?' in-place of it!
b'abc?de'
References:
https://docs.python.org/3/library/codecs.html#codecs.register_error
https://docs.python.org/3/library/exceptions.html#UnicodeError
How to ignore invalid lines in a file?
'str' object has no attribute 'decode'. Python 3 error?
How to replace invalid unicode characters in a string in Python?
UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?

Related

Conversion of Unicode

I am a newbie in python.
I have a unicode in Tamil.
When I use the sys.getdefaultencoding() I get the output as "Cp1252"
My requirement is that when I use text = testString.decode("utf-8") I get the error "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-8: character maps to undefined"

When I use the
sys.getdefaultencoding() I get the
output as "Cp1252"
Two comments on that: (1) it's "cp1252", not "Cp1252". Don't type from memory. (2) Whoever caused sys.getdefaultencoding() to produce "cp1252" should be told politely that that's not a very good idea.
As for the rest, let me guess. You have a unicode object that contains some text in the Tamil language. You try, erroneously, to decode it. Decode means to convert from a str object to a unicode object. Unfortunately you don't have a str object, and even more unfortunately you get bounced by one of the very few awkish/perlish warts in Python 2: it tries to make a str object by encoding your unicode string using the system default encoding. If that's 'ascii' or 'cp1252', encoding will fail. That's why you get a Unicode*En*codeError instead of a Unicode*De*codeError.
Short answer: do text = testString.encode("utf-8"), if that's what you really want to do. Otherwise please explain what you want to do, and show us the result of print repr(testString).

add this as your 1st line of code
# -*- coding: utf-8 -*-
later in your code...
text = unicode(testString,"UTF-8")

you need to know which character-encoding is testString using. if not utf8, an error will occur when using decode('utf8').

Linux/Python: encoding a unicode string for print

I have a fairly large python 2.6 application with lots of print statements sprinkled about. I'm using unicode strings throughout, and it usually works great. However, if I redirect the output of the application (like "myapp.py >output.txt"), then I occasionally get errors such as this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa1' in position 0: ordinal not in range(128)
I guess the same issue comes up if someone has set their LOCALE to ASCII. Now, I understand perfectly well the reason for this error. There are characters in my Unicode strings that are not possible to encode in ASCII. Fair enough. But I'd like my python program to make a best effort to try to print something understandable, maybe skipping the suspicious characters or replacing them with their Unicode ids.
This problem must be common... What is the best practice for handling this problem? I'd prefer a solution that allows me to keep using plain old "print", but I can modify all occurrences if necessary.
PS: I have now solved this problem. The solution was neither of the answers given. I used the method given at http://wiki.python.org/moin/PrintFails , as given by ChrisJ in one of the comments. That is, I replace sys.stdout with a wrapper that calls unicode encode with the correct arguments. Works very well.

If you're dumping to an ASCII terminal, encode manually using unicode.encode, and specify that errors should be ignored.
u = u'\xa0'
u.encode('ascii') # This fails
u.encode('ascii', 'ignore') # This replaces failed encoding attempts with empty string
If you want to store unicode files, try this:
u = u'\xa0'
print >>open('out', 'w'), u # This fails
print >>open('out', 'w'), u.encode('utf-8') # This is ok

I have now solved this problem. The solution was neither of the answers given. I used the method given at http://wiki.python.org/moin/PrintFails , as given by ChrisJ in one of the comments. That is, I replace sys.stdout with a wrapper that calls unicode encode with the correct arguments. Works very well.

Either wrap all your print statement through a method perform arbitrary unicode -> utf8 conversion or as last resort change the Python default encoding from ascii to utf-8 inside your site.py. In general it is a bad idea printing unicode strings unfiltered to sys.stdout since Python will trigger an implict conversion of unicode strings to the configured default encoding which is ascii.

utf-8 plus question marks

I have a site that displays user input by decoding it to unicode using utf-8. However, user input can include binary data, which is obviously not always able to be 'decoded' by utf-8.
I'm using Python, and I get an error saying:
'utf8' codec can't decode byte 0xbf in position 0: unexpected code byte. You passed in '\xbf\xcd...
Is there a standard efficient way to convert those undecodable characters into question marks?
It would be most helpful if the answer uses Python.

Try:
inputstring.decode("utf8", "replace")
See here for reference

I think what you are looking for is:
str.decode('utf8','ignore')
which should drop invalid bytes rather than raising exception

Python / Mako : How to get unicode strings/characters parsed correctly?

I'm trying to get Mako render some string with unicode characters :
tempLook=TemplateLookup(..., default_filters=[], input_encoding='utf8',output_encoding='utf-8', encoding_errors='replace')
...
print sys.stdout.encoding
uname=cherrypy.session['userName']
print uname
kwargs['_toshow']=uname
...
return tempLook.get_template(page).render(**kwargs)
The related template file :
...${_toshow}...
And the output is :
UTF-8
Deşghfkskhü
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1: ordinal not in range(128)
I don't think there's any problem with the string itself since I can print it just fine.
Altough I've played (a lot) with input/output_encoding and default_filters parameters, it always complains about being unable to decode/encode with ascii codec.
So I decided to try out the example found on the documentation, and the following works the "best" :
input_encoding='utf-8', output_encoding='utf-8'
#(note : it still raised an error without output_encoding, despite tutorial not implying it)
With
${u"voix m’a réveillé."}
And the result being
voix mâ�a rÃ©veillÃ©
I simply don't get why this doesn't work. "Magic encoding comment"s don't work either. All the files are encoded with UTF-8.
I've spent hours to no avail, am I missing something ?
Update :
I have a simpler question now :
Now that all the variables are unicode, how can I get Mako to render unicode strings without applying anything ? Passing a blank filter / render_unicode() doesn't help.

Yes, UTF-8 != Unicode.
UTF-8 is a specifc string encoding, as are ASCII and ISO 8859-1. Try this:
For any input string do a inputstring.decode('utf-8') (or whatever input encoding you get). For any output string do a outputstring.encode('utf-8')(or whatever output encoding you want). For any internal use, take unicode strings ('this is a normal string'.decode('utf-8') == u'this is a normal string')
'foo' is a string, u'foo' is a unicode string, which doesn't "have" an encoding (can't be decoded). SO anytime python want to change an encoding of a normal string, it first tries to "decode" it, the to "encode" it. And the default is "ascii", which fails more often than not :-)

Decoding not reversing unicode encoding in Django/Python

Ok, I have a hardcoded string I declare like this
name = u"Par Catégorie"
I have a # -- coding: utf-8 -- magic header, so I am guessing it's converted to utf-8
Down the road it's outputted to xml through
xml_output.toprettyxml(indent='....', encoding='utf-8')
And I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Most of my data is in French and is ouputted correctly in CDATA nodes, but that one harcoded string keep ... I don't see why an ascii codec is called.
what's wrong ?

The coding header in your source file tells Python what encoding your source is in. It's the encoding Python uses to decode the source of the unicode string literal (u"Par Catégorie") into a unicode object. The unicode object itself has no encoding; it's raw unicode data. (Internally, Python will use one of two encodings, depending on how it was configured, but Python code shouldn't worry about that.)
The UnicodeDecodeError you get means that somewhere, you are mixing unicode strings and bytestrings (normal strings.) When mixing them together (concatenating, performing string interpolation, et cetera) Python will try to convert the bytestring into a unicode string by decoding the bytestring using the default encoding, ASCII. If the bytestring contains non-ASCII data, this will fail with the error you see. The operation being done may be in a library somewhere, but it still means you're mixing inputs of different types.
Unfortunately the fact that it'll work just fine as long as the bytestrings contain just ASCII data means this type of error is all too frequent even in library code. Python 3.x solves that problem by getting rid of the implicit conversion between unicode strings (just str in 3.x) and bytestrings (the bytes type in 3.x.)

Wrong parameter name? From the doc, I can see the keyword argument name is supposed to be encoding and not coding.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

set the implicit default encoding\decoding error handling in python - python

Related

Conversion of Unicode

Linux/Python: encoding a unicode string for print

utf-8 plus question marks

Python / Mako : How to get unicode strings/characters parsed correctly?

Decoding not reversing unicode encoding in Django/Python

Categories

Resources