Rise UnicodeEncodeError in logging.StreamHandler - python

I migrated my python code from Win10 host to WS2012R2. Surprisingly it stops operating correctly and now shows warning message: "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to "
I've tried to execute a command:
set PYTHONLEGACYWINDOWSSTDIO=yes
My code:
import logging
import sys
def get_console_handler():
console_handler = logging.StreamHandler(sys.stdout)
return console_handler
def get_logger():
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logger.addHandler(get_console_handler())
return logger
my_logger = get_logger()
my_logger.debug("Это отладочное сообщение".encode("cp1252"))
What should I do to get rid of this warning?
Update
Colleagues, I am sorry for misleading you! I am obviously was tired after long hours of bug tracking )
The problem doesn't connect with "*.encode()" calling as such, it is connected with default python encoding while IO console operation (I suppose so)! The original code makes some requests from DB in cp1251 charset but the problem appears when python is trying to convert it to cp1252.
Here is another example of how to summon the error.
Create a plain text file, i.e. test.txt with text "Это отладочное сообщение" and save it cp1252.
Run python console and enter:
f = open("test.txt")
f.read()
Output:
f = open("test.txt")
f.read()
Traceback (most recent call last): File "<stdin>", line 1, in <module>
File "c:\project\venv\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 29: character maps to <undefined>

Use encode("utf-8"). Here is a list of python encodings: https://docs.python.org/2.4/lib/standard-encodings.html
my_logger.debug("Это отладочное сообщение".encode("utf-8"))
then use .decode("utf-8") to see the printable value of your string

The problem is how logging.StreamHandler performs console output, namely due to the fact that you couldn't change default encoding in contrast with FileHandler.
If the default system encoding doesn't match the needed one, you could face an issue.
For my example. I wanted to output cp1251 lines, while system default encoding was:
import locale
locale.getpreferredencoding()
'cp1252'
This question was solved by changing system locale (see https://stackoverflow.com/a/11234956/9851754). Choose "Change system locale..." for non-Unicode programs. No code changes needed.
import locale
locale.getpreferredencoding()
'cp1251'

I have tested your code with Python 3.6.8 and it worked for me (I didn't change anything).
Python 3.6.8:
>>> python3 -V
Python 3.6.8
>>> python3 test.py
Это отладочное сообщение
But when I have tested it with Python 2.7.15+, I got a similar error than you.
Python 2.7.15+ with your implementation:
>>> python2 -V
Python 2.7.15+
>>> python2 test.py
File "test.py", line 17
SyntaxError: Non-ASCII character '\xd0' in file test.py on line 17, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Then I have put the following line into the first line it worked for me.
Begging of code:
# -*- coding: utf-8 -*-
import logging
import sys
...
Output with Python 2.7.15+ and with modified code:
>>> python2 -V
Python 2.7.15+
>>> python2 test.py
Это отладочное сообщение

Related

UnicodeEncodeError when logging to file in PyCharm

I'm logging some Unicode characters to a file using "logging" in Python 3. The code works in the terminal, but fails with a UnicodeEncodeError in PyCharm.
I load my logging configuration using logging.config.fileConfig. In the configuration, I specify a file handler with encoding = utf-8. Logging to console works fine.
I'm using PyCharm 2019.1.1 (Community Edition). I don't think I've changed any relevant setting, but when I ran the same code in the PyCharm on another computer, the error was not reproduced. Therefore, I suspect the problem is related to a PyCharm setting.
Here is a minimal example:
import logging
from logging.config import fileConfig
# ok
print('1. café')
# ok
logging.error('2. café')
# UnicodeEncodeError
fileConfig('logcfg.ini')
logging.error('3. café')
The content of logcfg.ini (in the same directory) is the following:
[loggers]
keys = root
[handlers]
keys = file_handler
[formatters]
keys = formatter
[logger_root]
level = INFO
handlers = file_handler
[handler_file_handler]
class = logging.handlers.RotatingFileHandler
formatter = formatter
args = ('/tmp/test.log',)
encoding = utf-8
[formatter_formatter]
format = %(levelname)s: %(message)s
I expect to see the first two logging messages in the console, and the third one in the logging file. The first two logging statements worked fine, but the third one failed. Here is the complete console output in PyCharm:
1. café
ERROR:root:2. café
--- Logging error ---
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/logging/__init__.py", line 996, in emit
stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 13: ordinal not in range(128)
Call stack:
File "/Users/klkh/test.py", line 12, in <module>
logging.error('3. café')
Message: '3. café'
Arguments: ()
You seem to use utf-8 as encode for your config file, python (when using pycharm) in your case seems to raise an encode error UnicodeEncodeError in place of guessing your config file encode wildly, because if it use a wrong encode all the config file gonna be decrypted differently from the original, so best to do is precising the encode type of the config in your python script
Notice : I can't seem to find documentation of fileConfig from logging.config so I'm using basicConfig
import logging
from logging.config import fileConfig
print('1. café')
logging.error('2. café')
logging.basicConfig(filename='your config' , encode='utf-8') # in your case the encode is utf-8
logging.error('3. café')
output:
1. café
ERROR:root:2. café
ERROR:root:3. café
I found a solution myself. I should pass the encoding like this:
args = ('/tmp/test.log', 'a', 0, 0, 'utf-8')
instead of
args = ('/tmp/test.log',)
encoding = utf-8
However, I'm still interested in knowing why the PyCharm on the other computer uses utf-8 by default. How do I set the default encoding for non-console streams in PyCharm?

Python: print binary as string

I have a crawler that parses the HMTL of a given site and prints parts of the source code. Here is my script:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import urllib.request
import re
class Crawler:
headers = {'User-Agent' : 'Mozilla/5.0'}
keyword = 'arroz'
def extra(self):
url = "http://buscando.extra.com.br/search?w=" + self.keyword
r = requests.head(url, allow_redirects=True)
print(r.url)
html = urllib.request.urlopen(urllib.request.Request(url, None, self.headers)).read()
soup = BeautifulSoup(html, 'html.parser')
return soup.encode('utf-8')
def __init__(self):
extra = self.extra()
print(extra)
Crawler()
My code works fine, but it prints the source with an annoying b' before the value. I already tried to use decode('utf-8') but it didn't work. Any ideas?
UPDATE
If I don't use the encode('utf-8') I have the following error:
Traceback (most recent call last):
File "crawler.py", line 25, in <module>
Crawler()
File "crawler.py", line 23, in __init__
print(extra)
File "c:\Python34\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position
13345: character maps to <undefined>
When I run your code as given except replacing return soup.encode('utf-8') with return soup, it works fine. My environment:
OS: Ubuntu 15.10
Python: 3.4.3
python3 dist-packages bs4 version: beautifulsoup4==4.3.2
This leads me to suspect that the problem lies with your environment, not your code. Your stack trace mentions cp850.py and your source is hitting a .com.br site - this makes me think that perhaps the default encoding of your shell can't handle unicode characters. Here's the Wikipedia page for cp850 - Code Page 850.
You can check the default encoding your terminal is using with:
>>> import sys
>>> sys.stdout.encoding
My terminal responds with:
'UTF-8'
I'm assuming yours won't and that this is the root of the issue you are running into.
EDIT:
In fact, I can exactly replicate your error with:
>>> print("\u2030".encode("cp850"))
So that's the issue - because of your computer's locale settings, print is implicitly converting to your system's default encoding and raising the UnicodeDecodeError.
Updating Windows to display unicode characters from the command prompt is a bit outside my wheelhouse so I can't offer any advice other than to direct you to a relevant question/answer.

Encoding and decoding in Python 2.7.5.1 on windows cmd and pycharm get diffrent result

I use this code to deal with chinese:
# -*- coding: utf-8 -*-
strInFilNname = u'%s' % raw_input("input fileName:").decode('utf-8')
pathName = u'%s' % raw_input("input filePath:").decode('utf-8')
When I run this on PyCharm everything is ok. But when I run this on windows CMD, I get this error code:
Traceback (most recent call last):
File "E:\Sites\GetAllFile.py", line 23, in <module>
strInFilNname = u'%s' % raw_input("input filename:").decode('utf-8')
File "E:\Portable Python 2.7.5.1\App\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
I have read this python document about Unicode HOWTO, but can't get effective solution.
I really want to know why it does so .
The Windows console encoding is not UTF-8. I will assume you are using a Chinese-localized version of Windows since you mention the errors go away in Python 3.3 and suggest trying sys.stdin.encoding instead of utf-8.
Below is an example from my US-localized Windows using characters from the cp437 code page, which is what the US console uses (Python 2.7.9):
This returns a byte string in the console encoding:
>>> raw_input('test? ')
test? │┤╡╢╖╕╣
'\xb3\xb4\xb5\xb6\xb7\xb8\xb9'
Convert to Unicode:
>>> import sys
>>> sys.stdin.encoding
'cp437'
>>> raw_input('test? ').decode(sys.stdin.encoding)
test? │┤╡╢╖╕╣║╗╝╜╛
u'\u2502\u2524\u2561\u2562\u2556\u2555\u2563\u2551\u2557\u255d\u255c\u255b'
Note it prints correctly:
>>> print(raw_input('test? ').decode(sys.stdin.encoding))
test? │┤╡╢╖╕╣║╗
│┤╡╢╖╕╣║╗
This works correctly for a Chinese Windows console as well as it will use the correct console encoding for Chinese. Here's the same code after switching my system to use Chinese:
>>> raw_input('Test? ')
Test? 我是美国人。
'\xce\xd2\xca\xc7\xc3\xc0\xb9\xfa\xc8\xcb\xa1\xa3'
>>> import sys
>>> sys.stdin.encoding
'cp936'
>>> raw_input('Test? ').decode(sys.stdin.encoding)
Test? 我是美国人。
u'\u6211\u662f\u7f8e\u56fd\u4eba\u3002'
>>> print raw_input('Test? ').decode(sys.stdin.encoding)
Test? 我是美国人。
我是美国人。
Python 3.3 makes this much simpler:
>>> input('Test? ')
Test? 我是美国人。
'我是美国人。'

UTF-8 In Python logging, how?

I'm trying to log a UTF-8 encoded string to a file using Python's logging package. As a toy example:
import logging
def logging_test():
handler = logging.FileHandler("/home/ted/logfile.txt", "w",
encoding = "UTF-8")
formatter = logging.Formatter("%(message)s")
handler.setFormatter(formatter)
root_logger = logging.getLogger()
root_logger.addHandler(handler)
root_logger.setLevel(logging.INFO)
# This is an o with a hat on it.
byte_string = '\xc3\xb4'
unicode_string = unicode("\xc3\xb4", "utf-8")
print "printed unicode object: %s" % unicode_string
# Explode
root_logger.info(unicode_string)
if __name__ == "__main__":
logging_test()
This explodes with UnicodeDecodeError on the logging.info() call.
At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes. Essentially, Python is doing this:
file_handler.write(unicode_string.encode("UTF-8"))
When it should be doing this:
file_handler.write(unicode_string)
Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation.
Having code like:
raise Exception(u'щ')
Caused:
File "/usr/lib/python2.7/logging/__init__.py", line 467, in format
s = self._fmt % record.__dict__
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
This happens because the format string is a byte string, while some of the format string arguments are unicode strings with non-ASCII characters:
>>> "%(message)s" % {'message': Exception(u'\u0449')}
*** UnicodeEncodeError: 'ascii' codec can't encode character u'\u0449' in position 0: ordinal not in range(128)
Making the format string unicode fixes the issue:
>>> u"%(message)s" % {'message': Exception(u'\u0449')}
u'\u0449'
So, in your logging configuration make all format string unicode:
'formatters': {
'simple': {
'format': u'%(asctime)-s %(levelname)s [%(name)s]: %(message)s',
'datefmt': '%Y-%m-%d %H:%M:%S',
},
...
And patch the default logging formatter to use unicode format string:
logging._defaultFormatter = logging.Formatter(u"%(message)s")
Check that you have the latest Python 2.6 - some Unicode bugs were found and fixed since 2.6 came out. For example, on my Ubuntu Jaunty system, I ran your script copied and pasted, removing only the '/home/ted/' prefix from the log file name. Result (copied and pasted from a terminal window):
vinay#eta-jaunty:~/projects/scratch$ python --version
Python 2.6.2
vinay#eta-jaunty:~/projects/scratch$ python utest.py
printed unicode object: ô
vinay#eta-jaunty:~/projects/scratch$ cat logfile.txt
ô
vinay#eta-jaunty:~/projects/scratch$
On a Windows box:
C:\temp>python --version
Python 2.6.2
C:\temp>python utest.py
printed unicode object: ô
And the contents of the file:
This might also explain why Lennart Regebro couldn't reproduce it either.
I'm a little late, but I just came across this post that enabled me to set up logging in utf-8 very easily
Here the link to the post
or here the code:
root_logger= logging.getLogger()
root_logger.setLevel(logging.DEBUG) # or whatever
handler = logging.FileHandler('test.log', 'w', 'utf-8') # or whatever
formatter = logging.Formatter('%(name)s %(message)s') # or whatever
handler.setFormatter(formatter) # Pass handler as a parameter, not assign
root_logger.addHandler(handler)
I had a similar problem running Django in Python3: My logger died upon encountering some Umlauts (äöüß) but was otherwise fine. I looked through a lot of results and found none working. I tried
import locale;
if locale.getpreferredencoding().upper() != 'UTF-8':
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
which I got from the comment above.
It did not work. Looking at the current locale gave me some crazy ANSI thing, which turned out to mean basically just "ASCII". That sent me into totally the wrong direction.
Changing the logging format-strings to Unicode would not help.
Setting a magic encoding comment at the beginning of the script would not help.
Setting the charset on the sender's message (the text came from a HTTP-reqeust) did not help.
What DID work was setting the encoding on the file-handler to UTF-8 in settings.py. Because I had nothing set, the default would become None. Which apparently ends up being ASCII (or as I'd like to think about: ASS-KEY)
'handlers': {
'file': {
'level': 'DEBUG',
'class': 'logging.handlers.TimedRotatingFileHandler',
'encoding': 'UTF-8', # <-- That was missing.
....
},
},
Try this:
import logging
def logging_test():
log = open("./logfile.txt", "w")
handler = logging.StreamHandler(log)
formatter = logging.Formatter("%(message)s")
handler.setFormatter(formatter)
root_logger = logging.getLogger()
root_logger.addHandler(handler)
root_logger.setLevel(logging.INFO)
# This is an o with a hat on it.
byte_string = '\xc3\xb4'
unicode_string = unicode("\xc3\xb4", "utf-8")
print "printed unicode object: %s" % unicode_string
# Explode
root_logger.info(unicode_string.encode("utf8", "replace"))
if __name__ == "__main__":
logging_test()
For what it's worth I was expecting to have to use codecs.open to open the file with utf-8 encoding but either that's the default or something else is going on here, since it works as is like this.
If I understood your problem correctly, the same issue should arise on your system when you do just:
str(u'ô')
I guess automatic encoding to the locale encoding on Unix will not work until you have enabled locale-aware if branch in the setencoding function in your site module via locale. This file usually resides in /usr/lib/python2.x, it worth inspecting anyway. AFAIK, locale-aware setencoding is disabled by default (it's true for my Python 2.6 installation).
The choices are:
Let the system figure out the right way to encode Unicode strings to bytes or do it in your code (some configuration in site-specific site.py is needed)
Encode Unicode strings in your code and output just bytes
See also The Illusive setdefaultencoding by Ian Bicking and related links.

Setting the encoding for sax parser in Python

When I feed a utf-8 encoded xml to an ExpatParser instance:
def test(filename):
parser = xml.sax.make_parser()
with codecs.open(filename, 'r', encoding='utf-8') as f:
for line in f:
parser.feed(line)
...I get the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "test.py", line 72, in search_test
parser.feed(line)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 29: ordinal not in range(128)
I'm probably missing something obvious here. How do I change the parser's encoding from 'ascii' to 'utf-8'?
Your code fails in Python 2.6, but works in 3.0.
This does work in 2.6, presumably because it allows the parser itself to figure out the encoding (perhaps by reading the encoding optionally specified on the first line of the XML file, and otherwise defaulting to utf-8):
def test(filename):
parser = xml.sax.make_parser()
parser.parse(open(filename))
Jarret Hardie already explained the issue. But those of you who are coding for the command line, and don't seem to have the "sys.setdefaultencoding" visible, the quick work around this bug (or "feature") is:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Hopefully reload(sys) won't break anything else.
More details in this old blog:
The Illusive setdefaultencoding
The SAX parser in Python 2.6 should be able to parse utf-8 without mangling it. Although you've left out the ContentHandler you're using with the parser, if that content handler attempts to print any non-ascii characters to your console, that will cause a crash.
For example, say I have this XML doc:
<?xml version="1.0" encoding="utf-8"?>
<test>
<name>Champs-Élysées</name>
</test>
And this parsing apparatus:
import xml.sax
class MyHandler(xml.sax.handler.ContentHandler):
def startElement(self, name, attrs):
print "StartElement: %s" % name
def endElement(self, name):
print "EndElement: %s" % name
def characters(self, ch):
#print "Characters: '%s'" % ch
pass
parser = xml.sax.make_parser()
parser.setContentHandler(MyHandler())
for line in open('text.xml', 'r'):
parser.feed(line)
This will parse just fine, and the content will indeed preserve the accented characters in the XML. The only issue is that line in def characters() that I've commented out. Running in the console in Python 2.6, this will produce the exception you're seeing because the print function must convert the characters to ascii for output.
You have 3 possible solutions:
One: Make sure your terminal supports unicode, then create a sitecustomize.py entry in your site-packages and set the default character set to utf-8:
import sys
sys.setdefaultencoding('utf-8')
Two: Don't print the output to the terminal (tongue-in-cheek)
Three: Normalize the output using unicodedata.normalize to convert non-ascii chars to ascii equivalents, or encode the chars to ascii for text output: ch.encode('ascii', 'replace'). Of course, using this method you won't be able to properly evaluate the text.
Using option one above, your code worked just fine for my in Python 2.5.
To set an arbitrary file encoding for a SAX parser, one can use InputSource as follows:
def test(filename, encoding):
parser = xml.sax.make_parser()
with open(filename, "rb") as f:
input_source = xml.sax.xmlreader.InputSource()
input_source.setByteStream(f)
input_source.setEncoding(encoding)
parser.parse(input_source)
This allows parsing an XML file that has a non-ASCII, non-UTF8 encoding. For example, one can parse an extended ASCII file encoded with LATIN1 like: test(filename, "latin1")
(Added this answer to directly address the title of this question, as it tends to rank highly in search engines.)
Commenting on janpf's answer (sorry, I don't have enough reputation to put it there), note that Janpf's version will break IDLE which requires its own stdout etc. that is different from sys's default. So I'd suggest modifying the code to be something like:
import sys
currentStdOut = sys.stdout
currentStdIn = sys.stdin
currentStdErr = sys.stderr
reload(sys)
sys.setdefaultencoding('utf-8')
sys.stdout = currentStdOut
sys.stdin = currentStdIn
sys.stderr = currentStdErr
There may be other variables to preserve, but these seem like the most important.

Categories

Resources