I have a crawler that parses the HMTL of a given site and prints parts of the source code. Here is my script:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import urllib.request
import re
class Crawler:
headers = {'User-Agent' : 'Mozilla/5.0'}
keyword = 'arroz'
def extra(self):
url = "http://buscando.extra.com.br/search?w=" + self.keyword
r = requests.head(url, allow_redirects=True)
print(r.url)
html = urllib.request.urlopen(urllib.request.Request(url, None, self.headers)).read()
soup = BeautifulSoup(html, 'html.parser')
return soup.encode('utf-8')
def __init__(self):
extra = self.extra()
print(extra)
Crawler()
My code works fine, but it prints the source with an annoying b' before the value. I already tried to use decode('utf-8') but it didn't work. Any ideas?
UPDATE
If I don't use the encode('utf-8') I have the following error:
Traceback (most recent call last):
File "crawler.py", line 25, in <module>
Crawler()
File "crawler.py", line 23, in __init__
print(extra)
File "c:\Python34\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position
13345: character maps to <undefined>
When I run your code as given except replacing return soup.encode('utf-8') with return soup, it works fine. My environment:
OS: Ubuntu 15.10
Python: 3.4.3
python3 dist-packages bs4 version: beautifulsoup4==4.3.2
This leads me to suspect that the problem lies with your environment, not your code. Your stack trace mentions cp850.py and your source is hitting a .com.br site - this makes me think that perhaps the default encoding of your shell can't handle unicode characters. Here's the Wikipedia page for cp850 - Code Page 850.
You can check the default encoding your terminal is using with:
>>> import sys
>>> sys.stdout.encoding
My terminal responds with:
'UTF-8'
I'm assuming yours won't and that this is the root of the issue you are running into.
EDIT:
In fact, I can exactly replicate your error with:
>>> print("\u2030".encode("cp850"))
So that's the issue - because of your computer's locale settings, print is implicitly converting to your system's default encoding and raising the UnicodeDecodeError.
Updating Windows to display unicode characters from the command prompt is a bit outside my wheelhouse so I can't offer any advice other than to direct you to a relevant question/answer.
Related
I migrated my python code from Win10 host to WS2012R2. Surprisingly it stops operating correctly and now shows warning message: "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to "
I've tried to execute a command:
set PYTHONLEGACYWINDOWSSTDIO=yes
My code:
import logging
import sys
def get_console_handler():
console_handler = logging.StreamHandler(sys.stdout)
return console_handler
def get_logger():
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logger.addHandler(get_console_handler())
return logger
my_logger = get_logger()
my_logger.debug("Это отладочное сообщение".encode("cp1252"))
What should I do to get rid of this warning?
Update
Colleagues, I am sorry for misleading you! I am obviously was tired after long hours of bug tracking )
The problem doesn't connect with "*.encode()" calling as such, it is connected with default python encoding while IO console operation (I suppose so)! The original code makes some requests from DB in cp1251 charset but the problem appears when python is trying to convert it to cp1252.
Here is another example of how to summon the error.
Create a plain text file, i.e. test.txt with text "Это отладочное сообщение" and save it cp1252.
Run python console and enter:
f = open("test.txt")
f.read()
Output:
f = open("test.txt")
f.read()
Traceback (most recent call last): File "<stdin>", line 1, in <module>
File "c:\project\venv\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 29: character maps to <undefined>
Use encode("utf-8"). Here is a list of python encodings: https://docs.python.org/2.4/lib/standard-encodings.html
my_logger.debug("Это отладочное сообщение".encode("utf-8"))
then use .decode("utf-8") to see the printable value of your string
The problem is how logging.StreamHandler performs console output, namely due to the fact that you couldn't change default encoding in contrast with FileHandler.
If the default system encoding doesn't match the needed one, you could face an issue.
For my example. I wanted to output cp1251 lines, while system default encoding was:
import locale
locale.getpreferredencoding()
'cp1252'
This question was solved by changing system locale (see https://stackoverflow.com/a/11234956/9851754). Choose "Change system locale..." for non-Unicode programs. No code changes needed.
import locale
locale.getpreferredencoding()
'cp1251'
I have tested your code with Python 3.6.8 and it worked for me (I didn't change anything).
Python 3.6.8:
>>> python3 -V
Python 3.6.8
>>> python3 test.py
Это отладочное сообщение
But when I have tested it with Python 2.7.15+, I got a similar error than you.
Python 2.7.15+ with your implementation:
>>> python2 -V
Python 2.7.15+
>>> python2 test.py
File "test.py", line 17
SyntaxError: Non-ASCII character '\xd0' in file test.py on line 17, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Then I have put the following line into the first line it worked for me.
Begging of code:
# -*- coding: utf-8 -*-
import logging
import sys
...
Output with Python 2.7.15+ and with modified code:
>>> python2 -V
Python 2.7.15+
>>> python2 test.py
Это отладочное сообщение
I'm troubleshooting an existing python based Nagios plugin that uses PycURL to test that different actions can be taken on a remote WebDav service (GET,PUT,DELETE). We are having an issue when the service responds with a 301 redirection with the error "411 Length Required".
After checking the headers of the PUT requests for both the original service and to the redirected one, the latter is missing the "Content-Length" header, which is why this is failing. I haven't been able to find if there is an option that needs to be setup that is maybe needed for this to occur.
I am able to fix this in Python2 by adding the filesize using the option "INFILESIZE":
c.setopt(c.INFILESIZE, os.path.getsize(filepath))
The code looks like this:
#!/bin/python2
import pycurl
import os
filepath = '/tmp/testfile'
c = pycurl.Curl()
c.setopt(c.URL, 'http://remote_host.com/filename')
c.setopt(c.UPLOAD, 1)
file = open(filepath)
c.setopt(c.READDATA, file)
c.setopt(c.FOLLOWLOCATION, 1)
c.setopt(c.INFILESIZE, os.path.getsize(filepath))
c.perform()
c.close()
file.close()
However on Python3 (I've tried on 3.4, 3.6 and 3.7) the same code exits with error:
Traceback (most recent call last):
File "/usr/lib64/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 2: invalid continuation byte
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
pycurl.error: (42, 'operation aborted by callback')
And I'm do not understand if this is an issue of with the reply from the server. But if I remove the INFILESIZE option, then it just fails with the 411 error mentioned above.
If anyone has any clue what I'm doing wrong, it would be greatly appreciated.
PycURL has different handling of files in Python 2 and Python 3. It sounds like you are running into this difference. See this manual page for the description of behavior: http://pycurl.io/docs/latest/files.html
It sounds like on Python 2, libcurl is able to perform a stat(2) call on the open file description to figure out its size. On Python 3 there is no file descriptor being passed but a function, hence stat(2) approach doesn't work and no file length is calculated for you.
To troubleshoot the Unicode decode error, on Python 2 change the code to use READFUNCTION rather than READDATA and see what error you get. If the server returns an error message in the response which is not valid utf-8, Python 3 may fail in the way you describe.
I use this code to deal with chinese:
# -*- coding: utf-8 -*-
strInFilNname = u'%s' % raw_input("input fileName:").decode('utf-8')
pathName = u'%s' % raw_input("input filePath:").decode('utf-8')
When I run this on PyCharm everything is ok. But when I run this on windows CMD, I get this error code:
Traceback (most recent call last):
File "E:\Sites\GetAllFile.py", line 23, in <module>
strInFilNname = u'%s' % raw_input("input filename:").decode('utf-8')
File "E:\Portable Python 2.7.5.1\App\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
I have read this python document about Unicode HOWTO, but can't get effective solution.
I really want to know why it does so .
The Windows console encoding is not UTF-8. I will assume you are using a Chinese-localized version of Windows since you mention the errors go away in Python 3.3 and suggest trying sys.stdin.encoding instead of utf-8.
Below is an example from my US-localized Windows using characters from the cp437 code page, which is what the US console uses (Python 2.7.9):
This returns a byte string in the console encoding:
>>> raw_input('test? ')
test? │┤╡╢╖╕╣
'\xb3\xb4\xb5\xb6\xb7\xb8\xb9'
Convert to Unicode:
>>> import sys
>>> sys.stdin.encoding
'cp437'
>>> raw_input('test? ').decode(sys.stdin.encoding)
test? │┤╡╢╖╕╣║╗╝╜╛
u'\u2502\u2524\u2561\u2562\u2556\u2555\u2563\u2551\u2557\u255d\u255c\u255b'
Note it prints correctly:
>>> print(raw_input('test? ').decode(sys.stdin.encoding))
test? │┤╡╢╖╕╣║╗
│┤╡╢╖╕╣║╗
This works correctly for a Chinese Windows console as well as it will use the correct console encoding for Chinese. Here's the same code after switching my system to use Chinese:
>>> raw_input('Test? ')
Test? 我是美国人。
'\xce\xd2\xca\xc7\xc3\xc0\xb9\xfa\xc8\xcb\xa1\xa3'
>>> import sys
>>> sys.stdin.encoding
'cp936'
>>> raw_input('Test? ').decode(sys.stdin.encoding)
Test? 我是美国人。
u'\u6211\u662f\u7f8e\u56fd\u4eba\u3002'
>>> print raw_input('Test? ').decode(sys.stdin.encoding)
Test? 我是美国人。
我是美国人。
Python 3.3 makes this much simpler:
>>> input('Test? ')
Test? 我是美国人。
'我是美国人。'
When trying to access a file with a Russian unicode character using win32file.CreateFile(), I get:
Traceback (most recent call last):
File "test-utf8.py", line 36, in <module>
None )
pywintypes.error: (123, 'CreateFile', 'The filename, directory name, or volume l
abel syntax is incorrect.')
Here's the code. I'm using Python 2.7. I verify that I can open it with regular Python 'open':
# -*- coding: UTF-8 -*-
# We need the line above to tell Python interpreter to use UTF8.
# You must save this file with UTF8 encoding.
'''
Testing UTF8 Encoding
'''
import win32file
import os, sys
path = u'C:\\TempRandom\\utf8-1\\boo\\hi это русский end - Copy.txt'
# Clean path when printing since Windows terminal only supports ASCII:
print("Path: " + path.encode(sys.stdout.encoding, errors='replace'))
# Test that you can open it with normal Python open:
normal_fp = open (path, mode='r')
normal_fp.close()
fileH = win32file.CreateFile( path, win32file.GENERIC_READ, \
win32file.FILE_SHARE_READ | win32file.FILE_SHARE_WRITE, \
None, # No special security requirements \
win32file.OPEN_EXISTING, # expect the file to exist. \
0, # Not creating, so attributes dont matter. \
None ) # No template file
result, msg = win32file.ReadFile(fileH, 1000, None)
print("File Content >>")
print(msg)
The solution is to use CreateFileW and not CreateFile:
fileH = win32file.CreateFileW
Ironically, the documentation for CreateFile says it supports PyUnicode strings, but the underlying Windows function does not, unless you use CreateFileW. CreateFileW supports wide characters for unicode.
Thanks to this post discussing the C version of CreateFile:
How do I open a file named 𤭢.txt with CreateFile() API function?
When I feed a utf-8 encoded xml to an ExpatParser instance:
def test(filename):
parser = xml.sax.make_parser()
with codecs.open(filename, 'r', encoding='utf-8') as f:
for line in f:
parser.feed(line)
...I get the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "test.py", line 72, in search_test
parser.feed(line)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 29: ordinal not in range(128)
I'm probably missing something obvious here. How do I change the parser's encoding from 'ascii' to 'utf-8'?
Your code fails in Python 2.6, but works in 3.0.
This does work in 2.6, presumably because it allows the parser itself to figure out the encoding (perhaps by reading the encoding optionally specified on the first line of the XML file, and otherwise defaulting to utf-8):
def test(filename):
parser = xml.sax.make_parser()
parser.parse(open(filename))
Jarret Hardie already explained the issue. But those of you who are coding for the command line, and don't seem to have the "sys.setdefaultencoding" visible, the quick work around this bug (or "feature") is:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Hopefully reload(sys) won't break anything else.
More details in this old blog:
The Illusive setdefaultencoding
The SAX parser in Python 2.6 should be able to parse utf-8 without mangling it. Although you've left out the ContentHandler you're using with the parser, if that content handler attempts to print any non-ascii characters to your console, that will cause a crash.
For example, say I have this XML doc:
<?xml version="1.0" encoding="utf-8"?>
<test>
<name>Champs-Élysées</name>
</test>
And this parsing apparatus:
import xml.sax
class MyHandler(xml.sax.handler.ContentHandler):
def startElement(self, name, attrs):
print "StartElement: %s" % name
def endElement(self, name):
print "EndElement: %s" % name
def characters(self, ch):
#print "Characters: '%s'" % ch
pass
parser = xml.sax.make_parser()
parser.setContentHandler(MyHandler())
for line in open('text.xml', 'r'):
parser.feed(line)
This will parse just fine, and the content will indeed preserve the accented characters in the XML. The only issue is that line in def characters() that I've commented out. Running in the console in Python 2.6, this will produce the exception you're seeing because the print function must convert the characters to ascii for output.
You have 3 possible solutions:
One: Make sure your terminal supports unicode, then create a sitecustomize.py entry in your site-packages and set the default character set to utf-8:
import sys
sys.setdefaultencoding('utf-8')
Two: Don't print the output to the terminal (tongue-in-cheek)
Three: Normalize the output using unicodedata.normalize to convert non-ascii chars to ascii equivalents, or encode the chars to ascii for text output: ch.encode('ascii', 'replace'). Of course, using this method you won't be able to properly evaluate the text.
Using option one above, your code worked just fine for my in Python 2.5.
To set an arbitrary file encoding for a SAX parser, one can use InputSource as follows:
def test(filename, encoding):
parser = xml.sax.make_parser()
with open(filename, "rb") as f:
input_source = xml.sax.xmlreader.InputSource()
input_source.setByteStream(f)
input_source.setEncoding(encoding)
parser.parse(input_source)
This allows parsing an XML file that has a non-ASCII, non-UTF8 encoding. For example, one can parse an extended ASCII file encoded with LATIN1 like: test(filename, "latin1")
(Added this answer to directly address the title of this question, as it tends to rank highly in search engines.)
Commenting on janpf's answer (sorry, I don't have enough reputation to put it there), note that Janpf's version will break IDLE which requires its own stdout etc. that is different from sys's default. So I'd suggest modifying the code to be something like:
import sys
currentStdOut = sys.stdout
currentStdIn = sys.stdin
currentStdErr = sys.stderr
reload(sys)
sys.setdefaultencoding('utf-8')
sys.stdout = currentStdOut
sys.stdin = currentStdIn
sys.stderr = currentStdErr
There may be other variables to preserve, but these seem like the most important.