Before someone says this is a duplicate question, I just want to let you know that the error I am getting from running this program in command line is different from all the other related questions I've seen.
I am trying to run a very short script in Python
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen("http://dictionary.reference.com/browse/word?s=t").read().strip()
dhtml = str(html, "utf-8").strip()
soup = BeautifulSoup(dhtml.strip(), "html.parser")
print(soup.prettify())
But I keep getting an error when I run this program with python.exe. UnicodeEncodeError: 'charmap' codec can't encode character '\u025c. I have tried a lot of methods to get around this, but I managed to isolate it to the problem of converting bytes to strings. When I run this program in IDLE, I get the HTML as expected. What is it that IDLE is automatically doing? Can I use IDLE's interpretation program instead of python.exe? Thanks!
EDIT:
My problem is caused by print(soup.prettify()) but type(soup.prettify()) returns str?
RESOLVED:
I finally made a decision to use encode() and decode() because of the trouble that has been caused. If someone knows how to actually resolve a question, please do; also, thank you for all your answers
UnicodeEncodeError: 'charmap' codec can't encode character '\u025c'
The console character encoding can't represent '\u025c' i.e., "ɜ" Unicode character (U+025C LATIN SMALL LETTER REVERSED OPEN E).
What is it that IDLE is automatically doing?
IDLE displays Unicode directly (only BMP characters) if the corresponding font supports given Unicode characters.
Can I use IDLE's interpretation program instead of python.exe
Yes, run:
T:\> py -midlelib -r your_script.py
Note: you could write arbitrary Unicode characters to the Windows console if Unicode API is used:
T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script.py
See What's the deal with Python 3.4, Unicode, different languages and Windows?
I just want to let you know that the error I am getting from running this program in command line is different from all the other related questions I've seen.
Not really. You have PrintFails like everyone else.
The Windows console can't print Unicode. (This isn't strictly true, but going into exactly why, when and how you can get Unicode out of the console is a painful exercise and not usually worth it.) Trying to print a character that isn't in the console's limited encoding can't work, so Python gives you an error.
print them out (which I need an easier solution to because I cannot do .encode("utf-8") for a lot of elements
You could run the command set PYTHONIOENCODING=utf-8 before running the script to tell Python to use and encoding which can include any character (so no errors), but any non-ASCII output will still come out garbled as its encoding won't match the console's actual code page.
(Or indeed just use IDLE.)
I finally made a decision to use encode() and decode() because of the trouble that has been caused. If someone knows how to actually resolve a question, please do; also, thank you for all your answers
Related
I am Python beginner so I hope this problem will be an easy fix.
I would like to print the value of an attribute as follows:
print (follower.city)
I receive the following error message:
File “C:\Python34\lib\encodings\cp850.py“, line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: ‘charmap‘ codec can’t encode character ‘\u0130‘
0: character maps to (undefined)
I think the problem is that cp850.py does not contain the relevant character in the encoding table.
What would be the solution to this problem? No ultimate need to display the character correctly, but the error message must be avoided. Do I need to modify cp850.py?
Sorry if this question has been addressed before, but I was not able to figure it out using previous answers to this topic.
To print a string it must first be converted from pure Unicode to the byte sequences supported by your output device. This requires an encode to the proper character set, which Python has identified as cp850 - the Windows Console default.
Starting with Python 3.3 you can set the Windows console to use UTF-8 with the following command issued at the command prompt:
chcp 65001
This should fix your issue, as long as you've configured the window to use a font that contains the character.
res = requests.get(self.urlBase)
soup = BeautifulSoup(html)
print soup.prettify()
gives the error:
'ascii' codec can't encode character u'\xa0' in position 10816:
ordinal not in range(128)
I'm using Requests and BeautifulSoup4.
I assume it has to do with unicode? Every single example I have seen uses it this way without issues. Not sure what why there's a problem with my encoding?
The content type is text/html; charset=UTF-8
Try
print soup.decode('utf-8', 'ignore').prettify()
This will parse the soup string ignoring all the characters it cannot comprehend
If you don't choose the 'ignore' parameter, it will throw an error when encountering a non-ascii character
You are correct that this has to do with Unicode, and essentially, this is saying that it can't directly print out some characters to the command line because the character '\xa0', which is the Latin non-breaking space, apparently. For fixing this specific problem, see this link.
Edit: see comments below for more specific information regarding the print module, as well as a more thorough and complete description of what may be causing the problem.
Edit: This link mentions the same error and in a comment it's mentioned that the 'ascii' codec error is unique to Python 2.x, from the request and other urllib modules. This confirms my statement from before, although it is not exhaustively documented.
Now for some unsolicited advice:
If the program this involves is small and does not have many dependencies or use libraries that only exist in Python 2, Use Python 3. I started out writing a web scraping project earlier this summer and started writing in Python 2.7, and ultimately got several errors involving Unicode decoding that I ultimately could not resolve, even if I used the decoding modules on the strings themselves.
I then stumbled across the fact that Python 3 was actually made specifically for fixing what Guido van Rossum himself said was "breaking Python"- once and for all uniting Unicode and strings.
The reason I was asking if your code was relatively small- I actually upgraded my whole script, which was about 400 lines, to Python 3 in a few minutes- especially since I had a good interpreter which told me the syntax issues that would arise. There are a few differences, but not very many, and you will be happy that you did this.
Short-term fix: use the (limited) support Python 2 has for Unicode.
Long-term fix: Find a way to port to Python 3.
Edit: Because this code specifically refers to the print module, I retract my statements as I do not have enough specific experience in the print module to make test cases in both Python 2.x and 3.x stating that a switch to Python 3 will necessarily fix this.
It would be worth a reply from the OP, however, to see if the issue is addressed.
Edit 2: To further make matters more inconclusive, I have tried the following codes in Python 2.7 and Python 3.4:
Python 2.7:
from bs4 import BeautifulSoup
soup = BeautifulSoup(u'string with "\xa0" character')
print soup.prettify()
Python 3.4:
from bs4 import BeautifulSoup
soup = BeautifulSoup('string with "\xa0" character')
print(soup.prettify())
Both ways return the same expected answer. Even removing the Unicode classifier from the string does not affect Python 2.7's output. Further investigation is needed.
print soup.prettify().encode('utf8')
Although to dump the contents to view from the response itself before soup works better:
res = requests.get('urlfoobar')
print res.content
I have searched and found some related problems but the way they deal with Unicode is different, so I can't apply the solutions to my problem.
I won't paste my whole code but I'm sure this isolated example code replicates the error:
(I'm also using wx for GUI so this is like inside a class)
#coding: utf-8
...
something = u'ЧЕТЫРЕ'
//show the Russian text in a Label on the GUI
self.ExampleLabel.SetValue(str(self.something))
On Eclipse everything works perfectly and it displays the Russian characters. However when I try to open up Python straight through the file I get this error on the CL:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-11:
ordinal not in range(128)
I figured this has something to do with the CL not being able to ouput the Unicode chars and Eclipse doing behind-the-scene magic. Any help on how to make it so that it works on its own?
When you call str() on something without specifying an encoding, the default encoding is used, which depends on the environment your program is running in. In Eclipse, that's different from the command line.
Don't rely on the default encoding, instead specify it explicitly:
self.ExampleLabel.SetValue(self.something.encode('utf-8'))
You may want to study the Python Unicode HOWTO to understand what encoding and str() do with unicode objects. The wxPython project has a page on Unicode usage as well.
Try self.something.encode('utf-8') instead.
If you use repr instead of str it should handle the conversion for you and also cover the case that the object is not always of type string, but you may find that it gives you an extra set of quotes or even the unicode u in your context. repr is safer than str - str assumes ascii encoding, but repr is going to show your codepoints in the same way that you would see them in code, since wrapping with eval is supposed to convert it back to what it was - the repr has to be in a form that the python code would be in, namely ascii safe since most python code is written in ascii.
I'm trying to understand the difference in a bit of Python script behavior when run on the command line vs run as part of an Emacs elisp function.
The script looks like this (I'm using Python 2.7.1 BTW):
import json; t = {"Foo":"ザ"}; print json.dumps(t).decode("unicode_escape")
that is, [in general] take a JSON segment containing unicode characters, dumpstring it to it's unicode escaped version, then decode it back to it's unicode representation. When run on the command line, the dumps part of this returns:
'{"Foo": "\\u30b6"}'
which when printed looks like:
'{"Foo": "\u30b6"}'
the decode part of this looks like:
u'{"Foo": "\u30b6"}'
which when printed looks like:
{"Foo": "ザ"}
i.e., the original string representation of the structure, at least in a terminal/console that supports unicode (in my testbed, an xterm). In a Windows console, the output is not correct with respect to the unicode character, but the script does not error out.
In Emacs, the dumps conversion is the same as on the command line (at least as far as confirming with a print), but the decode part blows out with the dreaded:
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character u'\u30b6' in position 9: ordinal not in range(128)`
I've a feeling I'm missing something basic here with respect to either the script or Emacs (in my testbed 23.1.1). Is there some auto-magic part of print invoking the correct codec/locale that happens at the command line but not in Emacs? I've tried explicitly setting the locale for an Emacs invocation (here's a stub test without the json logic):
"LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Fooザ\"; print s'"
produces the same exception, while
"LC_ALL=\"en_US.UTF-8\" python -c 'import sys; enc=sys.stdout.encoding; print enc' "
indicates that the encoding is 'None'.
If I attempt to coerce the conversion using:
"LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Fooザ\"; print s.encode(\"utf8\",\"replace\")'"
the error goes away, but the result is the "garbled" version of the string seen in the non-unicode console:
Fooa?¶
Any ideas?
UPDATE: thanks to unutbu -- b/c the locale identification falls down, the command needs to be explicitly decorated with the utf8-encode (see the answer for working directly with a unicode string). In my case, I am getting what is needed from the dumps/decode sequence, so I add the additional required decoration to achieve the desired result:
import json; t = {"Foo":"ザ"}; print json.dumps(t).decode("unicode_escape").encode("utf8","replace")
Note that this is the "raw" Python without the necessary escaping required by Emacs.
As you may have guessed from looking at the original part of this question, I'm using this as part of some JSON formatting logic in Emacs -- see my answer to this question.
The Python wiki page, "PrintFails" says
When Python does not detect the desired character set of the output,
it sets sys.stdout.encoding to None, and print will invoke the "ascii"
codec.
It appears that when python is being run from an elisp function, it can not detect the desired character set, so it is defaulting to "ascii". So trying to print unicode is tacitly causing python to encode the unicode as ascii, which is reason for the error.
Replacing u\"Fooザ\" with u\"Foo\\u30b6\" seems to work:
(defun mytest ()
(interactive)
(shell-command-on-region (point)
(point) "LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Foo\\u30b6\"; print s.encode(\"utf8\",\"replace\")'" nil t))
C-x C-e M-x mytest
yields
Fooザ
I currently have serious problems with coding/encoding under Linux (Ubuntu). I never needed to deal with that before, so I don't have any idea why this actually doesn't work!
I'm parsing *.desktop files from /usr/share/applications/ and extracting information which is shown in the Web browser via a HTTPServer. I'm using jinja2 for templating.
First, I received UnicodeDecodeError at the call to jinja2.Template.render() which said that
utf-8 cannot decode character XXX at position YY [...]
So I have made all values that come from my appfind-module (which parses the *.desktop files) returning only unicode-strings.
The problem at this place was solved so far, but at some point I am writing a string returned by a function to the BaseHTTPServer.BaseHTTTPRequestHandler.wfile slot, and I can't get this error fixed, no matter what encoding I use.
At this point, the string that is written to wfile comes from jinja2.Template.render() which, afaik, returns a unicode object.
The bizarre part is, that it is working on my Ubuntu 12.04 LTS but not on my friend's Ubuntu 11.04 LTS. However, that might not be the reason. He has a lot more applications and maybe they do use encodings in their *.desktop files that raise the error.
However, I properly checked for the encoding in the *.desktop files:
data = dict(parser.items('Desktop Entry'))
try:
encoding = data.get('encoding', 'utf-8')
result = {
'name': data['name'].decode(encoding),
'exec': DKENTRY_EXECREPL.sub('', data['exec']).decode(encoding),
'type': data['type'].decode(encoding),
'version': float(data.get('version', 1.0)),
'encoding': encoding,
'comment': data.get('comment', '').decode(encoding) or None,
'categories': _filter_bool(data.get('categories', '').
decode(encoding).split(';')),
'mimetypes': _filter_bool(data.get('mimetype', '').
decode(encoding).split(';')),
}
# ...
Can someone please enlighten me about how I can fix this error? I have read on an answer on SO that I should use unicode() always, but that would be so much pain to implemented, and I don't think it would fix the problem when writing to wfile?
Thanks,
Niklas
This is probably obvious, but anyway: wfile is an ordinary byte stream: everything written must be unicode.encode():ed when written to it.
Reading OP, it is not clear to me what, exactly is afoot. However, there are some tricks that may help you, that I have found to be helpful to debug encoding problems. I appologize in advance if this is stuff you have long since transcended.
cat -v on a file will output all non-ascii characters as '^X' which is the only fool-proof way I have found to decide what encoding a file really has. UTF-8 non-ascii characters are multi-byte. That means that they will be sequences of more than one '^'-entry by cat -v.
Shell environment (LC_ALL, et al) is in my experience the most common cause of problems. Make sure you have a system that has locales with both UTF-8 and e.g. latin-1 available. Always set your LC_ALL to a locale that explicitly names an encoding, e.g. LC_ALL=sv_SE.iso88591.
In bash and zsh, you can run a command with specific environment changes for that command, like so:
$ LC_ALL=sv_SE.utf8 python ./foo.py
This makes it a lot easier to test than having to export different locales, and you won't pollute the shell.
Don't assume that you have unicode strings internally. Write assert statements that verify that strings are unicode.
assert isinstance(foo, unicode)
Learn to recognize mangled/misrepresented versions of common characters in the encodings you are working with. E.g. '\xe4' is latin-1 a diaresis and 'ä' are the two UTF-8 bytes, that make up a diaresis, misstakenly represented in latin-1. I have found that knowing this sort of gorp cuts debugging encoding issues considerably.
You need to take a disciplined approach to your byte strings and Unicode strings. This explains it all: Pragmatic Unicode, or, How Do I Stop the Pain?
By default, when python hits an encoding issue with unicde, it throws an error. However, this behavior can be modified, such as if the error is expected or not important.
Say you are converting between two unicode pages that are supersets of ascii. The both have mostly the same characters, but there is no one-to-one correspondence. Therefore, you would want to ignore errors.
To do so, use the errors variable in the encode function.
mystring = u'This is a test'
print mystring.encode('utf-8', 'ignore')
print mystring.encode('utf-8', 'replace')
print mystring.encode('utf-8', 'xmlcharrefreplace')
print mystring.encode('utf-8', 'backslashreplace')
There are lots of issues with unicode if the wrong encodings are used when reading/writing. Make sure that after you get the unicode string, you convert it to the form of unicode desired by jinja2.
If this doesn't help, could you please add the second error you see, with perhaps a code snippet to clarify what's going on?
Try using .encode(encoding) instead of .decode(encoding) in all its occurences in the snippet.