exec() not working with unicode characters - python

Im trying to execute a .py program from within my python code, but non-ASCII characters behave oddly when printed and dealt with.
module1.py:
test = "áéíóúabcdefgçë"
print(test)
Main code:
exec(open("module1.py").read(), globals())
I want this to print áéíóúabcdefgçë but it instead prints áéíóúabcdefgçë. This happens with all non-ASCII characters i have tried.
I am using Python 3.7 and Windows 10.
Running module1.py individually does not produce this error, but i want to run the program using exec() or something else that has roughly the same function.

I found a way to fix the issue. Python's open is assuming some encoding other than UTF-8. Changing the main code to the following fixes the issue on my computer (python 3.7 and windows 10):
exec(open("module1.py", encoding="utf-8").read(),globals())
Thanks #jjramsey for additional information:
According to the Python documentation for open(), "The default encoding is platform dependent (whatever locale.getpreferredencoding() returns)."
For me, if I run the following check:
import locale
print(locale.getpreferredencoding())
I get cp1252, which is notably not UTF-8 and so open() will cause the issues we have seen in this question, unless we specify the encoding.

Related

How to read excel Unicode characters using Python

I am receiving an Excel file whose content I cannot influence. It contains some Unicode characters like "á" or "é".
My code has been unchanged, but I migrated from Eclipse Juno to LiClipse together to a migration to a different python package (2.6 from 2.5). In principle the specific package I am using has a working version on win32com package.
When I read the Excel file my code is crashing when extracting and converting to to strings using str(). The console output is the following:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 89: ordinal not in range(128)
Being more concrete I perform the following:
Read the Excel:
xlApp = Dispatch("Excel.Application")
excel = xlApp.Workbooks.Open(excel_location)
in an internal loop I extract the value of the cell:
cell_value = self.excel.ActiveSheet.Cells(excel_line + 1, excel_column + 1)
and finally, if I try to convert cell_value to str, crashes:
print str(cell_value)
If I go to the Excel and remove the non-ASCII characters everything is working smoothly. I have tried this encode proposal. Any other solution I have googled proposes saving the file in a specific format, that I can't do.
What puzzles me is that the code was working before with the same input Excel but this change to LiClipse and 2.6 Python killed everything.
Any idea how can I progress?
This is a common problem when working with UTF-8 encoded Unicode data in Python 2.x. The handling of this has changed in a few places between 2.4 and 2.7, so it's no surprise that you suddenly get an error.
The source of the error is print: In Python 2.x, print doesn't try to assume what encoding your terminal supports. It just plays save and assumes that ascii is the only supported charset (which means characters between 0 and 127 are fine, everything else gives an error).
Now you convert a COMObject to a string. str is just a bunch of bytes (values 0 to 255) as far as Python 2.x is concerned. It doesn't have an encoding.
Combining the two is a recipe for trouble. When Python prints, it tries to validate the input (the string) and suddenly finds UTF-8 encoded characters (UTF-8 adds these odd \xe1 markers which tells the decoder that the next byte is special in some way; check Wikipedia for the gory details).
That's when the ascii encoder says: Sorry, can't help you there.
That means you can work with this value, compare it and such, but you can't print it. A simple fix for the printing problem is:
s = str(cell_value) # Convert COM -> UTF-8 encoded string
print repr(s) # repr() converts anything to ascii
If your terminal supports UTF-8, then you need to tell Python about it:
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
You should also have a look at sys.stdout.encoding which tells what Python currently thinks the output encoding is/should be. When Python 2 is properly configured (like on modern Linux distributions), then the correct codec for output should be used automatically.
Related:
Python 2 Unicode howto
Pragmatic Unicode, or, How do I stop the pain?
Setting the correct encoding when piping stdout in Python
.Cells(row,col) returns a Range object. You probably want the text from the cell:
cell = xl.ActiveSheet.Cells(1,2).Text
or
cell = xl.ActiveSheet.Range('B1').Text
The resulting value will be a Unicode string. To convert to bytes that you can write to a file, use .encode(encoding), for example:
bytes = cell.encode('utf8')
The below example uses the following spreadsheet:
import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
xl.Workbooks.Open(r'book1.xlsx')
cell = xl.ActiveSheet.Cells(1,2)
cell_value = cell.Text
print repr(cell)
print repr(cell_value)
print cell_value
Output (Note, Chinese will only print if console/IDE supports the characters):
<win32com.gen_py.Microsoft Excel 14.0 Object Library.Range instance at 0x129909424>
u'\u4e2d\u56fd\u4eba'
中国人
What is described here is a hack, you should not use as a long term
solution. Looking at the comments it could crush the terminal.
Finally I found a solution helped by the suggestion that #Huan-YuTseng provided, probably the solutions offered by other might work in other context but not in this one.
So, what happened is that I migrated from Eclipse Juno version (as Pydev stopped working due to Java upgrade needed that I can't accomplish in this computer) to LiClipse direct package (I did not upgraded a downloaded Eclipse version).
By default, in my LiClipse version (1.4.0.201502042042) the Console output is not by default utf-8. So I needed to change the output from either LiClipse or using my code. Fourtunately, there was another question related to a similar problem that helped me. You can see more details here, but essentially what you need to do is to include at the begginning of your code the following code:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
And everything works. In the answers from #AarongDigulla the solution is there, but is actually the very last solution.
However, I need to say that LiClipse is giving me an error on sys.setdefaultencoding statement, that during execution is not creating any issue... no idea what's happening. That stopped me testing this solution before. Maybe there is something wrong in LiClipse (is alowing me to execute code with errors!)
Use 'utf-8 BOM' which in python used as utf_8_sig for Unicode character & also to avoid irrelevant results in Excel sheet.

How to get IDLE to accept paste of Unicode characters?

Oftentimes when I'm working interactively in IDLE, I'd like to paste a Unicode string into the IDLE window. It appears to paste properly but generates an error immediately. It has no trouble displaying the same character on output.
>>> c = u'ĉ'
Unsupported characters in input
>>> print u'\u0109'
ĉ
I suspect that the input window, like most Windows programs, uses UTF-16 internally and has no trouble dealing with the full Unicode set; the problem is that IDLE insists on coercing all input to the default mbcs code page, and anything not in that page gets rejected.
Is there any way to configure or cajole IDLE into accepting the full Unicode character set as input?
Python 3.2 handles this much better and has no trouble with anything I throw at it.
I know that I can simply save the code to a file in UTF-8 and import it, but I want to be able to work with Unicode characters in the interactive window.
I finally figured out a way. Since the sources to IDLE are part of the distribution you can make a couple of quick edits to enable the capability. The files will typically be found in C:\Python27\Lib\idlelib.
The first step is to prevent IDLE from trying to encode all those nice Unicode characters into a character set that can't handle them. This is controlled by IOBinding.py. Edit the file, find the section after if sys.platform == 'win32': and comment out this line:
#encoding = locale.getdefaultlocale()[1]
Now add this line after it:
encoding = 'utf-8'
I was hoping that there would be a way to override this with an environment variable or something, but getdefaultlocale calls directly into a Win32 function that gets the globally configured Windows mbcs encoding.
This is half the battle, now we need to get the command line interpreter to recognize that the input bytes are UTF-8 encoded. It didn't appear that there was a way to pass an encoding into the interpreter, so I came up with the mother of all hacks. Maybe someone with a little more patience can come up with a better way, but this works for now. The input is processed in PyShell.py, in the runsource function. Change the following:
if isinstance(source, types.UnicodeType):
from idlelib import IOBinding
try:
source = source.encode(IOBinding.encoding)
except UnicodeError:
self.tkconsole.resetoutput()
self.write("Unsupported characters in input\n")
return
To:
from idlelib import IOBinding # line moved
if isinstance(source, types.UnicodeType):
try:
source = source.encode(IOBinding.encoding)
except UnicodeError:
self.tkconsole.resetoutput()
self.write("Unsupported characters in input\n")
return
source = "#coding=%s\n%s" % (IOBinding.encoding, source) # line added
We're taking advantage of PEP 263 to specify the encoding for each line of input provided to the interpreter.
Update: In Python 2.7.10 it is no longer necessary to make the change in PyShell.py, it already works properly if the encoding is set to utf-8. Unfortunately I haven't found a way to bypass the change in IOBinding.py.

unicode string tolerance in Eclipse+PyDev

I'm using Eclipse+PyDev to write code and often face unicode issues when moving this code to production. The reason is shown in this little example
a = u'фыва '\
'фыва'
If Eclipse see this it creates unicode string like nothing happened, but if type same command directly to Python shell(Python 2.7.3) you'll get this:
SyntaxError: (unicode error) 'ascii' codec can't decode byte 0xd1 in position 0: ordinal not in range(128)
because correct code is:
a = u'фыва '\
u'фыва'
But because of Eclipse+PyDev's "tolerance" I always get in trouble :( How can I force PyDev to "follow the rules"?
This happens because the encoding for the console is utf-8.
There's currently no way to set that globally in the UI, although you can change it by editing: \plugins\org.python.pydev_2.7.6\pysrc\pydev_sitecustomize\sitecustomize.py
And just remove the call to: (line 108) sys.setdefaultencoding(encoding)
This issue should be fixed in PyDev 3.4.0 (not released yet). Fabio (PyDev maintainer) says: "from now on PyDev will only set the PYTHONIOENCODING and will no longer change the default encoding". And PYTHONIOENCODING is supported since Python 2.6.
Here is the commit on GitHub.
Try adding # -*- coding: utf-8 -*- as the first line of your source files. It should make Python behave.
This solved the problem for me in my source code without having to modify the pydev sitecustomize.py file:
import sys
reload(sys).setdefaultencoding("utf-8")
You could use "ascii" or whatever other encoding you wanted to use.
In my case, the when I ran the program on the command-line, PyDev was using "utf-8", whereas the console was incorrectly setting "ascii".
It may not be what you are asking. But for my case I got these UTF-8 characters by accident copying my code from various sources. To find what character is making troubles I did in my Eclipse Mars:
Edit->set encoding
other->US ASCII
then I tried to save my file. And I got modal window telling me "Save problems". There was button "Select First Character"
It showed me troubling character and I just deleted that character and typed ASCII one.

Wrong encoding in filenames created on Windows XP by Python script

My Python script creates a xml file under Windows XP but that file doesn't get the right encoding with Spanish characters such 'ñ' or some accented letters.
First of all, the filename is read from an excel shell with the following code, I use to read the Excel file xlrd libraries:
filename = excelsheet.cell_value(rowx=first_row, colx=5)
Then, I've tried some encodings without success to generate the file with the right encode:
filename = filename[:-1].encode("utf-8")
filename = filename[:-1].encode("latin1")
filename = filename[:-1].encode("windows-1252")
Using "windows-1252" I get a bad encoding with letter 'ñ', 'í' and 'é'. For example, I got BAJO ARAGÓN_Alcañiz.xml instead of BAJO ARAGÓN_Alcañiz.xml
Thanks in advance for your help
You should use unicode strings for your filenames. In general operating systems support filenames that contain arbitrary Unicode characters. So if you do:
fn = u'ma\u00d1o' # maÑo
f = open(fn, "w")
f.close()
f = open(fn, "r")
f.close()
it should work just fine. A different thing is what you see in your terminal when you list the content of the directory where that file lives. If the encoding of the terminal is UTF-8 you will see the filename maño, but if the encoding is for instance iso-8859-1 you will see maÃo. But even if you see these strange characters you should be able to open the file from python as described above.
In summary, do not encode the output of
filename = excelsheet.cell_value(rowx=first_row, colx=5)
instead make sure it is a unicode string.
Reading the Unicode filenames section of the Python Unicode HOWTO can be helpful for you.
Trying your answers I found a fast solution, port my script from Python 2.7 yo Python 3.3, the reason to port my code is Python 3 works by default in Unicode.
I had to do some little changes in my code, the import of xlrd libraries (Previously I had to install xlrd3):
import xlrd3 as xlrd
Also, I had to convert the content from 'bytes' to 'string' using str instead of encode()
filename = str(filename[:-1])
Now, my script works perfect and generate the files on Windows XP without strange characters.
First,
if you had not, please, read http://www.joelonsoftware.com/articles/Unicode.html -
Now, "latin-1" should work for Spanish encoding under Windows - there are two hypotheses tehr: the strigns you are trying to "encode" to either encoding are not Unicdoe strings, but are already in some encoding. tha, however, would more likely give you an UnicodeDecodeError than strange characters, but it might work in some corner case.
The more likely case is that you are checking your files using the windows Prompt AKA 'CMD" -
Well, for some reason, Microsoft Windows does use two different encodings for the system - one from inside "native" windows programs - which should be compatible with latin1, and another one for legacy DOS programs, in which category it puts the command prompt. For Portuguese, this second encoding is "cp852" (Looking around, cp852 does not define "ñ" - but cp850 does ).
So, this happens:
>>> print u"Aña".encode("latin1").decode("cp850")
A±a
>>>
So, if you want your filenames to appear correctly from the DOS prompt, you should encode them using "CP850" - if you want them to look right from Windows programs, do encode them using "cp1252" (or "latin1" or "iso-8859-15" - they are almost the same, give or take the "€" symbol)
Of course, instead of trying to guess and picking one that looks good, and will fail if some one runs your program in Norway, Russia, or in aa Posix system, you should just do
import sys
encoding = sys.getfilesystemencoding()
(This should return one of the above for you - again, the filename will look right if seem from a Windows program, not from a DOS shell)
In Windows, the file system uses UTF-16, so no explicit encoding is required. Just use a Unicode string for the filename, and make sure to declare the encoding of the source file.
# coding: utf8
with open(u'BAJO ARAGÓN_Alcañiz.xml','w') as f:
f.write('test')
Also, even though, for example, Ó isn't supported by the cp437 encoding of my US Windows system, my console font supports the character and it still displays correctly on my console. The console supports Unicode, but non-Unicode programs can only read/write code page characters.

How to display utf-8 in windows console

I'm using Python 2.6 on Windows 7
I borrowed some code from here:
Python, Unicode, and the Windows console
My goal is to be able to display uft-8 strings in the windows console.
Apparantly in python 2.6, the
sys.setdefaultencoding()
is no longer supported
However, I wrote reload(sys) before I tried to use it and it magically didn't error.
This code will NOT error, but it shows funny characters instead of japanese text.
I believe the problem is because I have not successfully changed the codepage of the windows console.
These are my attempts, but they don't work:
reload(sys)
sys.setdefaultencoding('utf-8')
print os.popen('chcp 65001').read()
sys.stdout.encoding = 'cp65001'
Perhaps you can use win32console to change the codepage?
I tried the code from the website I linked, but it also errored from the win32console.. maybe that code is obsolete.
Here's my code, that doesn't error but prints funny characters:
#coding=<utf8>
import os
import sys
import codecs
reload(sys)
sys.setdefaultencoding('utf-8')
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)
#print os.popen('chcp 65001').read()
print(sys.stdout.encoding)
sys.stdout.encoding = 'cp65001'
print(sys.stdout.encoding)
x = raw_input('press enter to continue')
a = 'こんにちは世界'#.decode('utf8')
print a
x = raw_input()
I know you state you're using Python 2.6, but if you're able to use Python 3.3 you'll find that this is finally supported.
Use the command chcp 65001 before starting Python.
See http://docs.python.org/dev/whatsnew/3.3.html#codecs
In Python 3.6 it's no longer even necessary to use the chcp command, since Python bypasses the byte-level console interface entirely and uses a native Unicode interface instead. See PEP 528: Change Windows console encoding to UTF-8.
As noted in the comments by #mbom007, it's also important to make sure the console is configured with a font that supports the characters you're trying to display.
Never ever ever use setdefaultencoding. If you want to write unicode strings to stdio, encode them explicitly. Monkeying around with setdefaultencoding will cause stdlib modules and third-party modules alike to break in horrible subtle ways by allowing implicit conversion between str and unicode when it shouldn't happen.
Yes, the problem is most likely that your code page isn't set properly. However, using os.popen won't change the code page; it'll spawn a new shell, change its code page, and then immediately exit without affecting your console at all. I'm not personally very familiar with windows, so I couldn't tell you how to change your console's code page from within your python program.
The way to properly display unicode data via utf-8 from python, as mentioned before, is to explicitly encode your strings before printing them: print s.encode('utf-8')
Changing the console code page is both unnecessary and won't work (in particular, setting it to 65001 runs into a Python bug). See this question for details, and for how to print Unicode characters to the console regardless of the code page.
Windows doesn't support UTF-8 in a console properly. The only way I know of to display Japanese in the console is by changing (on XP) Control Panel's Regional and Language Options, Advanced Tab, Language for non-Unicode Programs to Japanese. After rebooting, open a console and run "chcp" to find out the Japanese console's code page. Then either print Unicode strings or byte strings explicitly encoded in the correct code page.

Categories

Resources