How to print unicode strings in a Python 2 shell under Windows? - python

I'm having problems when trying to print symbols such as €, ≤, Å, Ω, ℃, etc., in Python 2.7.11 under Windows 10. I expected that running this piece of code from IDLE:
print u'\u20AC\u2A7D\u212B\u2126\u2103'
would produce the following output on the screen:
>>> ================================ RESTART ================================
>>>
€⩽ÅΩ℃
>>>
But it didn't. I obtained a funky string of non-ascii characters instead. After struggling for a while, I finally got the expected output by setting up an environment variable:
PYTHONIOENCODING=UTF-8
So far, so good. My problem is that I am unable to get the same output from the Python shell:
>>> print u'\u20AC\u2A7D\u212B\u2126\u2103'
Ôé¼Ô®¢Ôä½ÔäªÔäâ
>>>
I have unsuccessfully tried a number of workarounds I found in answers to similar questions:
Changed the code page from 850 (which is the default in my system) to 65001 (which corresponds to utf-8 enconding)
Wrapped sys.stdout to ensure the appropriate encoding
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
Even changed - although it is widely discouraged - the default encoding
sys.setdefaultencoding("UTF-8")
None of the above worked for me.
My question is twofold:
Why if I run print u'\u20AC\u2A7D\u212B\u2126\u2103' from IDLE the output is €⩽ÅΩ℃ (as expected) whereas if I run this code from the Python shell the output is incorrect?
Does anyone have any tips for printing those symbols correctly from the shell?

Why: IDLE uses tkinter, which wraps the tcl/tk GUI framework. Tcl/tk uses unicode strings, like Python 3, except that it is limited to the first 2**16 characters (the Basic Multilingual Plane, BMP). On Windows, Python uses Command Prompt, which uses code pages mostly limited to 256 chars. CP65001 seems to be a fraud; join the large crowd of people who have failed to get it to work over the last decade. (Search web for code page 65001.)
Tip: unless you limit output to chars in a working codepage, use IDLE to run the program. IDLE has a -r file startup option. See Help => IDLE Help, 3.1 Command line usage. I don't normally recommend using IDLE to run already developed programs, but do on Windows for BMP output.

Related

Python 3 array with UTF-16 chars

I created two arrays with positive and negative emojis by using the emojis' unicode value:
positive = [
u'\U0001F600',
u'\U0001F601',
u'\U0001F602',
u'\U0001F923',
u'\U0001F603',
u'\U0001F604',
u'\U0001F605',
u'\U0001F606',
u'\U0001F609',
u'\U0001F60A',
u'\U0001F60B',
u'\U0001F60E',
u'\U0001F60D',
u'\U0001F618',
u'\U0001F617',
u'\U0001F619',
u'\U0001F61A',
u'\U0000263A',
u'\U0001F642',
u'\U0001F917',
u'\U0001F60F',
u'\U0001F60C',
u'\U0001F61B',
u'\U0001F61C',
u'\U0001F61D',
u'\U0001F924',
u'\U0001F643',
u'\U0001F62C']
negative = [
u'\U0001F610',
u'\U0001F611',
u'\U0001F636',
u'\U0001F644',
u'\U0001F60F',
u'\U0001F623',
u'\U0001F625',
u'\U0001F62E',
u'\U0001F910',
u'\U0001F62F',
u'\U0001F62A',
u'\U0001F62B',
u'\U0001F634',
u'\U0001F612',
u'\U0001F613',
u'\U0001F614',
u'\U0001F615',
u'\U0001F641',
u'\U0001F616',
u'\U0001F61E',
u'\U0001F61F',
u'\U0001F624',
u'\U0001F622',
u'\U0001F62D',
u'\U0001F626',
u'\U0001F627',
u'\U0001F628',
u'\U0001F629',
u'\U0001F630',
u'\U0001F631',
u'\U0001F635',
u'\U0001F621',
u'\U0001F620',
u'\U0001F637',
u'\U0001F912',
u'\U0001F915',
u'\U0001F922',
u'\U0001F927']
But when I print positive[0], for example, I get back this weird character instead of an emoji:
�
I'm working on an EC2 machine with Amazon Linux and using python-3.4.
Same code works as expected from my Macbook.
The issue is not a Python Issue. The Mac supports fonts in the terminal that can print those unicode characters.
Those same fonts are not supported in the case you are using.
If the device in question were to support those unicode values they would print properly.
I tested a standard macOS SSH Terminal to Ubuntu and that worked as does native Mac.
The issue was I was running under screen
Hello Daniel Haviv,
Try this if it works:
print(positive[0].encode('utf-8'))
You can read more here Python 2.x’s Unicode Support
Update: The previous solution did not worked for you try this:
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
print(positive[0])
I found the solution from Setting the correct encoding when piping stdout in Python

Text involving special symbols in python and command prompt

I'm going to print something involving special phonetic symbols with python 3.4. In IDLE and cmd I tried, it raised UnicodeEncodeError or something similar. After those failures, I create a txt saved in 'utf-8' to test it out. In the txt, it is abæʃ. And I write:
import sys
sys.stdout._encoding='cp65001'
f=open('test.txt')
c=f.read()
print(c)
print(c.encode())
In IDLE, it returns:
>>>
abæʃ
b'\xc3\xaf\xc2\xbb\xc2\xbfab\xc3\x83\xc2\xa6\xc3\x8a\xc6\x92'
When tested in cmd, it return the first row different with that from IDLE, but still in mess, but when I used '>>'(redirection) in cmd, from the file I saw they were the same.
I don't know why there's something before ab, and why can't either IDLE or cmd return the correct string even the encoding are the same. Why did it happen, and how can I solve it? Thanks in advance

Python - ShiftJIS errors in DOS

I have csv files that I need to edit in Python that have to remain in Shift-JIS. When test my code by entering each section into the Python interpreter, the files get edited fine and they stay in Shift-JIS. I run the following lines in the Python interpreter:
import sys, codecs
reload(sys)
sys.setdefaultencoding('shift_jis')
I put these lines in a script and run them from the DOS prompt and of course the shift-JIS characters I add get messed up. If I run chcp at the DOS prompt, it tells me that I'm running chcp 932, shift-JIS. Does anybody know what's not working?
In case anyone needs to know, this is the remedy:
In this case Python was using Unicode when I needed Shift-JIS. What worked for me was specifying lines to use unicode then encode them in Shift-JIS, then write them to the file. This worked every time.
For example:
name = u"テスト "
newstring = name + other_string_data
newstring = newstring.encode('shift_jis')
Then the string would get encoded into shift-JIS and written. This isn't the most elegant way to do this but I hope this helps somebody, it took me about 2 hours to figure out.

Dropping a file onto a python script in windows - avoid Windows use of backslashes in the argument

I know all about how Windows uses backslashes for filenames, etc., and Unix uses forward. However, I never use backslashes with strings I create in my code. However:
When windows explorer "drops" a file onto a python script, the string it passes contains backslashes. These translate into escape sequences in the strings in the sys.argv list and then I have no way to change them after that (open to suggestions there)
Is there any way I can somehow make windows pass a literal string or ... any other way I can solve this problem?
I'd love my script to be droppable, but the only thing preventing me is windows backslashes.
EDIT:
Sorry everyone, the error was actually not the passing of the string - as someone has pointed out below, but this could still help someone else:
Make sure you use absolute path names because when the Windows shell will NOT run the script in the current directory as you would from a command line. This causes permission denied errors when attempting to write to single-part path-names that aren't absolute.
Cannot reproduce. This:
import os, sys
print sys.argv
print map(os.path.exists, sys.argv)
raw_input()
gives me this:
['D:\\workspaces\\generic\\SO_Python\\9266551.py', 'D:\\workspaces\\generic\\SO_Python\\9254991.py']
[True, True]
after dropping the second file onto the first one. Python 2.7.2 (on Windows). Can you try this code out?

Unicode filename to python subprocess.call() [duplicate]

This question already has answers here:
Unicode filenames on Windows with Python & subprocess.Popen()
(5 answers)
Closed 7 years ago.
I'm trying to run subprocess.call() with unicode filename, and here is simplified problem:
n = u'c:\\windows\\notepad.exe '
f = u'c:\\temp\\nèw.txt'
subprocess.call(n + f)
which raises famous error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8'
Encoding to utf-8 produces wrong filename, and mbcs passes filename as new.txt without accent
I just can't read any more on this confusing subject and spin in circle. I found here lot of answers for many different problems in past so I thought to join and ask for help myself
Thanks
I found a fine workaround, it's a bit messy, but it works.
subprocess.call is going to pass the text in its own encoding to the terminal, which might or not be the one it's expecting. Because you want to make it portable, you'll need to know the machine's encoding at runtime.
The following
notepad = 'C://Notepad.exe'
subprocess.call([notepad.encode(sys.getfilesystemencoding())])
attempts to figure out the current encoding and therefore applies the correct one to subprocess.call
As a sidenote, I have also found that if you attempt to compose a string with the current directory, using
os.cwd()
Python (or the OS, don't know) will mess up directories with accented characters. To prevent this I have found the following to work:
os.cwd().decode(sys.getfilesystemencoding())
Which is very similar to the solution above.
Hope it helps.
If your file exists, you can use short filename (aka 8.3 name). This name is defined
for existent files, and should cause no trouble to non-Unicode aware programs when passed as argument.
One way to obtain one (needs Pywin32 to be installed):
import win32api
short_path = win32api.GetShortPathName(unicode_path)
Alternatively, you can also use ctypes:
import ctypes
import ctypes.wintypes
ctypes.windll.kernel32.GetShortPathNameW.argtypes = [
ctypes.wintypes.LPCWSTR, # lpszLongPath
ctypes.wintypes.LPWSTR, # lpszShortPath
ctypes.wintypes.DWORD # cchBuffer
]
ctypes.windll.kernel32.GetShortPathNameW.restype = ctypes.wintypes.DWORD
buf = ctypes.create_unicode_buffer(1024) # adjust buffer size, if necessary
ctypes.windll.kernel32.GetShortPathNameW(unicode_path, buf, len(buf))
short_path = buf.value
It appears that to make this work, the subprocess code would have to be modified to use a wide character version of CreateProcess (assuming that one exists). There's a PEP discussing the same change made for the file object at http://www.python.org/dev/peps/pep-0277/ Perhaps you could research the Windows C calls and propose a similar change for subprocess.
I don't have an answer for you, but I've done a fair amount of research into this problem. Python converts all output (including system calls) to the same character as the terminal it is running in. Windows terminals use code pages for character mapping; the default code page is 437, but it can be changed with the chcp command. chcp 65001 will theoretically change the code page to utf-8, but as far as I know python doesn't know what to do with this, so you're SOL.
As ΤΖΩΤΖΙΟΥ and starbuck mentioned, the problem is with the console code page which is in your case 866 (in Russian localization of windows) and not 1251. Just run chcp in console.
The problem is the same as when you want output unicode to Windows console. Unfortunatelly you will need at least to reqister and alias for unicode as 'cp866' in encodings\aliases.py (or do it programmatically on script start) and change the code page of the console to 65001 before running the notepad and setting it back afterwards.
chcp 65001 & c:\WINDOWS\notepad.exe nèw.txt & chcp 866
By the way, to be able to run the command in console and see the filename correctly, you will need to change the console font to Lucida Console in console window properties.
It might be even worse: you will need to change the code page of the current process. To do that, you will need either run chcp 65001 right before the script start or use pywin32 to do it within the script.
You can try opening the file as:
subprocess.call((n + f).encode("cp437"))
or whichever codepage chcp reports as being used in a command prompt window. If you try to chcp 65001 as starbuck suggested, you'll have to edit the stdlib encodings\aliases.py file and add cp65001 as an alias to 'utf-8' beforehand. It's an open issue in the Python source.
UPDATE: since this is a multiple target scenario, before running such a command, make sure you run a single chcp command first, analyse the output and retrieve the current "Command Prompt" (DOS) codepage. Subsequently, use the discovered codepage to encode the subprocess.call argument.
Use os.startfile with the operation edit. This will work better as it will open the default application for your extension.

Categories

Resources