Unicode filename to python subprocess.call() [duplicate] - python

This question already has answers here:
Unicode filenames on Windows with Python & subprocess.Popen()
(5 answers)
Closed 7 years ago.
I'm trying to run subprocess.call() with unicode filename, and here is simplified problem:
n = u'c:\\windows\\notepad.exe '
f = u'c:\\temp\\nèw.txt'
subprocess.call(n + f)
which raises famous error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8'
Encoding to utf-8 produces wrong filename, and mbcs passes filename as new.txt without accent
I just can't read any more on this confusing subject and spin in circle. I found here lot of answers for many different problems in past so I thought to join and ask for help myself
Thanks

I found a fine workaround, it's a bit messy, but it works.
subprocess.call is going to pass the text in its own encoding to the terminal, which might or not be the one it's expecting. Because you want to make it portable, you'll need to know the machine's encoding at runtime.
The following
notepad = 'C://Notepad.exe'
subprocess.call([notepad.encode(sys.getfilesystemencoding())])
attempts to figure out the current encoding and therefore applies the correct one to subprocess.call
As a sidenote, I have also found that if you attempt to compose a string with the current directory, using
os.cwd()
Python (or the OS, don't know) will mess up directories with accented characters. To prevent this I have found the following to work:
os.cwd().decode(sys.getfilesystemencoding())
Which is very similar to the solution above.
Hope it helps.

If your file exists, you can use short filename (aka 8.3 name). This name is defined
for existent files, and should cause no trouble to non-Unicode aware programs when passed as argument.
One way to obtain one (needs Pywin32 to be installed):
import win32api
short_path = win32api.GetShortPathName(unicode_path)
Alternatively, you can also use ctypes:
import ctypes
import ctypes.wintypes
ctypes.windll.kernel32.GetShortPathNameW.argtypes = [
ctypes.wintypes.LPCWSTR, # lpszLongPath
ctypes.wintypes.LPWSTR, # lpszShortPath
ctypes.wintypes.DWORD # cchBuffer
]
ctypes.windll.kernel32.GetShortPathNameW.restype = ctypes.wintypes.DWORD
buf = ctypes.create_unicode_buffer(1024) # adjust buffer size, if necessary
ctypes.windll.kernel32.GetShortPathNameW(unicode_path, buf, len(buf))
short_path = buf.value

It appears that to make this work, the subprocess code would have to be modified to use a wide character version of CreateProcess (assuming that one exists). There's a PEP discussing the same change made for the file object at http://www.python.org/dev/peps/pep-0277/ Perhaps you could research the Windows C calls and propose a similar change for subprocess.

I don't have an answer for you, but I've done a fair amount of research into this problem. Python converts all output (including system calls) to the same character as the terminal it is running in. Windows terminals use code pages for character mapping; the default code page is 437, but it can be changed with the chcp command. chcp 65001 will theoretically change the code page to utf-8, but as far as I know python doesn't know what to do with this, so you're SOL.

As ΤΖΩΤΖΙΟΥ and starbuck mentioned, the problem is with the console code page which is in your case 866 (in Russian localization of windows) and not 1251. Just run chcp in console.
The problem is the same as when you want output unicode to Windows console. Unfortunatelly you will need at least to reqister and alias for unicode as 'cp866' in encodings\aliases.py (or do it programmatically on script start) and change the code page of the console to 65001 before running the notepad and setting it back afterwards.
chcp 65001 & c:\WINDOWS\notepad.exe nèw.txt & chcp 866
By the way, to be able to run the command in console and see the filename correctly, you will need to change the console font to Lucida Console in console window properties.
It might be even worse: you will need to change the code page of the current process. To do that, you will need either run chcp 65001 right before the script start or use pywin32 to do it within the script.

You can try opening the file as:
subprocess.call((n + f).encode("cp437"))
or whichever codepage chcp reports as being used in a command prompt window. If you try to chcp 65001 as starbuck suggested, you'll have to edit the stdlib encodings\aliases.py file and add cp65001 as an alias to 'utf-8' beforehand. It's an open issue in the Python source.
UPDATE: since this is a multiple target scenario, before running such a command, make sure you run a single chcp command first, analyse the output and retrieve the current "Command Prompt" (DOS) codepage. Subsequently, use the discovered codepage to encode the subprocess.call argument.

Use os.startfile with the operation edit. This will work better as it will open the default application for your extension.

Related

Python special characters encoding problems in PATH

Given this simple code, I receive faulty paths if the userfolder contains any special characters. For example the returned path is expected to be "C:\Users\Aoë\", but the ë is instead shown as a ‰ or a \u2030 depending on what is done with encoding. This then messes up the rest of my code because of attempts to write to nonexistent paths.
I ran into this problem trying to run kivy, but it seems to be happening globally.
from pathlib import Path
home = str(Path.home())
print(home)
I've spent quite some time, but haven't been able to reach a solution. This is with the latest python, x64 on windows with eclipse. No matter what I do, I cannot get python to handle special characters properly.
Try 'r' tag at the beginning, it ignores the special characters:
home = r'%s'%str(Path.home())

How to print unicode strings in a Python 2 shell under Windows?

I'm having problems when trying to print symbols such as €, ≤, Å, Ω, ℃, etc., in Python 2.7.11 under Windows 10. I expected that running this piece of code from IDLE:
print u'\u20AC\u2A7D\u212B\u2126\u2103'
would produce the following output on the screen:
>>> ================================ RESTART ================================
>>>
€⩽ÅΩ℃
>>>
But it didn't. I obtained a funky string of non-ascii characters instead. After struggling for a while, I finally got the expected output by setting up an environment variable:
PYTHONIOENCODING=UTF-8
So far, so good. My problem is that I am unable to get the same output from the Python shell:
>>> print u'\u20AC\u2A7D\u212B\u2126\u2103'
Ôé¼Ô®¢Ôä½ÔäªÔäâ
>>>
I have unsuccessfully tried a number of workarounds I found in answers to similar questions:
Changed the code page from 850 (which is the default in my system) to 65001 (which corresponds to utf-8 enconding)
Wrapped sys.stdout to ensure the appropriate encoding
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
Even changed - although it is widely discouraged - the default encoding
sys.setdefaultencoding("UTF-8")
None of the above worked for me.
My question is twofold:
Why if I run print u'\u20AC\u2A7D\u212B\u2126\u2103' from IDLE the output is €⩽ÅΩ℃ (as expected) whereas if I run this code from the Python shell the output is incorrect?
Does anyone have any tips for printing those symbols correctly from the shell?
Why: IDLE uses tkinter, which wraps the tcl/tk GUI framework. Tcl/tk uses unicode strings, like Python 3, except that it is limited to the first 2**16 characters (the Basic Multilingual Plane, BMP). On Windows, Python uses Command Prompt, which uses code pages mostly limited to 256 chars. CP65001 seems to be a fraud; join the large crowd of people who have failed to get it to work over the last decade. (Search web for code page 65001.)
Tip: unless you limit output to chars in a working codepage, use IDLE to run the program. IDLE has a -r file startup option. See Help => IDLE Help, 3.1 Command line usage. I don't normally recommend using IDLE to run already developed programs, but do on Windows for BMP output.

Python - ShiftJIS errors in DOS

I have csv files that I need to edit in Python that have to remain in Shift-JIS. When test my code by entering each section into the Python interpreter, the files get edited fine and they stay in Shift-JIS. I run the following lines in the Python interpreter:
import sys, codecs
reload(sys)
sys.setdefaultencoding('shift_jis')
I put these lines in a script and run them from the DOS prompt and of course the shift-JIS characters I add get messed up. If I run chcp at the DOS prompt, it tells me that I'm running chcp 932, shift-JIS. Does anybody know what's not working?
In case anyone needs to know, this is the remedy:
In this case Python was using Unicode when I needed Shift-JIS. What worked for me was specifying lines to use unicode then encode them in Shift-JIS, then write them to the file. This worked every time.
For example:
name = u"テスト "
newstring = name + other_string_data
newstring = newstring.encode('shift_jis')
Then the string would get encoded into shift-JIS and written. This isn't the most elegant way to do this but I hope this helps somebody, it took me about 2 hours to figure out.

Dropping a file onto a python script in windows - avoid Windows use of backslashes in the argument

I know all about how Windows uses backslashes for filenames, etc., and Unix uses forward. However, I never use backslashes with strings I create in my code. However:
When windows explorer "drops" a file onto a python script, the string it passes contains backslashes. These translate into escape sequences in the strings in the sys.argv list and then I have no way to change them after that (open to suggestions there)
Is there any way I can somehow make windows pass a literal string or ... any other way I can solve this problem?
I'd love my script to be droppable, but the only thing preventing me is windows backslashes.
EDIT:
Sorry everyone, the error was actually not the passing of the string - as someone has pointed out below, but this could still help someone else:
Make sure you use absolute path names because when the Windows shell will NOT run the script in the current directory as you would from a command line. This causes permission denied errors when attempting to write to single-part path-names that aren't absolute.
Cannot reproduce. This:
import os, sys
print sys.argv
print map(os.path.exists, sys.argv)
raw_input()
gives me this:
['D:\\workspaces\\generic\\SO_Python\\9266551.py', 'D:\\workspaces\\generic\\SO_Python\\9254991.py']
[True, True]
after dropping the second file onto the first one. Python 2.7.2 (on Windows). Can you try this code out?

Different results from converting a file from iso-8859-1 to utf-8 iconv in shell vs calling it from python with subprocess

Well, this could be a simple question, to be frank I'm a little confused with encodings an all those things.
Let's suppose I have the file 01234.txt which is iso-8859-1.
When I do:
iconv --from-code=iso-8859-1 --to-code=utf-8 01234.txt > 01234_utf8.txt
It gives me the desired result, but when I do the same thing with python and using subprocess:
import subprocess
p0 = subprocess.Popen([<here the same command>], shell=True)
p0.wait()
I get almost the same result, but the new file is missing e.g. part of the line before the last one and the last one.
Here the last three lines of both files:
iconv result:
795719000|MARIA TERESA MARROU VILLALOBOS|107
259871385|CHRISTIAM ALBERTO SUAREZ VILLALOBOS|107
311015100|JORGE MEZA CERVANTES|09499386
python result:
795719000|MARIA TERESA MARROU VILLALOBOS|107
259871385|CHRISTIAM
EDIT: In the python file I've tried using coding: utf-8 and coding: iso-8859-1 (not both at the same time).
EDIT: I've used codecs in bpython it works great. When using it from a file I get the not desired result.
EDIT: I'm using linux (Ubuntu 9.10) and python 2.6.2.
Any suggestions?
You wrote: "In the python file I've used coding: utf-8 and coding: iso-8859-1."
Only the first of those will be used. Secondly, that specifies the encoding of the Python source file in which it appears, so that the Python compiler can do its job. Consequently it is absolutely nothing to do with the encodings of your input file and output file. A script to transcode data from encoding X to encoding Y can be written using only ASCII characters.
Now to your problem:
You wrote: "p0 = subprocess.Popen([<here the same command>], shell=True)"
Please (always) when asking a question, show the EXACT code that was run, not what you hoped/thought was run. Use copy/paste, don't retype it. Don't try to put it in a comment; edit your question.
Update: Here is a GUESS, based on the symptoms: you are losing the last few bytes of a file -- looks like failure to flush a buffer before fading away. Is the size of the truncated output file an integral power of 2?
Perhaps you should not rely on the command line processor doing > 01234_utf8.txt reliably. If you omit that part of the command, does the full payload appear on stdout? If, so you may be able to work around the problem by opening the output file yourself, passing its handle as the stdout arg, and later doing handle.flush() and handle.close().

Categories

Resources