Python - ShiftJIS errors in DOS - python

I have csv files that I need to edit in Python that have to remain in Shift-JIS. When test my code by entering each section into the Python interpreter, the files get edited fine and they stay in Shift-JIS. I run the following lines in the Python interpreter:
import sys, codecs
reload(sys)
sys.setdefaultencoding('shift_jis')
I put these lines in a script and run them from the DOS prompt and of course the shift-JIS characters I add get messed up. If I run chcp at the DOS prompt, it tells me that I'm running chcp 932, shift-JIS. Does anybody know what's not working?

In case anyone needs to know, this is the remedy:
In this case Python was using Unicode when I needed Shift-JIS. What worked for me was specifying lines to use unicode then encode them in Shift-JIS, then write them to the file. This worked every time.
For example:
name = u"テスト "
newstring = name + other_string_data
newstring = newstring.encode('shift_jis')
Then the string would get encoded into shift-JIS and written. This isn't the most elegant way to do this but I hope this helps somebody, it took me about 2 hours to figure out.

Related

Text involving special symbols in python and command prompt

I'm going to print something involving special phonetic symbols with python 3.4. In IDLE and cmd I tried, it raised UnicodeEncodeError or something similar. After those failures, I create a txt saved in 'utf-8' to test it out. In the txt, it is abæʃ. And I write:
import sys
sys.stdout._encoding='cp65001'
f=open('test.txt')
c=f.read()
print(c)
print(c.encode())
In IDLE, it returns:
>>>
abæʃ
b'\xc3\xaf\xc2\xbb\xc2\xbfab\xc3\x83\xc2\xa6\xc3\x8a\xc6\x92'
When tested in cmd, it return the first row different with that from IDLE, but still in mess, but when I used '>>'(redirection) in cmd, from the file I saw they were the same.
I don't know why there's something before ab, and why can't either IDLE or cmd return the correct string even the encoding are the same. Why did it happen, and how can I solve it? Thanks in advance

Dropping a file onto a python script in windows - avoid Windows use of backslashes in the argument

I know all about how Windows uses backslashes for filenames, etc., and Unix uses forward. However, I never use backslashes with strings I create in my code. However:
When windows explorer "drops" a file onto a python script, the string it passes contains backslashes. These translate into escape sequences in the strings in the sys.argv list and then I have no way to change them after that (open to suggestions there)
Is there any way I can somehow make windows pass a literal string or ... any other way I can solve this problem?
I'd love my script to be droppable, but the only thing preventing me is windows backslashes.
EDIT:
Sorry everyone, the error was actually not the passing of the string - as someone has pointed out below, but this could still help someone else:
Make sure you use absolute path names because when the Windows shell will NOT run the script in the current directory as you would from a command line. This causes permission denied errors when attempting to write to single-part path-names that aren't absolute.
Cannot reproduce. This:
import os, sys
print sys.argv
print map(os.path.exists, sys.argv)
raw_input()
gives me this:
['D:\\workspaces\\generic\\SO_Python\\9266551.py', 'D:\\workspaces\\generic\\SO_Python\\9254991.py']
[True, True]
after dropping the second file onto the first one. Python 2.7.2 (on Windows). Can you try this code out?

Bash Variable Contains '\r' - Carriage Return Causing Problems in my Script

I have a bash script (rsync.sh) that works fine and has this line in it:
python /path/to/rsync_script.py $EMAIL "$RSYNC $PATH1 $PATH1_BACKUP"
I want to break the command (it's actually much longer than shown here because my variables have longer names) in two and use something like this:
python /path/to/rsync_script.py \
$EMAIL "$RSYNC $PATH1 $PATH1_BACKUP"
But when I do this I get the error:
scripts/rsync.sh: line 32: $'admin#mydomain.com\r': command not found
It puts the carriage return, \r in there.
How can I break this line up and not include the carriage return?
The problem looks like Windows line endings.
Here's how you can check in Python.
repr(open('rsync.sh', 'rb').read())
# If you see any \\r\\n, it's windows
Here's how you can fix it:
text = open('rsync.sh', 'r').read().replace('\r\n', '\n')
open('rsync.sh', 'wb').write(text)
Edit
Here's some code that shows the problem.
# Python:
open('abc-n.sh', 'wb').write('echo abc \\' + '\n' + 'def')
open('abc-r-n.sh', 'wb').write('echo abc \\' + '\r\n' + 'def')
And then run the files we made...
$ sh abc-n.sh
abc def
$ sh abc-r-n.sh
abc
abc-r-n.sh: 2: def: not found
If you can chnage the python script, maybe it will be easier to pass it the variable names thenselves, instead of their content.
From within the Python code you w=have better and more consistent tools to deal with whitespace characters (like \r) than from within bash.
To do that, just change your .sh line to
python /path/to/rsync_script.py EMAIL "RSYNC PATH1 PATH1_BACKUP"
And on your rsync_script.py, use os.environ to read the contents of the shell variables (and clear the \r's in them) - something like:
import os, sys
paths = []
for var_name in sys.argv(2).split(" "):
paths.append(os.environ[var_name].strip())
So I figured it out... I made a mistake in this question and I got so much awesome help but it was me doing a dumb thing that caused the problem. As I mentioned above, I may have copied and pasted from Windows at some point (I had forgotten since I did most of the edits in vim). I went back and wrote a short script with the essentials of the original in vim and then added in the '\' for line break and the script worked just fine. I feel bad accepting my own answer since it was so stupid. I made sure to up-vote everyone who helped me. Thanks again.

Unicode filename to python subprocess.call() [duplicate]

This question already has answers here:
Unicode filenames on Windows with Python & subprocess.Popen()
(5 answers)
Closed 7 years ago.
I'm trying to run subprocess.call() with unicode filename, and here is simplified problem:
n = u'c:\\windows\\notepad.exe '
f = u'c:\\temp\\nèw.txt'
subprocess.call(n + f)
which raises famous error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8'
Encoding to utf-8 produces wrong filename, and mbcs passes filename as new.txt without accent
I just can't read any more on this confusing subject and spin in circle. I found here lot of answers for many different problems in past so I thought to join and ask for help myself
Thanks
I found a fine workaround, it's a bit messy, but it works.
subprocess.call is going to pass the text in its own encoding to the terminal, which might or not be the one it's expecting. Because you want to make it portable, you'll need to know the machine's encoding at runtime.
The following
notepad = 'C://Notepad.exe'
subprocess.call([notepad.encode(sys.getfilesystemencoding())])
attempts to figure out the current encoding and therefore applies the correct one to subprocess.call
As a sidenote, I have also found that if you attempt to compose a string with the current directory, using
os.cwd()
Python (or the OS, don't know) will mess up directories with accented characters. To prevent this I have found the following to work:
os.cwd().decode(sys.getfilesystemencoding())
Which is very similar to the solution above.
Hope it helps.
If your file exists, you can use short filename (aka 8.3 name). This name is defined
for existent files, and should cause no trouble to non-Unicode aware programs when passed as argument.
One way to obtain one (needs Pywin32 to be installed):
import win32api
short_path = win32api.GetShortPathName(unicode_path)
Alternatively, you can also use ctypes:
import ctypes
import ctypes.wintypes
ctypes.windll.kernel32.GetShortPathNameW.argtypes = [
ctypes.wintypes.LPCWSTR, # lpszLongPath
ctypes.wintypes.LPWSTR, # lpszShortPath
ctypes.wintypes.DWORD # cchBuffer
]
ctypes.windll.kernel32.GetShortPathNameW.restype = ctypes.wintypes.DWORD
buf = ctypes.create_unicode_buffer(1024) # adjust buffer size, if necessary
ctypes.windll.kernel32.GetShortPathNameW(unicode_path, buf, len(buf))
short_path = buf.value
It appears that to make this work, the subprocess code would have to be modified to use a wide character version of CreateProcess (assuming that one exists). There's a PEP discussing the same change made for the file object at http://www.python.org/dev/peps/pep-0277/ Perhaps you could research the Windows C calls and propose a similar change for subprocess.
I don't have an answer for you, but I've done a fair amount of research into this problem. Python converts all output (including system calls) to the same character as the terminal it is running in. Windows terminals use code pages for character mapping; the default code page is 437, but it can be changed with the chcp command. chcp 65001 will theoretically change the code page to utf-8, but as far as I know python doesn't know what to do with this, so you're SOL.
As ΤΖΩΤΖΙΟΥ and starbuck mentioned, the problem is with the console code page which is in your case 866 (in Russian localization of windows) and not 1251. Just run chcp in console.
The problem is the same as when you want output unicode to Windows console. Unfortunatelly you will need at least to reqister and alias for unicode as 'cp866' in encodings\aliases.py (or do it programmatically on script start) and change the code page of the console to 65001 before running the notepad and setting it back afterwards.
chcp 65001 & c:\WINDOWS\notepad.exe nèw.txt & chcp 866
By the way, to be able to run the command in console and see the filename correctly, you will need to change the console font to Lucida Console in console window properties.
It might be even worse: you will need to change the code page of the current process. To do that, you will need either run chcp 65001 right before the script start or use pywin32 to do it within the script.
You can try opening the file as:
subprocess.call((n + f).encode("cp437"))
or whichever codepage chcp reports as being used in a command prompt window. If you try to chcp 65001 as starbuck suggested, you'll have to edit the stdlib encodings\aliases.py file and add cp65001 as an alias to 'utf-8' beforehand. It's an open issue in the Python source.
UPDATE: since this is a multiple target scenario, before running such a command, make sure you run a single chcp command first, analyse the output and retrieve the current "Command Prompt" (DOS) codepage. Subsequently, use the discovered codepage to encode the subprocess.call argument.
Use os.startfile with the operation edit. This will work better as it will open the default application for your extension.

Different results from converting a file from iso-8859-1 to utf-8 iconv in shell vs calling it from python with subprocess

Well, this could be a simple question, to be frank I'm a little confused with encodings an all those things.
Let's suppose I have the file 01234.txt which is iso-8859-1.
When I do:
iconv --from-code=iso-8859-1 --to-code=utf-8 01234.txt > 01234_utf8.txt
It gives me the desired result, but when I do the same thing with python and using subprocess:
import subprocess
p0 = subprocess.Popen([<here the same command>], shell=True)
p0.wait()
I get almost the same result, but the new file is missing e.g. part of the line before the last one and the last one.
Here the last three lines of both files:
iconv result:
795719000|MARIA TERESA MARROU VILLALOBOS|107
259871385|CHRISTIAM ALBERTO SUAREZ VILLALOBOS|107
311015100|JORGE MEZA CERVANTES|09499386
python result:
795719000|MARIA TERESA MARROU VILLALOBOS|107
259871385|CHRISTIAM
EDIT: In the python file I've tried using coding: utf-8 and coding: iso-8859-1 (not both at the same time).
EDIT: I've used codecs in bpython it works great. When using it from a file I get the not desired result.
EDIT: I'm using linux (Ubuntu 9.10) and python 2.6.2.
Any suggestions?
You wrote: "In the python file I've used coding: utf-8 and coding: iso-8859-1."
Only the first of those will be used. Secondly, that specifies the encoding of the Python source file in which it appears, so that the Python compiler can do its job. Consequently it is absolutely nothing to do with the encodings of your input file and output file. A script to transcode data from encoding X to encoding Y can be written using only ASCII characters.
Now to your problem:
You wrote: "p0 = subprocess.Popen([<here the same command>], shell=True)"
Please (always) when asking a question, show the EXACT code that was run, not what you hoped/thought was run. Use copy/paste, don't retype it. Don't try to put it in a comment; edit your question.
Update: Here is a GUESS, based on the symptoms: you are losing the last few bytes of a file -- looks like failure to flush a buffer before fading away. Is the size of the truncated output file an integral power of 2?
Perhaps you should not rely on the command line processor doing > 01234_utf8.txt reliably. If you omit that part of the command, does the full payload appear on stdout? If, so you may be able to work around the problem by opening the output file yourself, passing its handle as the stdout arg, and later doing handle.flush() and handle.close().

Categories

Resources