python3 for win and cygwin - line endings in buffer - python

Setup: python3.6 for windows in Cygwin (have to use Win version because of functionalities introduced in 3.5 and Cygwin is stuck at 3.4)
How to get \n new lines in buffer (stdout) output from a python script (instead of \r\n)? The output is a list of paths and I want to get one per line for further processing by other Cygwin/Windows tools.
All answers I've found so far are dealing with file writing and I just want to modify what is written to stdout. So far the only sure way to get rid of \r is piping results through sed 's/\\10//' which is awkward.
Weird thing is that even Windows applications fed with script output don't accept it with messages like:
Can't find file <asdf.txt
>
(note newline before >)
Supposedly sys.stdout.write is doing pure output but when doing:
sys.stdout.write(line)
I get a list of paths without any separation. If I introduce anything which resembles NL (\n, \012, etc.) it is automatically converted to CRLF (\r\n). How to stop this conversion?

You need to write to stdout in binary mode; the default is text mode, which translates everything you write.
According to Issue4571 you can do this by writing directly to the internal buffer used by stdout.
sys.stdout.buffer.write(line)
Note that if you're writing Unicode strings you'll need to encode them to byte strings first.
sys.stdout.buffer.write(line.encode('utf-8')) # or 'mbcs'

Related

File size changes after read/write txt file in python

After executing the following code to generate a copy of a text file with Python, the newfile.txt doesn't have the exact same file size as oldfile.txt.
with open('oldfile.txt','r') as a, open('newfile.txt','w') as b:
content = a.read()
b.write(content)
While oldfile.txt has e.g. 667 KB, newfile.txt has 681 KB.
Does anyone have an explanation for that?
There are various causes.
You are opening a file as text file, so the bytes of file are interpreted (decoded) into python, and than encoded. So there could be changes.
From open documentation (https://docs.python.org/3/library/functions.html#open):
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.
So if the original file were ASCII (e.g. generated in Windows), you will have the \r removed. But when writing back the file you can have no more the original \r (if you are in Linux or MacOs) or you will have always \r\n, if you are on Windows (which it seems the case, because you file increase in size).
Also encoding could change text. E.g. BOM mark could be removed (or added), and potentially (but AFAIK it is not done implicitly), unneeded codes could be removed (you can have some extra code in Unicode, which change the behaviour of nearby codes. One could add more of one of them, but only the last one is effective.
I tried on Linux / Ubuntu. It works as expected, the file-size of both files is perfectly equal.
At this point, i guess this behavior does not relate to python, maybe it depends on your filesystem (compression) or operating system.

Modifying files containing SUB/escape characters

I am beginning to learn Python and want to use it to automate a process.
The process consists in
modifying a few lines of a file
use the file as the input for an executable
save, move, etc
repeat
The problem is that the file I'm trying to modify was written in a language that utilizes the SUB character to run. Therefore, when I try
with open(myFile,'r') as file:
data = list(file)
data does not contain any information beyond the SUB character.
Therefore, I need to be able to do two things:
Read the whole file in python (without exiting prematurely at the SUB character locations) so that I can modify it.
Be able to run it on the executable (that is, the SUB characters need to be back at their respective places).
Any suggestions on how to go about solving this problem?
Thanks
Use the binary mode to open file.
with open(myFile,'rb') as file:
for line in file:
print line
Are you on Windows? Quoted from your link to the SUB character:
In CP/M, 86-DOS, MS-DOS, PC DOS, DR-DOS and their various derivatives, character 26 was also used to indicate the end of a character stream, and thereby used to terminate user input in an interactive command line window (and as such, often used to finish console input redirection, e.g. as instigated by COPY CON: TYPEDTXT.TXT).
While no longer technically required to indicate the end of a file many text editors and program languages up to the present still support this convention...
Python 2.7 in text mode will stop at a CTRL-Z character (hex 1A), so open the file in binary mode:
Example:
# Create a file with embedded character 1Ah
with open('sub.txt','wb') as f:
f.write(b'abc\x1adef')
# Open in default (text) mode and read as much as possible
with open('sub.txt','r') as f:
print repr(f.read())
# Open in binary mode
with open('sub.txt','rb') as f:
print repr(f.read())
Output:
'abc'
'abc\x1adef'

Reading both newline types from an already opened file in python

I'm using subprocess.popen to run a command and grab the stdout.
It so happens that the program (mplayer) sort of uses both eol types \n and \r. The \rs come from terminal control characters. So the output I end up with are regular lines interspersed with really long lines where the \rs were ignored.
I know if I had opened a file myself, I could set the newline type. However, I'm getting the stdout from popen so I have no control over that.
I had a look at the python 2.7 source and I image I can somehow use TextIOWrapper to respect both eol types. However I'm not too sure what I need to pass to it. I know I need to pass the constructor some sort of buffer, but I don't know how to get the buffer from an already opened file.
All in all, how to I readline() in python that breaks at both \n and \r given an already open file/stream?
Popen.subprocess (and Popen.check_output if the convenience function is enough for you), have a universal_newlines parameter which by default is False, but when set to True will give you the behaviour you need of converting all newline variants to \n.

Open a new window in Vim-embedded python script

I've just started wrapping my head around vim+python scripts (having no experience with native vim scripts).
How can I open a new window to contain the stdout from a background process?
Currently, after reading some :help python, the only option I see is something like:
cmd = ":bel new"
vim.command(cmd)
Since vim.command can execute most (if not all?) ex commands, you can simply call :new +read!ls from within it.
:new splits the current window and puts a new (empty, no name) buffer into the upper window. It takes an argument +[cmd] which we use to execute read!cmd which reads the stdout of cmd after the bang into the buffer. Be aware that you need to escape spaces in your command with \
All in all you get vim.command("new +read!cmd")
:python vim.command("new +read!ls")
to read in the contents of the current directory into a new buffer in a n cichew, horizontally split window.
If you want to handle escaping of special characters, consider using python's re.escape():
:py import re;vim.command("new +read!"+re.escape("ls Dire*"))
which should be sufficient for most cases. If in doubt, check its documentation and compare it to that of your shell.

Unicode filename to python subprocess.call() [duplicate]

This question already has answers here:
Unicode filenames on Windows with Python & subprocess.Popen()
(5 answers)
Closed 7 years ago.
I'm trying to run subprocess.call() with unicode filename, and here is simplified problem:
n = u'c:\\windows\\notepad.exe '
f = u'c:\\temp\\nèw.txt'
subprocess.call(n + f)
which raises famous error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8'
Encoding to utf-8 produces wrong filename, and mbcs passes filename as new.txt without accent
I just can't read any more on this confusing subject and spin in circle. I found here lot of answers for many different problems in past so I thought to join and ask for help myself
Thanks
I found a fine workaround, it's a bit messy, but it works.
subprocess.call is going to pass the text in its own encoding to the terminal, which might or not be the one it's expecting. Because you want to make it portable, you'll need to know the machine's encoding at runtime.
The following
notepad = 'C://Notepad.exe'
subprocess.call([notepad.encode(sys.getfilesystemencoding())])
attempts to figure out the current encoding and therefore applies the correct one to subprocess.call
As a sidenote, I have also found that if you attempt to compose a string with the current directory, using
os.cwd()
Python (or the OS, don't know) will mess up directories with accented characters. To prevent this I have found the following to work:
os.cwd().decode(sys.getfilesystemencoding())
Which is very similar to the solution above.
Hope it helps.
If your file exists, you can use short filename (aka 8.3 name). This name is defined
for existent files, and should cause no trouble to non-Unicode aware programs when passed as argument.
One way to obtain one (needs Pywin32 to be installed):
import win32api
short_path = win32api.GetShortPathName(unicode_path)
Alternatively, you can also use ctypes:
import ctypes
import ctypes.wintypes
ctypes.windll.kernel32.GetShortPathNameW.argtypes = [
ctypes.wintypes.LPCWSTR, # lpszLongPath
ctypes.wintypes.LPWSTR, # lpszShortPath
ctypes.wintypes.DWORD # cchBuffer
]
ctypes.windll.kernel32.GetShortPathNameW.restype = ctypes.wintypes.DWORD
buf = ctypes.create_unicode_buffer(1024) # adjust buffer size, if necessary
ctypes.windll.kernel32.GetShortPathNameW(unicode_path, buf, len(buf))
short_path = buf.value
It appears that to make this work, the subprocess code would have to be modified to use a wide character version of CreateProcess (assuming that one exists). There's a PEP discussing the same change made for the file object at http://www.python.org/dev/peps/pep-0277/ Perhaps you could research the Windows C calls and propose a similar change for subprocess.
I don't have an answer for you, but I've done a fair amount of research into this problem. Python converts all output (including system calls) to the same character as the terminal it is running in. Windows terminals use code pages for character mapping; the default code page is 437, but it can be changed with the chcp command. chcp 65001 will theoretically change the code page to utf-8, but as far as I know python doesn't know what to do with this, so you're SOL.
As ΤΖΩΤΖΙΟΥ and starbuck mentioned, the problem is with the console code page which is in your case 866 (in Russian localization of windows) and not 1251. Just run chcp in console.
The problem is the same as when you want output unicode to Windows console. Unfortunatelly you will need at least to reqister and alias for unicode as 'cp866' in encodings\aliases.py (or do it programmatically on script start) and change the code page of the console to 65001 before running the notepad and setting it back afterwards.
chcp 65001 & c:\WINDOWS\notepad.exe nèw.txt & chcp 866
By the way, to be able to run the command in console and see the filename correctly, you will need to change the console font to Lucida Console in console window properties.
It might be even worse: you will need to change the code page of the current process. To do that, you will need either run chcp 65001 right before the script start or use pywin32 to do it within the script.
You can try opening the file as:
subprocess.call((n + f).encode("cp437"))
or whichever codepage chcp reports as being used in a command prompt window. If you try to chcp 65001 as starbuck suggested, you'll have to edit the stdlib encodings\aliases.py file and add cp65001 as an alias to 'utf-8' beforehand. It's an open issue in the Python source.
UPDATE: since this is a multiple target scenario, before running such a command, make sure you run a single chcp command first, analyse the output and retrieve the current "Command Prompt" (DOS) codepage. Subsequently, use the discovered codepage to encode the subprocess.call argument.
Use os.startfile with the operation edit. This will work better as it will open the default application for your extension.

Categories

Resources