add header to stdout of a subprocess in python - python

I am merging several dataframes into one and sorting them using unix sort. Before I write the final sorted data I would like to add a prefix/header to that output.
So, my code is something like:
my_cols = '\t'.join(['CHROM', 'POS', "REF" ....])
my_cmd = ["sort", "-k1,2", "-V", "final_merged.txt"]
with open(output + 'mergedAndSorted.txt', 'w') as sort_data:
sort_data.write(my_cols + '\n')
subprocess.run(my_cmd, stdout=sort_data)
But, this above doe puts my_cols at the end of the final output file (i.e mergedAndSorted.txt)
I also tried substituting:
sort_data=io.StringIO(my_cols)
but this gives me an error as I had expected.
How can I add that header to the begining of the subprocess output. I believe this can be achieved by a simple code change.

The problem with your code is a matter of buffering; the tldr is that you can fix it like this:
sort_data.write(my_cols + '\n')
sort_data.flush()
subprocess.run(my_cmd, stdout=sort_data)
If you want to understand why it happens, and how the fix solves it:
When you open a file in text mode, you're opening a buffered file. Writes go into the buffer, and the file object doesn't necessarily flush them to disk immediately. (There's also stream-encoding from Unicode to bytes going on, but that doesn't really add a new problem, it just adds two layers where the same thing can happen, so let's ignore that.)
As long as all of your writes are to the buffered file object, that's fine—they get sequenced properly in the buffer, so they get sequenced properly on the disk.
But if you write to the underlying sort_data.buffer.raw disk file, or to the sort_data.fileno() OS file descriptor, those writes may get ahead of the ones that went to sort_data.
And that's exactly what happens when you use the file as a pipe in subprocess. This doesn't seem to be explained directly, but can be inferred from Frequently Used Arguments:
stdin, stdout and stderr specify the executed program’s standard input, standard output and standard error file handles, respectively. Valid values are PIPE, DEVNULL, an existing file descriptor (a positive integer), an existing file object, and None.
This implies pretty strongly—if you know enough about the way piping works on *nix and Windows—that it's passing the actual file descriptor/handle to the underlying OS functionality. But it doesn't actually say that. To really be sure, you have to check the Unix source and Windows source, where you can see that it is calling fileno or msvcrt.get_osfhandle on the file objects.

Related

Python Compressed file ended before the end-of-stream marker was reached. But file is not Corrupted

i made a simple request code that downloads a file from a Server
r = requests.get("https:.../index_en.txt.lzma")
index_en= open('C:\...\index_en.txt.lzma','wb')
index_en.write(r.content)
index_en.close
when i now extract the file manually in the directorry with 7zip everything is fine and the file decrippts as normal.
i tried two ways to do it in a ython programm but scince the file ends with .lzma i guess the following one is a bether approach
import lzma
with open('C:\...\index_en.txt.lzma') as compressed:
print(compressed.readline)
with lzma.LZMAFile(compressed) as uncompressed:
for line in uncompressed:
print(line)
this one gives me the Error: "Compressed file ended before the end-of-stream marker was reached" at the line with the for loop.
the second way i tried was with 7zip, because by hand it worked fine
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
this one gives me the Error: OSError 22 Invalid Argument at the "with py7zr..." line
i really dont understand where the problem here is. WHy does it work by hand but not in python?
Thanks
You didn't close your file, so data stuck in user mode buffers isn't visible on disk until the file is cleaned up at some undetermined future point (may not happen at all, and may not happen until the program exits even if it does). Because of this, any attempt to access the file by any means other than the single handle you wrote to will not see the unflushed data, which would cause it to appear as if the file was truncated, getting the error you observe.
The minimal solution is to actually call close, changing index_en.close to index_en.close(). But practically speaking, you should use with statements for all files (and locks, and socket-like things, and all other resources that require cleanup), whenever possible, so even when an exception occurs the file is definitely closed; it's most important for files you're writing to (where data might not get flushed to disk without it), but even for files opened for reading, in pathological cases you can end up hitting the open file handle limit.
Rewriting your first block of code to be completely safe gets you:
with requests.get("https:.../index_en.txt.lzma") as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
index_en.write(r.content)
Note: request.Response objects are also context managers, so I added it to the with to ensure the underlying connection is released back to the pool promptly. I also prefixed your local path with an r to make it a raw string; on Windows, with backslashes in the path, you always want to do this, so that a file or directory beginning with a character that Python recognizes as a string literal escape doesn't get corrupted (e.g. "C:\foo" is actually "C:<form feed>oo", containing neither a backslash nor an f).
You could even optimize it a bit, in case the file is large, by streaming the data into the file (requiring mostly fixed memory overhead, tied to the buffer size of the underlying connection) rather than fetching eagerly (requiring memory proportionate to file size):
# stream=True means underlying file is opened without being immediately
# read into memory
with requests.get("https:.../index_en.txt.lzma", stream=True) as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
# iter_content(None) produces an iterator of chunks of data (of whatever size
# is available in a single system call)
# Changing to writelines means the iterator is consumed and written
# as the data arrives
index_en.writelines(r.iter_content(None))
Controlling the requests.get with a with statement is more important here (as stream=True mode means the underlying socket isn't consumed and freed immediately).
Also note that print(compressed.readline) is doing nothing (because you didn't call readline). If there is some line of text in the response prior to the raw LZMA data, you failed to skip it. If there is not such a garbage line, and if you'd called readline properly (with print(compressed.readline())), it would have broken decompression because the file pointer would now have skipped the first few (or many) bytes of the file, landing at some mostly random offset.
Lastly,
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
is wrong because you passed it a mode indicating you're opening it for write, when you're clearly attempting to read from it; either omit the 'w' or change it to 'r'.

How to read a print out statement from another program with python?

I have an algorithm that is written in C++ that outputs a cout debug statement to the terminal window and I would like to figure out how to read that printout with python without it being piped/written to a file or to return a value.
Python organizes how each of the individual C++ algorithms are called while the data is kept on the heap and not onto disk. Below is an example of the a situation that is of similar output,
+-------------- terminal window-----------------+
(c++)runNewAlgo: Debug printouts on
(c++)runNewAlgo: closing pipes and exiting
(c++)runNewAlgo: There are 5 objects of interest found
( PYTHON LINE READS THE PRINT OUT STATEMENT)
(python)main.py: Starting the next processing node, calling algorithm
(c++)newProcessNode: Node does work
+---------------------------------------------------+
Say the line of interest is "there are 5 objects of interest" and the code will be inserted before the python call. I've tried to use sys.stdout and subprocess.Popen() but I'm struggling here.
Your easiest path would probably be to invoke your C++ program from inside your Python script.
More details here: How to call an external program in python and retrieve the output and return code?
You can use stdout from the returned process and read it line-by-line. The key is to pass stdout=subprocess.PIPE so that the output is sent to a pipe instead of being printed to your terminal (via sys.stdout).
Since you're printing human-readable text from your C++ program, you can also pass encoding='utf-8' as well to automatically decode each line using utf-8 encoding; otherwise, raw bytes will be returned.
import subprocess
proc = subprocess.Popen(['/path/to/your/c++/program'],
stdout=subprocess.PIPE, encoding='utf-8')
for line in proc.stdout:
do_something_with(line)
print(line, end='') # if you also want to see each line printed

"_csv.Error: line contains NULL byte" after truncating a csv log file that is being piped to by another process

How do I truncate a csv log file that is being used as std out pipe destination from another process without generating a _csv.Error: line contains NULL byte error?
I have one process running rtlamr > log/readings.txt that is piping radio signal data to readings.txt. I don't think it matters what is piping to the file--any long-running pipe process will do.
I have a file watcher using watchdog (Python file watcher) on that file, which triggers a function when the file is changed. The function read the files and updates a database.
Then I try to truncate readings.txt so that it doesn't grow infinitely (or back it up).
file = open(dir_path+'/log/readings.txt', "w")
file.truncate()
file.close()
This corrupts readings.txt and generates the error (the start of the file contains garbage characters).
I tried moving the file instead of truncating it, in the hopes that rtlamr will recreate a fresh file, but that only has the effect of stopping the pipe.
EDIT
I noticed that the charset changes from us-ascii to binary but attempting to truncate the file with file = open(dir_path+'/log/readings.log', "w",encoding="us-ascii") does not do anything.
If you truncate a file1 while another process has it open in w mode, that process will continue to write to the same offsets, making the file sparse. Low offsets will thus be read as 0s.
As per x11 - Concurrent writing to a log file from many processes - Unix & Linux Stack Exchange and Can two Unix processes simultaneous write to different positions in a single file?, each process that has a file open has its own offset in it, and a ftruncate() doesn't change that.
If you want the other process to react to truncation, it needs to have it open in a mode.
Your approach has principal bugs, too. E.g. it's not atomic: you may (=will, eventually) truncate the file after the producer has added data but before you have read it so it would get lost.
Consider using dedicated data buffering utilities instead like buffer or pv as per Add a big buffer to a pipe between two commands.
1Which is superfluous because open(mode='w') already does that. Either truncate or reopen, no need to do both.

Using textfile as stdin in python under windows 7

I'm a win7-user.
I accidentally read about redirections (like command1 < infile > outfile) in *nix systems, and then I discovered that something similar can be done in Windows (link). And python is also can do something like this with pipes(?) or stdin/stdout(?).
I do not understand how this happens in Windows, so I have a question.
I use some kind of proprietary windows-program (.exe). This program is able to append data to a file.
For simplicity, let's assume that it is the equivalent of something like
while True:
f = open('textfile.txt','a')
f.write(repr(ctime()) + '\n')
f.close()
sleep(100)
The question:
Can I use this file (textfile.txt) as stdin?
I mean that the script (while it runs) should always (not once) handle all new data, ie
In the "never-ending cycle":
The program (.exe) writes something.
Python script captures the data and processes.
Could you please write how to do this in python, or maybe in win cmd/.bat or somehow else.
This is insanely cool thing. I want to learn how to do it! :D
If I am reading your question correctly then you want to pipe output from one command to another.
This is normally done as such:
cmd1 | cmd2
However, you say that your program only writes to files. I would double check the documentation to see if their isn't a way to get the command to write to stdout instead of a file.
If this is not possible then you can create what is known as a named pipe. It appears as a file on your filesystem, but is really just a buffer of data that can be written to and read from (the data is a stream and can only be read once). Meaning your program reading it will not finish until the program writing to the pipe stops writing and closes the "file". I don't have experience with named pipes on windows so you'll need to ask a new question for that. One down side of pipes is that they have a limited buffer size. So if there isn't a program reading data from the pipe then once the buffer is full the writing program won't be able to continue and just wait indefinitely until a program starts reading from the pipe.
An alternative is that on Unix there is a program called tail which can be set up to continuously monitor a file for changes and output any data as it is appended to the file (with a short delay.
tail --follow=textfile.txt --retry | mycmd
# wait for data to be appended to the file and output new data to mycmd
cmd1 >> textfile.txt # append output to file
One thing to note about this is that tail won't stop just because the first command has stopped writing to the file. tail will continue to listen to changes on that file forever or until mycmd stops listening to tail, or until tail is killed (or "sigint-ed").
This question has various answers on how to get a version of tail onto a windows machine.
import sys
sys.stdin = open('textfile.txt', 'r')
for line in sys.stdin:
process(line)
If the program writes to textfile.txt, you can't change that to redirect to stdin of your Python script unless you recompile the program to do so.
If you were to edit the program, you'd need to make it write to stdout, rather than a file on the filesystem. That way you can use the redirection operators to feed it into your Python script (in your case the | operator).
Assuming you can't do that, you could write a program that polls for changes on the text file, and consumes only the newly written data, by keeping track of how much it read the last time it was updated.
When you use < to direct the output of a file to a python script, that script receives that data on it's stdin stream.
Simply read from sys.stdin to get that data:
import sys
for line in sys.stdin:
# do something with line

Is it possible for `tail` to emit incomplete lines?

I am using tail -F log | python parse.py to monitor and parse a growing log file, but some parsing errors occur that may be caused by reading incomplete lines from the log file.
Is it possible that tail emits incomplete lines?
In the parser, I am reading rows with code like the following:
import csv
import sys
reader = csv.reader(sys.stdin)
for row in reader
# process
It is possible that tail can emit 'unparsable lines' - but only if invalid lines are written to the file. Kind of a circular bit of argument, but here's an example of how it could happen:
You tail -f on /var/log/syslog
syslog-ng dies in the middle of a block-spanning write (sectors are 512 bytes, your filesystem block size is most likely larger, altho probably not much larger than 4096.. so, syslog has 9k of data buffered up to write out, it gets through 4k byte page and before it can write the next 4k+1k syslog dies. On ext2 at least, that'll end up as a partial write even after fsck. ext3? .. heh. I've been doing embedded for so long I can't remember... But, I'd HOPE not.. But who's to say that the data you're writing is always going to be correct? You might get a non-fatal string formatting error that doesn't include a newline that you're expecting.
You'll then have a partial line that isn't terminated with a newline (or even a \0), and the next time syslog starts up and starts appending it'll just be appending on the end of of the file with no notion of 'valid' records. So the first new record will be garbage, but the next one will be ok.
This is easy to exercise..
In one window
tail -f SOMEFILE
In another window
echo FOO >>SOMEFILE
echo BAR >>SOMEFILE
printf NO_NEWLINE >>SOMEFILE
echo I_WILL_HAVE_THE_LAST_LINE_PREFIXED_TO_ME_CAUSING_NERD_RAGE >>SOMEFILE
Since Linux's tail uses inotify by default, whatever is reading will get the last line without a newline and wait until the next newline comes along appending NO_NEWLINE to the beginning of what it considers 'the latest line'.
If you want to do this 'the pythonic' way, if you're using Linux - use inotify, if you're using OSX or BSD use 'knotty' and negate the use of 'tail' as an input pipe and just watch the file yourself.
Tail might do weird things if you use the 'resync on truncate' too - i.e. if the file gets zeroed and restarted in the middle of a read, you might get some weird amount of data on the read since 'tail' will close the previously opened file handle in exchange for the new one.
Since you changed up the question on me.. new answer! :p
reader = csv.reader(sys.stdin)
for row in reader:
try:
validate_row_data_somehow(row)
do_things_with_valid_row(row)
except:
print "failed to process row", `row`

Categories

Resources