Grabbing output FILE from Python Popen process? - python

I have written a python program to interface with a compiled program (call it ProgramX) that has some idiosyncrasies that are proving difficult to deal with. I need to feed many thousands of input files to ProgramX via my python program. What I would like to do is to grab the output file that ProgramX creates with each run, and rename it something sensible, like inputfilename.output.
The problem comes in the output file that is written by ProgramX -- it is named via an unpredictable method, which will write, and "mercilessly overwrite", the output file if it already exists (which is the case the majority of the time). The saving grace probably comes with the fact that there is a standard prefix to the output files: think ProgramX.notQuiteRandomNumber.
The only think I can think to do is something like this in my bash shell:
PROGRAMXOUTPUT=$(ls -ltr ProgramX* | tail -n -1 | awk '{print $8}')
mv $PROGRAMXOUTPUT input.output
Which does 90% of what I need, but before I program all that bash into a series of Popen statements, is there a better way to do this? This problem feels like something people might have a much better solution than what I'm thinking.
Sidenote: I can grab the program's standard output without problems, however it's the output file that I need to grab.
Bonus: I was planning on running a bunch of instantiations of the program in the same directory, so my naive approach above may start to have unforeseen problems. So perhaps something fancy that watches the PID of ProgramX and follows its output.

To do what your shell script above does, assuming you've only got one ProgramX* in the current directory:
import glob, os
programxoutput = glob.glob('ProgramX*')[0]
os.rename(programxoutput, 'input.output')
If you need to sort by time, etc., there are ways to do that too (look at os.stat), but using the most recent modification date is a recipe for nasty race conditions if you'll be running multiple copies of ProgramX concurrently.
I'd suggest instead that you create and change to a new, perhaps temporary directory for each run of ProgramX, so the runs have no possibility of treading on each other. The tempfile module can help with this.

Two options that I see:
You could use lsof to find open files to find the files that ProgramX is writing.
A different approach would be to run ProgramX in a temporary directory (see tempfile for an easy way of setting up directories. Between runs of ProgramX, you can clean that directory or keep requesting new temp directories, if you are planning on running multiple copieProgramX at the same time.

If there is only one ProgramX* file, then what about just:
mv ProgramX* input.output

Related

separate working path for thread?

I'm writing a Python script within which I sometimes change directories with os.chdir(IMG_FOLDER) in order to do my file operations. That works fine as long as I have one thread only (as I can go back where I came from before leaving the function). Now, in case of multi threading, I would require a seperate "os path" instance for each thread otherwise it might mess up my file operations, hey?
How do I best go about this?
Don't use os.chdir. Instead, use os.path.join to form full paths.
The ultimate solution to this problem was that I
use absolute paths, no more relative as suggested by Perkins
when data is being received in my main thread I write it to data .tmp e.g. and only once the write process is completely finished, it'll be renamed to the name, that I'm scanning for in the seperate thread.

Python run binary and intercept file writes (using subprocess)

I have a simple command-line utility which produces output both on the console and the filesystem. While I know very well how to capture the console output, I am not aware how can I also intercept the file - for which I know the filename in advance.
I would like to keep the execution "in memory" without touching the filesystem as I immediately parse and delete the file created and this creates an unnecessary bottleneck (especially when I need to run the little tool millions of times).
So, to sum up, I am trying to achieve following:
Run a binary using python's subprocess
Capture both the tool's output AND contents of a file it creates (in current working directory with in-advance known name)
Ideally, run it all without touching the filesystem.
Since you only need to support Linux, one possibility is to use named pipes. The idea is to pre-create the output file as a named pipe, and have your process read the tool's output from the pipe.
See, for example, Introduction to Named Pipes.
The Python API is os.mkfifo().

Most efficient way to traverse file structure Python

Is using os.walk in the way below the least time consuming way to recursively search through a folder and return all the files that end with .tnt?
for root, dirs, files in os.walk('C:\\data'):
print "Now in root %s" %root
for f in files:
if f.endswith('.tnt'):
Yes, using os.walk is indeed the best way to do that.
As everyone has said, os.walk is almost certainly the best way to do it.
If you actually have a performance problem, and profiling has shown that it's caused by os.walk (and/or iterating the results with .endswith), your best answer is probably to step outside Python. Replace all of the code above with:
for f in sys.argv[1:]:
Now you need some outside tool that can gather the paths and run your script. (Ideally batching as many paths as possible into each script execution.)
If you can rely on Windows Desktop Search having indexed the drive, it should only need to do a quick database operation to find all files under a certain path with a certain extension. I have no idea how to write a batch file that runs that query and gets the results as a list of arguments to pass to a Python script (or a PowerShell file that runs the query and passes the results to IronPython without serializing it into a list of arguments), but it would be worth researching this before anything else.
If you can't rely on your platform's desktop search index, on any POSIX platform, it would almost certainly be fastest and simplest to use this one-liner shell script:
find /my/path -name '*.tnt' -exec myscript.py {} +
Unfortunately, you're not on a POSIX platform, you're on Windows, which doesn't come with the find tool, which is the thing that's doing all the heavy lifting here.
There are ports of find to native Windows, but you'll have to figure out the command-line intricaties to get everything quoted right and format the path and so on, so you can write the one-liner batch file. Alternatively, you could install cygwin and use the exact same shell script you'd use on a POSIX system. Or you could find a more Windows-y tool that does what you need.
This could conceivably be slower rather than faster—Windows isn't designed to execute lots of little processes with as little overhead as possible, and I believe it has smaller limits on command lines than platforms like linux or OS X, so you may spend more time waiting for the interpreter to start and exit than you save. You'd have to test to see. In fact, you probably want to test both native and cygwin versions (with both native and cygwin Python, in the latter case).
You don't actually have to move the find invocation into a batch/shell script; it's probably the simplest answer, but there are others, such as using subprocess to call find from within Python. This might solve performance problems caused by starting the interpreter too many times.
Getting the right amount of parallelism may also help—spin off each invocation of your script to the background and don't wait for them to finish. (I believe on Windows, the shell isn't involved in this; instead there's a tool named something like "run" that kicks off a process detached from the shell. But I don't remember the details.)
If none of this works out, you may have to write a custom C extension that does the fastest possible Win32 or .NET thing (which also means you have to do the research to find out what that is…) so you can call that from within Python.

running a python script indefinitely (as a process, pretty much)

i have tests that i ran which can take up to 15m at a time. during these 15m, a log file is periodically written to. however, most of the content is useless.
in response to this i have a python script that parses out the useless text and displays the relevant data.
what i'm trying to achieve is similar to what tail -f log_file, constantly updating the terminal with the newest additions to a file. i was thinking that if a python script ran as a process, it could parse the log file whenever the tests write to it, then the python script can go to sleep until interrupted again once the log file is written to.
any ideas how one can achieve this?
i already have a script that does the parsing, i just don't know how to make it do it continually and efficiently.
You could just have the script filter standard input, and pipe tail -f through it. When you're waiting on stdin, your script will sleep, so it's plenty efficient.
Eg.
python long_running_script.py && tail -f log_file | python filter_logs.py
Your script can be something like
while true:
line = sys.stdin.readline()
if filter_line(line): print line
looks like you need something like "pytailer":
http://code.google.com/p/pytailer/
While I never used it myself, last example looks like what you want.
any ideas how one can achieve this?
This should be pretty easy to do. Most of what you want is already part of your OS.
python test.py | python log_parser.py
Be sure your tests write their log to stdout instead of some other file. This is often easy to do with small changes to the logging configuration.
Having implemented almost this exact tool, I had great success using the inotify capability in twisted

Check if the directory content has changed with shell script or python

I have a program that create files in a specific directory.
When those files are ready, I run Latex to produce a .pdf file.
So, my question is, how can I use this directory change as a trigger
to call Latex, using a shell script or a python script?
Best Regards
inotify replaces dnotify.
Why?
...dnotify requires opening one file descriptor for each directory that you intend to watch for changes...
Additionally, the file descriptor pins the directory, disallowing the backing device to be unmounted, which causes problems in scenarios involving removable media. When using inotify, if you are watching a file on a file system that is unmounted, the watch is automatically removed and you receive an unmount event.
...and more.
More Why?
Unlike its ancestor dnotify, inotify doesn't complicate your work by various limitations. For example, if you watch files on a removable media these file aren't locked. In comparison with it, dnotify requires the files themselves to be open and thus really "locks" them (hampers unmounting the media).
Reference
Is dnotify what you need?
Make on unix systems is usually used to track by date what needs rebuilding when files have changed. I normally use a rather good makefile for this job. There seems to be another alternative around on google code too
You not only need to check for changes, but need to know that all changes are complete before running LaTeX. For example, if you start LaTeX after the first file has been modified and while more changes are still pending, you'll be using partial data and have to re-run later.
Wait for your first program to complete:
#!/bin/bash
first-program &&
run-after-changes-complete
Using && means the second command is only executed if the first completes successfully (a zero exit code). Because this simple script will always run the second command even if the first doesn't change any files, you can incorporate this into whatever build system you are already familiar with, such as make.
Python FAM is a Python interface for FAM (File Alteration Monitor)
You can also have a look at Pyinotify, which is a module for monitoring file system changes.
Not much of a python man myself. But in a pinch, assuming you're on linux, you could periodically shell out and "ls -lrt /path/to/directory" (get the directory contents and sort by last modified), and compare the results of the last two calls for a difference. If so, then there was a change. Not very detailed, but gets the job done.
You can use native python module hashlib which implements MD5 algorithm:
>>> import hashlib
>>> import os
>>> m = hashlib.md5()
>>> for root, dirs, files in os.walk(path):
for file_read in files:
full_path = os.path.join(root, file_read)
for line in open(full_path).readlines():
m.update(line)
>>> m.digest()
'pQ\x1b\xb9oC\x9bl\xea\xbf\x1d\xda\x16\xfe8\xcf'
You can save this result in a file or a variable, and compare it to the result of the next run. This will detect changes in any files, in any sub-directory.
This does not take into account file permission changes; if you need to monitor these change as well, this could be addressed via appending a string representing the permissions (accessible via os.stat for instance, attributes depend on your system) to the mvariable.

Categories

Resources