separate working path for thread? - python

I'm writing a Python script within which I sometimes change directories with os.chdir(IMG_FOLDER) in order to do my file operations. That works fine as long as I have one thread only (as I can go back where I came from before leaving the function). Now, in case of multi threading, I would require a seperate "os path" instance for each thread otherwise it might mess up my file operations, hey?
How do I best go about this?

Don't use os.chdir. Instead, use os.path.join to form full paths.

The ultimate solution to this problem was that I
use absolute paths, no more relative as suggested by Perkins
when data is being received in my main thread I write it to data .tmp e.g. and only once the write process is completely finished, it'll be renamed to the name, that I'm scanning for in the seperate thread.

Related

Python module to keep a file location variable updated?

I have a variable in my python script that holds the path to a file as a string. Is there a way, through a python module, etc, to keep the file path updated if the file were to be moved to another destination?
Short answer: no, not with a string.
Long answer: If you want to use only a string to record the location of this file, you're probably out of luck unless you use other tools to find the location of the file, or record whenever it moves - I don't know what those tools are.
You don't give a lot of info in your question about what you want this variable for; as #DeepSpace says in his comment, if you're trying to make this String follow the file between different runs of this program, then you'd be better off making an argument for the script. If, however, you expect the file to move sometime during the execution of your program, you might be able to use a file object to keep track of it. Or, rather - instead of keeping the filepath in memory, keep a file descriptor in memory instead (the kind you would generate by using the open() function, and just never close that file until the program terminates. You can use seek to return to the start of the file if you needed to read it multiple times. Problems with this include that it's not memory-safe, and it's absolutely not a best practice.
TL;DR
Your best bet is probably to go with a solution like #DeepSpace mentioned where you could go and call your script with a parameter in command-line which will then force the user to input a valid path.
This is actually a really good question, but unfortunately purely Pythonly speaking, this is impossible.
No python module will be able to dynamically linked a variable to a path on the file-system. You will need an external script or routine which will update any kind of data structure that will hold the path value.
Even then, the name of the file could change, but not it's location. here is what I mean by that.
Let's say you were to wrapped that file in a folder only containing that specific file. Since you now know that it's location is fixed (theoretically speaking), you can have another python script/routine that will read the filename and store it in a textfile. Your other script could go and get that file name (considering your routine would sync that file on a regular basis). But, as soon as the location of the file changes, how can you possibly know where it is now. It has to be manually hard coded somewhere to have something close to the behavior your expecting.
Note that my example is not in any way a solution to go-to for your problem. I'm actually trying to underline the shortcomings of such a feature.

python and another program writing to the same file

I have noticed that python always remembers where it finished writing in a file and continues from that point.
Is there a way to reset that so that if the files is edited by another program that removes certain text and ads another python will not fill the gaps with NULL when it does next write?
I have the file open in the parent and the threading children are writing to it. I used flush to ensure after write the data is physically written to the file, but that is only good to do that.
Is there another function I seem to miss that will make python append properly?
One safe thing, OS independent, and reliable is certainly to close the file, and open it again on writting.
If the performance hindrance due to that is unacceptable, you could try to use "seek" to move to the end of file before writing. I just did some naive testing in the interactive console, and indeed, using file.seek(0, os.SEEK_END) before writing worked.
Not that I think having two processes writing to the same file could be safe under most circumstances -- you will end up in race conditions of some sort doing this. One way around is to implement file-locks, so that one process just write to the file after acquiring the lock. Having this done in the right wya may be thought. So, ceck if your application wpould not be in better place using something to written data carefully built and hardened along the years to allow simultanous update by various processes, like an SQL engine (MySQL or Postgresql).

Most efficient way to traverse file structure Python

Is using os.walk in the way below the least time consuming way to recursively search through a folder and return all the files that end with .tnt?
for root, dirs, files in os.walk('C:\\data'):
print "Now in root %s" %root
for f in files:
if f.endswith('.tnt'):
Yes, using os.walk is indeed the best way to do that.
As everyone has said, os.walk is almost certainly the best way to do it.
If you actually have a performance problem, and profiling has shown that it's caused by os.walk (and/or iterating the results with .endswith), your best answer is probably to step outside Python. Replace all of the code above with:
for f in sys.argv[1:]:
Now you need some outside tool that can gather the paths and run your script. (Ideally batching as many paths as possible into each script execution.)
If you can rely on Windows Desktop Search having indexed the drive, it should only need to do a quick database operation to find all files under a certain path with a certain extension. I have no idea how to write a batch file that runs that query and gets the results as a list of arguments to pass to a Python script (or a PowerShell file that runs the query and passes the results to IronPython without serializing it into a list of arguments), but it would be worth researching this before anything else.
If you can't rely on your platform's desktop search index, on any POSIX platform, it would almost certainly be fastest and simplest to use this one-liner shell script:
find /my/path -name '*.tnt' -exec myscript.py {} +
Unfortunately, you're not on a POSIX platform, you're on Windows, which doesn't come with the find tool, which is the thing that's doing all the heavy lifting here.
There are ports of find to native Windows, but you'll have to figure out the command-line intricaties to get everything quoted right and format the path and so on, so you can write the one-liner batch file. Alternatively, you could install cygwin and use the exact same shell script you'd use on a POSIX system. Or you could find a more Windows-y tool that does what you need.
This could conceivably be slower rather than faster—Windows isn't designed to execute lots of little processes with as little overhead as possible, and I believe it has smaller limits on command lines than platforms like linux or OS X, so you may spend more time waiting for the interpreter to start and exit than you save. You'd have to test to see. In fact, you probably want to test both native and cygwin versions (with both native and cygwin Python, in the latter case).
You don't actually have to move the find invocation into a batch/shell script; it's probably the simplest answer, but there are others, such as using subprocess to call find from within Python. This might solve performance problems caused by starting the interpreter too many times.
Getting the right amount of parallelism may also help—spin off each invocation of your script to the background and don't wait for them to finish. (I believe on Windows, the shell isn't involved in this; instead there's a tool named something like "run" that kicks off a process detached from the shell. But I don't remember the details.)
If none of this works out, you may have to write a custom C extension that does the fastest possible Win32 or .NET thing (which also means you have to do the research to find out what that is…) so you can call that from within Python.

Python: Securing untrusted scripts/subprocess with chroot and chjail?

I'm writing a web server based on Python which should be able to execute "plugins" so that functionality can be easily extended.
For this I considered the approach to have a number of folders (one for each plugin) and a number of shell/python scripts in there named after predefined names for different events that can occur.
One example is to have an on_pdf_uploaded.py file which is executed when a PDF is uploaded to the server. To do this I would use Python's subprocess tools.
For convenience and security, this would allow me to use Unix environment variables to provide further information and set the working directory (cwd) of the process so that it can access the right files without having to find their location.
Since the plugin code is coming from an untrusted source, I want to make it as secure as possible. My idea was to execute the code in a subprocess, but put it into a chroot jail with a different user, so that it can't access any other resources on the server.
Unfortunately I couldn't find anything about this, and I wouldn't want to rely on the untrusted script to put itself into a jail.
Furthermore, I can't put the main/calling process into a chroot jail either, since plugin code might be executed in multiple processes at the same time while the server is answering other requests.
So here's the question: How can I execute subprocesses/scripts in a chroot jail with minimum privileges to protect the rest of the server from being damaged by faulty, untrusted code?
Thank you!
Perhaps something like this?
# main.py
subprocess.call(["python", "pluginhandler.py", "plugin", env])
Then,
# pluginhandler.py
os.chroot(chrootpath)
os.setgid(gid) # Important! Set GID first! See comments for details.
os.setuid(uid)
os.execle(programpath, arg1, arg2, ..., env)
# or another subprocess call
subprocess.call["python", "plugin", env])
EDIT: Wanted to use fork() but I didn't really understand what it did. Looked it up. New
code!
# main.py
import os,sys
somevar = someimportantdata
pid = os.fork()
if pid:
# this is the parent process... do whatever needs to be done as the parent
else:
# we are the child process... lets do that plugin thing!
os.setgid(gid) # Important! Set GID first! See comments for details.
os.setuid(uid)
os.chroot(chrootpath)
import untrustworthyplugin
untrustworthyplugin.run(somevar)
sys.exit(0)
This was useful and I pretty much just stole that code, so kudos to that guy for a decent example.
After creating your jail you would call os.chroot from your Python source to go into it. But even then, any shared libraries or module files already opened by the interpreter would still be open, and I have no idea what the consequences of closing those files via os.close would be; I've never tried it.
Even if this works, setting up chroot is a big deal so be sure the benefit is worth the price. In the worst case you would have to ensure that the entire Python runtime with all modules you intend to use, as well as all dependent programs and shared libraries and other files from /bin, /lib etc. are available within each jailed filesystem. And of course, doing this won't protect other types of resources, i.e. network destinations, database.
An alternative could be to read in the untrusted code as a string and then exec code in mynamespace where mynamespace is a dictionary defining only the symbols you want to expose to the untrusted code. This would be sort of a "jail" within the Python VM. You might have to parse the source first looking for things like import statements, unless replacing the built-in __import__ function would intercept that (I'm unsure).

Grabbing output FILE from Python Popen process?

I have written a python program to interface with a compiled program (call it ProgramX) that has some idiosyncrasies that are proving difficult to deal with. I need to feed many thousands of input files to ProgramX via my python program. What I would like to do is to grab the output file that ProgramX creates with each run, and rename it something sensible, like inputfilename.output.
The problem comes in the output file that is written by ProgramX -- it is named via an unpredictable method, which will write, and "mercilessly overwrite", the output file if it already exists (which is the case the majority of the time). The saving grace probably comes with the fact that there is a standard prefix to the output files: think ProgramX.notQuiteRandomNumber.
The only think I can think to do is something like this in my bash shell:
PROGRAMXOUTPUT=$(ls -ltr ProgramX* | tail -n -1 | awk '{print $8}')
mv $PROGRAMXOUTPUT input.output
Which does 90% of what I need, but before I program all that bash into a series of Popen statements, is there a better way to do this? This problem feels like something people might have a much better solution than what I'm thinking.
Sidenote: I can grab the program's standard output without problems, however it's the output file that I need to grab.
Bonus: I was planning on running a bunch of instantiations of the program in the same directory, so my naive approach above may start to have unforeseen problems. So perhaps something fancy that watches the PID of ProgramX and follows its output.
To do what your shell script above does, assuming you've only got one ProgramX* in the current directory:
import glob, os
programxoutput = glob.glob('ProgramX*')[0]
os.rename(programxoutput, 'input.output')
If you need to sort by time, etc., there are ways to do that too (look at os.stat), but using the most recent modification date is a recipe for nasty race conditions if you'll be running multiple copies of ProgramX concurrently.
I'd suggest instead that you create and change to a new, perhaps temporary directory for each run of ProgramX, so the runs have no possibility of treading on each other. The tempfile module can help with this.
Two options that I see:
You could use lsof to find open files to find the files that ProgramX is writing.
A different approach would be to run ProgramX in a temporary directory (see tempfile for an easy way of setting up directories. Between runs of ProgramX, you can clean that directory or keep requesting new temp directories, if you are planning on running multiple copieProgramX at the same time.
If there is only one ProgramX* file, then what about just:
mv ProgramX* input.output

Categories

Resources