Prevent a file from being created in python - python

I'm working on a python server which concurrently handles transactions on a number of databases, each storing performance data about a different application. Concurrency is accomplished via the Multiprocessing module, so each transaction thread starts in a new process, and shared-memory data protection schemes are not viable.
I am using sqlite as my DBMS, and have opted to set up each application's DB in its own file. Unfortunately, this introduces a race condition on DB creation; If two process attempt to create a DB for the same new application at the same time, both will create the file where the DB is to be stored. My research leads me to believe that one cannot lock a file before it is created; Is there some other mechanism I can use to ensure that the file is not created and then written to concurrently?
Thanks in advance,
David

The usual Unix-style way of handling this for regular files is to just try to create the file and see if it fails. In Python's case, that would be:
try:
os.open(filename, os.O_WRONLY | os.O_CREAT | os.O_EXCL)
except IOError: # or OSError?
# Someone else created it already.
At the very least, you can use this method to try to create a "lock file" with a similar name to the database. If the lock file is created, you go ahead and make the database. If not, you do whatever you need to for the "database exists" case.

Name your database files in such a way that they are guaranteed not to collide.
http://docs.python.org/library/tempfile.html

You could capture the error when trying to create the file in your code and in your exception handler, check if the file exists and use the existing file instead of creating it.

You didn't mention the platform, but on linux open(), or os.open() in python, takes a flags parameter which you can use. The O_CREAT flag creates a file if it does not exist, and the O_EXCL flag gives you an error if the file already exists. You'll also be needing O_RDONLY, O_WRONLY or O_RDWR for specifying the access mode. You can find these constants in the os module.
For example: fd = os.open(filename, os.O_RDWR | os.O_CREAT | os.O_EXCL)

You can use the POSIX O_EXCL and O_CREAT flags to open(2) to guarantee that only a single process gets the file and thus the database; O_EXCL won't work over NFSv2 or earlier, and it'd be pretty shaky to rely on it for other network filesystems.
The liblockfile library implements a network-filesystem safe locking mechanism described in the open(2) manpage, which would be convenient; but I only see pre-made Ruby and Perl bindings. Depending upon your needs, maybe providing Python bindings would be useful, or perhaps just re-implementing the algorithm:
O_EXCL Ensure that this call creates the file: if this flag is
specified in conjunction with O_CREAT, and pathname
already exists, then open() will fail. The behavior of
O_EXCL is undefined if O_CREAT is not specified.
When these two flags are specified, symbolic links are not
followed: if pathname is a symbolic link, then open()
fails regardless of where the symbolic link points to.
O_EXCL is only supported on NFS when using NFSv3 or later
on kernel 2.6 or later. In environments where NFS O_EXCL
support is not provided, programs that rely on it for
performing locking tasks will contain a race condition.
Portable programs that want to perform atomic file locking
using a lockfile, and need to avoid reliance on NFS
support for O_EXCL, can create a unique file on the same
file system (e.g., incorporating hostname and PID), and
use link(2) to make a link to the lockfile. If link(2)
returns 0, the lock is successful. Otherwise, use stat(2)
on the unique file to check if its link count has
increased to 2, in which case the lock is also successful.

Related

Do functions in the "os" module wait to be finished?

With:
import os
for file in files:
os.remove(file)
Will it wait for every file to be removed like a synchronous function in each call to os.remove(), or will it iterate through while calling os.remove()?
Generally, the os module provides wrappers around system calls of the operating system. For example, on Linux the os.remove/os.unlink functions correspond to the unlink system call. These functions wait until the system call has finished.
Whether this means that the high level operation intended by the program has finished depends on the use-case.
For example, unlink merely removes the path pointing to the file content; if there are other paths for the same file (i.e. hardlinks) or processes with a file handle on it, the file content remains. Only when all references are gone is the file content eligible for removal from the filesystem (similar to reference counting). The filesystem itself may arbitrarily delay removal of the content, and distributed filesystems may have additional consistency and synchronisation constraints.
As a rule of thumb, if there are no special requirements then it is fine to consider the os call to be prompt and synchronous. If there are specific requirements, such as file content being completely destroyed, read up on the specific behaviour of the involved components.

Pythonic way to handle if the file being written is deleted externally

A python process is writing to a file and the file has been deleted/moved by an external process (cron job in my case).
The python process will continue to execute without any errors (expected as it is being written to the buffer rather than the file and will be flushed after f.close()). Yet there won't be any new file created in this case and buffer will be silently discarded(correct me if I'm wrong).
Is there any pythonic way to handle this instead of checking if file exists and create one if not, before every write operation.
There is no "pythonic" way to do this because the question isn't about a specific language. It's an operating system question. So the answer is going to be different for MS Windows than it is for a UNIX like OS such as Linux or macOS. To do this efficiently requires using a facility such as the Linux inotify API. A simpler approach that will work on any UNIX like OS is to open the file then call os.fstat() and remember the st_ino member of the returned object. Then periodically call os.stat() on the path name and compare its st_ino value to the one you saved earlier. If it changes, or the os.stat() call fails, then you know the file name you are writing to is no longer the same file.

How to make Python ssl module use data in memory rather than pass file paths?

The full explanation of what I want to do and why would take a while to explain. Basically, I want to use a private SSL connection in a publicly distributed application, and not handout my private ssl keys, because that negates the purpose! I.e. I want secure remote database operations which no one can see into - inclusive of the client.
My core question is : How could I make the Python ssl module use data held in memory containing the ssl pem file contents instead of hard file system paths to them?
The constructor for class SSLSocket calls load_verify_locations(ca_certs) and load_cert_chain(certfile, keyfile) which I can't trace into because they are .pyd files. In those black boxes, I presume those files are read into memory. How might I short circuit the process and pass the data directly? (perhaps swapping out the .pyd?...)
Other thoughts I had were: I could use io.StringIO to create a virtual file, and then pass the file descriptor around. I've used that concept with classes that will take a descriptor rather than a path. Unfortunately, these classes aren't designed that way.
Or, maybe use a virtual file system / ram drive? That could be trouble though because I need this to be cross platform. Plus, that would probably negate what I'm trying to do if someone could access those paths from any external program...
I suppose I could keep them as real files, but "hide" them somewhere in the file system.
I can't be the first person to have this issue.
UPDATE
I found the source for the "black boxes"...
https://github.com/python/cpython/blob/master/Modules/_ssl.c
They work as expected. They just read the file contents from the paths, but you have to dig down into the C layer to get to this.
I can write in C, but I've never tried to recompile an underlying Python source. It looks like maybe I should follow the directions here https://devguide.python.org/ to pull the Python repo, and make changes to. I guess I can then submit my update to the Python community to see if they want to make a new standardized feature like I'm describing... Lots of work ahead it seems...
It took some effort, but I did, in fact, solve this in the manner I suggested. I revised the underlying code in the _ssl.c Python module / extension and rebuilt Python as a whole. After figuring out the process for building Python from source, I had to learn the details for how to pass variables between Python and C, and I needed to dig into guts of OpenSSL (over which the Python module is a wrapper).
Fortunately, OpenSSL already has functions for this exact purpose, so it was just a matter of swapping out the how Python is trying to pass file paths into the C, and instead bypassing the file reading process and jumping straight to the implementation of using the ca/cert/key data directly instead.
For the moment, I only did this for Windows. Since I'm ultimately creating a cross platform program, I'll have to repeat the build process for the other platforms I'll support - so that's a hassle. Consider how badly you want this, if you are going to pursue it yourself...
Note that when I rebuilt Python, I didn't use that as my actual Python installation. I just kept it off to the side.
One thing that was really nice about this process was that after that rebuild, all I needed to do was drop the single new _ssl.pyd into my working directory. With that file in place, I could pass my direct cert data. If I removed it, I could pass the normal file paths instead. It will use either the normal Python source, or implicitly use the override if the .pyd file is simply put in the program's directory.

Python: Securing untrusted scripts/subprocess with chroot and chjail?

I'm writing a web server based on Python which should be able to execute "plugins" so that functionality can be easily extended.
For this I considered the approach to have a number of folders (one for each plugin) and a number of shell/python scripts in there named after predefined names for different events that can occur.
One example is to have an on_pdf_uploaded.py file which is executed when a PDF is uploaded to the server. To do this I would use Python's subprocess tools.
For convenience and security, this would allow me to use Unix environment variables to provide further information and set the working directory (cwd) of the process so that it can access the right files without having to find their location.
Since the plugin code is coming from an untrusted source, I want to make it as secure as possible. My idea was to execute the code in a subprocess, but put it into a chroot jail with a different user, so that it can't access any other resources on the server.
Unfortunately I couldn't find anything about this, and I wouldn't want to rely on the untrusted script to put itself into a jail.
Furthermore, I can't put the main/calling process into a chroot jail either, since plugin code might be executed in multiple processes at the same time while the server is answering other requests.
So here's the question: How can I execute subprocesses/scripts in a chroot jail with minimum privileges to protect the rest of the server from being damaged by faulty, untrusted code?
Thank you!
Perhaps something like this?
# main.py
subprocess.call(["python", "pluginhandler.py", "plugin", env])
Then,
# pluginhandler.py
os.chroot(chrootpath)
os.setgid(gid) # Important! Set GID first! See comments for details.
os.setuid(uid)
os.execle(programpath, arg1, arg2, ..., env)
# or another subprocess call
subprocess.call["python", "plugin", env])
EDIT: Wanted to use fork() but I didn't really understand what it did. Looked it up. New
code!
# main.py
import os,sys
somevar = someimportantdata
pid = os.fork()
if pid:
# this is the parent process... do whatever needs to be done as the parent
else:
# we are the child process... lets do that plugin thing!
os.setgid(gid) # Important! Set GID first! See comments for details.
os.setuid(uid)
os.chroot(chrootpath)
import untrustworthyplugin
untrustworthyplugin.run(somevar)
sys.exit(0)
This was useful and I pretty much just stole that code, so kudos to that guy for a decent example.
After creating your jail you would call os.chroot from your Python source to go into it. But even then, any shared libraries or module files already opened by the interpreter would still be open, and I have no idea what the consequences of closing those files via os.close would be; I've never tried it.
Even if this works, setting up chroot is a big deal so be sure the benefit is worth the price. In the worst case you would have to ensure that the entire Python runtime with all modules you intend to use, as well as all dependent programs and shared libraries and other files from /bin, /lib etc. are available within each jailed filesystem. And of course, doing this won't protect other types of resources, i.e. network destinations, database.
An alternative could be to read in the untrusted code as a string and then exec code in mynamespace where mynamespace is a dictionary defining only the symbols you want to expose to the untrusted code. This would be sort of a "jail" within the Python VM. You might have to parse the source first looking for things like import statements, unless replacing the built-in __import__ function would intercept that (I'm unsure).

Python - How to check if a file is used by another application?

I want to open a file which is periodically written to by another application. This application cannot be modified. I'd therefore like to only open the file when I know it is not been written to by an other application.
Is there a pythonic way to do this? Otherwise, how do I achieve this in Unix and Windows?
edit: I'll try and clarify. Is there a way to check if the current file has been opened by another application?
I'd like to start with this question. Whether those other application read/write is irrelevant for now.
I realize it is probably OS dependent, so this may not really be python related right now.
Will your python script desire to open the file for writing or for reading? Is the legacy application opening and closing the file between writes, or does it keep it open?
It is extremely important that we understand what the legacy application is doing, and what your python script is attempting to achieve.
This area of functionality is highly OS-dependent, and the fact that you have no control over the legacy application only makes things harder unfortunately. Whether there is a pythonic or non-pythonic way of doing this will probably be the least of your concerns - the hard question will be whether what you are trying to achieve will be possible at all.
UPDATE
OK, so knowing (from your comment) that:
the legacy application is opening and
closing the file every X minutes, but
I do not want to assume that at t =
t_0 + n*X + eps it already closed
the file.
then the problem's parameters are changed. It can actually be done in an OS-independent way given a few assumptions, or as a combination of OS-dependent and OS-independent techniques. :)
OS-independent way: if it is safe to assume that the legacy application keeps the file open for at most some known quantity of time, say T seconds (e.g. opens the file, performs one write, then closes the file), and re-opens it more or less every X seconds, where X is larger than 2*T.
stat the file
subtract file's modification time from now(), yielding D
if T <= D < X then open the file and do what you need with it
This may be safe enough for your application. Safety increases as T/X decreases. On *nix you may have to double check /etc/ntpd.conf for proper time-stepping vs. slew configuration (see tinker). For Windows see MSDN
Windows: in addition (or in-lieu) of the OS-independent method above, you may attempt to use either:
sharing (locking): this assumes that the legacy program also opens the file in shared mode (usually the default in Windows apps); moreover, if your application acquires the lock just as the legacy application is attempting the same (race condition), the legacy application will fail.
this is extremely intrusive and error prone. Unless both the new application and the legacy application need synchronized access for writing to the same file and you are willing to handle the possibility of the legacy application being denied opening of the file, do not use this method.
attempting to find out what files are open in the legacy application, using the same techniques as ProcessExplorer (the equivalent of *nix's lsof)
you are even more vulnerable to race conditions than the OS-independent technique
Linux/etc.: in addition (or in-lieu) of the OS-independent method above, you may attempt to use the same technique as lsof or, on some systems, simply check which file the symbolic link /proc/<pid>/fd/<fdes> points to
you are even more vulnerable to race conditions than the OS-independent technique
it is highly unlikely that the legacy application uses locking, but if it is, locking is not a real option unless the legacy application can handle a locked file gracefully (by blocking, not by failing - and if your own application can guarantee that the file will not remain locked, blocking the legacy application for extender periods of time.)
UPDATE 2
If favouring the "check whether the legacy application has the file open" (intrusive approach prone to race conditions) then you can solve the said race condition by:
checking whether the legacy application has the file open (a la lsof or ProcessExplorer)
suspending the legacy application process
repeating the check in step 1 to confirm that the legacy application did not open the file between steps 1 and 2; delay and restart at step 1 if so, otherwise proceed to step 4
doing your business on the file -- ideally simply renaming it for subsequent, independent processing in order to keep the legacy application suspended for a minimal amount of time
resuming the legacy application process
Unix does not have file locking as a default. The best suggestion I have for a Unix environment would be to look at the sources for the lsof command. It has deep knowledge about which process have which files open. You could use that as the basis of your solution. Here are the Ubuntu sources for lsof.
One thing I've done is have python very temporarily rename the file. If we're able to rename it, then no other process is using it. I only tested this on Windows.

Categories

Resources