Giving permission to access a directory in Python - python

Background:
I'm running the ABSApp for sentiment analysis which requires Linux or Mac to run on. Our network filesystems require permission to access our datasets, and I'm trying to figure out how to give that permission via my code so that I can run my preprocessing scripts on our data and store it to the network since I'm not allowed to store it locally. I can access the files via my Linux dual bootup by connecting to the server directly, but this permission doesn't carry across the system when I run my code.
Tried:
I've tried to access the dir with os.walk(dir, topdown=True), and when I step through the debugger I see this message:
top = fspath(top)
dirs = []
nondirs = []
walk_dirs = []
# We may not have read permission for top, in which case we can't
# get a list of the files the directory contains. os.walk
# always suppressed the exception then, rather than blow up for a
# minor reason when (say) a thousand readable directories are still
# left to visit. That logic is copied here.
I don't see anything useful when I jump to the definition for fspath(path) either.
I read the documentation for os.access(), but I already know I don't have permission to the files. It does say this at the bottom, but it doesn't tell me a work-around:
Note
I/O operations may fail even when access() indicates that they would succeed,
particularly for operations on network filesystems
which may have permissions semantics beyond the usual POSIX permission-bit model.
TLDR:
So does anyone have any solutions for accessing and writing to a dir on a local network server that requires permissions? I can do python, java and c++, so I'm open to any solutions that exist! Thanks in advance!!

Solved: Needed to add noperm at the end to mount with write permissions
sudo mount.cifs <domain> /home/<mount location> -o username=<username>,vers=<set accordingly>,noperm

Related

How to check if a folder is open in Windows/Macintosh and possibly close it?

I am working in python and sometimes with os.rename() I run into the problem that if the file or folder to rename is open in the windows system I get a PermissionError: [WinError 5] error.
So if I close the folder and rerun the script, everything works.
In any case I am working on Windows, but I think it is good practice that I take into account that it could also be run on Machintosh.
I don't know what the best practice is for this, but please have a little patience, I'm still learning python and really don't know how to ask the question any better than that.
In general, no. This is a violation of system security: another entity (user, session, process, etc.) is using the resource (file or folder), in a way that requires exclusive rights (typically update / modify rights). If you steal the resource, how is that other entity supposed to react or adapt to the change?
This is why an OS has these privileges and locks: to manage system resources. Since you already have user control over the resource, you are supposed to use that authority to release the file lock -- not cracking the security from outside.
However, as the controlling user, you do likely have rights to view your own sessions and jobs, to inquire which one owns the resource, and then terminate the job or otherwise force it to release the resource.
In that fashion, it is possible to steal the resource from the other process. If you want to do that, you need to educate yourself on your OS's capabilities and rights. The most useful ones will be available through Python's os package. Enjoy the learning.
To echo Prune's answer, figuring out where it's in use and closing it sounds difficult, and probably not a great idea anyway. Imagine if it's not available because some other program is currently saving to one of the files — you could end up with corrupted data.
That said, you could make your Python script smart enough to notice there's a permission error, then pause and let you know so that you could try and close things on your own before telling it to try again and continue.
import os
def try_to_rename(src, dst):
while True:
try:
os.rename(src, dst)
break
except PermissionError:
input(f"Unable to rename {src} to {dst}. If one or both "
"files/folders are open, please close them. Press Enter to "
"continue.")
P.S. It makes little difference for this simple example, but for working extensively with paths and files, I'd recommend pathlib over os. It can really make things a lot more convenient and readable.

Caching remote files with python

Background
We have a lot of data files stored on a network drive which we process in python. For performance reasons I typically copy the files to my local SSD when processing. My wish is to make this happen automatically, so whenever I try to open a file it will grab the remote version if it isn't stored locally, and ideally also keep some sort of timer to delete the files after some time. The files will practically never be changed so I do not require actual syncing capabilities.
Functionality
To sum up what I am looking for is functionality for:
Keeping a local cache of files/directories from a network drive, automatically retrieving the remote version when unavailable locally
Support for directory structure - that is, the files are stored in a directory structure on the remote server which should be duplicated locally for the requested files
Ideally keep some sort of timer to expire cached files
It wouldn't be to difficult for me to write a piece of code which does this my self, but when possible, I prefer to rely on existing projects as this typically give a more versatile end result and also make any of my own improvements easily available to other users.
Question
I have searched a bit around for terms like python local file cache, file synchronization and the like, but what I have found mostly handles caching of function return values. I was a bit surprised because I would imagine this is a quite general problem, my question is therefore: is there something I have overlooked, and more importantly, are there any technical terms describing this functionality which could help my research.
Thank you in advance,
Gregers Poulsen
-- Update --
Due to other proprietary software packages I am forced to use Windows so the solution naturally must support this.
Have a look at fsspec remote caching, with a tutorial on the anaconda blog and the official documentation. Quoting the former:
In this article, we will present [fsspec]s new ability to cache remote content, keeping a local copy for faster lookup after the initial read.
They give an example for how to use it:
import fsspec
of = fsspec.open("filecache://anaconda-public-datasets/iris/iris.csv", mode='rt',
cache_storage='/tmp/cache1',
target_protocol='s3', target_options={'anon': True})
with of as f:
print(f.readline())
On first call, the file will be downloaded, stored into cache, and provided. On the second call, it will be downloaded from the local filesystem.
I haven't used it yet, but I need it and it looks promising.

Where should I write a user specific log file to (and be XDG base directory compatible)

By default, pip logs errors into "~/.pip/pip.log". Pip has an option to change the log path, and I'd like to put the log file somewhere besides ~/.pip so as not to clutter up my home directory. Where should I put it and be XDG base dir compatible?
Right now I'm considering one of these:
$XDG_DATA_HOME (typically $HOME/.local/share)
$XDG_CACHE_HOME (typically $HOME/.cache)
This is, for the moment, unclear.
Different software seem to handle this in different ways (imsettings puts it in $XDG_CACHE_HOME,
profanity in $XDG_DATA_HOME).
Debian, however, has a proposal which I can get behind (emphasis mine):
This is a recurring request/complaint (see this or this) on the xdg-freedesktop mailing list to introduce another directory for state information that does not belong in any of the existing categories (see also home-dir.proposal. Examples for this information are:
history files of shells, repls, anything that uses libreadline
logfiles
state of application windows on exit
recently opened files
last time application was run
emacs: bookmarks, ido last directories, backups, auto-save files, auto-save-list
The above example information is not essential data. However it should still persist on reboots of the system unlike cache data that a user might consider putting in a TMPFS. On the other hand the data is rather volatile and does not make sense to be checked into a VCS. The files are also not the data files that an application works on.
A default folder for a future STATE category might be: $HOME/.local/state
This would effectively introduce another environment variable since $XDG_DATA_HOME usually points to $HOME/.local/share and this hypothetical environment variable ($XDG_STATE_HOME?) would point to $HOME/.local/state
If you really want to adhere to the current standard I would place my log files in $XDG_CACHE_HOME since log files aren't required to run the program.

Python: Securing untrusted scripts/subprocess with chroot and chjail?

I'm writing a web server based on Python which should be able to execute "plugins" so that functionality can be easily extended.
For this I considered the approach to have a number of folders (one for each plugin) and a number of shell/python scripts in there named after predefined names for different events that can occur.
One example is to have an on_pdf_uploaded.py file which is executed when a PDF is uploaded to the server. To do this I would use Python's subprocess tools.
For convenience and security, this would allow me to use Unix environment variables to provide further information and set the working directory (cwd) of the process so that it can access the right files without having to find their location.
Since the plugin code is coming from an untrusted source, I want to make it as secure as possible. My idea was to execute the code in a subprocess, but put it into a chroot jail with a different user, so that it can't access any other resources on the server.
Unfortunately I couldn't find anything about this, and I wouldn't want to rely on the untrusted script to put itself into a jail.
Furthermore, I can't put the main/calling process into a chroot jail either, since plugin code might be executed in multiple processes at the same time while the server is answering other requests.
So here's the question: How can I execute subprocesses/scripts in a chroot jail with minimum privileges to protect the rest of the server from being damaged by faulty, untrusted code?
Thank you!
Perhaps something like this?
# main.py
subprocess.call(["python", "pluginhandler.py", "plugin", env])
Then,
# pluginhandler.py
os.chroot(chrootpath)
os.setgid(gid) # Important! Set GID first! See comments for details.
os.setuid(uid)
os.execle(programpath, arg1, arg2, ..., env)
# or another subprocess call
subprocess.call["python", "plugin", env])
EDIT: Wanted to use fork() but I didn't really understand what it did. Looked it up. New
code!
# main.py
import os,sys
somevar = someimportantdata
pid = os.fork()
if pid:
# this is the parent process... do whatever needs to be done as the parent
else:
# we are the child process... lets do that plugin thing!
os.setgid(gid) # Important! Set GID first! See comments for details.
os.setuid(uid)
os.chroot(chrootpath)
import untrustworthyplugin
untrustworthyplugin.run(somevar)
sys.exit(0)
This was useful and I pretty much just stole that code, so kudos to that guy for a decent example.
After creating your jail you would call os.chroot from your Python source to go into it. But even then, any shared libraries or module files already opened by the interpreter would still be open, and I have no idea what the consequences of closing those files via os.close would be; I've never tried it.
Even if this works, setting up chroot is a big deal so be sure the benefit is worth the price. In the worst case you would have to ensure that the entire Python runtime with all modules you intend to use, as well as all dependent programs and shared libraries and other files from /bin, /lib etc. are available within each jailed filesystem. And of course, doing this won't protect other types of resources, i.e. network destinations, database.
An alternative could be to read in the untrusted code as a string and then exec code in mynamespace where mynamespace is a dictionary defining only the symbols you want to expose to the untrusted code. This would be sort of a "jail" within the Python VM. You might have to parse the source first looking for things like import statements, unless replacing the built-in __import__ function would intercept that (I'm unsure).

Check if the directory content has changed with shell script or python

I have a program that create files in a specific directory.
When those files are ready, I run Latex to produce a .pdf file.
So, my question is, how can I use this directory change as a trigger
to call Latex, using a shell script or a python script?
Best Regards
inotify replaces dnotify.
Why?
...dnotify requires opening one file descriptor for each directory that you intend to watch for changes...
Additionally, the file descriptor pins the directory, disallowing the backing device to be unmounted, which causes problems in scenarios involving removable media. When using inotify, if you are watching a file on a file system that is unmounted, the watch is automatically removed and you receive an unmount event.
...and more.
More Why?
Unlike its ancestor dnotify, inotify doesn't complicate your work by various limitations. For example, if you watch files on a removable media these file aren't locked. In comparison with it, dnotify requires the files themselves to be open and thus really "locks" them (hampers unmounting the media).
Reference
Is dnotify what you need?
Make on unix systems is usually used to track by date what needs rebuilding when files have changed. I normally use a rather good makefile for this job. There seems to be another alternative around on google code too
You not only need to check for changes, but need to know that all changes are complete before running LaTeX. For example, if you start LaTeX after the first file has been modified and while more changes are still pending, you'll be using partial data and have to re-run later.
Wait for your first program to complete:
#!/bin/bash
first-program &&
run-after-changes-complete
Using && means the second command is only executed if the first completes successfully (a zero exit code). Because this simple script will always run the second command even if the first doesn't change any files, you can incorporate this into whatever build system you are already familiar with, such as make.
Python FAM is a Python interface for FAM (File Alteration Monitor)
You can also have a look at Pyinotify, which is a module for monitoring file system changes.
Not much of a python man myself. But in a pinch, assuming you're on linux, you could periodically shell out and "ls -lrt /path/to/directory" (get the directory contents and sort by last modified), and compare the results of the last two calls for a difference. If so, then there was a change. Not very detailed, but gets the job done.
You can use native python module hashlib which implements MD5 algorithm:
>>> import hashlib
>>> import os
>>> m = hashlib.md5()
>>> for root, dirs, files in os.walk(path):
for file_read in files:
full_path = os.path.join(root, file_read)
for line in open(full_path).readlines():
m.update(line)
>>> m.digest()
'pQ\x1b\xb9oC\x9bl\xea\xbf\x1d\xda\x16\xfe8\xcf'
You can save this result in a file or a variable, and compare it to the result of the next run. This will detect changes in any files, in any sub-directory.
This does not take into account file permission changes; if you need to monitor these change as well, this could be addressed via appending a string representing the permissions (accessible via os.stat for instance, attributes depend on your system) to the mvariable.

Categories

Resources