Does fsspec support virtual filesystems such as pyfileysystem

Does fsspec support virtual filesystems such as pyfileysystem - python

One of pyfilesystem's main feature is virtual filesystems. E.g.
home_fs = open_fs('~/')
projects_fs = home_fs.opendir('/projects')
I think that is a great feature and was hoping that fsspec has something similar. But I couldn't find an example and I'm not able to get it working.

You might want DirFileSystem, invoked like
fs = fsspec.implementations.dirfs.DirFileSystem(
"<root path>", fs=fsspec.filesystem("file")
)
You can apply this to any filesystem, not only local. root_path needs to be a string that, when you affix further path parts to it, makes a complete path the target filesystem can understand; it may include the protocol (e.g., for HTTP, it must do). In your case, it would be "~" (or the expanded version of this, which would be more explicit).
Alternatively, you can create an arbitrarily mapped virtual filesystem with
ReferenceFileSystem.
mapping = {"/key1": ["/local/path/file1"],
"/key2": ["/other/unrelated/path/file"]}
fs = fsspec.filesystem("reference", fo=mapping)
Here, fs.cat("/key1") would get the contents of "/local/path/file1". You can have those paths be remote, or a mix of different backends, and even byte ranges of target files.

Related

Where to store cross-testrun state in pytest (binary files)?

I have a session-level fixture in pytest that downloads several binary files that I use throughout my test suite. The current fixture looks something like the following:
#pytest.fixture(scope="session")
def image_cache(pytestconfig, tmp_path_factory):
# A temporary directory loaded with the test image files downloaded once.
remote_location = pytestconfig.getoption("remote_test_images")
tmp_path = tmp_path_factory.mktemp("image_cache", numbered=False)
# ... download the files and store them into tmp_path
yield tmp_path
This used to work well, however, now the amount of data is making things slow, so I wish to cache it between test runs (similar to this question). Contrary to the related question, I want to use pytests own cache for this, i.e., I'd like to do something like the following:
#pytest.fixture(scope="session")
def image_cache(request, tmp_path_factory):
# A temporary directory loaded with the test image files downloaded once.
remote_location = request.config.option.remote_test_images
tmp_path = request.config.cache.get("image_cache_dir", None)
if tmp_path is None:
# what is the correct location here?
tmp_path = ...
request.config.cache.set("image_cache_dir", tmp_path)
# ... ensure path exists and is empty, clean if necessary
# ... download the files and store them into tmp_path
yield tmp_path
Is there a typical/default/expected location that I should use to store the binary data?
If not, what is a good (platform-independent) location to choose? (tests run on the three major OS: linux, mac, windows)

A bit late here, but maybe I can still offer a helpful suggestion. You have at least two options, I think:
use pytest's caching solution on its own. Yes, it expects JSON-serializable data, so you'll need to convert your "binary" into strings. You can use base64 to safely encode arbitrary binary into letters that can be stored as a string and then later converted from a string back into the original binary (ie back into an image, or whatever).
use pytest's caching solution as a means of remembering a filename or a directory name. Then, use python's generic temporary file system so you can manage temporary files in a platform-independent way. In the end, you'll write just filenames or paths into pytest cache and do everything else manually.
Solution 1 benefits from fully-automatic management of cache content and will be compatible with other pytest extensions (such as distributed testing with xdist). It means that all files are kept in the same place and it is easier to see and manage disk usage.
Solution 2 is likely to be faster and will scale safely. It avoids the need to transcode to base64, which will be a waste of CPU and of space (since base64 will take a lot more space than the original binary). Additionally, the pytest cache may not be well suited for a large number of potentially large values (depending on the number of files and sizes of images we're talking about here).
Converting image / arbitrary bytes into something that can be encoded to JSON:
import base64
now_im_a_string = base64.encodebytes(your_bytes)
...
# cache it, store it whatever
...
im_your_bytes_again = base64.decodestring(string_read_from_cache)

How do I tell a .py script to look for folders and file in the directory it is in?

I have a .py script that pulls in data from Google Sheets and outputs it in a yaml format. This is for my Hugo powered website which is served via Netlify. As I understand it, Netlify is capable of running Python too so I thought I could upload the web content and the python file in the same directory. This is required for updating the content dynamically, and I expect the python file to run everytime I trigger a build for the website. However, the python file requires certain credentials to work.
My code currently looks like this:
# Set location to write new files to.
outputpath = Path("D:/content/submission/")
#Read JSON:
json = Path("D:/credentials.json")
These are hardcoded local paths. When I bundle the script in the website directory, what paths should I write in so that when the script runs, these files are read in and outputted correctly?
I would want to output in my content/submission folder and read in from my creds/credentials.json. Should I just put these paths in? Will it know that it has to look within the directory for these folders, or is there something I need to add to the script that tells it to work within the directory it is sitting in?

🧨 First, credentials and secrets are best kept out of files (and esp, esp, source control).
For general file locations however, you can use something like:
pa_same_but_json = Path(__file__).with_suffix(".json")
pa_same_directory = Path(__file__).parent / "nosecrets.json")
To answer your comments:
(Mind you not 100% sure about Window Drives in the following):
parent
is an attribute on Path objects allowing you to "climb up" hierarchy.
Path("c:\temp\foo.json").parent returns same as Path("c:\temp")
Yes, you can do mypath.parent.parent
/ is a path concatenation operator
when applied to Path objects
So
myfile = os.path.join(["c:", "temp", "foo.json"])
and
myfile_as_a_Path = Path("c:") / "temp" / "foo.json"
are the same, except for one being a string, the other a Path instance. Once the first Path has been built (on C:) the rest of the code "knows that it operating on Path instances" and re-purposes the division operator support (probably some magic __div__ method intended for instance math ) to support path concatenation. This happens because most operations on Path instances return another Path, allowing you to do this type of chaining.
It's best not to write way up the hierarchy in a hosted/VM context (you never know directory structure above or if you have permissions), but something based on your script location might be
pa_current = Path(__file__).parent
# could `content/submission` but that's assuming you're always on Posix
# systems. Letting Pathlib do the work is safer, even if Windows probably
# puts up with `/`
pa_write = pa_current / "content" / "submission"
pa_read = pa_current / "credentials.json"
These at this points are Path instances, but really not much different than strings except having smarter methods to manipulate them. They don't know or care if the files exist or not.
P.S.
🧨 A consideration is that, in many web contexts, writing to code directories (like what happens in a content/submission under the python scripts) is a security goof as well.
Maybe pa_write = pa_current.parent.parent / "uploads" / "content" / "submission" would be better.
Specifically when it comes to user uploads and secrets, please refer to best practices for your platform, not just what Python can do. This answer was about pathlib.Path, not Hugo uploads.

os.path operation in Django, change/join path issue

I'm getting the name of a file in Django after an Image save :
path-> 'companies/92_dsa/log/Hydrangeas.jpg' as it is in database
I do a clone of the file, an resize (is an image) and want to save the new file with a different name.
I get the directory of the original file:
folder = os.path.dirname(path)
the filename and extension:
filename, extension = os.path.splitext(os.path.basename(media_path))
then create a
new_filename = filename + '_sz' + extension
and the I want to recreate the path:
new_path = os.path.join(folder, new_filename)
and the problem(slash-backslash before the filename):
'companies/94_sda/logos\Hydrangeas_sz.jpg'
I'm working in Windows, bur the final deploy probably will be on Linux, so I want a fix indifferent of the OS.

so I want a fix indifferent of the OS.
Unfortunately, you can't really have your cake and eat it.
You say that
I'm working in Windows, bur the final deploy probably will be on Linux
This implies you are running the program on Windows, but dealing with *nix file names (be it Linux, Unix, or mac OS).
To do this completely os-independent ... you would need to split the original path on "/" to get all the sub components and then re-join them with os.path.join.
But then you need to deal with the fact that directory structures for absolute paths are very different between the two OS's - not to mention the leading drive specifier on Windows. This is less of an issue if you are only dealing with relative paths.
In short, the root of your problem is that the database contains Linux-style paths and you are processing them on Windows. You would have a similar problem if it was the other way around.
You need to choose your deployment platform and code for it.
Alternatively, write your code to simply remove the extension from the full path and replace it with "_sz."+extension

Since you don't actually care about the path in relation to the host OS (because you've chosen to store paths POSIX style in your DB), you can just use string joining: new_path = '/'.join([folder, new_filename]), or you could import the posixpath module directly import posixpath; new_path = posixpath.join(folder, new_filename).
You could also investigate PathLib, though that may be overkill for you.

normalize non-existing path using pathlib only

python has recently added the pathlib module (which i like a lot!).
there is just one thing i'm struggling with: is it possible to normalize a path to a file or directory that does not exist? i can do that perfectly well with os.path.normpath. but wouldn't it be absurd to have to use something other than the library that should take care of path related stuff?
the functionality i would like to have is this:
from os.path import normpath
from pathlib import Path
pth = Path('/tmp/some_directory/../i_do_not_exist.txt')
pth = Path(normpath(str(pth)))
# -> /tmp/i_do_not_exist.txt
but without having to resort to os.path and without having to type-cast to str and back to Path. also pth.resolve() does not work for non-existing files.
is there a simple way to do that with just pathlib?

is it possible to normalize a path to a file or directory that does not exist?
Starting from 3.6, it's the default behavior. See https://docs.python.org/3.6/library/pathlib.html#pathlib.Path.resolve
Path.resolve(strict=False)
...
If strict is False, the path is resolved as far as possible and any remainder is appended without checking whether it exists

As of Python 3.5: No, there's not.
PEP 0428 states:
Path resolution
The resolve() method makes a path absolute, resolving
any symlink on the way (like the POSIX realpath() call). It is the
only operation which will remove " .. " path components. On Windows,
this method will also take care to return the canonical path (with the
right casing).
Since resolve() is the only operation to remove the ".." components, and it fails when the file doesn't exist, there won't be a simple means using just pathlib.
Also, the pathlib documentation gives a hint as to why:
Spurious slashes and single dots are collapsed, but double dots ('..')
are not, since this would change the meaning of a path in the face of
symbolic links:
PurePath('foo//bar') produces PurePosixPath('foo/bar')
PurePath('foo/./bar') produces PurePosixPath('foo/bar')
PurePath('foo/../bar') produces PurePosixPath('foo/../bar')
(a naïve approach would make PurePosixPath('foo/../bar') equivalent to PurePosixPath('bar'), which is wrong if foo is a symbolic link to another directory)
All that said, you could create a 0 byte file at the location of your path, and then it'd be possible to resolve the path (thus eliminating the ..). I'm not sure that's any simpler than your normpath approach, though.

If this fits you usecase(e.g. ifle's directory already exists) you might try to resolve path's parent and then re-append file name, e.g.:
from pathlib import Path
p = Path()/'hello.there'
print(p.parent.resolve()/p.name)

Old question, but here is another solution in particular if you want POSIX paths across the board (like nix paths on Windows too).
I found pathlib resolve() to be broken as of Python 3.10, and this method is not currently exposed by PurePosixPath.
What I found worked was to use posixpath.normpath(). Also found PurePosixPath.joinpath() to be broken. I.E. It will not join ".." with "myfile.txt" as expected. It will return just "myfile.txt". But posixpath.join() works perfectly; will return "../myfile.txt".
Note this is in path strings, but easily back to pathlib.Path(my_posix_path) et al for an OOP container.
And easily transpose to Windows platform paths too by just constructing this way, as the module takes care of the platform independence for you.
Might be the solution for others with Python file path woes..

How to safely write to a file?

Imagine you have a library for working with some sort of XML file or configuration file. The library reads the whole file into memory and provides methods for editing the content. When you are done manipulating the content you can call a write to save the content back to file. The question is how to do this in a safe way.
Overwriting the existing file (starting to write to the original file) is obviously not safe. If the write method fails before it is done you end up with a half written file and you have lost data.
A better option would be to write to a temporary file somewhere, and when the write method has finished, you copy the temporary file to the original file.
Now, if the copy somehow fails, you still have correctly saved data in the temporary file. And if the copy succeeds, you can remove the temporary file.
On POSIX systems I guess you can use the rename system call which is an atomic operation. But how would you do this best on a Windows system? In particular, how do you handle this best using Python?
Also, is there another scheme for safely writing to files?

If you see Python's documentation, it clearly mentions that os.rename() is an atomic operation. So in your case, writing data to a temporary file and then renaming it to the original file would be quite safe.
Another way could work like this:
let original file be abc.xml
create abc.xml.tmp and write new data to it
rename abc.xml to abc.xml.bak
rename abc.xml.tmp to abc.xml
after new abc.xml is properly put in place, remove abc.xml.bak
As you can see that you have the abc.xml.bak with you which you can use to restore if there are any issues related with the tmp file and of copying it back.

If you want to be POSIXly correct and save you have to:
Write to temporary file
Flush and fsync the file (or fdatasync)
Rename over the original file
Note that calling fsync has unpredictable effects on performance -- Linux on ext3 may stall for disk I/O whole numbers of seconds as a result, depending on other outstanding I/O.
Notice that rename is not an atomic operation in POSIX -- at least not in relation to file data as you expect. However, most operating systems and filesystems will work this way. But it seems you missed the very large linux discussion about Ext4 and filesystem guarantees about atomicity. I don't know exactly where to link but here is a start: ext4 and data loss.
Notice however that on many systems, rename will be as safe in practice as you expect. However it is in a way not possible to get both -- performance and reliability across all possible linux confiugrations!
With a write to a temporary file, then a rename of the temporary file, one would expect the operations are dependent and would be executed in order.
The issue however is that most, if not all filesystems separate metadata and data. A rename is only metadata. It may sound horrible to you, but filesystems value metadata over data (take Journaling in HFS+ or Ext3,4 for example)! The reason is that metadata is lighter, and if the metadata is corrupt, the whole filesystem is corrupt -- the filesystem must of course preserve it self, then preserve the user's data, in that order.
Ext4 did break the rename expectation when it first came out, however heuristics were added to resolve it. The issue is not a failed rename, but a successful rename. Ext4 might sucessfully register the rename, but fail to write out the file data if a crash comes shortly thereafter. The result is then a 0-length file and neither orignal nor new data.
So in short, POSIX makes no such guarantee. Read the linked Ext4 article for more information!

In Win API I found quite nice function ReplaceFile that does what name suggests even with optional back-up. There is always way with DeleteFile, MoveFile combo.
In general what you want to do is really good. And I cannot think of any better write scheme.

A simplistic solution. Use tempfile to create a temporary file and if writing succeeds the just rename the file to your original configuration file.
Note that rename is not atomic across filesystems. You'll have to resort to a slight workaround (e.g. tempfile on target filesystem, followed by a rename) in order to be really safe.
For locking a file, see portalocker.

The standard solution is this.
Write a new file with a similar name. X.ext# for example.
When that file has been closed (and perhaps even read and checksummed), then you two two renames.
X.ext (the original) to X.ext~
X.ext# (the new one) to X.ext
(Only for the crazy paranoids) call the OS sync function to force dirty buffer writes.
At no time is anything lost or corruptable. The only glitch can happen during the renames. But you haven't lost anything or corrupted anything. The original is recoverable right up until the final rename.

Per RedGlyph's suggestion, I'm added an implementation of ReplaceFile that uses ctypes to access the Windows APIs. I first added this to jaraco.windows.api.filesystem.
ReplaceFile = windll.kernel32.ReplaceFileW
ReplaceFile.restype = BOOL
ReplaceFile.argtypes = [
LPWSTR,
LPWSTR,
LPWSTR,
DWORD,
LPVOID,
LPVOID,
]
REPLACEFILE_WRITE_THROUGH = 0x1
REPLACEFILE_IGNORE_MERGE_ERRORS = 0x2
REPLACEFILE_IGNORE_ACL_ERRORS = 0x4
I then tested the behavior using this script.
from jaraco.windows.api.filesystem import ReplaceFile
import os
open('orig-file', 'w').write('some content')
open('replacing-file', 'w').write('new content')
ReplaceFile('orig-file', 'replacing-file', 'orig-backup', 0, 0, 0)
assert open('orig-file').read() == 'new content'
assert open('orig-backup').read() == 'some content'
assert not os.path.exists('replacing-file')
While this only works in Windows, it appears to have a lot of nice features that other replace routines would lack. See the API docs for details.

There's now a codified, pure-Python, and I dare say Pythonic solution to this in the boltons utility library: boltons.fileutils.atomic_save.
Just pip install boltons, then:
from boltons.fileutils import atomic_save
with atomic_save('/path/to/file.txt') as f:
f.write('this will only overwrite if it succeeds!\n')
There are a lot of practical options, all well-documented. Full disclosure, I am the author of boltons, but this particular part was built with a lot of community help. Don't hesitate to drop a note if something is unclear!

You could use the fileinput module to handle the backing-up and in-place writing for you:
import fileinput
for line in fileinput.input(filename,inplace=True, backup='.bak'):
# inplace=True causes the original file to be moved to a backup
# standard output is redirected to the original file.
# backup='.bak' specifies the extension for the backup file.
# manipulate line
newline=process(line)
print(newline)
If you need to read in the entire contents before you can write the newline's,
then you can do that first, then print entire new contents with
newcontents=process(contents)
for line in fileinput.input(filename,inplace=True, backup='.bak'):
print(newcontents)
break
If the script ends abruptly, you will still have the backup.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.