Where to store cross-testrun state in pytest (binary files)? - python

I have a session-level fixture in pytest that downloads several binary files that I use throughout my test suite. The current fixture looks something like the following:
#pytest.fixture(scope="session")
def image_cache(pytestconfig, tmp_path_factory):
# A temporary directory loaded with the test image files downloaded once.
remote_location = pytestconfig.getoption("remote_test_images")
tmp_path = tmp_path_factory.mktemp("image_cache", numbered=False)
# ... download the files and store them into tmp_path
yield tmp_path
This used to work well, however, now the amount of data is making things slow, so I wish to cache it between test runs (similar to this question). Contrary to the related question, I want to use pytests own cache for this, i.e., I'd like to do something like the following:
#pytest.fixture(scope="session")
def image_cache(request, tmp_path_factory):
# A temporary directory loaded with the test image files downloaded once.
remote_location = request.config.option.remote_test_images
tmp_path = request.config.cache.get("image_cache_dir", None)
if tmp_path is None:
# what is the correct location here?
tmp_path = ...
request.config.cache.set("image_cache_dir", tmp_path)
# ... ensure path exists and is empty, clean if necessary
# ... download the files and store them into tmp_path
yield tmp_path
Is there a typical/default/expected location that I should use to store the binary data?
If not, what is a good (platform-independent) location to choose? (tests run on the three major OS: linux, mac, windows)

A bit late here, but maybe I can still offer a helpful suggestion. You have at least two options, I think:
use pytest's caching solution on its own. Yes, it expects JSON-serializable data, so you'll need to convert your "binary" into strings. You can use base64 to safely encode arbitrary binary into letters that can be stored as a string and then later converted from a string back into the original binary (ie back into an image, or whatever).
use pytest's caching solution as a means of remembering a filename or a directory name. Then, use python's generic temporary file system so you can manage temporary files in a platform-independent way. In the end, you'll write just filenames or paths into pytest cache and do everything else manually.
Solution 1 benefits from fully-automatic management of cache content and will be compatible with other pytest extensions (such as distributed testing with xdist). It means that all files are kept in the same place and it is easier to see and manage disk usage.
Solution 2 is likely to be faster and will scale safely. It avoids the need to transcode to base64, which will be a waste of CPU and of space (since base64 will take a lot more space than the original binary). Additionally, the pytest cache may not be well suited for a large number of potentially large values (depending on the number of files and sizes of images we're talking about here).
Converting image / arbitrary bytes into something that can be encoded to JSON:
import base64
now_im_a_string = base64.encodebytes(your_bytes)
...
# cache it, store it whatever
...
im_your_bytes_again = base64.decodestring(string_read_from_cache)

Related

Does fsspec support virtual filesystems such as pyfileysystem

One of pyfilesystem's main feature is virtual filesystems. E.g.
home_fs = open_fs('~/')
projects_fs = home_fs.opendir('/projects')
I think that is a great feature and was hoping that fsspec has something similar. But I couldn't find an example and I'm not able to get it working.
You might want DirFileSystem, invoked like
fs = fsspec.implementations.dirfs.DirFileSystem(
"<root path>", fs=fsspec.filesystem("file")
)
You can apply this to any filesystem, not only local. root_path needs to be a string that, when you affix further path parts to it, makes a complete path the target filesystem can understand; it may include the protocol (e.g., for HTTP, it must do). In your case, it would be "~" (or the expanded version of this, which would be more explicit).
Alternatively, you can create an arbitrarily mapped virtual filesystem with
ReferenceFileSystem.
mapping = {"/key1": ["/local/path/file1"],
"/key2": ["/other/unrelated/path/file"]}
fs = fsspec.filesystem("reference", fo=mapping)
Here, fs.cat("/key1") would get the contents of "/local/path/file1". You can have those paths be remote, or a mix of different backends, and even byte ranges of target files.

How do I tell a .py script to look for folders and file in the directory it is in?

I have a .py script that pulls in data from Google Sheets and outputs it in a yaml format. This is for my Hugo powered website which is served via Netlify. As I understand it, Netlify is capable of running Python too so I thought I could upload the web content and the python file in the same directory. This is required for updating the content dynamically, and I expect the python file to run everytime I trigger a build for the website. However, the python file requires certain credentials to work.
My code currently looks like this:
# Set location to write new files to.
outputpath = Path("D:/content/submission/")
#Read JSON:
json = Path("D:/credentials.json")
These are hardcoded local paths. When I bundle the script in the website directory, what paths should I write in so that when the script runs, these files are read in and outputted correctly?
I would want to output in my content/submission folder and read in from my creds/credentials.json. Should I just put these paths in? Will it know that it has to look within the directory for these folders, or is there something I need to add to the script that tells it to work within the directory it is sitting in?
🧨 First, credentials and secrets are best kept out of files (and esp, esp, source control).
For general file locations however, you can use something like:
pa_same_but_json = Path(__file__).with_suffix(".json")
pa_same_directory = Path(__file__).parent / "nosecrets.json")
To answer your comments:
(Mind you not 100% sure about Window Drives in the following):
parent
is an attribute on Path objects allowing you to "climb up" hierarchy.
Path("c:\temp\foo.json").parent returns same as Path("c:\temp")
Yes, you can do mypath.parent.parent
/ is a path concatenation operator
when applied to Path objects
So
myfile = os.path.join(["c:", "temp", "foo.json"])
and
myfile_as_a_Path = Path("c:") / "temp" / "foo.json"
are the same, except for one being a string, the other a Path instance. Once the first Path has been built (on C:) the rest of the code "knows that it operating on Path instances" and re-purposes the division operator support (probably some magic __div__ method intended for instance math ) to support path concatenation. This happens because most operations on Path instances return another Path, allowing you to do this type of chaining.
It's best not to write way up the hierarchy in a hosted/VM context (you never know directory structure above or if you have permissions), but something based on your script location might be
pa_current = Path(__file__).parent
# could `content/submission` but that's assuming you're always on Posix
# systems. Letting Pathlib do the work is safer, even if Windows probably
# puts up with `/`
pa_write = pa_current / "content" / "submission"
pa_read = pa_current / "credentials.json"
These at this points are Path instances, but really not much different than strings except having smarter methods to manipulate them. They don't know or care if the files exist or not.
P.S.
🧨 A consideration is that, in many web contexts, writing to code directories (like what happens in a content/submission under the python scripts) is a security goof as well.
Maybe pa_write = pa_current.parent.parent / "uploads" / "content" / "submission" would be better.
Specifically when it comes to user uploads and secrets, please refer to best practices for your platform, not just what Python can do. This answer was about pathlib.Path, not Hugo uploads.

Efficient design to store lookup table for files in directories

Let's say I have three directories dir1, dir2 & dir3, with thousands of files in each. Each file has a unique name with no pattern.
Now, given a filename, I need to find which of the three directories it's in. My first thought was to create a dictionary with the filename as key and the directory as the value, like this:
{'file1':'dir1',
'file2':'dir3',
'file3':'dir1', ... }
But seeing as there are only three unique values, this seems a bit redundant and takes up space.
Is there a better way to implement this? What if I can compromise on space but need faster lookup?
A simple way to solve this is to query the file-system directly instead of caching all the filenames in a dict. This will save a lot of space, and will probably be fast enough if there only a few hundred directories to search.
Here is a simple function that does that:
def find_directory(filename, directories):
for directory in directories:
path = os.path.join(directory, filename)
if os.path.exists(path):
return directory
On my Linux system, when searching around 170 directories, it takes about 0.3 seconds to do the first search, and then only about 0.002 seconds thereafter. This is because the OS does file-caching to speed up repeated searches. But note that if you used a dict to do this caching in Python, you'd still have to pay a similar initial cost.
Of course, the subsequent dict lookups would be faster than querying the file-system directly. But do you really need that extra speed? To me, two thousandths of second seems easily "fast enough" for most purposes. And you get the extra benefit of never needing to refresh the file-cache (because the OS does it for you).
PS:
I should probably point out that the above timings are worst-case: that is, I dropped all the system file-caches first, and then searched for a filename that was in the last directory.
You can store the index as a dict of sets. It might be more memory-efficient.
index = {
"dir1": {"f1", "f2", "f3", "f4"},
"dir2": {"f3", "f4"},
"dir3": {"f5", "f6", "f7"},
}
filename = "f4"
for dir, files in index.iteritems():
if filename in files:
print dir
Speaking of thousands of files, you'll barely see any difference between this method and your inverted index.
Also, repeatable strings in python can be interned in order to save memory. Sometimes CPython interns short string itself.

os.walk() caching/speeding up

I have a prototype server[0] that's doing an os.walk()[1] for each query a client[0] makes.
I'm currently looking into ways of:
caching this data in memory,
speeding up queries, and
hopefully allowing for expansion into storing metadata and data persistence later on.
I find SQL complicated for tree structures, so I thought I would get some advice before actually committing to SQLite
Are there any cross-platform, embeddable or bundle-able non-SQL databases that might be able to handle this kind of data?
I have a small (10k-100k files) list.
I have an extremely small amount of connections (maybe 10-20).
I want to be able to scale to handling metadata as well.
[0] the server and client are actually the same piece of software, this is a P2P application, that's designed to share files over a local trusted network with out a main server, using zeroconf for discovery, and twisted for pretty much everything else
[1] query time is currently 1.2s with os.walk() on 10,000 files
Here is the related function in my Python code that does the walking:
def populate(self, string):
for name, sharedir in self.sharedirs.items():
for root, dirs, files, in os.walk(sharedir):
for dir in dirs:
if fnmatch.fnmatch(dir, string):
yield os.path.join(name, *os.path.join(root, dir)[len(sharedir):].split("/"))
for file in files:
if fnmatch.fnmatch(file, string):
yield os.path.join(name, *os.path.join(root, ile)[len(sharedir):].split("/"))
You don't need to persist a tree structure -- in fact, your code is busily dismantling the natural tree structure of the directory tree into a linear sequence, so why would you want to restart from a tree next time?
Looks like what you need is just an ordered sequence:
i X result of os.path.join for X
where X, a string, names either a file or directory (you treat them just the same), i is a progressively incrementing integer number (to preserve the order), and the result column, also a string, is the result of os.path.join(name, *os.path.join(root, &c.
This is perfectly easy to put in a SQL table, of course!
To create the table the first time, just remove the guards if fnmatch.fnmatch (and the string argument) from your populate function, yield the dir or file before the os.path.join result, and use a cursor.executemany to save the enumerate of the call (or, use a self-incrementing column, your pick). To use the table, populate becomes essentially a:
select result from thetable where X LIKE '%foo%' order by i
where string is foo.
I misunderstood the question at first, but I think I have a solution now (and sufficiently different from my other answer to warrant a new one). Basically, you do the normal query the first time you run walk on a directory, but you store the yielded values. The second time around, you just yield those stored values. I've wrapped the os.walk() call because it's short, but you could just as easily wrap your generator as a whole.
cache = {}
def os_walk_cache( dir ):
if dir in cache:
for x in cache[ dir ]:
yield x
else:
cache[ dir ] = []
for x in os.walk( dir ):
cache[ dir ].append( x )
yield x
raise StopIteration()
I'm not sure of your memory requirements, but you may want to consider periodically cleaning out cache.
Have you looked at MongoDB? What about mod_python? mod_python should allow you to do your os.walk() and just store the data in Python data structures, since the script is persistent between connections.

How to safely write to a file?

Imagine you have a library for working with some sort of XML file or configuration file. The library reads the whole file into memory and provides methods for editing the content. When you are done manipulating the content you can call a write to save the content back to file. The question is how to do this in a safe way.
Overwriting the existing file (starting to write to the original file) is obviously not safe. If the write method fails before it is done you end up with a half written file and you have lost data.
A better option would be to write to a temporary file somewhere, and when the write method has finished, you copy the temporary file to the original file.
Now, if the copy somehow fails, you still have correctly saved data in the temporary file. And if the copy succeeds, you can remove the temporary file.
On POSIX systems I guess you can use the rename system call which is an atomic operation. But how would you do this best on a Windows system? In particular, how do you handle this best using Python?
Also, is there another scheme for safely writing to files?
If you see Python's documentation, it clearly mentions that os.rename() is an atomic operation. So in your case, writing data to a temporary file and then renaming it to the original file would be quite safe.
Another way could work like this:
let original file be abc.xml
create abc.xml.tmp and write new data to it
rename abc.xml to abc.xml.bak
rename abc.xml.tmp to abc.xml
after new abc.xml is properly put in place, remove abc.xml.bak
As you can see that you have the abc.xml.bak with you which you can use to restore if there are any issues related with the tmp file and of copying it back.
If you want to be POSIXly correct and save you have to:
Write to temporary file
Flush and fsync the file (or fdatasync)
Rename over the original file
Note that calling fsync has unpredictable effects on performance -- Linux on ext3 may stall for disk I/O whole numbers of seconds as a result, depending on other outstanding I/O.
Notice that rename is not an atomic operation in POSIX -- at least not in relation to file data as you expect. However, most operating systems and filesystems will work this way. But it seems you missed the very large linux discussion about Ext4 and filesystem guarantees about atomicity. I don't know exactly where to link but here is a start: ext4 and data loss.
Notice however that on many systems, rename will be as safe in practice as you expect. However it is in a way not possible to get both -- performance and reliability across all possible linux confiugrations!
With a write to a temporary file, then a rename of the temporary file, one would expect the operations are dependent and would be executed in order.
The issue however is that most, if not all filesystems separate metadata and data. A rename is only metadata. It may sound horrible to you, but filesystems value metadata over data (take Journaling in HFS+ or Ext3,4 for example)! The reason is that metadata is lighter, and if the metadata is corrupt, the whole filesystem is corrupt -- the filesystem must of course preserve it self, then preserve the user's data, in that order.
Ext4 did break the rename expectation when it first came out, however heuristics were added to resolve it. The issue is not a failed rename, but a successful rename. Ext4 might sucessfully register the rename, but fail to write out the file data if a crash comes shortly thereafter. The result is then a 0-length file and neither orignal nor new data.
So in short, POSIX makes no such guarantee. Read the linked Ext4 article for more information!
In Win API I found quite nice function ReplaceFile that does what name suggests even with optional back-up. There is always way with DeleteFile, MoveFile combo.
In general what you want to do is really good. And I cannot think of any better write scheme.
A simplistic solution. Use tempfile to create a temporary file and if writing succeeds the just rename the file to your original configuration file.
Note that rename is not atomic across filesystems. You'll have to resort to a slight workaround (e.g. tempfile on target filesystem, followed by a rename) in order to be really safe.
For locking a file, see portalocker.
The standard solution is this.
Write a new file with a similar name. X.ext# for example.
When that file has been closed (and perhaps even read and checksummed), then you two two renames.
X.ext (the original) to X.ext~
X.ext# (the new one) to X.ext
(Only for the crazy paranoids) call the OS sync function to force dirty buffer writes.
At no time is anything lost or corruptable. The only glitch can happen during the renames. But you haven't lost anything or corrupted anything. The original is recoverable right up until the final rename.
Per RedGlyph's suggestion, I'm added an implementation of ReplaceFile that uses ctypes to access the Windows APIs. I first added this to jaraco.windows.api.filesystem.
ReplaceFile = windll.kernel32.ReplaceFileW
ReplaceFile.restype = BOOL
ReplaceFile.argtypes = [
LPWSTR,
LPWSTR,
LPWSTR,
DWORD,
LPVOID,
LPVOID,
]
REPLACEFILE_WRITE_THROUGH = 0x1
REPLACEFILE_IGNORE_MERGE_ERRORS = 0x2
REPLACEFILE_IGNORE_ACL_ERRORS = 0x4
I then tested the behavior using this script.
from jaraco.windows.api.filesystem import ReplaceFile
import os
open('orig-file', 'w').write('some content')
open('replacing-file', 'w').write('new content')
ReplaceFile('orig-file', 'replacing-file', 'orig-backup', 0, 0, 0)
assert open('orig-file').read() == 'new content'
assert open('orig-backup').read() == 'some content'
assert not os.path.exists('replacing-file')
While this only works in Windows, it appears to have a lot of nice features that other replace routines would lack. See the API docs for details.
There's now a codified, pure-Python, and I dare say Pythonic solution to this in the boltons utility library: boltons.fileutils.atomic_save.
Just pip install boltons, then:
from boltons.fileutils import atomic_save
with atomic_save('/path/to/file.txt') as f:
f.write('this will only overwrite if it succeeds!\n')
There are a lot of practical options, all well-documented. Full disclosure, I am the author of boltons, but this particular part was built with a lot of community help. Don't hesitate to drop a note if something is unclear!
You could use the fileinput module to handle the backing-up and in-place writing for you:
import fileinput
for line in fileinput.input(filename,inplace=True, backup='.bak'):
# inplace=True causes the original file to be moved to a backup
# standard output is redirected to the original file.
# backup='.bak' specifies the extension for the backup file.
# manipulate line
newline=process(line)
print(newline)
If you need to read in the entire contents before you can write the newline's,
then you can do that first, then print entire new contents with
newcontents=process(contents)
for line in fileinput.input(filename,inplace=True, backup='.bak'):
print(newcontents)
break
If the script ends abruptly, you will still have the backup.

Categories

Resources