os.walk() caching/speeding up

os.walk() caching/speeding up - python

I have a prototype server[0] that's doing an os.walk()[1] for each query a client[0] makes.
I'm currently looking into ways of:
caching this data in memory,
speeding up queries, and
hopefully allowing for expansion into storing metadata and data persistence later on.
I find SQL complicated for tree structures, so I thought I would get some advice before actually committing to SQLite
Are there any cross-platform, embeddable or bundle-able non-SQL databases that might be able to handle this kind of data?
I have a small (10k-100k files) list.
I have an extremely small amount of connections (maybe 10-20).
I want to be able to scale to handling metadata as well.
[0] the server and client are actually the same piece of software, this is a P2P application, that's designed to share files over a local trusted network with out a main server, using zeroconf for discovery, and twisted for pretty much everything else
[1] query time is currently 1.2s with os.walk() on 10,000 files
Here is the related function in my Python code that does the walking:
def populate(self, string):
for name, sharedir in self.sharedirs.items():
for root, dirs, files, in os.walk(sharedir):
for dir in dirs:
if fnmatch.fnmatch(dir, string):
yield os.path.join(name, *os.path.join(root, dir)[len(sharedir):].split("/"))
for file in files:
if fnmatch.fnmatch(file, string):
yield os.path.join(name, *os.path.join(root, ile)[len(sharedir):].split("/"))

You don't need to persist a tree structure -- in fact, your code is busily dismantling the natural tree structure of the directory tree into a linear sequence, so why would you want to restart from a tree next time?
Looks like what you need is just an ordered sequence:
i X result of os.path.join for X
where X, a string, names either a file or directory (you treat them just the same), i is a progressively incrementing integer number (to preserve the order), and the result column, also a string, is the result of os.path.join(name, *os.path.join(root, &c.
This is perfectly easy to put in a SQL table, of course!
To create the table the first time, just remove the guards if fnmatch.fnmatch (and the string argument) from your populate function, yield the dir or file before the os.path.join result, and use a cursor.executemany to save the enumerate of the call (or, use a self-incrementing column, your pick). To use the table, populate becomes essentially a:
select result from thetable where X LIKE '%foo%' order by i
where string is foo.

I misunderstood the question at first, but I think I have a solution now (and sufficiently different from my other answer to warrant a new one). Basically, you do the normal query the first time you run walk on a directory, but you store the yielded values. The second time around, you just yield those stored values. I've wrapped the os.walk() call because it's short, but you could just as easily wrap your generator as a whole.
cache = {}
def os_walk_cache( dir ):
if dir in cache:
for x in cache[ dir ]:
yield x
else:
cache[ dir ] = []
for x in os.walk( dir ):
cache[ dir ].append( x )
yield x
raise StopIteration()
I'm not sure of your memory requirements, but you may want to consider periodically cleaning out cache.

Have you looked at MongoDB? What about mod_python? mod_python should allow you to do your os.walk() and just store the data in Python data structures, since the script is persistent between connections.

Related

Where to store cross-testrun state in pytest (binary files)?

I have a session-level fixture in pytest that downloads several binary files that I use throughout my test suite. The current fixture looks something like the following:
#pytest.fixture(scope="session")
def image_cache(pytestconfig, tmp_path_factory):
# A temporary directory loaded with the test image files downloaded once.
remote_location = pytestconfig.getoption("remote_test_images")
tmp_path = tmp_path_factory.mktemp("image_cache", numbered=False)
# ... download the files and store them into tmp_path
yield tmp_path
This used to work well, however, now the amount of data is making things slow, so I wish to cache it between test runs (similar to this question). Contrary to the related question, I want to use pytests own cache for this, i.e., I'd like to do something like the following:
#pytest.fixture(scope="session")
def image_cache(request, tmp_path_factory):
# A temporary directory loaded with the test image files downloaded once.
remote_location = request.config.option.remote_test_images
tmp_path = request.config.cache.get("image_cache_dir", None)
if tmp_path is None:
# what is the correct location here?
tmp_path = ...
request.config.cache.set("image_cache_dir", tmp_path)
# ... ensure path exists and is empty, clean if necessary
# ... download the files and store them into tmp_path
yield tmp_path
Is there a typical/default/expected location that I should use to store the binary data?
If not, what is a good (platform-independent) location to choose? (tests run on the three major OS: linux, mac, windows)

A bit late here, but maybe I can still offer a helpful suggestion. You have at least two options, I think:
use pytest's caching solution on its own. Yes, it expects JSON-serializable data, so you'll need to convert your "binary" into strings. You can use base64 to safely encode arbitrary binary into letters that can be stored as a string and then later converted from a string back into the original binary (ie back into an image, or whatever).
use pytest's caching solution as a means of remembering a filename or a directory name. Then, use python's generic temporary file system so you can manage temporary files in a platform-independent way. In the end, you'll write just filenames or paths into pytest cache and do everything else manually.
Solution 1 benefits from fully-automatic management of cache content and will be compatible with other pytest extensions (such as distributed testing with xdist). It means that all files are kept in the same place and it is easier to see and manage disk usage.
Solution 2 is likely to be faster and will scale safely. It avoids the need to transcode to base64, which will be a waste of CPU and of space (since base64 will take a lot more space than the original binary). Additionally, the pytest cache may not be well suited for a large number of potentially large values (depending on the number of files and sizes of images we're talking about here).
Converting image / arbitrary bytes into something that can be encoded to JSON:
import base64
now_im_a_string = base64.encodebytes(your_bytes)
...
# cache it, store it whatever
...
im_your_bytes_again = base64.decodestring(string_read_from_cache)

Compare one directory at time 1 to same directory at time 2

My goal : compare the content of one directory (including sub-directories and files) at time 1 to the content of the same directory at time 2 (e.g. 6 months later). "Content" means : number and names of the subdirectories + number and names and size of files. The main intended outcome is : being sure that no files were destroyed or corrupted in the mean time.
I did not find any existing tool, although I was wondering whether https://github.com/njanakiev/folderstats folderstats could help.
Would you have any suggestion of modules or anything to start well? If you heard about an existing tool for this, I would also be interested.
Thanks.

Here's some code that should help to get you started. It defines a function that will build a data structure of nested dictionaries that correspond the contents of the starting root directory and everything below it in the filesystem. Each each item dictionary that has with a 'type' key with the value 'file', will also have a 'stat' key that can contain whatever file metadata you want or need, such as time of creation, last modification time, length in bytes, … etc.
You can use it to obtain a "before" and "after" snapshots of the directory you're tracking and use them for comparison purposes. I've left the latter (the comparing) out since I'm not sure exactly what you're interested in.
Note that when I actually went about implementing this, I found it simpler to write a recursive function than to use os.walk(), as I suggested in a comment.
The following implements a version of the function and prints out the data structure of nested dictionaries it returns.
import os
from pathlib import PurePath
def path_to_dict(path):
result = {}
result['full_path'] = PurePath(path).as_posix()
if os.path.isdir(path):
result['type'] = 'dir'
result['items'] = {filename: path_to_dict(os.path.join(path, filename))
for filename in os.listdir(path)}
else:
result['type'] = 'file'
result['stat'] = 'os.stat(path)' # Preserve any needed metadata.
return result
root = './folder' # Change as desired.
before = path_to_dict(root)
# Pretty-print data structure created.
from pprint import pprint
pprint(before, sort_dicts=False)

Efficient design to store lookup table for files in directories

Let's say I have three directories dir1, dir2 & dir3, with thousands of files in each. Each file has a unique name with no pattern.
Now, given a filename, I need to find which of the three directories it's in. My first thought was to create a dictionary with the filename as key and the directory as the value, like this:
{'file1':'dir1',
'file2':'dir3',
'file3':'dir1', ... }
But seeing as there are only three unique values, this seems a bit redundant and takes up space.
Is there a better way to implement this? What if I can compromise on space but need faster lookup?

A simple way to solve this is to query the file-system directly instead of caching all the filenames in a dict. This will save a lot of space, and will probably be fast enough if there only a few hundred directories to search.
Here is a simple function that does that:
def find_directory(filename, directories):
for directory in directories:
path = os.path.join(directory, filename)
if os.path.exists(path):
return directory
On my Linux system, when searching around 170 directories, it takes about 0.3 seconds to do the first search, and then only about 0.002 seconds thereafter. This is because the OS does file-caching to speed up repeated searches. But note that if you used a dict to do this caching in Python, you'd still have to pay a similar initial cost.
Of course, the subsequent dict lookups would be faster than querying the file-system directly. But do you really need that extra speed? To me, two thousandths of second seems easily "fast enough" for most purposes. And you get the extra benefit of never needing to refresh the file-cache (because the OS does it for you).
PS:
I should probably point out that the above timings are worst-case: that is, I dropped all the system file-caches first, and then searched for a filename that was in the last directory.

You can store the index as a dict of sets. It might be more memory-efficient.
index = {
"dir1": {"f1", "f2", "f3", "f4"},
"dir2": {"f3", "f4"},
"dir3": {"f5", "f6", "f7"},
}
filename = "f4"
for dir, files in index.iteritems():
if filename in files:
print dir
Speaking of thousands of files, you'll barely see any difference between this method and your inverted index.
Also, repeatable strings in python can be interned in order to save memory. Sometimes CPython interns short string itself.

How can I optimise this recursive file size function?

I wrote a script that sums the size of the files in subdirectories on a FTP server:
for dirs in ftp.nlst("."):
try:
print("Searching in "+dirs+"...")
ftp.cwd(dirs)
for files in ftp.nlst("."):
size += ftp.size(files)
ftp.cwd("../")
except ftplib.error_perm:
pass
print("Total size of "+serveradd+tvt+" = "+str(size*10**-9)+" GB")
Is there a quicker way to get the size of the whole directory tree other than summing the file sizes for all directories?

As Alex Hall commented, this is not recursive. I'll address the speeding-up issue, as you can read about recursion from many sources, for example here.
Putting that aside, you didn't mention how many files are approximately in that directory, but you're wasting time by spending a whole round-trip for every file in the directory. Instead ask the server to return the entire listing for the directory and sum the file sizes:
import re
class DirSizer:
def __init__(self):
self.size = 0
def add_list_entry(self, lst):
if '<DIR>' not in lst:
metadata = re.split(r'\s+', lst)
self.size += int(metadata[2])
ds = DirSizer()
ftp.retrlines('LIST', ds.add_list_entry) # add_list_entry will be called for every line
print(ds.size) # => size (shallow, currently) of the directory
Note that:
This should of course be done recursively for every directory in the tree.
Your server might return the list in a different format, so you might need to change either the re.split line or the metadata[2] part.
If your server supports the MLSD FTP command, use that instead, as it'll be in a standardized format.
See here for an explanation of retrlines and the callback.

I want a clever algorithm for indexing a file directory...pointers?

I have a directory of music on Ubuntu (.mp3, .wav, etc) files. This directory can have as many sub directories as it needs, no limits. I want to be able to make a music library out of it - that is, return list of songs based on filters of:
1) membership to playlist
2) artist name
3) string search
4) name of song
etc, etc
However, if file names are changed, moved, or even added to my Music directory, I need to be able to reflect this is in my music organization engine - quickly!
I originally thought to just monitor my directory with pyinotify, incron, or inotify. Unfortunately my directory is a Samba share and so monitoring file events failed. So my next guess was to simply recursively search the directory in python, and populate a SQL database. Then when updating, I would just look to see if anything has changed (scanning each subfolder to see if each song's name is in the database already, and if not adding it), and make UPDATEs accordingly. Unfortunately, this seems to be a terrible O(n^2) implementation - awful for a multi-terabyte music collection.
A slightly better one might involve creating a tree structure in SQL, thus narrowing the possible candidates to search for a match at any given subfolder step to the size of that subfolder. Still seems inelegant.
What design paradigms/packages can I use to help myself out? Obviously will involve lots of clever hash tables. I'm just looking for some pointers in the right direction for how to approach the problem. (Also I'm a complete junkie for optimization.)

The hard part of this is the scanning of the directory, just because it can be expensive.
But that's a cruel reality since you can't use inotify et al.
In your database, simply create a node type record:
create table node (
nodeKey integer not null primary key,
parentNode integer references node(nodeKey), // allow null for the root, or have root point to itself, whatever
fullPathName varchar(2048),
nodeName varchar(2048),
nodeType varchar(1) // d = directory, f = file, or whatever else you want
)
That's your node structure.
You can use the full path column to quickly find anything by the absolute path.
When a file moves, simply recalculate the path.
Finally, scan you music files. In unix, you can do something like:
find . -type f | sort > sortedListOfFiles
Next, simply suck all of the path names out of the database.
select fullPathName from node where nodeType != 'd' order by fullPathName
Now you have two sorted list of files.
Run them through DIFF (or comm), and you'll have a list of deleted and new files. You won't have a list of "moved" files. If you want to do some heuristic where you compare new and old files and they have the same endings (i.e. ..../album/song) to try and detect "moves" vs new and old, then fine, no big deal. Worth a shot.
But diff will give you your differential in a heartbeat.
If you have zillions of files, then, sorry, this it going to take some time -- but you already know that when you lose the inotify capability. If you had that it would just be incremental maintenance.
When a file moves, it's trivial to find its new absolute path, because you can ask its parent for its path and simply append your name to it. After that, you're not crawling a tree or anything, unless you want to. Works both ways.
Addenda:
If you want to track actual name changes, you can get a little more information.
You can do this:
find . -type f -print0 | xargs -0 ls -i | sort -n > sortedListOfFileWithInode
The -print0 and -0 are used to work with files with spaces in them. Quotes in the file names will wreck this however. You might be better off running the raw list through python and fstat to get the inode. Different things you can do here.
What this does is rather than just having names, you also get the inode of the file. The inode is the "real" file, a directory links names to inodes. This is how you can have multiple names (hard links) in a unix file system to a single file, all of the names point to the same inode.
When a file is renamed, the inode will remain the same. In unix, there's a single command used for renaming, and moving files, mv. When mv renames or moves the file, the inode stays the same AS LONG AS THE FILE IS ON THE SAME FILE SYSTEM.
So, using the inode as well as the file name will let you capture some more interesting information, like file moves.
It won't help if they delete the file and add a new file. But you WILL (likely) be able to tell that it happened, since it is unlikely that an old inode will be reused for the new inode.
So if you have a list of files (sorted by file name):
1234 song1.mp3
1235 song2.mp3
1236 song3.mp3
and someone removes and adds back song 2, you'll have something like
1234 song1.mp3
1237 song2.mp3
1236 song3.mp3
But if you do this:
mv song1.mp3 song4.mp3
You'll get:
1237 song2.mp3
1236 song3.mp3
1234 song4.mp3
The other caveat is that if you lose the drive and restore it from backup, likely all of the inodes will change, forcing effectively a rebuild of your index.
If you're real adventurous you can try playing with extended file system attributes and assign other interesting meta data to files. Haven't done much with that, but it's got possibilities as well, and there are likely unseen dangers, but...

my aggregate_digup program reads an extended sha1sum.txt format file produced by the digup program. this lets me locate a file based on its sha1sum. the digup program stores the mtime size hash and pathname in its output. by default it skips hashing a file if the mtime and size match. the index produced by my aggregate_digup is used by my modifed version of the open uri context menu gedit plugin allowing one to option click on sha1:b7d67986e54f852de25e2d803472f31fb53184d5 and it'll list the copies of the file it knows about so you can pick one and open it.
how this relates to the problem is that there are two parts: one the playlists and two the files.
if we can assume that nothing the player does changes the files, then the hash and sizes of the files are constant. so we should be able to use the size and hash of a file as a unique identifier.
for example the key for the file mentioned: 222415:b7d67986e54f852de25e2d803472f31fb53184d5
i've found that in practice this has no collisions in any natural collection.
(this does mean that the ID3 metadata which is appended or prepended to the mp3 data can't change unless you choose to skip that metadata while hashing)
so the playlist database would be something this:
files(file_key, hash, size, mtime, path, flag)
tracks(file_key, title, artist)
playlists(playlistid, index, file_key)
to update the files table:
import os
import stat
# add new files:
update files set flag=0
for path in filesystem:
s=os.stat(path)
if stat.S_ISREG(s.st_mode):
fetch first row of select mtime, hash, size from files where path=path
if row is not None:
if s.st_mtime == mtime and s.st_size == size:
update files set flag=1 where path=path
continue
hash=hash_file(path)
file_key="%s:%s" % (int(s.st_mtime), hash)
insert or update files set file_key=file_key, size=s.st_size, mtime=s.st_mtime, hash=hash, flag=1 where path=path
# remove non-existent files:
delete from files where flag=0

The reality is, this is a hard problem. You're starting from a disadvantage as well: Python and mySQL aren't the fastest tools to use for this purpose.
Even iTunes is complained about because of the time it takes to import libraries and index new files. Can you imagine the man hours that went into making iTunes as good as it is?
Your best bet is to look at the code of major open source music players such as
Miro, http://www.getmiro.com/,
Banshee, http://banshee.fm/, and
Songbird, http://getsongbird.com/
And try an adapt their algorithms to your purpose and to Python idioms.

import os
import re
your other code here that initially sets up a dictonary containing which files you already have in your library (I called the dictionary archived_music)
music_directory = '/home/username/music'
music_type = '\.mp3$|\.wav$|\.etc$'
found_files = os.popen('find %s -type f -mtime 1 2>/dev/null' % music_directory)
for file in found_files:
directory, filename = os.path.split()
if re.compile(music_type).search(filename):
#found a music file, check if you already have it in the library
if filename in archived_music:
continue
#if you have gotten to this point, the music was not found in the arcchived music directory, so now perform whatever processing you would like to do on the full path found in file.
You can use this code as a little function or whatever and call it on whatever time resolution you would like. It will use the find command and find every newly created file within the last day. It will then check whether it is of type music_type, if it is it will check the filename against whatever current database you have set up and you can continue processing from there. This should be able to get your started for updating newly added music or whatnot.

I've done something similar in the past, but ended up utilizing Amarok w/ MySQL. Amarok will create a mysql database for you and index all your files quite nicely - after that interfacing with the database should be relatively straightforward from python.
It was quite a time saver for me :)
HTH

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.