Query GAE Datastore on Transformed Property Values

Query GAE Datastore on Transformed Property Values - python

I'm working on a Python NDB model on app engine that looks like:
class NDBPath(ndb.Model):
path = ndb.StringProperty()
directory = ndb.ComputedProperty(lambda self: getDirectory(self.path))
cat = ndb.IntegerProperty(indexed=False)
Path is a file path, directory is the superdirectory of that file, and cat is some number. These entities are effectively read only after an initial load.
I query the datastore with various filepaths and want to pull out the cat property of an entity if either a) its path matches the queried path (same file), or b) if the entity's directory is in a superdirectory of the queried path. So I end up doing a query like:
NDBPath.query(NDBPath.directory.IN(generateSuperPaths(queriedPath)))
Where generateSuperPaths lists all the superdirectories in their full form of the queried Path (eg a/b/c/d.html --> [/a, /a/b, /a/b/c])
Because these are read only, using a computed property is effectively a waste of a write as it will never change. Is there any way to query based on a dynamically transformed value, like
NDBPath.query(getDirectory(NDBPath.path).IN(generateSuperPaths(queriedPath)))
So I can save writing the directory as a property and just use it in the query?

AFAIK, yes. They will end up doing the same and it will save you on write time and cost. Although it will make your query run slower (because of the added computation), so be wary of possible timeouts.

Related

Updating metadata for gridfs file object

I use GridFS as follows:
connection = MongoClient(host='localhost')
db = connection.gridfs_example
fs = gridfs.GridFS(db)
fileId = fs.put("Contents of my file", key='s1')
After files are originally stored in GridFS, I have a process that computes additional metadata respective to the contents of the file.
def qcFile(fileId):
#DO QC
return "QC PASSED"
qcResult = qcFile(fileId)
It would have been great if I could do:
fs.update(fileId, QC_RESULT = qcResult)
But that option does not appear to exist within the documentation. I found here (the question updated with solution) that the Java driver appears to offer an option to do something like this but can't find its equivalent in python gridfs.
So, how do I use pymongo to tag my file with the newly computed metadata value qcResult? I can't find it within the documentation.

GridFS stores files in two collections:
1. files collection: store files' metadata
2. chunks collection: store files' binary data
You can hack into the files collection. The name of the files collection is 'fs.files' where 'fs' is the default bucket name.
So, to update the QC result, you can do like this:
db.fs.files.update({'_id': fileId}, {'$set': {'QC_RESULT': qcResult}})

Need a file system metadata layer for applications

I'm looking for a metadata layer that sits on top of files which can interpret key-value pairs of information in file names for apps that work with thousands of files. More info:
These aren't necessarily media files that have built-in metadata - hence the key-value pairs.
The metadata goes beyond os information (file sizes, etc) - to whatever the app puts into the key-values.
It should be accessible via command line as well as a python module so that my applications can talk to it.
ADDED: It should also be supported by common os commands (cp, mv, tar, etc) so that it doesn't get lost if a file is copied or moved.
Examples of functionality I'd like include:
list files in directory x for organization_id 3375
report on files in directory y by translating load_time to year/month and show file count & size for each year/month combo
get oldest file in directory z based upon key of loadtime
Files with this simple metadata embedded within them might look like:
bowling_state-ky_league-15_game-8_gametime-201209141830.tgz
bowling_state-ky_league-15_game-9_gametime-201209141930.tgz
This metadata is very accessible & tightly joined to the file. But - I'd prefer to avoid needing to use cut or wild-cards for all operations.
I've looked around and can only find media & os metadata solutions and don't want to build something if it already exists.

Have you looked at extended file attributes? See: http://en.wikipedia.org/wiki/Extended_file_attributes
Basically, you store the key-value pairs as zero terminated strings in the filesystem itself. You can set these attributes from the command line like this:
$ setfattr -n user.comment -v "this is a comment" testfile
$ getfattr testfile
# file: testfile
user.comment
$ getfattr -n user.comment testfile
# file: testfile
user.comment="this is a comment"
To set and query extended file system attributes from python, you can try the python module xattr. See: http://pypi.python.org/pypi/xattr
EDIT
Extended attributes are supported by most filesystem manipulation commands, such as cp, mv and tar by adding command line flags. E.g. cp -a or tar --xattr. You may need to make these commands to work transparently. (You may have users who are unaware of your extended attributes.) In this case you can create an alias, e.g. alias cp="cp -a".

As already discussed, xattrs are a good solution when available. However, when you can't use xattrs:
NTFS alternate data streams
On Microsoft Windows, xattrs are not available, but NTFS alternate data streams provide a similar feature. ADSs let you store arbitrary amounts of data together with the main stream of a file. They are accessed using
drive:\path\to\file:streamname
An ADS is effectively just its own file with a special name. Apparently you can access them from Python by specifying a filename containing a colon:
open(r"drive:\path\to\file:streamname", "wb")
and then using it like an ordinary file. (Disclaimer: not tested.)
From the command line, use Microsoft's streams program.
Since ADSs store arbitrary binary data, you are responsible for writing the querying functionality.
SQLite
SQLite is an embedded RDBMS that you can use. Store the .sqlite database file alongside your directory tree.
For each file you add, also record each file in a table:
CREATE TABLE file (
file_id INTEGER PRIMARY KEY AUTOINCREMENT,
path TEXT
);
Then, for example, you could store a piece of metadata as a table:
CREATE TABLE organization_id (
file_id INTEGER PRIMARY KEY,
value INTEGER,
FOREIGN KEY(file_id) REFERENCES file(file_id)
);
Then you can query on it:
SELECT path FROM file NATURAL JOIN organization_id
WHERE value == 3375 AND path LIKE '/x/%';
Alternatively, if you want a pure key-value store, you can store all the metadata in one table:
CREATE TABLE metadata (
file_id INTEGER,
key TEXT,
value TEXT,
PRIMARY KEY(file_id, key),
FOREIGN KEY(file_id) REFERENCES file(file_id)
);
Query:
SELECT path FROM file NATURAL JOIN metadata
WHERE key == 'organization_id' AND value == 3375 AND path LIKE '/x/%';
Obviously it's your responsibility to update the database whenever you read or write a file. You must also make sure these updates are atomic (e.g. add a column active to the file table; when adding a file: set active = FALSE, write file, fsync, set active = TRUE, and as cleanup delete any files that have active = FALSE).
The Python standard library includes SQLite support as the sqlite3 package.
From the command line, use the sqlite3 program.

the Xattr limits on the size(Ext4 upto 4kb), the xattr key must prefix 'user.' on linux.
And not all file system support the xattr.
try iDB.py library which wraps the xattr and can easily switch to disable xattr supports.

I want a clever algorithm for indexing a file directory...pointers?

I have a directory of music on Ubuntu (.mp3, .wav, etc) files. This directory can have as many sub directories as it needs, no limits. I want to be able to make a music library out of it - that is, return list of songs based on filters of:
1) membership to playlist
2) artist name
3) string search
4) name of song
etc, etc
However, if file names are changed, moved, or even added to my Music directory, I need to be able to reflect this is in my music organization engine - quickly!
I originally thought to just monitor my directory with pyinotify, incron, or inotify. Unfortunately my directory is a Samba share and so monitoring file events failed. So my next guess was to simply recursively search the directory in python, and populate a SQL database. Then when updating, I would just look to see if anything has changed (scanning each subfolder to see if each song's name is in the database already, and if not adding it), and make UPDATEs accordingly. Unfortunately, this seems to be a terrible O(n^2) implementation - awful for a multi-terabyte music collection.
A slightly better one might involve creating a tree structure in SQL, thus narrowing the possible candidates to search for a match at any given subfolder step to the size of that subfolder. Still seems inelegant.
What design paradigms/packages can I use to help myself out? Obviously will involve lots of clever hash tables. I'm just looking for some pointers in the right direction for how to approach the problem. (Also I'm a complete junkie for optimization.)

The hard part of this is the scanning of the directory, just because it can be expensive.
But that's a cruel reality since you can't use inotify et al.
In your database, simply create a node type record:
create table node (
nodeKey integer not null primary key,
parentNode integer references node(nodeKey), // allow null for the root, or have root point to itself, whatever
fullPathName varchar(2048),
nodeName varchar(2048),
nodeType varchar(1) // d = directory, f = file, or whatever else you want
)
That's your node structure.
You can use the full path column to quickly find anything by the absolute path.
When a file moves, simply recalculate the path.
Finally, scan you music files. In unix, you can do something like:
find . -type f | sort > sortedListOfFiles
Next, simply suck all of the path names out of the database.
select fullPathName from node where nodeType != 'd' order by fullPathName
Now you have two sorted list of files.
Run them through DIFF (or comm), and you'll have a list of deleted and new files. You won't have a list of "moved" files. If you want to do some heuristic where you compare new and old files and they have the same endings (i.e. ..../album/song) to try and detect "moves" vs new and old, then fine, no big deal. Worth a shot.
But diff will give you your differential in a heartbeat.
If you have zillions of files, then, sorry, this it going to take some time -- but you already know that when you lose the inotify capability. If you had that it would just be incremental maintenance.
When a file moves, it's trivial to find its new absolute path, because you can ask its parent for its path and simply append your name to it. After that, you're not crawling a tree or anything, unless you want to. Works both ways.
Addenda:
If you want to track actual name changes, you can get a little more information.
You can do this:
find . -type f -print0 | xargs -0 ls -i | sort -n > sortedListOfFileWithInode
The -print0 and -0 are used to work with files with spaces in them. Quotes in the file names will wreck this however. You might be better off running the raw list through python and fstat to get the inode. Different things you can do here.
What this does is rather than just having names, you also get the inode of the file. The inode is the "real" file, a directory links names to inodes. This is how you can have multiple names (hard links) in a unix file system to a single file, all of the names point to the same inode.
When a file is renamed, the inode will remain the same. In unix, there's a single command used for renaming, and moving files, mv. When mv renames or moves the file, the inode stays the same AS LONG AS THE FILE IS ON THE SAME FILE SYSTEM.
So, using the inode as well as the file name will let you capture some more interesting information, like file moves.
It won't help if they delete the file and add a new file. But you WILL (likely) be able to tell that it happened, since it is unlikely that an old inode will be reused for the new inode.
So if you have a list of files (sorted by file name):
1234 song1.mp3
1235 song2.mp3
1236 song3.mp3
and someone removes and adds back song 2, you'll have something like
1234 song1.mp3
1237 song2.mp3
1236 song3.mp3
But if you do this:
mv song1.mp3 song4.mp3
You'll get:
1237 song2.mp3
1236 song3.mp3
1234 song4.mp3
The other caveat is that if you lose the drive and restore it from backup, likely all of the inodes will change, forcing effectively a rebuild of your index.
If you're real adventurous you can try playing with extended file system attributes and assign other interesting meta data to files. Haven't done much with that, but it's got possibilities as well, and there are likely unseen dangers, but...

my aggregate_digup program reads an extended sha1sum.txt format file produced by the digup program. this lets me locate a file based on its sha1sum. the digup program stores the mtime size hash and pathname in its output. by default it skips hashing a file if the mtime and size match. the index produced by my aggregate_digup is used by my modifed version of the open uri context menu gedit plugin allowing one to option click on sha1:b7d67986e54f852de25e2d803472f31fb53184d5 and it'll list the copies of the file it knows about so you can pick one and open it.
how this relates to the problem is that there are two parts: one the playlists and two the files.
if we can assume that nothing the player does changes the files, then the hash and sizes of the files are constant. so we should be able to use the size and hash of a file as a unique identifier.
for example the key for the file mentioned: 222415:b7d67986e54f852de25e2d803472f31fb53184d5
i've found that in practice this has no collisions in any natural collection.
(this does mean that the ID3 metadata which is appended or prepended to the mp3 data can't change unless you choose to skip that metadata while hashing)
so the playlist database would be something this:
files(file_key, hash, size, mtime, path, flag)
tracks(file_key, title, artist)
playlists(playlistid, index, file_key)
to update the files table:
import os
import stat
# add new files:
update files set flag=0
for path in filesystem:
s=os.stat(path)
if stat.S_ISREG(s.st_mode):
fetch first row of select mtime, hash, size from files where path=path
if row is not None:
if s.st_mtime == mtime and s.st_size == size:
update files set flag=1 where path=path
continue
hash=hash_file(path)
file_key="%s:%s" % (int(s.st_mtime), hash)
insert or update files set file_key=file_key, size=s.st_size, mtime=s.st_mtime, hash=hash, flag=1 where path=path
# remove non-existent files:
delete from files where flag=0

The reality is, this is a hard problem. You're starting from a disadvantage as well: Python and mySQL aren't the fastest tools to use for this purpose.
Even iTunes is complained about because of the time it takes to import libraries and index new files. Can you imagine the man hours that went into making iTunes as good as it is?
Your best bet is to look at the code of major open source music players such as
Miro, http://www.getmiro.com/,
Banshee, http://banshee.fm/, and
Songbird, http://getsongbird.com/
And try an adapt their algorithms to your purpose and to Python idioms.

import os
import re
your other code here that initially sets up a dictonary containing which files you already have in your library (I called the dictionary archived_music)
music_directory = '/home/username/music'
music_type = '\.mp3$|\.wav$|\.etc$'
found_files = os.popen('find %s -type f -mtime 1 2>/dev/null' % music_directory)
for file in found_files:
directory, filename = os.path.split()
if re.compile(music_type).search(filename):
#found a music file, check if you already have it in the library
if filename in archived_music:
continue
#if you have gotten to this point, the music was not found in the arcchived music directory, so now perform whatever processing you would like to do on the full path found in file.
You can use this code as a little function or whatever and call it on whatever time resolution you would like. It will use the find command and find every newly created file within the last day. It will then check whether it is of type music_type, if it is it will check the filename against whatever current database you have set up and you can continue processing from there. This should be able to get your started for updating newly added music or whatnot.

I've done something similar in the past, but ended up utilizing Amarok w/ MySQL. Amarok will create a mysql database for you and index all your files quite nicely - after that interfacing with the database should be relatively straightforward from python.
It was quite a time saver for me :)
HTH

web2py: Where should I store private, application-specific files?

I have just started working on web2py. Personally, I find it easier to learn than Django.
My query is that I have to load a file at application startup. Its a pickled hashtable. Where should I store this file so that the system is able to see it
My code is :
import cPickle as pickle
def index():
"""
Load the file into memory and message the number of entries
"""
f = open('tables.pkl','rb')
session.tables = pickle.load(f)
f.close()
terms = len(session.tables.keys())
message = 'The total entries in table = ' + str(terms)
return dict(message=message)
As you can see, I have put the code in index() to load it at startup. At present I am using the absolute path upto the physical location of the 'tables.pkl' file. Where should i put it in my application folder.
Also, I want tables variable to be available to all functions in the controller. Is session.tables the right way to go? It is just a search app so there is no user login.
The table has to be loaded only once for all users accessing the page.
Thank you.

I think the private folder would be a good place for this. You can get the absolute path with:
import os
fp = os.path.join(request.folder,'private','tables.pkl')
I would use cache instead of session if the file is not unique per user.

os.walk() caching/speeding up

I have a prototype server[0] that's doing an os.walk()[1] for each query a client[0] makes.
I'm currently looking into ways of:
caching this data in memory,
speeding up queries, and
hopefully allowing for expansion into storing metadata and data persistence later on.
I find SQL complicated for tree structures, so I thought I would get some advice before actually committing to SQLite
Are there any cross-platform, embeddable or bundle-able non-SQL databases that might be able to handle this kind of data?
I have a small (10k-100k files) list.
I have an extremely small amount of connections (maybe 10-20).
I want to be able to scale to handling metadata as well.
[0] the server and client are actually the same piece of software, this is a P2P application, that's designed to share files over a local trusted network with out a main server, using zeroconf for discovery, and twisted for pretty much everything else
[1] query time is currently 1.2s with os.walk() on 10,000 files
Here is the related function in my Python code that does the walking:
def populate(self, string):
for name, sharedir in self.sharedirs.items():
for root, dirs, files, in os.walk(sharedir):
for dir in dirs:
if fnmatch.fnmatch(dir, string):
yield os.path.join(name, *os.path.join(root, dir)[len(sharedir):].split("/"))
for file in files:
if fnmatch.fnmatch(file, string):
yield os.path.join(name, *os.path.join(root, ile)[len(sharedir):].split("/"))

You don't need to persist a tree structure -- in fact, your code is busily dismantling the natural tree structure of the directory tree into a linear sequence, so why would you want to restart from a tree next time?
Looks like what you need is just an ordered sequence:
i X result of os.path.join for X
where X, a string, names either a file or directory (you treat them just the same), i is a progressively incrementing integer number (to preserve the order), and the result column, also a string, is the result of os.path.join(name, *os.path.join(root, &c.
This is perfectly easy to put in a SQL table, of course!
To create the table the first time, just remove the guards if fnmatch.fnmatch (and the string argument) from your populate function, yield the dir or file before the os.path.join result, and use a cursor.executemany to save the enumerate of the call (or, use a self-incrementing column, your pick). To use the table, populate becomes essentially a:
select result from thetable where X LIKE '%foo%' order by i
where string is foo.

I misunderstood the question at first, but I think I have a solution now (and sufficiently different from my other answer to warrant a new one). Basically, you do the normal query the first time you run walk on a directory, but you store the yielded values. The second time around, you just yield those stored values. I've wrapped the os.walk() call because it's short, but you could just as easily wrap your generator as a whole.
cache = {}
def os_walk_cache( dir ):
if dir in cache:
for x in cache[ dir ]:
yield x
else:
cache[ dir ] = []
for x in os.walk( dir ):
cache[ dir ].append( x )
yield x
raise StopIteration()
I'm not sure of your memory requirements, but you may want to consider periodically cleaning out cache.

Have you looked at MongoDB? What about mod_python? mod_python should allow you to do your os.walk() and just store the data in Python data structures, since the script is persistent between connections.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.