Need a file system metadata layer for applications

Need a file system metadata layer for applications - python

I'm looking for a metadata layer that sits on top of files which can interpret key-value pairs of information in file names for apps that work with thousands of files. More info:
These aren't necessarily media files that have built-in metadata - hence the key-value pairs.
The metadata goes beyond os information (file sizes, etc) - to whatever the app puts into the key-values.
It should be accessible via command line as well as a python module so that my applications can talk to it.
ADDED: It should also be supported by common os commands (cp, mv, tar, etc) so that it doesn't get lost if a file is copied or moved.
Examples of functionality I'd like include:
list files in directory x for organization_id 3375
report on files in directory y by translating load_time to year/month and show file count & size for each year/month combo
get oldest file in directory z based upon key of loadtime
Files with this simple metadata embedded within them might look like:
bowling_state-ky_league-15_game-8_gametime-201209141830.tgz
bowling_state-ky_league-15_game-9_gametime-201209141930.tgz
This metadata is very accessible & tightly joined to the file. But - I'd prefer to avoid needing to use cut or wild-cards for all operations.
I've looked around and can only find media & os metadata solutions and don't want to build something if it already exists.

Have you looked at extended file attributes? See: http://en.wikipedia.org/wiki/Extended_file_attributes
Basically, you store the key-value pairs as zero terminated strings in the filesystem itself. You can set these attributes from the command line like this:
$ setfattr -n user.comment -v "this is a comment" testfile
$ getfattr testfile
# file: testfile
user.comment
$ getfattr -n user.comment testfile
# file: testfile
user.comment="this is a comment"
To set and query extended file system attributes from python, you can try the python module xattr. See: http://pypi.python.org/pypi/xattr
EDIT
Extended attributes are supported by most filesystem manipulation commands, such as cp, mv and tar by adding command line flags. E.g. cp -a or tar --xattr. You may need to make these commands to work transparently. (You may have users who are unaware of your extended attributes.) In this case you can create an alias, e.g. alias cp="cp -a".

As already discussed, xattrs are a good solution when available. However, when you can't use xattrs:
NTFS alternate data streams
On Microsoft Windows, xattrs are not available, but NTFS alternate data streams provide a similar feature. ADSs let you store arbitrary amounts of data together with the main stream of a file. They are accessed using
drive:\path\to\file:streamname
An ADS is effectively just its own file with a special name. Apparently you can access them from Python by specifying a filename containing a colon:
open(r"drive:\path\to\file:streamname", "wb")
and then using it like an ordinary file. (Disclaimer: not tested.)
From the command line, use Microsoft's streams program.
Since ADSs store arbitrary binary data, you are responsible for writing the querying functionality.
SQLite
SQLite is an embedded RDBMS that you can use. Store the .sqlite database file alongside your directory tree.
For each file you add, also record each file in a table:
CREATE TABLE file (
file_id INTEGER PRIMARY KEY AUTOINCREMENT,
path TEXT
);
Then, for example, you could store a piece of metadata as a table:
CREATE TABLE organization_id (
file_id INTEGER PRIMARY KEY,
value INTEGER,
FOREIGN KEY(file_id) REFERENCES file(file_id)
);
Then you can query on it:
SELECT path FROM file NATURAL JOIN organization_id
WHERE value == 3375 AND path LIKE '/x/%';
Alternatively, if you want a pure key-value store, you can store all the metadata in one table:
CREATE TABLE metadata (
file_id INTEGER,
key TEXT,
value TEXT,
PRIMARY KEY(file_id, key),
FOREIGN KEY(file_id) REFERENCES file(file_id)
);
Query:
SELECT path FROM file NATURAL JOIN metadata
WHERE key == 'organization_id' AND value == 3375 AND path LIKE '/x/%';
Obviously it's your responsibility to update the database whenever you read or write a file. You must also make sure these updates are atomic (e.g. add a column active to the file table; when adding a file: set active = FALSE, write file, fsync, set active = TRUE, and as cleanup delete any files that have active = FALSE).
The Python standard library includes SQLite support as the sqlite3 package.
From the command line, use the sqlite3 program.

the Xattr limits on the size(Ext4 upto 4kb), the xattr key must prefix 'user.' on linux.
And not all file system support the xattr.
try iDB.py library which wraps the xattr and can easily switch to disable xattr supports.

Related

Obtaining metadata "Where from" of a file on Mac

I am trying to obtain the "Where from" extended file attribute which is located on the "get info" context-menu of a file in MacOS.
Example
When right-clicking on the file and displaying the info it shows the this metadata.
The highlighted part in the image below shows the information I want to obtain (the link of the website where the file was downloaded from).
I want to use this Mac-specific function using Python.
I thought of using OS tools but couldn't figure out any.

TL;DR: Get the extended attribute like MacOS's "Where from" by e.g. pip-install pyxattr and use xattr.getxattr("file.pdf", "com.apple.metadata:kMDItemWhereFroms").
Extended Attributes on files
These extended file attributes like your "Where From" in MacOS (since 10.4) store metadata not interpreted by the filesystem. They exist for different operating systems.
using the command-line
You can also query them on the command-line with tools like:
exiftool:
exiftool -MDItemWhereFroms -MDItemTitle -MDItemAuthors -MDItemDownloadedDate /path/to/file
xattr (apparently MacOS also uses a Python-script)
xattr -p -l -x /path/to/file
On MacOS many attributes are displayed in property-list format, thus use -x option to obtain hexadecimal output.
using Python
Ture Pålsson pointed out the missing link keywords. Such common and appropriate terms are helpful to search Python Package Index (PyPi):
Search PyPi by keywords: extend file attributes, meta data:
xattr
pyxattr
osxmetadata, requires Python 3.7+, MacOS only
For example to list and get attributes use (adapted from pyxattr's official docs)
import xattr
xattr.listxattr("file.pdf")
# ['user.mime_type', 'com.apple.metadata:kMDItemWhereFroms']
xattr.getxattr("file.pdf", "user.mime_type")
# 'text/plain'
xattr.getxattr("file.pdf", "com.apple.metadata:kMDItemWhereFroms")
# ['https://example.com/downloads/file.pdf']
However you will have to convert the MacOS specific metadata which is stored in plist format, e.g. using plistlib.
File metadata on MacOS
Mac OS X 10.4 (Tiger) introduced Spotlight a system for extracting (or harvesting), storing, indexing, and querying metadata. It provides an integrated system-wide service for searching and indexing.
This metadata is stored as extended file attributes having keys prefixed with com.apple.metadata:. The "Where from" attribute for example has the key com.apple.metadata:kMDItemWhereFroms.
using Python
Use osxmetadata to use similar functionality like in MacOS's md* utils:
from osxmetadata import OSXMetaData
filename = 'file.pdf'
meta = OSXMetaData(filename)
# get and print "Where from" list, downloaded date, title
print(meta.wherefroms, meta.downloadeddate, meta.title)
See also
MacIssues (2014): How to look up file metadata in OS X
OSXDaily (2018): How to View & Remove Extended Attributes from a File on Mac OS
Ask Different: filesystem - What all file metadata is available in macOS?
Query Spotlight for a range of dates via PyObjC
Mac OS X : add a custom meta data field to any file

macOS stores metadata such as the "Where from" attribute under the key com.apple.metadata:kMDItemWhereFroms.
import xattr
value = xattr.getxattr("sublime_text_build_4121_mac.zip",'com.apple.metadata:kMDItemWhereFroms').decode("ISO-8859-1")
print(value)
'bplist00¢\x01\x02_\x10#https://download.sublimetext.com/sublime_text_build_4121_mac.zip_\x10\x1chttps://www.sublimetext.com/\x08\x0bN\x00\x00\x00\x00\x00\x00\x01\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00m'
I had faced a similar problem long ago. We did not use Python to solve it.

Query GAE Datastore on Transformed Property Values

I'm working on a Python NDB model on app engine that looks like:
class NDBPath(ndb.Model):
path = ndb.StringProperty()
directory = ndb.ComputedProperty(lambda self: getDirectory(self.path))
cat = ndb.IntegerProperty(indexed=False)
Path is a file path, directory is the superdirectory of that file, and cat is some number. These entities are effectively read only after an initial load.
I query the datastore with various filepaths and want to pull out the cat property of an entity if either a) its path matches the queried path (same file), or b) if the entity's directory is in a superdirectory of the queried path. So I end up doing a query like:
NDBPath.query(NDBPath.directory.IN(generateSuperPaths(queriedPath)))
Where generateSuperPaths lists all the superdirectories in their full form of the queried Path (eg a/b/c/d.html --> [/a, /a/b, /a/b/c])
Because these are read only, using a computed property is effectively a waste of a write as it will never change. Is there any way to query based on a dynamically transformed value, like
NDBPath.query(getDirectory(NDBPath.path).IN(generateSuperPaths(queriedPath)))
So I can save writing the directory as a property and just use it in the query?

AFAIK, yes. They will end up doing the same and it will save you on write time and cost. Although it will make your query run slower (because of the added computation), so be wary of possible timeouts.

Render hg.tar into in-memory tree-structure after getting back from MongoDB

I am working on a project where I have to store .hg directories. The easiest way is to pack the .hg into hg.tar. I save it in MongoDB's GridFS filesystem.
If I go with this plan, I have to read the tar out.
import tarfile, cStringIO as io
repo = get_repo(saved_repo.id)
ios = io.StringIO()
ios.write(repo.hgfile.read())
ios.seek(0)
tar = tarfile.open(mode='r', fileobj=ios)
members = tar.getmembers()
#for info in members:
# tar.extract(info.name, '/tmp')
for file in members:
print file.name, file.isdir()
This is a working code. I can get all the files and directories names as the loop continues.
My question is how do I extract this tar into a valid, file-system like directory. I can .extractfile individually into memory, but if I want to feed into Mercurial API, I probably need the entire directory as in a single DIRECTORY .hg in memory like how they exist in the filesystem.
Thoughts?

Mercurial has a concept called opener that's used to abstract filesystem access. I first looked at http://hg.intevation.org/mercurial/crew/file/tip/mercurial/revlog.py to see if you can replace the revlog class (which is the base class for changelog, manifest log and filelogs), but recent versions of Mercurial also have a VFS abstraction layer. It can be found in http://hg.intevation.org/mercurial/crew/file/8c64c4af21a4/mercurial/scmutil.py#l202 and is used by the localrepo.localrepository class for all file access.

I want a clever algorithm for indexing a file directory...pointers?

I have a directory of music on Ubuntu (.mp3, .wav, etc) files. This directory can have as many sub directories as it needs, no limits. I want to be able to make a music library out of it - that is, return list of songs based on filters of:
1) membership to playlist
2) artist name
3) string search
4) name of song
etc, etc
However, if file names are changed, moved, or even added to my Music directory, I need to be able to reflect this is in my music organization engine - quickly!
I originally thought to just monitor my directory with pyinotify, incron, or inotify. Unfortunately my directory is a Samba share and so monitoring file events failed. So my next guess was to simply recursively search the directory in python, and populate a SQL database. Then when updating, I would just look to see if anything has changed (scanning each subfolder to see if each song's name is in the database already, and if not adding it), and make UPDATEs accordingly. Unfortunately, this seems to be a terrible O(n^2) implementation - awful for a multi-terabyte music collection.
A slightly better one might involve creating a tree structure in SQL, thus narrowing the possible candidates to search for a match at any given subfolder step to the size of that subfolder. Still seems inelegant.
What design paradigms/packages can I use to help myself out? Obviously will involve lots of clever hash tables. I'm just looking for some pointers in the right direction for how to approach the problem. (Also I'm a complete junkie for optimization.)

The hard part of this is the scanning of the directory, just because it can be expensive.
But that's a cruel reality since you can't use inotify et al.
In your database, simply create a node type record:
create table node (
nodeKey integer not null primary key,
parentNode integer references node(nodeKey), // allow null for the root, or have root point to itself, whatever
fullPathName varchar(2048),
nodeName varchar(2048),
nodeType varchar(1) // d = directory, f = file, or whatever else you want
)
That's your node structure.
You can use the full path column to quickly find anything by the absolute path.
When a file moves, simply recalculate the path.
Finally, scan you music files. In unix, you can do something like:
find . -type f | sort > sortedListOfFiles
Next, simply suck all of the path names out of the database.
select fullPathName from node where nodeType != 'd' order by fullPathName
Now you have two sorted list of files.
Run them through DIFF (or comm), and you'll have a list of deleted and new files. You won't have a list of "moved" files. If you want to do some heuristic where you compare new and old files and they have the same endings (i.e. ..../album/song) to try and detect "moves" vs new and old, then fine, no big deal. Worth a shot.
But diff will give you your differential in a heartbeat.
If you have zillions of files, then, sorry, this it going to take some time -- but you already know that when you lose the inotify capability. If you had that it would just be incremental maintenance.
When a file moves, it's trivial to find its new absolute path, because you can ask its parent for its path and simply append your name to it. After that, you're not crawling a tree or anything, unless you want to. Works both ways.
Addenda:
If you want to track actual name changes, you can get a little more information.
You can do this:
find . -type f -print0 | xargs -0 ls -i | sort -n > sortedListOfFileWithInode
The -print0 and -0 are used to work with files with spaces in them. Quotes in the file names will wreck this however. You might be better off running the raw list through python and fstat to get the inode. Different things you can do here.
What this does is rather than just having names, you also get the inode of the file. The inode is the "real" file, a directory links names to inodes. This is how you can have multiple names (hard links) in a unix file system to a single file, all of the names point to the same inode.
When a file is renamed, the inode will remain the same. In unix, there's a single command used for renaming, and moving files, mv. When mv renames or moves the file, the inode stays the same AS LONG AS THE FILE IS ON THE SAME FILE SYSTEM.
So, using the inode as well as the file name will let you capture some more interesting information, like file moves.
It won't help if they delete the file and add a new file. But you WILL (likely) be able to tell that it happened, since it is unlikely that an old inode will be reused for the new inode.
So if you have a list of files (sorted by file name):
1234 song1.mp3
1235 song2.mp3
1236 song3.mp3
and someone removes and adds back song 2, you'll have something like
1234 song1.mp3
1237 song2.mp3
1236 song3.mp3
But if you do this:
mv song1.mp3 song4.mp3
You'll get:
1237 song2.mp3
1236 song3.mp3
1234 song4.mp3
The other caveat is that if you lose the drive and restore it from backup, likely all of the inodes will change, forcing effectively a rebuild of your index.
If you're real adventurous you can try playing with extended file system attributes and assign other interesting meta data to files. Haven't done much with that, but it's got possibilities as well, and there are likely unseen dangers, but...

my aggregate_digup program reads an extended sha1sum.txt format file produced by the digup program. this lets me locate a file based on its sha1sum. the digup program stores the mtime size hash and pathname in its output. by default it skips hashing a file if the mtime and size match. the index produced by my aggregate_digup is used by my modifed version of the open uri context menu gedit plugin allowing one to option click on sha1:b7d67986e54f852de25e2d803472f31fb53184d5 and it'll list the copies of the file it knows about so you can pick one and open it.
how this relates to the problem is that there are two parts: one the playlists and two the files.
if we can assume that nothing the player does changes the files, then the hash and sizes of the files are constant. so we should be able to use the size and hash of a file as a unique identifier.
for example the key for the file mentioned: 222415:b7d67986e54f852de25e2d803472f31fb53184d5
i've found that in practice this has no collisions in any natural collection.
(this does mean that the ID3 metadata which is appended or prepended to the mp3 data can't change unless you choose to skip that metadata while hashing)
so the playlist database would be something this:
files(file_key, hash, size, mtime, path, flag)
tracks(file_key, title, artist)
playlists(playlistid, index, file_key)
to update the files table:
import os
import stat
# add new files:
update files set flag=0
for path in filesystem:
s=os.stat(path)
if stat.S_ISREG(s.st_mode):
fetch first row of select mtime, hash, size from files where path=path
if row is not None:
if s.st_mtime == mtime and s.st_size == size:
update files set flag=1 where path=path
continue
hash=hash_file(path)
file_key="%s:%s" % (int(s.st_mtime), hash)
insert or update files set file_key=file_key, size=s.st_size, mtime=s.st_mtime, hash=hash, flag=1 where path=path
# remove non-existent files:
delete from files where flag=0

The reality is, this is a hard problem. You're starting from a disadvantage as well: Python and mySQL aren't the fastest tools to use for this purpose.
Even iTunes is complained about because of the time it takes to import libraries and index new files. Can you imagine the man hours that went into making iTunes as good as it is?
Your best bet is to look at the code of major open source music players such as
Miro, http://www.getmiro.com/,
Banshee, http://banshee.fm/, and
Songbird, http://getsongbird.com/
And try an adapt their algorithms to your purpose and to Python idioms.

import os
import re
your other code here that initially sets up a dictonary containing which files you already have in your library (I called the dictionary archived_music)
music_directory = '/home/username/music'
music_type = '\.mp3$|\.wav$|\.etc$'
found_files = os.popen('find %s -type f -mtime 1 2>/dev/null' % music_directory)
for file in found_files:
directory, filename = os.path.split()
if re.compile(music_type).search(filename):
#found a music file, check if you already have it in the library
if filename in archived_music:
continue
#if you have gotten to this point, the music was not found in the arcchived music directory, so now perform whatever processing you would like to do on the full path found in file.
You can use this code as a little function or whatever and call it on whatever time resolution you would like. It will use the find command and find every newly created file within the last day. It will then check whether it is of type music_type, if it is it will check the filename against whatever current database you have set up and you can continue processing from there. This should be able to get your started for updating newly added music or whatnot.

I've done something similar in the past, but ended up utilizing Amarok w/ MySQL. Amarok will create a mysql database for you and index all your files quite nicely - after that interfacing with the database should be relatively straightforward from python.
It was quite a time saver for me :)
HTH

Returning a file in uploads directory with web2py - strings issue

I have users upload files into a fake directory structure using a database. I have fields for the parent path & the filename & the file (file is of type "upload") that I set using my controller. I can see that files are properly being stored in the uploads directory so that is working. Just for reference I store the files using
db.allfiles.insert(filename=filename, \
parentpath=parentpath, \
file=db.allfiles.file.store(file.file,filename), \
datecreated=now,user=me)
I am trying to set up a function for downloading files as well so a user can download files using something like app/controller/function/myfiles/image.jpg. I find the file using this code:
file=db((db.allfiles.parentpath==parentpath)&\
(db.allfiles.filename==filename)&\
(db.allfiles.user==me)).select()[0]
an I tried returning file.file but the files I was getting jpg files that were strings like:
allfiles.file.89fe64038f1de7be.6d6f6e6b65792d372e6a7067.jpg
Which is the filename in the database. I tried this code:
os.path.join(request.folder,('uploads/'),'/'.join(file.file))
but I'm getting this path:
/home/charles/web2py/applications/chips/uploads/a/l/l/f/i/l/e/s/./f/i/l/e/./8/9/f/e/6/4/0/3/8/f/1/d/e/7/b/e/./6/d/6/f/6/e/6/b/6/5/7/9/2/d/3/7/2/e/6/a/7/0/6/7/./j/p/g
I think this is special type of string or maybe file.file isn't exactly a string. Is there something I can return the file to the user through my function?

You're almost right. Try:
os.path.join(request.folder,'uploads',file.file)

Python strings are sequence types, and therefore iterable. When you submit a single string as an argument to the join method, it iterates over each character in the string. So, for example:
>>> '/'.join('hello')
'h/e/l/l/o'
Also, note that os.path.join will automatically separate its arguments by the appropriate path separator for your OS (i.e., os.path.sep), so no need to insert slashes manually.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.