I am trying to obtain the "Where from" extended file attribute which is located on the "get info" context-menu of a file in MacOS.
Example
When right-clicking on the file and displaying the info it shows the this metadata.
The highlighted part in the image below shows the information I want to obtain (the link of the website where the file was downloaded from).
I want to use this Mac-specific function using Python.
I thought of using OS tools but couldn't figure out any.
TL;DR: Get the extended attribute like MacOS's "Where from" by e.g. pip-install pyxattr and use xattr.getxattr("file.pdf", "com.apple.metadata:kMDItemWhereFroms").
Extended Attributes on files
These extended file attributes like your "Where From" in MacOS (since 10.4) store metadata not interpreted by the filesystem. They exist for different operating systems.
using the command-line
You can also query them on the command-line with tools like:
exiftool:
exiftool -MDItemWhereFroms -MDItemTitle -MDItemAuthors -MDItemDownloadedDate /path/to/file
xattr (apparently MacOS also uses a Python-script)
xattr -p -l -x /path/to/file
On MacOS many attributes are displayed in property-list format, thus use -x option to obtain hexadecimal output.
using Python
Ture Pålsson pointed out the missing link keywords. Such common and appropriate terms are helpful to search Python Package Index (PyPi):
Search PyPi by keywords: extend file attributes, meta data:
xattr
pyxattr
osxmetadata, requires Python 3.7+, MacOS only
For example to list and get attributes use (adapted from pyxattr's official docs)
import xattr
xattr.listxattr("file.pdf")
# ['user.mime_type', 'com.apple.metadata:kMDItemWhereFroms']
xattr.getxattr("file.pdf", "user.mime_type")
# 'text/plain'
xattr.getxattr("file.pdf", "com.apple.metadata:kMDItemWhereFroms")
# ['https://example.com/downloads/file.pdf']
However you will have to convert the MacOS specific metadata which is stored in plist format, e.g. using plistlib.
File metadata on MacOS
Mac OS X 10.4 (Tiger) introduced Spotlight a system for extracting (or harvesting), storing, indexing, and querying metadata. It provides an integrated system-wide service for searching and indexing.
This metadata is stored as extended file attributes having keys prefixed with com.apple.metadata:. The "Where from" attribute for example has the key com.apple.metadata:kMDItemWhereFroms.
using Python
Use osxmetadata to use similar functionality like in MacOS's md* utils:
from osxmetadata import OSXMetaData
filename = 'file.pdf'
meta = OSXMetaData(filename)
# get and print "Where from" list, downloaded date, title
print(meta.wherefroms, meta.downloadeddate, meta.title)
See also
MacIssues (2014): How to look up file metadata in OS X
OSXDaily (2018): How to View & Remove Extended Attributes from a File on Mac OS
Ask Different: filesystem - What all file metadata is available in macOS?
Query Spotlight for a range of dates via PyObjC
Mac OS X : add a custom meta data field to any file
macOS stores metadata such as the "Where from" attribute under the key com.apple.metadata:kMDItemWhereFroms.
import xattr
value = xattr.getxattr("sublime_text_build_4121_mac.zip",'com.apple.metadata:kMDItemWhereFroms').decode("ISO-8859-1")
print(value)
'bplist00¢\x01\x02_\x10#https://download.sublimetext.com/sublime_text_build_4121_mac.zip_\x10\x1chttps://www.sublimetext.com/\x08\x0bN\x00\x00\x00\x00\x00\x00\x01\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00m'
I had faced a similar problem long ago. We did not use Python to solve it.
I'm working on a Python NDB model on app engine that looks like:
class NDBPath(ndb.Model):
path = ndb.StringProperty()
directory = ndb.ComputedProperty(lambda self: getDirectory(self.path))
cat = ndb.IntegerProperty(indexed=False)
Path is a file path, directory is the superdirectory of that file, and cat is some number. These entities are effectively read only after an initial load.
I query the datastore with various filepaths and want to pull out the cat property of an entity if either a) its path matches the queried path (same file), or b) if the entity's directory is in a superdirectory of the queried path. So I end up doing a query like:
NDBPath.query(NDBPath.directory.IN(generateSuperPaths(queriedPath)))
Where generateSuperPaths lists all the superdirectories in their full form of the queried Path (eg a/b/c/d.html --> [/a, /a/b, /a/b/c])
Because these are read only, using a computed property is effectively a waste of a write as it will never change. Is there any way to query based on a dynamically transformed value, like
NDBPath.query(getDirectory(NDBPath.path).IN(generateSuperPaths(queriedPath)))
So I can save writing the directory as a property and just use it in the query?
AFAIK, yes. They will end up doing the same and it will save you on write time and cost. Although it will make your query run slower (because of the added computation), so be wary of possible timeouts.
I am working on a project where I have to store .hg directories. The easiest way is to pack the .hg into hg.tar. I save it in MongoDB's GridFS filesystem.
If I go with this plan, I have to read the tar out.
import tarfile, cStringIO as io
repo = get_repo(saved_repo.id)
ios = io.StringIO()
ios.write(repo.hgfile.read())
ios.seek(0)
tar = tarfile.open(mode='r', fileobj=ios)
members = tar.getmembers()
#for info in members:
# tar.extract(info.name, '/tmp')
for file in members:
print file.name, file.isdir()
This is a working code. I can get all the files and directories names as the loop continues.
My question is how do I extract this tar into a valid, file-system like directory. I can .extractfile individually into memory, but if I want to feed into Mercurial API, I probably need the entire directory as in a single DIRECTORY .hg in memory like how they exist in the filesystem.
Thoughts?
Mercurial has a concept called opener that's used to abstract filesystem access. I first looked at http://hg.intevation.org/mercurial/crew/file/tip/mercurial/revlog.py to see if you can replace the revlog class (which is the base class for changelog, manifest log and filelogs), but recent versions of Mercurial also have a VFS abstraction layer. It can be found in http://hg.intevation.org/mercurial/crew/file/8c64c4af21a4/mercurial/scmutil.py#l202 and is used by the localrepo.localrepository class for all file access.
I have a directory of music on Ubuntu (.mp3, .wav, etc) files. This directory can have as many sub directories as it needs, no limits. I want to be able to make a music library out of it - that is, return list of songs based on filters of:
1) membership to playlist
2) artist name
3) string search
4) name of song
etc, etc
However, if file names are changed, moved, or even added to my Music directory, I need to be able to reflect this is in my music organization engine - quickly!
I originally thought to just monitor my directory with pyinotify, incron, or inotify. Unfortunately my directory is a Samba share and so monitoring file events failed. So my next guess was to simply recursively search the directory in python, and populate a SQL database. Then when updating, I would just look to see if anything has changed (scanning each subfolder to see if each song's name is in the database already, and if not adding it), and make UPDATEs accordingly. Unfortunately, this seems to be a terrible O(n^2) implementation - awful for a multi-terabyte music collection.
A slightly better one might involve creating a tree structure in SQL, thus narrowing the possible candidates to search for a match at any given subfolder step to the size of that subfolder. Still seems inelegant.
What design paradigms/packages can I use to help myself out? Obviously will involve lots of clever hash tables. I'm just looking for some pointers in the right direction for how to approach the problem. (Also I'm a complete junkie for optimization.)
The hard part of this is the scanning of the directory, just because it can be expensive.
But that's a cruel reality since you can't use inotify et al.
In your database, simply create a node type record:
create table node (
nodeKey integer not null primary key,
parentNode integer references node(nodeKey), // allow null for the root, or have root point to itself, whatever
fullPathName varchar(2048),
nodeName varchar(2048),
nodeType varchar(1) // d = directory, f = file, or whatever else you want
)
That's your node structure.
You can use the full path column to quickly find anything by the absolute path.
When a file moves, simply recalculate the path.
Finally, scan you music files. In unix, you can do something like:
find . -type f | sort > sortedListOfFiles
Next, simply suck all of the path names out of the database.
select fullPathName from node where nodeType != 'd' order by fullPathName
Now you have two sorted list of files.
Run them through DIFF (or comm), and you'll have a list of deleted and new files. You won't have a list of "moved" files. If you want to do some heuristic where you compare new and old files and they have the same endings (i.e. ..../album/song) to try and detect "moves" vs new and old, then fine, no big deal. Worth a shot.
But diff will give you your differential in a heartbeat.
If you have zillions of files, then, sorry, this it going to take some time -- but you already know that when you lose the inotify capability. If you had that it would just be incremental maintenance.
When a file moves, it's trivial to find its new absolute path, because you can ask its parent for its path and simply append your name to it. After that, you're not crawling a tree or anything, unless you want to. Works both ways.
Addenda:
If you want to track actual name changes, you can get a little more information.
You can do this:
find . -type f -print0 | xargs -0 ls -i | sort -n > sortedListOfFileWithInode
The -print0 and -0 are used to work with files with spaces in them. Quotes in the file names will wreck this however. You might be better off running the raw list through python and fstat to get the inode. Different things you can do here.
What this does is rather than just having names, you also get the inode of the file. The inode is the "real" file, a directory links names to inodes. This is how you can have multiple names (hard links) in a unix file system to a single file, all of the names point to the same inode.
When a file is renamed, the inode will remain the same. In unix, there's a single command used for renaming, and moving files, mv. When mv renames or moves the file, the inode stays the same AS LONG AS THE FILE IS ON THE SAME FILE SYSTEM.
So, using the inode as well as the file name will let you capture some more interesting information, like file moves.
It won't help if they delete the file and add a new file. But you WILL (likely) be able to tell that it happened, since it is unlikely that an old inode will be reused for the new inode.
So if you have a list of files (sorted by file name):
1234 song1.mp3
1235 song2.mp3
1236 song3.mp3
and someone removes and adds back song 2, you'll have something like
1234 song1.mp3
1237 song2.mp3
1236 song3.mp3
But if you do this:
mv song1.mp3 song4.mp3
You'll get:
1237 song2.mp3
1236 song3.mp3
1234 song4.mp3
The other caveat is that if you lose the drive and restore it from backup, likely all of the inodes will change, forcing effectively a rebuild of your index.
If you're real adventurous you can try playing with extended file system attributes and assign other interesting meta data to files. Haven't done much with that, but it's got possibilities as well, and there are likely unseen dangers, but...
my aggregate_digup program reads an extended sha1sum.txt format file produced by the digup program. this lets me locate a file based on its sha1sum. the digup program stores the mtime size hash and pathname in its output. by default it skips hashing a file if the mtime and size match. the index produced by my aggregate_digup is used by my modifed version of the open uri context menu gedit plugin allowing one to option click on sha1:b7d67986e54f852de25e2d803472f31fb53184d5 and it'll list the copies of the file it knows about so you can pick one and open it.
how this relates to the problem is that there are two parts: one the playlists and two the files.
if we can assume that nothing the player does changes the files, then the hash and sizes of the files are constant. so we should be able to use the size and hash of a file as a unique identifier.
for example the key for the file mentioned: 222415:b7d67986e54f852de25e2d803472f31fb53184d5
i've found that in practice this has no collisions in any natural collection.
(this does mean that the ID3 metadata which is appended or prepended to the mp3 data can't change unless you choose to skip that metadata while hashing)
so the playlist database would be something this:
files(file_key, hash, size, mtime, path, flag)
tracks(file_key, title, artist)
playlists(playlistid, index, file_key)
to update the files table:
import os
import stat
# add new files:
update files set flag=0
for path in filesystem:
s=os.stat(path)
if stat.S_ISREG(s.st_mode):
fetch first row of select mtime, hash, size from files where path=path
if row is not None:
if s.st_mtime == mtime and s.st_size == size:
update files set flag=1 where path=path
continue
hash=hash_file(path)
file_key="%s:%s" % (int(s.st_mtime), hash)
insert or update files set file_key=file_key, size=s.st_size, mtime=s.st_mtime, hash=hash, flag=1 where path=path
# remove non-existent files:
delete from files where flag=0
The reality is, this is a hard problem. You're starting from a disadvantage as well: Python and mySQL aren't the fastest tools to use for this purpose.
Even iTunes is complained about because of the time it takes to import libraries and index new files. Can you imagine the man hours that went into making iTunes as good as it is?
Your best bet is to look at the code of major open source music players such as
Miro, http://www.getmiro.com/,
Banshee, http://banshee.fm/, and
Songbird, http://getsongbird.com/
And try an adapt their algorithms to your purpose and to Python idioms.
import os
import re
your other code here that initially sets up a dictonary containing which files you already have in your library (I called the dictionary archived_music)
music_directory = '/home/username/music'
music_type = '\.mp3$|\.wav$|\.etc$'
found_files = os.popen('find %s -type f -mtime 1 2>/dev/null' % music_directory)
for file in found_files:
directory, filename = os.path.split()
if re.compile(music_type).search(filename):
#found a music file, check if you already have it in the library
if filename in archived_music:
continue
#if you have gotten to this point, the music was not found in the arcchived music directory, so now perform whatever processing you would like to do on the full path found in file.
You can use this code as a little function or whatever and call it on whatever time resolution you would like. It will use the find command and find every newly created file within the last day. It will then check whether it is of type music_type, if it is it will check the filename against whatever current database you have set up and you can continue processing from there. This should be able to get your started for updating newly added music or whatnot.
I've done something similar in the past, but ended up utilizing Amarok w/ MySQL. Amarok will create a mysql database for you and index all your files quite nicely - after that interfacing with the database should be relatively straightforward from python.
It was quite a time saver for me :)
HTH
I have users upload files into a fake directory structure using a database. I have fields for the parent path & the filename & the file (file is of type "upload") that I set using my controller. I can see that files are properly being stored in the uploads directory so that is working. Just for reference I store the files using
db.allfiles.insert(filename=filename, \
parentpath=parentpath, \
file=db.allfiles.file.store(file.file,filename), \
datecreated=now,user=me)
I am trying to set up a function for downloading files as well so a user can download files using something like app/controller/function/myfiles/image.jpg. I find the file using this code:
file=db((db.allfiles.parentpath==parentpath)&\
(db.allfiles.filename==filename)&\
(db.allfiles.user==me)).select()[0]
an I tried returning file.file but the files I was getting jpg files that were strings like:
allfiles.file.89fe64038f1de7be.6d6f6e6b65792d372e6a7067.jpg
Which is the filename in the database. I tried this code:
os.path.join(request.folder,('uploads/'),'/'.join(file.file))
but I'm getting this path:
/home/charles/web2py/applications/chips/uploads/a/l/l/f/i/l/e/s/./f/i/l/e/./8/9/f/e/6/4/0/3/8/f/1/d/e/7/b/e/./6/d/6/f/6/e/6/b/6/5/7/9/2/d/3/7/2/e/6/a/7/0/6/7/./j/p/g
I think this is special type of string or maybe file.file isn't exactly a string. Is there something I can return the file to the user through my function?
You're almost right. Try:
os.path.join(request.folder,'uploads',file.file)
Python strings are sequence types, and therefore iterable. When you submit a single string as an argument to the join method, it iterates over each character in the string. So, for example:
>>> '/'.join('hello')
'h/e/l/l/o'
Also, note that os.path.join will automatically separate its arguments by the appropriate path separator for your OS (i.e., os.path.sep), so no need to insert slashes manually.