Obtaining metadata "Where from" of a file on Mac

Obtaining metadata "Where from" of a file on Mac - python

I am trying to obtain the "Where from" extended file attribute which is located on the "get info" context-menu of a file in MacOS.
Example
When right-clicking on the file and displaying the info it shows the this metadata.
The highlighted part in the image below shows the information I want to obtain (the link of the website where the file was downloaded from).
I want to use this Mac-specific function using Python.
I thought of using OS tools but couldn't figure out any.

TL;DR: Get the extended attribute like MacOS's "Where from" by e.g. pip-install pyxattr and use xattr.getxattr("file.pdf", "com.apple.metadata:kMDItemWhereFroms").
Extended Attributes on files
These extended file attributes like your "Where From" in MacOS (since 10.4) store metadata not interpreted by the filesystem. They exist for different operating systems.
using the command-line
You can also query them on the command-line with tools like:
exiftool:
exiftool -MDItemWhereFroms -MDItemTitle -MDItemAuthors -MDItemDownloadedDate /path/to/file
xattr (apparently MacOS also uses a Python-script)
xattr -p -l -x /path/to/file
On MacOS many attributes are displayed in property-list format, thus use -x option to obtain hexadecimal output.
using Python
Ture Pålsson pointed out the missing link keywords. Such common and appropriate terms are helpful to search Python Package Index (PyPi):
Search PyPi by keywords: extend file attributes, meta data:
xattr
pyxattr
osxmetadata, requires Python 3.7+, MacOS only
For example to list and get attributes use (adapted from pyxattr's official docs)
import xattr
xattr.listxattr("file.pdf")
# ['user.mime_type', 'com.apple.metadata:kMDItemWhereFroms']
xattr.getxattr("file.pdf", "user.mime_type")
# 'text/plain'
xattr.getxattr("file.pdf", "com.apple.metadata:kMDItemWhereFroms")
# ['https://example.com/downloads/file.pdf']
However you will have to convert the MacOS specific metadata which is stored in plist format, e.g. using plistlib.
File metadata on MacOS
Mac OS X 10.4 (Tiger) introduced Spotlight a system for extracting (or harvesting), storing, indexing, and querying metadata. It provides an integrated system-wide service for searching and indexing.
This metadata is stored as extended file attributes having keys prefixed with com.apple.metadata:. The "Where from" attribute for example has the key com.apple.metadata:kMDItemWhereFroms.
using Python
Use osxmetadata to use similar functionality like in MacOS's md* utils:
from osxmetadata import OSXMetaData
filename = 'file.pdf'
meta = OSXMetaData(filename)
# get and print "Where from" list, downloaded date, title
print(meta.wherefroms, meta.downloadeddate, meta.title)
See also
MacIssues (2014): How to look up file metadata in OS X
OSXDaily (2018): How to View & Remove Extended Attributes from a File on Mac OS
Ask Different: filesystem - What all file metadata is available in macOS?
Query Spotlight for a range of dates via PyObjC
Mac OS X : add a custom meta data field to any file

macOS stores metadata such as the "Where from" attribute under the key com.apple.metadata:kMDItemWhereFroms.
import xattr
value = xattr.getxattr("sublime_text_build_4121_mac.zip",'com.apple.metadata:kMDItemWhereFroms').decode("ISO-8859-1")
print(value)
'bplist00¢\x01\x02_\x10#https://download.sublimetext.com/sublime_text_build_4121_mac.zip_\x10\x1chttps://www.sublimetext.com/\x08\x0bN\x00\x00\x00\x00\x00\x00\x01\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00m'
I had faced a similar problem long ago. We did not use Python to solve it.

Related

python how to use tika with existing jar file without downloading again

I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\asus\AppData\Local\Temp\tika-server.jar.
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to C:\Users\asus\AppData\Local\Temp\tika-server.jar.md5.
The problem is that the jar file size is around 60MB, which takes some time to download.
This is the code I'm using :
from tika import parser
def get_pdf_text(path):
parsed = parser.from_file(path):
return parsed['content']
The only workaround I found is this :
1 - Manually running the jar using java -jar tika-server-x.x.jar --port xxxx
2 - Using tika.TikaClientOnly = True
3 - Replacing parser.from_file(path) with parser.from_file(path, '/path/to/server')
But I don't want to run the jar file manually. It would be better if I can use Python to automatically run the jar file and setup tika with it without redownloading.

To resolve this problem you should add an environment variable to the tika server jar and specify the path folder which contains the tika jar file.
TIKA_SERVER_JAR = 'PATH_OF_FOLDER_CONTAINING_TIKA_SERVER_JAR'.

if you don't want to add environment variable, you can change the directory that the tika looking for tika-server.jar file with code bellow.
from tika import tika
tika.TikaJarPath = r'TIKA_SERVER_PATH'
in that TIKA_SERVER_PATH the jar file name should be tika-server.jar(the name shouldn't include the version) and also the .md5 file must be there. if the .md5 file isn't the right version as tika-server.jar this method doesn't work and tika will delete your file and download the default version.

Here is what worked here :
os.environ['TIKA_SERVER_JAR'] = "<path_to_jar_and_md5>/tika-server.jar"
os.environ['TIKA_PATH'] = "<path_to_jar_and_md5_again>"
These are read at library import, so import the parser after, and reimport if you change them.

After trying almost everything, and debugging tika.py library code I found that you must set both of these variables for this hack to work.
TIKA_SERVER_JAR="/path_to_tika_server/tika-server.jar"
TIKA_SERVER_JAR="/path_to_tika_server"
You also need to provide a .md5 signature file because since Tika version 1.18 .md5 file is not provided (sha512 signature is provided instead, see https://archive.apache.org/dist/tika/). So you need to trick the library to accept your downloaded file.
Or someone could just patch python library :)

i am wondering how to get the .md5 file of tika-server.jar, since .md5 file is not provided and sha512 signature is provided instead

Render hg.tar into in-memory tree-structure after getting back from MongoDB

I am working on a project where I have to store .hg directories. The easiest way is to pack the .hg into hg.tar. I save it in MongoDB's GridFS filesystem.
If I go with this plan, I have to read the tar out.
import tarfile, cStringIO as io
repo = get_repo(saved_repo.id)
ios = io.StringIO()
ios.write(repo.hgfile.read())
ios.seek(0)
tar = tarfile.open(mode='r', fileobj=ios)
members = tar.getmembers()
#for info in members:
# tar.extract(info.name, '/tmp')
for file in members:
print file.name, file.isdir()
This is a working code. I can get all the files and directories names as the loop continues.
My question is how do I extract this tar into a valid, file-system like directory. I can .extractfile individually into memory, but if I want to feed into Mercurial API, I probably need the entire directory as in a single DIRECTORY .hg in memory like how they exist in the filesystem.
Thoughts?

Mercurial has a concept called opener that's used to abstract filesystem access. I first looked at http://hg.intevation.org/mercurial/crew/file/tip/mercurial/revlog.py to see if you can replace the revlog class (which is the base class for changelog, manifest log and filelogs), but recent versions of Mercurial also have a VFS abstraction layer. It can be found in http://hg.intevation.org/mercurial/crew/file/8c64c4af21a4/mercurial/scmutil.py#l202 and is used by the localrepo.localrepository class for all file access.

Need a file system metadata layer for applications

I'm looking for a metadata layer that sits on top of files which can interpret key-value pairs of information in file names for apps that work with thousands of files. More info:
These aren't necessarily media files that have built-in metadata - hence the key-value pairs.
The metadata goes beyond os information (file sizes, etc) - to whatever the app puts into the key-values.
It should be accessible via command line as well as a python module so that my applications can talk to it.
ADDED: It should also be supported by common os commands (cp, mv, tar, etc) so that it doesn't get lost if a file is copied or moved.
Examples of functionality I'd like include:
list files in directory x for organization_id 3375
report on files in directory y by translating load_time to year/month and show file count & size for each year/month combo
get oldest file in directory z based upon key of loadtime
Files with this simple metadata embedded within them might look like:
bowling_state-ky_league-15_game-8_gametime-201209141830.tgz
bowling_state-ky_league-15_game-9_gametime-201209141930.tgz
This metadata is very accessible & tightly joined to the file. But - I'd prefer to avoid needing to use cut or wild-cards for all operations.
I've looked around and can only find media & os metadata solutions and don't want to build something if it already exists.

Have you looked at extended file attributes? See: http://en.wikipedia.org/wiki/Extended_file_attributes
Basically, you store the key-value pairs as zero terminated strings in the filesystem itself. You can set these attributes from the command line like this:
$ setfattr -n user.comment -v "this is a comment" testfile
$ getfattr testfile
# file: testfile
user.comment
$ getfattr -n user.comment testfile
# file: testfile
user.comment="this is a comment"
To set and query extended file system attributes from python, you can try the python module xattr. See: http://pypi.python.org/pypi/xattr
EDIT
Extended attributes are supported by most filesystem manipulation commands, such as cp, mv and tar by adding command line flags. E.g. cp -a or tar --xattr. You may need to make these commands to work transparently. (You may have users who are unaware of your extended attributes.) In this case you can create an alias, e.g. alias cp="cp -a".

As already discussed, xattrs are a good solution when available. However, when you can't use xattrs:
NTFS alternate data streams
On Microsoft Windows, xattrs are not available, but NTFS alternate data streams provide a similar feature. ADSs let you store arbitrary amounts of data together with the main stream of a file. They are accessed using
drive:\path\to\file:streamname
An ADS is effectively just its own file with a special name. Apparently you can access them from Python by specifying a filename containing a colon:
open(r"drive:\path\to\file:streamname", "wb")
and then using it like an ordinary file. (Disclaimer: not tested.)
From the command line, use Microsoft's streams program.
Since ADSs store arbitrary binary data, you are responsible for writing the querying functionality.
SQLite
SQLite is an embedded RDBMS that you can use. Store the .sqlite database file alongside your directory tree.
For each file you add, also record each file in a table:
CREATE TABLE file (
file_id INTEGER PRIMARY KEY AUTOINCREMENT,
path TEXT
);
Then, for example, you could store a piece of metadata as a table:
CREATE TABLE organization_id (
file_id INTEGER PRIMARY KEY,
value INTEGER,
FOREIGN KEY(file_id) REFERENCES file(file_id)
);
Then you can query on it:
SELECT path FROM file NATURAL JOIN organization_id
WHERE value == 3375 AND path LIKE '/x/%';
Alternatively, if you want a pure key-value store, you can store all the metadata in one table:
CREATE TABLE metadata (
file_id INTEGER,
key TEXT,
value TEXT,
PRIMARY KEY(file_id, key),
FOREIGN KEY(file_id) REFERENCES file(file_id)
);
Query:
SELECT path FROM file NATURAL JOIN metadata
WHERE key == 'organization_id' AND value == 3375 AND path LIKE '/x/%';
Obviously it's your responsibility to update the database whenever you read or write a file. You must also make sure these updates are atomic (e.g. add a column active to the file table; when adding a file: set active = FALSE, write file, fsync, set active = TRUE, and as cleanup delete any files that have active = FALSE).
The Python standard library includes SQLite support as the sqlite3 package.
From the command line, use the sqlite3 program.

the Xattr limits on the size(Ext4 upto 4kb), the xattr key must prefix 'user.' on linux.
And not all file system support the xattr.
try iDB.py library which wraps the xattr and can easily switch to disable xattr supports.

converting/mapping linux reference path without altering the file?

Currently on a project that my client needs the reference file path to
remain in linux format. For example
A.ma , referencing objects from --> //linux/project/scene/B.ma
B.ma , referencing objects from --> //linux/project/scene/C.ma
Most of our Maya license here however are on Windows. I can run a
Python script that convert all the paths windows paths and save the
file. For example
Z:\project\scene\B.ma
However I'm trying to figure out a way to do this without converting
or altering the original file.... I'll try to explain what I'm trying to do.
Run the script to open the file.
The script checks for the linux formatted reference path, and all
child path down the hierarchy.
Maps all paths to their appropriate windows formatted paths.
Giving the animators the ability to "save" files normally without running a separate save script.
Is this possible to achieve this with Python script? Or will I need a
fully-compiled plug in to get this to work?
Any suggestion is greatly appreciated.
edit: Thank you for your input.
A little more clarification. The projects were set up for us by a remote company and part of the requirement is that we have to keep the path as is. They come as absolute path and we have no choice in that matter.
We match the mount //linux/ on our Fedora workstations. That same drive is mapped to Z:\ on our windows workstations. We only have 2 Maya license for Linux tho which is why I'm trying to do this.

Here is a solution. First step is to create a dict that keeps track of linux/windows references (don't forget to import the re module for regexp):
>>> def windows_path(path):
return path.replace('//linux', 'Z:').replace('/', '\\')
>>> reg = re.compile('(\w+\.ma) , referencing objects from --> (.*)')
>>> d = {}
>>> for line in open('D:\\temp\\Toto.txt'):
match = reg.match(line)
if match:
file_name = match.groups()[0]
linux_path = match.groups()[1]
d[file_name] = (linux_path, windows_path(linux_path))
>>> d
{'B.ma': ('//linux/project/scene/C.ma', 'Z:\\project\\scene\\C.ma'),
'A.ma': ('//linux/project/scene/B.ma', 'Z:\\project\\scene\\B.ma')}
Then you just need to loop on this dict to ask for file save:
>>> for file_name in d.keys():
s = raw_input('do you want to save file %s ? ' % file_name)
if s.lower() in ('y', 'yes'):
# TODO: save your file thanks to d[file][0] for linux path,
# d[file][1] for windows path
print '-> file %s was saved' % file_name
else:
print '-> file %s was not saved' % file_name
do you want to save file B.ma ? n
-> file B.ma was not saved
do you want to save file A.ma ? yes
-> file A.ma was saved

Many Windows applications will interpret paths with two leading "/"s as UNC paths. I don't know if Maya is one of those, but try it out. If Maya can understand paths like "//servername/share/foo", then all you need to do is set up a SMB server named "linux", and the paths will work as they are. I would guess that this is actually what your client does, since the path "//linux" would not make sense in a Linux-only environment.

You can use environment variables to do this. Maya will expand environment vars present in a file path, you could use Maya.env to set them up properly for each platform.

What you are looking for is the dirmap mel command. It is completely non-intrusive to your files as you just define a mapping from your linux paths to windows and/or vice versa. Maya will internally apply the mapping to resolve the paths, without changing them when saving the file.
To setup dirmap, you need to run a MEL script which issues the respective commands on maya startup. UserSetup.mel could be one place to put it.
For more details, see the official documentation - this particular link points to maya 2012, the command is available in Maya 7.0 and earlier as well though:
http://download.autodesk.com/global/docs/maya2012/en_us/Commands/dirmap.html

Internationalizing a Python 2.6 application via Babel

We're evaluating Babel 0.9.5 [1] under Windows for use with Python 2.6 and have the following questions that we we've been unable to answer through reading the documentation or googling.
1) I would like to use an _ like abbreviation for ungettext. Is there a concencus on whether one should use n_ or N_ for this?
n_ does not appear to work. Babel does not extract text.
N_ appears to partially work. Babel extracts text like it does for gettext, but does not format for ngettext (missing plural argument and msgstr[ n ].)
2) Is there a way to set the initial msgstr fields like the following when creating a POT file?
I suspect there may be a way to do this via Babel cfg files, but I've been unable to find documentation on the Babel cfg file format.
"Project-Id-Version: PROJECT VERSION\n"
"Language-Team: en_US \n"
3) Is there a way to preserve 'obsolete' msgid/msgstr's in our PO files? When I use the Babel update command, newly created obsolete strings are marked with #~ prefixes, but existing obsolete message strings get deleted.
Thanks,
Malcolm
[1] http://babel.edgewall.org/

By default pybabel extract recognizes the following keywords: _, gettext, ngettext, ugettext, ungettext, dgettext, dngettext,N_. Use -k option to add others. N_ is often used for NULL-translations (also called deferred translations).
Update: The -k option can list arguments of function to be put in catalog. So, if you use n_ = ngettext try pybabel extract -k n_:1,2 ....

To answer question 2):
If you run Babel via pybabel extract, you can set Project-Id-Version via the --project and --version options.
If you run Babel via setup.py extract_messages, then Project-Id-Version is taken from the distribution (project name and version in the setup.py file).
Both ways also support the options --msgid-bugs-address and --copyright-holder for setting the POT metadata.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Obtaining metadata "Where from" of a file on Mac - python

Related

python how to use tika with existing jar file without downloading again

Render hg.tar into in-memory tree-structure after getting back from MongoDB

Need a file system metadata layer for applications

converting/mapping linux reference path without altering the file?

Internationalizing a Python 2.6 application via Babel

Categories

Resources