Updating metadata for gridfs file object - python

I use GridFS as follows:
connection = MongoClient(host='localhost')
db = connection.gridfs_example
fs = gridfs.GridFS(db)
fileId = fs.put("Contents of my file", key='s1')
After files are originally stored in GridFS, I have a process that computes additional metadata respective to the contents of the file.
def qcFile(fileId):
#DO QC
return "QC PASSED"
qcResult = qcFile(fileId)
It would have been great if I could do:
fs.update(fileId, QC_RESULT = qcResult)
But that option does not appear to exist within the documentation. I found here (the question updated with solution) that the Java driver appears to offer an option to do something like this but can't find its equivalent in python gridfs.
So, how do I use pymongo to tag my file with the newly computed metadata value qcResult? I can't find it within the documentation.

GridFS stores files in two collections:
1. files collection: store files' metadata
2. chunks collection: store files' binary data
You can hack into the files collection. The name of the files collection is 'fs.files' where 'fs' is the default bucket name.
So, to update the QC result, you can do like this:
db.fs.files.update({'_id': fileId}, {'$set': {'QC_RESULT': qcResult}})

Related

Transfer files from S3 Bucket to another keeping folder structure - python boto

Have found many questions related to this with solutions using boto3, however I am in a position where I have to use boto, running Python 2.38.
Now I can successfully transfer my files in their folders (Not real folders I know as S3 doesn't have this concept) but I want them to be saved into a particular folder in my destination bucket
from boto.s3.connection import S3Connection
def transfer_files():
conn = S3Connection()
srcBucket = conn.get_bucket("source_bucket")
dstBucket = conn.get_bucket(bucket_name="destination_bucket")
objectlist = srcbucket.list()
for obj in objectlist:
dstBucket.copy_key(obj.key, srcBucket.name, obj.key)
My srcBucket will look like folder/subFolder/anotherSubFolder/file.txt which when transferred will land in the dstBucket like so destination_bucket/folder/subFolder/anotherSubFolder/file.txt
I would like it to end up in destination_bucket/targetFolder so the final directory structure would look like
destination_bucket/targetFolder/folder/subFolder/anotherSubFolder/file.txt
Hopefully I have explained this well enough and it makes sense
The first parameter is the name of the destination key.
Therefore, just use:
dstBucket.copy_key('targetFolder/' + obj.key, srcBucket.name, obj.key)

How to append multiple files into one in Amazon's s3 using Python and boto3?

I have a bucket in Amazon's S3 called test-bucket. Within this bucket, json files look like this:
test-bucket
| continent
| country
| <filename>.json
Essentially, filenames are continent/country/name/. Within each country, there are about 100k files, each containing a single dictionary, like this:
{"data":"more data", "even more data":"more data", "other data":"other other data"}
Different files have different lengths. What I need to do is compile all these files together into a single file, then re-upload that file into s3. The easy solution would be to download all the files with boto3, read them into Python, then append them using this script:
import json
def append_to_file(data, filename):
with open(filename, "a") as f:
json.dump(record, f)
f.write("\n")
However, I do not know all the filenames (the names are a timestamp). How can I read all the files in a folder, e.g. Asia/China/*, then append them to a file, with the filename being the country?
Optimally, I don't want to have to download all the files into local storage. If I could load these files into memory that would be great.
EDIT: to make things more clear. Files on s3 aren't stored in folders, the file path is just set up to look like a folder. All files are stored under test-bucket.
The answer to this is fairly simple. You can list all files in the bucket using a filter to filter it down to a "subdirectory" in the prefix. If you have a list of the continents and countries in advance, then you can reduce the list returned. The returned list will have the prefix, so you can filter the list of object names to the ones you want.
s3 = boto3.resource('s3')
bucket_obj = s3.Bucket(bucketname)
all_s3keys = list(obj.key for obj in bucket_obj.objects.filter(Prefix=job_prefix))
if file_pat:
filtered_s3keys = [key for key in all_s3keys if bool(re.search(file_pat, key))]
else:
filtered_s3keys = all_s3keys
The code above will return all the files, with their complete prefix in the bucket, exclusive to the prefix provided. So if you provide prefix='Asia/China/', then it will provide a list of the files only with that prefix. In some cases, I take a second step and filter the file names in that 'subdirectory' before I use the full prefix to access the files.
The second step is to download all the files:
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
executor.map(lambda s3key: bucket_obj.download_file(s3key, local_filepath, Config=CUSTOM_CONFIG),
filtered_s3keys)
for simplicity, I skipped showing the fact that the code generates a local_filepath for each file downloaded so it is the one you actually want and where you want it.

Creating Unique Names

I'm creating a corpus from a repository. I download the text from the repository in pdf, convert these to text files, and save them. However, I'm trying to find a good way to name these files.
To get the filenames I do this: (the records generator is an object from the Sickle package that I use to get access to all the records in the repository)
for record in records:
record_data = [] # data is stored in record_data
for name, metadata in record.metadata.items():
for i, value in enumerate(metadata):
if value:
record_data.append(value)
file_path = ''
fulltext = ''
for data in record_data:
if 'Fulltext' in data:
fulltext = data.replace('Fulltext ', '')
file_path = '/' + os.path.basename(data) + '.txt'
print fulltext
print file_path
The print statements on the two last lines:
https://www.duo.uio.no/bitstream/handle/10852/34910/1/Bertelsen-Master.pdf
/Bertelsen-Master.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/34912/1/thesis-output.pdf
/thesis-output.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9976/1/gartmann.pdf
/gartmann.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/34174/1/thesis-mariusno.pdf
/thesis-mariusno.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9285/1/thesis2.pdf
/thesis2.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9360/1/OMyhre.pdf
As you can see I add a .txt to the end of the original filename and want to use that name to save the file. However, a lot of the files have the same filename, like thesis.pdf. One way I thought about solving this was to add a few random numbers to the name, or have a number that gets incremented on each record and use that, like this: thesis.pdf.124.txt (adding 124 to the name).
But that does not look very good, and the repository is huge, so in the end I would have quite large numbers appended to each filename. Any smart suggestions on how I can solve this?
I have seen suggestions like using the time module. I was thinking maybe I can use regex or another technique to extract part of the name (so every name is equally long) and then create a method that adds a string to each file pased on the url of the file, which should be unique.
One thing you could do is to compute a unique hash of the files, e.g. with MD5 or SHA1 (or any other), cf. this article. For a large number of files this can become quite slow, though.
But you don't really see to touch the files in this piece of code. For generating some unique id, you could use uuid and put this somewhere in the name.

Query GAE Datastore on Transformed Property Values

I'm working on a Python NDB model on app engine that looks like:
class NDBPath(ndb.Model):
path = ndb.StringProperty()
directory = ndb.ComputedProperty(lambda self: getDirectory(self.path))
cat = ndb.IntegerProperty(indexed=False)
Path is a file path, directory is the superdirectory of that file, and cat is some number. These entities are effectively read only after an initial load.
I query the datastore with various filepaths and want to pull out the cat property of an entity if either a) its path matches the queried path (same file), or b) if the entity's directory is in a superdirectory of the queried path. So I end up doing a query like:
NDBPath.query(NDBPath.directory.IN(generateSuperPaths(queriedPath)))
Where generateSuperPaths lists all the superdirectories in their full form of the queried Path (eg a/b/c/d.html --> [/a, /a/b, /a/b/c])
Because these are read only, using a computed property is effectively a waste of a write as it will never change. Is there any way to query based on a dynamically transformed value, like
NDBPath.query(getDirectory(NDBPath.path).IN(generateSuperPaths(queriedPath)))
So I can save writing the directory as a property and just use it in the query?
AFAIK, yes. They will end up doing the same and it will save you on write time and cost. Although it will make your query run slower (because of the added computation), so be wary of possible timeouts.

Need a file system metadata layer for applications

I'm looking for a metadata layer that sits on top of files which can interpret key-value pairs of information in file names for apps that work with thousands of files. More info:
These aren't necessarily media files that have built-in metadata - hence the key-value pairs.
The metadata goes beyond os information (file sizes, etc) - to whatever the app puts into the key-values.
It should be accessible via command line as well as a python module so that my applications can talk to it.
ADDED: It should also be supported by common os commands (cp, mv, tar, etc) so that it doesn't get lost if a file is copied or moved.
Examples of functionality I'd like include:
list files in directory x for organization_id 3375
report on files in directory y by translating load_time to year/month and show file count & size for each year/month combo
get oldest file in directory z based upon key of loadtime
Files with this simple metadata embedded within them might look like:
bowling_state-ky_league-15_game-8_gametime-201209141830.tgz
bowling_state-ky_league-15_game-9_gametime-201209141930.tgz
This metadata is very accessible & tightly joined to the file. But - I'd prefer to avoid needing to use cut or wild-cards for all operations.
I've looked around and can only find media & os metadata solutions and don't want to build something if it already exists.
Have you looked at extended file attributes? See: http://en.wikipedia.org/wiki/Extended_file_attributes
Basically, you store the key-value pairs as zero terminated strings in the filesystem itself. You can set these attributes from the command line like this:
$ setfattr -n user.comment -v "this is a comment" testfile
$ getfattr testfile
# file: testfile
user.comment
$ getfattr -n user.comment testfile
# file: testfile
user.comment="this is a comment"
To set and query extended file system attributes from python, you can try the python module xattr. See: http://pypi.python.org/pypi/xattr
EDIT
Extended attributes are supported by most filesystem manipulation commands, such as cp, mv and tar by adding command line flags. E.g. cp -a or tar --xattr. You may need to make these commands to work transparently. (You may have users who are unaware of your extended attributes.) In this case you can create an alias, e.g. alias cp="cp -a".
As already discussed, xattrs are a good solution when available. However, when you can't use xattrs:
NTFS alternate data streams
On Microsoft Windows, xattrs are not available, but NTFS alternate data streams provide a similar feature. ADSs let you store arbitrary amounts of data together with the main stream of a file. They are accessed using
drive:\path\to\file:streamname
An ADS is effectively just its own file with a special name. Apparently you can access them from Python by specifying a filename containing a colon:
open(r"drive:\path\to\file:streamname", "wb")
and then using it like an ordinary file. (Disclaimer: not tested.)
From the command line, use Microsoft's streams program.
Since ADSs store arbitrary binary data, you are responsible for writing the querying functionality.
SQLite
SQLite is an embedded RDBMS that you can use. Store the .sqlite database file alongside your directory tree.
For each file you add, also record each file in a table:
CREATE TABLE file (
file_id INTEGER PRIMARY KEY AUTOINCREMENT,
path TEXT
);
Then, for example, you could store a piece of metadata as a table:
CREATE TABLE organization_id (
file_id INTEGER PRIMARY KEY,
value INTEGER,
FOREIGN KEY(file_id) REFERENCES file(file_id)
);
Then you can query on it:
SELECT path FROM file NATURAL JOIN organization_id
WHERE value == 3375 AND path LIKE '/x/%';
Alternatively, if you want a pure key-value store, you can store all the metadata in one table:
CREATE TABLE metadata (
file_id INTEGER,
key TEXT,
value TEXT,
PRIMARY KEY(file_id, key),
FOREIGN KEY(file_id) REFERENCES file(file_id)
);
Query:
SELECT path FROM file NATURAL JOIN metadata
WHERE key == 'organization_id' AND value == 3375 AND path LIKE '/x/%';
Obviously it's your responsibility to update the database whenever you read or write a file. You must also make sure these updates are atomic (e.g. add a column active to the file table; when adding a file: set active = FALSE, write file, fsync, set active = TRUE, and as cleanup delete any files that have active = FALSE).
The Python standard library includes SQLite support as the sqlite3 package.
From the command line, use the sqlite3 program.
the Xattr limits on the size(Ext4 upto 4kb), the xattr key must prefix 'user.' on linux.
And not all file system support the xattr.
try iDB.py library which wraps the xattr and can easily switch to disable xattr supports.

Categories

Resources