Getting newest S3 keys first - python

I am writing an app that stores (potentially millions of) objects in an S3 bucket. My app will take the most recent object (roughly), process it, and write it back to the same bucket. I need a way of accessing keys and naming new objects so that the app can easily get to the newest objects.
I know I can do this properly by putting metadata in SimpleDB, but I don't need hard consistency. It's ok if the app grabs an object that isn't quite the newest. I just need the app to tend to grab new-ish keys instead of old ones. So I'm trying to keep it simple by using S3 alone.
Is there a way to access and sort on S3 meta-data? Or might there be a scheme for naming the objects that would get what I need (since I know S3 lists keys in lexicographic order and boto can handle paging).

s3 versioning really helps out here. If these are really the same "thing" you can turn on versioning for you bucket, get the data from your key, modify it and store it back to the same key.
you'll need to use boto's
bucket.get_all_versions( prefix='yourkeynamehere' )
you get versions out, most recent first, so while this function doesn't handle paging, you can just get the first index and you've got the most recent version.
if you want to go back further and need paging, boto also offers a list_versions() function that takes a prefix as well and will give you a result set that will iterate through all the versions without you needing to worry about it.
if these objects really aren't the "same" object, it really doesn't matter because s3 doesn't store diffs -- it stores the whole thing every time. If you have multiple 'types' of objects, you can have multiple version sets of which you can pull the most recent.
i've been using versioning and i'm pretty happy with it.

Related

What is the proper way to split a source into resource name/zone/project?

I'm listing instances using the google cloud python library method:
service.instances().list()
This returns a dict of instances, for each instance it returns a list of disks, and for each disk the source of the disk is available in the following format:
https://www.googleapis.com/compute/v1/projects/<project name>/zones/<zone>/disks/<disk name>
There is no other "name" in the disks dict, so that is the closest thing I have to retrieve the disk name.
After looking into other methods many of them return resources in a similar way.
However, if I want to use any google disk methods from the library, it's expected that I supply the disk name, project and zone as separate arguments to the library's method.
Is there a common method I can write to split the resource parameters?
In this example this would be project name, zone and disk name, but other resources might have different resources.
I could not find any method in the library that would do the split for me, so I guess it's expected that I write my own.
There is no specific API in GCP that helps you to give you such a result, although considering that the URL you are getting is constant (the order of what you want is constant), I think the easiest way to do it is by applying the next code ,
disk_url = "https://www.googleapis.com/compute/v1/projects/<project name>/zones/<zone>/disks/<disk name>".split('/')
project = disk_url[6]
zone = disk_url[8]
disk = disk_url[10]
I think it would be helpful but If you need something more specific I believe you have more work to do with "handling strings in python" by your own.

AWS S3 boto3: Server-side filtering for top-level keys

I am developing a GUI for the analysts in my company to be able to browse a series of S3 buckets (and files therein named according to different hierarchies of keys) just as if they were working with any other hierarchical FS. The idea is that they can work without needing to know any details about how (or where) is the data actually stored.
I am aware of how S3 does not natively support folders and how can this be (to some extent) simulated by properly using key naming and delimiters. For my use case, let's suppose the content of my bucket is:
asset1/property1/fileA.txt
asset1/property2/fileB.txt
asset2/property1/fileA.txt
asset2/property2/fileB.txt
configX.txt
configY.txt
configZ.txt
I have so far written a simple GUI navigator that enables interactive navigation through the different levels of S3 keys as if they were folders (using the CommonPrefixes key of the dictionary returned by the s3 client or the paginator). The problem I have when I land in an example like the one above. Obviously, the CommonPrefixes is not going to return the file basenames under the requested key, but I also want to display them to the user (they are files contained in that "folder"!).
One thing that I have tried is for every "inspection" of a requested key level (when the user clicks on a list item (key substring) as if it was a folder) I retrieve the first, say, 1000 item basenames under the passed prefix and search for any file matching exactly the {prefix}/{basename} key. If any of the 1000 first files matches the criterion, it means that "there are files contained directly in that very folder", so then I can use the paginator to query them all (eentually there are more than 1000 in total) and have their keys returned to me "for displaying them in the folder".
The problem I have in the above situation is that the paginator will recursively inspect for all the contents with the passed key (logically), but, unlike in the above toy example, the 'asset' or 'property' "folders" can contain tens or hundreds of thousands files, thereby significantly slowing down the search "just" for being able to show the extra top-level 'config' files that share path with the asset "folders".
I can filter the results myself to display at each level what it has to be displayed, thereby resembling the appearance of a hierarchical folder structure. Nevertheless, the loss of speed and interactiveness is massive, rendering the GUI solution for comfortable analyst experience somewhat pointless. I have been searching for a way to filter results server-side beyond the standard prefix and delimiter (e. g. passing a suffix or a regular expression), but I have been unable to find any solution that does not imply the (slow) full-key results retrieval and client-side filtering.
Is there any way to approach this that I am not seeing? What would be the correct way to go about this problem? Since I do not think my use-case is a very specific corner-case, apologies if this is a basic question that has already been solved. I have tried to find an answer, but chances are that I am unable to google.
Thanks very much in advance.
D.
P.S.: BTW I have read that boto2 delivers the query results by levels, as I would expect them, but I am not certain that it wouldn't anyway query the whole bucket (which is what actually costs time)
If you look at the boto3 doc closely, besides CommonPrefixes being returned, there is also a Contents key which contains
A list of metadata about each object returned.

Is there a good way to store a boolean array into a file or database in python?

I am building an image mosaic that detect if the user's selected area are taken or not.
My idea is to store the available_spots in a list, and I would just have to look through the list to check whether a spot is available or not.
The problem is that when I reload the website, avaliable_spots also gets reset to blank list,
so I want to store this array somewhere, that is fast to read and write to.
I am currently thinking about a text file to store this, but that might take forever to read since array length is over 1.4 million. Is there any other solutions that might be better?
You can't store the data in a file for a few reasons: (1) GAE standard won't let you, (2) the data is lost when your server is restarted, and (3) different instances will have different data.
Of course you can and should store the data in a database of your choice. Firestore is likely a better and cheaper option than SQL. It should be fast enough for you and you can implement caching if needed.
You might be able to store the data in a single Firestore entity and consider using compression if you are getting close to the max entity size.
If you want to store into a database you can use the "sqlite3" module.
Is a simple database that gets stored in a file so you dont have to install a database program. Is great for small projects.
If you want to do more complex stuff with databases you can use "sqlalchemy".

python object save and reload

I've been working on a python program that basically creates 5 different types of objects that are in a hierarchy. For example my program might create 1 Region object that contains 2000 Column objects that contains 8000 Cell objects(4 Cells in each Column) where all the objects are interacting with each other based on video input.
Now, I want to be able to save all the objects states after the video input changes each of their states over a period of time. So my question is how can I save and reload thousands of objects in Python efficiently? Thanks in advance!
Not sure how efficient pickle is for large scales but I think what you're looking for is object serialization. But are you trying to 'refresh' the information in these objects or save and load them? Also read the section on 'Persistence of External Objects' since you will need to create an alphanumeric id that is associated with each object, for the relations/associations.
One totally hacky way could also be to json-ify the objects and store that. You would still need the alphanumeric id or some sort of usable identifier to associate each of the objects.
Have you looked at Shelve, Pickle or cPickle?
http://docs.python.org/release/2.5/lib/persistence.html
I think you need to look into the ZODB.
The ZODB is an object database that uses pickle to serialize data, is very adept at handling hierarchies of objects, and if your objects use the included persistent.Persistent base-class, will detect and only save the objects that changed when you commit; e.g. there is no need to write out the whole hierarchy on every little change.
Included in the ZODB project is a package called BTrees, which are ZODB aware and make storing thousands of objects in one place efficient. Use these for your Region object to store the Columns. We use BTrees to store millions of datapoints at times.

Move or copy an entity to another kind

Is there a way to move an entity to another kind in appengine.
Say you have a kind defines, and you want to keep a record of deleted entities of that kind.
But you want to separate the storage of live object and archived objects.
Kinds are basically just serialized dicts in the bigtable anyway. And maybe you don't need to index the archive in the same way as the live data.
So how would you make a move or copy of a entity of one kind to another kind.
No - once created, the kind is a part of the entity's immutable key. You need to create a new entity and copy everything across. One way to do this would be to use the low-level google.appengine.api.datastore interface, which treats entities as dicts.
Unless someone's written utilities for this kind of thing, the way to go is to read from one and write to the other kind!

Categories

Resources