Page through S3 objects matching specific filename using boto3

Page through S3 objects matching specific filename using boto3 - python

I have an AWS S3 bucket with a Prefix (or "folder") called /photos. That "contains" a bunch of image files and even fewer EVENT.json files. A naive representation might look like this:
my-awesome-events-bucket
photos
image1.jpg
image2.jpg
1_EVENT.json
image3.jpg
2_EVENT.json
...
The EVENT.json files have an object that contains a path reference to an arbitrary amount of image files, which group images into a specific event. Using the example above, image1.jpg and image2.jpg could appear in 1_EVENT.json, and image3.jpg may belong to 2_EVENT.json.
As the bucket gets larger, I have an interest in paging through the results. I only want to request a page at a time from S3 as I need them. The problem I'm running into is that I want to page specifically by keys that contain the word "EVENT". I'm finding this difficult to accomplish without bringing back ALL the objects and then filtering or iterating the results.
Using an S3 Paginator, I'm able to get paging working. Assuming my PageSize and MaxItems are set to 6, this is what I might get back for my first page:
/photos/
/photos/image1.jpg
/photos/image2.jpg
/photos/1_EVENT.json
/photos/image3.jpg
/photos/2_EVENT.json
S3's flat structure means that it's paging through all objects in the bucket according to the Prefix, and limiting and paging according to the pagination parameters. This means that I could easily get multiple EVENT.json files, or none at all, depending on the page.
So I'm looking for something more along the lines of this:
/photos/1_EVENT.json
/photos/2_EVENT.json
/photos/3_EVENT.json
/photos/4_EVENT.json
/photos/5_EVENT.json
/photos/6_EVENT.json
without first having to request all objects and then slice the results set in some way; which is exactly what I'm doing currently:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(
Bucket=app.config.get('S3_BUCKET'),
Prefix="photos/") # Left PaginationConfig MaxItems & PageSize off intentionally
filtered_iterator = page_iterator.search(
"Contents[?contains(Key, `EVENT`)][]")
for page in filtered_iterator:
# Do stuff.
pass
The above is really expensive, with no paging, but it does give me a list of all files containing my "EVENT" search string.
I specifically want to page results of only EVENT.json objects through S3 using boto3 without the overhead of returning and filtering all objects every request. Is that possible?
EDIT: I'm already narrowing requests down to just objects with the photos/ Prefix. This is because there are other "folders" in my bucket that also may contain EVENT files. That prevents me from using EVENT or EVENT.json as my Prefix, because the response may be polluted by files from other folders.

The simplest way would be to rehash your filename structure to have the EVENT files follow the pattern photos/EVENT_*.json instead of photos/*_EVENT.json. Then you could use a common prefix of photos/EVENT.
Short of that, I think that the expensive method you are using is actually the only way to go about it.

There is a prefix option you can throw on one of the search functions in boto. This will dramatically reduce the amount of files it has to scan. However if you are having to search strings with wildcards in the middle of the string last I knew it had to scan all the objects in the bucket then you would have to wildcard search though those objects.
ex:
bucket.search_function(prefix="string")
I can't recall the boto function off the top of my head though.

Related

Is there a way to of getting part of a dataframe from an azure blob storage

So I have a a lot of data in an Azure blob storage. Each user can upload some cases and the end result can be represented as a series of panda dataframes. Now I want to be able to display some of this data on our site, but the files are several hundreds of MB and there is no need to download all of it. What would be the best way to get part of the df?
I can make a folder structure in each blob storage containing the different columns in each df and perhaps a more more compact summery of the columns but I would like to keep it in one file if possible.
I could also set up a database containing the info but I like the structure as it is - completely separated in cases.
Originally I thought I could do it in hdf5 but it seems that I need to download the entire file from the blob storage to my API backend before I can run my python code on it. I would prefer if I could keep the hdf5 files and get the parts of the columns from the blob storage directly but as far as I can see that is not possible.
I am thinking this is something that has been solved a million times before but it is a bit out of my domain so I have not been able to find a good solution for it.

Check out the BlobClient of the Azure Python SDK. The download_blob method might suit your needs. Use chunks() to get an iterator which allows you to iterate of over the file in chunks. You can also set other parameters to assure that a chunk doesn't exceed a set size.

How to use multithreading to search for blob? - Azure Python SDK

Currently, I am working on a Python application that searches for a blob in a container, given a keyword. My code for searching the blob is found below. When performing the search in very large blob containers, this current method is not very effective as it takes over 20 minutes to search for a blob (for a blob container containing ~ 1,100,000 blobs). In addition, my application 'freezes' and is not clickable until the search is finished.
I recently started reading about multi-threading, and starting thinking about how it could be used in my application to speed up the search process. Since my current search is using a single thread, would it somehow be possible to use multiple threads to complete the search?
An idea I currently have is to somehow get the total count of blobs that the generator holds, and assign one half of it to one thread to search, and assign the other half to another thread to search. So in the end, multiple threads would be performing the search to ultimately complete the entire search faster. Any ideas, tips or recommendations would be most helpful.
next_marker = None
while True:
generator = container_client.list_blobs(marker=next_marker)
for item in generator:
if search_keyword in item.name:
print("Container: {0}, Blob: {1}\n".format(container_client.container_name, item.name))
# Using next_marker to get continuous token and the rest of the blob result
if not next_marker:
break
next_marker = generator.next_marker

Not a perfect solution but one possible way to parallelize the listing operation is to make use of blob prefix. Assuming your blob names start with alpha-numeric characters (a-z, A-Z and 0-9), what you could do is do a blob prefix search in parallel where in each thread you search for blobs names of which start with certain prefix ("a", "b", .... etc.).
You would use list_blobs with name_starts_with parameter and provide the prefix there.
Other option would be to make use of Azure Cognitive Search and create an Index which makes use of Azure Blob Storage kind of Data Source. It will be much faster however you are using a different search all together to do blob search.

How to get the last file last file deposited in a gcs bucket (python)

I want to have the last file deposited in a bucket of gcs in python to use it in my dag.
Every day files are deposited in the bucket: gs://name/neth in a format file_yymmdd_.

With Cloud Storage you can only filter by prefix. So, you have the prefix neth/file_yymmdd_..., the bucket name, you have to get all the files of the prefix that you want and iterate over the metadata to get the latest. No out of the box solution.

Supposing you did not have this information tracked somewhere, you can have a crawler on GCS which will iterate the files based on a prefix.
Every time you get the next page of the item you can specify a file as a prefix. Since this is a lexicographic ordered operation eventually after some time you should reach at the last file.
The format you use helps a lot since the combinations are limited (days).
Be aware that when using the list item operations the pages you retrieve list the items stored at that point in time you started the listing of the files (probably behind the scenes BigTable is used for the listing of the files).
This is not blocking you, you can always initiate a new list objects operation based on an offset, thus having something fresh.
Here's a go example that uses the offset
storage.Query{Prefix: prefix, StartOffset: offSet}

Add a unique File ID to PDF documents received

I need to track PDF documents RECEIVED. I can keep a list of the documents in a database, however sometimes the documents get renamed or moved, so the file path to the PDF is not always reliable.
For other document types, I sometimes add a unique ID as metadata, so that I can recognize that a file that was moved and/or renamed is the same as one seen previously.
I am looking for a solution that will work on Windows 10, and would prefer a solution based on Node.js, although Python would also be acceptable.
The documents are received from many different sources, and I do not have the option of requiring the source of the document to add a unique identifier.
I have use IPTCinfo in this way for media files, but (as far as I know) that can't be used with PDFs.
I am looking for something similar that can be used with PDFs.

Use md5sum:
import os
def check_md5sum(file_path):
md5sum = os.system(f'md5sum {file_path}')
return md5sum

Read Latest File from Google Cloud Storage Bucket Using Cloud Function

The issue I'm facing is that Cloud Storage sorts newly added files lexicographically (Alphabetical Order) while I'm reading a file placed at index 0 in Cloud Storage bucket using its python client library in Cloud Functions (using cloud function is must as a part of my project) and put the data in BigQuery which is working fine for me but the newly added file do not always appear at index 0.
The streaming files enter in my bucket every day at different times.
The filename is same (data-2019-10-18T14_20_00.000Z-2019-10-18T14_25_00.txt) but the date and time field in file name differ in every newly added file.
How can I adjust this python code to read the latest added file in Cloud Storage bucket every time the cloud function is triggered?
files = bucket.list_blobs()
fileList = [file.name for file in files if '.' in file.name]
blob = bucket.blob(fileList[0]) #reading file placed at index 0 in bucket

If the Cloud Function you have is triggered by HTTP then you could substitute it with one that uses Google Cloud Storage Triggers. If it was already then you only need to take advantage of it.
Any time the function is triggered, you could check for the event type and do whatever with the data, like:
from google.cloud import storage
storage_client = storage.Client()
def hello_gcs_generic(data, context):
"""Background Cloud Function to be triggered by Cloud Storage.
check more in https://cloud.google.com/functions/docs/calling/storage#functions-calling-storage-python
"""
if context.event_type == storage.notification.OBJECT_FINALIZE_EVENT_TYPE:
print('Created: {}'.format(data['timeCreated'])) #this here for illustration purposes
print('Updated: {}'.format(data['updated']))
blob = storage_client.get_bucket(data['bucket']).get_blob(data['name'])
#TODO whatever else needed with blob
This way, you don't care about when the object was created. You know that when is created your client library code fetches the correspondent blob and you do whatever you want with it.

If your goal is to process each and every one (or most) of the uploaded files #fhenrique's answer is a better approach.
But if your processing is rather sparsely in comparison with the rate at which the files are uploaded (or simply if your requirement doesn't allow you to switch to the suggested Cloud Storage trigger) then you need to take a closer look at why your expectation to find the most recently uploaded file in the index 0 position is not met.
The first reason that comes to mind is your file naming convention. For example let's assume 2 such files: data-2019-10-18T14_20_00.000Z-2019-10-18T14_25_00.txt and data-2019-10-18T14_25_00.000Z-2019-10-18T14_30_00.txt. Their
lexicographic order would be:
['data-2019-10-18T14_20_00.000Z-2019-10-18T14_25_00.txt',
'data-2019-10-18T14_25_00.000Z-2019-10-18T14_30_00.txt']
Note that the most recently uploaded file is actually the last one in the list, not the first one. So all you'd have to do is to replace index 0 with index -1.
A few other possible things/reasons to consider (try printing fileList to confirm/deny these theories):
the file you expect to find in the index -1 position isn't actually completely uploaded and finalized. I'm unsure if there is anything you can do in this case - it's simply a matter of managing expectations
the list of files returned isn't actually lexicographically sorted (for whatever reason). I see the sorting being mentioned at Listing Objects, but not at the Storage Client API documentation. Explicitly sorting fileList before picking the file at index -1 should take care of that, if needed.
having files in that bucket which do not follow the mentioned naming rule (for whatever reason) - any such file with a name positioning it after the more recently uploaded file will completely break your algorithm going forward. To protect against such case you could use the prefix and maybe the delimiter optional arguments to bucket.list_blobs() to filter the results as needed. From the above-mentioned API doc:
prefix (str) – (Optional) prefix used to filter blobs.
delimiter (str) – (Optional) Delimiter, used with prefix to emulate hierarchy.
Such filtering can also be useful to limit the number of entries you get in the list based on the current date/time, which might significantly speedup your function execution, especially if there are many such files uploaded (your naming suggestion suggests there can be a whole lot of them)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.