Recently, I've been working on a Python script that prints a blob's name if it matches a keyword specified by the user. There are several blob containers in the storage account.
Everything seems to work fine, except for one blob container. This container contains roughly 1,100,000 blobs and it takes my Python script approximately 23 minutes to scan through all the blobs and check for a match. I am fairly new to working with Azure in Python, and I was wondering if there is any possible way to speed up the process of printing blob names in a blob container.
The following code is how I am currently printing out the blob names:
next_marker = None
while True:
generator = container_client.list_blobs(marker=next_marker)
for item in generator:
if search_keyword in item.name:
print("Container: {0}, Blob: {1}\n".format(container_client.container_name, item.name))
# Using next_marker to get continuous token and the rest of the blob result
if not next_marker:
break
next_marker = generator.next_marker
I understand that there are a lot of blobs in this container, and 23 minutes to print all 1,100,000 blob names is reasonable. But, if anyone has any suggestions or knowledge on possibly speeding up the process, please let me know.
Related
So I have a a lot of data in an Azure blob storage. Each user can upload some cases and the end result can be represented as a series of panda dataframes. Now I want to be able to display some of this data on our site, but the files are several hundreds of MB and there is no need to download all of it. What would be the best way to get part of the df?
I can make a folder structure in each blob storage containing the different columns in each df and perhaps a more more compact summery of the columns but I would like to keep it in one file if possible.
I could also set up a database containing the info but I like the structure as it is - completely separated in cases.
Originally I thought I could do it in hdf5 but it seems that I need to download the entire file from the blob storage to my API backend before I can run my python code on it. I would prefer if I could keep the hdf5 files and get the parts of the columns from the blob storage directly but as far as I can see that is not possible.
I am thinking this is something that has been solved a million times before but it is a bit out of my domain so I have not been able to find a good solution for it.
Check out the BlobClient of the Azure Python SDK. The download_blob method might suit your needs. Use chunks() to get an iterator which allows you to iterate of over the file in chunks. You can also set other parameters to assure that a chunk doesn't exceed a set size.
Currently, I am working on a Python application that searches for a blob in a container, given a keyword. My code for searching the blob is found below. When performing the search in very large blob containers, this current method is not very effective as it takes over 20 minutes to search for a blob (for a blob container containing ~ 1,100,000 blobs). In addition, my application 'freezes' and is not clickable until the search is finished.
I recently started reading about multi-threading, and starting thinking about how it could be used in my application to speed up the search process. Since my current search is using a single thread, would it somehow be possible to use multiple threads to complete the search?
An idea I currently have is to somehow get the total count of blobs that the generator holds, and assign one half of it to one thread to search, and assign the other half to another thread to search. So in the end, multiple threads would be performing the search to ultimately complete the entire search faster. Any ideas, tips or recommendations would be most helpful.
next_marker = None
while True:
generator = container_client.list_blobs(marker=next_marker)
for item in generator:
if search_keyword in item.name:
print("Container: {0}, Blob: {1}\n".format(container_client.container_name, item.name))
# Using next_marker to get continuous token and the rest of the blob result
if not next_marker:
break
next_marker = generator.next_marker
Not a perfect solution but one possible way to parallelize the listing operation is to make use of blob prefix. Assuming your blob names start with alpha-numeric characters (a-z, A-Z and 0-9), what you could do is do a blob prefix search in parallel where in each thread you search for blobs names of which start with certain prefix ("a", "b", .... etc.).
You would use list_blobs with name_starts_with parameter and provide the prefix there.
Other option would be to make use of Azure Cognitive Search and create an Index which makes use of Azure Blob Storage kind of Data Source. It will be much faster however you are using a different search all together to do blob search.
The issue I'm facing is that Cloud Storage sorts newly added files lexicographically (Alphabetical Order) while I'm reading a file placed at index 0 in Cloud Storage bucket using its python client library in Cloud Functions (using cloud function is must as a part of my project) and put the data in BigQuery which is working fine for me but the newly added file do not always appear at index 0.
The streaming files enter in my bucket every day at different times.
The filename is same (data-2019-10-18T14_20_00.000Z-2019-10-18T14_25_00.txt) but the date and time field in file name differ in every newly added file.
How can I adjust this python code to read the latest added file in Cloud Storage bucket every time the cloud function is triggered?
files = bucket.list_blobs()
fileList = [file.name for file in files if '.' in file.name]
blob = bucket.blob(fileList[0]) #reading file placed at index 0 in bucket
If the Cloud Function you have is triggered by HTTP then you could substitute it with one that uses Google Cloud Storage Triggers. If it was already then you only need to take advantage of it.
Any time the function is triggered, you could check for the event type and do whatever with the data, like:
from google.cloud import storage
storage_client = storage.Client()
def hello_gcs_generic(data, context):
"""Background Cloud Function to be triggered by Cloud Storage.
check more in https://cloud.google.com/functions/docs/calling/storage#functions-calling-storage-python
"""
if context.event_type == storage.notification.OBJECT_FINALIZE_EVENT_TYPE:
print('Created: {}'.format(data['timeCreated'])) #this here for illustration purposes
print('Updated: {}'.format(data['updated']))
blob = storage_client.get_bucket(data['bucket']).get_blob(data['name'])
#TODO whatever else needed with blob
This way, you don't care about when the object was created. You know that when is created your client library code fetches the correspondent blob and you do whatever you want with it.
If your goal is to process each and every one (or most) of the uploaded files #fhenrique's answer is a better approach.
But if your processing is rather sparsely in comparison with the rate at which the files are uploaded (or simply if your requirement doesn't allow you to switch to the suggested Cloud Storage trigger) then you need to take a closer look at why your expectation to find the most recently uploaded file in the index 0 position is not met.
The first reason that comes to mind is your file naming convention. For example let's assume 2 such files: data-2019-10-18T14_20_00.000Z-2019-10-18T14_25_00.txt and data-2019-10-18T14_25_00.000Z-2019-10-18T14_30_00.txt. Their
lexicographic order would be:
['data-2019-10-18T14_20_00.000Z-2019-10-18T14_25_00.txt',
'data-2019-10-18T14_25_00.000Z-2019-10-18T14_30_00.txt']
Note that the most recently uploaded file is actually the last one in the list, not the first one. So all you'd have to do is to replace index 0 with index -1.
A few other possible things/reasons to consider (try printing fileList to confirm/deny these theories):
the file you expect to find in the index -1 position isn't actually completely uploaded and finalized. I'm unsure if there is anything you can do in this case - it's simply a matter of managing expectations
the list of files returned isn't actually lexicographically sorted (for whatever reason). I see the sorting being mentioned at Listing Objects, but not at the Storage Client API documentation. Explicitly sorting fileList before picking the file at index -1 should take care of that, if needed.
having files in that bucket which do not follow the mentioned naming rule (for whatever reason) - any such file with a name positioning it after the more recently uploaded file will completely break your algorithm going forward. To protect against such case you could use the prefix and maybe the delimiter optional arguments to bucket.list_blobs() to filter the results as needed. From the above-mentioned API doc:
prefix (str) – (Optional) prefix used to filter blobs.
delimiter (str) – (Optional) Delimiter, used with prefix to emulate hierarchy.
Such filtering can also be useful to limit the number of entries you get in the list based on the current date/time, which might significantly speedup your function execution, especially if there are many such files uploaded (your naming suggestion suggests there can be a whole lot of them)
I have mounted a Blob Storage Account in to Databricks, and can access it fine, so i know that it works.
What i want to do though, is list out the names all of the files at a given path.. currently i'm doing this with:
list = dbutils.fs.ls('dbfs:/mnt/myName/Path/To/Files/2019/03/01')
df = spark.createDataFrame(list).select('name')
The issue i have though, is that it's exceptionally slow.. due to there being around 160,000 blobs at that location (storage explorer shows this as ~1016106592 bytes which is 1Gb!)
This surely can't be pulling down all this data, all i need/want is the filename..
Is blob storage my bottle neck, or can i (somehow) get Databricks to execute the command in parallel or something?
Thanks.
Per my experience and based on my understanding for Azure Blob Storage, all operations in SDK or others on Azure Blob Storage will be translated to REST API calling. So your dbutils.fs.ls calling is actually calling the related REST API List Blobs on a
Blob container.
Therefore, I'm sure the performance neck of your code is really affected by transfering the data of amount size of the XML response body of blobs list on Blob Storage to extract blob names to the list variable , even there is around 160,000 blobs.
Meanwhile, all blob names will be wrapped in many slices of XML response, and there is a MaxResults limit per slice, and to get next slice is depended on the NextMarker value of previous slice. The above reason is why to list blobs slow, and it can not be parallelism.
My suggestion for enhancing the efficiency of loading blob list is to cache the result of list blobs in advance, such as to generate a blob to write the blob list line by line. Considering for realtime update, you can try to use Azure Function with Blob Trigger to add the blob name record to an Append Blob when an event of blob creation happened.
Here is how I normally download a GCS file to local:
storage_client = storage.Client()
bucket = storage_client.get_bucket('mybucket')
blob = bucket.blob('myBigFile.txt')
blob.download_to_filename('myBigFile.txt)
The files that I am working with a much, much larger than the allowable size/memory of the Cloud Functions (for example, several GBs to several TBs), so the above would not work for these large files.
Is there a simpler, "streaming" (see example 1 below) or "direct-access" (see example 2 below) way to work with GCS files in a Cloud Function?
Two examples of what I'd be looking to do would be:
# 1. Load it in chunks of 5GB -- "Streaming"
storage_client = storage.Client()
bucket = storage_client.get_bucket('mybucket')
blob = bucket.blob('myBigFile.txt')
while True:
data = blob.download_to_filename('myBigFile.txt', chunk_size=5GB)
do_something(data)
if not data: break
Or:
# 2. Read the data from GCS without downloading it locally -- "Direct Access"
storage_client = storage.Client()
bucket = storage_client.get_bucket('mybucket')
blob = bucket.blob('myBigFile.txt')
with blob.read_filename('myBigFile.txt') as f:
do_something(f)
I'm not sure if either of these are possible to do, but I'm leaving a few options of how this could work. It seems like the Streaming Option is supported, but I wasn't sure how to apply it to the above case.
You might be able to achieve something close to your #1 example using the Cloud Storage XML API.
There should not be a problem implementing it inside Cloud Functions since it's entirely based on standard HTTP requests.
You're probably looking for the GET Object request to Download an Object:
GET requests for objects can include a Range header as defined in the
HTTP 1.1 RFC to limit the scope of the returned data within the
object, but be aware that in certain circumstances the range
header is ignored.
That HTTP Range header appear to be usable to implement the "chunks" you're looking for (but as standalone requests, not in a "streaming" mode):
The range of bytes that you want returned in the response, or the
range of bytes that have been uploaded to the Cloud Storage system.
Valid Values
Any contiguous range of bytes.
Example
Range: bytes=0-1999 (first 2000 bytes)
Range: bytes=-2000 (last 2000 bytes)
Range: bytes=2000- (from byte 2000 to end of file)
Implementation Details
Cloud Storage does not handle complex disjoint ranges, but it does
support simple contiguous byte ranges. Also, byte ranges are
inclusive; that is, bytes=0-999 represent the first 1000 bytes in a
file or object. A valid and successful request will result in a 206
Partial Content response code. For more information, see the
specification.
Since the ranges would be static it's unlikely you'll be able to find range values exactly fit to make the chunks perfectly match the stored data "borders". So you may need to choose the chunks overlapping a bit, to be able to capture data which otherwise would be split across 2 chunks.
Note: I didn't try this, the answer is based solely on docs.
As of this writing, the standard Google Cloud Client library does not support stream-like up-/download.
Have a look at GCSFS. Caveat, you may need to implement a retry strategy in case connection gets lost.