updating and saving an existing blob/file in blobstore? - python

The docs are pretty clear about how to create/get a blob, but I can't find any reference to how to modify and save an existing blob.
Is this actually possible given the BlobInfo object?
https://developers.google.com/appengine/docs/python/blobstore/overview#Writing_Files_to_the_Blobstore

You cannot modify an existing blob.
You can use the Files API to read from an existing blob and write to a new blob.
If you don't want to use the Files API to read the existing blob then you can use a BlobReader.

I use http://tsssaver.1conan.com to save blobs. You put in your info, and it sends a request for all signed firmwares.
You do not need to be on the version to save blobs, blobs are just a small file from Apple’s servers that use some of your device’s info and their encryption key to verify a restore.
Assuming you want to save your blobs for your current iOS no longer signed by apple hoping you could future restore it, you are out of luck as you may only be able to save your blob at your end but you can't obtain Apples encryption key to verify and perform the restore for future use.

Related

Is there a way to of getting part of a dataframe from an azure blob storage

So I have a a lot of data in an Azure blob storage. Each user can upload some cases and the end result can be represented as a series of panda dataframes. Now I want to be able to display some of this data on our site, but the files are several hundreds of MB and there is no need to download all of it. What would be the best way to get part of the df?
I can make a folder structure in each blob storage containing the different columns in each df and perhaps a more more compact summery of the columns but I would like to keep it in one file if possible.
I could also set up a database containing the info but I like the structure as it is - completely separated in cases.
Originally I thought I could do it in hdf5 but it seems that I need to download the entire file from the blob storage to my API backend before I can run my python code on it. I would prefer if I could keep the hdf5 files and get the parts of the columns from the blob storage directly but as far as I can see that is not possible.
I am thinking this is something that has been solved a million times before but it is a bit out of my domain so I have not been able to find a good solution for it.
Check out the BlobClient of the Azure Python SDK. The download_blob method might suit your needs. Use chunks() to get an iterator which allows you to iterate of over the file in chunks. You can also set other parameters to assure that a chunk doesn't exceed a set size.

Updating an Azure Block Blob in a datalake without using append while having it under a lease

I have an azure function that writes to a parquet file in a Gen 2 DataLake. It needs to append a parquet record on each execution.
When I try and use an Append Blob, I receive an error that append blobs are not supported with my datalake setup. (Hierarchal Namespace)
My alternative was to obtain a lease on the blob, read the contents out, append my record, then re-upload the blob under the lease and release the lease. However this does not work, because blob client cannot upload blob under a lease. So I risk changes getting overwritten during high volume times.
I need a way to safely edit a block blobs contents without risk of losing or overwriting changes.

Azure Blob Bindings with Azure Function (Python)

I currently have a process of reading from sql, using pandas and pd.Excelwriter to format the data and email it out. I want my function to read from sql (no problem) and write to a blob, then from that blob (using SendGrid binding) attach that file from the blob and send it out.
My question is do I need both an in (attaching for email) and an out (archiving to the blob) binding for that blob? Additionally, is this the simplest way to do this? It's be nice to send it and write to the blob as two unconnected operations instead of sequentially.
It also appears that with the binding, I have to hard code the name of the file in the blob-path? That seems a little ridiculous, does anyone know a workaround, or perhaps I have misunderstood.
do I need both an in (attaching for email) and an out (archiving to
the blob) binding for that blob?
Firstly I don't think you could bind the blob in and out simultaneously if the not existed. If you have tried you will find it will return error. And I suppose you could send the mail directly with the content from sql and write to blob, don't need to read content from blob again.
I have to hard code the name of the file in the blob-path?
If you could accept guid or datetime blob name you could bind the path with {rand-guid} or {DateTime}(you could format the time).
I fyou could not accept this binding, you could pass the blob path from the trigger body with json data like below pic. If you use other like queue trigger, you also could pass the json data with the path value.

Databricks list all blobs in Azure Blob Storage

I have mounted a Blob Storage Account in to Databricks, and can access it fine, so i know that it works.
What i want to do though, is list out the names all of the files at a given path.. currently i'm doing this with:
list = dbutils.fs.ls('dbfs:/mnt/myName/Path/To/Files/2019/03/01')
df = spark.createDataFrame(list).select('name')
The issue i have though, is that it's exceptionally slow.. due to there being around 160,000 blobs at that location (storage explorer shows this as ~1016106592 bytes which is 1Gb!)
This surely can't be pulling down all this data, all i need/want is the filename..
Is blob storage my bottle neck, or can i (somehow) get Databricks to execute the command in parallel or something?
Thanks.
Per my experience and based on my understanding for Azure Blob Storage, all operations in SDK or others on Azure Blob Storage will be translated to REST API calling. So your dbutils.fs.ls calling is actually calling the related REST API List Blobs on a
Blob container.
Therefore, I'm sure the performance neck of your code is really affected by transfering the data of amount size of the XML response body of blobs list on Blob Storage to extract blob names to the list variable , even there is around 160,000 blobs.
Meanwhile, all blob names will be wrapped in many slices of XML response, and there is a MaxResults limit per slice, and to get next slice is depended on the NextMarker value of previous slice. The above reason is why to list blobs slow, and it can not be parallelism.
My suggestion for enhancing the efficiency of loading blob list is to cache the result of list blobs in advance, such as to generate a blob to write the blob list line by line. Considering for realtime update, you can try to use Azure Function with Blob Trigger to add the blob name record to an Append Blob when an event of blob creation happened.

Uploading binary file Google Drive using Python API still does not show md5Checksum metadata

I need to check to see that a file that I have uploaded to my Google Drive account needs to be updated with the original file I have on my local harddrive. For this purpose, I would like to use some hash such as a md5sum. So the md5Checksum metadata on files stored on the Drive seems to be the best fit (baring that, I'd look for modification time stamps but that is less appealing given the burden of dealing with date/time conversion). So, using code of below, with a known fileId:
fileId = 'whatever' # gotten from a previous upload using MediaFileUpload():
fi = drive_service.files().get(fileId=fileId).execute()
import pprint
pprint.pprint(fi)
md5Checksum = fi.get('md5Checksum')
print("\n{}\nmd5Checksum\n{}".format('-' * 80, md5Checksum))
it returns:
{u'id': u'whatever',
u'kind': u'drive#file',
u'mimeType': u'application/octet-stream',
u'name': u'the_name_of_that_file_I_uploaded'}
--------------------------------------------------------------------------------
md5Checksum
None
As indicated by https://stackoverflow.com/a/36623919/257924, the md5Checksum value will be available only to binary files. But since the mimeType shown above is application/octet-stream, shouldn't it be available? Isn't application/octet-stream equivalent to it being a "binary" file upload?
What I cannot do is simply download the file and run the md5sum Linux utility (or equivalent Python code to do the same) to check that it is different, because then that defeats the whole point of checking to see if the file actually needs to be updated (deleted and the re-uploaded, or uploading a new version, etc.).
Update #1: This is not the same as Google Drive API 'md5Checksum' does not work on Python as I am not using the list api to obtain the file id. I am just using the id that I found by a previous upload to retrieve metadata on that file, and I am hoping to see some type of hash value indicating its content, and specifically on a file that I think (hope) is considered "binary".

Categories

Resources