Azure storage turn off automatic decompression

Azure storage turn off automatic decompression - python

Hi I'm using python with the azure-sdk to download files from a storage blob. The following code is what I use.
BLOB_SERVICE = BlockBlobService(account_name=AZURE_BLOB_SERVICE_ACCOUNT_NAME, account_key=AZURE_BLOB_SERVICE_ACCOUNT_KEY)
cloud_globals.BLOB_SERVICE.get_blob_to_path(
guid,
name,
path,
)
The download works but Azure or the SDK decompresses my gzipped files on the fly when the files are fetched. I need the files to be gzipped and I would prefer to download the files as they are on the storage. Is there any way to turn off this behaviour?

In my experience, your issue is due to your blob properties. You could check it on the portal and need to set the Content_Encoding = NULL.
I tested your code and the gz file could be downloaded normally.
If I set the Content_Encoding = gzip which is corresponding to my file foramt, the gz file will be decompressed when the file is fetched as same as you. You could refer to this doc.

Related

How to retrieve .dcm image files from the ADLS gen2 using Azure Synapse and pySpark notebook?

I want to access the files of type .dcm (dicom) stored in a container on ADLS gen2 in a pyspark notebook on azure synapse analytics. I'm using pydicom to access the files but getting and error that file does not exists. Please have a look at the code below,
To create the filepath I'm using path library:
Path(path_to_dicoms_dir).joinpath('stage_2_train_images/%s.dcm' % pid)
where pid is the id of the dcm image.
To fetch the dcm image I'm using the following way.
d = pydicom.read_file(data['dicom'])
OR
d = pydicom.dcmread(data['dicom'])
where data['dicom'] is the path.
I've checked the path there is no issue with the it, the file exists and all the access rights are there as I'm accessing other files in the directory just above the directory in which these dcm files are there. But the other files are csv and not dcm
Error:
FileNotFoundError: [Errno 2] No such file or directory: 'abfss:/#.dfs.core.windows.net//stage_2_train_images/stage_2_train_images/003d8fa0-6bf1-40ed-b54c-ac657f8495c5.dcm'
Questions that I have in my mind:
Is this the right storage solution for such image data, if not shall I use blog storage then?
Is it some issue with pydicom library and I'm missing some settings to tell the pydicom that this is a ADLS link.
Or should I entirely change the approach and use databricks instead to run my notebooks?
Or is someone can help me with issue?

Is this the right storage solution for such image data, if not shall I
use blog storage then?
The ADLS Gen2 storage account works perfectly fine with Synapse, so there is no need to use blob storage.
It seems like the pydicom not taking the path correctly.
You need to mount the ADLS Gen2 account in synapse so that pydicom will treat the path as an attached hard drive instead if taking URL.
Follow this tutorial given my Microsoft to How to mount Gen2/blob Storage to do the same.
You need to first create a Linked Service in Synapse which will store your ADLS Gen2 account connection details. Later use below code in notebook to mount the storage account:
mssparkutils.fs.mount(
"abfss://mycontainer#<accountname>.dfs.core.windows.net",
"/test",
{"linkedService":"mygen2account"}
)

How to view S3 bucket video files into browser

In my current project, my objective is to access the video files (in mp4) from AWS S3 bucket.
I have created S3 bucket, named videostreambucketpankesh . This is a public folder with the following permission (as follows).
The Access Control list (ACL) of videostreambucketpankesh bucket is as follows:
The bucket policy of videostreambucketpankesh bucket is as follows:
Now the bucket “videostreambucketpankesh” contains many subfolders (or sub-buckets), including one subfolder, named “video”. This sub-bucket contains some .mp4 file (as shown in the image below).
My problem is that there are some files (such as firetruck.mp4 and ambulance.mp4) that can be directly accessed by browser, when I click its objectURL. I can play them in the browser.
However, I am not able to play other .mp4 ( 39cf9079-7b65-4aa8-8913-8a6b924021d3.mp4, 45fd1749-95aa-488c-ac2f-be8673b8416e.mp4, 8ba187f2-5148-49f6-9acc-2459e41f547b.mp4) files into the browser, when I click its objectURL.
Please note that I upload 39cf9079-7b65-4aa8-8913-8a6b924021d3.mp4, 45fd1749-95aa-488c-ac2f-be8673b8416e.mp4, 8ba187f2-5148-49f6-9acc-2459e41f547b.mp4 video file using a python program programmatically in Python (See the following code ).
def upload_to_s3(local_file, bucket, s3_file):
data = open(local_file, 'rb')
s3_client.put_object(Key="video/"+frame_id+".mp4", Body=data, ContentType='video/mp4', Bucket = s3_bucket)
print("Upload succcessful")
However, I am not able to play mp4 file (I play them in VLC player) in my Google chrome browser. Can you please suggest how can I resolve this issue?

Select the files and look at Properties / Metadata.
It should show Content-Type : video/mp4 like this:
When uploading via the browser, the metadata is automatically set based upon the filetype.
If you are uploading via your own code, you can set the metadata like this:
s3_client.upload_file('video.mp4', bucketname, key, ExtraArgs={'ContentType': "video/mp4"})
or
bucket.put_object(key, Body=data, ContentType='video/mp4')
See: AWS Content Type Settings in S3 Using Boto3

Django assigning a remote file to a FileField without download and reuploading

I'm using django-storages as my default file storage. I have a script that uploads videos directly from the client to google cloud storage. Is there any way to associate this file to a FileField without downloading and reuploading the file. Thank you for any help!

A FileField field is just a pimped CharField, usually persisting a path (depending on the storage). Find out what path should be stored for your uploaded files (e.g. from other files uploaded through your app). You can get raw values using values_list:
Model.objects.filter(pk=known_instance.pk).values_list('file_field_name')
and set it directly for your manually uploaded files:
instance.file_field_name = 'required/raw/value/for/resource'
instance.save()

Python script to use data from Azure Storage Blob by stream, and update blob by stream without local file reading and uploading

I have a python code for data processing , i want to use azure block blob as the data input for the code, to be specify, a csv file from block blob. its all good to download the csv file from azure blob to local path, and upload other way around for the python code if running locally, but the problem is my code running on azure virtual machine because its quite heavy for my Apple Air , pandas read_csv from local path does not work in this case, therefore i have to download and upload and update csv files to azure storage by stream without local saving. both download and upload csv are quite small volume, much less than the blob block limits
there wasn't that much tutorial to explain how to do this step by step, MS Docs are generally suck to explain as well, my minimal code are as following:
for downloading from azure blob storage
from azure.storage.blob import BlockBlobService
storage = BlockBlobService(account_name='myname', account_key = 'mykey')
#here i don't know how to make a csv stream that could could be used in next steps#
file = storage.get_blob_to_stream('accountname','blobname','stream')
df = pd.read_csv(file)
#df for later steps#
for uploading and updating a blob by a dataframe from code by stream
df #dataframe generated by code
'i don't know how to do the preparation steps for df and the final fire up operation'
storage.put_blob_to_list _by_stream('accountname','blobname','stream')
can you please make a step by step tutorial for me, for ppl has experience to azure blob , this should be not very difficult.
or if you have better solution other than use blob for my case , please drop some hits. Thanks.

So the document is still in progress, I think it is getting better and better...
Useful link:
Github - Microsoft Azure Storage SDK for Python
Quickstart: Upload, download, and list blobs using Python
To download a file as a stream from blob storage, you can use BytesIO:
from azure.storage.blob import BlockBlobService
from io import BytesIO
from shutil import copyfileobj
with BytesIO() as input_blob:
with BytesIO() as output_blob:
block_blob_service = BlockBlobService(account_name='my_account_name', account_key='my_account_key')
# Download as a stream
block_blob_service.get_blob_to_stream('mycontainer', 'myinputfilename', input_blob)
# Do whatever you want to do - here I am just copying the input stream to the output stream
copyfileobj(input_blob, output_blob)
...
# Create the a new blob
block_blob_service.create_blob_from_stream('mycontainer', 'myoutputfilename', output_blob)
# Or update the same blob
block_blob_service.create_blob_from_stream('mycontainer', 'myinputfilename', output_blob)

Transfering data from gcs to s3 with google-cloud-storage

I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python.
I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata).
With the python sdk it seems i only have the options to download file to a file and as a string.
I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first.
I am in no way a python pro, so i might have missed an obvious way of doing this.

I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle.
This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files.
def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")
bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)
data = BytesIO()
blob.download_to_file(data)
data.seek(0)
s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)
If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated.

Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. This way you don't need to download whole file into memory. Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Azure storage turn off automatic decompression - python

Related

How to retrieve .dcm image files from the ADLS gen2 using Azure Synapse and pySpark notebook?

How to view S3 bucket video files into browser

Django assigning a remote file to a FileField without download and reuploading

Python script to use data from Azure Storage Blob by stream, and update blob by stream without local file reading and uploading

Transfering data from gcs to s3 with google-cloud-storage

Categories

Resources