Download file from Azure Storage - python

The official snippet code for downloading a blob from Microsoft Docs is:
# Download the blob to a local file
# Add 'DOWNLOAD' before the .txt extension so you can see both files in the data directory
download_file_path = os.path.join(local_path, str.replace(local_file_name ,'.txt', 'DOWNLOAD.txt'))
blob_client = blob_service_client.get_container_client(container= container_name)
print("\nDownloading blob to \n\t" + download_file_path)
with open(download_file_path, "wb") as download_file:
download_file.write(blob_client.download_blob(blob.name).readall())
The problem is that readall reads the blob content to the memory. Giant blobs (hundreds of Gigabytes) cannot be held in memory.
I didn't find a way to download a blob directly to the file (can use a buffer internally, but not hold all the file's content). Is there any way to do so?

In case of large blobs, you would want to use download_blob method in BlobClient. This method allows you to read a range of data as a stream.
The way this would work is you call this method multiple times and each time when you call it, you would set the offset and length parameter to a different value.
For example, let's say your blob size is 10MB and you want to download it in chunks of 1MB, when you call this method first time, you will set the offset as 0 and length as 1048576 (1 MB) and you will get a stream of 1MB data. Next time you call this method, you set the offset as 1048576 and so on.
This way you will be able to progressively download a blob.

Related

Google Cloud Storage streaming upload from Python generator

I have a Python generator that will yield a large and unknown amount of byte data. I'd like to stream the output to GCS, without buffering to a file on disk first.
While I'm sure this is possible (e.g., I can create a subprocess of gsutil cp - <...> and just write my bytes into its stdin), I'm not sure what's a recommended/supported way and the documentation gives the example of uploading a local file.
How should I do this right?
The BlobWriter class makes this a bit easier:
bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('my_object')
writer = BlobWriter(blob)
for d in your_generator:
writer.write(d)
writer.close()

Streaming decompression of S3 gzip source object to a S3 destination object using python?

Given a large gzip object in S3, what is a memory efficient (e.g. streaming) method in python3/boto3 to decompress the data and store the results back into another S3 object?
There is a similar question previously asked. However, all of the answers use a methodology in which the contents of the gzip file are first read into memory (e.g. ByteIO). These solutions are not viable for objects that are too big to fit in main memory.
For large S3 objects the contents need to be read, decompressed "on the fly", and then written to a different S3 object is some chunked fashion.
Thank you in advance for your consideration and response.
You can use streaming methods with boto / s3 but you have to define your own file-like objects AFAIK.
Luckily there's smart_open which handles that for you; it also supports GCS, Azure, HDFS, SFTP and others.
Here's an example using a large sample of sales data:
import boto3
from smart_open import open
session = boto3.Session() # you need to set auth credentials here if you don't have them set in your environment
chunk_size = 1024 * 1024 # 1 MB
f_in = open("s3://mybucket/2m_sales_records.csv.gz", transport_params=dict(session=session), encoding="utf-8")
f_out = open("s3://mybucket/2m_sales_records.csv", "w", transport_params=dict(session=session))
byte_count = 0
while True:
data = f_in.read(chunk_size)
if not data:
break
f_out.write(data)
byte_count += len(data)
print(f"wrote {byte_count} bytes so far")
f_in.close()
f_out.close()
The sample file has 2 million lines and it's 75 MB compressed and 238 MB uncompressed.
I uploaded the compressed file to mybucket and ran the code which downloaded the file, extracted the contents in memory and uploaded the uncompressed data back to S3.
On my computer the process took around 78 seconds (highly dependent on Internet connection speed) and never used more than 95 MB of memory; I think you can lower the memory requirements if need be by overriding the part size for S3 multipart uploads in smart_open.
DEFAULT_MIN_PART_SIZE = 50 * 1024**2
"""Default minimum part size for S3 multipart uploads"""
MIN_MIN_PART_SIZE = 5 * 1024 ** 2
"""The absolute minimum permitted by Amazon."""

Azure Blobstore: How can I read a file without having to download the whole thing first?

I'm trying to figure out how to read a file from Azure blob storage.
Studying its documentation, I can see that the download_blob method seems to be the main way to access a blob.
This method, though, seems to require downloading the whole blob into a file or some other stream.
Is it possible to read a file from Azure Blob Storage line by line as a stream from the service? (And without having to have downloaded the whole thing first)
Update 0710:
In the latest SDK azure-storage-blob 12.3.2, we can also do the same thing by using download_blob.
The screenshot of the source code of download_blob:
So just provide an offset and length parameter, like below(it works as per my test):
blob_client.download_blob(60,100)
Original answer:
You can not read the blob file line by line, but you can read them as per bytes. Like first read 10 bytes of the data, next you can continue to read the next 10 to 20 bytes etc.
This is only available in the older version of python blob storage sdk 2.1.0. Install it like below:
pip install azure-storage-blob==2.1.0
Here is the sample code(here I read the text, but you can change it to use get_blob_to_stream(container_name,blob_name,start_range=0,end_range=10) method to read stream):
from azure.storage.blob import BlockBlobService, PublicAccess
accountname="xxxx"
accountkey="xxxx"
blob_service_client = BlockBlobService(account_name=accountname,account_key=accountkey)
container_name="test2"
blob_name="a5.txt"
#get the length of the blob file, you can use it if you need a loop in your code to read a blob file.
blob_property = blob_service_client.get_blob_properties(container_name,blob_name)
print("the length of the blob is: " + str(blob_property.properties.content_length) + " bytes")
print("**********")
#get the first 10 bytes data
b1 = blob_service_client.get_blob_to_text(container_name,blob_name,start_range=0,end_range=10)
#you can use the method below to read stream
#blob_service_client.get_blob_to_stream(container_name,blob_name,start_range=0,end_range=10)
print(b1.content)
print("*******")
#get the next range of data
b2=blob_service_client.get_blob_to_text(container_name,blob_name,start_range=10,end_range=50)
print(b2.content)
print("********")
#get the next range of data
b3=blob_service_client.get_blob_to_text(container_name,blob_name,start_range=50,end_range=200)
print(b3.content)
The accepted answer here may be of use to you. The documentation can be found here.

AWS Lambda and S3: passing s3 object path to image process function

My intention is to have a large image stored on my S3 server and then get a lambda function to read/process the file and save the resulting output(s). I'm using a package called python-bioformats to work with a proprietary image file (which is basically a whole bunch of tiffs stacked together). When I use
def lambda_handler(event, context):
import boto3
key = event['Records'][0]['s3']['object']['key'].encode("utf-8")
bucket = 'bucketname'
s3 = boto3.resource('s3')
imageobj = s3.Object(bucket, key).get()['Body'].read()
bioformats.get_omexml_metadata(imageobj)
I have a feeling that the lambda function tries to download the entire file (5GB) when making imageobj. Is there a way I can just get the second function (which takes a filepath as argument) to refer to the s3 object in a filepath-like manner? I'd also like to not expose the s3 bucket/object publicly, so doing this server-side would be ideal.
If your bioformats.get_omexml_metadata() function requires a filepath as an argument, then you will need to have the object downloaded before calling the function.
This could be a problem in an AWS Lambda function because there is a 500MB limit on available disk space (and only in /tmp/).
If the data can instead be processed as a stream, you could read the data as it is required without saving to disk first. However, the python-bioformats documentation does not show this as an option. In fact, I would be surprised if your above code works, given that it is expecting a path while imageobj is the contents of the file.

How to generate a Zip from a set of streams and producing a stream with the Zip data?

I have an app with manages a set of files, but those files are actually stored in Rackspace's CloudFiles, because most of the files will be ~100GB. I'm using the Cloudfile's TempURL feature to allow individual files, but sometimes, the user will want to download a set of files. But downloading all those files and generating a local Zip file is impossible since the server only have 40GB of disk space.
From the user view, I want to implement it the way GMail does when you get an email with several pictures: It gives you a link to download a Zip file with all the images in it, and the download is immediate.
How to accomplish this with Python/Django? I have found ZipStream and looks promising because of the iterator output, but it still only accepts filepaths as arguments, and the writestr method would need to fetch all the file data at once (~100GB).
Since Python 3.5 it is possible to create zip chunks stream of huge files/folders. You can use the unseekable stream. So no need to use ZipStream now.
See my answer here.
And live example here: https://repl.it/#IvanErgunov/zipfilegenerator
If you don't have filepath, but have chunks of bytes you can exclude open(path, 'rb') as entry from example and replace iter(lambda: entry.read(16384), b'') with your iterable of bytes. And prepare ZipInfo manually:
zinfo = ZipInfo(filename='any-name-of-your-non-existent-file', date_time=time.localtime(time.time())[:6])
zinfo.compress_type = zipfile.ZIP_STORED
# permissions:
if zinfo.filename[-1] == '/':
# directory
zinfo.external_attr = 0o40775 << 16 # drwxrwxr-x
zinfo.external_attr |= 0x10 # MS-DOS directory flag
else:
# file
zinfo.external_attr = 0o600 << 16 # ?rw-------
You should also remember that the zipfile module writes chunks of its zipfile own size. So, if you send a piece of 512 bytes the stream will receive a piece of data only when and only with size the zipfile module decides to do it. It depends on the compression algorithm, but I think it is not a problem, because the zipfile module makes small chunks <= 16384.
You can use https://pypi.python.org/pypi/tubing. Here's an example using s3, you could pretty easily create a rackspace clouldfile Source. Create a customer Writer (instead of sinks.Objects) to stream the data some where else and custom Transformers to transform the stream.
from tubing.ext import s3
from tubing import pipes, sinks
output = s3.S3Source(bucket, key) \
| pipes.Gunzip() \
| pipes.Split(on=b'\n') \
| sinks.Objects()
print len(output)
Check this out - it's part of the Python Standard Library:
http://docs.python.org/3/library/zipfile.html#zipfile-objects
You can give it an open file or file-like-object.

Categories

Resources