Read ZIP file from cloud storage via the streaming library - python

I need to read a zip file, located in the Google Storage Bucket, without unzipping using airflow. I am using a library called stream-unzip.
from stream_unzip import stream_unzip
import httpx
def zipped_chunks():
# Iterable that yields the bytes of a zip file
with httpx.stream('GET', "<google_bucket_file_path>") as r:
yield from r.iter_bytes(chunk_size=65536)
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks(), password=b'my-password'):
# unzipped_chunks must be iterated to completion or UnfinishedIterationError will be raised
for chunk in unzipped_chunks:
print(chunks)
The error which I got when I tried with above code is:
stream_unzip.TruncatedDataError
This zip file contains a text-file that have millions of lines. I need to stream these lines and get the data in each line on the fly. The airflow is deployed in the composer environment.
Please suggest a good solution for this.
Thank you

Related

Google Cloud Storage streaming upload from Python generator

I have a Python generator that will yield a large and unknown amount of byte data. I'd like to stream the output to GCS, without buffering to a file on disk first.
While I'm sure this is possible (e.g., I can create a subprocess of gsutil cp - <...> and just write my bytes into its stdin), I'm not sure what's a recommended/supported way and the documentation gives the example of uploading a local file.
How should I do this right?
The BlobWriter class makes this a bit easier:
bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('my_object')
writer = BlobWriter(blob)
for d in your_generator:
writer.write(d)
writer.close()

Azure Blobstore: How can I read a file without having to download the whole thing first?

I'm trying to figure out how to read a file from Azure blob storage.
Studying its documentation, I can see that the download_blob method seems to be the main way to access a blob.
This method, though, seems to require downloading the whole blob into a file or some other stream.
Is it possible to read a file from Azure Blob Storage line by line as a stream from the service? (And without having to have downloaded the whole thing first)
Update 0710:
In the latest SDK azure-storage-blob 12.3.2, we can also do the same thing by using download_blob.
The screenshot of the source code of download_blob:
So just provide an offset and length parameter, like below(it works as per my test):
blob_client.download_blob(60,100)
Original answer:
You can not read the blob file line by line, but you can read them as per bytes. Like first read 10 bytes of the data, next you can continue to read the next 10 to 20 bytes etc.
This is only available in the older version of python blob storage sdk 2.1.0. Install it like below:
pip install azure-storage-blob==2.1.0
Here is the sample code(here I read the text, but you can change it to use get_blob_to_stream(container_name,blob_name,start_range=0,end_range=10) method to read stream):
from azure.storage.blob import BlockBlobService, PublicAccess
accountname="xxxx"
accountkey="xxxx"
blob_service_client = BlockBlobService(account_name=accountname,account_key=accountkey)
container_name="test2"
blob_name="a5.txt"
#get the length of the blob file, you can use it if you need a loop in your code to read a blob file.
blob_property = blob_service_client.get_blob_properties(container_name,blob_name)
print("the length of the blob is: " + str(blob_property.properties.content_length) + " bytes")
print("**********")
#get the first 10 bytes data
b1 = blob_service_client.get_blob_to_text(container_name,blob_name,start_range=0,end_range=10)
#you can use the method below to read stream
#blob_service_client.get_blob_to_stream(container_name,blob_name,start_range=0,end_range=10)
print(b1.content)
print("*******")
#get the next range of data
b2=blob_service_client.get_blob_to_text(container_name,blob_name,start_range=10,end_range=50)
print(b2.content)
print("********")
#get the next range of data
b3=blob_service_client.get_blob_to_text(container_name,blob_name,start_range=50,end_range=200)
print(b3.content)
The accepted answer here may be of use to you. The documentation can be found here.

Using IO library to load string variable as a txt file to/from s3

I have old code below that gzips a file and stores it as json into S3, using the IO library ( so a file does not save locally). I am having trouble converting this same approach (ie using IO library for a buffer) to create a .txt file and push into S3 and later retrieve. I know how to create txt files and push into s3 is as well, but not how to use IO in the process.
The value I would want to be stored in the text value would just be a variable with a string value of 'test'
Goal: Use IO library and save string variable as a text file into S3 and be able to pull it down again.
x = 'test'
inmemory = io.BytesIO()
with gzip.GzipFile(fileobj=inmemory, mode='wb') as fh:
with io.TextIOWrapper(fh, encoding='utf-8',errors='replace') as wrapper:
wrapper.write(json.dumps(x, ensure_ascii=False,indent=2))
inmemory.seek(0)
s3_resource.Object(s3bucket, s3path + '.json.gz').upload_fileobj(inmemory)
inmemory.close()
Also any documentation with that anyone likes with specific respect to the IO library and writing to files, because the actual documentation ( f = io.StringIO("some initial text data")
ect..https://docs.python.org/3/library/io.html ) It just did not give me enough at my current level.
Duplicate.
For sake of brevity, it turns out there's a way to override the putObject call so that it takes in a string of text instead of a file.
The original post is answered in Java, but this additional thread should be sufficient for a Python-specific answer.

JSON format validation without loading file

How to validate JSON format without loading the file? I am copying files from one S3 bucket to another S3 bucket. After JSONL files are
copied , I want to check if file format is correct in the sense curly braces and commas are fine.
I don't want to use json.load() because file size and number are big and it will slow down the process plus file is already copied so no need to parse it , just validation is requirement.
There is no capability within Amazon S3 itself to validate the content of objects.
You could configure S3 to trigger an AWS Lambda function whenever a file is created in the S3 bucket. The Lambda function could then parse the file and perform some action (eg send a notification or move the object to another location) if the validation fails.
Streaming the file seems to be the way to go about it, put into a generator and yield line by line to check if the JSON is valid. The requests library supports streaming of a file.
The solution would look something like this:
import requests
def get_data():
r = requests.get('s3_file_url', stream=True)
yield from r.iter_lines():
def parse_data():
# initialize generator
gd_gen = get_data()
while True:
try:
ge_gen.__next__()
except StopIteration:
break
# put your validation code here
Let me know you need a better clarification

Azure blob storage to JSON in azure function using SDK

I am trying to create a timer trigger azure function that takes data from blob, aggregates it, and puts the aggregates in a cosmosDB. I previously tried using the bindings in azure functions to use blob as input, which I was informed was incorrect (see this thread: Azure functions python no value for named parameter).
I am now using the SDK and am running into the following problem:
import sys, os.path
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), 'myenv/Lib/site-packages')))
import json
import pandas as pd
from azure.storage.blob import BlockBlobService
data = BlockBlobService(account_name='accountname', account_key='accountkey')
container_name = ('container')
generator = data.list_blobs(container_name)
for blob in generator:
print("{}".format(blob.name))
json = json.loads(data.get_blob_to_text('container', open(blob.name)))
df = pd.io.json.json_normalize(json)
print(df)
This results in an error:
IOError: [Errno 2] No such file or directory: 'test.json'
I realize this might be an absolute path issue, but im not sure how that works with azure storage. Any ideas on how to circumvent this?
Made it "work" by doing the following:
for blob in generator:
loader = data.get_blob_to_text('kvaedevdystreamanablob',blob.name,if_modified_since=delta)
json = json.loads(loader.content)
This works for ONE json file, i.e I only had one in storage, but when more are added I get this error:
ValueError: Expecting object: line 1 column 21907 (char 21906)
This happens even if i add if_modified_since as to only take in one blob. Will update if I figure something out. Help always welcome.
Another update: My data is coming in through stream analytics, and then down to the blob. I have selected that the data should come in as arrays, this is why the error is occurring. When the stream is terminated, the blob doesnt immediately append ] to the EOF line in json, thus the json file isnt valid. Will try now with using line-by-line in stream analytics instead of array.
figured it out. In the end it was a quite simple fix:
I had to make sure each json entry in the blob was less than 1024 characters, or it would create a new line, thus making reading lines problematic.
The code that iterates through each blob file, reads and adds to a list is a follows:
data = BlockBlobService(account_name='accname', account_key='key')
generator = data.list_blobs('collection')
dataloaded = []
for blob in generator:
loader = data.get_blob_to_text('collection',blob.name)
trackerstatusobjects = loader.content.split('\n')
for trackerstatusobject in trackerstatusobjects:
dataloaded.append(json.loads(trackerstatusobject))
From this you can add to a dataframe and do what ever you want :)
Hope this helps if someone stumbles upon a similar problem.

Categories

Resources