Azure Grab Data from Blob Storage w. Python (No downloading) - python

I'm trying to open a series of different cracked documents / texts that we've stored in Azure Blob storage, ideally pushing them all into a pandas db. I do not want to download them (I'm going to be opening them from a Docker Container), I just want to store the information in memory.
The file structure looks like: Azure Blob Storage -> MyContainer -> UUIDFolderNames (many) -> 1 "knowledge.json" file in each Folder.
What I've got working:
container = ContainerClient.from_connection_string( <my connection str>, <MyContainer> )
blob_list = container.list_blobs()
for blob in blob_list:
blobClient = container.get_blob_client( blob ) #Not sure this is needed
Ideally for each item in my for loop, I'd do something like opening the .json file, then adding it's text to a row in my dataframe. However, I can't actually manage to open any of the JSON files.
What I've tried:
#1
name = blob.name
json.loads( name )
#2
with open(name, 'r') as f:
data = json.load( f )
Errors:
#1 Json Decoder Error Expecting Value: line 1 column 1 (char 0)
#2: No such file or directory
I've tried other sillier things like json.loads( blob ) or json.loads('knowledge.json') (no folder name in path), but those are kinda nonsensicle things that I was just trying to see if they worked, they're not exactly reasonable.
Most methods (including on Azure's documentation) download the file first, but again, I don't want to download the file.
*Edit: I realized that its somewhat obvious why the file's cannot be found - json.load etc will look in my local directory / where I'm running the python file from, rather than the blob location. Still, not sure how to load a file w.o downloading it.

With the help of the below block you will be able to view the JSON blob:
for blobs in container_client.list_blobs():
blob_client = service_client.get_blob_client(container=Container_name, blob=blobs)
content = blob_client.download_blob()
contentastext = content.readall()
print(contentastext)
Below is the full code to read JSON files from blobs, later you can add this data to your dataframes:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient,PublicAccess
import os
import logging
import sys
import azure.functions as func
from azure.storage import blob
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
def UploadFiles():
CONNECTION_STRING="ENTER_CONNECTION_STR"
Container_name="gatherblobs"
service_client=BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = service_client.get_container_client(Container_name)
for blobs in container_client.list_blobs():
blob_client = service_client.get_blob_client(container=Container_name, blob=blobs)
content = blob_client.download_blob()
contentastext = content.readall()
print(contentastext)
if __name__ == '__main__':
UploadFiles()

Related

How to create a file in a python script without storing it in a local directory and save it in a different container on azure blob storage?

Because I want to deploy this code as an azure container instance, I don't want to read a file from a local path on my computer. Basically, I am retrieving a file from a container in blob storage in Azure and then saving it in a different container. There would be some processing that changes the file from the blob storage and uploads the processed file into a different container. In my code, for simplicity I am uploading the file as it is without processing it and changing the file. So far I have managed to read a file from blob storage and make a local file out of it and upload the local file to a different container on blob storage but I don't want to create a local file. I want to process the file from blob storage and directly upload it to a different container without storing it in a local path. Can someone please help me figure it out. I have the following code:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import json
class FileProcessing:
def __init__(self):
self.file_access()
def file_access(self):
filename = "data_map.json"
container_name="filestorage"
constr = ""
blob_service_client = BlobServiceClient.from_connection_string(constr)
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader = blob_client.download_blob()
fileReader = json.loads(streamdownloader.readall())
#Here it stores it in a local directory on my computer; I want it saved on Azure directly
#For simplicity I am not making any changes to the file yet
with open('json_data.json', 'w') as outfile:
json.dump(fileReader, outfile)
container_name2="filedeposit"
container_client = ContainerClient.from_connection_string(constr, container_name2)
print("Uploading files to blob storage")
blob_client= container_client.get_blob_client("json_data.json")
with open(r"C:\Users\python-test\json_data.json", "rb") as data:
blob_client.upload_blob(data)
print("file uploaded")
if __name__ == "__main__":
FileProcessing()
You don't really need to write the data to the file. You can simply convert the JSON data into a string and then upload it. Something like (untested code though):
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import json
class FileProcessing:
def __init__(self):
self.file_access()
def file_access(self):
filename = "data_map.json"
container_name="filestorage"
constr = ""
blob_service_client = BlobServiceClient.from_connection_string(constr)
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader = blob_client.download_blob()
fileReader = json.loads(streamdownloader.readall())
#Read the data into string
data = json.dumps(fileReader)
container_name2="filedeposit"
container_client = ContainerClient.from_connection_string(constr, container_name2)
print("Uploading files to blob storage")
blob_client= container_client.get_blob_client("json_data.json")
blob_client.upload_blob(data)
print("file uploaded")
if __name__ == "__main__":
FileProcessing()

How to process a file located in Azure blob Storage using python with pandas read_fwf function

I need to open and work on data coming in a text file with python.
The file will be stored in the Azure Blob storage or Azure file share.
However, my question is can I use the same modules and functions like os.chdir() and read_fwf() I was using in windows? The code I wanted to run:
import pandas as pd
import os
os.chdir( file_path)
df=pd.read_fwf(filename)
I want to be able to run this code and file_path would be a directory in Azure blob.
Please let me know if it's possible. If you have a better idea where the file can be stored please share.
Thanks,
As far as I know, os.chdir(path) can only operate on local files. If you want to move files from storage to local, you can refer to the following code:
connect_str = "<your-connection-string>"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_name = "<container-name>"
file_name = "<blob-name>"
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(file_name)
download_file_path = "<local-path>"
with open(download_file_path, "wb") as download_file:
download_file.write(blob_client.download_blob().readall())
pandas.read_fwf can read blob directly from storage using URL:
For example:
url = "https://<your-account>.blob.core.windows.net/test/test.txt?<sas-token>"
df=pd.read_fwf(url)

How to parse AVRO blobs captured in a storage account of Event Hub?

In Microsoft Azure we have an Event Hub capturing JSON data and storing it in AVRO format in a blob storage account:
I have written a python script, which would fetch the AVRO files from the Event Hub:
import os, avro
from io import BytesIO
from operator import itemgetter, attrgetter
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
conn_str = 'DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net'
container_name = 'container1'
blob_service_client = BlobServiceClient.from_connection_string(conn_str)
container_client = blob_service_client.get_container_client(container_name)
blob_list = []
for blob in container_client.list_blobs():
if blob.name.endswith('.avro'):
blob_list.append(blob)
blob_list.sort(key=attrgetter('creation_time'), reverse=True)
This works well and I get a list of AVRO blobs, sorted by the creation time.
Now I am trying to add the final steps where I would download the blobs, parse the AVRO-formatted data and retrieve the JSON payload.
I try to retrieve each blob in the list into memory buffer and to parse it:
for blob in blob_list:
blob_client = container_client.get_blob_client(blob.name)
downloader = blob_client.download_blob()
stream = BytesIO()
downloader.download_to_stream(stream) # also tried readinto(stream)
reader = DataFileReader(stream, DatumReader())
for event_data in reader:
print(event_data)
reader.close()
Unfortunately, the above Python code does not work, nothing is printed.
I have also seen, that there is a StorageStreamDownloader.readall() method, but I am not sure, how to apply it.
I am using Windows 10, python 3.8.5 and avro 1.10.0 installed by pip.
When using readall() method, it should be used as below:
with open("xxx", "wb+") as my_file:
my_file.write(blob_client.download_blob().readall()) # Write blob contents into the file.
For more details about reading captured eventhub data, you can refer to this official doc: Create a Python script to read your Capture files.
Please let me know if you still have more issues:).

Transform .xlsx in BLOB storage to .csv using pandas without downloading to local machine

I'm dealing with a transformation from .xlsx file to .csv. I tested locally a python script that downloads .xlsx files from a container in blob storage, manipulate data, save results as .csv file (using pandas) and upload it on a new container. Now I should bring the python script to ADF to build a pipeline to automate the task. I'm dealing with two kind of problems:
First problem: I can't figure out how to complete the task without downloading the file on my local machine.
I found these threads/tutorials but the "azure" v5.0.0 meta-package is deprecated
read excel files from "input" blob storage container and export to csv in "output" container with python
Tutorial: Run Python scripts through Azure Data Factory using Azure Batch
Sofar my code is:
import os
import sys
import pandas as pd
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, PublicAccess
# Create the BlobServiceClient that is used to call the Blob service for the storage account
conn_str = 'XXXX;EndpointSuffix=core.windows.net'
blob_service_client = BlobServiceClient.from_connection_string(conn_str=conn_str)
container_name = "input"
blob_name = "prova/excel/AAA_prova1.xlsx"
container = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)
downloaded_blob = container.download_blob(blob_name)
df = pd.read_excel(downloaded_blob.content_as_bytes(), skiprows = 4)
data = df.to_csv (r'C:\mypath/AAA_prova2.csv' ,encoding='utf-8-sig', index=False)
full_path_to_file = r'C:\mypath/AAA_prova2.csv'
local_file_name = 'prova\csv\AAA_prova2.csv'
#upload in blob
blob_client = blob_service_client.get_blob_client(
container=container_name, blob=local_file_name)
with open(full_path_to_file, "rb") as data:
blob_client.upload_blob(data)
Second problem: with this method I can deal only with the specific name of the blob, but in the future I'll have to parametrize the script (i.e. select only blob names starting with AAA_). I can't understand if I have to deal with this in the python script or if I can manage to filter the file through ADF (i.e. adding a Filter File task before running the python script). I can't find any tutorial/code snippet and any help or hint or documentation would be very appreciated.
EDIT
I modified the code to avoid to download to local machine, now it works (problem #1 solved)
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
from io import BytesIO
import pandas as pd
filename = "excel/prova.xlsx"
container_name="input"
blob_service_client = BlobServiceClient.from_connection_string("XXXX==;EndpointSuffix=core.windows.net")
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
df = pd.read_excel(stream, skiprows = 5)
local_file_name_out = "csv/prova.csv"
container_name_out = "input"
blob_client = blob_service_client.get_blob_client(
container=container_name_out, blob=local_file_name_out)
blob_client.upload_blob(df.to_csv(path_or_buf = None , encoding='utf-8-sig', index=False))
Azure Functions, Python 3.8 Version of an Azure function. Waits for a blob trigger from Excel. Then does some stuff and used a good chunk of your code for final completion.
Note the split to trim off the .xlsx of the file name.
This is what I ended up with:
source_blob = (f"https://{account_name}.blob.core.windows.net/{uploadedxlsx.name}")
file_name = uploadedxlsx.name.split("/")[2]
container_name = "container"
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(f"Received/{file_name}")
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
df = pd.read_excel(stream)
file_name_t = file_name.split(".")[0]
local_file_name_out = f"Converted/{file_name_t}.csv"
container_name_out = "out_container"
blob_client = blob_service_client.get_blob_client(
container=container_name_out, blob=local_file_name_out)
blob_client.upload_blob(df.to_csv(path_or_buf = None , encoding='utf-8-sig', index=False))

Extract particular file from zip blob stored in azure container with python using Jupyter notebook

I had uploaded zip file in my azure account as a blob in azure container.
Zip file contains .csv, .ascii files and many other formats.
I need to read specific file, lets say ascii file data containing in zip file. I am using python for this case.
How to read particular file data from this zip file without downloading it on local? I would like to handle this process in memory only.
I am also trying with jypyter notebook provided by azure for ML functionality
I am using ZipFile python package for this case.
Request you to assist in this matter to read the file
Please find following code snippet.
blob_service=BlockBlobService(account_name=ACCOUNT_NAME,account_key=ACCOUNT_KEY)
blob_list=blob_service.list_blobs(CONTAINER_NAME)
allBlobs = []
for blob in blob_list:
allBlobs.append(blob.name)
sampleZipFile = allBlobs[0]
print(sampleZipFile)
The below code should work. This example accesses an Azure Container using an Account URL and Key combination.
from azure.storage.blob import BlobServiceClient
from io import BytesIO
from zipfile import ZipFile
key = r'my_key'
service = BlobServiceClient(account_url="my_account_url",
credential=key
)
container_client = service.get_container_client('container_name')
zipfilename = 'myzipfile.zip'
blob_data = container_client.download_blob(zipfilename)
blob_bytes = blob_data.content_as_bytes()
inmem = BytesIO(blob_bytes)
myzip = ZipFile(inmem)
otherfilename = 'mycontainedfile.csv'
filetoread = BytesIO(myzip.read(otherfilename))
Now all you have to do is pass filetoread into whatever method you would normally use to read a local file (eg. pandas.read_csv())
you could use below code for reading file inside .zip file without extracting in python
import zipfile
archive = zipfile.ZipFile('images.zip', 'r')
imgdata = archive.read('img_01.png')
For details , you can refer to ZipFile docs here
Alternatively, you can do something like this
-- coding: utf-8 --
"""
Created on Mon Apr 1 11:14:56 2019
#author: moverm
"""
import zipfile
zfile = zipfile.ZipFile('C:\\LAB\Pyt\sample.zip')
for finfo in zfile.infolist():
ifile = zfile.open(finfo)
line_list = ifile.readlines()
print(line_list)
Here is the output for the same
Hope it helps.

Categories

Resources