I want to read an excel file stored in Azure blob storage to a python data frame. What method would I use?
There is a function named read_excel in the pandas package, which you can pass a url of an online excel file to the function to get the dataframe of the excel table, as the figure below.
So you just need to generate a url of a excel blob with sas token and then to pass it to the function.
Here is my sample code. Note: it requires to install Python packages azure-storage, pandas and xlrd.
# Generate a url of excel blob with sas token
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
account_name = '<your storage account name>'
account_key = '<your storage key>'
container_name = '<your container name>'
blob_name = '<your excel blob>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
sas_token = blob_service.generate_blob_shared_access_signature(container_name, blob_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
blob_url_with_sas = blob_service.make_blob_url(container_name, blob_name, sas_token=sas_token)
# pass the blob url with sas to function `read_excel`
import pandas as pd
df = pd.read_excel(blob_url_with_sas)
print(df)
I used my sample excel file to test the code below, it works fine.
Fig 1. My sample excel file testing.xlsx in my test container of Azure Blob Storage
Fig 2. The content of my sample excel file testing.xlsx
Fig 3. The result of my sample Python code to read excel blob
Related
I would like to deploy azure function with following functionality
read excel data from Azure blob into stream object instead of downloading onto VM.
read into data frame
I require help to read the excel file to into data frame. How to update placed holder download_file_path to read excel data .
import pandas as pd
import os
import io
from azure.storage.blob import BlobClient,BlobServiceClient,ContentSettings
connectionstring="XXXXXXXXXXXXXXXX"
excelcontainer = "excelcontainer"
excelblobname="Resource.xlsx"
sheet ="Resource"
blob_service_client =BlobServiceClient.from_connection_string(connectionstring)
download_file_path =os.path.join(excelcontainer)
blob_client = blob_service_client.get_blob_client(container=excelcontainer, blob=excelblobname)
with open(download_file_path, "rb") as f:
data_bytes = f.read()
df =pd.read_excel(data_bytes, sheet_name=sheet, encoding = "utf-16")
If you want to read an excel file from Azure blob with panda, you have two choice
Generate SAS token for the blob then use blob URL with SAS token to access it
from datetime import datetime, timedelta
import pandas as pd
from azure.storage.blob import BlobSasPermissions, generate_blob_sas
def main(req: func.HttpRequest) -> func.HttpResponse:
account_name = 'andyprivate'
account_key = 'h4pP1fe76*****A=='
container_name = 'test'
blob_name="sample.xlsx"
sas=generate_blob_sas(
account_name=account_name,
container_name=container_name,
blob_name=blob_name,
account_key=account_key,
permission=BlobSasPermissions(read=True),
expiry=datetime.utcnow() + timedelta(hours=1)
)
blob_url = f'https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{sas}'
df=pd.read_excel(blob_url)
print(df)
......
Download the blob
from azure.storage.blob import BlobServiceClient
def main(req: func.HttpRequest) -> func.HttpResponse:
account_name = 'andyprivate'
account_key = 'h4pP1f****='
blob_service_client = BlobServiceClient(account_url=f'https://{account_name }.blob.core.windows.net/', credential=account_key)
blob_client = blob_service_client.get_blob_client(container='test', blob='sample.xlsx')
downloader =blob_client.download_blob()
df=pd.read_excel(downloader.readall())
print(df)
....
I'm dealing with a transformation from .xlsx file to .csv. I tested locally a python script that downloads .xlsx files from a container in blob storage, manipulate data, save results as .csv file (using pandas) and upload it on a new container. Now I should bring the python script to ADF to build a pipeline to automate the task. I'm dealing with two kind of problems:
First problem: I can't figure out how to complete the task without downloading the file on my local machine.
I found these threads/tutorials but the "azure" v5.0.0 meta-package is deprecated
read excel files from "input" blob storage container and export to csv in "output" container with python
Tutorial: Run Python scripts through Azure Data Factory using Azure Batch
Sofar my code is:
import os
import sys
import pandas as pd
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, PublicAccess
# Create the BlobServiceClient that is used to call the Blob service for the storage account
conn_str = 'XXXX;EndpointSuffix=core.windows.net'
blob_service_client = BlobServiceClient.from_connection_string(conn_str=conn_str)
container_name = "input"
blob_name = "prova/excel/AAA_prova1.xlsx"
container = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)
downloaded_blob = container.download_blob(blob_name)
df = pd.read_excel(downloaded_blob.content_as_bytes(), skiprows = 4)
data = df.to_csv (r'C:\mypath/AAA_prova2.csv' ,encoding='utf-8-sig', index=False)
full_path_to_file = r'C:\mypath/AAA_prova2.csv'
local_file_name = 'prova\csv\AAA_prova2.csv'
#upload in blob
blob_client = blob_service_client.get_blob_client(
container=container_name, blob=local_file_name)
with open(full_path_to_file, "rb") as data:
blob_client.upload_blob(data)
Second problem: with this method I can deal only with the specific name of the blob, but in the future I'll have to parametrize the script (i.e. select only blob names starting with AAA_). I can't understand if I have to deal with this in the python script or if I can manage to filter the file through ADF (i.e. adding a Filter File task before running the python script). I can't find any tutorial/code snippet and any help or hint or documentation would be very appreciated.
EDIT
I modified the code to avoid to download to local machine, now it works (problem #1 solved)
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
from io import BytesIO
import pandas as pd
filename = "excel/prova.xlsx"
container_name="input"
blob_service_client = BlobServiceClient.from_connection_string("XXXX==;EndpointSuffix=core.windows.net")
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
df = pd.read_excel(stream, skiprows = 5)
local_file_name_out = "csv/prova.csv"
container_name_out = "input"
blob_client = blob_service_client.get_blob_client(
container=container_name_out, blob=local_file_name_out)
blob_client.upload_blob(df.to_csv(path_or_buf = None , encoding='utf-8-sig', index=False))
Azure Functions, Python 3.8 Version of an Azure function. Waits for a blob trigger from Excel. Then does some stuff and used a good chunk of your code for final completion.
Note the split to trim off the .xlsx of the file name.
This is what I ended up with:
source_blob = (f"https://{account_name}.blob.core.windows.net/{uploadedxlsx.name}")
file_name = uploadedxlsx.name.split("/")[2]
container_name = "container"
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(f"Received/{file_name}")
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
df = pd.read_excel(stream)
file_name_t = file_name.split(".")[0]
local_file_name_out = f"Converted/{file_name_t}.csv"
container_name_out = "out_container"
blob_client = blob_service_client.get_blob_client(
container=container_name_out, blob=local_file_name_out)
blob_client.upload_blob(df.to_csv(path_or_buf = None , encoding='utf-8-sig', index=False))
I am getting memory error while creating simple dataframe read from CSV file on Azure Machine Learning using notebook VM as compute instance. The VM has config of DS 13 56gb RAM, 8vcpu, 112gb storage on Ubuntu (Linux (ubuntu 16.04). CSV file is 5gb file.
blob_service = BlockBlobService(account_name,account_key)
blobstring = blob_service.get_blob_to_text(container,filepath).content
dffinaldata = pd.read_csv(StringIO(blobstring), sep=',')
What I am doing wrong here ?
you need to provide the right encoding when calling get_blob_to_text, please refer to the sample.
The code below is what normally use for reading data file in blob storages. Basically, you can use blob’s url along with sas token and use a request method. However, You might want to edit the ‘for loop’ depending what types of data you have (e.g. csv, jpg, and etc).
-- Python code below --
import requests
from azure.storage.blob import BlockBlobService, BlobPermissions
from azure.storage.blob.baseblobservice import BaseBlobService
from datetime import datetime, timedelta
account_name = '<account_name>'
account_key = '<account_key>'
container_name = '<container_name>'
blob_service=BlockBlobService(account_name,account_key)
generator = blob_service.list_blobs(container_name)
for blob in generator:
url = f"https://{account_name}.blob.core.windows.net/{container_name}"
service = BaseBlobService(account_name=account_name, account_key=account_key)
token = service.generate_blob_shared_access_signature(container_name, img_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)
url_with_sas = f"{url}?{token}"
response = requests.get(url_with_sas)
Please follow the below link to read data on Azure Blob Storage.
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data
I've been trying to compress my CSV files to .gz before uploading to GCS using Cloud Function-Python 3.7, but what my code does only adds the .gz extension but doesn't really compress the file, so in the end, the file was corrupted. Can you please show me how to fix this? Thanks
here is part of my code
import gzip
def to_gcs(request):
job_config = bigquery.QueryJobConfig()
gcs_filename = 'filename_{}.csv'
bucket_name = 'bucket_gcs_name'
subfolder = 'subfolder_name'
client = bigquery.Client()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
QUERY = "SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` session, UNNEST(hits) AS hits"
query_job = client.query(
QUERY,
location='US',
job_config=job_config)
while not query_job.done():
time.sleep(1)
rows_df = query_job.result().to_dataframe()
storage_client = storage.Client()
storage_client.get_bucket(bucket_name).blob(subfolder+'/'+gcs_filename+'.gz').upload_from_string(rows_df.to_csv(sep='|',index=False,encoding='utf-8',compression='gzip'), content_type='application/octet-stream')
As suggested in the thread referred by #Sam Mason in a comment, once you have obtained the Pandas datafame, you should use a TextIOWrapper() and BytesIO() as described in the following sample:
The following sample was inspired by #ramhiser's answer in this SO thread
df = query_job.result().to_dataframe()
blob = bucket.blob(f'{subfolder}/{gcs_filename}.gz')
with BytesIO() as gz_buffer:
with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)
blob.upload_from_file(gz_buffer,
content_type='application/octet-stream')
also note that if you expect this file to ever get larger than a couple of MB you are probably better off using something from the tempfile module in place of BytesIO. SpooledTemporaryFile is basically designed for this use case, where it will use a memory buffer up to some given size and only use the disk if the file gets really big
Hi I tried to reproduce your use case:
I created a cloud function using this quickstart link:
def hello_world(request):
from google.cloud import bigquery
from google.cloud import storage
import pandas as pd
client = bigquery.Client()
storage_client = storage.Client()
path = '/tmp/file.gz'
query_job = client.query("""
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
LIMIT 10""")
results = query_job.result().to_dataframe()
results.to_csv(path,sep='|',index=False,encoding='utf-8',compression='gzip')
bucket = storage_client.get_bucket('mybucket')
blob = bucket.blob('file.gz')
blob.upload_from_filename(path)
This is the requirements.txt:
# Function dependencies, for example:
google-cloud-bigquery
google-cloud-storage
pandas
I deployed the function.
I checked the output.
gsutil cp gs://mybucket/file.gz file.gz
gzip -d file.gz
cat file
#url|view_count
https://stackoverflow.com/questions/22879669|52306
https://stackoverflow.com/questions/13530967|46073
https://stackoverflow.com/questions/35159967|45991
https://stackoverflow.com/questions/10604135|45238
https://stackoverflow.com/questions/16609219|37758
https://stackoverflow.com/questions/11647201|32963
https://stackoverflow.com/questions/13221978|32507
https://stackoverflow.com/questions/27060396|31630
https://stackoverflow.com/questions/6607552|31487
https://stackoverflow.com/questions/11057219|29069
I have a number of large csv (tab delimited) data stored as azure blobs, and I want to create a pandas dataframe from these. I can do this locally as follows:
from azure.storage.blob import BlobService
import pandas as pd
import os.path
STORAGEACCOUNTNAME= 'account_name'
STORAGEACCOUNTKEY= "key"
LOCALFILENAME= 'path/to.csv'
CONTAINERNAME= 'container_name'
BLOBNAME= 'bloby_data/000000_0'
blob_service = BlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
# Only get a local copy if haven't already got it
if not os.path.isfile(LOCALFILENAME):
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
df_customer = pd.read_csv(LOCALFILENAME, sep='\t')
However, when running the notebook on azure ML notebooks, I can't 'save a local copy' and then read from csv, and so I'd like to do the conversion directly (something like pd.read_azure_blob(blob_csv) or just pd.read_csv(blob_csv) would be ideal).
I can get to the desired end result (pandas dataframe for blob csv data), if I first create an azure ML workspace, and then read the datasets into that, and finally using https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python to access the dataset as a pandas dataframe, but I'd prefer to just read straight from the blob storage location.
The accepted answer will not work in the latest Azure Storage SDK. MS has rewritten the SDK completely. It's kind of annoying if you are using the old version and update it. The below code should work in the new version.
from azure.storage.blob import ContainerClient
from io import StringIO
import pandas as pd
conn_str = ""
container = ""
blob_name = ""
container_client = ContainerClient.from_connection_string(
conn_str=conn_str,
container_name=container
)
# Download blob as StorageStreamDownloader object (stored in memory)
downloaded_blob = container_client.download_blob(blob_name)
df = pd.read_csv(StringIO(downloaded_blob.content_as_text()))
I think you want to use get_blob_to_bytes, or get_blob_to_text; these should output a string which you can use to create a dataframe as
from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
df = pd.read_csv(StringIO(blobstring))
Thanks for the answer, I think some correction is needed. You need to get content from the blob object and in the get_blob_to_text there's no need for the local file name.
from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))
Simple Answer:
Working as on 12th June 2022
Below are the steps to read a CSV file from Azure Blob into a Jupyter notebook dataframe (python).
STEP 1:
First generate a SAS token & URL for the target CSV(blob) file on Azure-storage by right-clicking the blob/storage CSV file(blob file).
STEP 2: Copy the Blob SAS URL that appears below the button used for generating SAS token and URL.
STEP 3: Use the below line of code in your Jupyter notbook to import the desired CSV. Replace url value with your Blob SAS URL copied in the above step.
import pandas as pd
url ='Your Blob SAS URL'
df = pd.read_csv(url)
df.head()
Use ADLFS (pip install adlfs), which is an fsspec-compatible API for Azure lakes (gen1 and gen2):
storage_options = {
'tenant_id': tenant_id,
'account_name': account_name,
'client_id': client_id,
'client_secret': client_secret
}
url = 'az://some/path.csv'
pd.read_csv(url, storage_options=storage_options)