I'd like to use the Python bindings to delta-rs to read from my blob storage.
Currently I am kind of lost, since I cannot figure out how to configure the filesystem on my local machine. Where do I have to put my credentials?
Can I use adlfs for this?
from adlfs import AzureBlobFileSystem
fs = AzureBlobFileSystem(
account_name="...",
account_key='...'
)
and then use the fs object?
Unfortunately we don't have great documentation around this at the moment. You should be able to set AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_SAS environment variables a la this integration test.
That will ensure the Python bindings can access table metadata, but typically fetching of the data for query is done through Pandas, and I'm not sure if Pandas will handle these variables as well (not an ADLSv2 user myself)..
One possible workaround is to download the delta lake files to a tmp-dir and read the files using python-delta-rs with something like this:
from azure.storage.blob import BlobServiceClient
import tempfile
from deltalake import DeltaTable
def get_blobs_for_folder(container_client, blob_storage_folder_path):
blob_iter = container_client.list_blobs(name_starts_with=blob_storage_folder_path)
blob_names = []
for blob in blob_iter:
if "." in blob.name:
# To just get files and not directories, there might be a better way to do this
blob_names.append(blob.name)
return blob_names
def download_blob_files(container_client, blob_names, local_folder):
for blob_name in blob_names:
local_filename = os.path.join(local_folder, blob_name)
local_file_dir = os.path.dirname(local_filename)
if not os.path.exists(local_file_dir):
os.makedirs(local_file_dir)
with open(local_filename, 'wb') as f:
f.write(container_client.download_blob(blob_name).readall())
def read_delta_lake_file_to_df(blob_storage_path, access_key):
blob_storage_url = "https://your-blob-storage"
blob_service_client = BlobServiceClient(blob_storage_url, credential=access_key)
container_client = blob_service_client.get_container_client("your-container-name")
blob_names = get_blobs_for_folder(container_client, blob_storage_path)
with tempfile.TemporaryDirectory() as tmp_dirpath:
download_blob_files(container_client, blob_names, tmp_dirpath)
local_filename = os.path.join(tmp_dirpath, blob_storage_path)
dt = DeltaTable(local_filename)
df = dt.to_pyarrow_table().to_pandas()
return df
I don't know about delta-rs but you can use this object directly with pandas.
abfs = AzureBlobFileSystem(account_name="account_name", account_key="access_key", container_name="name_of_container")
df = pd.read_parquet("path/of/file/with/container_name/included",filesystem=abfs)
Related
I am trying to download files from azure blob storage and I want to download it without using any file handler or open or close method.
This is my approach I want an alternative approach without using "with open()"
`
def download_to_blob_storage(CONTAINERNAME ,remote_path,local_file_path):
client = BlobServiceClient(STORAGEACCOUNTURL, credential=default_credential)
blob_client = client.get_blob_client(container =CONTAINERNAME,blob = remote_path)
with open(local_file_path, "wb") as my_blob:
download_stream = blob_client.download_blob()
my_blob.write(download_stream.readall())
print('downloaded'+remote_path+'file')
download_to_blob_storage(CONTAINERNAME , '/results/wordcount.py',"./wordcount.py")
`
For this question, first make sure what do you want to do.
What kind of concept is the "download" you want to achieve? If you just want to get the file content, I can provide you with a method. But if you want to avoid python's I/O mechanism to implement creating and writing files, then this is not possible. The open method of file is the most basic method of python's native, if you want a file that can be stored on disk, you will inevitably call this method.
A python demo for you:
from azure.storage.blob import BlobClient, ContainerClient
account_url = "https://xxx.blob.core.windows.net"
key = "xxx"
container_name = "test"
blob_name = "test.txt"
# Create the ContainerClient object
container_client = ContainerClient(account_url, container_name, credential=key)
# Create the BlobClient object
blob_client = BlobClient(account_url, container_name, blob_name, credential=key)
#download the file without using open
file_data = blob_client.download_blob().readall()
print("file content is: " + str(file_data))
Result:
I'm trying to open a series of different cracked documents / texts that we've stored in Azure Blob storage, ideally pushing them all into a pandas db. I do not want to download them (I'm going to be opening them from a Docker Container), I just want to store the information in memory.
The file structure looks like: Azure Blob Storage -> MyContainer -> UUIDFolderNames (many) -> 1 "knowledge.json" file in each Folder.
What I've got working:
container = ContainerClient.from_connection_string( <my connection str>, <MyContainer> )
blob_list = container.list_blobs()
for blob in blob_list:
blobClient = container.get_blob_client( blob ) #Not sure this is needed
Ideally for each item in my for loop, I'd do something like opening the .json file, then adding it's text to a row in my dataframe. However, I can't actually manage to open any of the JSON files.
What I've tried:
#1
name = blob.name
json.loads( name )
#2
with open(name, 'r') as f:
data = json.load( f )
Errors:
#1 Json Decoder Error Expecting Value: line 1 column 1 (char 0)
#2: No such file or directory
I've tried other sillier things like json.loads( blob ) or json.loads('knowledge.json') (no folder name in path), but those are kinda nonsensicle things that I was just trying to see if they worked, they're not exactly reasonable.
Most methods (including on Azure's documentation) download the file first, but again, I don't want to download the file.
*Edit: I realized that its somewhat obvious why the file's cannot be found - json.load etc will look in my local directory / where I'm running the python file from, rather than the blob location. Still, not sure how to load a file w.o downloading it.
With the help of the below block you will be able to view the JSON blob:
for blobs in container_client.list_blobs():
blob_client = service_client.get_blob_client(container=Container_name, blob=blobs)
content = blob_client.download_blob()
contentastext = content.readall()
print(contentastext)
Below is the full code to read JSON files from blobs, later you can add this data to your dataframes:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient,PublicAccess
import os
import logging
import sys
import azure.functions as func
from azure.storage import blob
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
def UploadFiles():
CONNECTION_STRING="ENTER_CONNECTION_STR"
Container_name="gatherblobs"
service_client=BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = service_client.get_container_client(Container_name)
for blobs in container_client.list_blobs():
blob_client = service_client.get_blob_client(container=Container_name, blob=blobs)
content = blob_client.download_blob()
contentastext = content.readall()
print(contentastext)
if __name__ == '__main__':
UploadFiles()
I need to open and work on data coming in a text file with python.
The file will be stored in the Azure Blob storage or Azure file share.
However, my question is can I use the same modules and functions like os.chdir() and read_fwf() I was using in windows? The code I wanted to run:
import pandas as pd
import os
os.chdir( file_path)
df=pd.read_fwf(filename)
I want to be able to run this code and file_path would be a directory in Azure blob.
Please let me know if it's possible. If you have a better idea where the file can be stored please share.
Thanks,
As far as I know, os.chdir(path) can only operate on local files. If you want to move files from storage to local, you can refer to the following code:
connect_str = "<your-connection-string>"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_name = "<container-name>"
file_name = "<blob-name>"
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(file_name)
download_file_path = "<local-path>"
with open(download_file_path, "wb") as download_file:
download_file.write(blob_client.download_blob().readall())
pandas.read_fwf can read blob directly from storage using URL:
For example:
url = "https://<your-account>.blob.core.windows.net/test/test.txt?<sas-token>"
df=pd.read_fwf(url)
I'm dealing with a transformation from .xlsx file to .csv. I tested locally a python script that downloads .xlsx files from a container in blob storage, manipulate data, save results as .csv file (using pandas) and upload it on a new container. Now I should bring the python script to ADF to build a pipeline to automate the task. I'm dealing with two kind of problems:
First problem: I can't figure out how to complete the task without downloading the file on my local machine.
I found these threads/tutorials but the "azure" v5.0.0 meta-package is deprecated
read excel files from "input" blob storage container and export to csv in "output" container with python
Tutorial: Run Python scripts through Azure Data Factory using Azure Batch
Sofar my code is:
import os
import sys
import pandas as pd
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, PublicAccess
# Create the BlobServiceClient that is used to call the Blob service for the storage account
conn_str = 'XXXX;EndpointSuffix=core.windows.net'
blob_service_client = BlobServiceClient.from_connection_string(conn_str=conn_str)
container_name = "input"
blob_name = "prova/excel/AAA_prova1.xlsx"
container = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)
downloaded_blob = container.download_blob(blob_name)
df = pd.read_excel(downloaded_blob.content_as_bytes(), skiprows = 4)
data = df.to_csv (r'C:\mypath/AAA_prova2.csv' ,encoding='utf-8-sig', index=False)
full_path_to_file = r'C:\mypath/AAA_prova2.csv'
local_file_name = 'prova\csv\AAA_prova2.csv'
#upload in blob
blob_client = blob_service_client.get_blob_client(
container=container_name, blob=local_file_name)
with open(full_path_to_file, "rb") as data:
blob_client.upload_blob(data)
Second problem: with this method I can deal only with the specific name of the blob, but in the future I'll have to parametrize the script (i.e. select only blob names starting with AAA_). I can't understand if I have to deal with this in the python script or if I can manage to filter the file through ADF (i.e. adding a Filter File task before running the python script). I can't find any tutorial/code snippet and any help or hint or documentation would be very appreciated.
EDIT
I modified the code to avoid to download to local machine, now it works (problem #1 solved)
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
from io import BytesIO
import pandas as pd
filename = "excel/prova.xlsx"
container_name="input"
blob_service_client = BlobServiceClient.from_connection_string("XXXX==;EndpointSuffix=core.windows.net")
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
df = pd.read_excel(stream, skiprows = 5)
local_file_name_out = "csv/prova.csv"
container_name_out = "input"
blob_client = blob_service_client.get_blob_client(
container=container_name_out, blob=local_file_name_out)
blob_client.upload_blob(df.to_csv(path_or_buf = None , encoding='utf-8-sig', index=False))
Azure Functions, Python 3.8 Version of an Azure function. Waits for a blob trigger from Excel. Then does some stuff and used a good chunk of your code for final completion.
Note the split to trim off the .xlsx of the file name.
This is what I ended up with:
source_blob = (f"https://{account_name}.blob.core.windows.net/{uploadedxlsx.name}")
file_name = uploadedxlsx.name.split("/")[2]
container_name = "container"
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(f"Received/{file_name}")
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
df = pd.read_excel(stream)
file_name_t = file_name.split(".")[0]
local_file_name_out = f"Converted/{file_name_t}.csv"
container_name_out = "out_container"
blob_client = blob_service_client.get_blob_client(
container=container_name_out, blob=local_file_name_out)
blob_client.upload_blob(df.to_csv(path_or_buf = None , encoding='utf-8-sig', index=False))
I've been trying to compress my CSV files to .gz before uploading to GCS using Cloud Function-Python 3.7, but what my code does only adds the .gz extension but doesn't really compress the file, so in the end, the file was corrupted. Can you please show me how to fix this? Thanks
here is part of my code
import gzip
def to_gcs(request):
job_config = bigquery.QueryJobConfig()
gcs_filename = 'filename_{}.csv'
bucket_name = 'bucket_gcs_name'
subfolder = 'subfolder_name'
client = bigquery.Client()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
QUERY = "SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` session, UNNEST(hits) AS hits"
query_job = client.query(
QUERY,
location='US',
job_config=job_config)
while not query_job.done():
time.sleep(1)
rows_df = query_job.result().to_dataframe()
storage_client = storage.Client()
storage_client.get_bucket(bucket_name).blob(subfolder+'/'+gcs_filename+'.gz').upload_from_string(rows_df.to_csv(sep='|',index=False,encoding='utf-8',compression='gzip'), content_type='application/octet-stream')
As suggested in the thread referred by #Sam Mason in a comment, once you have obtained the Pandas datafame, you should use a TextIOWrapper() and BytesIO() as described in the following sample:
The following sample was inspired by #ramhiser's answer in this SO thread
df = query_job.result().to_dataframe()
blob = bucket.blob(f'{subfolder}/{gcs_filename}.gz')
with BytesIO() as gz_buffer:
with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)
blob.upload_from_file(gz_buffer,
content_type='application/octet-stream')
also note that if you expect this file to ever get larger than a couple of MB you are probably better off using something from the tempfile module in place of BytesIO. SpooledTemporaryFile is basically designed for this use case, where it will use a memory buffer up to some given size and only use the disk if the file gets really big
Hi I tried to reproduce your use case:
I created a cloud function using this quickstart link:
def hello_world(request):
from google.cloud import bigquery
from google.cloud import storage
import pandas as pd
client = bigquery.Client()
storage_client = storage.Client()
path = '/tmp/file.gz'
query_job = client.query("""
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
LIMIT 10""")
results = query_job.result().to_dataframe()
results.to_csv(path,sep='|',index=False,encoding='utf-8',compression='gzip')
bucket = storage_client.get_bucket('mybucket')
blob = bucket.blob('file.gz')
blob.upload_from_filename(path)
This is the requirements.txt:
# Function dependencies, for example:
google-cloud-bigquery
google-cloud-storage
pandas
I deployed the function.
I checked the output.
gsutil cp gs://mybucket/file.gz file.gz
gzip -d file.gz
cat file
#url|view_count
https://stackoverflow.com/questions/22879669|52306
https://stackoverflow.com/questions/13530967|46073
https://stackoverflow.com/questions/35159967|45991
https://stackoverflow.com/questions/10604135|45238
https://stackoverflow.com/questions/16609219|37758
https://stackoverflow.com/questions/11647201|32963
https://stackoverflow.com/questions/13221978|32507
https://stackoverflow.com/questions/27060396|31630
https://stackoverflow.com/questions/6607552|31487
https://stackoverflow.com/questions/11057219|29069