How to use pd.read_csv() with Blobs from GCS? - python

I have python script reading files from GCS bucket
from google.cloud import storage
import pandas as pd
client = storage.Client.from_service_account_json('sa.json')
BUCKET_NAME = 'sleep-accel'
bucket = client.get_bucket(BUCKET_NAME)
blobs_all = list(bucket.list_blobs())
blobs_specific = list(bucket.list_blobs(prefix='physionet.org/files/sleep-accel/1.0.0/motion/'))
main_df = pd.DataFrame({})
txtFile = blobs_specific[0].download_to_file('/tmp')
main_df = pd.concat([main_df, pd.read_csv(txtFile)])
blobs_specific returns array of blobs with contain a blob with few metadata and the .txt file that i need to get parsed by .read_csv();
I'm trying to figure out which GCS library function I'm suppose to use so that pd.read_csv() can read it
All the files are in .txt and thats why im trying to parse it to .csv here

Related

GCS Read from a text file from google cloud to python jupyter notebook

Hello in my GCP jupyter notebook I am reading
from google.cloud import storage
client = storage.Client()
BUCKET_NAME = 'sleep-accel'
bucket = client.get_bucket(BUCKET_NAME)
blobs_all = list(bucket.list_blobs())
blobs_specific = list(bucket.list_blobs(prefix='physionet.org/files/sleep-accel/1.0.0/motion/'))
for doc in blobs_specific:
print(doc)
dataset that I loaded in to GCS and for some reason its printing
<Blob: sleep-accel, physionet.org/files/sleep-accel/1.0.0/motion/1455390_acceleration.txt, 1656705245042882>
how can I access the .txt files ?
Because my main/end goal is to convert the content of .txt into a single .csv format
Converting the .txt files to .csv format can be achieved by using the pandas module.
Below is my sample code converting txt files from the bucket to csv format
from google.cloud import storage
import pandas as pd
client = storage.Client()
BUCKET_NAME = 'your_bucket_name'
bucket = client.get_bucket(BUCKET_NAME)
blobs_specific = list(bucket.list_blobs(prefix='physionet.org/files/sleep-accel/1.0.0/motion/'))
#List all the objects inside physionet.org/files/sleep-accel/1.0.0/motion/ folder
for doc in list(blobs_specific)[1:]:
#read the txt files using pandas and remove the header and separate it using space
df = pd.read_csv("gs://your_bucket_file_path/" + doc.name, header=None, sep=' ')
#change the doc.name value .txt to .csv
to_csv = doc.name.replace('.txt','.csv')
print(to_csv)
#convert the txt files to csv using pandas and save it to physionet.org/files/sleep-accel/1.0.0/motion/ folder in your notebook
df.to_csv(to_csv, index=False, sep=',')
the csv file will be downloaded to your notebook local server.
Note: you need to create a directory tree like this: physionet.org/files/sleep-accel/1.0.0/motion/ in your notebook because this is where the csv file will be saved.

Convert CSV to Parquet in S3 with Python

I need convert a CSV file to Parquet file in S3 path. I'm trying use the code below, but no error occurs, the code execute with success and dont convert the CSV file
import pandas as pd
import boto3
import pyarrow as pa
import pyarrow.parquet as pq
s3 = boto3.client("s3", region_name='us-east-2', aws_access_key_id='my key id',
aws_secret_access_key='my secret key')
obj = s3.get_object(Bucket='my bucket', Key='test.csv')
df = pd.read_csv(obj['Body'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table=table, root_path="test.parquet")
AWS CSV to Parquet Converter in Python
This Script gets files from Amazon S3 and converts it to Parquet Version for later query jobs and uploads it back to the Amazon S3.
import numpy
import pandas
import fastparquet
def lambda_handler(event,context):
#identifying resource
s3_object = boto3.client('s3', region_name='us-east-2')
#access file
get_file = s3_object.get_object(Bucket='ENTER_BUCKET_NAME_HERE', Key='CSV_FILE_NAME.csv')
get = get_file['Body']
df = pandas.DataFrame(get)
#convert csv to parquet function
def conv_csv_parquet_file(df):
converted_data_parquet = df.to_parquet('converted_data_parquet_version.parquet')
conv_csv_parquet_file(df)
print("File converted from CSV to parquet completed")
#uploading the parquet version file
s3_path = "/converted_to_parquet/" + converted_data_parquet
put_response = s3_resource.Object('ENTER_BUCKET_NAME_HERE',converted_data_parquet).put(Body=converted_data_parquet)
Python Library Boto3 allows the lambda to get the CSV file from S3 and then Fast-Parquet (or Pyarrow) converts the CSV file into Parquet.
From- https://github.com/ayshaysha/aws-csv-to-parquet-converter.py

Transform .xlsx in BLOB storage to .csv using pandas without downloading to local machine

I'm dealing with a transformation from .xlsx file to .csv. I tested locally a python script that downloads .xlsx files from a container in blob storage, manipulate data, save results as .csv file (using pandas) and upload it on a new container. Now I should bring the python script to ADF to build a pipeline to automate the task. I'm dealing with two kind of problems:
First problem: I can't figure out how to complete the task without downloading the file on my local machine.
I found these threads/tutorials but the "azure" v5.0.0 meta-package is deprecated
read excel files from "input" blob storage container and export to csv in "output" container with python
Tutorial: Run Python scripts through Azure Data Factory using Azure Batch
Sofar my code is:
import os
import sys
import pandas as pd
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, PublicAccess
# Create the BlobServiceClient that is used to call the Blob service for the storage account
conn_str = 'XXXX;EndpointSuffix=core.windows.net'
blob_service_client = BlobServiceClient.from_connection_string(conn_str=conn_str)
container_name = "input"
blob_name = "prova/excel/AAA_prova1.xlsx"
container = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)
downloaded_blob = container.download_blob(blob_name)
df = pd.read_excel(downloaded_blob.content_as_bytes(), skiprows = 4)
data = df.to_csv (r'C:\mypath/AAA_prova2.csv' ,encoding='utf-8-sig', index=False)
full_path_to_file = r'C:\mypath/AAA_prova2.csv'
local_file_name = 'prova\csv\AAA_prova2.csv'
#upload in blob
blob_client = blob_service_client.get_blob_client(
container=container_name, blob=local_file_name)
with open(full_path_to_file, "rb") as data:
blob_client.upload_blob(data)
Second problem: with this method I can deal only with the specific name of the blob, but in the future I'll have to parametrize the script (i.e. select only blob names starting with AAA_). I can't understand if I have to deal with this in the python script or if I can manage to filter the file through ADF (i.e. adding a Filter File task before running the python script). I can't find any tutorial/code snippet and any help or hint or documentation would be very appreciated.
EDIT
I modified the code to avoid to download to local machine, now it works (problem #1 solved)
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
from io import BytesIO
import pandas as pd
filename = "excel/prova.xlsx"
container_name="input"
blob_service_client = BlobServiceClient.from_connection_string("XXXX==;EndpointSuffix=core.windows.net")
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
df = pd.read_excel(stream, skiprows = 5)
local_file_name_out = "csv/prova.csv"
container_name_out = "input"
blob_client = blob_service_client.get_blob_client(
container=container_name_out, blob=local_file_name_out)
blob_client.upload_blob(df.to_csv(path_or_buf = None , encoding='utf-8-sig', index=False))
Azure Functions, Python 3.8 Version of an Azure function. Waits for a blob trigger from Excel. Then does some stuff and used a good chunk of your code for final completion.
Note the split to trim off the .xlsx of the file name.
This is what I ended up with:
source_blob = (f"https://{account_name}.blob.core.windows.net/{uploadedxlsx.name}")
file_name = uploadedxlsx.name.split("/")[2]
container_name = "container"
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(f"Received/{file_name}")
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
df = pd.read_excel(stream)
file_name_t = file_name.split(".")[0]
local_file_name_out = f"Converted/{file_name_t}.csv"
container_name_out = "out_container"
blob_client = blob_service_client.get_blob_client(
container=container_name_out, blob=local_file_name_out)
blob_client.upload_blob(df.to_csv(path_or_buf = None , encoding='utf-8-sig', index=False))

How to append data to an existing csv file in AWS S3 using python boto3

I have a csv file in s3 but I have to append the data to that file whenever I call the function but i am not able to do that,
df = pd.DataFrame(data_list)
bytes_to_write = df.to_csv(None, header=None, index=False).encode()
file_name = "Words/word_dictionary.csv" # Not working the below line
s3_client.put_object(Body=bytes_to_write, Bucket='recengine', Key=file_name)
This code is directly replacing the data inside the file instead of appending, Any solution?
s3 has no append functionality. You need to read the file from s3, append the data in your code, then upload the complete file to the same key in s3.
Check this thread on the AWS forum for details
The code will probably look like:
df = pd.DataFrame(data_list)
bytes_to_write = df.to_csv(None, header=None, index=False).encode()
file_name = "Words/word_dictionary.csv"
# get the existing file
curent_data = s3_client.get_object(Bucket='recengine', Key=file_name)
# append
appended_data = current_data + bytes_to_write
# overwrite
s3_client.put_object(Body=appended_data, Bucket='recengine', Key=file_name)
You can try using aws data wrangler library from awslabs to append , overwrite csv dataset stored in s3. Check out their documentation and tutorials from here link
You can utilize the pandas concat function to append the data and then write the csv back to the S3 bucket:
from io import StringIO
import pandas as pd
# read current data from bucket as data frame
csv_obj = s3_client.get_object(Bucket=bucket, Key=key)
current_data = csv_obj['Body'].read().decode('utf-8')
current_df = pd.read_csv(StringIO(csv_string))
# append data
appended_data = pd.concat([current_df, new_df], ignore_index=True)
appended_data_encoded = appended_data.to_csv(None, index=False).encode('utf-8')
# write the appended data to s3 bucket
s3_client.put_object(Body=appended_data_encoded,Bucket=bucket, Key=key)

how do I write a list of data to S3 in ORC format

I need to write a file in ORC format directly to an S3 bucket. the file will be a result of a query to a db.
I know how to write a CSV file directly to S3 but couldn't find a way to write directly in ORC.. any recommendations?
save ORC content to file
using default values as per the linked documentation as there is no code sample to work with
df = spark.read.load("examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
upload file
import boto3
# Create an S3 client
s3 = boto3.client('s3')
filename = 'file.txt'
bucket_name = 'my-bucket'
# Uploads the given file using a managed uploader, which will split up large
# files automatically and upload parts in parallel.
s3.upload_file(filename, bucket_name, filename)

Categories

Resources