Convert CSV to Parquet in S3 with Python

Convert CSV to Parquet in S3 with Python - python

I need convert a CSV file to Parquet file in S3 path. I'm trying use the code below, but no error occurs, the code execute with success and dont convert the CSV file
import pandas as pd
import boto3
import pyarrow as pa
import pyarrow.parquet as pq
s3 = boto3.client("s3", region_name='us-east-2', aws_access_key_id='my key id',
aws_secret_access_key='my secret key')
obj = s3.get_object(Bucket='my bucket', Key='test.csv')
df = pd.read_csv(obj['Body'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table=table, root_path="test.parquet")

AWS CSV to Parquet Converter in Python
This Script gets files from Amazon S3 and converts it to Parquet Version for later query jobs and uploads it back to the Amazon S3.
import numpy
import pandas
import fastparquet
def lambda_handler(event,context):
#identifying resource
s3_object = boto3.client('s3', region_name='us-east-2')
#access file
get_file = s3_object.get_object(Bucket='ENTER_BUCKET_NAME_HERE', Key='CSV_FILE_NAME.csv')
get = get_file['Body']
df = pandas.DataFrame(get)
#convert csv to parquet function
def conv_csv_parquet_file(df):
converted_data_parquet = df.to_parquet('converted_data_parquet_version.parquet')
conv_csv_parquet_file(df)
print("File converted from CSV to parquet completed")
#uploading the parquet version file
s3_path = "/converted_to_parquet/" + converted_data_parquet
put_response = s3_resource.Object('ENTER_BUCKET_NAME_HERE',converted_data_parquet).put(Body=converted_data_parquet)
Python Library Boto3 allows the lambda to get the CSV file from S3 and then Fast-Parquet (or Pyarrow) converts the CSV file into Parquet.
From- https://github.com/ayshaysha/aws-csv-to-parquet-converter.py

Related

How to use pd.read_csv() with Blobs from GCS?

I have python script reading files from GCS bucket
from google.cloud import storage
import pandas as pd
client = storage.Client.from_service_account_json('sa.json')
BUCKET_NAME = 'sleep-accel'
bucket = client.get_bucket(BUCKET_NAME)
blobs_all = list(bucket.list_blobs())
blobs_specific = list(bucket.list_blobs(prefix='physionet.org/files/sleep-accel/1.0.0/motion/'))
main_df = pd.DataFrame({})
txtFile = blobs_specific[0].download_to_file('/tmp')
main_df = pd.concat([main_df, pd.read_csv(txtFile)])
blobs_specific returns array of blobs with contain a blob with few metadata and the .txt file that i need to get parsed by .read_csv();
I'm trying to figure out which GCS library function I'm suppose to use so that pd.read_csv() can read it
All the files are in .txt and thats why im trying to parse it to .csv here

GCS Read from a text file from google cloud to python jupyter notebook

Hello in my GCP jupyter notebook I am reading
from google.cloud import storage
client = storage.Client()
BUCKET_NAME = 'sleep-accel'
bucket = client.get_bucket(BUCKET_NAME)
blobs_all = list(bucket.list_blobs())
blobs_specific = list(bucket.list_blobs(prefix='physionet.org/files/sleep-accel/1.0.0/motion/'))
for doc in blobs_specific:
print(doc)
dataset that I loaded in to GCS and for some reason its printing
<Blob: sleep-accel, physionet.org/files/sleep-accel/1.0.0/motion/1455390_acceleration.txt, 1656705245042882>
how can I access the .txt files ?
Because my main/end goal is to convert the content of .txt into a single .csv format

Converting the .txt files to .csv format can be achieved by using the pandas module.
Below is my sample code converting txt files from the bucket to csv format
from google.cloud import storage
import pandas as pd
client = storage.Client()
BUCKET_NAME = 'your_bucket_name'
bucket = client.get_bucket(BUCKET_NAME)
blobs_specific = list(bucket.list_blobs(prefix='physionet.org/files/sleep-accel/1.0.0/motion/'))
#List all the objects inside physionet.org/files/sleep-accel/1.0.0/motion/ folder
for doc in list(blobs_specific)[1:]:
#read the txt files using pandas and remove the header and separate it using space
df = pd.read_csv("gs://your_bucket_file_path/" + doc.name, header=None, sep=' ')
#change the doc.name value .txt to .csv
to_csv = doc.name.replace('.txt','.csv')
print(to_csv)
#convert the txt files to csv using pandas and save it to physionet.org/files/sleep-accel/1.0.0/motion/ folder in your notebook
df.to_csv(to_csv, index=False, sep=',')
the csv file will be downloaded to your notebook local server.
Note: you need to create a directory tree like this: physionet.org/files/sleep-accel/1.0.0/motion/ in your notebook because this is where the csv file will be saved.

Python: Read CSV file without Pandas from S3 bucket [duplicate]

I have uploaded an excel file to AWS S3 bucket and now I want to read it in python. Any help would be appreciated. Here is what I have achieved so far,
import boto3
import os
aws_id = 'aws_id'
aws_secret = 'aws_secret_key'
client = boto3.client('s3', aws_access_key_id=aws_id, aws_secret_access_key=aws_secret)
bucket_name = 'my_bucket'
object_key = 'my_excel_file.xlsm'
object_file = client.get_object(Bucket=bucket_name, Key=object_key)
body = object_file['Body']
data = body.read()
What do I need to do next in order to read this data and work on it?

Spent quite some time on it and here's how I got it working,
import boto3
import io
import pandas as pd
import json
aws_id = ''
aws_secret = ''
bucket_name = ''
object_key = ''
s3 = boto3.client('s3', aws_access_key_id=aws_id, aws_secret_access_key=aws_secret)
obj = s3.get_object(Bucket=bucket_name, Key=object_key)
data = obj['Body'].read()
df = pd.read_excel(io.BytesIO(data), encoding='utf-8')

You can directly read xls file from S3 without having to download or save it locally. xlrd module has a provision to provide raw data to create workbook object.
Following is the code snippet.
from boto3 import Session
from xlrd.book import open_workbook_xls
aws_id = ''
aws_secret = ''
bucket_name = ''
object_key = ''
s3_session = Session(aws_access_key_id=aws_id, aws_secret_access_key=aws_secret)
bucket_object = s3_session.resource('s3').Bucket(bucket_name).Object(object_key)
content = bucket_object.get()['Body'].read()
workbook = open_workbook_xls(file_contents=content)

You can directly read excel files using awswrangler.s3.read_excel. Note that you can pass any pandas.read_excel() arguments (sheet name, etc) to this.
import awswrangler as wr
df = wr.s3.read_excel(path=s3_uri)

Python doesn't support excel files natively. You could use the pandas library pandas library read_excel functionality

Load CSV data into Jupyter Notebook from S3

I have several CSV files (50 GB) in an S3 bucket in Amazon Cloud. I am trying to read these files in a Jupyter Notebook (with Python3 Kernel) using the following code:
import boto3
from boto3 import session
import pandas as pd
session = boto3.session.Session(region_name='XXXX')
s3client = session.client('s3', config = boto3.session.Config(signature_version='XXXX'))
response = s3client.get_object(Bucket='myBucket', Key='myKey')
names = ['id','origin','name']
dataset = pd.read_csv(response['Body'], names=names)
dataset.head()
But I face the following error when I run the code:
valueError: Invalid file path or buffer object type: class 'botocore.response.StreamingBody'
I came across this bug report about pandas and boto3 object not being compatible yet.
My question is, how else can I import these CSV files from my S3 bucket into my Jupyter Notebook which runs on the Cloud.

You can also use s3fs which allows pandas to read directly from S3:
import s3fs
# csv file
df = pd.read_csv('s3://{bucket_name}/{path_to_file}')
# parquet file
df = pd.read_parquet('s3://{bucket_name}/{path_to_file}')
And then if you have multiple files in a bucket, you can iterate through them like so:
import boto3
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(name='{bucket_name}')
for file in bucket.objects.all():
# do what you want with the files
# for example:
if 'filter' in file.key:
print(file.key)
new_df = pd.read_csv('s3:://{bucket_name}/{}'.format(file.key))

I am posting this fix to my problem, in case somebody needs it. I replaces the read_csv line with the following and the problem was solved:
dataset = pd.read_csv(io.BytesIO(response['Body'].read()), encoding='utf8')

How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

I have a hacky way of achieving this using boto3 (1.4.4), pyarrow (0.4.1) and pandas (0.20.3).
First, I can read a single parquet file locally like this:
import pyarrow.parquet as pq
path = 'parquet/part-r-00000-1e638be4-e31f-498a-a359-47d017a0059c.gz.parquet'
table = pq.read_table(path)
df = table.to_pandas()
I can also read a directory of parquet files locally like this:
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('parquet/')
table = dataset.read()
df = table.to_pandas()
Both work like a charm. Now I want to achieve the same remotely with files stored in a S3 bucket. I was hoping that something like this would work:
dataset = pq.ParquetDataset('s3n://dsn/to/my/bucket')
But it does not:
OSError: Passed non-file path: s3n://dsn/to/my/bucket
After reading pyarrow's documentation thoroughly, this does not seem possible at the moment. So I came out with the following solution:
Reading a single file from S3 and getting a pandas dataframe:
import io
import boto3
import pyarrow.parquet as pq
buffer = io.BytesIO()
s3 = boto3.resource('s3')
s3_object = s3.Object('bucket-name', 'key/to/parquet/file.gz.parquet')
s3_object.download_fileobj(buffer)
table = pq.read_table(buffer)
df = table.to_pandas()
And here my hacky, not-so-optimized, solution to create a pandas dataframe from a S3 folder path:
import io
import boto3
import pandas as pd
import pyarrow.parquet as pq
bucket_name = 'bucket-name'
def download_s3_parquet_file(s3, bucket, key):
buffer = io.BytesIO()
s3.Object(bucket, key).download_fileobj(buffer)
return buffer
client = boto3.client('s3')
s3 = boto3.resource('s3')
objects_dict = client.list_objects_v2(Bucket=bucket_name, Prefix='my/folder/prefix')
s3_keys = [item['Key'] for item in objects_dict['Contents'] if item['Key'].endswith('.parquet')]
buffers = [download_s3_parquet_file(s3, bucket_name, key) for key in s3_keys]
dfs = [pq.read_table(buffer).to_pandas() for buffer in buffers]
df = pd.concat(dfs, ignore_index=True)
Is there a better way to achieve this? Maybe some kind of connector for pandas using pyarrow? I would like to avoid using pyspark, but if there is no other solution, then I would take it.

You should use the s3fs module as proposed by yjk21. However as result of calling ParquetDataset you'll get a pyarrow.parquet.ParquetDataset object. To get the Pandas DataFrame you'll rather want to apply .read_pandas().to_pandas() to it:
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
pandas_dataframe = pq.ParquetDataset('s3://your-bucket/', filesystem=s3).read_pandas().to_pandas()

Thanks! Your question actually tell me a lot. This is how I do it now with pandas (0.21.1), which will call pyarrow, and boto3 (1.3.1).
import boto3
import io
import pandas as pd
# Read single parquet file from S3
def pd_read_s3_parquet(key, bucket, s3_client=None, **args):
if s3_client is None:
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=bucket, Key=key)
return pd.read_parquet(io.BytesIO(obj['Body'].read()), **args)
# Read multiple parquets from a folder on S3 generated by spark
def pd_read_s3_multiple_parquets(filepath, bucket, s3=None,
s3_client=None, verbose=False, **args):
if not filepath.endswith('/'):
filepath = filepath + '/' # Add '/' to the end
if s3_client is None:
s3_client = boto3.client('s3')
if s3 is None:
s3 = boto3.resource('s3')
s3_keys = [item.key for item in s3.Bucket(bucket).objects.filter(Prefix=filepath)
if item.key.endswith('.parquet')]
if not s3_keys:
print('No parquet found in', bucket, filepath)
elif verbose:
print('Load parquets:')
for p in s3_keys:
print(p)
dfs = [pd_read_s3_parquet(key, bucket=bucket, s3_client=s3_client, **args)
for key in s3_keys]
return pd.concat(dfs, ignore_index=True)
Then you can read multiple parquets under a folder from S3 by
df = pd_read_s3_multiple_parquets('path/to/folder', 'my_bucket')
(One can simplify this code a lot I guess.)

It can be done using boto3 as well without the use of pyarrow
import boto3
import io
import pandas as pd
# Read the parquet file
buffer = io.BytesIO()
s3 = boto3.resource('s3')
object = s3.Object('bucket_name','key')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)
print(df.head())

Probably the easiest way to read parquet data on the cloud into dataframes is to use dask.dataframe in this way:
import dask.dataframe as dd
df = dd.read_parquet('s3://bucket/path/to/data-*.parq')
dask.dataframe can read from Google Cloud Storage, Amazon S3, Hadoop file system and more!

Provided you have the right package setup
$ pip install pandas==1.1.0 pyarrow==1.0.0 s3fs==0.4.2
and your AWS shared config and credentials files configured appropriately
you can use pandas right away:
import pandas as pd
df = pd.read_parquet("s3://bucket/key.parquet")
In case of having multiple AWS profiles you may also need to set
$ export AWS_DEFAULT_PROFILE=profile_under_which_the_bucket_is_accessible
so you can access your bucket.

If you are open to also use AWS Data Wrangler.
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://...")

You can use s3fs from dask which implements a filesystem interface for s3. Then you can use the filesystem argument of ParquetDataset like so:
import s3fs
s3 = s3fs.S3FileSystem()
dataset = pq.ParquetDataset('s3n://dsn/to/my/bucket', filesystem=s3)

Using pre-signed URLs
s3 =s3fs.S3FileSystem(key='your_key',secret='your_secret',client_kwargs={"endpoint_url":'your_end_point'})
df = dd.read_parquet(s3.url('your_bucket' + 'your_filepath',expires=3600,client_method='get_object'))

I have tried the #oya163 solution and it works but after little bit change
import boto3
import io
import pandas as pd
# Read the parquet file
buffer = io.BytesIO()
s3 = boto3.resource('s3',aws_access_key_id='123',aws_secret_access_key= '456')
object = s3.Object('bucket_name','myoutput.parquet')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)
print(df.head())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert CSV to Parquet in S3 with Python - python

Related

How to use pd.read_csv() with Blobs from GCS?

GCS Read from a text file from google cloud to python jupyter notebook

Python: Read CSV file without Pandas from S3 bucket [duplicate]

Load CSV data into Jupyter Notebook from S3

How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

Categories

Resources