Overwrite parquet file with pyarrow in S3

Overwrite parquet file with pyarrow in S3 - python

I'm trying to overwrite my parquet files with pyarrow that are in S3. I've seen the documentacion and I haven't found anything.
Here is my code:
from s3fs.core import S3FileSystem
import pyarrow as pa
import pyarrow.parquet as pq
s3 = S3FileSystem(anon=False)
output_dir = "s3://mybucket/output/my_table"
my_csv = pd.read_csv(file.csv)
my_table = pa.Table.from_pandas(my_csv , preserve_index=False)
pq.write_to_dataset(my_table,
output_dir,
filesystem=s3,
use_dictionary=True,
compression='snappy')
Is there something like mode = "overwrite" option in write_to_dataset function?

I think the best way to do it is with AWS Data Wrangler that offers 3 differents write modes:
append
overwrite
overwrite_partitions
Example:
import awswrangler as wr
wr.s3.to_parquet(
dataframe=df,
path="s3://...",
mode="overwrite",
dataset=True,
database="my_database", # Optional, only with you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])

Here's a solution using pyarrow.parquet (need version 8+! see docs regarding arg: "existing_data_behavior") and S3FileSystem.
Now decide if you want to overwrite partitions or parquet part files which often compose those partitions.
Overwrite single .parquet file
pq.write_to_dataset(
my_table,
root_path='bucket/mydata/year=2022/data_part001.parquet',
filesystem=s3,
existing_data_behavior="overwrite_or_ignore"
)
Overwrite .parquet files with common basename within each partition
pq.write_to_dataset(
my_table,
root_path='bucket/mydata',
partition_cols=['year'],
basename_template='data_part001.parquet',
filesystem=s3,
existing_data_behavior="overwrite_or_ignore"
)
Overwriting existing partitions that match new records
If some of your new records belong to a partition that already exists, that entire partition will be overwritten and new partitions will be added with:
pq.write_to_dataset(
my_table,
root_path='bucket/mydata',
partition_cols=['year'],
filesystem=s3,
existing_data_behavior="delete_matching"
)

Sorry, there's no a such option yet but the way I work around it is using boto3 to delete the files before writing them.
import boto3
resource = boto3.resource('s3')
resource.Bucket('mybucket').objects.filter(Prefix='output/my_table').delete()

Related

Write pyspark binary column to S3 PDF files (AWS glue job)

I have a pyspark.sql.dataframe sourcing some parquet-files which contains a column with the dataformat binary, it holds one PDF-file per row. Currently, i can write them locally by calling write_documents:
# full_path includes name of file and its suffix (.pdf)
def write_document_locally(full_path: str, byte_file: bytearray):
with open(full_path, "wb") as f:
f.write(byte_file)
def write_documents(data_frame: sql.DataFrame) -> None:
[
write_document_locally(full_path=full_path, byte_file=byte_file)
for full_path, byte_file in zip(
data_frame["file_path_and_name"], data_frame["byte_file"]
)
]
From the same job I'm also writing a parquet-table to a separate location. Both folders that are created including the resulting PDF/parquet-files are partitioned by year and id. In the PDF-case i partition by manually concatenating year=XXXX/id=XX to the full_path, in the parquet-case i use:
data_frame.write.mode("overwrite").partitionBy("year", "id").parquet(path=another_path)
To replicate the PDF-export in AWS and writing it to a S3-bucket instead, i would have to use boto3. I'm wondering whether there is a more efficient way of doing this using data_frame.write instead.
The problems with using boto3 is 1) I will write the pdf locally in one driver before uploading it to S3 which is inefficient and gathers all data in one driver (i think), 2) it would not create partitions automatically for me.

Read csv files recursively in all sub folders from a GCP bucket using python

I was trying to load all csv files recursively from all sub folders available in a GCP bucket using python pandas.
Currently I am using dask to load data, but its very slow.
import dask
path = "gs://mybucket/parent_path + "*/*.csv"
getAllDaysData = dask.dataframe.read_csv(path).compute()
Can someone help me with better way.

I would suggest reading into parquet files instead.
And use pd.read_parquet(file, engine = 'pyarrow') to convert it into a pandas dataframe.

Alternatively you might want to consider loading data into BigQuery first.
You can do something like this, as long as all csv-files have the some structure.
uri = f"gs://mybucket/parent_path/*.csv"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV
)
load_job = client.load_table_from_uri(
uri,
'destination_table',
job_config=job_config,
location=GCP_LOCATION
)
load_job_result = load_job.result()

How can I read from a CSV file from an S3 bucket, apply certain if-statements to it, and write a new updated CSV file and place it in the S3 bucket?

I'm having trouble writing to a new CSV file into an S3 bucket. I want to be able to read a CSV file that I have in an S3 bucket, and if one of the values in the CSV fits a certain requirement, I want to change it to a different value. I've read that it's not possible to edit an S3 object, so I need to create a new one every time. In short, I want to create a new, updated CSV file from another CSV file in an S3 bucket, with changes applied.
I'm trying to use DictWriter and DictReader, but I always run into issues with DictWriter. I can read the CSV file properly, but when I try to update it, there are a myriad of significantly different issues from DictWriter. Right now, the issue that I am getting is that
# Function to be pasted into AWS Lambda.
# Accesses S3 bucket, opens the CSV file, receive the response line-by-line,
# To be able to access S3 buckets and the objects within the bucket
import boto3
# To be able to read the CSV by using DictReader
import csv
# Lambda script that extracts, transforms, and loads data from S3 bucket 'testing-bucket-1042' and CSV file 'Insurance.csv'
def lambda_handler(event, context):
s3 = boto3.resource('s3')
bucket = s3.Bucket('testing-bucket-1042')
obj = bucket.Object(key = 'Insurance.csv')
response = obj.get()
lines = response['Body'].read().decode('utf-8').split()
reader = csv.DictReader(lines)
with open("s3://testing-bucket-1042/Insurance.csv", newline = '') as csvfile:
reader = csv.DictReader(csvfile)
fieldnames = ['county', 'eq_site_limit']
writer = csv.DictWriter(lines, fieldnames=fieldnames)
for row in reader:
writer.writeheader()
if row['county'] == "CLAY": # if the row is under the column 'county', and contains the string "CLAY"
writer.writerow({'county': 'CHANGED'})
if row['eq_site_limit'] == "0": # if the row is under the column 'eq_site_limit', and contains the string "0"
writer.writerow({'eq_site_limit': '9000'})
Right now, the error that I am getting is that the path I use when attempting to open the CSV, "s3://testing-bucket-1042/Insurance.csv", is said to not exist.
The error says
"errorMessage": "[Errno 2] No such file or directory: 's3://testing-bucket-1042/Insurance.csv'",
"errorType": "FileNotFoundError"
What would be the correct way to use DictWriter, if at all?

First of all s3:\\ is not a common (file) protocol and therefore you get your error message. It is good, that you stated your intentions.
Okay, I refactored your code
import codecs
import boto3
# To be able to read the CSV by using DictReader
import csv
from io import StringIO
# Lambda script that extracts, transforms, and loads data from S3 bucket 'testing-bucket-1042' and CSV file 'Insurance.csv'
def lambda_handler(event, context):
s3 = boto3.resource('s3')
bucket = s3.Bucket('testing-bucket-1042')
obj = bucket.Object(key = 'Insurance.csv')
stream = codecs.getreader('utf-8')(obj.get()['Body'])
lines = list(csv.DictReader(stream))
### now you have your object there
csv_buffer = StringIO()
out = csv.DictWriter(csv_buffer, fieldnames=['county', 'eq_site_limit'])
for row in lines:
if row['county'] == "CLAY":
out.writerow({'county': 'CHANGED'})
if row['eq_site_limit'] == "0":
out.writerow({'eq_site_limit': '9000'})
### now write content into some different bucket/key
s3client = boto3.client('s3')
s3client.put_object(Body=csv_buffer.getvalue().encode(encoding),
Bucket=...targetbucket, Key=...targetkey)
I hope that this works. Basically there are few tricks:
use codecs to directly stream csv data from s3 bucket
use BytesIO to create a stream in memory to which csv.DictWriter can write to.
when you are finished, one way to "upload" your content is through s3.clients's put_object method (as documented in AWS)

To logically separate AWS code from business logic, I normally recommend this approach:
Download the object from Amazon S3 to the /tmp directory
Perform desired business logic (read file, write file)
Upload the resulting file to Amazon S3
Using download_file() and upload_file() avoids having to worry about in-memory streams. It means you can take logic that normally operates on files (eg on your own computer) and then apply them to files obtained from S3.
It comes down to personal preference.

You can use streaming functionality of S3 to make changes on the fly. It is better suited for text manipulation tools such as awk and sed.
Example:
aws s3 cp s3://bucketname/file.csv - | sed 's/foo/bar/g' | aws s3 cp - s3://bucketname/new-file.csv
AWS Docs: https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html

AWS Lambda: read csv file dimensions from an s3 bucket with Python without using Pandas or CSV package

good afternoon. I am hoping that someone can help me with this issue.
I have multiple CSV files that are sitting in an s3 folder. I would like to use python without the Pandas, and the csv package (because aws lambda has very limited packages available, and there is a size restriction) and loop through the files sitting in the s3 bucket, and read the csv dimensions (length of rows, and length of columns)
For example my s3 folder contains two csv files (1.csv, and 2 .csv)
my code will run through the specified s3 folder, and put the count of rows, and columns in 1 csv, and 2 csv, and puts the result in a new csv file. I greatly appreciate your help! I can do this using the Pandas package (thank god for Pandas, but aws lambda has restrictions that limits me on what I can use)
AWS lambda uses python 3.7

If you can visit your s3 resources in your lambda function, then basically do this to check the rows,
def lambda_handler(event, context):
import boto3 as bt3
s3 = bt3.client('s3')
csv1_data = s3.get_object(Bucket='the_s3_bucket', Key='1.csv')
csv2_data = s3.get_object(Bucket='the_s3_bucket', Key='2.csv')
contents_1 = csv1_data['Body'].read()
contents_2 = csv2_data['Body'].read()
rows1 = contents_1.split()
rows2=contents_2.split()
return len(rows1), len(rows2)
It should work directly, if not, please let me know. BTW, hard coding the bucket and file name into the function like what I did in the sample is not a good idea at all.
Regards.

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this:
files = ['s3a://dev/2017/01/03/data.parquet',
's3a://dev/2017/01/02/data.parquet']
df = session.read.parquet(*files)
This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as it finds into the dataframe, and return this result without complaining. Is this possible?

Yes, it's possible if you change method of specifying input to hadoop glob pattern, for example:
files = 's3a://dev/2017/01/{02,03}/data.parquet'
df = session.read.parquet(files)
You can read more on patterns in Hadoop javadoc.
But, in my opinion this isn't elegant way of working with data partitioned by time (by day in your case). If you are able to rename directories like this:
s3a://dev/2017/01/03/data.parquet --> s3a://dev/day=2017-01-03/data.parquet
s3a://dev/2017/01/02/data.parquet --> s3a://dev/day=2017-01-02/data.parquet
then you can take advantage of spark partitioning schema and read data by:
session.read.parquet('s3a://dev/') \
.where(col('day').between('2017-01-02', '2017-01-03')
This way will omit empty/non-existing directories as well. Additionall column day will appear in your dataframe (it will be string in spark <2.1.0 and datetime in spark >= 2.1.0), so you will know in which directory each record exists.

Can I observe that as glob-pattern matching includes a full recursive tree-walk and pattern match of the paths, it is an absolute performance killer against object stores, especially S3. There's a special shortcut in spark to recognise when your path doesn't have any glob characters in, in which case it makes a more efficient choice.
Similarly, a very deep partitioning tree,as in that year/month/day layout, means many directories scanned, at a cost of hundreds of millis (or worse) per directory.
The layout suggested by Mariusz should be much more efficient, as it is a flatter directory tree —switching to it should have a bigger impact on performance on object stores than real filesystems.

A solution using union
files = ['s3a://dev/2017/01/03/data.parquet',
's3a://dev/2017/01/02/data.parquet']
for i, file in enumerate(files):
act_df = spark.read.parquet(file)
if i == 0:
df = act_df
else:
df = df.union(act_df)
An advantage is that it can be done regardless any pattern.

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
from awsglue.dynamicframe import DynamicFrame
from awsglue.utils import getResolvedOptions
from awsglue.job import Job
import boto3
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
inputDyf = lueContext.create_dynamic_frame.from_options(connection_type="parquet", connection_options={'paths': ["s3://dev-test-laxman-new-bucket/"]})
I am able to read multiple (2) parquet file from s3://dev-test-laxman-new-bucket/ and write in csv files.
As you can see i have 2 parqet file in the my bucket :
Hope it will be helpful to others.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Overwrite parquet file with pyarrow in S3 - python

Sorry, there's no a such option yet but the way I work around it is using boto3 to delete the files before writing them. import boto3 resource = boto3.resource('s3') resource.Bucket('mybucket').objects.filter(Prefix='output/my_table').delete()

Related

Write pyspark binary column to S3 PDF files (AWS glue job)

Read csv files recursively in all sub folders from a GCP bucket using python

How can I read from a CSV file from an S3 bucket, apply certain if-statements to it, and write a new updated CSV file and place it in the S3 bucket?

AWS Lambda: read csv file dimensions from an s3 bucket with Python without using Pandas or CSV package

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

Categories

Resources