Write pyspark binary column to S3 PDF files (AWS glue job) - python

I have a pyspark.sql.dataframe sourcing some parquet-files which contains a column with the dataformat binary, it holds one PDF-file per row. Currently, i can write them locally by calling write_documents:
# full_path includes name of file and its suffix (.pdf)
def write_document_locally(full_path: str, byte_file: bytearray):
with open(full_path, "wb") as f:
f.write(byte_file)
def write_documents(data_frame: sql.DataFrame) -> None:
[
write_document_locally(full_path=full_path, byte_file=byte_file)
for full_path, byte_file in zip(
data_frame["file_path_and_name"], data_frame["byte_file"]
)
]
From the same job I'm also writing a parquet-table to a separate location. Both folders that are created including the resulting PDF/parquet-files are partitioned by year and id. In the PDF-case i partition by manually concatenating year=XXXX/id=XX to the full_path, in the parquet-case i use:
data_frame.write.mode("overwrite").partitionBy("year", "id").parquet(path=another_path)
To replicate the PDF-export in AWS and writing it to a S3-bucket instead, i would have to use boto3. I'm wondering whether there is a more efficient way of doing this using data_frame.write instead.
The problems with using boto3 is 1) I will write the pdf locally in one driver before uploading it to S3 which is inefficient and gathers all data in one driver (i think), 2) it would not create partitions automatically for me.

Related

How to process multiple CSV files from an Amazon S3 bucket in a lambda function?

In an Amazon S3 bucket, event logs are sent as a CSV file every hour. I would like to perform some brief descriptive analysis on 1 weeks worth of data, every week (e.g. 168 files every week). The point of the analysis is to output a list of trending products for each week. I have a python script written out on my local machine which retrieves the latest 168 files from S3 using boto3, and does all the necessary wrangling etc.
But now I need to put this into a lambda function. I will set up an eventbridge to trigger the lambda function every monday. But, is it possible to call multiple files into a lambda function using the standard boto3, or do I need to do something special when defining the lambda handler function?
Here is the code from my local machine for getting the 168 files:
# import modules
import boto3
import pandas as pd
from io import StringIO
# set up aws credentials
s3 = boto3.resource('s3')
client = boto3.client('s3', aws_access_key_id='XXXXXXXXXXXXXXXXXXXX',
aws_secret_access_key='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
# name s3 bucket
my_bucket = s3.Bucket('bucket_with_data')
# get names of last weeks files from s3 bucket
files = []
for file in my_bucket.objects.all():
files.append(file.key)
files = files[-168:] # all files from last 7 days (24 files * 7 days per week)
bucket_name = 'bucket_with_data'
data = []
col_list = ["user_id", "event_type", "event_time", "session_id", "event_properties.product_id"]
for obj in files:
csv_obj = client.get_object(Bucket=bucket_name, Key=obj)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')
temp = pd.read_csv(StringIO(csv_string))
final_list = list(set(col_list) & set(temp.columns))
temp = temp[final_list]
data.append(temp)
# combining all dataframes into one
event_data = pd.concat(data, ignore_index=True)
So, my question is, can I just put this code into a lambda function and it should work, or do I need to incorporate a lambda_handler function? If I need to use a lambda_handler function, how would I handle multiple files? Because the lambda is being triggered by a schedule, rather than one event taking place.
Yes, you need a Lambda function handler, otherwise it's not a Lambda function.
The Lambda runtime is looking for a specific entry point in your code and it will invoke that entry point with a specific set of parameters (event, context).
You can ignore the event and context if you choose to and simply use the boto3 SDK to list objects in a given S3 bucket (assuming your Lambda has permission to do this, via an IAM role), and then perform whatever actions you need to against those objects.
Your code example explicitly supplies boto3 with an access key and secret key, but that is not the best practice in AWS Lambda (or EC2 or other compute environment running on AWS). Instead, configure the Lambda function to assume an IAM role (that provides the necessary permissions).
Be aware of the Lambda function timeout, and increase it as necessary if you are going to process a lot of files. Increasing the RAM size is also potentially a good idea as it will give you correspondingly more CPU and network bandwidth.
If you're going to process a very large number of files then consider using S3 Batch.

Trying to move data from one Azure Blob Storage to another using a Python script

I have data that exists in a zipped format in container A that I need to transform using a Python script and am trying to schedule this to occur within Azure, but when writing the output to a new storage container (container B), it simply outputs a csv with the name of the file inside rather than the data.
I've followed the tutorial given on the microsoft site exactly, but I can't get it to work - what am I missing?
https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
file_n='iris.csv'
# Load iris dataset from the task node
df = pd.read_csv(file_n)
# Subset records
df = df[df['Species'] == "setosa"]
# Save the subset of the iris dataframe locally in task node
df.to_csv("iris_setosa.csv", index = False, encoding="utf-8")
# Upload iris dataset
blobService.create_blob_from_text(containerName, "iris_setosa.csv", "iris_setosa.csv")
Specifically, the final line seems to be just giving me the output of a csv called "iris_setosa.csv" with a contents of "iris_setosa.csv" in cell A1 rather than the actual data that it reads in.
Update:
replace create_blob_from_text with create_blob_from_path.
create_blob_from_text creates a new blob from str/unicode, or updates the content of an existing blob. So you will find text iris_setosa.csv in the content of the new blob.
create_blob_from_path creates a new blob from a file path, or updates the content of an existing blob. It is what you want.
This workaround uses copy_blob and delete_blob to move Azure Blob from one container to another.
from azure.storage.blob import BlobService
def copy_azure_files(self):
blob_service = BlobService(account_name='account_name', account_key='account_key')
blob_name = 'iris_setosa.csv'
copy_from_container = 'test-container'
copy_to_container = 'demo-container'
blob_url = blob_service.make_blob_url(copy_from_container, blob_name)
# blob_url:https://demostorage.blob.core.windows.net/test-container/iris_setosa.csv
blob_service.copy_blob(copy_to_container, blob_name, blob_url)
#for move the file use this line
blob_service.delete_blob(copy_from_container, blob_name)

Spark df partitioniong after partioning by yy/mm/dd

S3 hosts a very large compressed file (20gb compressed -> 200gb uncompressed).
I want to read this file in (unfortunately decompress on single core), transform some sql columns, and then output to S3 with the s3_path/year=2020/month=01/day=01/[files 1-200].parquet format.
The entirety of the file will comprised of data from the same date. This leads me to believe instead of using partitionBy('year','month','day') I should append "year={year}/month={month}/day={day}/" to the s3 path, because currently spark is writing a single file at a time to s3 (1gb size each). Is my thinking correct?
Here is what I'm doing currently:
df = df\
.withColumn('year', lit(datetime_object.year))\
.withColumn('month', lit(datetime_object.month))\
.withColumn('day', lit(datetime_object.day))
df\
.write\
.partitionBy('year','month','day')\
.parquet(s3_dest_path, mode='overwrite')
What I'm thinking:
df = spark.read.format('json')\
.load(s3_file, schema=StructType.fromJson(my_schema))\
.repartition(200)
# currently takes a long time decompressing the 20gb s3_file.json.gz
# transform
df.write\
.parquet(s3_dest_path + 'year={}/month={}/day={}/'.format(year,month,day))
You're probably running into the problem that spark writes data first to some _temporary directory and only then commit it to the final location. In HDFS this is done by rename. However S3 does not support renames, but instead copies the data fully (only using one executor). For more on this topic see for example this post: Extremely slow S3 write times from EMR/ Spark
Common work-around is to write to hdfs and then use distcp to copy distributed from hdfs to s3

How can I read from a CSV file from an S3 bucket, apply certain if-statements to it, and write a new updated CSV file and place it in the S3 bucket?

I'm having trouble writing to a new CSV file into an S3 bucket. I want to be able to read a CSV file that I have in an S3 bucket, and if one of the values in the CSV fits a certain requirement, I want to change it to a different value. I've read that it's not possible to edit an S3 object, so I need to create a new one every time. In short, I want to create a new, updated CSV file from another CSV file in an S3 bucket, with changes applied.
I'm trying to use DictWriter and DictReader, but I always run into issues with DictWriter. I can read the CSV file properly, but when I try to update it, there are a myriad of significantly different issues from DictWriter. Right now, the issue that I am getting is that
# Function to be pasted into AWS Lambda.
# Accesses S3 bucket, opens the CSV file, receive the response line-by-line,
# To be able to access S3 buckets and the objects within the bucket
import boto3
# To be able to read the CSV by using DictReader
import csv
# Lambda script that extracts, transforms, and loads data from S3 bucket 'testing-bucket-1042' and CSV file 'Insurance.csv'
def lambda_handler(event, context):
s3 = boto3.resource('s3')
bucket = s3.Bucket('testing-bucket-1042')
obj = bucket.Object(key = 'Insurance.csv')
response = obj.get()
lines = response['Body'].read().decode('utf-8').split()
reader = csv.DictReader(lines)
with open("s3://testing-bucket-1042/Insurance.csv", newline = '') as csvfile:
reader = csv.DictReader(csvfile)
fieldnames = ['county', 'eq_site_limit']
writer = csv.DictWriter(lines, fieldnames=fieldnames)
for row in reader:
writer.writeheader()
if row['county'] == "CLAY": # if the row is under the column 'county', and contains the string "CLAY"
writer.writerow({'county': 'CHANGED'})
if row['eq_site_limit'] == "0": # if the row is under the column 'eq_site_limit', and contains the string "0"
writer.writerow({'eq_site_limit': '9000'})
Right now, the error that I am getting is that the path I use when attempting to open the CSV, "s3://testing-bucket-1042/Insurance.csv", is said to not exist.
The error says
"errorMessage": "[Errno 2] No such file or directory: 's3://testing-bucket-1042/Insurance.csv'",
"errorType": "FileNotFoundError"
What would be the correct way to use DictWriter, if at all?
First of all s3:\\ is not a common (file) protocol and therefore you get your error message. It is good, that you stated your intentions.
Okay, I refactored your code
import codecs
import boto3
# To be able to read the CSV by using DictReader
import csv
from io import StringIO
# Lambda script that extracts, transforms, and loads data from S3 bucket 'testing-bucket-1042' and CSV file 'Insurance.csv'
def lambda_handler(event, context):
s3 = boto3.resource('s3')
bucket = s3.Bucket('testing-bucket-1042')
obj = bucket.Object(key = 'Insurance.csv')
stream = codecs.getreader('utf-8')(obj.get()['Body'])
lines = list(csv.DictReader(stream))
### now you have your object there
csv_buffer = StringIO()
out = csv.DictWriter(csv_buffer, fieldnames=['county', 'eq_site_limit'])
for row in lines:
if row['county'] == "CLAY":
out.writerow({'county': 'CHANGED'})
if row['eq_site_limit'] == "0":
out.writerow({'eq_site_limit': '9000'})
### now write content into some different bucket/key
s3client = boto3.client('s3')
s3client.put_object(Body=csv_buffer.getvalue().encode(encoding),
Bucket=...targetbucket, Key=...targetkey)
I hope that this works. Basically there are few tricks:
use codecs to directly stream csv data from s3 bucket
use BytesIO to create a stream in memory to which csv.DictWriter can write to.
when you are finished, one way to "upload" your content is through s3.clients's put_object method (as documented in AWS)
To logically separate AWS code from business logic, I normally recommend this approach:
Download the object from Amazon S3 to the /tmp directory
Perform desired business logic (read file, write file)
Upload the resulting file to Amazon S3
Using download_file() and upload_file() avoids having to worry about in-memory streams. It means you can take logic that normally operates on files (eg on your own computer) and then apply them to files obtained from S3.
It comes down to personal preference.
You can use streaming functionality of S3 to make changes on the fly. It is better suited for text manipulation tools such as awk and sed.
Example:
aws s3 cp s3://bucketname/file.csv - | sed 's/foo/bar/g' | aws s3 cp - s3://bucketname/new-file.csv
AWS Docs: https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html

Is there a way to use COPY on multiple files at once?

I am trying to find a way to move our MySQL databases and put them on Amazon Redshift for its speed and scalable storage. They recommend splitting the data into multiple files and using the COPY command to copy data from S3 into the data warehouse. I am using Python to attempt to automate this process and plan to use boto3 for client side encryption of the data
s3 = boto3.client('s3',
aws_access_key_id='[Access key id]',
aws_secret_access_key='[Secret access key]')
filename = '[S3 file path]'
bucket_name = '[Bucket name]'
# Uploads the given file using a managed uploader, which will split up large
# files automatically and upload parts in parallel.
s3.upload_file(filename, bucket_name, filename)
#create table for data
statement = 'create table [table_name] ([table fields])'
conn = psycopg2.connect(
host='[host]',
user='[user]',
port=5439,
password='[password]',
dbname='dev')
cur = conn.cursor()
cur.execute(statement)
conn.commit()
#load data to redshift
conn_string = "dbname='dev' port='5439' user='[user]' password='[password]'
host='[host]'"
conn = psycopg2.connect(conn_string);
cur = conn.cursor()
cur.execute("""copy [table_name] from '[data location]'
access_key_id '[Access key id]'
secret_access_key '[Secret access key]'
region 'us-east-1'
null as 'NA'
delimiter ','
removequotes;""")
conn.commit()
The problem is with this code is I think I would have to individually create a table for every table and then copy it over for every file individually. Is there a way to get the data into redshift using a single copy for multiple files? Or is it possible to run multiple copy statements at once? And is it possible to do this without creating a table for every single file?
Redshift does support a parallelized form of COPY from a single connection, and in fact, it appears to be an anti pattern to concurrently COPY data to the same tables from multiple connections.
There are two ways to do parallel ingestion:
Specify a common prefix in the COPY FROM, instead of a specific file name.
In this case, COPY will attempt to load all files from the bucket / folder with that prefix
OR, provide a manifest file, containing the names of the files
In both instances, you should split the source data up into an appropriate number of files of approximately equal size. Again from the docs:
Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices. The number of slices per node depends on the node size of the cluster. For example, each DS1.XL compute node has two slices, and each DS1.8XL compute node has 32 slices.

Categories

Resources