I am fairy new to unit testing in Python and have been using pytest. I have been using fixtures for some unit tests but there is a simple function (see below) that I want to test which has me baffled.
It simply reads parquet file from s3 and as there are duplicate rows in the source data it dedupes on a given list of columns.
So in the unit test I just want to mock loading the data from s3 as I dont want to perform this read. Any pointers on how to approach this?
def load_parquet(bucket_name: str,
bucket_prefix: str,
folder: str,
columns_required: list,
log_date_start: date,
log_date_end=date):
'''Function to load event data for given date
Args:
bucket_name: S3 bucket
bucket_prefix: S3 bucket input path to load parquet
folder: S3 folder
log_date_start: Start of logdate
log_date_end: End of logdate
Returns:
Dataframe with required columns and deduped
'''
s3_path = f's3://{bucket_name}/{bucket_prefix}/{folder}/'
df = wr.s3.read_parquet(path=s3_path,
columns=columns_required,
partition_filter=lambda x: log_date_start <= x["date"] <= log_date_end,
dataset=True,
use_threads=True)
# Partition filter creates unnecessary columns and drop any duplicates
df_deduped = df[columns_required].drop_duplicates()
return df_deduped
Related
I have a pyspark.sql.dataframe sourcing some parquet-files which contains a column with the dataformat binary, it holds one PDF-file per row. Currently, i can write them locally by calling write_documents:
# full_path includes name of file and its suffix (.pdf)
def write_document_locally(full_path: str, byte_file: bytearray):
with open(full_path, "wb") as f:
f.write(byte_file)
def write_documents(data_frame: sql.DataFrame) -> None:
[
write_document_locally(full_path=full_path, byte_file=byte_file)
for full_path, byte_file in zip(
data_frame["file_path_and_name"], data_frame["byte_file"]
)
]
From the same job I'm also writing a parquet-table to a separate location. Both folders that are created including the resulting PDF/parquet-files are partitioned by year and id. In the PDF-case i partition by manually concatenating year=XXXX/id=XX to the full_path, in the parquet-case i use:
data_frame.write.mode("overwrite").partitionBy("year", "id").parquet(path=another_path)
To replicate the PDF-export in AWS and writing it to a S3-bucket instead, i would have to use boto3. I'm wondering whether there is a more efficient way of doing this using data_frame.write instead.
The problems with using boto3 is 1) I will write the pdf locally in one driver before uploading it to S3 which is inefficient and gathers all data in one driver (i think), 2) it would not create partitions automatically for me.
I have two relatively large dataframes (less than 5MB), which I receive from my front-end as files via my API Gateway. I am able to receive the files and can print the dataframes in my receiver Lambda function. From my Lambda function, I am trying to invoke my state machine (which just cleans up the dataframes and does some processing). However, when passing my dataframe to my step function, I receive the following error:
ClientError: An error occurred (413) when calling the StartExecution operation: HTTP content length exceeded 1049600 bytes
My Receiver Lambda function:
dict = {}
dict['username'] = arr[0]
dict['region'] = arr[1]
dict['country'] = arr[2]
dict['grid'] = arr[3]
dict['physicalServers'] = arr[4] #this is one dataframe in json format
dict['servers'] = arr[5] #this is my second dataframe in json format
client = boto3.client('stepfunctions')
response = client.start_execution(
stateMachineArn='arn:aws:states:us-west-2:##:stateMachine:MyStateMachineTest',
name='testStateMachine',
input= json.dumps(dict)
)
print(response)
Is there something I can do to pass in my dataframes to my step function? The dataframes contain sensitive customer data which I would rather not store in my S3. I realize I can store the files into S3 (directly from my front-end via pre-signed URLs) and then read the files from my step function but this is one of my least preferred approaches.
Passing them as direct input via input= json.dumps(dict) isn't going to work, as you are finding. You are running up against the size limit of the request. You need to save the dataframes to a file, somewhere the step functions can access it, and then just pass the file paths as input to the step function.
The way I would solve this is to write the data frames to files in the Lambda file system, with some random ID, perhaps the Lambda invocation ID, in the filename. Then have the Lambda function copy those files to an S3 bucket. Finally invoke the step function with the S3 paths as part of the input.
Over on the Step Functions side, have your state machine expect S3 paths for the physicalServers and servers input values, and use those paths to download the files from S3 during state machine execution.
Finally, I would configure an S3 lifecycle policy on the bucket, to remove any objects more than a few days old (or whatever time makes sense for your application) so that the bucket doesn't get large and run up your AWS bill.
An alternative to using S3 would be to use an EFS volume mount in both this Lambda function, and in the Lambda function or (or EC2 or ECS) that your step function is executing. With EFS your code could write and read from it just like a local file system, which would eliminate the steps of copying to/from S3, but you would have to add some code at the end of your step function to clean up the files after you are done since EFS won't do that for you.
In an Amazon S3 bucket, event logs are sent as a CSV file every hour. I would like to perform some brief descriptive analysis on 1 weeks worth of data, every week (e.g. 168 files every week). The point of the analysis is to output a list of trending products for each week. I have a python script written out on my local machine which retrieves the latest 168 files from S3 using boto3, and does all the necessary wrangling etc.
But now I need to put this into a lambda function. I will set up an eventbridge to trigger the lambda function every monday. But, is it possible to call multiple files into a lambda function using the standard boto3, or do I need to do something special when defining the lambda handler function?
Here is the code from my local machine for getting the 168 files:
# import modules
import boto3
import pandas as pd
from io import StringIO
# set up aws credentials
s3 = boto3.resource('s3')
client = boto3.client('s3', aws_access_key_id='XXXXXXXXXXXXXXXXXXXX',
aws_secret_access_key='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
# name s3 bucket
my_bucket = s3.Bucket('bucket_with_data')
# get names of last weeks files from s3 bucket
files = []
for file in my_bucket.objects.all():
files.append(file.key)
files = files[-168:] # all files from last 7 days (24 files * 7 days per week)
bucket_name = 'bucket_with_data'
data = []
col_list = ["user_id", "event_type", "event_time", "session_id", "event_properties.product_id"]
for obj in files:
csv_obj = client.get_object(Bucket=bucket_name, Key=obj)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')
temp = pd.read_csv(StringIO(csv_string))
final_list = list(set(col_list) & set(temp.columns))
temp = temp[final_list]
data.append(temp)
# combining all dataframes into one
event_data = pd.concat(data, ignore_index=True)
So, my question is, can I just put this code into a lambda function and it should work, or do I need to incorporate a lambda_handler function? If I need to use a lambda_handler function, how would I handle multiple files? Because the lambda is being triggered by a schedule, rather than one event taking place.
Yes, you need a Lambda function handler, otherwise it's not a Lambda function.
The Lambda runtime is looking for a specific entry point in your code and it will invoke that entry point with a specific set of parameters (event, context).
You can ignore the event and context if you choose to and simply use the boto3 SDK to list objects in a given S3 bucket (assuming your Lambda has permission to do this, via an IAM role), and then perform whatever actions you need to against those objects.
Your code example explicitly supplies boto3 with an access key and secret key, but that is not the best practice in AWS Lambda (or EC2 or other compute environment running on AWS). Instead, configure the Lambda function to assume an IAM role (that provides the necessary permissions).
Be aware of the Lambda function timeout, and increase it as necessary if you are going to process a lot of files. Increasing the RAM size is also potentially a good idea as it will give you correspondingly more CPU and network bandwidth.
If you're going to process a very large number of files then consider using S3 Batch.
I have 3 files per date per name in this format:
'nameXX_date', here's an example:
'nameXX_01-01-20'
'nameXY_01-01-20'
'nameXZ_01-01-20'
where 'name' can be anything, and the date is whatever day the file was uploaded (almost every day).
I need to write a cloud function that triggers whenever a new file lands in the bucket, that combines the 3 XX,XY,XZ files into one file with filename = "name_date".
Here's what I've got so far:
bucket_id = 'bucketname'
client = gcs.Client()
bucket = client.get_bucket(bucket_id)
name =
date =
outfile = f'bucketname/{name}_{date}.CSV'
blobs = []
for shard in ('XX', 'XY', 'XZ'):
sfile = f'{name}{shard}_{date}'
blob = bucket.blob(sfile)
if not blob.exists():
# this causes a retry in 60s
raise ValueError(f'branch {sfile} not present')
blobs.append(blob)
bucket.blob(outfile).compose(blobs)
logging.info(f'Successfullt created {outfile}')
for blob in blobs:
blob.delete()
logging.info('Deleted {} blobs'.format(len(blobs)))
The issue I'm facing is that I'm not sure how to get the name and date of the new file that landed in the bucket, so that I can find the other 2 matching files and combine them
Btw, I've got this code from this article and I'm trying to implement it here: https://medium.com/google-cloud/how-to-write-to-a-single-shard-on-google-cloud-storage-efficiently-using-cloud-dataflow-and-cloud-3aeef1732325
As I understand, the cloud function is triggered by a google.storage.object.finalize event on an object in the specific GCS bucket.
In that case your cloud function "signature" looks like (taken from the "medium" article you mentioned):
def compose_shards(data, context):
The data is a dictionary with plenty of details about the object (file) has been finalized. See some details here: Google Cloud Storage Triggers
For example, the data["name"] - is the name of the object under discussion.
If you know the pattern/template/rule according to which those objects/shards are named, you can extract the relevant elements from an object/shard name, and use it to compose the target object/file name.
I am looking for some advice on this project. My thought was to use Python and a Lambda to aggregate the data and respond to the website. The main parameters are date ranges and can be dynamic.
Project Requirements:
Read data from monthly return files stored in JSON (each file contains roughly 3000 securities and is 1.6 MB in size)
Aggregate the data into various buckets displaying counts and returns for each bucket (for our purposes here lets say the buckets are Sectors and Market Cap ranges which can vary)
Display aggregated data on a website
Issue I face
I have successfully implemted this in an AWS Lambda, however in testing requests that are 20 years of data (and yes I get them), I begin to hit the memory limits in AWS Lambda.
Process I used:
All files are stored in S3, so I use the boto3 library to obtain the files, reading them into memory. This is still small and not of any real significance.
I use json.loads to convert the files into a pandas dataframe. I was loading all of the files into one large dataframe. - This is where the it runs out of memory.
I then pass the dataframe to custom aggregations using groupby to get my results. This part is not as fast as I would like but does the job of getting what I need.
The end result dataframe that is then converted back into JSON and is less than 500 MB.
This entire process when it works locally outside the lambda is about 40 seconds.
I have tried running this with threads and processing single frames at once but the performance degrades to about 1 min 30 seconds.
While I would rather not scrap everything and start over, I am willing to do so if there is a more efficient way to handle this. The old process did everything inside of node.js without the use of a lambda and took almost 3 minutes to generate.
Code currently used
I had to clean this a little to pull out some items but here is the code used.
Read data from S3 into JSON this will result in a list of string data.
while not q.empty():
fkey = q.get()
try:
obj = self.s3.Object(bucket_name=bucket,key=fkey[1])
json_data = obj.get()['Body'].read().decode('utf-8')
results[fkey[1]] = json_data
except Exception as e:
results[fkey[1]] = str(e)
q.task_done()
Loop through the JSON files to build a dataframe for working
for k,v in s3Data.items():
lstdf.append(buildDataframefromJson(k,v))
def buildDataframefromJson(key, json_data):
tmpdf = pd.DataFrame(columns=['ticker','totalReturn','isExcluded','marketCapStartUsd',
'category','marketCapBand','peGreaterThanMarket', 'Month','epsUsd']
)
#Read the json into a dataframe
tmpdf = pd.read_json(json_data,
dtype={
'ticker':str,
'totalReturn':np.float32,
'isExcluded':np.bool,
'marketCapStartUsd':np.float32,
'category':str,
'marketCapBand':str,
'peGreaterThanMarket':np.bool,
'epsUsd':np.float32
})[['ticker','totalReturn','isExcluded','marketCapStartUsd','category',
'marketCapBand','peGreaterThanMarket','epsUsd']]
dtTmp = datetime.strptime(key.split('/')[3], "%m-%Y")
dtTmp = datetime.strptime(str(dtTmp.year) + '-'+ str(dtTmp.month),'%Y-%m')
tmpdf.insert(0,'Month',dtTmp, allow_duplicates=True)
return tmpdf