Parallel delete operations on S3 using PySpark

Parallel delete operations on S3 using PySpark - python

I have a situation where I need to delete a large number of files (hundreds of millions) from S3, and it takes, like, forever if you use the traditional approaches (even using python boto3 package with delete_objects, to delete them in chunks of 1000, and processing it locally in 16 multiprocesses)
So, I developed an approach using PySpark, where I:
get the list of files I need to delete
parallelize it in a dataframe, partition it by prefix (considering that I have a limit of 3500 DELETE requests/sec per prefix)
get the underlying RDD and apply delete_objects using the .mapPartitions() method of the RDD
convert it to dataframe again (.toDF())
run .cache() and .count() to force the execution of the requests
This is the function I am passing to .mapPartitions():
def delete_files(list_of_rows):
for chunk in chunked_iterable(list_of_rows, 1000):
session = boto3.session.Session(region_name='us-east-1')
client = session.client('s3')
files = list(chunk)
bucket = files[0][0]
delete = {'Objects': [{'Key': f[1]} for f in files]}
response = client.delete_objects(
Bucket=bucket,
Delete=delete,
)
yield Row(
deleted=len(response.get('Deleted'))
)
It works nicely, except that depending on the number of files, I keep getting SlowDown (status code 503) exceptions, hitting the limits of 3500 DELETE requests/sec per prefix.
It does not make sense to me, considering that I am partitioning my rows by prefix [.repartition("prefix")] (meaning that I should not have the same prefix in more than one partition) and mapping the delete_files function to the partition at once.
In my head it is not possible that I am calling delete_objects for the same prefix at the same time, and so, I can not find a reason to keep hitting those limits.
Is there something else I should consider?
Thanks in advance!

Related

Passing in a dataframe to a stateMachine from Lambda

I have two relatively large dataframes (less than 5MB), which I receive from my front-end as files via my API Gateway. I am able to receive the files and can print the dataframes in my receiver Lambda function. From my Lambda function, I am trying to invoke my state machine (which just cleans up the dataframes and does some processing). However, when passing my dataframe to my step function, I receive the following error:
ClientError: An error occurred (413) when calling the StartExecution operation: HTTP content length exceeded 1049600 bytes
My Receiver Lambda function:
dict = {}
dict['username'] = arr[0]
dict['region'] = arr[1]
dict['country'] = arr[2]
dict['grid'] = arr[3]
dict['physicalServers'] = arr[4] #this is one dataframe in json format
dict['servers'] = arr[5] #this is my second dataframe in json format
client = boto3.client('stepfunctions')
response = client.start_execution(
stateMachineArn='arn:aws:states:us-west-2:##:stateMachine:MyStateMachineTest',
name='testStateMachine',
input= json.dumps(dict)
)
print(response)
Is there something I can do to pass in my dataframes to my step function? The dataframes contain sensitive customer data which I would rather not store in my S3. I realize I can store the files into S3 (directly from my front-end via pre-signed URLs) and then read the files from my step function but this is one of my least preferred approaches.

Passing them as direct input via input= json.dumps(dict) isn't going to work, as you are finding. You are running up against the size limit of the request. You need to save the dataframes to a file, somewhere the step functions can access it, and then just pass the file paths as input to the step function.
The way I would solve this is to write the data frames to files in the Lambda file system, with some random ID, perhaps the Lambda invocation ID, in the filename. Then have the Lambda function copy those files to an S3 bucket. Finally invoke the step function with the S3 paths as part of the input.
Over on the Step Functions side, have your state machine expect S3 paths for the physicalServers and servers input values, and use those paths to download the files from S3 during state machine execution.
Finally, I would configure an S3 lifecycle policy on the bucket, to remove any objects more than a few days old (or whatever time makes sense for your application) so that the bucket doesn't get large and run up your AWS bill.
An alternative to using S3 would be to use an EFS volume mount in both this Lambda function, and in the Lambda function or (or EC2 or ECS) that your step function is executing. With EFS your code could write and read from it just like a local file system, which would eliminate the steps of copying to/from S3, but you would have to add some code at the end of your step function to clean up the files after you are done since EFS won't do that for you.

Default Dict using a lot of memory Python

I have a server with 32GB of memory and I find my script that it's consuming all that memory and even more because it's getting killed. Just want to ask what can I do on this scenario. I have a CSV file with 320,000 rows which I am iterating in Pandas. It has three columns called date, location and value. My goal is to store all location and values per date. So date is my key. I'm going to store it to s3 as a json file. My code in appending is like this
appends = defaultdict(list)
for i, row in df.iterrows():
appends[row["date"]].append(dict(
location=row['location'],
value=row['value'],
))
Then I will it to s3 like this
for key in appends.keys():
try:
append = appends[key]
obj = s3.Object(bucket_storage, f"{key}/data.json")
obj.put(Body=json.dumps(append, cls=DjangoJSONEncoder))
except IntegrityError:
pass
But this one is consuming all the memory. I've read somewhere that default dict can be memory consuming but I'm not sure what my other options are. I don'
t want to involve a database here as well.
Basically I need all the data to be mapped first in the dict before saving it to s3 but problem is I think it can't handle all the data or I am doing something wrong here. Thanks

Python asynchronous file download + parsing + outputting to JSON

To briefly explain context, I am downloading SEC prospectus data for example. After downloading I want to parse the file to extract certain data, then output the parsed dictionary to a JSON file which consists of a list of dictionaries. I would use a SQL database for output, but the research cluster admins at my university are being slow getting me access. If anyone has any suggestions for how to store the data for easy reading/writing later I would appreciate it, I was thinking about HDF5 as a possible alternative.
A minimal example of what I am doing with the spots that I think I need to improved labeled.
def classify_file(doc):
try:
data = {
'link': doc.url
}
except AttributeError:
return {'flag': 'ATTRIBUTE ERROR'}
# Do a bunch of parsing using regular expressions
if __name__=="__main__":
items = list()
for d in tqdm([y + ' ' + q for y in ['2019'] for q in ['1']]):
stream = os.popen('bash ./getformurls.sh ' + d)
stacked = stream.read().strip().split('\n')
# split each line into the fixed-width fields
widths=(12,62,12,12,44)
items += [[item[sum(widths[:j]):sum(widths[:j+1])].strip() for j in range(len(widths))] for item in stacked]
urls = [BASE_URL + item[4] for item in items]
resp = list()
# PROBLEM 1
filelimit = 100
for i in range(ceil(len(urls)/filelimit)):
print(f'Downloading: {i*filelimit/len(urls)*100:2.0f}%... ',end='\r',flush=True)
resp += [r for r in grequests.map((grequests.get(u) for u in urls[i*filelimit:(i+1)*filelimit]))]
# PROBLEM 2
with Pool() as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
with open('prospectus_data.json') as f:
json.dump(prospectus,f)
The getfileurls.sh referenced is a bash script I wrote that was faster than doing it in python since I could use grep, the code for that is
#!/bin/bash
BASE_URL="https://www.sec.gov/Archives/"
INDEX="edgar/full-index/"
url="${BASE_URL}${INDEX}$1/QTR$2/form.idx"
out=$(curl -s ${url} | grep "^485[A|B]POS")
echo "$out"
PROBLEM 1: So I am currently pulling about 18k files in the grequests map call. I was running into an error about too many files being open so I decided to split up the urls list into manageable chunks. I don't like this solution, but it works.
PROBLEM 2: This is where my actual error is. This code runs fine on a smaller set of urls (~2k) on my laptop (uses 100% of my cpu and ~20GB of RAM ~10GB for the file downloads and another ~10GB when the parsing starts), but when I take it to the larger 18k dataset using 40 cores on a research cluster it spins up to ~100GB RAM and ~3TB swap usage then crashes after parsing about 2k documents in 20 minutes via a KeyboardInterrupt from the server.
I don't really understand why the swap usage is getting so crazy, but I think I really just need help with memory management here. Is there a way to create an generator of unsent requests that will be sent when I call classify_file() on them later? Any help would be appreciated.

Generally when you have runaway memory usage with a Pool it's because the workers are being re-used and accumulating memory with each iteration. You can occasionally close and re-open the pool to prevent this but it's so common of an issue that Python now has a built-in parameter to do it for you...
Pool(...maxtasksperchild) is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool.
There's no way for me to tell you what the right value is but you generally want to set it low enough that resources can be freed fairly often but not so low that it slows things down. (Maybe a minutes worth of processing... just as a guess)
with Pool(maxtasksperchild=5) as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
For your first problem, you might consider just using requests and moving the call inside of the worker process you already have. Pulling 18K worth of URLs and caching all that data initially is going to take time and memory. If it's all encapsulated in the worker, you'll minimize data usage and you wont need to spin up so many open file handles.

Reading csv files with glob to pass data to a database very slow

I have many csv files and I am trying to pass all the data that they contain into a database. For this reason, I found that I could use the glob library to iterate over all csv files in my folder. Following is the code I used:
import requests as req
import pandas as pd
import glob
import json
endpoint = "testEndpoint"
path = "test/*.csv"
for fname in glob.glob(path):
print(fname)
df = pd.read_csv(fname)
for index, row in df.iterrows():
#print(row['ID'], row['timestamp'], row['date'], row['time'],
# row['vltA'], row['curA'], row['pwrA'], row['rpwrA'], row['frq'])
print(row['timestamp'])
testjson = {"data":
{"installationid": row['ID'],
"active": row['pwrA'],
"reactive": row['rpwrA'],
"current": row['curA'],
"voltage": row['vltA'],
"frq": row['frq'],
}, "timestamp": row['timestamp']}
payload = {"payload": [testjson]}
json_data = json.dumps(payload)
response = req.post(
endpoint, data=json_data, headers=headers)
This code seems to work fine in the beginning. However, after some time it starts to become really slow (I noticed this because I print the timestamp as I upload the data) and eventually stops completely. What is the reason for this? Is something I am doing here really inefficient?

I can see 3 possible problems here:
memory. read_csv is fast, but it loads the content of a full file in memory. If the files are really large, you could exhaust the real memory and start using swap which has terrible performances
iterrows: you seem to build a dataframe - meaning a data structure optimized for column wise access - to then access it by rows. This already is a bad idea and iterrows is know to have terrible performances because it builds a Series per each row
one post request per row. An http request has its own overhead, but furthemore, this means that you add rows to the database one at a time. If this is the only interface for your database, you may have no other choice, but you should search whether it is possible to prepare a bunch of rows and load it as a whole. It often provides a gain of more than one magnitude order.
Without more info I can hardly say more, but IHMO the higher gain is to be found on database feeding so here in point 3. If nothing can be done on that point, of if further performance gain is required, I would try to replace pandas with the csv module which is row oriented and has a limited footprint because it only processes one line at a time whatever the file size.
Finally, and if it makes sense for your use case, I would try to use one thread for the reading of the csv file that would feed a queue and a pool of threads to send requests to the database. That should allow to gain the HTTP overhead. But beware, depending on the endpoint implementation it could not improve much if really the database access if the limiting factor.

Hitting AWS Lambda Memory limit in Python

I am looking for some advice on this project. My thought was to use Python and a Lambda to aggregate the data and respond to the website. The main parameters are date ranges and can be dynamic.
Project Requirements:
Read data from monthly return files stored in JSON (each file contains roughly 3000 securities and is 1.6 MB in size)
Aggregate the data into various buckets displaying counts and returns for each bucket (for our purposes here lets say the buckets are Sectors and Market Cap ranges which can vary)
Display aggregated data on a website
Issue I face
I have successfully implemted this in an AWS Lambda, however in testing requests that are 20 years of data (and yes I get them), I begin to hit the memory limits in AWS Lambda.
Process I used:
All files are stored in S3, so I use the boto3 library to obtain the files, reading them into memory. This is still small and not of any real significance.
I use json.loads to convert the files into a pandas dataframe. I was loading all of the files into one large dataframe. - This is where the it runs out of memory.
I then pass the dataframe to custom aggregations using groupby to get my results. This part is not as fast as I would like but does the job of getting what I need.
The end result dataframe that is then converted back into JSON and is less than 500 MB.
This entire process when it works locally outside the lambda is about 40 seconds.
I have tried running this with threads and processing single frames at once but the performance degrades to about 1 min 30 seconds.
While I would rather not scrap everything and start over, I am willing to do so if there is a more efficient way to handle this. The old process did everything inside of node.js without the use of a lambda and took almost 3 minutes to generate.
Code currently used
I had to clean this a little to pull out some items but here is the code used.
Read data from S3 into JSON this will result in a list of string data.
while not q.empty():
fkey = q.get()
try:
obj = self.s3.Object(bucket_name=bucket,key=fkey[1])
json_data = obj.get()['Body'].read().decode('utf-8')
results[fkey[1]] = json_data
except Exception as e:
results[fkey[1]] = str(e)
q.task_done()
Loop through the JSON files to build a dataframe for working
for k,v in s3Data.items():
lstdf.append(buildDataframefromJson(k,v))
def buildDataframefromJson(key, json_data):
tmpdf = pd.DataFrame(columns=['ticker','totalReturn','isExcluded','marketCapStartUsd',
'category','marketCapBand','peGreaterThanMarket', 'Month','epsUsd']
)
#Read the json into a dataframe
tmpdf = pd.read_json(json_data,
dtype={
'ticker':str,
'totalReturn':np.float32,
'isExcluded':np.bool,
'marketCapStartUsd':np.float32,
'category':str,
'marketCapBand':str,
'peGreaterThanMarket':np.bool,
'epsUsd':np.float32
})[['ticker','totalReturn','isExcluded','marketCapStartUsd','category',
'marketCapBand','peGreaterThanMarket','epsUsd']]
dtTmp = datetime.strptime(key.split('/')[3], "%m-%Y")
dtTmp = datetime.strptime(str(dtTmp.year) + '-'+ str(dtTmp.month),'%Y-%m')
tmpdf.insert(0,'Month',dtTmp, allow_duplicates=True)
return tmpdf

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.