Hitting AWS Lambda Memory limit in Python - python

I am looking for some advice on this project. My thought was to use Python and a Lambda to aggregate the data and respond to the website. The main parameters are date ranges and can be dynamic.
Project Requirements:
Read data from monthly return files stored in JSON (each file contains roughly 3000 securities and is 1.6 MB in size)
Aggregate the data into various buckets displaying counts and returns for each bucket (for our purposes here lets say the buckets are Sectors and Market Cap ranges which can vary)
Display aggregated data on a website
Issue I face
I have successfully implemted this in an AWS Lambda, however in testing requests that are 20 years of data (and yes I get them), I begin to hit the memory limits in AWS Lambda.
Process I used:
All files are stored in S3, so I use the boto3 library to obtain the files, reading them into memory. This is still small and not of any real significance.
I use json.loads to convert the files into a pandas dataframe. I was loading all of the files into one large dataframe. - This is where the it runs out of memory.
I then pass the dataframe to custom aggregations using groupby to get my results. This part is not as fast as I would like but does the job of getting what I need.
The end result dataframe that is then converted back into JSON and is less than 500 MB.
This entire process when it works locally outside the lambda is about 40 seconds.
I have tried running this with threads and processing single frames at once but the performance degrades to about 1 min 30 seconds.
While I would rather not scrap everything and start over, I am willing to do so if there is a more efficient way to handle this. The old process did everything inside of node.js without the use of a lambda and took almost 3 minutes to generate.
Code currently used
I had to clean this a little to pull out some items but here is the code used.
Read data from S3 into JSON this will result in a list of string data.
while not q.empty():
fkey = q.get()
try:
obj = self.s3.Object(bucket_name=bucket,key=fkey[1])
json_data = obj.get()['Body'].read().decode('utf-8')
results[fkey[1]] = json_data
except Exception as e:
results[fkey[1]] = str(e)
q.task_done()
Loop through the JSON files to build a dataframe for working
for k,v in s3Data.items():
lstdf.append(buildDataframefromJson(k,v))
def buildDataframefromJson(key, json_data):
tmpdf = pd.DataFrame(columns=['ticker','totalReturn','isExcluded','marketCapStartUsd',
'category','marketCapBand','peGreaterThanMarket', 'Month','epsUsd']
)
#Read the json into a dataframe
tmpdf = pd.read_json(json_data,
dtype={
'ticker':str,
'totalReturn':np.float32,
'isExcluded':np.bool,
'marketCapStartUsd':np.float32,
'category':str,
'marketCapBand':str,
'peGreaterThanMarket':np.bool,
'epsUsd':np.float32
})[['ticker','totalReturn','isExcluded','marketCapStartUsd','category',
'marketCapBand','peGreaterThanMarket','epsUsd']]
dtTmp = datetime.strptime(key.split('/')[3], "%m-%Y")
dtTmp = datetime.strptime(str(dtTmp.year) + '-'+ str(dtTmp.month),'%Y-%m')
tmpdf.insert(0,'Month',dtTmp, allow_duplicates=True)
return tmpdf

Related

Default Dict using a lot of memory Python

I have a server with 32GB of memory and I find my script that it's consuming all that memory and even more because it's getting killed. Just want to ask what can I do on this scenario. I have a CSV file with 320,000 rows which I am iterating in Pandas. It has three columns called date, location and value. My goal is to store all location and values per date. So date is my key. I'm going to store it to s3 as a json file. My code in appending is like this
appends = defaultdict(list)
for i, row in df.iterrows():
appends[row["date"]].append(dict(
location=row['location'],
value=row['value'],
))
Then I will it to s3 like this
for key in appends.keys():
try:
append = appends[key]
obj = s3.Object(bucket_storage, f"{key}/data.json")
obj.put(Body=json.dumps(append, cls=DjangoJSONEncoder))
except IntegrityError:
pass
But this one is consuming all the memory. I've read somewhere that default dict can be memory consuming but I'm not sure what my other options are. I don'
t want to involve a database here as well.
Basically I need all the data to be mapped first in the dict before saving it to s3 but problem is I think it can't handle all the data or I am doing something wrong here. Thanks

DynamoDB: Importing Big Data with Python

We are importing a 5Gb csv file into AWS DynamoDB.
At this time, we want to finish the import into DynamoDB within an hour or two, using only Python.
Also, since we are considering concurrent processing, we cannot split the file.
Using the article below as a reference, I imported a 5Gb file and it took me 6 hours to complete.
Is there any way to make this import faster by coding Python?
I don't know much about big data, but I'd like to know more about it.
https://github.com/aws-samples/csv-to-dynamodb
import json
import boto3
import os
import csv
import codecs
import sys
s3 = boto3.resource('s3')
dynamodb = boto3.resource('dynamodb')
bucket = os.environ['bucket'].
key = os.environ['key'].
tableName = os.environ['table'].
def lambda_handler(event, context):
get() does not store in memory
try:
Object(bucket, key).get()['Body'] obj = s3.
except:
Object(bucket, key).get()['Body'] except: print("S3 Object could not be opened. Check environment variable. ")
Try:
Table(tableName)
except:
Check if table was created correctly and environment variable.") try: table = dynamodb.Table(tableName) except: print("Error loading DynamoDB table.
batch_size = 100
batch = [].
DictReader is a generator; not stored in memory
for row in csv.DictReader(codecs.getreader('utf-8')(obj)):
if len(batch) >= batch_size:
write_to_dynamo(batch)
batch.clear()
batch.append(row)
if batch:
write_to_dynamo(batch)
return {
'statusCode': 200,
'body': json.dumps('Uploaded to DynamoDB Table')
}
def write_to_dynamo(rows):
try:
table = dynamodb.Table(tableName)
except:
print("Error loading DynamoDB table. Check if table was created correctly and environment variable.")
try:
Check if table was created correctly and environment variable.") try: with table.batch_writer() as batch:
for i in range(len(rows)):
batch.put_item(
Item=rows[i])
)
except:
print("Error executing batch_writer")
For importing huge datasets into DynamoDB you should better use built-in DynamoDB import from S3 functionality. Using it, you do not pay for each PutItem request but for the amount of imported data only.
Keep in mind: when importing to DynamoDB, you pay based on an uncompressed data size. You may compress the data on S3 to reduce the transfer time and costs (if transferring to another region), but it will not affect the DynamoDB import operation cost.
You should prepare your data in one of 3 available formats in S3:
CSV — the simplest and the most cost-efficient, works best if you have strings-only attributes;
DynamoDB JSON — supports all DynamoDB types (numbers, lists, etc.), requires no 3rd-party libs, works best if you have not only strings;
Amazon Ion — supports all DynamoDB types, but you'll probably require using a 3rd-party library to reformat your data into Ion.
Frankly speaking, I have not found any Ion advantages compared to JSON for this use case: it is a superset of JSON and the same explicit, so no storage or compression benefits.
When using JSON, the only ways to reduce the uncompressed data size are to:
Minimize attributes names (in JSON you have to include attributes names into every line).
Use dictionary compression for attributes values, i.e. if you have a enum field, store its values as 1, 2, 3, for example, and then decode them in your application code when reading from DynamoDB: {1: 'val1', 2: 'val2', …}.get(dynamodb_field_value).
Drop all unnecessary symbols, i.e. whitespaces.
If you have strings only, I highly recommend you to use the CSV format since you do not have to include fields names into every line in that case.

Reading csv files with glob to pass data to a database very slow

I have many csv files and I am trying to pass all the data that they contain into a database. For this reason, I found that I could use the glob library to iterate over all csv files in my folder. Following is the code I used:
import requests as req
import pandas as pd
import glob
import json
endpoint = "testEndpoint"
path = "test/*.csv"
for fname in glob.glob(path):
print(fname)
df = pd.read_csv(fname)
for index, row in df.iterrows():
#print(row['ID'], row['timestamp'], row['date'], row['time'],
# row['vltA'], row['curA'], row['pwrA'], row['rpwrA'], row['frq'])
print(row['timestamp'])
testjson = {"data":
{"installationid": row['ID'],
"active": row['pwrA'],
"reactive": row['rpwrA'],
"current": row['curA'],
"voltage": row['vltA'],
"frq": row['frq'],
}, "timestamp": row['timestamp']}
payload = {"payload": [testjson]}
json_data = json.dumps(payload)
response = req.post(
endpoint, data=json_data, headers=headers)
This code seems to work fine in the beginning. However, after some time it starts to become really slow (I noticed this because I print the timestamp as I upload the data) and eventually stops completely. What is the reason for this? Is something I am doing here really inefficient?
I can see 3 possible problems here:
memory. read_csv is fast, but it loads the content of a full file in memory. If the files are really large, you could exhaust the real memory and start using swap which has terrible performances
iterrows: you seem to build a dataframe - meaning a data structure optimized for column wise access - to then access it by rows. This already is a bad idea and iterrows is know to have terrible performances because it builds a Series per each row
one post request per row. An http request has its own overhead, but furthemore, this means that you add rows to the database one at a time. If this is the only interface for your database, you may have no other choice, but you should search whether it is possible to prepare a bunch of rows and load it as a whole. It often provides a gain of more than one magnitude order.
Without more info I can hardly say more, but IHMO the higher gain is to be found on database feeding so here in point 3. If nothing can be done on that point, of if further performance gain is required, I would try to replace pandas with the csv module which is row oriented and has a limited footprint because it only processes one line at a time whatever the file size.
Finally, and if it makes sense for your use case, I would try to use one thread for the reading of the csv file that would feed a queue and a pool of threads to send requests to the database. That should allow to gain the HTTP overhead. But beware, depending on the endpoint implementation it could not improve much if really the database access if the limiting factor.

Parallel delete operations on S3 using PySpark

I have a situation where I need to delete a large number of files (hundreds of millions) from S3, and it takes, like, forever if you use the traditional approaches (even using python boto3 package with delete_objects, to delete them in chunks of 1000, and processing it locally in 16 multiprocesses)
So, I developed an approach using PySpark, where I:
get the list of files I need to delete
parallelize it in a dataframe, partition it by prefix (considering that I have a limit of 3500 DELETE requests/sec per prefix)
get the underlying RDD and apply delete_objects using the .mapPartitions() method of the RDD
convert it to dataframe again (.toDF())
run .cache() and .count() to force the execution of the requests
This is the function I am passing to .mapPartitions():
def delete_files(list_of_rows):
for chunk in chunked_iterable(list_of_rows, 1000):
session = boto3.session.Session(region_name='us-east-1')
client = session.client('s3')
files = list(chunk)
bucket = files[0][0]
delete = {'Objects': [{'Key': f[1]} for f in files]}
response = client.delete_objects(
Bucket=bucket,
Delete=delete,
)
yield Row(
deleted=len(response.get('Deleted'))
)
It works nicely, except that depending on the number of files, I keep getting SlowDown (status code 503) exceptions, hitting the limits of 3500 DELETE requests/sec per prefix.
It does not make sense to me, considering that I am partitioning my rows by prefix [.repartition("prefix")] (meaning that I should not have the same prefix in more than one partition) and mapping the delete_files function to the partition at once.
In my head it is not possible that I am calling delete_objects for the same prefix at the same time, and so, I can not find a reason to keep hitting those limits.
Is there something else I should consider?
Thanks in advance!

Python Write dynamically huge files avoiding 100% CPU Usage

I am parsing a huge CSV approx 2 GB files with the help of this great stuff. Now have to generate dynamic files for each column in a new file where column name as file name. So I written this code to write the dynamic files:
def write_CSV_dynamically(self, header, reader):
"""
:header - CSVs first row in string format
:reader - CSVs all other rows in list format
"""
try:
headerlist =header.split(',') #-- string headers
zipof = lambda x, y: zip(x.split(','), y.split(','))
filename = "{}.csv".format(self.dtstamp)
filename = "{}_"+filename
filesdct = {filename.format(k.strip()):open(filename.format(k.strip()), 'a')\
for k in headerlist}
for row in reader:
for key, data in zipof(header, row):
filesdct[filename.format(key.strip())].write( str(data) +"\n" )
for _, v in filesdct.iteritems():
v.close()
except Exception, e:
print e
Now its taking around 50 secs to write these huge files using 100% CPU.As there are other heavy things running on my server. I want to block my program to use only 10% to 20% of the CPU and write these files. No matter if it takes 10-15 mins.
How can I optimize my code, so that it should limit 10-20% CPU usage.
There is number of ways to achieve this:
Nice the process - plain and simple.
cpulimit - just pass your script and cpu usage as parameters:
cpulimit -P /path/to/your/script -l 20
Python's resource package to set limits from the script. Bear in mind it works with absolute CPU time.

Categories

Resources