We are importing a 5Gb csv file into AWS DynamoDB.
At this time, we want to finish the import into DynamoDB within an hour or two, using only Python.
Also, since we are considering concurrent processing, we cannot split the file.
Using the article below as a reference, I imported a 5Gb file and it took me 6 hours to complete.
Is there any way to make this import faster by coding Python?
I don't know much about big data, but I'd like to know more about it.
https://github.com/aws-samples/csv-to-dynamodb
import json
import boto3
import os
import csv
import codecs
import sys
s3 = boto3.resource('s3')
dynamodb = boto3.resource('dynamodb')
bucket = os.environ['bucket'].
key = os.environ['key'].
tableName = os.environ['table'].
def lambda_handler(event, context):
get() does not store in memory
try:
Object(bucket, key).get()['Body'] obj = s3.
except:
Object(bucket, key).get()['Body'] except: print("S3 Object could not be opened. Check environment variable. ")
Try:
Table(tableName)
except:
Check if table was created correctly and environment variable.") try: table = dynamodb.Table(tableName) except: print("Error loading DynamoDB table.
batch_size = 100
batch = [].
DictReader is a generator; not stored in memory
for row in csv.DictReader(codecs.getreader('utf-8')(obj)):
if len(batch) >= batch_size:
write_to_dynamo(batch)
batch.clear()
batch.append(row)
if batch:
write_to_dynamo(batch)
return {
'statusCode': 200,
'body': json.dumps('Uploaded to DynamoDB Table')
}
def write_to_dynamo(rows):
try:
table = dynamodb.Table(tableName)
except:
print("Error loading DynamoDB table. Check if table was created correctly and environment variable.")
try:
Check if table was created correctly and environment variable.") try: with table.batch_writer() as batch:
for i in range(len(rows)):
batch.put_item(
Item=rows[i])
)
except:
print("Error executing batch_writer")
For importing huge datasets into DynamoDB you should better use built-in DynamoDB import from S3 functionality. Using it, you do not pay for each PutItem request but for the amount of imported data only.
Keep in mind: when importing to DynamoDB, you pay based on an uncompressed data size. You may compress the data on S3 to reduce the transfer time and costs (if transferring to another region), but it will not affect the DynamoDB import operation cost.
You should prepare your data in one of 3 available formats in S3:
CSV — the simplest and the most cost-efficient, works best if you have strings-only attributes;
DynamoDB JSON — supports all DynamoDB types (numbers, lists, etc.), requires no 3rd-party libs, works best if you have not only strings;
Amazon Ion — supports all DynamoDB types, but you'll probably require using a 3rd-party library to reformat your data into Ion.
Frankly speaking, I have not found any Ion advantages compared to JSON for this use case: it is a superset of JSON and the same explicit, so no storage or compression benefits.
When using JSON, the only ways to reduce the uncompressed data size are to:
Minimize attributes names (in JSON you have to include attributes names into every line).
Use dictionary compression for attributes values, i.e. if you have a enum field, store its values as 1, 2, 3, for example, and then decode them in your application code when reading from DynamoDB: {1: 'val1', 2: 'val2', …}.get(dynamodb_field_value).
Drop all unnecessary symbols, i.e. whitespaces.
If you have strings only, I highly recommend you to use the CSV format since you do not have to include fields names into every line in that case.
Related
I have many csv files and I am trying to pass all the data that they contain into a database. For this reason, I found that I could use the glob library to iterate over all csv files in my folder. Following is the code I used:
import requests as req
import pandas as pd
import glob
import json
endpoint = "testEndpoint"
path = "test/*.csv"
for fname in glob.glob(path):
print(fname)
df = pd.read_csv(fname)
for index, row in df.iterrows():
#print(row['ID'], row['timestamp'], row['date'], row['time'],
# row['vltA'], row['curA'], row['pwrA'], row['rpwrA'], row['frq'])
print(row['timestamp'])
testjson = {"data":
{"installationid": row['ID'],
"active": row['pwrA'],
"reactive": row['rpwrA'],
"current": row['curA'],
"voltage": row['vltA'],
"frq": row['frq'],
}, "timestamp": row['timestamp']}
payload = {"payload": [testjson]}
json_data = json.dumps(payload)
response = req.post(
endpoint, data=json_data, headers=headers)
This code seems to work fine in the beginning. However, after some time it starts to become really slow (I noticed this because I print the timestamp as I upload the data) and eventually stops completely. What is the reason for this? Is something I am doing here really inefficient?
I can see 3 possible problems here:
memory. read_csv is fast, but it loads the content of a full file in memory. If the files are really large, you could exhaust the real memory and start using swap which has terrible performances
iterrows: you seem to build a dataframe - meaning a data structure optimized for column wise access - to then access it by rows. This already is a bad idea and iterrows is know to have terrible performances because it builds a Series per each row
one post request per row. An http request has its own overhead, but furthemore, this means that you add rows to the database one at a time. If this is the only interface for your database, you may have no other choice, but you should search whether it is possible to prepare a bunch of rows and load it as a whole. It often provides a gain of more than one magnitude order.
Without more info I can hardly say more, but IHMO the higher gain is to be found on database feeding so here in point 3. If nothing can be done on that point, of if further performance gain is required, I would try to replace pandas with the csv module which is row oriented and has a limited footprint because it only processes one line at a time whatever the file size.
Finally, and if it makes sense for your use case, I would try to use one thread for the reading of the csv file that would feed a queue and a pool of threads to send requests to the database. That should allow to gain the HTTP overhead. But beware, depending on the endpoint implementation it could not improve much if really the database access if the limiting factor.
I have a Python3 Flask app using Flask-Session (which adds server-side session support) and configured to use the filesystem type.
This type underlying uses the Werkzeug class werkzeug.contrib.cache.FileSystemCache (Werkzeug cache documentation).
The raw cache files look like this if opened:
J¬».].Äï;}î(å
_permanentîàå
respondentîåuuidîåUUIDîìî)Åî}î(åintîät˙ò∑flŒºçLÃ/∆6jhåis_safeîhåSafeUUIDîìîNÖîRîubåSECTIONS_VISITEDî]îåcurrent_sectionîKåSURVEY_CONTENTî}î(å0î}î(ås_idîås0îånameîåWelcomeîådescriptionîåîå questionsî]î}î(ås_idîhåq_idîhåq_constructîhåq_textîhå
q_descriptionîhåq_typeîhårequiredîhåoptions_rowîhåoptions_row_alpha_sortîhåreplace_rowîhåoptions_colîhåoptions_col_codesîhåoptions_col_alpha_sortîhåcond_continue_rules_rowîhåq_meta_notesîhuauå1î}î(hås1îhå Screeningîhå[This section determines if you fit into the target group.îh]î(}î(hh/håq1îh hh!å9Have you worked on a product in this field before?
The items stored in the session can be seen a bit above:
- current_section should be an integer, e.g., 0
- SECTIONS_VISITED should be an array of integers, e.g., [0,1,2]
- SURVEY_CONTENT format should be an object with structure like below
{
'item1': {
'label': string,
'questions': [{}]
},
'item2': {
'label': string,
'questions': [{}]
}
}
What you can see in the excerpt above, for example the text This section determines if you fit into the target group is the value of one label. The stuff after questions are keys that can be found in each questions object, e.g., q_text as well as their values, e.g., Have you worked on a product in this field before? is the value of q_text.
I need to retrieve data from the stored cache files in a way that I can read them without all the extra characters like å.
I tried using Werkzeug like this, where the item 9c3c48a94198f61aa02a744b16666317 is the name of the cache file I want to read. However, it was not found in the cache directory.
from werkzeug.contrib.cache import FileSystemCache
cache_dir="flask_session"
mode=0600
threshold=20000
cache = FileSystemCache(cache_dir, threshold=threshold, mode=mode)
item = "9c3c48a94198f61aa02a744b16666317"
print(cache.has(item))
data = cache.get(item)
print(data)
What ways are there to read the cache files?
I opened a GitHub issue in Flask-Session, but that's not really been actively maintained in years.
For context, I had an instance where for my web app writing to the database was briefly not working - but the data I need was also being saved in the session. So right now the only way to retrieve that data is to get it from these files.
EDIT:
Thanks to Tim's answer I solved it using the following:
import pickle
obj = []
with open(file_name,"rb") as fileOpener:
while True:
try:
obj.append(pickle.load(fileOpener))
except EOFError:
break
print(obj)
I needed to load all pickled objects in the file, so I combined Tim's solution with the one here for loading multiple objects: https://stackoverflow.com/a/49261333/11805662
Without this, I was just seeing the first pickled item.
Also, in case anyone has the same problem, I needed to use the same python version as my Flask app (related post). If I didn't, then I would get the following error:
ValueError: unsupported pickle protocol: 4
You can decode the data with pickle. Pickle is part of the Python standard library.
import pickle
with open("PATH/TO/SESSION/FILE") as f:
data = pickle.load(f)
I am looking for some advice on this project. My thought was to use Python and a Lambda to aggregate the data and respond to the website. The main parameters are date ranges and can be dynamic.
Project Requirements:
Read data from monthly return files stored in JSON (each file contains roughly 3000 securities and is 1.6 MB in size)
Aggregate the data into various buckets displaying counts and returns for each bucket (for our purposes here lets say the buckets are Sectors and Market Cap ranges which can vary)
Display aggregated data on a website
Issue I face
I have successfully implemted this in an AWS Lambda, however in testing requests that are 20 years of data (and yes I get them), I begin to hit the memory limits in AWS Lambda.
Process I used:
All files are stored in S3, so I use the boto3 library to obtain the files, reading them into memory. This is still small and not of any real significance.
I use json.loads to convert the files into a pandas dataframe. I was loading all of the files into one large dataframe. - This is where the it runs out of memory.
I then pass the dataframe to custom aggregations using groupby to get my results. This part is not as fast as I would like but does the job of getting what I need.
The end result dataframe that is then converted back into JSON and is less than 500 MB.
This entire process when it works locally outside the lambda is about 40 seconds.
I have tried running this with threads and processing single frames at once but the performance degrades to about 1 min 30 seconds.
While I would rather not scrap everything and start over, I am willing to do so if there is a more efficient way to handle this. The old process did everything inside of node.js without the use of a lambda and took almost 3 minutes to generate.
Code currently used
I had to clean this a little to pull out some items but here is the code used.
Read data from S3 into JSON this will result in a list of string data.
while not q.empty():
fkey = q.get()
try:
obj = self.s3.Object(bucket_name=bucket,key=fkey[1])
json_data = obj.get()['Body'].read().decode('utf-8')
results[fkey[1]] = json_data
except Exception as e:
results[fkey[1]] = str(e)
q.task_done()
Loop through the JSON files to build a dataframe for working
for k,v in s3Data.items():
lstdf.append(buildDataframefromJson(k,v))
def buildDataframefromJson(key, json_data):
tmpdf = pd.DataFrame(columns=['ticker','totalReturn','isExcluded','marketCapStartUsd',
'category','marketCapBand','peGreaterThanMarket', 'Month','epsUsd']
)
#Read the json into a dataframe
tmpdf = pd.read_json(json_data,
dtype={
'ticker':str,
'totalReturn':np.float32,
'isExcluded':np.bool,
'marketCapStartUsd':np.float32,
'category':str,
'marketCapBand':str,
'peGreaterThanMarket':np.bool,
'epsUsd':np.float32
})[['ticker','totalReturn','isExcluded','marketCapStartUsd','category',
'marketCapBand','peGreaterThanMarket','epsUsd']]
dtTmp = datetime.strptime(key.split('/')[3], "%m-%Y")
dtTmp = datetime.strptime(str(dtTmp.year) + '-'+ str(dtTmp.month),'%Y-%m')
tmpdf.insert(0,'Month',dtTmp, allow_duplicates=True)
return tmpdf
I am very new to Python and am trying to move data from a MongoDB collection into a CSV document (needs to be done with Python, not mongoexport, if possible).
I am using the Pymongo and CSV packages to bring the data from the database, into the CSV.
This is the structure of the data I am querying from MongoDB:
Primary identifer - Computer Name (parent): R2D2
Details - Computer Details (parent): Operating System (Child), Owner (Child)
I need for Operating System and Owner to have their own columns in the CSV sheet, but they keep on falling under a single column called Computer Name.
Is there away around this, so that the child objects can have their own columns, instead of being grouped under their parent object?
I've done this kind of thing many times. You will have a loop iterating your mongo query with pymongo. At the bottom end of the loop you will use the csv module to write each line of the csv file. Your question relates to what goes on in the middle of the loop.
MongoDB syntax is closest to the Java language which also fits with the 'associative array' objects in Mongo being JSON. Pymongo brings these things closer to Python types.
So the parent object will come straight from the pymongo iterable which if they have children, will by Python dictionaries:
import csv
from pymongo import MongoClient
client = MongoClient(ipaddress, port)
db = client.MyDatabase
collection = db.mycollection
f = open('output_file.csv', 'wb')
csvwriter = csv.writer(f)
for doc in collection.find():
computer_name = doc.get('Computer Name')
operating_system = computer_name.get('Operating System')
owner = computer_name.get('owner')
csvwriter.writerow([operating_system, owner])
f.close()
Of course there's probably a bunch of other columns in there without children that also come direct from the doc.
What is the preferred way to store application-specific parameters (persistent data) for my Python program?
I'm creating a Python program, which needs to store some parameters: "project_name", "start_year", "max_value", ...
I don't know which is the best way to store these data (I must reuse them when making calculations and reports): using local TXT files, using a tiny-very-simple DB (does it exist in Python? shall I use SQLite?), ...
Thank you very much in advance.
SQLite. Very easy to setup, and you gain a number of builtin db functions. You also won't have to handle file read/writes and parsing.
pickle it:
import pickle
options = {
'project_name': 'foo',
'start_year': 2000
}
with open('config.pickle', 'wb') as config:
pickle.dump(options, config)
The pickle module lets you dump most Python objects into a file and read them back again.
You can use shelve library. From shelve documentation:
A "shelf" is a persistent, dictionary-like object. The difference with dbm databases is that the values (not the keys!) in a shelf can be essentially arbitrary Python objects -- anything that the "pickle" module can handle
import shelve
d = shelve.open(filename) # open, with (g)dbm filename -- no suffix
d[key] = data # store data at key (overwrites old data if
# using an existing key)
data = d[key] # retrieve a COPY of the data at key (raise
# KeyError if no such key) -- NOTE that this
# access returns a *copy* of the entry!
del d[key] # delete data stored at key (raises KeyError
# if no such key)
flag = d.has_key(key) # true if the key exists; same as "key in d"
list = d.keys() # a list of all existing keys (slow!)
d.close()
if scheme is fixed, sqldb is best choise, like sqlite3, plus memcached as cache.
if relationship change often, i think flexible data may be stored in files(hash indexed).