I have a server with 32GB of memory and I find my script that it's consuming all that memory and even more because it's getting killed. Just want to ask what can I do on this scenario. I have a CSV file with 320,000 rows which I am iterating in Pandas. It has three columns called date, location and value. My goal is to store all location and values per date. So date is my key. I'm going to store it to s3 as a json file. My code in appending is like this
appends = defaultdict(list)
for i, row in df.iterrows():
appends[row["date"]].append(dict(
location=row['location'],
value=row['value'],
))
Then I will it to s3 like this
for key in appends.keys():
try:
append = appends[key]
obj = s3.Object(bucket_storage, f"{key}/data.json")
obj.put(Body=json.dumps(append, cls=DjangoJSONEncoder))
except IntegrityError:
pass
But this one is consuming all the memory. I've read somewhere that default dict can be memory consuming but I'm not sure what my other options are. I don'
t want to involve a database here as well.
Basically I need all the data to be mapped first in the dict before saving it to s3 but problem is I think it can't handle all the data or I am doing something wrong here. Thanks
Related
I have a table in my database with having a few columns and one of them is a foreign key to another table. When I try to insert a few records (less than a million) I use python's automation where the foreign key object is accessed early before the processing. For a CSV file containing the data, I use something like this:
CSV file contains some data like this:
contract_no, date, symbol, buyer_broker, seller_broker, quantity, rate, amt
2021020803000132,2021-02-08,ACL,13.0,38.0,20.0,1850.0,37000.0
2021020803000180,2021-02-08,BSL,36.0,35.0,100.0,1850.0,185000.0
2021020803000181,2021-02-08,CLB,36.0,42.0,10.0,1850.0,18500.0
2021020803000182,2021-02-08,CLB,36.0,29.0,54.0,1850.0,99900.0
2021020803000183,2021-02-08,BSL,36.0,34.0,22.0,1850.0,40700.0
2021020803000267,2021-02-08,ACL,54.0,1.0,20.0,1887.0,37740.0
2021020803000268,2021-02-08,ACL,54.0,57.0,100.0,1887.0,188700.0
2021020803000301,2021-02-08,LBL,28.0,38.0,10.0,1887.0,18870.0
2021020803000305,2021-02-08,LBL,28.0,1.0,10.0,1887.0,18870.0
2021020803000356,2021-02-08,CLB,28.0,59.0,10.0,1887.0,18870.0
2021020803000531,2021-02-08,LSL,16.0,1.0,10.0,1900.0,19000.0
2021020803000550,2021-02-08,BSL,54.0,58.0,10.0,1900.0,19000.0
2021020803000552,2021-02-08,ACL,38.0,58.0,10.0,1900.0,19000.0
....
I've imported the CSV file as:
file = open('floor_sheet_data.csv')
read_file = csv.reader(file)
data = []
for record in read_file:
try:
obj = FloorSheet.objects.get(contract_no=record[0])
except ObjectDoesNotExist:
org = Organization.objects.get(symbol=record[2])
fl_data = FloorSheet(
org=org,
contract_no=record[0],
date=record[1],
buyer_broker=record[3],
seller_broker=record[4],
quantity=record[5],
rate=record[6],
amount=record[7]
)
data.append(fl_data)
FloorSheet.Objects.bulk_create(data)
This works fine when there are a small no. of records to insert into the database. The main problem is here: I have a CSV file with more than 100 million rows of data as shown in the CSV above. By following this same process It'll take more than 5 days to update all of this data even if I used the batch_create method.
I found a solution that can copy CSV files directly to the PostgreSQL database but doesn't clarify the case when there is a foreign key element in the data like mine does.
If you know how can I achieve my goals here, I'd be very thankful to you!
I have a situation where I need to delete a large number of files (hundreds of millions) from S3, and it takes, like, forever if you use the traditional approaches (even using python boto3 package with delete_objects, to delete them in chunks of 1000, and processing it locally in 16 multiprocesses)
So, I developed an approach using PySpark, where I:
get the list of files I need to delete
parallelize it in a dataframe, partition it by prefix (considering that I have a limit of 3500 DELETE requests/sec per prefix)
get the underlying RDD and apply delete_objects using the .mapPartitions() method of the RDD
convert it to dataframe again (.toDF())
run .cache() and .count() to force the execution of the requests
This is the function I am passing to .mapPartitions():
def delete_files(list_of_rows):
for chunk in chunked_iterable(list_of_rows, 1000):
session = boto3.session.Session(region_name='us-east-1')
client = session.client('s3')
files = list(chunk)
bucket = files[0][0]
delete = {'Objects': [{'Key': f[1]} for f in files]}
response = client.delete_objects(
Bucket=bucket,
Delete=delete,
)
yield Row(
deleted=len(response.get('Deleted'))
)
It works nicely, except that depending on the number of files, I keep getting SlowDown (status code 503) exceptions, hitting the limits of 3500 DELETE requests/sec per prefix.
It does not make sense to me, considering that I am partitioning my rows by prefix [.repartition("prefix")] (meaning that I should not have the same prefix in more than one partition) and mapping the delete_files function to the partition at once.
In my head it is not possible that I am calling delete_objects for the same prefix at the same time, and so, I can not find a reason to keep hitting those limits.
Is there something else I should consider?
Thanks in advance!
I'm trying to save and load a list of tuples of 2 ndarrays and an int to and from a .csv file.
In my current implementation, when I save and load a list l, there is some error in the recovered list on the order of 10^-10. Is there a way to save and recover values more precisely? I would also appreciate comments on my code in general. Thanks!
This is what I have now:
def save_l(l,path):
tup=()
for X in l:
u=X[0].reshape(784*9)
v=X[2]*np.ones(1)
w=np.concatenate((u,X[1],v))
tup+=(w,)
L=np.row_stack(tup)
df=pd.DataFrame(L)
df.to_csv(path)
def load_l(path):
df=pd.read_csv(path)
L=df.values
l=[]
for v in L:
tup=()
for i in range(784):
tup+=(v[9*i+1:9*(i+1)+1],)
T=np.row_stack(tup)
Q=v[9*784+1:10*784+1]
i=v[7841]
l.append((T,Q,i))
return(l)
It may be possible that the issue you are experiencing is due to absence of .csv file protection during save and load.
a good way to make sure that your file is locked until all data are saved/loaded completely is using a context manager. In this way, you won't lose any data in case your system stops execution due to whatever reason, because all results are saved in the moment when they are available.
I recommend to use the with-statement, whose primary use is an exception-safe cleanup of the object used inside (in this case your .csv). In other words, with makes sure that files are closed, locks released, contexts restored etc.
with open("myfile.csv", "a") as reference: # Drop to csv w/ context manager
df.to_csv(reference, sep = ",", index = False) # Same goes for read_csv
# As soon as you are here, reference is closed
If you try this and still see your error, it's not due to save/load issues.
I am looking for some advice on this project. My thought was to use Python and a Lambda to aggregate the data and respond to the website. The main parameters are date ranges and can be dynamic.
Project Requirements:
Read data from monthly return files stored in JSON (each file contains roughly 3000 securities and is 1.6 MB in size)
Aggregate the data into various buckets displaying counts and returns for each bucket (for our purposes here lets say the buckets are Sectors and Market Cap ranges which can vary)
Display aggregated data on a website
Issue I face
I have successfully implemted this in an AWS Lambda, however in testing requests that are 20 years of data (and yes I get them), I begin to hit the memory limits in AWS Lambda.
Process I used:
All files are stored in S3, so I use the boto3 library to obtain the files, reading them into memory. This is still small and not of any real significance.
I use json.loads to convert the files into a pandas dataframe. I was loading all of the files into one large dataframe. - This is where the it runs out of memory.
I then pass the dataframe to custom aggregations using groupby to get my results. This part is not as fast as I would like but does the job of getting what I need.
The end result dataframe that is then converted back into JSON and is less than 500 MB.
This entire process when it works locally outside the lambda is about 40 seconds.
I have tried running this with threads and processing single frames at once but the performance degrades to about 1 min 30 seconds.
While I would rather not scrap everything and start over, I am willing to do so if there is a more efficient way to handle this. The old process did everything inside of node.js without the use of a lambda and took almost 3 minutes to generate.
Code currently used
I had to clean this a little to pull out some items but here is the code used.
Read data from S3 into JSON this will result in a list of string data.
while not q.empty():
fkey = q.get()
try:
obj = self.s3.Object(bucket_name=bucket,key=fkey[1])
json_data = obj.get()['Body'].read().decode('utf-8')
results[fkey[1]] = json_data
except Exception as e:
results[fkey[1]] = str(e)
q.task_done()
Loop through the JSON files to build a dataframe for working
for k,v in s3Data.items():
lstdf.append(buildDataframefromJson(k,v))
def buildDataframefromJson(key, json_data):
tmpdf = pd.DataFrame(columns=['ticker','totalReturn','isExcluded','marketCapStartUsd',
'category','marketCapBand','peGreaterThanMarket', 'Month','epsUsd']
)
#Read the json into a dataframe
tmpdf = pd.read_json(json_data,
dtype={
'ticker':str,
'totalReturn':np.float32,
'isExcluded':np.bool,
'marketCapStartUsd':np.float32,
'category':str,
'marketCapBand':str,
'peGreaterThanMarket':np.bool,
'epsUsd':np.float32
})[['ticker','totalReturn','isExcluded','marketCapStartUsd','category',
'marketCapBand','peGreaterThanMarket','epsUsd']]
dtTmp = datetime.strptime(key.split('/')[3], "%m-%Y")
dtTmp = datetime.strptime(str(dtTmp.year) + '-'+ str(dtTmp.month),'%Y-%m')
tmpdf.insert(0,'Month',dtTmp, allow_duplicates=True)
return tmpdf
I have several million records I want to store, retrieve, delete pretty frequently. Each of these records has a "key", but the "value" is not easily translatable to a dictionary as it is an arbitrary Python object returned from a module method that I didn't write (I understand that a lot of hierarchical data structures like json work better as dictionaries, and not sure if json is the preferred database in any case).
I am thinking to pickle each entry in a separate file. Is there a better way?
Use the shelve module.
You can use it as a dictionary, much like in json, but it stores objects using pickle.
From the python official docs:
import shelve
d = shelve.open(filename) # open -- file may get suffix added by low-level
# library
d[key] = data # store data at key (overwrites old data if
# using an existing key)
data = d[key] # retrieve a COPY of data at key (raise KeyError if no
# such key)
del d[key] # delete data stored at key (raises KeyError
# if no such key)
flag = d.has_key(key) # true if the key exists
klist = d.keys() # a list of all existing keys (slow!)
# as d was opened WITHOUT writeback=True, beware:
d['xx'] = range(4) # this works as expected, but...
d['xx'].append(5) # *this doesn't!* -- d['xx'] is STILL range(4)!
# having opened d without writeback=True, you need to code carefully:
temp = d['xx'] # extracts the copy
temp.append(5) # mutates the copy
d['xx'] = temp # stores the copy right back, to persist it
# or, d=shelve.open(filename,writeback=True) would let you just code
# d['xx'].append(5) and have it work as expected, BUT it would also
# consume more memory and make the d.close() operation slower.
d.close() # close it
I would evaluate the use of a key/value database like berkeleydb, kyoto cabinet or others. This will give you all the fancy things plus a better handling of disk space. In a filesystem with a block size of 4096B, one million files occupy ~4GB whatever is the size of your objects (as lower bound limit, if the objects are larger than 4096B the the size increase).