I have (900k, 300) records on mongo collection.
When i am trying to read the data to pandas the memory consumption increase dramatically till the process is Killed.
I have to mention that the data is fit to memory(1.5GB~) if i am reading it from csv file.
My machine is 32GB RAM and 16 CPU's Centos 7.
My simple code:
client = MongoClient(host,port)
collection = client[db_name][collection_name]
cursor = collection.find()
df = pd.DataFrame(list(cursor))
My multiprocessing code:
def read_mongo_parallel(skipses):
print("Starting process")
client = MongoClient(skipses[4],skipses[5])
db = client[skipses[2]]
collection = db[skipses[3]]
print("range of {} to {}".format(skipses[0],skipses[0]+skipses[1]))
cursor = collection.find().skip(skipses[0]).limit(skipses[1])
return list(cursor)
all_lists = []
with concurrent.futures.ProcessPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
for rows in executor.map(read_mongo_parallel, skipesess):
all_lists.extend(rows)
df = pd.DataFrame(all_lists)
The memory increase in both methods and kill the kernel,
What i am doing worng?
The problem is in the list usage when you build the DataFrame.
The cursor is consumed all at once, making a list with 900k dictionaries inside it, which takes a lot of memory.
You can avoid that if you create an empty DataFrame and then pull the documents in batches, a few documents at a time, appending them to the DataFrame.
def batched(cursor, batch_size):
batch = []
for doc in cursor:
batch.append(doc)
if batch and not len(batch) % batch_size:
yield batch
batch = []
if batch: # last documents
yield batch
df = pd.DataFrame()
for batch in batched(cursor, 10000):
df = df.append(batch, ignore_index=True)
10000 seems like a reasonable batch size, but you may want to change it according to your memory constraints: the higher it is, the faster this will end, but also the more memory it will use while running.
UPDATE: Add some benchmark
Note that this approach does not necessary make the query last longer
but rather the opposite, as what actually takes time is the process
of pulling the documents out of mongodb as dictionaries and allocating
them into a list.
Here are some benchmarks with a 300K documents that show how this
approach, with the right batch_size is actually even faster than pulling
the whole cursor into a list:
The whole cursor into a list
%%time
df = pd.DataFrame(list(db.test.find().limit(300000)))
CPU times: user 35.3 s, sys: 2.14 s, total: 37.5 s
Wall time: 37.7 s
batch_size=10000 <- FASTEST
%%time
df = pd.DataFrame()
for batch in batched(db.test.find().limit(300000), 10000):
df = df.append(batch, ignore_index=True)
CPU times: user 29.5 s, sys: 1.23 s, total: 30.7 s
Wall time: 30.8 s
batch_size=1000
%%time
df = pd.DataFrame()
for batch in batched(db.test.find().limit(300000), 1000):
df = df.append(batch, ignore_index=True)
CPU times: user 44.8 s, sys: 2.09 s, total: 46.9 s
Wall time: 46.9 s
batch_size=100000
%%time
df = pd.DataFrame()
for batch in batched(db.test.find().limit(300000), 100000):
df = df.append(batch, ignore_index=True)
CPU times: user 34.6 s, sys: 1.15 s, total: 35.8 s
Wall time: 36 s
This test harness creates 900k (albeit small) records and runs fine on my stock laptop. Give it a try.
import pymongo
import pandas as pd
db = pymongo.MongoClient()['mydatabase']
db.mycollection.drop()
operations = []
for i in range(900000):
operations.append(pymongo.InsertOne({'a': i}))
db.mycollection.bulk_write(operations, ordered=False)
cursor = db.mycollection.find({})
df = pd.DataFrame(list(cursor))
print(df.count())
Load the data in chunks.
Using iterator2dataframes from https://stackoverflow.com/a/39446008/12015722
def iterator2dataframes(iterator, chunk_size: int):
"""Turn an iterator into multiple small pandas.DataFrame
This is a balance between memory and efficiency
"""
records = []
frames = []
for i, record in enumerate(iterator):
records.append(record)
if i % chunk_size == chunk_size - 1:
frames.append(pd.DataFrame(records))
records = []
if records:
frames.append(pd.DataFrame(records))
return pd.concat(frames)
client = MongoClient(host,port)
collection = client[db_name][collection_name]
cursor = collection.find()
df = iterator2dataframes(cursor, 1000)
Just wanted to make y'all aware of pymongoarrow which is officially developed by MongoDB and solves this problem. It can output query results to arrow tables or pandas data frames and is - according to the docs - the preferred way of loading data from mongo into pandas. It sure worked like a charm for me!
You can try to get data from mongodb in chunk using slice index i.e. get 100000 documents at a time from mongodb. Add documents to dataframe and then fetch next 100000 documents and append the data to dataframe.
client = MongoClient(host,port)
collection = client[db_name][collection_name]
maxrows=905679
for i in range(0, maxrows, 100000):
df2 = df2.iloc[0:0]
if (i+100000<maxrows):
cursor = collection.find()[i:i+100000]
else:
cursor = collection.find()[i:maxrows]
df2= pd.DataFrame(list(cursor))
df.append(df2, ignore_index=True)
Refer below link to know more about slice index in mongodb.
https://api.mongodb.com/python/current/api/pymongo/cursor.html
I have found a solution with multiprocessing and its is the fastest
def chunks(collection_size, n_cores=mp.cpu_count()):
""" Return chunks of tuples """
batch_size = round(collection_size/n_cores)
rest = collection_size%batch_size
cumulative = 0
for i in range(n_cores):
cumulative += batch_size
if i == n_cores-1:
yield (batch_size*i,cumulative+rest)
else:
yield (batch_size*i,cumulative)
def parallel_read(skipses,host=HOST, port=PORT):
print('Starting process on range of {} to {}'.format(skipses[0],skipses[1]))
client = MongoClient(host,port)
db = client[DB_NAME]
collection = db[COLLECTION_NAME]
cursor = collection.find({},{ '_id': False } )
_df = pd.DataFrame(list(cursor[skipses[0]:skipses[1]]))
return _df
def read_mongo(colc_size,_workers=mp.cpu_count()):
temp_df = pd.DataFrame()
pool = mp.Pool(processes=_workers)
results = [pool.apply_async(parallel_read, args=(chunk,)) for chunk in chunks(colc_size,n_cores=_workers)]
output = [p.get() for p in results]
temp_df = pd.concat(output)
return temp_df
time_0 = time()
df = read_mongo(get_collection_size())
print("Reading database with {} processes took {}".format(mp.cpu_count(),time()-time_0))
Starting process on range of 0 to 53866
Starting process on range of 323196 to 377062
Starting process on range of 430928 to 484794
Starting process on range of 538660 to 592526
Starting process on range of 377062 to 430928
Starting process on range of 700258 to 754124
Starting process on range of 53866 to 107732
Starting process on range of 484794 to 538660
Starting process on range of 592526 to 646392
Starting process on range of 646392 to 700258
Starting process on range of 215464 to 269330
Starting process on range of 754124 to 807990
Starting process on range of 807990 to 915714
Starting process on range of 107732 to 161598
Starting process on range of 161598 to 215464
Starting process on range of 269330 to 323196
Reading database with 16 processes took 142.64860558509827
With one of the examples above (no multiprocessing)
def iterator2dataframes(iterator, chunk_size: int):
"""Turn an iterator into multiple small pandas.DataFrame
This is a balance between memory and efficiency
"""
records = []
frames = []
for i, record in enumerate(iterator):
records.append(record)
if i % chunk_size == chunk_size - 1:
frames.append(pd.DataFrame(records))
records = []
if records:
frames.append(pd.DataFrame(records))
return pd.concat(frames)
time_0 = time()
cursor = collection.find()
chunk_size = 1000
df = iterator2dataframes(cursor, chunk_size)
print("Reading database with chunksize = {} took {}".format(chunk_size,time()-time_0))
Reading database with chunksize = 10000 took 372.1170778274536
time_0 = time()
cursor = collection.find()
chunk_size = 10000
df = iterator2dataframes(cursor, chunk_size)
print("Reading database with chunksize = {} took {}".format(chunk_size,time()-time_0))
Reading database with chunksize = 10000 took 367.02637577056885
Related
I'm relatively new to python and very new to multithreading and multiprocessing. I've been trying to send out thousands of values (Approx. 70,000) into chunks through a web-based API and want it to return me data associated with all those values. The API can take on 50 values a batch at a time so for now as a test I have 100 values I'd like to send in 2 chunks of 50 values. Without multithreading, it would've taken me hours to finish the job so I've tried to use multithreading to improve performance.
The Issue: The code is getting stuck after performing only one task(first row, that even the header, not even the main values) on pool.map() part, I had to restart the notebook kernel. I've heard not to use multiprocessing on a notebook, so I've coded the whole thing on Spyder and ran it, but still the same. Code is below:
#create df data frame with
#some codes to get df of 100 values in
#2 chunks, each chunk contains 50 values.
output:
df = VAL
0 1166835704;1352357565;544477351;159345951;22...
1 354236462063;54666246046;13452466248...
def get_val(df):
data = []
v_list = df
s = requests.Session()
url = 'https://website/'
post_fields = {'format': 'json', 'data':v_list}
r = s.post(url, data=post_fields)
d = json.loads(r.text)
sort = pd.json_normalize(d, ['Results'])
return sort
if __name__ == "__main__":
pool = ThreadPool(4) # Make the Pool of workers
results = pool.map(get_val, df) #Open the df in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
Any suggestions would be helpful. Thanks!
Can you check once with following
with ThreadPool(4) as pool:
results= pool.map(get_val, df) #df should be iterable.
print(results)
Also, pls.check if chunksize can be passed to threadpool as that can affect performance.
I'm using Pandas' to_sql function to write to MySQL, which is timing out due to large frame size (1M rows, 20 columns).
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
Is there a more official way to chunk through the data and write rows in blocks? I've written my own code, which seems to work. I'd prefer an official solution though. Thanks!
def write_to_db(engine, frame, table_name, chunk_size):
start_index = 0
end_index = chunk_size if chunk_size < len(frame) else len(frame)
frame = frame.where(pd.notnull(frame), None)
if_exists_param = 'replace'
while start_index != end_index:
print "Writing rows %s through %s" % (start_index, end_index)
frame.iloc[start_index:end_index, :].to_sql(con=engine, name=table_name, if_exists=if_exists_param)
if_exists_param = 'append'
start_index = min(start_index + chunk_size, len(frame))
end_index = min(end_index + chunk_size, len(frame))
engine = sqlalchemy.create_engine('mysql://...') #database details omited
write_to_db(engine, frame, 'retail_pendingcustomers', 20000)
Update: this functionality has been merged in pandas master and will be released in 0.15 (probably end of september), thanks to #artemyk! See https://github.com/pydata/pandas/pull/8062
So starting from 0.15, you can specify the chunksize argument and e.g. simply do:
df.to_sql('table', engine, chunksize=20000)
There is beautiful idiomatic function chunks provided in answer to this question
In your case you can use this function like this:
def chunks(l, n):
""" Yield successive n-sized chunks from l.
"""
for i in xrange(0, len(l), n):
yield l.iloc[i:i+n]
def write_to_db(engine, frame, table_name, chunk_size):
for idx, chunk in enumerate(chunks(frame, chunk_size)):
if idx == 0:
if_exists_param = 'replace':
else:
if_exists_param = 'append'
chunk.to_sql(con=engine, name=table_name, if_exists=if_exists_param)
Only drawback that it doesn't support slicing second index in iloc function.
Reading from one table and writing to other in chunks....
[myconn1 ---> Source Table],[myconn2----> Target Table],[ch= 10000]
for chunk in pd.read_sql_table(table_name=source, con=myconn1, chunksize=ch):
chunk.to_sql(name=target, con=myconn2, if_exists="replace", index=False,
chunksize=ch)
LOGGER.info(f"Done 1 chunk")
Could someone point out what I did wrong with following dask implementation, since it doesnt seems to use the multi cores.
[ Updated with reproducible code]
The code that uses dask :
bookingID = np.arange(1,10000)
book_data = pd.DataFrame(np.random.rand(1000))
def calculate_feature_stats(bookingID):
curr_book_data = book_data
row = list()
row.append(bookingID)
row.append(curr_book_data.min())
row.append(curr_book_data.max())
row.append(curr_book_data.std())
row.append(curr_book_data.mean())
return row
calculate_feature_stats = dask.delayed(calculate_feature_stats)
rows = []
for bookid in bookingID.tolist():
row = calculate_feature_stats(bookid)
rows.append(row)
start = time.time()
rows = dask.persist(*rows)
end = time.time()
print(end - start) # Execution time = 16s in my machine
Code with normal implementation without dask :
bookingID = np.arange(1,10000)
book_data = pd.DataFrame(np.random.rand(1000))
def calculate_feature_stats_normal(bookingID):
curr_book_data = book_data
row = list()
row.append(bookingID)
row.append(curr_book_data.min())
row.append(curr_book_data.max())
row.append(curr_book_data.std())
row.append(curr_book_data.mean())
return row
rows = []
start = time.time()
for bookid in bookingID.tolist():
row = calculate_feature_stats_normal(bookid)
rows.append(row)
end = time.time()
print(end - start) # Execution time = 4s in my machine
So, without dask actually faster, how is that possible?
Answer
Extended comment. You should consider that using dask there is about 1ms overhead (see doc) so if your computation is shorther than that then dask It isn't worth the trouble.
Going to your specific question I can think of two possible real world scenario:
1. A big dataframe with a column called bookingID and another value
2. A different file for every bookingID
In the second case you can play from this answer while for the first case you can proceed as following:
import dask.dataframe as dd
import numpy as np
import pandas as pd
# create dummy df
df = []
for i in range(10_000):
df.append(pd.DataFrame({"id":i,
"value":np.random.rand(1000)}))
df = pd.concat(df, ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
df.to_parquet("df.parq")
Pandas
%%time
df = pd.read_parquet("df.parq")
out = df.groupby("id").agg({"value":{"min", "max", "std", "mean"}})
out.columns = [col[1] for col in out.columns]
out = out.reset_index(drop=True)
CPU times: user 1.65 s, sys: 316 ms, total: 1.96 s
Wall time: 1.08 s
Dask
%%time
df = dd.read_parquet("df.parq")
out = df.groupby("id").agg({"value":["min", "max", "std", "mean"]}).compute()
out.columns = [col[1] for col in out.columns]
out = out.reset_index(drop=True)
CPU times: user 4.94 s, sys: 427 ms, total: 5.36 s
Wall time: 3.94 s
Final thoughts
In this situation dask starts to make sense if the df doesn't fit in memory.
I have a folder (7.7GB) with multiple pandas dataframes stored in parquet file format. I need to load all these dataframes in a python dictionary, but since I only have 32GB of RAM, I use the .loc method to just load the data that I need.
When all the dataframes are loaded in memory in the python dictory, I create a common index from the indexes all of the data, then I reindex all the dataframes with the new index.
I developed two scripts to do this, the first one is in a classic sequential way, the second one is using Dask in oder to get some performance improvement from all the cores of my Threadripper 1920x.
Sequential code:
# Standard library imports
import os
import pathlib
import time
# Third party imports
import pandas as pd
# Local application imports
class DataProvider:
def __init__(self):
self.data = dict()
def load_parquet(self, source_dir: str, timeframe_start: str, timeframe_end: str) -> None:
t = time.perf_counter()
symbol_list = list(file for file in os.listdir(source_dir) if file.endswith('.parquet'))
# updating containers
for symbol in symbol_list:
path = pathlib.Path.joinpath(pathlib.Path(source_dir), symbol)
name = symbol.replace('.parquet', '')
self.data[name] = pd.read_parquet(path).loc[timeframe_start:timeframe_end]
print(f'Loaded data in {round(time.perf_counter() - t, 3)} seconds.')
t = time.perf_counter()
# building index
index = None
for symbol in self.data:
if index is not None:
index.union(self.data[symbol].index)
else:
index = self.data[symbol].index
print(f'Built index in {round(time.perf_counter() - t, 3)} seconds.')
t = time.perf_counter()
# reindexing data
for symbol in self.data:
self.data[symbol] = self.data[symbol].reindex(index=index, method='pad').itertuples()
print(f'Indexed data in {round(time.perf_counter() - t, 3)} seconds.')
if __name__ == '__main__' or __name__ == 'builtins':
source = r'WindowsPath'
x = DataProvider()
x.load_parquet(source_dir=source, timeframe_start='2015', timeframe_end='2015')
Dask code:
# Standard library imports
import os
import pathlib
import time
# Third party imports
from dask.distributed import Client
import pandas as pd
# Local application imports
def __load_parquet__(directory, timeframe_start, timeframe_end):
return pd.read_parquet(directory).loc[timeframe_start:timeframe_end]
def __reindex__(new_index, df):
return df.reindex(index=new_index, method='pad').itertuples()
if __name__ == '__main__' or __name__ == 'builtins':
client = Client()
source = r'WindowsPath'
start = '2015'
end = '2015'
t = time.perf_counter()
file_list = [file for file in os.listdir(source) if file.endswith('.parquet')]
# build data
data = dict()
for file in file_list:
path = pathlib.Path.joinpath(pathlib.Path(source), file)
symbol = file.replace('.parquet', '')
data[symbol] = client.submit(__load_parquet__, path, start, end)
print(f'Loaded data in {round(time.perf_counter() - t, 3)} seconds.')
t = time.perf_counter()
# build index
index = None
for symbol in data:
if index is not None:
index.union(data[symbol].result().index)
else:
index = data[symbol].result().index
print(f'Built index in {round(time.perf_counter() - t, 3)} seconds.')
t = time.perf_counter()
# reindex
for symbol in data:
data[symbol] = client.submit(__reindex__, index, data[symbol].result())
print(f'Indexed data in {round(time.perf_counter() - t, 3)} seconds.')
I found the results pretty weird.
Sequential code:
max memory consumption during computations: 30.2GB
memory consumption at the end of computations: 15.6GB
total memory consumption (without Windows and others): 11.6GB
Loaded data in 54.289 seconds.
Built index in 0.428 seconds.
Reindexed data in 9.666 seconds.
Dask code:
max memory consumption during computations: 25.2GB
memory consumption at the end of computations: 22.6GB
total memory consumption (without Windows and others): 18.9GB
Loaded data in 0.638 seconds.
Built index in 27.541 seconds.
Reindexed data in 30.179 seconds.
My questions:
Why with Dask the memory consumption at the end of computation is so much higher?
Why with Dask building the common index and reindexing all the dataframes takes so much time?
Also, when using the Dask code the console prints me the following error.
C:\Users\edit\Anaconda3\envs\edit\lib\site-packages\distribute\worker.py:901:UserWarning: Large object of size 5.41 MB detected in task graph:
(DatetimeIndex(['2015-01-02 09:30:00', '2015-01-02 ... s x 5 columns])
Consider scattering large objects ahead of time with client.scatter to reduce scheduler burden and keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s))
Even if the error suggestions are really good, I don't get what's wrong with my code. Why is it saying keep data on workers? I thought that with submit method I'm sending all the data to my client, and so the workers have an easy access to all the data. Thank you all for the help.
I am not an expert at all, just try to help.
you might want to try not using time.perf_counter , see if that changes anything.
I have a file with around 500K records.
Each record needs to be validated.
Records are de duplicated and store in a list:
with open(filename) as f:
records = f.readlines()
The validation file I used is stored in a Pandas Dataframe
This DataFrame contains around 80K records and 9 columns (myfile.csv).
filename = 'myfile.csv'
df = pd.read_csv(filename)
def check(df, destination):
try:
area_code = destination[:3]
office_code = destination[3:6]
subscriber_number = destination[6:]
if any(df['AREA_CODE'].astype(int) == area_code):
area_code_numbers = df[df['AREA_CODE'] == area_code]
if any(area_code_numbers['OFFICE_CODE'].astype(int) == office_code):
matching_records = area_code_numbers[area_code_numbers['OFFICE_CODE'].astype(int) == office_code]
start = subscriber_number >= matching_records['SUBSCRIBER_START']
end = subscriber_number <= matching_records['SUBSCRIBER_END']
# Perform intersection
record_found = matching_records[start & end]['LABEL'].to_string(index=False)
# We should return only 1 value
if len(record_found) > 0:
return record_found
else:
return 'INVALID_SUBSCRIBER'
else:
return 'INVALID_OFFICE_CODE'
else:
return 'INVALID_AREA_CODE'
except KeyError:
pass
except Exception:
pass
I'm looking for a way to improve the comparisons, as when I run it, it just hangs. If I run it with an small subset (10K) it works fine.
Not sure if there is a more efficient notation/recommendation.
for record in records:
check(df, record)
Using MacOS 8GB/2.3 GHz Intel Core i7.
With Cprofile.run in check function alone shows:
4253 function calls (4199 primitive calls) in 0.017 seconds.
Hence I assume 500K will take around 2 1/2 hours
While no data is available, consider this untested approach with a couple of left join merges of both data pieces and then run the validation steps. This would avoid any looping and run conditional logic across columns:
import pandas as pd
import numpy as np
with open('RecordsValidate.txt') as f:
records = f.readlines()
print(records)
rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)),
'rcd_area_code': [int(rcd[:3]) for rcd in records],
'rcd_office_code': [int(rcd[3:6]) for rcd in records],
'rcd_subscriber_number': [rcd[6:] for rcd in records]})
filename = 'myfile.csv'
df = pd.read_csv(filename)
# VALIDATE AREA CODE
mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan)
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
# VALIDATE OFFICE CODE
mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'],
right_on=['rcd_area_code', 'rcd_office_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN'])
# VALIDATE SUBSCRIBER
mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) |
(mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) |
(mrgdf['LABEL'].str.len() = 0),
'INVALID_SUBSCRIBER', mrgdf['RETURN'])
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)