Parallelize python for loop on pandas dataframe and append the result - python

I have a pandas dataframe with 5M rows and 20+ columns. I want do some calculations in for loop as in below sample,
grp_list=df.GroupName.unique()
df2 = pd.DataFrame()
for g in grp_list:
tmp_df = df.loc[(df['GroupName']==g)]
for i in range(len(tmp_df.GroupName)):
# calls another function
res=my_func(tmp_df)
tmp_df['Result'] = res
df2 = df2.append(tmp_df, ignore_index=True)
There are ~900 distinct GroupName. In order to improve the performance, I want to parallelize the first for loop as it is independent for each GroupName and append the result to a output data frame. How can I effectively do it with multiprocessing with group by on GroupName with final output as a appended dataframe.

First, you can try:
out = []
for _, g in df.groupby("GroupName"):
res = my_func(g)
out.append(res)
final_df = pd.concat(out)
This should speed your computation significantly.
If you want to use multiprocessing (but it depends on your computation inside my_func if it speeds up the things) you can use next example:
import multiprocessing
def my_func(df):
# modify df here
# ...
return df
if __name__ == "__main__":
with multiprocessing.Pool() as pool:
groups = (g for _, g in df.groupby("GroupName"))
out = []
for res in pool.imap_unordered(my_func, groups):
out.append(res)
final_df = pd.concat(out)

Related

Grouping a dataframe and performing operations on the resulting matrix in a parallelized manner using Python/Dask/multiprocessing?

I am working on a project where I need to group molecules in a database by their ID and perform operations on the resulting matrix. I am using Python and I want to improve performance by parallelizing the process.
I am currently loading the molecules from an SDF file and storing them in a Pandas dataframe. Each molecule has an ID, a unique Pose ID, and a unique Structure. My goal is to group the dataframe by ID and create a matrix for each ID group. The rows and columns of the matrix would correspond to the unique Pose IDs of the molecules in that ID group. Then, I can calculate values for each cell in the matrix, such as the similarity between the molecules that define that cell. However, the specific operations on the molecules are not important for this question. I am primarily asking for advice on how to set up such a system for parallelized computing using Dask or Multiprocessing, or if there are other better options.
Here is a gist of the version without any parallelisation (please note i have heavily modified to make my questions clearer, the code below outputs the desired things, but I am looking to calculate the celles on molecules not the Pose ID) : https://gist.github.com/Tonylac77/abfd54b1ceef39f0d161fb6b21950edb
#Generate sample dataframe
import pandas as pd
df = pd.DataFrame(columns=['ID', 'Pose ID'])
ids = ['ID' + str(i) for i in range(1, 6)]
pose_ids = ['Pose ' + str(i) for i in range(1, 11)]
# For each ID, add 10 rows to the dataframe with the corresponding Pose IDs
df_list = []
for i in ids:
temp_df = pd.DataFrame({'ID': [i] * 10, 'Pose ID': pose_ids})
df_list.append(temp_df)
df= pd.concat(df_list)
print(df)
################
from tqdm import tqdm
import itertools
import functools
import numpy as np
from IPython.display import display
def full_matrix_calculation(df):
#Here I am using just string concatenation as an example calculation, in reality i am calling external functions
def matrix_calculation(df, id_list):
matrices = {}
calc_dataframes = []
for id in tqdm(id_list):
df_name = df[df['ID']==id]
df_name.index = range(len(df_name['Pose ID']))
matrix = pd.DataFrame(0.0, index=[df_name['Pose ID']], columns=df_name['Pose ID'])
for subset in itertools.combinations(df_name['Pose ID'], 2):
result = subset[0]+subset[1]
matrix.iloc[df_name[df_name['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
matrix.iloc[df_name[df_name['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
matrices[id] = matrix
return matrices
id_list = np.unique(np.array(df['ID']))
calculated_dfs = matrix_calculation(df, id_list)
return calculated_dfs
calculated_dfs = full_matrix_calculation(df)
display(calculated_dfs)
I have tried using multiprocessing, however, my implementation seems to be slower than the non-parallelized version : https://gist.github.com/Tonylac77/b4bbada97ee2bab7c37d4a29079af574
def function(tuple):
return tuple[0]+tuple[1]
def full_matrix_calculation(df):
#Here I am using just string concatenation as an example calculation, in reality i am calling external functions
def matrix_calculation(df, id_list):
matrices = {}
calc_dataframes = []
for id in tqdm(id_list):
df_name = df[df['ID']==id]
df_name.index = range(len(df_name['Pose ID']))
matrix = pd.DataFrame(0.0, index=[df_name['Pose ID']], columns=df_name['Pose ID'])
with multiprocessing.Pool() as p:
try:
results = p.map(function, itertools.combinations(df_name['Pose ID'], 2))
except KeyError:
print('Incorrect clustering method selected')
return
results_list = list(zip(itertools.combinations(df_name['Pose ID'], 2), results))
for subset, result in results_list:
matrix.iloc[df_name[df_name['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
matrix.iloc[df_name[df_name['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
matrices[id] = matrix
for subset in itertools.combinations(df_name['Pose ID'], 2):
result = subset[0]+subset[1]
matrix.iloc[df_name[df_name['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
matrix.iloc[df_name[df_name['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
matrices[id] = matrix
return matrices
id_list = np.unique(np.array(df['ID']))
calculated_dfs = matrix_calculation(df, id_list)
return calculated_dfs
calculated_dfs = full_matrix_calculation(df)
display(calculated_dfs)
I have also started playing around with Dask, however the main issue I'm facing is that I need all of the values of one ID to be in the same dask partition, otherwise I will have incomplete matrices (if I understand correctly at least). I have tried to find a solution to this (like chunking in x partitions etc) but so far to no avail. Will update this thread if something changes.
Any advice welcome to speed these calculations up. For reference, the actual datasets I'm working contain ~10000 unique IDs and ~300000 Pose IDs. With the calculations I'm running on the molecules, some of these are taking 40h to complete.
This should be pretty straightforward using Dask Dataframe and groupBy:
ddf = your_dataframe_as_dask
def matrix_calculation(df, id):
matrix = pd.DataFrame(0.0, index=[df['Pose ID']], columns=df_name['Pose ID'])
for subset in itertools.combinations(df['Pose ID'], 2):
result = subset[0]+subset[1]
matrix.iloc[df[df['Pose ID']==subset[0]].index.values, df_name[df_name['Pose ID']==subset[1]].index.values] = result
matrix.iloc[df[df['Pose ID']==subset[1]].index.values, df_name[df_name['Pose ID']==subset[0]].index.values] = result
return matrix
ddf.groupby('ID').apply(matrix_calculation).compute()
See https://examples.dask.org/dataframes/02-groupby.html#Groupby-Apply.
This will parallelize the work for each ID.
You might then want to look at https://docs.dask.org/en/stable/scheduling.html to chose the scheduler that suits your need (default with Dataframe is threads, which might not be efficient depending on your code).

Creating Cartesian Product DataFrame without maxing Memory

I have several dataframes, from which I'm creating a cartesian product (on purpose!)
After this, I'm exporting the result to disk.
I believe the size of the resulting dataframe could exceed my memory footprint, so I'm wondering is there a way that I can chunk this so that the dataframe doesn't need to all be in memory at the same time?
Example Code:
import pandas as pd
def create_list_from_range(r1,r2):
if (r1 == r2):
return r1
else:
res = []
while(r1 < r2+1 ):
res.append(r1)
r1 += 1
return res
# make a list of options
color_opt = ['red','blue','green','orange']
dow_opt = create_list_from_range(1,7)
hod_opt = create_list_from_range(0,23)
# turn each list into a dataframe
df_color = pd.DataFrame({'color': color_opt})
df_day = pd.DataFrame({'day_of_week': dow_opt})
df_hour = pd.DataFrame({'hour_of_day': hod_opt})
# add a dummy columns to everything so I can easily do a cartesian product
df_color['dummy']=1
df_day['dummy']=1
df_hour['dummy']=1
# now cartesian product... cascading
merge1 = pd.merge(df_day, df_hour, on='dummy')
FINAL = pd.merge(merge1, df_color, on='dummy')
FINAL.to_csv('FINAL_OUTPUT.csv', index=False)
You could try building up individual rows using itertools.product. In your example, you could do this as follows:
from itertools import product
prod = product(color_opt, dow_opt, hod_opt)
You can then get a number of rows and append them to an existing csv file using
df.to_csv("file", mode="a")

Multiprocessing Equally Query as pandas df then merge df's into Single df

I want to summarize many tables records count, and I want it parallelly running to save time.
list_tabl_cn =['TBL_A', 'TBL_B', 'TBL_C', 'TBL_D']
def tblRowCn(p_tbl):
#connDb = pyodbc.connect(f'DSN={nama_db_target}', autocommit =True)
connDb = sqlanydb.connect(uid='dba',
pwd='sql',
host='ip:port',
dbn='blah')
is_tableExists = ego.my_desc(p_tbl,163).shape[0]
if is_tableExists:
proc_name = 'df_'+p_tbl
if p_tbl == 'STG_CFG_SYS':
Q_ = """\
SELECT OPENDATE as TGL_POS, COUNT(1) CN FROM {0}
GROUP BY TGL_POS
""".format(p_tbl)
else:
Q_ = """\
SELECT TANGGAL_POSISI as TGL_POS, COUNT(1) CN FROM {0}
GROUP BY TGL_POS
""".format(p_tbl)
df_tbl = pd.read_sql_query(Q_, connDb, parse_dates=['TGL_POS'])
df_tbl['THN'],df_tbl['BLN']= df_tbl['TGL_POS'].dt.year, df_tbl['TGL_POS'].dt.month
else:
df_tbl=[]
return df_tbl
def task(table_nm):
print(f"Task Executed with process {mp.current_process().pid}")
tblRowCn(table_nm.upper())
def main():
executor = mp.Pool(mp.cpu_count()-8)
executor.map(task, [n_table_nm for n_table_nm in list_tabl_cn])
executor.close()
if __name__ == "__main__":
main()
May be something like this
def main():
executor = mp.Pool(mp.cpu_count()-8)
executor.map(task, [n_table_nm for n_table_nm in list_tabl_cn])
append.[task1, task2, task...]
executor.close()
Which my whole dataframe is append.[task1, task2, task...]
I believe I missed something in my code, but it is too blur.
If all your Dataframes has same columns and you want all the rows of all dataframes to be added in to one dataframe, then you can use pandas concat function. add all your individual dataframes to a list, then concat all of them to make your main dataframe.
list_of_df =[]
for df in executor.map(task, list_tabl_cn):
if df:
list_of_df.append(df)
main_df = pd.concat(list_of_df)
you can remove the else condition in tblRowCn method, its redundant and not needed.
In your code you generated a list from list_tabl_cn to pass it to map function, you don't have to do that, you can give list_tabl_cn to map function directly as in the above code.

Parallelize pandas with multiple class instances

I am trying to figure out how to run a large problem on multiple cores. I am struggling with splitting a dataframe to the different processes.
I have a class as follows:
class Pergroup():
def __init__(self, groupid):
...
def process_datapoint(self, df_in, group):
...
My data is a time-series, and contains events that can be grouped using the groupid column. I create an instance of the class for each group as so:
for groupname in df_in['groupid'].unique():
instance_names.append(groupname)
holder = {name: Pergroup(name) for name in instance_names}
Now, for each timestamp in the dataframe, I want to call the corresponding instance (based on the group), and pass to it the dataframe at that timestamp.
I have tried the following, which does not seem to parallelize as I expect:
for val in range(0, len(df_in)):
current_group = df_in['groupid'][val]
current_df = df_in.ix[val]
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(holder[current_group].process_datapoint, current_df, current_group)
I have also tried using this, which splits the df into its columns, when calling the instances:
Parallel(n_jobs=-1)(map(delayed(holder[current_group].process_datapoint), current_df, current_group))
How should I break up the dataframe such that I can still call the right instance with the right data? Basically, I am attempting to run a loop as below, with the last line running in parallel:
for val in range(0, len(df_in)):
current_group = df_in['groupid'][val]
current_df = df_in.ix[val]
holder[current_group].process_datapoint(current_df, current_group) #This call should be initiated in as many cores as possible.
Slightly different approach using pool
import pandas as pd
from multiprocessing import Pool
# First make sure each process has its own data
groups = df_in['groupid'].unique().values
data = [(group_id, holder[group_id], df_in.ix[group_id])
for group for groups]
# Prepare a function that can take this data as input
def help_func(current_group, holder, current_df):
return holder.process_datapoint(current_df, current_group)
# Run in parallel
with Pool(processes=4) as p:
p.map(help_func, data)
I had at some point a similar problem; as I can completely adapt to your question, I hope you can transpose and make this fit to your problem:
import multiprocessing
from joblib import Parallel, delayed
maxbatchsize = 10000 #limit the amount of data dispatched to each core
ncores = -1 #number of cores to use
data = pandas.DataFrame() #<<<- your dataframe
class DFconvoluter():
def __init__(self, myparam):
self.myparam = myparam
def __call__(self, df):
return df.apply(lamda row: row['somecolumn']*self.myparam)
nbatches = max(math.ceil(len(df)/maxbatchsize), ncores)
g = GenStrategicGroups( data['Key'].values, nbatches ) #a vector telling which row should be dispatched to which batch
#-- parallel part
def applyParallel(dfGrouped, func):
retLst = Parallel(n_jobs=ncores)(delayed(func)(group) for _, group in dfGrouped)
return pd.concat(retLst)
out = applyParallel(data.groupby(g), Dfconvoluter(42)))'
what is left is to write, how you'd like to group the batches together, for me this had to be done in a fashion so that rows, where values in the 'keys'-column where similar had to stay together:
def GenStrategicGroups(stratify, ngroups):
''' Generate a list of integers in a grouped sequence,
where grouped levels in stratifiy are preserved.
'''
g = []
nelpg = float(len(stratify)) / ngroups
prev_ = None
grouped_idx = 0
for i,s in enumerate(stratify):
if i > (grouped_idx+1)*nelpg:
if s != prev_:
grouped_idx += 1
g.append(grouped_idx)
prev_ = s
return g

python parallel process list with data frames

I have a list that contains data frames. Inside a loop I iterate over this list cleanup each data frame in the list and dump to another list and return that list:
allDfs = []
def processDfs(self):
for df in listOfDfs():
for column_name in need_to_change_column_name:
...# some column name changes
df.set_index('id', inplace=True)
## dropping any na
df = df.dropna()
...
df['cost'] = df['cost'].astype('float64')
allDfs.append(df)
return allDfs
How do I distribute processing of each data frame in the listOfDfs among multiple threads? and collect it and return list of process dfs.
Use the multiprocessing module:
from multiprocessing import Pool
# enter the desired number of processes here
NUM_PROCS = 8
def process_single_df(df):
"""
Function that processes a single df.
"""
for column_name in need_to_change_column_name:
# some column name changes
...
df.set_index('id', inplace=True)
## dropping any na
df = df.dropna()
...
df['cost'] = df['cost'].astype('float64')
return df
pool = Pool(processes=NUM_PROCS)
allDfs = pool.map(process_single_df, listOfDfs)
The call to pool.map is blocking, meaning it will wait for all processes to complete before the program can continue.
If you don't need allDfs right away (you are happy to go ahead computing some other stuff while the parallel processing does its work) you can use pool.map_async instead in the last line:
# get async result instead (non-blocking)
async_result = pool.map_async(process_single_df, listOfDfs)
# do other stuff
...
# ok, now I need allDfs so will call async_result.get
allDfs = async_result.get()

Categories

Resources