Parallelize pandas with multiple class instances - python

I am trying to figure out how to run a large problem on multiple cores. I am struggling with splitting a dataframe to the different processes.
I have a class as follows:
class Pergroup():
def __init__(self, groupid):
...
def process_datapoint(self, df_in, group):
...
My data is a time-series, and contains events that can be grouped using the groupid column. I create an instance of the class for each group as so:
for groupname in df_in['groupid'].unique():
instance_names.append(groupname)
holder = {name: Pergroup(name) for name in instance_names}
Now, for each timestamp in the dataframe, I want to call the corresponding instance (based on the group), and pass to it the dataframe at that timestamp.
I have tried the following, which does not seem to parallelize as I expect:
for val in range(0, len(df_in)):
current_group = df_in['groupid'][val]
current_df = df_in.ix[val]
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(holder[current_group].process_datapoint, current_df, current_group)
I have also tried using this, which splits the df into its columns, when calling the instances:
Parallel(n_jobs=-1)(map(delayed(holder[current_group].process_datapoint), current_df, current_group))
How should I break up the dataframe such that I can still call the right instance with the right data? Basically, I am attempting to run a loop as below, with the last line running in parallel:
for val in range(0, len(df_in)):
current_group = df_in['groupid'][val]
current_df = df_in.ix[val]
holder[current_group].process_datapoint(current_df, current_group) #This call should be initiated in as many cores as possible.

Slightly different approach using pool
import pandas as pd
from multiprocessing import Pool
# First make sure each process has its own data
groups = df_in['groupid'].unique().values
data = [(group_id, holder[group_id], df_in.ix[group_id])
for group for groups]
# Prepare a function that can take this data as input
def help_func(current_group, holder, current_df):
return holder.process_datapoint(current_df, current_group)
# Run in parallel
with Pool(processes=4) as p:
p.map(help_func, data)

I had at some point a similar problem; as I can completely adapt to your question, I hope you can transpose and make this fit to your problem:
import multiprocessing
from joblib import Parallel, delayed
maxbatchsize = 10000 #limit the amount of data dispatched to each core
ncores = -1 #number of cores to use
data = pandas.DataFrame() #<<<- your dataframe
class DFconvoluter():
def __init__(self, myparam):
self.myparam = myparam
def __call__(self, df):
return df.apply(lamda row: row['somecolumn']*self.myparam)
nbatches = max(math.ceil(len(df)/maxbatchsize), ncores)
g = GenStrategicGroups( data['Key'].values, nbatches ) #a vector telling which row should be dispatched to which batch
#-- parallel part
def applyParallel(dfGrouped, func):
retLst = Parallel(n_jobs=ncores)(delayed(func)(group) for _, group in dfGrouped)
return pd.concat(retLst)
out = applyParallel(data.groupby(g), Dfconvoluter(42)))'
what is left is to write, how you'd like to group the batches together, for me this had to be done in a fashion so that rows, where values in the 'keys'-column where similar had to stay together:
def GenStrategicGroups(stratify, ngroups):
''' Generate a list of integers in a grouped sequence,
where grouped levels in stratifiy are preserved.
'''
g = []
nelpg = float(len(stratify)) / ngroups
prev_ = None
grouped_idx = 0
for i,s in enumerate(stratify):
if i > (grouped_idx+1)*nelpg:
if s != prev_:
grouped_idx += 1
g.append(grouped_idx)
prev_ = s
return g

Related

Parallelize python for loop on pandas dataframe and append the result

I have a pandas dataframe with 5M rows and 20+ columns. I want do some calculations in for loop as in below sample,
grp_list=df.GroupName.unique()
df2 = pd.DataFrame()
for g in grp_list:
tmp_df = df.loc[(df['GroupName']==g)]
for i in range(len(tmp_df.GroupName)):
# calls another function
res=my_func(tmp_df)
tmp_df['Result'] = res
df2 = df2.append(tmp_df, ignore_index=True)
There are ~900 distinct GroupName. In order to improve the performance, I want to parallelize the first for loop as it is independent for each GroupName and append the result to a output data frame. How can I effectively do it with multiprocessing with group by on GroupName with final output as a appended dataframe.
First, you can try:
out = []
for _, g in df.groupby("GroupName"):
res = my_func(g)
out.append(res)
final_df = pd.concat(out)
This should speed your computation significantly.
If you want to use multiprocessing (but it depends on your computation inside my_func if it speeds up the things) you can use next example:
import multiprocessing
def my_func(df):
# modify df here
# ...
return df
if __name__ == "__main__":
with multiprocessing.Pool() as pool:
groups = (g for _, g in df.groupby("GroupName"))
out = []
for res in pool.imap_unordered(my_func, groups):
out.append(res)
final_df = pd.concat(out)

Running Functions with Multiple Arguments Concurrently and Aggregating Complex Results

Set Up
This is part two of a question that I posted regarding accessing results from multiple processes.
For part one click Here: Link to Part One
I have a complex set of data that I need to compare to various sets of constraints concurrently, but I'm running into multiple issues. The first issue is getting results out of my multiple processes, and the second issue is making anything beyond an extremely simple function to run concurrently.
Example
I have multiple sets of constraints that I need to compare against some data and I would like to do this concurrently because I have a lot of sets of constrains. In this example I'll just be using two sets of constraints.
Jupyter Notebook
Create Some Sample Constraints & Data
# Create a set of constraints
constraints = pd.DataFrame([['2x2x2', 2,2,2],['5x5x5',5,5,5],['7x7x7',7,7,7]],
columns=['Name','First', 'Second', 'Third'])
constraints.set_index('Name', inplace=True)
# Create a second set of constraints
constraints2 = pd.DataFrame([['4x4x4', 4,4,4],['6x6x6',6,6,6],['7x7x7',7,7,7]],
columns=['Name','First', 'Second', 'Third'])
constraints2.set_index('Name', inplace=True)
# Create some sample data
items = pd.DataFrame([['a', 2,8,2],['b',5,3,5],['c',7,4,7]], columns=['Name','First', 'Second', 'Third'])
items.set_index('Name', inplace=True)
Running Sequentially
If I run this sequentially I can get my desired results but with the data that I am actually dealing with it can take over 12 hours. Here is what it would look like ran sequentially so that you know what my desired result would look like.
# Function
def seq_check_constraint(df_constraints_input, df_items_input):
df_constraints = df_constraints_input.copy()
df_items = df_items_input.copy()
df_items['Product'] = df_items.product(axis=1)
df_constraints['Product'] = df_constraints.product(axis=1)
for constraint in df_constraints.index:
df_items[constraint+'Product'] = df_constraints.loc[constraint,'Product']
for constraint in df_constraints.index:
for item in df_items.index:
col_name = constraint+'_fits'
df_items[col_name] = False
df_items.loc[df_items['Product'] < df_items[constraint+'Product'], col_name] = True
df_res = df_items.iloc[:: ,7:]
return df_res
constraint_sets = [constraints, constraints2, ...]
results = {}
counter = 0
for df in constrain_sets:
res = seq_check_constraint(df, items)
results['constraints'+str(counter)] = res
or uglier:
df_res1 = seq_check_constraint(constraints, items)
df_res2 = seq_check_constraint(constraints2, items)
results = {'constraints0':df_res1, 'constraints1': df_res2}
As a result of running these sequentially I end up with DataFrame's like shown here:
I'd ultimately like to end up with a dictionary or list of the DataFrame's, or be able to append the DataFrame's all together. The order that I get the results doesn't matter to me, I just want to have them all together and need to be able to do further analysis on them.
What I've Tried
So this brings me to my attempts at multiprocessing, From what I understand you can either use Queues or Managers to handle shared data and memory, but I haven't been able to get either to work. I also am struggling to get my function which takes two arguments to execute within the Pool's at all.
Here is my code as it stands right now using the same sample data from above:
Function
def check_constraint(df_constraints_input, df_items_input):
df_constraints = df_constraints_input.copy()
df_items = df_items_input.copy()
df_items['Product'] = df_items.product(axis=1) # Mathematical Product
df_constraints['Product'] = df_constraints.product(axis=1)
for constraint in df_constraints.index:
df_items[constraint+'Product'] = df_constraints.loc[constraint,'Product']
for constraint in df_constraints.index:
for item in df_items.index:
col_name = constraint+'_fits'
df_items[col_name] = False
df_items.loc[df_items['Product'] < df_items[constraint+'Product'], col_name] = True
df_res = df_items.iloc[:: ,7:]
return df_res
Jupyter Notebook
df_manager = mp.Manager()
df_ns = df_manager.Namespace()
df_ns.constraint_sets = [constraints, constraints2]
print('---Starting pool---')
if __name__ == '__main__':
with mp.Pool() as p:
print('--In the pool--')
res = p.map_async(mpf.check_constraint, (df_ns.constraint_sets, itertools.repeat(items)))
print(res.get())
and my current error:
TypeError: check_constraint() missing 1 required positional argument: 'df_items_input'
Easiest way is to create a list of tuples (where one tuple represents one set of arguments to the function) and pass it to starmap.
df_manager = mp.Manager()
df_ns = df_manager.Namespace()
df_ns.constraint_sets = [constraints, constraints2]
print('---Starting pool---')
if __name__ == '__main__':
with mp.Pool() as p:
print('--In the pool--')
check_constraint_args = []
for constraint in constraint_sets:
check_constraint_args.append((constraint, items))
res = p.starmap(mpf.check_constraint, check_constraint_args)
print(res.get())

Multiprocessing Equally Query as pandas df then merge df's into Single df

I want to summarize many tables records count, and I want it parallelly running to save time.
list_tabl_cn =['TBL_A', 'TBL_B', 'TBL_C', 'TBL_D']
def tblRowCn(p_tbl):
#connDb = pyodbc.connect(f'DSN={nama_db_target}', autocommit =True)
connDb = sqlanydb.connect(uid='dba',
pwd='sql',
host='ip:port',
dbn='blah')
is_tableExists = ego.my_desc(p_tbl,163).shape[0]
if is_tableExists:
proc_name = 'df_'+p_tbl
if p_tbl == 'STG_CFG_SYS':
Q_ = """\
SELECT OPENDATE as TGL_POS, COUNT(1) CN FROM {0}
GROUP BY TGL_POS
""".format(p_tbl)
else:
Q_ = """\
SELECT TANGGAL_POSISI as TGL_POS, COUNT(1) CN FROM {0}
GROUP BY TGL_POS
""".format(p_tbl)
df_tbl = pd.read_sql_query(Q_, connDb, parse_dates=['TGL_POS'])
df_tbl['THN'],df_tbl['BLN']= df_tbl['TGL_POS'].dt.year, df_tbl['TGL_POS'].dt.month
else:
df_tbl=[]
return df_tbl
def task(table_nm):
print(f"Task Executed with process {mp.current_process().pid}")
tblRowCn(table_nm.upper())
def main():
executor = mp.Pool(mp.cpu_count()-8)
executor.map(task, [n_table_nm for n_table_nm in list_tabl_cn])
executor.close()
if __name__ == "__main__":
main()
May be something like this
def main():
executor = mp.Pool(mp.cpu_count()-8)
executor.map(task, [n_table_nm for n_table_nm in list_tabl_cn])
append.[task1, task2, task...]
executor.close()
Which my whole dataframe is append.[task1, task2, task...]
I believe I missed something in my code, but it is too blur.
If all your Dataframes has same columns and you want all the rows of all dataframes to be added in to one dataframe, then you can use pandas concat function. add all your individual dataframes to a list, then concat all of them to make your main dataframe.
list_of_df =[]
for df in executor.map(task, list_tabl_cn):
if df:
list_of_df.append(df)
main_df = pd.concat(list_of_df)
you can remove the else condition in tblRowCn method, its redundant and not needed.
In your code you generated a list from list_tabl_cn to pass it to map function, you don't have to do that, you can give list_tabl_cn to map function directly as in the above code.

Applying parallelization when updating dictionary values

datasets = {}
datasets['df1'] = df1
datasets['df2'] = df2
datasets['df3'] = df3
datasets['df4'] = df4
def prepare_dataframe(dataframe):
return dataframe.apply(lambda x: x.astype(str).str.lower().str.replace('[^\w\s]', ''))
for key, value in datasets.items():
datasets[key] = prepare_dataframe(value)
I need to prepare the data in some dataframes for further analysis. I would like to parallelize the for loop that updates the dictionary with a prepared dataframe. This code will eventually run on a machine with dozens of cores and thousands of dataframes. On my local machine I do not appear to be using more than a single core in the prepare_dataframe function.
I have looked at Numba and Joblib but I cannot find a way to work with dictionary values in either library.
Any insight would be very much appreciated!
You can use the multiprocessing library. You can read about its basics here.
Here is the code that does what you need:
from multiprocessing import Pool
def prepare_dataframe(dataframe):
# do whatever you want here
# changes made here are *not* global
# return a modified version of what you want
return dataframe
def worker(dict_item):
key,value = dict_item
return (key,prepare_dataframe(value))
def parallelize(data, func):
data_list = list(data.items())
pool = Pool()
data = dict(pool.map(func, data_list))
pool.close()
pool.join()
return data
datasets = parallelize(datasets,worker)

How to use pass by reference for data frame in python pandas

Manager Code..
import pandas as pd
import multiprocessing
import time
import MyDF
import WORKER
class Manager():
'Common base class for all Manager'
def __init__(self,Name):
print('Hello Manager..')
self.MDF=MyDF.MYDF(Name);
self.Arg=self.MDF.display();
self.WK=WORKER.Worker(self.Arg); MGR=Manager('event_wise_count') if __name__ == '__main__':
jobs = []
x=5;
for i in range(5):
x=10*i
print('Manager : ',i)
p = multiprocessing.Process(target=MGR.WK.DISPLAY)
jobs.append(p)
p.start()
time.sleep(x);
worker code...
import pandas as pd
import time
class Worker():
'Common base class for all Workers'
empCount = 0
def __init__(self,DF):
self.DF=DF;
print('Hello worker..',self.DF.count())
def DISPLAY(self):
self.DF=self.DF.head(10);
return self.DF
Hi I am trying to do multiprocessing. and i want to share a Data Frame address with all sub-processes.
So in above from Manager Class I am spawning 5 process , where each sub-process required to use Data Frame of worker class , expecting that each sub process will share reference of worker Data Frame. But unfortunately It is not happening..
Any Answer welcome..
Thanks In Advance,,.. please :)..
This answer suggests using Namespaces to share large objects between processes by reference.
Here's an example of an application where 4 different processes can read from the same DataFrame. (Note: you can't run this on an interactive console -- save this as a program.py and run it.)
import pandas as pd
from multiprocessing import Manager, Pool
def get_slice(namespace, column, rows):
'''Return the first `rows` rows from column `column in namespace.data'''
return namespace.data[column].head(rows)
if __name__ == '__main__':
# Create a namespace to place our DataFrame in it
manager = Manager()
namespace = manager.Namespace()
namespace.data = pd.DataFrame(pd.np.random.rand(1000, 10))
# Create 4 processes
pool = Pool(processes=2)
for column in namespace.data.columns:
# Each pool can access the same DataFrame object
result = pool.apply_async(get_slice, [namespace, column, 5])
print result._job, column, result.get().tolist()
While reading from the DataFrame is perfectly fine, it gets a little tricky if you want to write back to it. It's better to just stick to immutable objects unless you really need large write-able objects.
Sorry about the necromancy.
The issue is that the workers must have unique DataFrame instances. Almost all attempts to slice, or chunk, a Pandas DataFrame will result in aliases to the original DataFrame. These aliases will still result in resource contention between workers.
There a two things that should improve performance. The first would be to make sure that you are working with Pandas. Iterating row by row, with iloc or iterrows, fights against the design of DataFrames. Using a new-style class object and the apply a method is one option.
def get_example_df():
return pd.DataFrame(pd.np.random.randint(10, 100, size=(5,5)))
class Math(object):
def __init__(self):
self.summation = 0
def operation(self, row):
row_result = 0
for elem in row:
if elem % 2:
row_result += elem
else:
row_result += 1
self.summation += row_result
if row_result % 2:
return row_result
else:
return 1
def get_summation(self):
return self.summation
Custom = Math()
df = get_example_df()
df['new_col'] = df.apply(Custom.operation)
print Custom.get_summation()
The second option would be to read in, or generate, each DataFrame for each worker. Then recombine if desired.
workers = 5
df_list = [ get_example_df() ]*workers
...
# worker code
...
aggregated = pd.concat(df_list, axis=0)
However, multiprocessing will not be necessary in most cases. I've processed more than 6 million rows of data without multiprocessing in a reasonable amount of time (on a laptop).
Note: I did not time the above code and there is probably room for improvement.

Categories

Resources