Doing task between Multiprocessing python - python

I want use multiprocessing library to parallelize the computation. If you comment line 5 and 9 and uncomment line 11 we can run this code in serial fashion.
My dataframe is very big and taking lot of time so I want to use multiprocessing.
This is what i am trying
def do_something (df):
return df
def main(df,df_hide,df_res):
p = Pool() # comment to run normal way
for i in range(0,df_hide.shape[0]):
df = df.append(df_hide.iloc[i,:])
df = p.map(do_something,df) # comment to run normal way
#df = do_something(df) # uncomment to run normal way
df_res.iloc[i,0] = df.iloc[-1,0]
return df_res
if __name__ == '__main__':
df = pd.DataFrame({'a':[1,2,3]})
df_hide = pd.DataFrame({'a':[4,5,6]})
df_res = pd.DataFrame({'z':[0,0,0]})
df_res1 = main(df,df_hide,df_res)
print(df_res1)
Excepted output it will come if I run it normally
z
0 4
1 5
2 6
This gives me nothing It freezes the cmd. Any way still if I run it I don't think I will get expected results. As I have to do something after ever process. Can you please suggest how to parallelize this above code using multiprocessing.

import numpy as np
import pandas as pd
def do_something (df):
return df
def main(df,df_hide,df_res):
for i in range(0,df_hide.shape[0]):
df = df.append(df_hide.iloc[i,:])
df_res.iloc[i,0] = df.iloc[-1,0]
return df_res
if __name__ == '__main__':
df = pd.DataFrame({'a':[1,2,3]})
df_hide = pd.DataFrame({'a':[4,5,6]})
df_res = pd.DataFrame({'z':[0,0,0]})
df_res1 = main(df,df_hide,df_res)
print(df_res1)

Related

Multiprocessing on python 3.10 appending to the same list

I'm currently trying to learn how to use multiprocessing on python. Moreover I want to apply multiprocessing on a code of mine.
I have read other questions on the subject but the solutions on those questions did not work on my environment (maybe because something has changed with python 3.10)
My code looks like:
def obtenern2():
A = []
for d in days:
aux = dfhabil[dfhabil["day"] == d]
n2 = casosn(aux,2)
aml = ExportarMODml(n2)
adl = ExportarMODdl(n2)
A.append(aml)
A.append(adl)
return pd.concat(A)
B = obtenern2()
where "ExportarMODml" or "ExportarMODdl" takes the dataframe "n2" and perform some calculations returning a dataframe (so "A" is actually a list of dataframes).
I think that "ExportarMODml" and "ExportarMODdl" could be process in parallel, but I dont know how to append the resulting dataframes to the same list without causing corruption or something like that.
Here is a pattern that you could probably adapt to your requirements.
We have two functions ExportarMODml and ExportarMODdl. Each function takes a dictionary as its only argument and returns a DataFrame.
These can be executed in parallel and a concatenation of the returned DataFrames can be achieved thus:
from pandas import concat, DataFrame
from concurrent.futures import ProcessPoolExecutor
def ExportarMODml(d):
return DataFrame(d)
def ExportarMODdl(d):
return DataFrame(d)
def main():
d = {'a': [1,2], 'b': [3,4]}
with ProcessPoolExecutor() as ppe:
futures = [ppe.submit(func, d) for func in (ExportarMODml, ExportarMODdl)]
df = concat([future.result() for future in futures])
print(df)
if __name__ == '__main__':
main()
Output:
a b
0 1 3
1 2 4
0 1 3
1 2 4

what is the pythonic way to test dataframe processing pipelines

What is the best way to go about testing a pandas dataframe processing chain? I stubbed out the script file and the test file below so you can see what I mean.
I am getting confused on best practice, my only guiding intuition is to make the tests so they can run in any order, limit how many times the csv is loaded from disk, while also making sure each point in the chain does not modify the fixture. Each step in the process is dependent on the previous steps so unit testing each node is like testing the accumulation of processing to that point in the pipeline. So far I am accomplishing the mission but it seems like a lot of code duplication is happening because I am incrementally building the pipeline in each test.
What is the way to test this kind of python script?
This is the data processing file stubbed out:
#main_script.py
def calc_allocation_methodology(df_row):
print('calculating allocation methodoloyg')
return 'simple'
def flag_data_for_the_allocation_methodology(df):
allocation_methodology = df.apply(calc_allocation_methodology, axis=1)
df.assign(allocation_methodology=allocation_methodology)
print('flagging each row for the allocation methodoloyg')
return df
def convert_repeating_values_to_nan(df):
'keep one value and nan the rest of the values'
print('convert repeating values to nan')
return df
def melt_and_drop_accounting_columns(df):
print('melt and drop accounting columns')
print(f'colums remaining: {df.shape[0]}')
return df
def melt_and_drop_engineering_columns(df):
print('melt and drop engineering columns')
print(f'colums remaining: {df.shape[0]}')
return df
def process_csv_to_tiny_format(df):
print('process the entire pipeline')
return (df
.pipe(flag_data_for_the_allocation_methodology)
.pipe(convert_repeating_values_to_nan)
.pipe(melt_and_drop_accounting_columns)
.pipe(melt_and_drop_engineering_columns)
)
This is the test file stubbed out
#test_main.py
from pytest import fixture
import main_script as main
import pandas as pd
#fixture(scope='session')
def df_from_csv()
return pd.load_csv('database_dump.csv')
#fixture
def df_copy(df_from_csv):
df = df_from_csv.copy()
return df
def test_expected_flag_data_for_the_allocation_methodology(df_copy)
df = df_copy
node_to_test = df.pipe(main.flag_data_for_the_allocation_methodology)
assert True
def test_convert_repeating_values_to_nan(df_copy)
df = df_copy
node_to_test = df.pipe(main.flag_data_for_the_allocation_methodology).pipe(main.convert_repeating_values_to_nan)
assert True
def test_melt_and_drop_accounting_columns(df_copy)
df = df_copy
node_to_test = (df
.pipe(main.flag_data_for_the_allocation_methodology)
.pipe(main.convert_repeating_values_to_nan)
.pipe(main.melt_and_drop_accounting_columns))
assert True
def test_melt_and_drop_engineering_columns(df_copy)
df = df_copy
node_to_test = (df
.pipe(main.flag_data_for_the_allocation_methodology)
.pipe(main.convert_repeating_values_to_nan)
.pipe(main.melt_and_drop_accounting_columns)
.pipe(main.melt_and_drop_engineering_columns))
assert True
def test_process_csv_to_tiny_format(df_from_csv):
df = df_from_csv.copy()
tiny_data = main.process_csv_to_tiny_format(df)
assert True

I want to run a loop with condition and save all outputs as dataframes with different names

I wrote an function which only depends on a dataframe. The functions output is also a dataframe. I would like make different dataframes according a condition and save them as different datasets with different names. However I couldnt save them as dataframes with different names. Instead i manually do the process. Is there a code which would do the same. It would be much beneficial.
import os
import numpy as np
import pandas as pd
data1 = pd.read_csv('C:/Users/Oz/Desktop/vintage/vintage1.csv', encoding='latin-1')
product_list= data1['product_types'].unique()
def vintage_table(df):
df['Disbursement_Date']=pd.to_datetime(df.Disbursement_Date)
df['Closing_Date']=pd.to_datetime(df.Closing_Date)
df['NPL_date']=pd.to_datetime(df.NPL_date, errors='ignore')
df['NPL_date_period']=df.loc[df.NPL_date > '2015-01-01', 'NPL_date'].apply(lambda x: x.strftime('%Y-%m'))
df['Dis_date_period'] = df.Disbursement_Date.apply(lambda x: x.strftime('%Y-%m'))
df['diff']=((df.NPL_date-df.Disbursement_Date) / np.timedelta64(3, 'M')).round(0)
df=df.groupby(['Dis_date_period','NPL_date_period']).agg({'Dis_amount' : 'sum', 'NPL_amount' : 'sum', 'diff' : 'mean'})
df.reset_index(level=0, inplace=True)
df['Vintage_Ratio']=df['NPL_amount']/df['Dis_amount']
table=pd.pivot_table(df,values='Vintage_Ratio',index='Dis_date_period',columns=['diff'],).fillna(0)
return
The above is the function
#for e in product_list:
# sub = data1[data1['product_types'] == e]
# print(sub)
consumer = data1[data1['product_types'] == product_list[0]]
mortgage = data1[data1['product_types'] == product_list[1]]
vehicle = data1[data1['product_types'] == product_list[2]]
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vehicle)
I would like to improve this part is there a better way to do the same process?
You could have your vintage_table() function return a dataframe instead of just modifying one dataframe over and over and that way you could do this in the second code block:
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vechicle)

Python parallel data processing

We have a dataset which has approx 1.5MM rows. I would like to process that in parallel. The main function of that code is to lookup master information and enrich the 1.5MM rows. The master is a two column dataset with roughly 25000 rows. However i am unable to make the multi-process work and test its scalability properly. Can some one please help. The cut-down version of the code is as follows
import pandas
from multiprocessing import Pool
def work(data):
mylist =[]
#Business Logic
return mylist.append(data)
if __name__ == '__main__':
data_df = pandas.read_csv('D:\\retail\\customer_sales_parallel.csv',header='infer')
print('Source Data :', data_df)
agents = 2
chunksize = 2
with Pool(processes=agents) as pool:
result = pool.map(func=work, iterable= data_df, chunksize=20)
pool.close()
pool.join()
print('Result :', result)
Method work will have the business logic and i would like to pass partitioned data_df into work to enable parallel processing. The sample data is as follows
CUSTOMER_ID,PRODUCT_ID,SALE_QTY
641996,115089,2
1078894,78144,1
1078894,121664,1
1078894,26467,1
457347,59359,2
1006860,36329,2
1006860,65237,2
1006860,121189,2
825486,78151,2
825486,78151,2
123445,115089,4
Ideally i would like to process 6 rows in each partition.
Please help.
Thanks and Regards
Bala
First, work is returning the output of mylist.append(data), which is None. I assume (and if not, I suggest) you want to return a processed Dataframe.
To distribute the load, you could use numpy.array_split to split the large Dataframe into a list of 6-row Dataframes, which are then processed by work.
import pandas
import math
import numpy as np
from multiprocessing import Pool
def work(data):
#Business Logic
return data # Return it as a Dataframe
if __name__ == '__main__':
data_df = pandas.read_csv('D:\\retail\\customer_sales_parallel.csv',header='infer')
print('Source Data :', data_df)
agents = 2
rows_per_workload = 6
num_loads = math.ceil(data_df.shape[0]/float(rows_per_workload))
split_df = np.array_split(data_df, num_loads) # A list of Dataframes
with Pool(processes=agents) as pool:
result = pool.map(func=work, iterable=split_df)
result = pandas.concat(result) # Stitch them back together
pool.close()
pool.join()pool = Pool(processes=agents)
print('Result :', result)
My best recommendation is for you to use the chunksize parameter in read_csv (Docs) and iterate over. This way you wont crash your ram trying to load everything plus if you want you can for example use threads to speed up the process.
for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
Im not sure if this answer your specific question but i hope it helps.

How to use pass by reference for data frame in python pandas

Manager Code..
import pandas as pd
import multiprocessing
import time
import MyDF
import WORKER
class Manager():
'Common base class for all Manager'
def __init__(self,Name):
print('Hello Manager..')
self.MDF=MyDF.MYDF(Name);
self.Arg=self.MDF.display();
self.WK=WORKER.Worker(self.Arg); MGR=Manager('event_wise_count') if __name__ == '__main__':
jobs = []
x=5;
for i in range(5):
x=10*i
print('Manager : ',i)
p = multiprocessing.Process(target=MGR.WK.DISPLAY)
jobs.append(p)
p.start()
time.sleep(x);
worker code...
import pandas as pd
import time
class Worker():
'Common base class for all Workers'
empCount = 0
def __init__(self,DF):
self.DF=DF;
print('Hello worker..',self.DF.count())
def DISPLAY(self):
self.DF=self.DF.head(10);
return self.DF
Hi I am trying to do multiprocessing. and i want to share a Data Frame address with all sub-processes.
So in above from Manager Class I am spawning 5 process , where each sub-process required to use Data Frame of worker class , expecting that each sub process will share reference of worker Data Frame. But unfortunately It is not happening..
Any Answer welcome..
Thanks In Advance,,.. please :)..
This answer suggests using Namespaces to share large objects between processes by reference.
Here's an example of an application where 4 different processes can read from the same DataFrame. (Note: you can't run this on an interactive console -- save this as a program.py and run it.)
import pandas as pd
from multiprocessing import Manager, Pool
def get_slice(namespace, column, rows):
'''Return the first `rows` rows from column `column in namespace.data'''
return namespace.data[column].head(rows)
if __name__ == '__main__':
# Create a namespace to place our DataFrame in it
manager = Manager()
namespace = manager.Namespace()
namespace.data = pd.DataFrame(pd.np.random.rand(1000, 10))
# Create 4 processes
pool = Pool(processes=2)
for column in namespace.data.columns:
# Each pool can access the same DataFrame object
result = pool.apply_async(get_slice, [namespace, column, 5])
print result._job, column, result.get().tolist()
While reading from the DataFrame is perfectly fine, it gets a little tricky if you want to write back to it. It's better to just stick to immutable objects unless you really need large write-able objects.
Sorry about the necromancy.
The issue is that the workers must have unique DataFrame instances. Almost all attempts to slice, or chunk, a Pandas DataFrame will result in aliases to the original DataFrame. These aliases will still result in resource contention between workers.
There a two things that should improve performance. The first would be to make sure that you are working with Pandas. Iterating row by row, with iloc or iterrows, fights against the design of DataFrames. Using a new-style class object and the apply a method is one option.
def get_example_df():
return pd.DataFrame(pd.np.random.randint(10, 100, size=(5,5)))
class Math(object):
def __init__(self):
self.summation = 0
def operation(self, row):
row_result = 0
for elem in row:
if elem % 2:
row_result += elem
else:
row_result += 1
self.summation += row_result
if row_result % 2:
return row_result
else:
return 1
def get_summation(self):
return self.summation
Custom = Math()
df = get_example_df()
df['new_col'] = df.apply(Custom.operation)
print Custom.get_summation()
The second option would be to read in, or generate, each DataFrame for each worker. Then recombine if desired.
workers = 5
df_list = [ get_example_df() ]*workers
...
# worker code
...
aggregated = pd.concat(df_list, axis=0)
However, multiprocessing will not be necessary in most cases. I've processed more than 6 million rows of data without multiprocessing in a reasonable amount of time (on a laptop).
Note: I did not time the above code and there is probably room for improvement.

Categories

Resources