I would like to store the result of the work in a specific variable after multiprocessing as shown below.
Alternatively, I want to save the results of the job as a csv file. May I know how to do it?
This is my code:
(I want to get 'df4' and 'df7' data and to save csv file)
import pandas as pd
from pandas import DataFrame
import time
import multiprocessing
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df4 = pd.DataFrame()
df5 = pd.DataFrame()
df6 = pd.DataFrame()
df7 = pd.DataFrame()
df8 = pd.DataFrame()
date = '2011-03', '2011-02' ........ '2021-03' #There are 120 list.
list1 = df1['resion'].drop_duplicates() # There are 20 list. 'df1' is original data
#I'd like to divide the list and work on it.
list11 = list1.iloc[0:10]
list12 = list1.iloc[10:20]
#It's a function using 'list11'.
def cal1():
global df2
global df3
global df4
start = time.time()
for i, t in enumerate(list11):
df2 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df3 = pd.DataFrame(df2[df2['date'] == d])
df3['number'] = df3['price'].rank(pct=True, ascending = False )
df4 = df4.append(pd.DataFrame(df3))
return df4
#It's a function using 'list12'.
def cal2():
global df5
global df6
global df7
start = time.time()
for i, t in enumerate(list12):
df5 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df6 = pd.DataFrame(df5[df5['date'] == d])
df6['number'] = df6['price'].rank(pct=True, ascending = False )
df7 = df7.append(pd.DataFrame(df6))
return df7
## Multiprocessing code
if __name__ == "__main__":
# creating processes
p1 = multiprocessing.Process(target=cal1, args=())
p2 = multiprocessing.Process(target=cal2, args=())
# starting process 1
p1.start()
# starting process 2
p2.start()
# wait until process 1 is finished
p1.join()
# wait until process 2 is finished
p2.join()
# both processes finished
print("Done!")
It looks like your functions cal1 and cal2 are identical except that they are trying to assign results to some different global variables. This is not going to work, because when you run them in a subprocess, they will assign that global variable in the subprocess, but that will have no impact whatsoever on the main process from which you started them.
If you want to map a function to multiple input ranges across multiple processes you can use a process Pool and Pool.map.
For example:
def cal(input_list):
start = time.time()
for i, t in enumerate(input_list):
df2 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df3 = pd.DataFrame(df2[df2['date'] == d])
df3['number'] = df3['price'].rank(pct=True, ascending = False )
df4 = df4.append(pd.DataFrame(df3))
# I kept your original code unmodified but I'm not really sure this
# is what to do, because you are returning after one pass through the
# outer loop. I haven't scrutinized what you are actually trying to
# do but I suspect this is wrong too.
return df4
Then create a process pool and you can divide up the input how you want (or, with a bit of tweaking, you can let Pool.map chunk the input for you, and then reduce the outputs from map into a single output):
pool = multiprocessing.Pool(2)
dfs = pool.map(cal, [list1.iloc[0:10], list1.iloc[10:20]])
This is just to get you started. I would probably do a number of other things differently as well.
Related
I found my bottleneck in my python script. This function takes for my csv over 4 min.
Is it better to use dataframe assign function with lambda here? And is this possible for my function I wrote?
This function checks if an article nr is more than once in the dataframe and when this is true, it should mark all these rows as an variant.
def mark_variants(df):
single_prods = df["ArtikelNr"].unique()
varianten = pd.DataFrame()
non_varianten = pd.DataFrame()
for prod in single_prods:
filtered_prods = df[df.ArtikelNr == prod]
if len(filtered_prods["ArtikelNr"]) > 1:
varianten = pd.concat([varianten, filtered_prods])
else:
non_varianten = pd.concat([non_varianten, filtered_prods])
varianten["variante"] = 1
non_varianten["variante"] = 0
return pd.concat([varianten, non_varianten])
You are concatenating dataframes multiple times within the for-loop, which is computationally expensive.
You did not provide a reproducible example, so I can not test it for myself, but:
using lists instead of empty dataframes to instantiate varianten and non_varianten
and concatenating only once the for-loop is over
might speed up things.
Here is how you could refactor your function and give it a try:
def mark_variants(df):
single_prods = df["ArtikelNr"].unique()
varianten = []
non_varianten = []
for prod in single_prods:
filtered_prods = df[df.ArtikelNr == prod]
if len(filtered_prods["ArtikelNr"]) > 1:
varianten.append(filtered_prods)
else:
non_varianten.append(filtered_prods)
varianten = pd.concat(varianten)
non_varianten = pd.concat(non_varianten)
varianten["variante"] = 1
non_varianten["variante"] = 0
return pd.concat([varianten, non_varianten])
i have a Dataframe of 200k lines, i want to split into parts and call my function S_Function for each partition.
def S_Function(df):
#mycode here
return new_df
Main program
N_Threads = 10
Threads = []
Out = []
size = df.shape[0] // N_Threads
for i in range(N_Threads + 1):
begin = i * size
end = min(df.shape[0], (i+1)*size)
Threads.append(Thread(target = S_Function, args = (df[begin:end])) )
I run the threads & make the join :
for i in range(N_Threads + 1):
Threads[i].start()
for i in range(N_Threads + 1):
Out.append(Threads[i].join())
output = pd.concat(Out)
The code is working perfectly but the problem is that using threading.Thread did not decrease the execution time.
Sequential Code : 16 minutes
Parallel Code : 15 minutes
Can someone explain what to improve, why this is not working well?
Don't use threading when you have to process CPU-bound operations. To achieve your goal, I think you should use multiprocessing module
Try:
import pandas as pd
import numpy as np
import multiprocessing
import time
import functools
# Modify here
CHUNKSIZE = 20000
def S_Function(df, dictionnary):
# do stuff here
new_df = df
return new_df
if __name__ == '__main__':
# Load your dataframe
df = pd.DataFrame({'A': np.random.randint(1, 30000000, 200000).tolist()})
# Create chunks to process
chunks = (df[i:i+CHUNKSIZE] for i in range(0, len(df), CHUNKSIZE))
dictionnary = {'k1': 'v1', 'k2': 'v2'}
s_func = functools.partial(S_Function, dictionnary=dictionnary)
start = time.time()
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.map(s_func, chunks)
out = pd.concat(data)
end = time.time()
print(f"Elapsed time: {end - start:.2f} seconds")
I was hoping to use parallel processing to accelerate a for loop, but as seen in the example below, it is much slower that the loop. Is there anything wrong with my parallel processing approach? Are there better solutions?
The goal here is to update a column of a dataframe using a pre-defined functions that operates on multiple other columns of the dataframe.
import itertools
import pandas as pd
import multiprocessing as mp
import timeit
inputs = [range(50),range(90),range(30)]
inputs_list = list(itertools.product(*inputs))
Index = pd.MultiIndex.from_tuples(inputs_list,names={"a", "b", "c"})
df = pd.DataFrame(index = Index)
df['Output'] = 0
start_p = timeit.timeit()
def Addition(A,B,C):
df.loc[A,B,C]['Output']=A+B+C
return df.loc[A,B,C]['Output']
num_workers = mp.cpu_count()
pool = mp.Pool(num_workers)
df['Output'] = pool.starmap(Addition,inputs_list) # specify the function and arguments to map
pool.close()
pool.join()
end_p = timeit.timeit()
print(end_p - start_p)
start_l = timeit.timeit()
for A in range(50):
for B in range(90):
for C in range(30):
df.loc[A,B,C]['Output']=A+B+C
end_l = timeit.timeit()
print(end_l - start_l)
Better approach is to first prepare dict, then make dataframe out of it. Adding rows to df one by one is slow.
And as DarkKnight mentioned in a comment, timeit does not make sense here. I use time.time()
start_l = time.time()
dict_to_df = {}
for A in range(50):
for B in range(90):
for C in range(30):
dict_to_df[A,B,C] = A+B+C
df2 = pd.DataFrame.from_dict(dict_to_df, orient='index', columns=['Output'])
end_l = time.time()
print(end_l - start_l)
0.26 sec on my Machine
Assuming dataframe index is well ordered, you can just do something like this, using numpy vectorization:
import numpy as np
start_l = time.time()
a = np.arange(50)
b = np.arange(90)
c = np.arange(30)
a_plus_b = np.add.outer(a, b).flatten()
a_plus_b_plus_c = np.add.outer(a_plus_b, c).flatten()
df['Output'] = a_plus_b_plus_c
end_l = time.time()
print(end_l - start_l)
0.00044
Explanation of what I'm trying to accomplish:
I have dataframe to iterate over looking for some condition given a variable.
I have list of variables and I iterate over this df using multiprocessing. I pop(0) everytime a process start.
Now I need to add one more level, but I can't understand how to do it.
Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
import decimal
import multiprocessing
from multiprocessing import Pool, Manager
import itertools
#dataframe
columns = ['A', 'B', 'C', 'D']
data = np.array([np.random.randint(1, 10_000, size=750)]*4).T
df = pd.DataFrame(data, columns= columns)
print(df)
# Creating a list of tuples to apply a given function
a = np.arange(5,20, 1)
b = np.arange(1.01, 1.10, 0.01)
d = np.arange(0.95, 0.99, 0.01)
c = list(itertools.product(a, b, d))
list_of_tuples = []
dic = {}
for x in c:
dic[(x)] = x
for key, value in dic.items():
uno, due, tre = value[0], value[1], value[2]
list_of_tuples.append((uno, due, tre))
print(len(dic)) #checking size of dictionary
print(len(list_of_tuples), len(df)) #checking if size match
maximum = max(dic, key=dic.get) #maximum key inside dictionary
print(maximum, dic[maximum])
new_dic = {}
i = 1
#look_back_period = (len(df) // 10)
#print(look_back_period)
c = 0
"""chunks is the only way where I could use pool.map, it should be a list of list"""
chunks = [list_of_tuples[i::len(list_of_tuples)] for i in range(len(list_of_tuples))]
print(len(chunks[0]))
#this manager is needed to have every process append to the same Dict the result of the
# function that is given below
manager = Manager()
new_dic = manager.dict()
def multi_prova(list_of_tuples):
list_results = []
given1, given2, given3 = list_of_tuples.pop(0)
#sliding_window = df.iloc[0 : c + look_back_period, : ]
for row in df.itertuples():
result = (given1 / row.A).round(2)
list_results.append(result)
new_dic[str(given1)+', ' + str(given2)+', ' + str(given2)] = result
time1 = time.time()
if __name__ == "__main__":
try:
pool = Pool() # Make the Pool of workers
results = pool.map(multi_prova, chunks) #Open the urls in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
except:
print('error')
time2 = time.time()
print(time2 - time1)
#On my original code len(new_dic) matched len(dic), here is 750 vs 150, don't know why?!?!?!
print(new_dic)
print(len(new_dic))
Shouldn't be the len of new_dic == dic
750 rows, and a result for every row to 'append' to the dictionary.
So the problem are two:
why (len(new_dic)) is not 750.
And on top of that I would like to have, a sliding window to iter a slice of dataframe and have a dictionary of list of list with all the result of every slice of the df while c + look_back_period < len(df).
Hope I was clear enough.
A big hug on anyone that can contribute.
I wanted to parallelize df.corr() using multiprocessing module in Python. I'm taking one column and computing correlation values with rest all columns in one process and second column with rest other columns in another process. I'm continuing in this fashion to fill the upper traingle of correlation matrix by stacking up the result rows from all the processes.
I took sample data of shape (678461, 210) and tried my parallelized method and df.corr() and got running time of 214.40s and 42.64s respectively. So, my parallelized method is taking more time.
Is there a way to improve this?
import multiprocessing as mp
import pandas as pd
import numpy as np
from time import *
def _correlation(args):
i, mat, mask = args
ac = mat[i]
arr = []
for j in range(len(mat)):
if i > j:
continue
bc = mat[j]
valid = mask[i] & mask[j]
if valid.sum() < 1:
c = NA
elif i == j:
c = 1.
elif not valid.all():
c = np.corrcoef(ac[valid], bc[valid])[0, 1]
else:
c = np.corrcoef(ac, bc)[0, 1]
arr.append((j, c))
return arr
def correlation_multi(df):
numeric_df = df._get_numeric_data()
cols = numeric_df.columns
mat = numeric_df.values
mat = pd.core.common._ensure_float64(mat).T
K = len(cols)
correl = np.empty((K, K), dtype=float)
mask = np.isfinite(mat)
pool = mp.Pool(processes=4)
ret_list = pool.map(_correlation, [(i, mat, mask) for i in range(len(mat))])
for i, arr in enumerate(ret_list):
for l in arr:
j = l[0]
c = l[1]
correl[i, j] = c
correl[j, i] = c
return pd.DataFrame(correl, index = cols, columns = cols)
if __name__ == '__main__':
noise = pd.DataFrame(np.random.randint(0,100,size=(100000, 50)))
noise2 = pd.DataFrame(np.random.randint(100,200,size=(100000, 50)))
df = pd.concat([noise, noise2], axis=1)
#Single process correlation
start = time()
s = df.corr()
print('Time taken: ',time()-start)
#Multi process correlation
start = time()
s1 = correlation_multi(df)
print('Time taken: ',time()-start)
The results from _correlation have to be moved from the worker processes to the process running the Pool via interprocess communication.
This means that the return data is pickled, sent to the other process, unpickled and added to the result list.
This takes time and is by nature a sequential process.
And map processes the returns in the order they were sent, IIRC. So if one iteration takes relatively long, other results might be stuck waiting. You could try using imap_unordered which yields results as soon as they arrive.