I was hoping to use parallel processing to accelerate a for loop, but as seen in the example below, it is much slower that the loop. Is there anything wrong with my parallel processing approach? Are there better solutions?
The goal here is to update a column of a dataframe using a pre-defined functions that operates on multiple other columns of the dataframe.
import itertools
import pandas as pd
import multiprocessing as mp
import timeit
inputs = [range(50),range(90),range(30)]
inputs_list = list(itertools.product(*inputs))
Index = pd.MultiIndex.from_tuples(inputs_list,names={"a", "b", "c"})
df = pd.DataFrame(index = Index)
df['Output'] = 0
start_p = timeit.timeit()
def Addition(A,B,C):
df.loc[A,B,C]['Output']=A+B+C
return df.loc[A,B,C]['Output']
num_workers = mp.cpu_count()
pool = mp.Pool(num_workers)
df['Output'] = pool.starmap(Addition,inputs_list) # specify the function and arguments to map
pool.close()
pool.join()
end_p = timeit.timeit()
print(end_p - start_p)
start_l = timeit.timeit()
for A in range(50):
for B in range(90):
for C in range(30):
df.loc[A,B,C]['Output']=A+B+C
end_l = timeit.timeit()
print(end_l - start_l)
Better approach is to first prepare dict, then make dataframe out of it. Adding rows to df one by one is slow.
And as DarkKnight mentioned in a comment, timeit does not make sense here. I use time.time()
start_l = time.time()
dict_to_df = {}
for A in range(50):
for B in range(90):
for C in range(30):
dict_to_df[A,B,C] = A+B+C
df2 = pd.DataFrame.from_dict(dict_to_df, orient='index', columns=['Output'])
end_l = time.time()
print(end_l - start_l)
0.26 sec on my Machine
Assuming dataframe index is well ordered, you can just do something like this, using numpy vectorization:
import numpy as np
start_l = time.time()
a = np.arange(50)
b = np.arange(90)
c = np.arange(30)
a_plus_b = np.add.outer(a, b).flatten()
a_plus_b_plus_c = np.add.outer(a_plus_b, c).flatten()
df['Output'] = a_plus_b_plus_c
end_l = time.time()
print(end_l - start_l)
0.00044
Related
i have a Dataframe of 200k lines, i want to split into parts and call my function S_Function for each partition.
def S_Function(df):
#mycode here
return new_df
Main program
N_Threads = 10
Threads = []
Out = []
size = df.shape[0] // N_Threads
for i in range(N_Threads + 1):
begin = i * size
end = min(df.shape[0], (i+1)*size)
Threads.append(Thread(target = S_Function, args = (df[begin:end])) )
I run the threads & make the join :
for i in range(N_Threads + 1):
Threads[i].start()
for i in range(N_Threads + 1):
Out.append(Threads[i].join())
output = pd.concat(Out)
The code is working perfectly but the problem is that using threading.Thread did not decrease the execution time.
Sequential Code : 16 minutes
Parallel Code : 15 minutes
Can someone explain what to improve, why this is not working well?
Don't use threading when you have to process CPU-bound operations. To achieve your goal, I think you should use multiprocessing module
Try:
import pandas as pd
import numpy as np
import multiprocessing
import time
import functools
# Modify here
CHUNKSIZE = 20000
def S_Function(df, dictionnary):
# do stuff here
new_df = df
return new_df
if __name__ == '__main__':
# Load your dataframe
df = pd.DataFrame({'A': np.random.randint(1, 30000000, 200000).tolist()})
# Create chunks to process
chunks = (df[i:i+CHUNKSIZE] for i in range(0, len(df), CHUNKSIZE))
dictionnary = {'k1': 'v1', 'k2': 'v2'}
s_func = functools.partial(S_Function, dictionnary=dictionnary)
start = time.time()
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.map(s_func, chunks)
out = pd.concat(data)
end = time.time()
print(f"Elapsed time: {end - start:.2f} seconds")
I would like to store the result of the work in a specific variable after multiprocessing as shown below.
Alternatively, I want to save the results of the job as a csv file. May I know how to do it?
This is my code:
(I want to get 'df4' and 'df7' data and to save csv file)
import pandas as pd
from pandas import DataFrame
import time
import multiprocessing
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df4 = pd.DataFrame()
df5 = pd.DataFrame()
df6 = pd.DataFrame()
df7 = pd.DataFrame()
df8 = pd.DataFrame()
date = '2011-03', '2011-02' ........ '2021-03' #There are 120 list.
list1 = df1['resion'].drop_duplicates() # There are 20 list. 'df1' is original data
#I'd like to divide the list and work on it.
list11 = list1.iloc[0:10]
list12 = list1.iloc[10:20]
#It's a function using 'list11'.
def cal1():
global df2
global df3
global df4
start = time.time()
for i, t in enumerate(list11):
df2 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df3 = pd.DataFrame(df2[df2['date'] == d])
df3['number'] = df3['price'].rank(pct=True, ascending = False )
df4 = df4.append(pd.DataFrame(df3))
return df4
#It's a function using 'list12'.
def cal2():
global df5
global df6
global df7
start = time.time()
for i, t in enumerate(list12):
df5 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df6 = pd.DataFrame(df5[df5['date'] == d])
df6['number'] = df6['price'].rank(pct=True, ascending = False )
df7 = df7.append(pd.DataFrame(df6))
return df7
## Multiprocessing code
if __name__ == "__main__":
# creating processes
p1 = multiprocessing.Process(target=cal1, args=())
p2 = multiprocessing.Process(target=cal2, args=())
# starting process 1
p1.start()
# starting process 2
p2.start()
# wait until process 1 is finished
p1.join()
# wait until process 2 is finished
p2.join()
# both processes finished
print("Done!")
It looks like your functions cal1 and cal2 are identical except that they are trying to assign results to some different global variables. This is not going to work, because when you run them in a subprocess, they will assign that global variable in the subprocess, but that will have no impact whatsoever on the main process from which you started them.
If you want to map a function to multiple input ranges across multiple processes you can use a process Pool and Pool.map.
For example:
def cal(input_list):
start = time.time()
for i, t in enumerate(input_list):
df2 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df3 = pd.DataFrame(df2[df2['date'] == d])
df3['number'] = df3['price'].rank(pct=True, ascending = False )
df4 = df4.append(pd.DataFrame(df3))
# I kept your original code unmodified but I'm not really sure this
# is what to do, because you are returning after one pass through the
# outer loop. I haven't scrutinized what you are actually trying to
# do but I suspect this is wrong too.
return df4
Then create a process pool and you can divide up the input how you want (or, with a bit of tweaking, you can let Pool.map chunk the input for you, and then reduce the outputs from map into a single output):
pool = multiprocessing.Pool(2)
dfs = pool.map(cal, [list1.iloc[0:10], list1.iloc[10:20]])
This is just to get you started. I would probably do a number of other things differently as well.
My data has 4 columns A-D which have integers. I am adding a new column E, which takes its first value same as first value in column D. The next value in E should be the corresponding value in column D if previous value in column E is negative, otherwise it takes corresponding value in column C.
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
data=pd.read_excel('/Users/xxxx/Documents/PY Notebooks/Data/yyyy.xlsx')
data1=data.copy()
data1['E']=np.nan
data1.at[0,'E']=data1['D'][0]
l=len(data1)
for i in range(l-1):
if data1['E'][i]<0:
data1.at[i+1,'E']=data1['D'][i+1]
else:
data1.at[i+1,'E']=data1['C'][i+1]
TL;DR: go to the benchmark code and use Method 1.
Short Answer
No. Vectorization is not possible.
Long Answer
Theorem: For this particular task, the output of a given row cannot be determined using any finite length of backward rolling window smaller than the partial length up to this row.
Thus, there is no way for this output logic to be processed in a vectorized way. (See this answer for an idea of vectorization is performed in CPUs). The output can only be computed from the beginning of the dataframe.
Proof: Consider a target row of a dataframe df. Assume there is a backward rolling window with size n < partial length, so a previous value of df["E"] exists before the window. We denote this previous value by state.
Consider a special case: df["C"] == -1 and df["D"] == 1 within the window.
Case 1 (state < 0): The output within this rolling window will be [1, -1, 1, -1, .....], making the last element (-1)^(n-1)
Case 2 (state >= 0): The output will be [-1, 1, -1, 1, .....], making the last element (-1)^(n)
Therefore, it is possible for the output df["E"] of the target row to be dependent on a state variable outside the window. QED.
Useful Answer
Although vectorization is impossible, it does not mean that significant acceleration cannot be achieved. A simple yet very efficient approach is using a numba-compiled generator to perform the sequential generation. It only requires re-writing your logic into a generator function and add two additional lines:
import numba
#numba.njit
def my_generator_func():
....
Of course, you may have to install numba first. If this is not possible, then using a plain generator without numba optimization is also fine.
Benchmark
The benchmark is performed on a i5-8250U (4C8T) laptop with 16GB RAM running 64-bit debian 10. Python version is 3.7.9 and pandas is 1.1.3. n = 10^7 (10 million) records are generated for benchmarking purposes.
Result:
1. numba-njit: 2.48s
2. plain generator (no numba): 5.13s
3. original: 271.15s
> 100x efficiency gain can be achieved against the original code.
Code
from datetime import datetime
import pandas as pd
import numpy as np
n = 10000000 # a large number of rows
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
#print(df.head())
# ========== Method 1. generator + numba njit ==========
ti = datetime.now()
import numba
#numba.njit
def gen(plus: np.array, minus: np.array):
l = len(plus)
assert len(minus) == l
# first
state = minus[0]
yield state
# second to last
for i in range(l-1):
state = minus[i+1] if state < 0 else plus[i+1]
yield state
df["E"] = [i for i in gen(df["C"].values, df["D"].values)]
tf = datetime.now()
print(f"1. numba-njit: {(tf-ti).total_seconds():.2f}s") # 1. numba-njit: 0.47s
# ========== Method 2. Generator without numba ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()
def gen_plain(plus: np.array, minus: np.array):
l = len(plus)
assert len(minus) == l
# first
state = minus[0]
yield state
# second to last
for i in range(l-1):
state = minus[i+1] if state < 0 else plus[i+1]
yield state
df["E"] = [i for i in gen_plain(df["C"].values, df["D"].values)]
tf = datetime.now()
print(f"2. plain generator (no numba): {(tf-ti).total_seconds():.2f}s") #
# ========== Method 3. Direct iteration ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()
# code provided by the OP
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
l=len(df)
for i in range(l - 1):
if df['E'][i] < 0:
df.at[i+1,'E'] = df['D'][i+1]
else:
df.at[i+1,'E'] = df['C'][i+1]
tf = datetime.now()
print(f"3. original: {(tf-ti).total_seconds():.2f}s") # 2. 26.61s
I don't think you can vectorize this operation as you have dependent rows that need to get previous calculations being run. This being said, there still is quite some room for optimization in your functionality. Let's first check your original implementation with some random points.
import numpy as np
import pandas as pd
import time
size = 10000000
data = np.random.randint(-2, 10, size=size)
data = data.reshape([size//4, 4])
time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
for i in range(len(df)-1):
if df['E'][i]<0:
df.at[i+1,'E'] = df['D'][i+1]
else:
df.at[i+1,'E'] = df['C'][i+1]
print(f"Operation on pd df took {time.time() - time_start} seconds.")
Output:
Operation on pd df took 84.00791883468628 seconds.
As operations on the DataFrame usually are quite slow, we can operate on the underlying numpy arrays instead.
time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
c_vals = df["C"].values
d_vals = df["D"].values
e_vals = [d_vals[0]]
last_e = e_vals[0]
for i in range(len(df)-1):
if last_e < 0:
last_e = d_vals[i+1]
else:
last_e = c_vals[i+1]
e_vals.append(last_e)
df['E'] = e_vals
print(f"Operation on np array took {time.time() - time_start} seconds.")
Output:
Operation on np array took 2.2387869358062744 seconds.
Now we can argue that for loops are slow in Python and we can use a JIT compiler that can deal with numpy arrays, for instance numba.
import numba
time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
c_vals = df["C"].values
d_vals = df["D"].values
#numba.jit(nopython=True)
def numba_calc(c_vals, d_vals):
e_vals = [d_vals[0]]
last_e = e_vals[0]
for i in range(len(c_vals)-1):
if last_e < 0:
last_e = d_vals[i+1]
else:
last_e = c_vals[i+1]
e_vals.append(last_e)
return e_vals
df["E"] = numba_calc(c_vals, d_vals)
print(f"Operation on np array with numba took {time.time() - time_start} seconds.")
Output:
Operation on np array with numba took 1.2623450756072998 seconds.
So especially for larger DataFrames using numba will pay out, while operating on the raw numpy arrays mostly gives a nice runtime improvement.
Explanation of what I'm trying to accomplish:
I have dataframe to iterate over looking for some condition given a variable.
I have list of variables and I iterate over this df using multiprocessing. I pop(0) everytime a process start.
Now I need to add one more level, but I can't understand how to do it.
Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
import decimal
import multiprocessing
from multiprocessing import Pool, Manager
import itertools
#dataframe
columns = ['A', 'B', 'C', 'D']
data = np.array([np.random.randint(1, 10_000, size=750)]*4).T
df = pd.DataFrame(data, columns= columns)
print(df)
# Creating a list of tuples to apply a given function
a = np.arange(5,20, 1)
b = np.arange(1.01, 1.10, 0.01)
d = np.arange(0.95, 0.99, 0.01)
c = list(itertools.product(a, b, d))
list_of_tuples = []
dic = {}
for x in c:
dic[(x)] = x
for key, value in dic.items():
uno, due, tre = value[0], value[1], value[2]
list_of_tuples.append((uno, due, tre))
print(len(dic)) #checking size of dictionary
print(len(list_of_tuples), len(df)) #checking if size match
maximum = max(dic, key=dic.get) #maximum key inside dictionary
print(maximum, dic[maximum])
new_dic = {}
i = 1
#look_back_period = (len(df) // 10)
#print(look_back_period)
c = 0
"""chunks is the only way where I could use pool.map, it should be a list of list"""
chunks = [list_of_tuples[i::len(list_of_tuples)] for i in range(len(list_of_tuples))]
print(len(chunks[0]))
#this manager is needed to have every process append to the same Dict the result of the
# function that is given below
manager = Manager()
new_dic = manager.dict()
def multi_prova(list_of_tuples):
list_results = []
given1, given2, given3 = list_of_tuples.pop(0)
#sliding_window = df.iloc[0 : c + look_back_period, : ]
for row in df.itertuples():
result = (given1 / row.A).round(2)
list_results.append(result)
new_dic[str(given1)+', ' + str(given2)+', ' + str(given2)] = result
time1 = time.time()
if __name__ == "__main__":
try:
pool = Pool() # Make the Pool of workers
results = pool.map(multi_prova, chunks) #Open the urls in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
except:
print('error')
time2 = time.time()
print(time2 - time1)
#On my original code len(new_dic) matched len(dic), here is 750 vs 150, don't know why?!?!?!
print(new_dic)
print(len(new_dic))
Shouldn't be the len of new_dic == dic
750 rows, and a result for every row to 'append' to the dictionary.
So the problem are two:
why (len(new_dic)) is not 750.
And on top of that I would like to have, a sliding window to iter a slice of dataframe and have a dictionary of list of list with all the result of every slice of the df while c + look_back_period < len(df).
Hope I was clear enough.
A big hug on anyone that can contribute.
I wanted to parallelize df.corr() using multiprocessing module in Python. I'm taking one column and computing correlation values with rest all columns in one process and second column with rest other columns in another process. I'm continuing in this fashion to fill the upper traingle of correlation matrix by stacking up the result rows from all the processes.
I took sample data of shape (678461, 210) and tried my parallelized method and df.corr() and got running time of 214.40s and 42.64s respectively. So, my parallelized method is taking more time.
Is there a way to improve this?
import multiprocessing as mp
import pandas as pd
import numpy as np
from time import *
def _correlation(args):
i, mat, mask = args
ac = mat[i]
arr = []
for j in range(len(mat)):
if i > j:
continue
bc = mat[j]
valid = mask[i] & mask[j]
if valid.sum() < 1:
c = NA
elif i == j:
c = 1.
elif not valid.all():
c = np.corrcoef(ac[valid], bc[valid])[0, 1]
else:
c = np.corrcoef(ac, bc)[0, 1]
arr.append((j, c))
return arr
def correlation_multi(df):
numeric_df = df._get_numeric_data()
cols = numeric_df.columns
mat = numeric_df.values
mat = pd.core.common._ensure_float64(mat).T
K = len(cols)
correl = np.empty((K, K), dtype=float)
mask = np.isfinite(mat)
pool = mp.Pool(processes=4)
ret_list = pool.map(_correlation, [(i, mat, mask) for i in range(len(mat))])
for i, arr in enumerate(ret_list):
for l in arr:
j = l[0]
c = l[1]
correl[i, j] = c
correl[j, i] = c
return pd.DataFrame(correl, index = cols, columns = cols)
if __name__ == '__main__':
noise = pd.DataFrame(np.random.randint(0,100,size=(100000, 50)))
noise2 = pd.DataFrame(np.random.randint(100,200,size=(100000, 50)))
df = pd.concat([noise, noise2], axis=1)
#Single process correlation
start = time()
s = df.corr()
print('Time taken: ',time()-start)
#Multi process correlation
start = time()
s1 = correlation_multi(df)
print('Time taken: ',time()-start)
The results from _correlation have to be moved from the worker processes to the process running the Pool via interprocess communication.
This means that the return data is pickled, sent to the other process, unpickled and added to the result list.
This takes time and is by nature a sequential process.
And map processes the returns in the order they were sent, IIRC. So if one iteration takes relatively long, other results might be stuck waiting. You could try using imap_unordered which yields results as soon as they arrive.