Python Threads with Pandas Dataframe does not improve performance

Python Threads with Pandas Dataframe does not improve performance - python

i have a Dataframe of 200k lines, i want to split into parts and call my function S_Function for each partition.
def S_Function(df):
#mycode here
return new_df
Main program
N_Threads = 10
Threads = []
Out = []
size = df.shape[0] // N_Threads
for i in range(N_Threads + 1):
begin = i * size
end = min(df.shape[0], (i+1)*size)
Threads.append(Thread(target = S_Function, args = (df[begin:end])) )
I run the threads & make the join :
for i in range(N_Threads + 1):
Threads[i].start()
for i in range(N_Threads + 1):
Out.append(Threads[i].join())
output = pd.concat(Out)
The code is working perfectly but the problem is that using threading.Thread did not decrease the execution time.
Sequential Code : 16 minutes
Parallel Code : 15 minutes
Can someone explain what to improve, why this is not working well?

Don't use threading when you have to process CPU-bound operations. To achieve your goal, I think you should use multiprocessing module
Try:
import pandas as pd
import numpy as np
import multiprocessing
import time
import functools
# Modify here
CHUNKSIZE = 20000
def S_Function(df, dictionnary):
# do stuff here
new_df = df
return new_df
if __name__ == '__main__':
# Load your dataframe
df = pd.DataFrame({'A': np.random.randint(1, 30000000, 200000).tolist()})
# Create chunks to process
chunks = (df[i:i+CHUNKSIZE] for i in range(0, len(df), CHUNKSIZE))
dictionnary = {'k1': 'v1', 'k2': 'v2'}
s_func = functools.partial(S_Function, dictionnary=dictionnary)
start = time.time()
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.map(s_func, chunks)
out = pd.concat(data)
end = time.time()
print(f"Elapsed time: {end - start:.2f} seconds")

Related

Accelerate parallel processing in Python

I was hoping to use parallel processing to accelerate a for loop, but as seen in the example below, it is much slower that the loop. Is there anything wrong with my parallel processing approach? Are there better solutions?
The goal here is to update a column of a dataframe using a pre-defined functions that operates on multiple other columns of the dataframe.
import itertools
import pandas as pd
import multiprocessing as mp
import timeit
inputs = [range(50),range(90),range(30)]
inputs_list = list(itertools.product(*inputs))
Index = pd.MultiIndex.from_tuples(inputs_list,names={"a", "b", "c"})
df = pd.DataFrame(index = Index)
df['Output'] = 0
start_p = timeit.timeit()
def Addition(A,B,C):
df.loc[A,B,C]['Output']=A+B+C
return df.loc[A,B,C]['Output']
num_workers = mp.cpu_count()
pool = mp.Pool(num_workers)
df['Output'] = pool.starmap(Addition,inputs_list) # specify the function and arguments to map
pool.close()
pool.join()
end_p = timeit.timeit()
print(end_p - start_p)
start_l = timeit.timeit()
for A in range(50):
for B in range(90):
for C in range(30):
df.loc[A,B,C]['Output']=A+B+C
end_l = timeit.timeit()
print(end_l - start_l)

Better approach is to first prepare dict, then make dataframe out of it. Adding rows to df one by one is slow.
And as DarkKnight mentioned in a comment, timeit does not make sense here. I use time.time()
start_l = time.time()
dict_to_df = {}
for A in range(50):
for B in range(90):
for C in range(30):
dict_to_df[A,B,C] = A+B+C
df2 = pd.DataFrame.from_dict(dict_to_df, orient='index', columns=['Output'])
end_l = time.time()
print(end_l - start_l)
0.26 sec on my Machine
Assuming dataframe index is well ordered, you can just do something like this, using numpy vectorization:
import numpy as np
start_l = time.time()
a = np.arange(50)
b = np.arange(90)
c = np.arange(30)
a_plus_b = np.add.outer(a, b).flatten()
a_plus_b_plus_c = np.add.outer(a_plus_b, c).flatten()
df['Output'] = a_plus_b_plus_c
end_l = time.time()
print(end_l - start_l)
0.00044

How do I get multiprocessing results?

I would like to store the result of the work in a specific variable after multiprocessing as shown below.
Alternatively, I want to save the results of the job as a csv file. May I know how to do it?
This is my code:
(I want to get 'df4' and 'df7' data and to save csv file)
import pandas as pd
from pandas import DataFrame
import time
import multiprocessing
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df4 = pd.DataFrame()
df5 = pd.DataFrame()
df6 = pd.DataFrame()
df7 = pd.DataFrame()
df8 = pd.DataFrame()
date = '2011-03', '2011-02' ........ '2021-03' #There are 120 list.
list1 = df1['resion'].drop_duplicates() # There are 20 list. 'df1' is original data
#I'd like to divide the list and work on it.
list11 = list1.iloc[0:10]
list12 = list1.iloc[10:20]
#It's a function using 'list11'.
def cal1():
global df2
global df3
global df4
start = time.time()
for i, t in enumerate(list11):
df2 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df3 = pd.DataFrame(df2[df2['date'] == d])
df3['number'] = df3['price'].rank(pct=True, ascending = False )
df4 = df4.append(pd.DataFrame(df3))
return df4
#It's a function using 'list12'.
def cal2():
global df5
global df6
global df7
start = time.time()
for i, t in enumerate(list12):
df5 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df6 = pd.DataFrame(df5[df5['date'] == d])
df6['number'] = df6['price'].rank(pct=True, ascending = False )
df7 = df7.append(pd.DataFrame(df6))
return df7
## Multiprocessing code
if __name__ == "__main__":
# creating processes
p1 = multiprocessing.Process(target=cal1, args=())
p2 = multiprocessing.Process(target=cal2, args=())
# starting process 1
p1.start()
# starting process 2
p2.start()
# wait until process 1 is finished
p1.join()
# wait until process 2 is finished
p2.join()
# both processes finished
print("Done!")

It looks like your functions cal1 and cal2 are identical except that they are trying to assign results to some different global variables. This is not going to work, because when you run them in a subprocess, they will assign that global variable in the subprocess, but that will have no impact whatsoever on the main process from which you started them.
If you want to map a function to multiple input ranges across multiple processes you can use a process Pool and Pool.map.
For example:
def cal(input_list):
start = time.time()
for i, t in enumerate(input_list):
df2 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df3 = pd.DataFrame(df2[df2['date'] == d])
df3['number'] = df3['price'].rank(pct=True, ascending = False )
df4 = df4.append(pd.DataFrame(df3))
# I kept your original code unmodified but I'm not really sure this
# is what to do, because you are returning after one pass through the
# outer loop. I haven't scrutinized what you are actually trying to
# do but I suspect this is wrong too.
return df4
Then create a process pool and you can divide up the input how you want (or, with a bit of tweaking, you can let Pool.map chunk the input for you, and then reduce the outputs from map into a single output):
pool = multiprocessing.Pool(2)
dfs = pool.map(cal, [list1.iloc[0:10], list1.iloc[10:20]])
This is just to get you started. I would probably do a number of other things differently as well.

How to make 6 calculation as fast as possible based on one datastream?

I have one stream of data who is coming very fast, and when a new data arrive, I would like to make 6 different calculation based on it.
I would like to make those calculation as fast as possible so I can update as soon as I receive new data.
The data can arrive as fast as milliseconds so my calculation must be very fast.
So the best thing I was thinking of was to make those calculations on 6 different Threads at the same time.
I never used threads before so I don't know where to place it.
This is the code who describe my problem
What can I do from here?
import numpy as np
import time
np.random.seed(0)
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
return r
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
calc_1 = calculation_1(data=data_stream_main[0], multiplicator=2)
calc_2 = calculation_1(data=data_stream_main[0], multiplicator=3)
calc_3 = calculation_1(data=data_stream_main[1], multiplicator=2)
calc_4 = calculation_1(data=data_stream_main[1], multiplicator=3)
calc_5 = calculation_1(data=data_stream_main[2], multiplicator=2)
calc_6 = calculation_1(data=data_stream_main[2], multiplicator=3)
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)

You can use either class multiprocessing.pool.Pool or concurrent.futures.ProcessPoolExecutor to create a multiprocessing pool of 6 processes to which you can submit your 6 tasks in your loop to execute in parallel and await the results. The following example uses multiprocessing.pool.Pool.
But, the result will be very disappointing.
The problem is that (1) There is overhead in initially creating the 6 processes and (2) overhead in queueing up each task to execute in the different address space that the subprocesses live. This means that for multiprocessing to be advantageous, your worker function, calculation_1 in this case, needs to be a less-trivial, longer-running, more-CPU-intensive function. If you were to add to your worker function the following "do-nothing", CPU-intensive loop ...
cnt = 0
for i in range(100000):
cnt += 1
... then the following multiprocessing code would run several times more quickly. As is, stick with what you have.
import numpy as np
import multiprocessing as mp
import time
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
"""
cnt = 0
for i in range(100000):
cnt += 1
"""
return r
# required for Windows and other platforms that use spawn for creating new processes:
if __name__ == '__main__':
np.random.seed(0)
# no point in using more processes than processors:
n_processors = min(6, mp.cpu_count())
pool = mp.Pool(n_processors)
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
# submit tasks:
result_1 = pool.apply_async(calculation_1, (data_stream_main[0], 2))
result_2 = pool.apply_async(calculation_1, (data_stream_main[0], 3))
result_3 = pool.apply_async(calculation_1, (data_stream_main[1], 2))
result_4 = pool.apply_async(calculation_1, (data_stream_main[1], 3))
result_5 = pool.apply_async(calculation_1, (data_stream_main[2], 2))
result_6 = pool.apply_async(calculation_1, (data_stream_main[2], 3))
# wait for results:
calc_1 = result_1.get()
calc_2 = result_2.get()
calc_3 = result_3.get()
calc_4 = result_4.get()
calc_5 = result_5.get()
calc_6 = result_6.get()
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)

You could factorize the calculation by separating the log(data) from the log(multiplicator).
Given that np.log(data * (multiplicator+1)) is the same as np.log(data) + np.log(multiplicator+1), you can compute and store the 2 possible values of np.log(multiplicator+1) in global variables, then only compute log(data) once per index (thus saving 50%) on that part.
# global variables and calculation function:
multiplicator2 = np.log(3)
multiplicator3 = np.log(4)
def calculation_1(data):
logData = np.log(data)
return logData + multiplicator2, logData + multiplicator3
# in the loop:...
calc_1,calc_2 = calculation_1(data_stream_main[0])
calc_3,calc_4 = calculation_1(data_stream_main[1])
calc_5,calc_6 = calculation_1(data_stream_main[2])
If you can afford to buffer several rows of data into a numpy matrix before outputing the result, you may get some performance improvement by using numpy's parallelism to perform the calculation on the whole matrix (or chunk) and output the result in chunks instead of one row at a time. Separating reception of the data from computation and output is where the use of threads may provide a benefit.
For example:
start = time.time()
chunk = []
multiplicators = np.array([2,2,2,3,3,3])
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
chunk.append(data_stream_main*2)
if len(chunk)< 1000: continue
# process 1000 lines at a time and output results
calcs = np.log(np.array(chunk)*multiplicators)
calc_1,calc_4,calc_2,calc_5,calc_3,calc6 = calcs[-1,:]
chunk = [] # reset chunk
print("total time:", time.time() - start) # 2.7 (compared to 6.6)

starmap, starmap_async python

I want to translate a huge matlab model to python. Therefor I need to work on the key functions first. One key function handles parallel processing. Basically, a matrix with parameters is the input, in which every row represents the parameters for one run. These parameters are used within a computation-heavy function. This computation-heavy function should run in parallel, I don't need the results of a previous run for any other run. So all processes can run independent from eachother.
Why is starmap_async slower on my pc? Also: When i add more code (to test consecutive computation) my python crashes (i use spyder). Can you give me advice?
import time
import numpy as np
import multiprocessing as mp
from functools import partial
# Create simulated data matrix
data = np.random.random((100,3000))
data = np.column_stack((np.arange(1,len(data)+1,1),data))
def EAF_DGL(*z, package_num):
sum_row = 0
for i in range(1,np.shape(z)[0]):
sum_row = sum_row + z[i]
func_result = np.column_stack((package_num,z[0],sum_row))
return func_result
t0 = time.time()
if __name__ == "__main__":
package_num = 1
help_EAF_DGL = partial(EAF_DGL, package_num=1)
with mp.Pool() as pool:
#result = pool.starmap(partial(EAF_DGL, package_num), [(data[i]) for i in range(0,np.shape(data)[0])])
result = pool.starmap_async(help_EAF_DGL, [(data[i]) for i in range(0,np.shape(data)[0])]).get()
pool.close()
pool.join()
t1 = time.time()
calculation_time_parallel_async = t1-t0
print(calculation_time_parallel_async)
t2 = time.time()
if __name__ == "__main__":
package_num = 1
help_EAF_DGL = partial(EAF_DGL, package_num=1)
with mp.Pool() as pool:
#result = pool.starmap(partial(EAF_DGL, package_num), [(data[i]) for i in range(0,np.shape(data)[0])])
result = pool.starmap(help_EAF_DGL, [(data[i]) for i in range(0,np.shape(data)[0])])
pool.close()
pool.join()
t3 = time.time()
calculation_time_parallel = t3-t2
print(calculation_time_parallel)

How to iterate over Dataframe using a "sliding window" with multi processing (used for genetic algo)

Explanation of what I'm trying to accomplish:
I have dataframe to iterate over looking for some condition given a variable.
I have list of variables and I iterate over this df using multiprocessing. I pop(0) everytime a process start.
Now I need to add one more level, but I can't understand how to do it.
Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
import decimal
import multiprocessing
from multiprocessing import Pool, Manager
import itertools
#dataframe
columns = ['A', 'B', 'C', 'D']
data = np.array([np.random.randint(1, 10_000, size=750)]*4).T
df = pd.DataFrame(data, columns= columns)
print(df)
# Creating a list of tuples to apply a given function
a = np.arange(5,20, 1)
b = np.arange(1.01, 1.10, 0.01)
d = np.arange(0.95, 0.99, 0.01)
c = list(itertools.product(a, b, d))
list_of_tuples = []
dic = {}
for x in c:
dic[(x)] = x
for key, value in dic.items():
uno, due, tre = value[0], value[1], value[2]
list_of_tuples.append((uno, due, tre))
print(len(dic)) #checking size of dictionary
print(len(list_of_tuples), len(df)) #checking if size match
maximum = max(dic, key=dic.get) #maximum key inside dictionary
print(maximum, dic[maximum])
new_dic = {}
i = 1
#look_back_period = (len(df) // 10)
#print(look_back_period)
c = 0
"""chunks is the only way where I could use pool.map, it should be a list of list"""
chunks = [list_of_tuples[i::len(list_of_tuples)] for i in range(len(list_of_tuples))]
print(len(chunks[0]))
#this manager is needed to have every process append to the same Dict the result of the
# function that is given below
manager = Manager()
new_dic = manager.dict()
def multi_prova(list_of_tuples):
list_results = []
given1, given2, given3 = list_of_tuples.pop(0)
#sliding_window = df.iloc[0 : c + look_back_period, : ]
for row in df.itertuples():
result = (given1 / row.A).round(2)
list_results.append(result)
new_dic[str(given1)+', ' + str(given2)+', ' + str(given2)] = result
time1 = time.time()
if __name__ == "__main__":
try:
pool = Pool() # Make the Pool of workers
results = pool.map(multi_prova, chunks) #Open the urls in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
except:
print('error')
time2 = time.time()
print(time2 - time1)
#On my original code len(new_dic) matched len(dic), here is 750 vs 150, don't know why?!?!?!
print(new_dic)
print(len(new_dic))
Shouldn't be the len of new_dic == dic
750 rows, and a result for every row to 'append' to the dictionary.
So the problem are two:
why (len(new_dic)) is not 750.
And on top of that I would like to have, a sliding window to iter a slice of dataframe and have a dictionary of list of list with all the result of every slice of the df while c + look_back_period < len(df).
Hope I was clear enough.
A big hug on anyone that can contribute.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Threads with Pandas Dataframe does not improve performance - python

Related

Accelerate parallel processing in Python

How do I get multiprocessing results?

How to make 6 calculation as fast as possible based on one datastream?

starmap, starmap_async python

How to iterate over Dataframe using a "sliding window" with multi processing (used for genetic algo)

Categories

Resources