Parallel processing of large dataframes

Parallel processing of large dataframes - python

I have a large Pandas DataFrame with columns a and b (kind of coordinates in floats) and column c (values), which has to be binned and summarized over a certain interval of steps in the columns a and b. The order of the result is relevant, since a and b simulate coordinates where samples have been taken with the value c. In the next step, the results would be reshaped to an image and processed further.
This can be solved using nested loops (see below), however, it obviously does not scale well with larger datasets or smaller step sizes.
Example:
import pandas as pd
import time
import numpy as np
a = np.random.random(int(10E7))
b = np.random.random(int(10E7))
c = np.random.random(int(10E7))
df = pd.DataFrame({'a':a, 'b':b, 'values': c})
stepsize = 0.1
means = []
T1 = time.time()
for j in np.arange(0,1,stepsize):
for i in np.arange(0,1,stepsize):
selection = df[ (df.a > i) & (df.a <= i+stepsize) & (df.b > j) & (df.b <= j+stepsize) ]
means.append(selection['values'].mean())
T2 = time.time()
I have been wondering, how this can be resolved using multiprocessing (or multithreading?).
Therefore, I have set up the following code, though I am stuck: I am not sure, if the meanAB code is setup correctly and if the multiprocessing has been initiated correctly.
stepsize = 0.01 # smaller stepsize
A_vals = list(np.arange(0,1,stepsize))
B_vals = list(np.arange(0,1,stepsize))
def meanByAB(A,B):
# loaded df and stepsize globally. Does it make sense?!
selection = df[ (df.a > A) & (df.a <= A+stepsize) & (df.b > B) & (df.b <= B+stepsize) ]
mean = np.mean(selection['values'])
if __name__ == '__main__':
T1 = time.time()
p = mp.Process(target=meanByAB,args=(A_values,B_values,) # tried many things here. yields ValueError
p.start()
p.join()
T2 = time.time()

This is my solution. The results are not ordered, because I don't understand which is the right sorting to apply. However, each element of the list of results is a tuple composed of: (mean, A, B) where A and B are the values you specify. A posteriori, you can sort the values based on A and B, or by mean if you need it:
import pandas as pd
import numpy as np
# from multiprocessing import Process, Manager, Array, Queue
# import multiprocessing as mp
from threading import Thread
import threading as th
from queue import Queue
import time
def worker(df, q_input, q_output, stepsize):
while True:
A = q_input.get()
if A == 'DONE':
break
for i, B in enumerate(np.arange(0,1,stepsize)):
selection = df[ (df.a > A) & (df.a <= A+stepsize) & (df.b > B) & (df.b <= B+stepsize) ]
mean = np.mean(selection.values)
q_output.put((mean, A, B))
def dump_queue(queue):
result = []
while not queue.empty():
result.append(queue.get(timeout=0.01))
return result
if __name__ == '__main__':
start_time = time.time()
start_initialization = time.time()
NUM_THREADS = 4
a = np.random.random(int(10E7))
b = np.random.random(int(10E7))
c = np.random.random(int(10E7))
df = pd.DataFrame({'a':a, 'b':b, 'values': c})
stepsize = 0.1
means = []
q_input = Queue()
for j in np.arange(0,1,stepsize):
q_input.put(j)
for _ in range(NUM_PROCESSES):
q_input.put('DONE')
q_output = Queue()
print(f"Initialization terminates in {time.time() - start_initialization}")
processes = [Thread(target=worker, args=(df, q_input, q_output, stepsize), name=f"Worker-{i}") for i in range(NUM_THREADS)]
for p in processes:
p.start()
for p in processes:
p.join()
print(f"Finished in {time.time() - start_time}")
result = dump_queue(q_output)
print(result)
print(len(result))
EDIT
The edited version uses multi-threading instead of multiprocessing: multi-threading is lighter than multiprocessing, so I think it should be the best choice in this problem

Related

Accelerate parallel processing in Python

I was hoping to use parallel processing to accelerate a for loop, but as seen in the example below, it is much slower that the loop. Is there anything wrong with my parallel processing approach? Are there better solutions?
The goal here is to update a column of a dataframe using a pre-defined functions that operates on multiple other columns of the dataframe.
import itertools
import pandas as pd
import multiprocessing as mp
import timeit
inputs = [range(50),range(90),range(30)]
inputs_list = list(itertools.product(*inputs))
Index = pd.MultiIndex.from_tuples(inputs_list,names={"a", "b", "c"})
df = pd.DataFrame(index = Index)
df['Output'] = 0
start_p = timeit.timeit()
def Addition(A,B,C):
df.loc[A,B,C]['Output']=A+B+C
return df.loc[A,B,C]['Output']
num_workers = mp.cpu_count()
pool = mp.Pool(num_workers)
df['Output'] = pool.starmap(Addition,inputs_list) # specify the function and arguments to map
pool.close()
pool.join()
end_p = timeit.timeit()
print(end_p - start_p)
start_l = timeit.timeit()
for A in range(50):
for B in range(90):
for C in range(30):
df.loc[A,B,C]['Output']=A+B+C
end_l = timeit.timeit()
print(end_l - start_l)

Better approach is to first prepare dict, then make dataframe out of it. Adding rows to df one by one is slow.
And as DarkKnight mentioned in a comment, timeit does not make sense here. I use time.time()
start_l = time.time()
dict_to_df = {}
for A in range(50):
for B in range(90):
for C in range(30):
dict_to_df[A,B,C] = A+B+C
df2 = pd.DataFrame.from_dict(dict_to_df, orient='index', columns=['Output'])
end_l = time.time()
print(end_l - start_l)
0.26 sec on my Machine
Assuming dataframe index is well ordered, you can just do something like this, using numpy vectorization:
import numpy as np
start_l = time.time()
a = np.arange(50)
b = np.arange(90)
c = np.arange(30)
a_plus_b = np.add.outer(a, b).flatten()
a_plus_b_plus_c = np.add.outer(a_plus_b, c).flatten()
df['Output'] = a_plus_b_plus_c
end_l = time.time()
print(end_l - start_l)
0.00044

How do I get multiprocessing results?

I would like to store the result of the work in a specific variable after multiprocessing as shown below.
Alternatively, I want to save the results of the job as a csv file. May I know how to do it?
This is my code:
(I want to get 'df4' and 'df7' data and to save csv file)
import pandas as pd
from pandas import DataFrame
import time
import multiprocessing
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df4 = pd.DataFrame()
df5 = pd.DataFrame()
df6 = pd.DataFrame()
df7 = pd.DataFrame()
df8 = pd.DataFrame()
date = '2011-03', '2011-02' ........ '2021-03' #There are 120 list.
list1 = df1['resion'].drop_duplicates() # There are 20 list. 'df1' is original data
#I'd like to divide the list and work on it.
list11 = list1.iloc[0:10]
list12 = list1.iloc[10:20]
#It's a function using 'list11'.
def cal1():
global df2
global df3
global df4
start = time.time()
for i, t in enumerate(list11):
df2 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df3 = pd.DataFrame(df2[df2['date'] == d])
df3['number'] = df3['price'].rank(pct=True, ascending = False )
df4 = df4.append(pd.DataFrame(df3))
return df4
#It's a function using 'list12'.
def cal2():
global df5
global df6
global df7
start = time.time()
for i, t in enumerate(list12):
df5 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df6 = pd.DataFrame(df5[df5['date'] == d])
df6['number'] = df6['price'].rank(pct=True, ascending = False )
df7 = df7.append(pd.DataFrame(df6))
return df7
## Multiprocessing code
if __name__ == "__main__":
# creating processes
p1 = multiprocessing.Process(target=cal1, args=())
p2 = multiprocessing.Process(target=cal2, args=())
# starting process 1
p1.start()
# starting process 2
p2.start()
# wait until process 1 is finished
p1.join()
# wait until process 2 is finished
p2.join()
# both processes finished
print("Done!")

It looks like your functions cal1 and cal2 are identical except that they are trying to assign results to some different global variables. This is not going to work, because when you run them in a subprocess, they will assign that global variable in the subprocess, but that will have no impact whatsoever on the main process from which you started them.
If you want to map a function to multiple input ranges across multiple processes you can use a process Pool and Pool.map.
For example:
def cal(input_list):
start = time.time()
for i, t in enumerate(input_list):
df2 = pd.DataFrame(df1[df1['resion'] == t]) #'df1' is original data
if i%2 == 0:
print ("cal1 function processing: ", i)
end = time.time()
print (end-start)
else:
pass
for n, d in enumerate(date):
df3 = pd.DataFrame(df2[df2['date'] == d])
df3['number'] = df3['price'].rank(pct=True, ascending = False )
df4 = df4.append(pd.DataFrame(df3))
# I kept your original code unmodified but I'm not really sure this
# is what to do, because you are returning after one pass through the
# outer loop. I haven't scrutinized what you are actually trying to
# do but I suspect this is wrong too.
return df4
Then create a process pool and you can divide up the input how you want (or, with a bit of tweaking, you can let Pool.map chunk the input for you, and then reduce the outputs from map into a single output):
pool = multiprocessing.Pool(2)
dfs = pool.map(cal, [list1.iloc[0:10], list1.iloc[10:20]])
This is just to get you started. I would probably do a number of other things differently as well.

How to iterate over Dataframe using a "sliding window" with multi processing (used for genetic algo)

Explanation of what I'm trying to accomplish:
I have dataframe to iterate over looking for some condition given a variable.
I have list of variables and I iterate over this df using multiprocessing. I pop(0) everytime a process start.
Now I need to add one more level, but I can't understand how to do it.
Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
import decimal
import multiprocessing
from multiprocessing import Pool, Manager
import itertools
#dataframe
columns = ['A', 'B', 'C', 'D']
data = np.array([np.random.randint(1, 10_000, size=750)]*4).T
df = pd.DataFrame(data, columns= columns)
print(df)
# Creating a list of tuples to apply a given function
a = np.arange(5,20, 1)
b = np.arange(1.01, 1.10, 0.01)
d = np.arange(0.95, 0.99, 0.01)
c = list(itertools.product(a, b, d))
list_of_tuples = []
dic = {}
for x in c:
dic[(x)] = x
for key, value in dic.items():
uno, due, tre = value[0], value[1], value[2]
list_of_tuples.append((uno, due, tre))
print(len(dic)) #checking size of dictionary
print(len(list_of_tuples), len(df)) #checking if size match
maximum = max(dic, key=dic.get) #maximum key inside dictionary
print(maximum, dic[maximum])
new_dic = {}
i = 1
#look_back_period = (len(df) // 10)
#print(look_back_period)
c = 0
"""chunks is the only way where I could use pool.map, it should be a list of list"""
chunks = [list_of_tuples[i::len(list_of_tuples)] for i in range(len(list_of_tuples))]
print(len(chunks[0]))
#this manager is needed to have every process append to the same Dict the result of the
# function that is given below
manager = Manager()
new_dic = manager.dict()
def multi_prova(list_of_tuples):
list_results = []
given1, given2, given3 = list_of_tuples.pop(0)
#sliding_window = df.iloc[0 : c + look_back_period, : ]
for row in df.itertuples():
result = (given1 / row.A).round(2)
list_results.append(result)
new_dic[str(given1)+', ' + str(given2)+', ' + str(given2)] = result
time1 = time.time()
if __name__ == "__main__":
try:
pool = Pool() # Make the Pool of workers
results = pool.map(multi_prova, chunks) #Open the urls in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
except:
print('error')
time2 = time.time()
print(time2 - time1)
#On my original code len(new_dic) matched len(dic), here is 750 vs 150, don't know why?!?!?!
print(new_dic)
print(len(new_dic))
Shouldn't be the len of new_dic == dic
750 rows, and a result for every row to 'append' to the dictionary.
So the problem are two:
why (len(new_dic)) is not 750.
And on top of that I would like to have, a sliding window to iter a slice of dataframe and have a dictionary of list of list with all the result of every slice of the df while c + look_back_period < len(df).
Hope I was clear enough.
A big hug on anyone that can contribute.

Block-wise array writing with Python multiprocessing

I know there are a lot of topics around similar problems (like How do I make processes able to write in an array of the main program?, Multiprocessing - Shared Array or Multiprocessing a loop of a function that writes to an array in python), but I just don't get it... so sorry for asking again.
I need to do some stuff with a huge array and want to speed up things by splitting it into blocks and running my function on those blocks, with each block being run in its own process. Problem is: the blocks are "cut" from one array and the result shall then be written into a new, common array. This is what I did so far (minimum working example; don't mind the array-shaping, this is necessary for my real-world case):
import numpy as np
import multiprocessing as mp
def calcArray(array, blocksize, n_cores=1):
in_shape = (array.shape[0] * array.shape[1], array.shape[2])
input_array = array[:, :, :array.shape[2]].reshape(in_shape)
result_array = np.zeros(array.shape)
# blockwise loop
pix_count = array.size
for position in range(0, pix_count, blocksize):
if position + blocksize < array.shape[0] * array.shape[1]:
num = blocksize
else:
num = pix_count - position
result_part = input_array[position:position + num, :] * 2
result_array[position:position + num] = result_part
# finalize result
final_result = result_array.reshape(array.shape)
return final_result
if __name__ == '__main__':
start = time.time()
img = np.ones((4000, 4000, 4))
result = calcArray(img, blocksize=100, n_cores=4)
print 'Input:\n', img
print '\nOutput:\n', result
How can I now implement multiprocessing in way that I set a number of cores and then calcArray assigns processes to each block until n_cores is reached?
With the much appreciated help of #Blownhither Ma, the code now looks like this:
import time, datetime
import numpy as np
from multiprocessing import Pool
def calculate(array):
return array * 2
if __name__ == '__main__':
start = time.time()
CORES = 4
BLOCKSIZE = 100
ARRAY = np.ones((4000, 4000, 4))
pool = Pool(processes=CORES)
in_shape = (ARRAY.shape[0] * ARRAY.shape[1], ARRAY.shape[2])
input_array = ARRAY[:, :, :ARRAY.shape[2]].reshape(in_shape)
result_array = np.zeros(input_array.shape)
# do it
pix_count = ARRAY.size
handles = []
for position in range(0, pix_count, BLOCKSIZE):
if position + BLOCKSIZE < ARRAY.shape[0] * ARRAY.shape[1]:
num = BLOCKSIZE
else:
num = pix_count - position
### OLD APPROACH WITH NO PARALLELIZATION ###
# part = calculate(input_array[position:position + num, :])
# result_array[position:position + num] = part
### NEW APPROACH WITH PARALLELIZATION ###
handle = pool.apply_async(func=calculate, args=(input_array[position:position + num, :],))
handles.append(handle)
# finalize result
### OLD APPROACH WITH NO PARALLELIZATION ###
# final_result = result_array.reshape(ARRAY.shape)
### NEW APPROACH WITH PARALLELIZATION ###
final_result = [h.get() for h in handles]
final_result = np.concatenate(final_result, axis=0)
print 'Done!\nDuration (hh:mm:ss): {duration}'.format(duration=datetime.timedelta(seconds=time.time() - start))
The code runs and really starts the number processes I assigned, but takes much much longer than the old approach with just using the loop "as-is" (3 sec compared to 1 min). There must be something missing here.

The core function is pool.apply_async and handler.get.
I have been recently working on the same functions and find it useful to make a standard utility function. balanced_parallel applies function fn on matrix a in a parallel manner silently. assigned_parallel explicitly apply function on each element.
i. The way I split array is np.array_split. You may use block scheme instead.
ii. I use concat rather than assign to a empty matrix when collecting result. There is no shared memory.
from multiprocessing import cpu_count, Pool
def balanced_parallel(fn, a, processes=None, timeout=None):
""" apply fn on slice of a, return concatenated result """
if processes is None:
processes = cpu_count()
print('Parallel:\tstarting {} processes on input with shape {}'.format(processes, a.shape))
results = assigned_parallel(fn, np.array_split(a, processes), timeout=timeout, verbose=False)
return np.concatenate(results, 0)
def assigned_parallel(fn, l, processes=None, timeout=None, verbose=True):
""" apply fn on each element of l, return list of results """
if processes is None:
processes = min(cpu_count(), len(l))
pool = Pool(processes=processes)
if verbose:
print('Parallel:\tstarting {} processes on {} elements'.format(processes, len(l)))
# add jobs to the pool
handler = [pool.apply_async(fn, args=x if isinstance(x, tuple) else (x, )) for x in l]
# pool running, join all results
results = [handler[i].get(timeout=timeout) for i in range(len(handler))]
pool.close()
return results
In your case, fn would be
def _fn(matrix_part): return matrix_part * 2
result = balanced_parallel(_fn, img)
Follow-up:
Your loop should look like this to make parallelization happen.
handles = []
for position in range(0, pix_count, BLOCKSIZE):
if position + BLOCKSIZE < ARRAY.shape[0] * ARRAY.shape[1]:
num = BLOCKSIZE
else:
num = pix_count - position
handle = pool.apply_async(func=calculate, args=(input_array[position:position + num, :], ))
handles.append(handle)
# multiple handlers exist at this moment!! Don't `.get()` yet
results = [h.get() for h in handles]
results = np.concatenate(results, axis=0)

Python parallelised correlation slower than single process correlation

I wanted to parallelize df.corr() using multiprocessing module in Python. I'm taking one column and computing correlation values with rest all columns in one process and second column with rest other columns in another process. I'm continuing in this fashion to fill the upper traingle of correlation matrix by stacking up the result rows from all the processes.
I took sample data of shape (678461, 210) and tried my parallelized method and df.corr() and got running time of 214.40s and 42.64s respectively. So, my parallelized method is taking more time.
Is there a way to improve this?
import multiprocessing as mp
import pandas as pd
import numpy as np
from time import *
def _correlation(args):
i, mat, mask = args
ac = mat[i]
arr = []
for j in range(len(mat)):
if i > j:
continue
bc = mat[j]
valid = mask[i] & mask[j]
if valid.sum() < 1:
c = NA
elif i == j:
c = 1.
elif not valid.all():
c = np.corrcoef(ac[valid], bc[valid])[0, 1]
else:
c = np.corrcoef(ac, bc)[0, 1]
arr.append((j, c))
return arr
def correlation_multi(df):
numeric_df = df._get_numeric_data()
cols = numeric_df.columns
mat = numeric_df.values
mat = pd.core.common._ensure_float64(mat).T
K = len(cols)
correl = np.empty((K, K), dtype=float)
mask = np.isfinite(mat)
pool = mp.Pool(processes=4)
ret_list = pool.map(_correlation, [(i, mat, mask) for i in range(len(mat))])
for i, arr in enumerate(ret_list):
for l in arr:
j = l[0]
c = l[1]
correl[i, j] = c
correl[j, i] = c
return pd.DataFrame(correl, index = cols, columns = cols)
if __name__ == '__main__':
noise = pd.DataFrame(np.random.randint(0,100,size=(100000, 50)))
noise2 = pd.DataFrame(np.random.randint(100,200,size=(100000, 50)))
df = pd.concat([noise, noise2], axis=1)
#Single process correlation
start = time()
s = df.corr()
print('Time taken: ',time()-start)
#Multi process correlation
start = time()
s1 = correlation_multi(df)
print('Time taken: ',time()-start)

The results from _correlation have to be moved from the worker processes to the process running the Pool via interprocess communication.
This means that the return data is pickled, sent to the other process, unpickled and added to the result list.
This takes time and is by nature a sequential process.
And map processes the returns in the order they were sent, IIRC. So if one iteration takes relatively long, other results might be stuck waiting. You could try using imap_unordered which yields results as soon as they arrive.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel processing of large dataframes - python

Related

Accelerate parallel processing in Python

How do I get multiprocessing results?

How to iterate over Dataframe using a "sliding window" with multi processing (used for genetic algo)

Block-wise array writing with Python multiprocessing

Python parallelised correlation slower than single process correlation

Categories

Resources