Python parallelised correlation slower than single process correlation

Python parallelised correlation slower than single process correlation - python

I wanted to parallelize df.corr() using multiprocessing module in Python. I'm taking one column and computing correlation values with rest all columns in one process and second column with rest other columns in another process. I'm continuing in this fashion to fill the upper traingle of correlation matrix by stacking up the result rows from all the processes.
I took sample data of shape (678461, 210) and tried my parallelized method and df.corr() and got running time of 214.40s and 42.64s respectively. So, my parallelized method is taking more time.
Is there a way to improve this?
import multiprocessing as mp
import pandas as pd
import numpy as np
from time import *
def _correlation(args):
i, mat, mask = args
ac = mat[i]
arr = []
for j in range(len(mat)):
if i > j:
continue
bc = mat[j]
valid = mask[i] & mask[j]
if valid.sum() < 1:
c = NA
elif i == j:
c = 1.
elif not valid.all():
c = np.corrcoef(ac[valid], bc[valid])[0, 1]
else:
c = np.corrcoef(ac, bc)[0, 1]
arr.append((j, c))
return arr
def correlation_multi(df):
numeric_df = df._get_numeric_data()
cols = numeric_df.columns
mat = numeric_df.values
mat = pd.core.common._ensure_float64(mat).T
K = len(cols)
correl = np.empty((K, K), dtype=float)
mask = np.isfinite(mat)
pool = mp.Pool(processes=4)
ret_list = pool.map(_correlation, [(i, mat, mask) for i in range(len(mat))])
for i, arr in enumerate(ret_list):
for l in arr:
j = l[0]
c = l[1]
correl[i, j] = c
correl[j, i] = c
return pd.DataFrame(correl, index = cols, columns = cols)
if __name__ == '__main__':
noise = pd.DataFrame(np.random.randint(0,100,size=(100000, 50)))
noise2 = pd.DataFrame(np.random.randint(100,200,size=(100000, 50)))
df = pd.concat([noise, noise2], axis=1)
#Single process correlation
start = time()
s = df.corr()
print('Time taken: ',time()-start)
#Multi process correlation
start = time()
s1 = correlation_multi(df)
print('Time taken: ',time()-start)

The results from _correlation have to be moved from the worker processes to the process running the Pool via interprocess communication.
This means that the return data is pickled, sent to the other process, unpickled and added to the result list.
This takes time and is by nature a sequential process.
And map processes the returns in the order they were sent, IIRC. So if one iteration takes relatively long, other results might be stuck waiting. You could try using imap_unordered which yields results as soon as they arrive.

Related

Parallel processing of large dataframes

I have a large Pandas DataFrame with columns a and b (kind of coordinates in floats) and column c (values), which has to be binned and summarized over a certain interval of steps in the columns a and b. The order of the result is relevant, since a and b simulate coordinates where samples have been taken with the value c. In the next step, the results would be reshaped to an image and processed further.
This can be solved using nested loops (see below), however, it obviously does not scale well with larger datasets or smaller step sizes.
Example:
import pandas as pd
import time
import numpy as np
a = np.random.random(int(10E7))
b = np.random.random(int(10E7))
c = np.random.random(int(10E7))
df = pd.DataFrame({'a':a, 'b':b, 'values': c})
stepsize = 0.1
means = []
T1 = time.time()
for j in np.arange(0,1,stepsize):
for i in np.arange(0,1,stepsize):
selection = df[ (df.a > i) & (df.a <= i+stepsize) & (df.b > j) & (df.b <= j+stepsize) ]
means.append(selection['values'].mean())
T2 = time.time()
I have been wondering, how this can be resolved using multiprocessing (or multithreading?).
Therefore, I have set up the following code, though I am stuck: I am not sure, if the meanAB code is setup correctly and if the multiprocessing has been initiated correctly.
stepsize = 0.01 # smaller stepsize
A_vals = list(np.arange(0,1,stepsize))
B_vals = list(np.arange(0,1,stepsize))
def meanByAB(A,B):
# loaded df and stepsize globally. Does it make sense?!
selection = df[ (df.a > A) & (df.a <= A+stepsize) & (df.b > B) & (df.b <= B+stepsize) ]
mean = np.mean(selection['values'])
if __name__ == '__main__':
T1 = time.time()
p = mp.Process(target=meanByAB,args=(A_values,B_values,) # tried many things here. yields ValueError
p.start()
p.join()
T2 = time.time()

This is my solution. The results are not ordered, because I don't understand which is the right sorting to apply. However, each element of the list of results is a tuple composed of: (mean, A, B) where A and B are the values you specify. A posteriori, you can sort the values based on A and B, or by mean if you need it:
import pandas as pd
import numpy as np
# from multiprocessing import Process, Manager, Array, Queue
# import multiprocessing as mp
from threading import Thread
import threading as th
from queue import Queue
import time
def worker(df, q_input, q_output, stepsize):
while True:
A = q_input.get()
if A == 'DONE':
break
for i, B in enumerate(np.arange(0,1,stepsize)):
selection = df[ (df.a > A) & (df.a <= A+stepsize) & (df.b > B) & (df.b <= B+stepsize) ]
mean = np.mean(selection.values)
q_output.put((mean, A, B))
def dump_queue(queue):
result = []
while not queue.empty():
result.append(queue.get(timeout=0.01))
return result
if __name__ == '__main__':
start_time = time.time()
start_initialization = time.time()
NUM_THREADS = 4
a = np.random.random(int(10E7))
b = np.random.random(int(10E7))
c = np.random.random(int(10E7))
df = pd.DataFrame({'a':a, 'b':b, 'values': c})
stepsize = 0.1
means = []
q_input = Queue()
for j in np.arange(0,1,stepsize):
q_input.put(j)
for _ in range(NUM_PROCESSES):
q_input.put('DONE')
q_output = Queue()
print(f"Initialization terminates in {time.time() - start_initialization}")
processes = [Thread(target=worker, args=(df, q_input, q_output, stepsize), name=f"Worker-{i}") for i in range(NUM_THREADS)]
for p in processes:
p.start()
for p in processes:
p.join()
print(f"Finished in {time.time() - start_time}")
result = dump_queue(q_output)
print(result)
print(len(result))
EDIT
The edited version uses multi-threading instead of multiprocessing: multi-threading is lighter than multiprocessing, so I think it should be the best choice in this problem

Accelerate parallel processing in Python

I was hoping to use parallel processing to accelerate a for loop, but as seen in the example below, it is much slower that the loop. Is there anything wrong with my parallel processing approach? Are there better solutions?
The goal here is to update a column of a dataframe using a pre-defined functions that operates on multiple other columns of the dataframe.
import itertools
import pandas as pd
import multiprocessing as mp
import timeit
inputs = [range(50),range(90),range(30)]
inputs_list = list(itertools.product(*inputs))
Index = pd.MultiIndex.from_tuples(inputs_list,names={"a", "b", "c"})
df = pd.DataFrame(index = Index)
df['Output'] = 0
start_p = timeit.timeit()
def Addition(A,B,C):
df.loc[A,B,C]['Output']=A+B+C
return df.loc[A,B,C]['Output']
num_workers = mp.cpu_count()
pool = mp.Pool(num_workers)
df['Output'] = pool.starmap(Addition,inputs_list) # specify the function and arguments to map
pool.close()
pool.join()
end_p = timeit.timeit()
print(end_p - start_p)
start_l = timeit.timeit()
for A in range(50):
for B in range(90):
for C in range(30):
df.loc[A,B,C]['Output']=A+B+C
end_l = timeit.timeit()
print(end_l - start_l)

Better approach is to first prepare dict, then make dataframe out of it. Adding rows to df one by one is slow.
And as DarkKnight mentioned in a comment, timeit does not make sense here. I use time.time()
start_l = time.time()
dict_to_df = {}
for A in range(50):
for B in range(90):
for C in range(30):
dict_to_df[A,B,C] = A+B+C
df2 = pd.DataFrame.from_dict(dict_to_df, orient='index', columns=['Output'])
end_l = time.time()
print(end_l - start_l)
0.26 sec on my Machine
Assuming dataframe index is well ordered, you can just do something like this, using numpy vectorization:
import numpy as np
start_l = time.time()
a = np.arange(50)
b = np.arange(90)
c = np.arange(30)
a_plus_b = np.add.outer(a, b).flatten()
a_plus_b_plus_c = np.add.outer(a_plus_b, c).flatten()
df['Output'] = a_plus_b_plus_c
end_l = time.time()
print(end_l - start_l)
0.00044

How can I make Matrix Multiplication Code run in parallel?

I have a block of code that takes user input and creates them into matrices but i need them to run in parallel as appose to running the code as one single process. I have no clue how to do so, i started reading into NumPy but haven't really grasped it.
import timeit
start = timeit.default_timer()
def getMatrix(name):
matrixCreated = []
i = 0
while True:
i += 1
row = input('\nEnter elements in row %s of Matrix %s (separated by commas)\nOr -1 to exit: ' % (i, name))
if row == '-1':
break
else:
strList = row.split(',')
matrixCreated.append(list(map(int, strList)))
return matrixCreated
def getColAsList(matrixToManipulate, col):
myList = []
numOfRows = len(matrixToManipulate)
for i in range(numOfRows):
myList.append(matrixToManipulate[i][col])
return myList
def getCell(matrixA, matrixB, r, c):
matrixBCol = getColAsList(matrixB, c)
lenOfList = len(matrixBCol)
productList = [matrixA[r][i] * matrixBCol[i] for i in range(lenOfList)]
return sum(productList)
matrixA = getMatrix('A')
matrixB = getMatrix('B')
rowA = len(matrixA)
colA = len(matrixA[0])
rowB = len(matrixB)
colB = len(matrixB[0])
result = [[0 for p in range(colB)] for q in range(rowA)]
if (colA != rowB):
print('The two matrices cannot be multiplied')
else:
print('\nThe result is')
for i in range(rowA):
for j in range(colB):
result[i][j] = getCell(matrixA, matrixB, i, j)
print(result[i])
stop = timeit.default_timer()
print('Time: ', stop - start)
I also have a timer on the code to pint the time taken, but as its a programme that takes user input it is directly related to how long it takes to process in real-time, is there a way i can time it just to execute? I need to compare how making this code run in parallel can decrease run-time.

numpy is an efficient C implementation, while jax is a an efficient parallel implementation that supports also GPU/TPU.
Both of then would run faster than your current python implementation.
Import numpy or jax
import numpy as np
or
import jax.numpy as np
Then create the matrices
A = np.array(getMatrix('A'))
B = np.array(getMatrix('B'))
And output the matrix multiplication
C = A # B
print (C)

If you want only the time of execution of the matrix multiplication than move the start after the user input, like this:
...
matrixA = getMatrix('A')
matrixB = getMatrix('B')
start = timeit.default_timer()
rowA = len(matrixA)
colA = len(matrixA[0])
rowB = len(matrixB)
colB = len(matrixB[0])
...
stop = timeit.default_timer()
print('Time: ', stop - start)

Optimizing Matrix Traversal/General Code Optimization

I have two matrices. One is of size (CxK) and another is of size (SxK) (where S,C, and K all have the potential to be very large). I want to combine these an output matrix using the cosine similarity function (would be of size [CxS]). When I run my code, it takes a very long time to produce an output, and I was wondering if there is any way to optimize what I currently have. [Note, the two input matrices are often very sparse]
I was previously traversing each matrix using two for index,row loops, but I have since switched to the while loops, which improved my run time significantly.
A #this is one of my input matrices (pandas dataframe)
B #this is my second input matrix (pandas dataframe)
C = pd.DataFrame(columns = ['col_1' ,'col_2' ,'col_3'])
i=0
k=0
while i <= 5:
col_1 = A.iloc[i].get('label_A')
while k < 5:
col_2 = B.iloc[k].get('label_B')
propensity = cosine_similarity([A.drop('label_A', axis=1)\
.iloc[i]], [B.drop('label_B',axis=1).iloc[k]])
d = {'col_1':[col_1], 'col_2':[col_2], 'col_3':[propensity[0][0]]}
to_append = pd.DataFrame(data=d)
C = C.append(to_append)
k += 1
k = 0
i += 1
Right now I have the loops to run on only 5 items from each matrix, producing a 5x5 matrix, but I would obviously like this to work for very large inputs. This is the first time I have done anything like this so please let me know if any facet of code can be improved (data types used to hold matrices, how to traverse them, updating the output matrix, etc.).
Thank you in advance.

This can be done much more easyly and way faster by passing the whole arrays to cosine_similarity after you move the labels to the index:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import time
c = 50
s = 50
k = 100
A = pd.DataFrame( np.random.rand(c,k))
B = pd.DataFrame( np.random.rand(s,k))
A['label_A'] = [f'A{i}' for i in range(c)]
B['label_B'] = [f'B{i}' for i in range(s)]
C = pd.DataFrame()
# your program
start = time.time()
i=0
k=0
while i < c:
col_1 = A.iloc[i].get('label_A')
while k < s:
col_2 = B.iloc[k].get('label_B')
propensity = cosine_similarity([A.drop('label_A', axis=1)\
.iloc[i]], [B.drop('label_B',axis=1).iloc[k]])
d = {'col_1':[col_1], 'col_2':[col_2], 'col_3':[propensity[0][0]]}
to_append = pd.DataFrame(data=d)
C = C.append(to_append)
k += 1
k = 0
i += 1
print(f'elementwise: {time.time() - start:7.3f} s')
# my solution
start = time.time()
A = A.set_index('label_A')
B = B.set_index('label_B')
C1 = pd.DataFrame(cosine_similarity(A, B), index=A.index, columns=B.index).stack().rename('col_3')
C1.index.rename(['col_1','col_2'], inplace=True)
C1 = C1.reset_index()
print(f'whole array: {time.time() - start:7.3f} s')
# verification
assert(C[['col_1','col_2']].to_numpy()==C1[['col_1','col_2']].to_numpy()).all()\
and np.allclose(C.col_3.to_numpy(), C1.col_3.to_numpy())

Block-wise array writing with Python multiprocessing

I know there are a lot of topics around similar problems (like How do I make processes able to write in an array of the main program?, Multiprocessing - Shared Array or Multiprocessing a loop of a function that writes to an array in python), but I just don't get it... so sorry for asking again.
I need to do some stuff with a huge array and want to speed up things by splitting it into blocks and running my function on those blocks, with each block being run in its own process. Problem is: the blocks are "cut" from one array and the result shall then be written into a new, common array. This is what I did so far (minimum working example; don't mind the array-shaping, this is necessary for my real-world case):
import numpy as np
import multiprocessing as mp
def calcArray(array, blocksize, n_cores=1):
in_shape = (array.shape[0] * array.shape[1], array.shape[2])
input_array = array[:, :, :array.shape[2]].reshape(in_shape)
result_array = np.zeros(array.shape)
# blockwise loop
pix_count = array.size
for position in range(0, pix_count, blocksize):
if position + blocksize < array.shape[0] * array.shape[1]:
num = blocksize
else:
num = pix_count - position
result_part = input_array[position:position + num, :] * 2
result_array[position:position + num] = result_part
# finalize result
final_result = result_array.reshape(array.shape)
return final_result
if __name__ == '__main__':
start = time.time()
img = np.ones((4000, 4000, 4))
result = calcArray(img, blocksize=100, n_cores=4)
print 'Input:\n', img
print '\nOutput:\n', result
How can I now implement multiprocessing in way that I set a number of cores and then calcArray assigns processes to each block until n_cores is reached?
With the much appreciated help of #Blownhither Ma, the code now looks like this:
import time, datetime
import numpy as np
from multiprocessing import Pool
def calculate(array):
return array * 2
if __name__ == '__main__':
start = time.time()
CORES = 4
BLOCKSIZE = 100
ARRAY = np.ones((4000, 4000, 4))
pool = Pool(processes=CORES)
in_shape = (ARRAY.shape[0] * ARRAY.shape[1], ARRAY.shape[2])
input_array = ARRAY[:, :, :ARRAY.shape[2]].reshape(in_shape)
result_array = np.zeros(input_array.shape)
# do it
pix_count = ARRAY.size
handles = []
for position in range(0, pix_count, BLOCKSIZE):
if position + BLOCKSIZE < ARRAY.shape[0] * ARRAY.shape[1]:
num = BLOCKSIZE
else:
num = pix_count - position
### OLD APPROACH WITH NO PARALLELIZATION ###
# part = calculate(input_array[position:position + num, :])
# result_array[position:position + num] = part
### NEW APPROACH WITH PARALLELIZATION ###
handle = pool.apply_async(func=calculate, args=(input_array[position:position + num, :],))
handles.append(handle)
# finalize result
### OLD APPROACH WITH NO PARALLELIZATION ###
# final_result = result_array.reshape(ARRAY.shape)
### NEW APPROACH WITH PARALLELIZATION ###
final_result = [h.get() for h in handles]
final_result = np.concatenate(final_result, axis=0)
print 'Done!\nDuration (hh:mm:ss): {duration}'.format(duration=datetime.timedelta(seconds=time.time() - start))
The code runs and really starts the number processes I assigned, but takes much much longer than the old approach with just using the loop "as-is" (3 sec compared to 1 min). There must be something missing here.

The core function is pool.apply_async and handler.get.
I have been recently working on the same functions and find it useful to make a standard utility function. balanced_parallel applies function fn on matrix a in a parallel manner silently. assigned_parallel explicitly apply function on each element.
i. The way I split array is np.array_split. You may use block scheme instead.
ii. I use concat rather than assign to a empty matrix when collecting result. There is no shared memory.
from multiprocessing import cpu_count, Pool
def balanced_parallel(fn, a, processes=None, timeout=None):
""" apply fn on slice of a, return concatenated result """
if processes is None:
processes = cpu_count()
print('Parallel:\tstarting {} processes on input with shape {}'.format(processes, a.shape))
results = assigned_parallel(fn, np.array_split(a, processes), timeout=timeout, verbose=False)
return np.concatenate(results, 0)
def assigned_parallel(fn, l, processes=None, timeout=None, verbose=True):
""" apply fn on each element of l, return list of results """
if processes is None:
processes = min(cpu_count(), len(l))
pool = Pool(processes=processes)
if verbose:
print('Parallel:\tstarting {} processes on {} elements'.format(processes, len(l)))
# add jobs to the pool
handler = [pool.apply_async(fn, args=x if isinstance(x, tuple) else (x, )) for x in l]
# pool running, join all results
results = [handler[i].get(timeout=timeout) for i in range(len(handler))]
pool.close()
return results
In your case, fn would be
def _fn(matrix_part): return matrix_part * 2
result = balanced_parallel(_fn, img)
Follow-up:
Your loop should look like this to make parallelization happen.
handles = []
for position in range(0, pix_count, BLOCKSIZE):
if position + BLOCKSIZE < ARRAY.shape[0] * ARRAY.shape[1]:
num = BLOCKSIZE
else:
num = pix_count - position
handle = pool.apply_async(func=calculate, args=(input_array[position:position + num, :], ))
handles.append(handle)
# multiple handlers exist at this moment!! Don't `.get()` yet
results = [h.get() for h in handles]
results = np.concatenate(results, axis=0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python parallelised correlation slower than single process correlation - python

Related

Parallel processing of large dataframes

Accelerate parallel processing in Python

How can I make Matrix Multiplication Code run in parallel?

Optimizing Matrix Traversal/General Code Optimization

Block-wise array writing with Python multiprocessing

Categories

Resources