How can I make Matrix Multiplication Code run in parallel?

How can I make Matrix Multiplication Code run in parallel? - python

I have a block of code that takes user input and creates them into matrices but i need them to run in parallel as appose to running the code as one single process. I have no clue how to do so, i started reading into NumPy but haven't really grasped it.
import timeit
start = timeit.default_timer()
def getMatrix(name):
matrixCreated = []
i = 0
while True:
i += 1
row = input('\nEnter elements in row %s of Matrix %s (separated by commas)\nOr -1 to exit: ' % (i, name))
if row == '-1':
break
else:
strList = row.split(',')
matrixCreated.append(list(map(int, strList)))
return matrixCreated
def getColAsList(matrixToManipulate, col):
myList = []
numOfRows = len(matrixToManipulate)
for i in range(numOfRows):
myList.append(matrixToManipulate[i][col])
return myList
def getCell(matrixA, matrixB, r, c):
matrixBCol = getColAsList(matrixB, c)
lenOfList = len(matrixBCol)
productList = [matrixA[r][i] * matrixBCol[i] for i in range(lenOfList)]
return sum(productList)
matrixA = getMatrix('A')
matrixB = getMatrix('B')
rowA = len(matrixA)
colA = len(matrixA[0])
rowB = len(matrixB)
colB = len(matrixB[0])
result = [[0 for p in range(colB)] for q in range(rowA)]
if (colA != rowB):
print('The two matrices cannot be multiplied')
else:
print('\nThe result is')
for i in range(rowA):
for j in range(colB):
result[i][j] = getCell(matrixA, matrixB, i, j)
print(result[i])
stop = timeit.default_timer()
print('Time: ', stop - start)
I also have a timer on the code to pint the time taken, but as its a programme that takes user input it is directly related to how long it takes to process in real-time, is there a way i can time it just to execute? I need to compare how making this code run in parallel can decrease run-time.

numpy is an efficient C implementation, while jax is a an efficient parallel implementation that supports also GPU/TPU.
Both of then would run faster than your current python implementation.
Import numpy or jax
import numpy as np
or
import jax.numpy as np
Then create the matrices
A = np.array(getMatrix('A'))
B = np.array(getMatrix('B'))
And output the matrix multiplication
C = A # B
print (C)

If you want only the time of execution of the matrix multiplication than move the start after the user input, like this:
...
matrixA = getMatrix('A')
matrixB = getMatrix('B')
start = timeit.default_timer()
rowA = len(matrixA)
colA = len(matrixA[0])
rowB = len(matrixB)
colB = len(matrixB[0])
...
stop = timeit.default_timer()
print('Time: ', stop - start)

Related

Optimizing Matrix Traversal/General Code Optimization

I have two matrices. One is of size (CxK) and another is of size (SxK) (where S,C, and K all have the potential to be very large). I want to combine these an output matrix using the cosine similarity function (would be of size [CxS]). When I run my code, it takes a very long time to produce an output, and I was wondering if there is any way to optimize what I currently have. [Note, the two input matrices are often very sparse]
I was previously traversing each matrix using two for index,row loops, but I have since switched to the while loops, which improved my run time significantly.
A #this is one of my input matrices (pandas dataframe)
B #this is my second input matrix (pandas dataframe)
C = pd.DataFrame(columns = ['col_1' ,'col_2' ,'col_3'])
i=0
k=0
while i <= 5:
col_1 = A.iloc[i].get('label_A')
while k < 5:
col_2 = B.iloc[k].get('label_B')
propensity = cosine_similarity([A.drop('label_A', axis=1)\
.iloc[i]], [B.drop('label_B',axis=1).iloc[k]])
d = {'col_1':[col_1], 'col_2':[col_2], 'col_3':[propensity[0][0]]}
to_append = pd.DataFrame(data=d)
C = C.append(to_append)
k += 1
k = 0
i += 1
Right now I have the loops to run on only 5 items from each matrix, producing a 5x5 matrix, but I would obviously like this to work for very large inputs. This is the first time I have done anything like this so please let me know if any facet of code can be improved (data types used to hold matrices, how to traverse them, updating the output matrix, etc.).
Thank you in advance.

This can be done much more easyly and way faster by passing the whole arrays to cosine_similarity after you move the labels to the index:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import time
c = 50
s = 50
k = 100
A = pd.DataFrame( np.random.rand(c,k))
B = pd.DataFrame( np.random.rand(s,k))
A['label_A'] = [f'A{i}' for i in range(c)]
B['label_B'] = [f'B{i}' for i in range(s)]
C = pd.DataFrame()
# your program
start = time.time()
i=0
k=0
while i < c:
col_1 = A.iloc[i].get('label_A')
while k < s:
col_2 = B.iloc[k].get('label_B')
propensity = cosine_similarity([A.drop('label_A', axis=1)\
.iloc[i]], [B.drop('label_B',axis=1).iloc[k]])
d = {'col_1':[col_1], 'col_2':[col_2], 'col_3':[propensity[0][0]]}
to_append = pd.DataFrame(data=d)
C = C.append(to_append)
k += 1
k = 0
i += 1
print(f'elementwise: {time.time() - start:7.3f} s')
# my solution
start = time.time()
A = A.set_index('label_A')
B = B.set_index('label_B')
C1 = pd.DataFrame(cosine_similarity(A, B), index=A.index, columns=B.index).stack().rename('col_3')
C1.index.rename(['col_1','col_2'], inplace=True)
C1 = C1.reset_index()
print(f'whole array: {time.time() - start:7.3f} s')
# verification
assert(C[['col_1','col_2']].to_numpy()==C1[['col_1','col_2']].to_numpy()).all()\
and np.allclose(C.col_3.to_numpy(), C1.col_3.to_numpy())

The Anaconda prompt freezes when I run code with numba's "jit" decorator

I have this python code that should run just fine. I'm running it on Anaconda's Spyder Ipython console, or on the Anaconda terminal itself, because that is the only way I can use the "numba" library and its "jit" decorator.
However, either one always "freezes" or "hangs" just about whenever I run it. There is nothing wrong with the code itself, or else I'd get an error.
Sometimes, the code runs all the way through perfectly fine, sometimes it just prints the first line from the first function, and sometimes the code stops anywhere in the middle.
I've tried seeing under which conditions the same problems reproduce, but I haven't been able to get any insights.
My code is:
import time
import numpy as np
import random
from numba import vectorize, cuda, jit, njit, prange, float64, float32, int64
from numba.numpy_support import from_dtype
import numba
#jit(nopython = True)
def make_array(number_of_rows, row_size, starting_size):
q = np.zeros((number_of_rows,row_size))
q[:,0]=starting_size
return(q)
q = make_array(5,5,5)
#jit(nopython = True)
def row_size(array):
return(array.shape[1])
#jit(nopython = True)
def number_of_rows(array):
return(array.shape[0])
#jit(nopython = True)
def foo(array):
result = np.zeros(array.size).reshape(1,array.shape[1])
result[:] = array[:]
shedding_row = np.zeros(array.size).reshape(1,array.shape[1])
birth_row = np.zeros(array.size).reshape(1,array.shape[1])
for i in range((array.shape[0])):
for j in range((array.shape[1])-1):
if result[i,j] !=0:
shedding = (np.random.poisson( (result[i,j])**.2, 1))[0]
birth = (np.random.poisson( (3), 1))[0]
birth = 0
result[i,j+1] = result[i,j] - shedding + birth
shedding_row[i,j+1] = shedding
birth_row[i,j+1] = birth
if result[i,j] == 0:
result[i,j] = result[i,j]
return(result, shedding_row)
#jit(nopython = True)
def foo_two(array):
result = np.zeros(array.size).reshape(array.shape[0],array.shape[1])
result_two = np.zeros(array.size).reshape(array.shape[0],array.shape[1])
i = 0
while i != (result.shape[0]):
fill_in_row= 0*np.arange(1 * result.shape[1]).reshape(1, result.shape[1])
fill_in_row[0] = array[i]
result[i], shedding_row = foo(fill_in_row)
result_two[i] = shedding_row
i+=1
return(result, result_two)
#jit(nopython = True)
def foo_three(array):
array_sum = np.sum(array, axis = 0)
array_sum = array_sum.reshape(1,array_sum.size)
result = np.zeros(array_sum.size).reshape(1,array_sum.size)
for i in range((result.shape[0])):
for j in range((result.shape[1])):
shed_death_param = .2
shed_metastasis_param = .3
combined_number = (int(array_sum[i,j])) * (shed_death_param+shed_metastasis_param)
for q in range(int(combined_number)):
random_number = random.randint(1, 7)
if random_number == 5:
result[i,j]+=1
number_to_add = (int(array_sum[i,j])) - (int(combined_number))
if j < row_size(array_sum) - 1:
(array_sum[i,j+1]) += number_to_add
return(result)
#jit(nopython = True)
def foo_four(array):
result = np.zeros(array.size).reshape(1,array.size)
for i in range((result.shape[0])):
for j in range((result.shape[1])):
if int(array[i,j])!= 0:
for q in range(int(array[i,j])):
addition = np.zeros((1,result.shape[1]))
addition[0][j] = 1
result = np.concatenate((result, addition), axis=0)
if result.shape[0]!=1:
result = result[1:]
return(result)
def the_process(array):
array, master_shedding_array = (foo_two(array))
master_metastasis_array = foo_three(master_shedding_array)
new_array = (foo_four(master_metastasis_array))
print("new_array is\n", new_array)
return(array,new_array)
def the_bigger_process(array):
big_array = make_array(1,row_size(array),0)
big_metastasis_array = make_array(1,row_size(array),0)
counter =0
i = 0
while counter < row_size(array)-1:
print("We begin, before the_process is called")
updated_array,metastasis_array = the_process(array)
big_array = np.concatenate((big_array, updated_array), axis=0)
if sum( metastasis_array[0] ) != 0:
big_metastasis_array = np.concatenate((big_metastasis_array, metastasis_array), axis=0)
i+=1
third_big_metastasis_array = big_metastasis_array[np.where(big_metastasis_array[:,i] == 1)]
array = third_big_metastasis_array
counter+=1
big_array = big_array[1:]
big_metastasis_array = big_metastasis_array[1:]
return(big_array,big_metastasis_array)
something, big_metastasis_array = the_bigger_process(q)
print("something is\n",something)
print("big_metastasis_array is\n",big_metastasis_array)
I know it's best to just post the part of your code that's relevant, but this such an unusual situation where the code is actually fine, that I thought I should post all of it.
This is a screenshot of when I ran the code two consecutive times, clearly the first time it printed the outputs I wanted just fine, and then the next time it froze. And sometimes it freezes in between.
Of course I put many print functions all over when I was testing if I could see some pattern, but I couldn't, and I took out all those print functions in the code above. But the truth is, this code would freeze in the middle, and there was no consistency or "replicability" to it.
I've googled around but couldn't find anyone else with a similar issue.

You are passing a bad value to np.random.poisson. In your code result[i, j] can sometimes be negative, which is causing an NaN in numba, whereas in python it return an actual (negative) value. In python you might get a ValueError, but numba is failing in a different way that causes the process to hang.
You have to decide whether it makes sense for your particular problem, but if I add, the check between the # ****** comments:
#jit(nopython=True)
def foo(array):
result = np.zeros(array.size).reshape(1, array.shape[1])
result[:] = array[:]
shedding_row = np.zeros(array.size).reshape(1, array.shape[1])
birth_row = np.zeros(array.size).reshape(1, array.shape[1])
for i in range((array.shape[0])):
for j in range((array.shape[1]) - 1):
if result[i, j] != 0:
# ******
if result[i, j] < 0:
continue
# ******
shedding = (np.random.poisson( (result[i, j])**.2, 1))[0]
birth = (np.random.poisson((3), 1))[0]
....
in foo, then the code stops hanging.
As a general debugging tip, it's good to run your code with the jit decorators commented out to see if anything strange is happening.

Block-wise array writing with Python multiprocessing

I know there are a lot of topics around similar problems (like How do I make processes able to write in an array of the main program?, Multiprocessing - Shared Array or Multiprocessing a loop of a function that writes to an array in python), but I just don't get it... so sorry for asking again.
I need to do some stuff with a huge array and want to speed up things by splitting it into blocks and running my function on those blocks, with each block being run in its own process. Problem is: the blocks are "cut" from one array and the result shall then be written into a new, common array. This is what I did so far (minimum working example; don't mind the array-shaping, this is necessary for my real-world case):
import numpy as np
import multiprocessing as mp
def calcArray(array, blocksize, n_cores=1):
in_shape = (array.shape[0] * array.shape[1], array.shape[2])
input_array = array[:, :, :array.shape[2]].reshape(in_shape)
result_array = np.zeros(array.shape)
# blockwise loop
pix_count = array.size
for position in range(0, pix_count, blocksize):
if position + blocksize < array.shape[0] * array.shape[1]:
num = blocksize
else:
num = pix_count - position
result_part = input_array[position:position + num, :] * 2
result_array[position:position + num] = result_part
# finalize result
final_result = result_array.reshape(array.shape)
return final_result
if __name__ == '__main__':
start = time.time()
img = np.ones((4000, 4000, 4))
result = calcArray(img, blocksize=100, n_cores=4)
print 'Input:\n', img
print '\nOutput:\n', result
How can I now implement multiprocessing in way that I set a number of cores and then calcArray assigns processes to each block until n_cores is reached?
With the much appreciated help of #Blownhither Ma, the code now looks like this:
import time, datetime
import numpy as np
from multiprocessing import Pool
def calculate(array):
return array * 2
if __name__ == '__main__':
start = time.time()
CORES = 4
BLOCKSIZE = 100
ARRAY = np.ones((4000, 4000, 4))
pool = Pool(processes=CORES)
in_shape = (ARRAY.shape[0] * ARRAY.shape[1], ARRAY.shape[2])
input_array = ARRAY[:, :, :ARRAY.shape[2]].reshape(in_shape)
result_array = np.zeros(input_array.shape)
# do it
pix_count = ARRAY.size
handles = []
for position in range(0, pix_count, BLOCKSIZE):
if position + BLOCKSIZE < ARRAY.shape[0] * ARRAY.shape[1]:
num = BLOCKSIZE
else:
num = pix_count - position
### OLD APPROACH WITH NO PARALLELIZATION ###
# part = calculate(input_array[position:position + num, :])
# result_array[position:position + num] = part
### NEW APPROACH WITH PARALLELIZATION ###
handle = pool.apply_async(func=calculate, args=(input_array[position:position + num, :],))
handles.append(handle)
# finalize result
### OLD APPROACH WITH NO PARALLELIZATION ###
# final_result = result_array.reshape(ARRAY.shape)
### NEW APPROACH WITH PARALLELIZATION ###
final_result = [h.get() for h in handles]
final_result = np.concatenate(final_result, axis=0)
print 'Done!\nDuration (hh:mm:ss): {duration}'.format(duration=datetime.timedelta(seconds=time.time() - start))
The code runs and really starts the number processes I assigned, but takes much much longer than the old approach with just using the loop "as-is" (3 sec compared to 1 min). There must be something missing here.

The core function is pool.apply_async and handler.get.
I have been recently working on the same functions and find it useful to make a standard utility function. balanced_parallel applies function fn on matrix a in a parallel manner silently. assigned_parallel explicitly apply function on each element.
i. The way I split array is np.array_split. You may use block scheme instead.
ii. I use concat rather than assign to a empty matrix when collecting result. There is no shared memory.
from multiprocessing import cpu_count, Pool
def balanced_parallel(fn, a, processes=None, timeout=None):
""" apply fn on slice of a, return concatenated result """
if processes is None:
processes = cpu_count()
print('Parallel:\tstarting {} processes on input with shape {}'.format(processes, a.shape))
results = assigned_parallel(fn, np.array_split(a, processes), timeout=timeout, verbose=False)
return np.concatenate(results, 0)
def assigned_parallel(fn, l, processes=None, timeout=None, verbose=True):
""" apply fn on each element of l, return list of results """
if processes is None:
processes = min(cpu_count(), len(l))
pool = Pool(processes=processes)
if verbose:
print('Parallel:\tstarting {} processes on {} elements'.format(processes, len(l)))
# add jobs to the pool
handler = [pool.apply_async(fn, args=x if isinstance(x, tuple) else (x, )) for x in l]
# pool running, join all results
results = [handler[i].get(timeout=timeout) for i in range(len(handler))]
pool.close()
return results
In your case, fn would be
def _fn(matrix_part): return matrix_part * 2
result = balanced_parallel(_fn, img)
Follow-up:
Your loop should look like this to make parallelization happen.
handles = []
for position in range(0, pix_count, BLOCKSIZE):
if position + BLOCKSIZE < ARRAY.shape[0] * ARRAY.shape[1]:
num = BLOCKSIZE
else:
num = pix_count - position
handle = pool.apply_async(func=calculate, args=(input_array[position:position + num, :], ))
handles.append(handle)
# multiple handlers exist at this moment!! Don't `.get()` yet
results = [h.get() for h in handles]
results = np.concatenate(results, axis=0)

How to avoid sigkill error 9?

I am trying to build an algorithm which first builds a power set of around 100 symbols excluding null set and repeated elements.
Then for each item in the list of power set it reads data file and evaluates the Sharpe Ratio (Return/Risk).
Results are then appended to a list and at last the program gives the best combination of symbols that would result in highest Sharpe Ratio.
Following is the code:
import pandas as pd
import numpy as np
import math
from itertools import chain, combinations
import operator
import time as t
#ASSUMPTION
#EQUAL ALLOCATION OF RESOURCES
t0 = t.time()
start_date = '2016-06-01'
end_date = '2017-08-18'
allocation = 170000
usesymbols=['PAEL','TPL','SING','DCL','POWER','FCCL','DGKC','LUCK',
'THCCL','PIOC','GWLC','CHCC','MLCF','FLYNG','EPCL',
'LOTCHEM','SPL','DOL','NRSL','AGL','GGL','ICL','AKZO','ICI',
'WAHN','BAPL','FFC','EFERT','FFBL','ENGRO','AHCL','FATIMA',
'EFOODS','QUICE','ASC','TREET','ZIL','FFL','CLOV',
'BGL','STCL','GGGL','TGL','GHGL','OGDC','POL','PPL','MARI',
'SSGC','SNGP','HTL','PSO','SHEL','APL','HASCOL','RPL','MERIT',
'GLAXO','SEARL','FEROZ','HINOON','ABOT','KEL','JPGL','EPQL',
'HUBC','PKGP','NCPL','LPL','KAPCO','TSPL','ATRL','BYCO','NRL','PRL',
'DWSM','SML','MZSM','IMSL','SKRS','HWQS','DSFL','TRG','PTC','TELE',
'WTL','MDTL','AVN','NETSOL','SYS','HUMNL','PAKD',
'ANL','CRTM','NML','NCL','GATM','CLCPS','GFIL','CHBL',
'DFSM','KOSM','AMTEX','HIRAT','NCML','CTM','HMIM',
'CWSM','RAVT','PIBTL','PICT','PNSC','ASL',
'DSL','ISL','CSAP','MUGHAL','DKL','ASTL','INIL']
cost_matrix = []
def data(symbols):
dates=pd.date_range(start_date,end_date)
df=pd.DataFrame(index=dates)
for symbol in symbols:
df_temp=pd.read_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),usecols=['Date','Close'],
parse_dates=True,index_col='Date',na_values=['nan'])
df_temp = df_temp.rename(columns={'Close': symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
return df
def mat_alloc_auto(symbols):
n = len(symbols)
mat_alloc = np.zeros((n,n), dtype='float')
for i in range(0,n):
mat_alloc[i,i] = allocation / n
return mat_alloc
def compute_daily_returns(df):
"""Compute and return the daily return values."""
daily_returns=(df/df.shift(1))-1
df=df.fillna(value=0)
daily_returns=daily_returns[1:]
daily_returns = np.array(daily_returns)
return daily_returns
def port_eval(matrix_alloc,daily_return_matrix):
risk_free = 0
amount_matrix = [allocation]
return_mat = np.dot(daily_return_matrix,matrix_alloc)
return_mat = np.sum(return_mat, axis=1, keepdims=True)
return_mat = np.divide(return_mat,amount_matrix)
mat_average = np.mean(return_mat)
mat_std = np.std(return_mat, ddof=1)
sharpe_ratio = ((mat_average-risk_free)/mat_std) * math.sqrt(252)
return return_mat, sharpe_ratio, mat_average
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1))
power_set = list(powerset(usesymbols))
len_power = len(power_set)
sharpe = []
for j in range(0, len_power):
df_01 = data(power_set[j])
matrix_allocation = mat_alloc_auto(power_set[j])
daily_return_mat = compute_daily_returns(df_01)
return_matrix, sharpe_ratio_val, matrix_average = port_eval(matrix_allocation, daily_return_mat)
sharpe.append(sharpe_ratio_val)
max_index, max_value = max(enumerate(sharpe), key=operator.itemgetter(1))
print('Maximum sharpe ratio occurs from ',power_set[max_index], ' value = ', max_value)
t1=t.time()
print('exec time is ', t1-t0, 'seconds')
The above code results in a sigkill error 9.
After research I understood that it is because process allocates too much memory putting pressures on OS.
So I tried running same code on HP Z600 workstation but it takes a lot of time plus the machine is freezes.
My question is how can I make my code more efficient to get instant results.

Python parallelised correlation slower than single process correlation

I wanted to parallelize df.corr() using multiprocessing module in Python. I'm taking one column and computing correlation values with rest all columns in one process and second column with rest other columns in another process. I'm continuing in this fashion to fill the upper traingle of correlation matrix by stacking up the result rows from all the processes.
I took sample data of shape (678461, 210) and tried my parallelized method and df.corr() and got running time of 214.40s and 42.64s respectively. So, my parallelized method is taking more time.
Is there a way to improve this?
import multiprocessing as mp
import pandas as pd
import numpy as np
from time import *
def _correlation(args):
i, mat, mask = args
ac = mat[i]
arr = []
for j in range(len(mat)):
if i > j:
continue
bc = mat[j]
valid = mask[i] & mask[j]
if valid.sum() < 1:
c = NA
elif i == j:
c = 1.
elif not valid.all():
c = np.corrcoef(ac[valid], bc[valid])[0, 1]
else:
c = np.corrcoef(ac, bc)[0, 1]
arr.append((j, c))
return arr
def correlation_multi(df):
numeric_df = df._get_numeric_data()
cols = numeric_df.columns
mat = numeric_df.values
mat = pd.core.common._ensure_float64(mat).T
K = len(cols)
correl = np.empty((K, K), dtype=float)
mask = np.isfinite(mat)
pool = mp.Pool(processes=4)
ret_list = pool.map(_correlation, [(i, mat, mask) for i in range(len(mat))])
for i, arr in enumerate(ret_list):
for l in arr:
j = l[0]
c = l[1]
correl[i, j] = c
correl[j, i] = c
return pd.DataFrame(correl, index = cols, columns = cols)
if __name__ == '__main__':
noise = pd.DataFrame(np.random.randint(0,100,size=(100000, 50)))
noise2 = pd.DataFrame(np.random.randint(100,200,size=(100000, 50)))
df = pd.concat([noise, noise2], axis=1)
#Single process correlation
start = time()
s = df.corr()
print('Time taken: ',time()-start)
#Multi process correlation
start = time()
s1 = correlation_multi(df)
print('Time taken: ',time()-start)

The results from _correlation have to be moved from the worker processes to the process running the Pool via interprocess communication.
This means that the return data is pickled, sent to the other process, unpickled and added to the result list.
This takes time and is by nature a sequential process.
And map processes the returns in the order they were sent, IIRC. So if one iteration takes relatively long, other results might be stuck waiting. You could try using imap_unordered which yields results as soon as they arrive.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I make Matrix Multiplication Code run in parallel? - python

Related

Optimizing Matrix Traversal/General Code Optimization

The Anaconda prompt freezes when I run code with numba's "jit" decorator

Block-wise array writing with Python multiprocessing

How to avoid sigkill error 9?

Python parallelised correlation slower than single process correlation

Categories

Resources