Optimizing loop for by using vectors python - python

I created following code:
M=20000
sample_all = np.load('sample.npy')
sd = np.zeros(M)
chi_arr = np.zeros((M,4))
sigma_e = np.zeros((M,41632))
mean_sigma = np.zeros(M)
max_sigma = np.zeros(M)
min_sigma = np.zeros(M)
z = np.load('z_array.npy')
prof = np.load('profile_at_sources.npy')
L = np.load('luminosities.npy')
for k in range(M):
sd[k]=np.array(sp.std(sample_all[k,:]))
arr = np.genfromtxt('samples_fin1.txt').T[2:6]
arr_T = arr.T
chi_arr[k,:] = arr_T[k,:]
sigma_e[k,:]=np.sqrt(calc(z,prof,chi_arr[k,:], L))
mean_sigma[k] = np.array(sp.mean(sigma_e[k,:]))
max_sigma[k] = np.array(sigma_e[k,:].max())
min_sigma[k] = np.array(sigma_e[k,:].min())
where calc(...) is a function that calculates some stuff (is not important for my question)
This loop takes, for M=20000, about 27 hours on my machine. It's enough... There's a way to optimize it, maybe with vectors instead of loop for?
For me it's really simple create loop, my head thinks with loops for this kind of code... It's my limitation... Could you help me? thanks

It seems to me like each of the k-th rows that are created in your various arrays are independent of each k-th iteration of your for loop and only dependent on rows of sigma_e... so you could parallelize it over many workers. Not sure if the code is 100% kosher but you didn't provide a working example.
Note this only works if each k-th iteration is COMPLETELY independent of each k-1th iteration.
M=20000
sample_all = np.load('sample.npy')
sd = np.zeros(M)
chi_arr = np.zeros((M,4))
sigma_e = np.zeros((M,41632))
mean_sigma = np.zeros(M)
max_sigma = np.zeros(M)
min_sigma = np.zeros(M)
z = np.load('z_array.npy')
prof = np.load('profile_at_sources.npy')
L = np.load('luminosities.npy')
workers = 100
arr = np.genfromtxt('samples_fin1.txt').T[2:6] # only works if this is really what you're doing to set arr.
def worker(k_start, k_end):
for k in range(k_start, k_end + 1):
sd[k]=np.array(sp.std(sample_all[k,:]))
arr_T = arr.T
chi_arr[k,:] = arr_T[k,:]
sigma_e[k,:]=np.sqrt(calc(z,prof,chi_arr[k,:], L))
mean_sigma[k] = np.array(sp.mean(sigma_e[k,:]))
max_sigma[k] = np.array(sigma_e[k,:].max())
min_sigma[k] = np.array(sigma_e[k,:].min())
threads = []
kstart = 0
for k in range(0, workers):
T = threading.Thread(target=worker, args=[0 + k * M / workers, (1+ k) * M / workers - 1 ])
threads.append(T)
T.start()
for t in threads:
t.join()
Edited following comments:
Seems like there's a mutex that CPython places on all objects that prevents parallel access. Use IronPython or Jython to step around this. Also, you can move the file read outside if you're really just deserializing the same array from samples_fin1.txt.

Related

How to properly parallelizing a for-loop in python with shared variables?

This is my first attempt for multiprocessing in python and I am having hard time finding a fast solution for parallelizing my code. I do my best to explain my situation.
This is the code I am trying to parallelizing (the main problem is the for-loop):
Original code:
if __name__ == "__main__":
n = 10
filling_array = np.ones((n, n))*-1
reading_array = np.ones((n, n))*10
for i in range(n):
for j in range(int(i/2)):
filling_array[i, j] = 10*i + reading_array[i, j]
print(filling_array.sum(axis=1), 'end')
And this is my solution:
def child_p(i,
filling_array,
reading_array):
for j in range(int(i/2)):
filling_array[i, j] = 10*i + reading_array[i, j]
return filling_array
if __name__ == "__main__":
n = 10
filling_array = np.ones((n, n))*-1
reading_array = np.ones((n, n))*10
final_array = np.zeros((n, n))
with Pool(cpu_count()) as pool:
result = pool.starmap(child_p,
zip(
range(n),
repeat(filling_array, n),
repeat(reading_array, n)
)
)
for iter in range(n):
final_array[iter, :] = result[iter][iter, :]
print(final_array.sum(axis=1), 'end')
Literally, the child function act like the for-loop to fill the filling_array, using the reading array. I have separated the "for-loop" as a child function.
My solution is faster than the original for-loop but it is very slow and memory-heavy!
I think the main problem is where the pool.starmap() tries to copying/mapping the two arrays into the function (the arrays are very large).
Would you please guide me? Is there a way that I share these large arrays (in my original code, there are 5 reading_array and 1 filling_array and they are large)?
How can I do that?
This is what I did after, in order to solve my problem, it somehow works, but Im not sure how good it is:
Edited
I have tried the method in this website, but for two arrays, like this:
def child_p(i,
filling_array_specs,
reading_array_specs,):
fa_name, fa_shape, fa_dtype = filling_array_specs
fa_shm = SharedMemory(fa_name)
filling_array = np.ndarray(shape=fa_shape, dtype=fa_dtype, buffer=fa_shm.buf)
ra_name, ra_shape, ra_dtype = reading_array_specs
ra_shm = SharedMemory(ra_name)
reading_array = np.ndarray(shape=ra_shape, dtype=ra_dtype, buffer=ra_shm.buf)
for j in range(int(i/2)):
filling_array[i, j] = 10*i + reading_array[i, j]
if __name__ == "__main__":
n = 10
filling_array = np.ones((n, n))*-1
reading_array = np.ones((n, n))*10
fa_shape, fa_dtype = filling_array.shape, filling_array.dtype
ra_shape, ra_dtype = reading_array.shape, reading_array.dtype
with SharedMemoryManager() as smm:
fa_shm = smm.SharedMemory(filling_array.nbytes)
shm_filling_array = np.ndarray(shape=fa_shape, dtype=fa_dtype, buffer=fa_shm.buf)
np.copyto(shm_filling_array, filling_array)
filling_array_specs = [fa_shm.name, fa_shape, fa_dtype]
ra_shm = smm.SharedMemory(reading_array.nbytes)
shm_reading_array = np.ndarray(shape=ra_shape, dtype=ra_dtype, buffer=ra_shm.buf)
np.copyto(shm_reading_array, reading_array)
reading_array_specs = [ra_shm.name, ra_shape, ra_dtype]
with Pool(cpu_count()) as pool:
pool.starmap(child_p,
zip(
range(n),
repeat(filling_array_specs, n),
repeat(reading_array_specs, n)
)
)
print(shm_filling_array.sum(axis=1), 'end')
Please note that I passed each array's specs [sh_name, shape, dtype] to the child_p.
So far it seems working, but I am not sure how fast it is with my large arrays, will update this post regarding the time.
Edit 2:
# for n = 2000
# The original for-loop took 82.13 sec
# The shared memory method took 17.39 sec
Is this the fastest method I can achive?
Or there are other methods that you think would works better?

Running a for loop for higher number of iterations in python

I have written a piece of code which I am trying to run in my local machine of 8GB ram.
import numpy as np
tasks = ['A','B','C','D']
tasks_pass_prob = [0.7,0.1,0.5,0.3]
task_probs = tuple(zip(tasks,tasks_pass_prob))
N = 1000000
n = 1
results_dict = {}
for _ in range(N):
for t,p in task_probs:
res = np.random.binomial(n,p,N)
results_dict[t]=res
For smaller values of N code is running but with a higher value of N the machine gets hung. Is a better way to restructure my for loop to run the code ?
I think I can achieve the same thing by simply removing the first loop like so...
import numpy as np
tasks = ['A','B','C','D']
tasks_pass_prob = [0.7,0.1,0.5,0.3]
task_probs = tuple(zip(tasks,tasks_pass_prob))
N = 1000000
n = 1
results_dict = {}
for t,p in task_probs:
res = np.random.binomial(n,p,N)
results_dict[t]=res
Actually, your code is not hanging but your processes are so big that it's taking a long time to run...
It is not the issue of RAM...
And why did you use for _ in range(N)?
I suggest you write it like this:
import numpy as np
tasks = ['A','B','C','D']
tasks_pass_prob = [0.7,0.1,0.5,0.3]
task_probs = tuple(zip(tasks,tasks_pass_prob))
N = 1000000
n = 1
results_dict = {}
# for _ in range(N):
for t, p in task_probs:
res = np.random.binomial(n, p, N)
results_dict[t] = res
print(f"{res=}, {results_dict=}")

Multiprocessing Pool Not Mapping

I am attempting to create Pool() objects, so that I can break down large arrays. Though, each time after the first I run through the below code, the map is never run. Only the first pass seems to enter the function, though the arguments are the same size, even when running it using the EXACT same arguments - only the first
job.map(...)
appears to run. Below is the source of my pain (not all the code in the file):
def iterCount():
#m is not in shared memory, as intended.
global m
m = m + 1
return m
def thread_search(pair):
divisor_lower = pair[0]
divisor_upper = pair[1]
for i in range(divisor_lower, divisor_upper,window_size):
current_section = np.array(x[i: i + window_size])
for row in current_section:
if (row[2].startswith('NP') ) and checkPep(row[0]): #checkPep is a simple unique-in-array checking function.
#shared_list is a multiprocessing.Manager list.
shared_list.append(row[[0,1,2]])
m = iterCount()
if not m%1000000:
print(f'Encountered m = {m}', flush = True)
def poolMap(pairs, group):
job = Pool(3)
print(f'Pool Created')
print(len(pairs))
job.map(thread_search,pairs)
print('Pool Closed')
job.close()
if __name__ == '__main__':
for group in [1,2,3]: #Example times to be run...
x = None
lower_bound = int((group - 1)*group_step)
upper_bound = int(group*group_step)
x = list(csv.reader(open(pa_table_name,"rt", encoding = "utf-8"), delimiter = "\t"))[lower_bound:upper_bound]
print(len(x))
divisor_pairs = [ [int(lower_bound + (i - 1)*chunk_size) , int(lower_bound + i*chunk_size)] for i in range(1,6143) ]
poolMap(divisor_pairs, group)
The output of this function is:
Program started: 03/09/19, 12:41:25 (Machine Time)
11008256 - Length of the file read in (in the group)
Pool Created
6142 - len(pairs)
Encountered m = 1000000
Encountered m = 1000000
Encountered m = 1000000
Encountered m = 2000000
Encountered m = 2000000
Encountered m = 2000000
Encountered m = 3000000
Encountered m = 3000000
Encountered m = 3000000 (Total size is ~ 9 million per set)
Pool Closed
11008256 (this is the number of lines read, correct value)
Pool Created
6142 (Number of pairs is correct, though map appears to never run...)
Pool Closed
11008256
Pool Created
6142
Pool Closed
At this point, the shared_list is saved, and only the first threads results appear to be present.
I'm really at a loss to what is happening here, and I've tried to find bugs (?) or similar instances of any of this.
Ubuntu 18.04
Python 3.6

Block-wise array writing with Python multiprocessing

I know there are a lot of topics around similar problems (like How do I make processes able to write in an array of the main program?, Multiprocessing - Shared Array or Multiprocessing a loop of a function that writes to an array in python), but I just don't get it... so sorry for asking again.
I need to do some stuff with a huge array and want to speed up things by splitting it into blocks and running my function on those blocks, with each block being run in its own process. Problem is: the blocks are "cut" from one array and the result shall then be written into a new, common array. This is what I did so far (minimum working example; don't mind the array-shaping, this is necessary for my real-world case):
import numpy as np
import multiprocessing as mp
def calcArray(array, blocksize, n_cores=1):
in_shape = (array.shape[0] * array.shape[1], array.shape[2])
input_array = array[:, :, :array.shape[2]].reshape(in_shape)
result_array = np.zeros(array.shape)
# blockwise loop
pix_count = array.size
for position in range(0, pix_count, blocksize):
if position + blocksize < array.shape[0] * array.shape[1]:
num = blocksize
else:
num = pix_count - position
result_part = input_array[position:position + num, :] * 2
result_array[position:position + num] = result_part
# finalize result
final_result = result_array.reshape(array.shape)
return final_result
if __name__ == '__main__':
start = time.time()
img = np.ones((4000, 4000, 4))
result = calcArray(img, blocksize=100, n_cores=4)
print 'Input:\n', img
print '\nOutput:\n', result
How can I now implement multiprocessing in way that I set a number of cores and then calcArray assigns processes to each block until n_cores is reached?
With the much appreciated help of #Blownhither Ma, the code now looks like this:
import time, datetime
import numpy as np
from multiprocessing import Pool
def calculate(array):
return array * 2
if __name__ == '__main__':
start = time.time()
CORES = 4
BLOCKSIZE = 100
ARRAY = np.ones((4000, 4000, 4))
pool = Pool(processes=CORES)
in_shape = (ARRAY.shape[0] * ARRAY.shape[1], ARRAY.shape[2])
input_array = ARRAY[:, :, :ARRAY.shape[2]].reshape(in_shape)
result_array = np.zeros(input_array.shape)
# do it
pix_count = ARRAY.size
handles = []
for position in range(0, pix_count, BLOCKSIZE):
if position + BLOCKSIZE < ARRAY.shape[0] * ARRAY.shape[1]:
num = BLOCKSIZE
else:
num = pix_count - position
### OLD APPROACH WITH NO PARALLELIZATION ###
# part = calculate(input_array[position:position + num, :])
# result_array[position:position + num] = part
### NEW APPROACH WITH PARALLELIZATION ###
handle = pool.apply_async(func=calculate, args=(input_array[position:position + num, :],))
handles.append(handle)
# finalize result
### OLD APPROACH WITH NO PARALLELIZATION ###
# final_result = result_array.reshape(ARRAY.shape)
### NEW APPROACH WITH PARALLELIZATION ###
final_result = [h.get() for h in handles]
final_result = np.concatenate(final_result, axis=0)
print 'Done!\nDuration (hh:mm:ss): {duration}'.format(duration=datetime.timedelta(seconds=time.time() - start))
The code runs and really starts the number processes I assigned, but takes much much longer than the old approach with just using the loop "as-is" (3 sec compared to 1 min). There must be something missing here.
The core function is pool.apply_async and handler.get.
I have been recently working on the same functions and find it useful to make a standard utility function. balanced_parallel applies function fn on matrix a in a parallel manner silently. assigned_parallel explicitly apply function on each element.
i. The way I split array is np.array_split. You may use block scheme instead.
ii. I use concat rather than assign to a empty matrix when collecting result. There is no shared memory.
from multiprocessing import cpu_count, Pool
def balanced_parallel(fn, a, processes=None, timeout=None):
""" apply fn on slice of a, return concatenated result """
if processes is None:
processes = cpu_count()
print('Parallel:\tstarting {} processes on input with shape {}'.format(processes, a.shape))
results = assigned_parallel(fn, np.array_split(a, processes), timeout=timeout, verbose=False)
return np.concatenate(results, 0)
def assigned_parallel(fn, l, processes=None, timeout=None, verbose=True):
""" apply fn on each element of l, return list of results """
if processes is None:
processes = min(cpu_count(), len(l))
pool = Pool(processes=processes)
if verbose:
print('Parallel:\tstarting {} processes on {} elements'.format(processes, len(l)))
# add jobs to the pool
handler = [pool.apply_async(fn, args=x if isinstance(x, tuple) else (x, )) for x in l]
# pool running, join all results
results = [handler[i].get(timeout=timeout) for i in range(len(handler))]
pool.close()
return results
In your case, fn would be
def _fn(matrix_part): return matrix_part * 2
result = balanced_parallel(_fn, img)
Follow-up:
Your loop should look like this to make parallelization happen.
handles = []
for position in range(0, pix_count, BLOCKSIZE):
if position + BLOCKSIZE < ARRAY.shape[0] * ARRAY.shape[1]:
num = BLOCKSIZE
else:
num = pix_count - position
handle = pool.apply_async(func=calculate, args=(input_array[position:position + num, :], ))
handles.append(handle)
# multiple handlers exist at this moment!! Don't `.get()` yet
results = [h.get() for h in handles]
results = np.concatenate(results, axis=0)

Python parallelised correlation slower than single process correlation

I wanted to parallelize df.corr() using multiprocessing module in Python. I'm taking one column and computing correlation values with rest all columns in one process and second column with rest other columns in another process. I'm continuing in this fashion to fill the upper traingle of correlation matrix by stacking up the result rows from all the processes.
I took sample data of shape (678461, 210) and tried my parallelized method and df.corr() and got running time of 214.40s and 42.64s respectively. So, my parallelized method is taking more time.
Is there a way to improve this?
import multiprocessing as mp
import pandas as pd
import numpy as np
from time import *
def _correlation(args):
i, mat, mask = args
ac = mat[i]
arr = []
for j in range(len(mat)):
if i > j:
continue
bc = mat[j]
valid = mask[i] & mask[j]
if valid.sum() < 1:
c = NA
elif i == j:
c = 1.
elif not valid.all():
c = np.corrcoef(ac[valid], bc[valid])[0, 1]
else:
c = np.corrcoef(ac, bc)[0, 1]
arr.append((j, c))
return arr
def correlation_multi(df):
numeric_df = df._get_numeric_data()
cols = numeric_df.columns
mat = numeric_df.values
mat = pd.core.common._ensure_float64(mat).T
K = len(cols)
correl = np.empty((K, K), dtype=float)
mask = np.isfinite(mat)
pool = mp.Pool(processes=4)
ret_list = pool.map(_correlation, [(i, mat, mask) for i in range(len(mat))])
for i, arr in enumerate(ret_list):
for l in arr:
j = l[0]
c = l[1]
correl[i, j] = c
correl[j, i] = c
return pd.DataFrame(correl, index = cols, columns = cols)
if __name__ == '__main__':
noise = pd.DataFrame(np.random.randint(0,100,size=(100000, 50)))
noise2 = pd.DataFrame(np.random.randint(100,200,size=(100000, 50)))
df = pd.concat([noise, noise2], axis=1)
#Single process correlation
start = time()
s = df.corr()
print('Time taken: ',time()-start)
#Multi process correlation
start = time()
s1 = correlation_multi(df)
print('Time taken: ',time()-start)
The results from _correlation have to be moved from the worker processes to the process running the Pool via interprocess communication.
This means that the return data is pickled, sent to the other process, unpickled and added to the result list.
This takes time and is by nature a sequential process.
And map processes the returns in the order they were sent, IIRC. So if one iteration takes relatively long, other results might be stuck waiting. You could try using imap_unordered which yields results as soon as they arrive.

Categories

Resources