I'm trying to optimize the following loop:
telePortMatrix = sparse.coo_matrix(self.matrix,dtype=float)
telePortMatrixCSR = telePortMatrix.tocsr()
for r in telePortMatrix.row:
rowCount = telePortMatrix.getrow(r).nnz
rowSum = telePortMatrix.getrow(r).sum()
avg = rowSum/rowCount
for c in telePortMatrix.col:
if telePortMatrixCSR[r,c] != 0:
telePortMatrixCSR[r,c] = avg
with the joblib module:
telePortMatrix = sparse.coo_matrix(self.matrix,dtype=float)
telePortMatrixCSR = telePortMatrix.tocsr()
def loop(r):
rowCount = telePortMatrix.getrow(r).nnz
rowSum = telePortMatrix.getrow(r).sum()
avg = rowSum/rowCount
for c in telePortMatrix.col:
if telePortMatrixCSR[r,c] != 0:
telePortMatrixCSR[r,c] = avg
Parallel(n_jobs=-1)(delayed(loop)(i) for i in telePortMatrix.row)
But according to the Shared-memory semantics section of the documentation, if the parallel function really needs to rely on the shared memory semantics of threads, it should be made explicit with require='sharedmem'. I'm not quite sure if this is the case of shared memory.
Since we're using the same matrix but not a single function call will access the same row twice.
Related
I have to speed up my current code to do around 10^6 operations in a feasible time. Before I used multiprocessing in my the actual document I tried to do it in a mock case. Following is my attempt:
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
iteration
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
List.append((i,k,j))
t3 = time.time()
test = []
List = chunkIt(List,3)
if __name__ == '__main__':
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(do_something,List)
for result in results:
test.append(result)
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1
However, when I increase the size of my "List" my computer tires to use all of my RAM and CPU and freezes. I even cut my "List" into 3 pieces so it will only use 3 of my cores. However, nothing changed. Also, when I tried to use it on a smaller data set I noticed the code ran much slower than when it ran on a single core.
I am still very new to multiprocessing in Python, am I doing something wrong. I would appreciate it if you could help me.
To reduce memory usage, I suggest you use instead the multiprocessing module and specifically the imap method method (or imap_unordered method). Unlike the map method of either multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor, the iterable argument is processed lazily. What this means is that if you use a generator function or generator expression for the iterable argument, you do not need to create the complete list of arguments in memory; as a processor in the pool become free and ready to execute more tasks, the generator will be called upon to generate the next argument for the imap call.
By default a chunksize value of 1 is used, which can be inefficient for a large iterable size. When using map and the default value of None for the chunksize argument, the pool will look at the length of the iterable first converting it to a list if necessary and then compute what it deems to be an efficient chunksize based on that length and the size of the pool. When using imap or imap_unordered, converting the iterable to a list would defeat the whole purpose of using that method. But if you know what that size would be (more or less) if the iterable were converted to a list, then there is no reason not to apply the same chunksize calculation the map method would have, and that is what is done below.
The following benchmarks perform the same processing first as a single process and then using multiprocessing using imap where each invocation of do_something on my desktop takes approximately .5 seconds. do_something now has been modified to just process a single i, k, j tuple as there is no longer any need to break up anything into smaller lists:
from multiprocessing import Pool, cpu_count
import time
def half_second():
HALF_SECOND_ITERATIONS = 10_000_000
sum = 0
for _ in range(HALF_SECOND_ITERATIONS):
sum += 1
return sum
def do_something(tpl):
# in real case this function takes about 0.5 seconds to finish for each iteration
half_second() # on my desktop
return tpl[0]**2, tpl[1]**2, tpl[2]**2
"""
def generate_tpls():
for i in range(1, 20-2):
for k in range(i+1, 20-1):
for j in range(k+1, 20):
yield i, k, j
"""
# Use smaller number of tuples so we finish in a reasonable amount of time:
def generate_tpls():
# 64 tuples:
for i in range(1, 5):
for k in range(1, 5):
for j in range(1, 5):
yield i, k, j
def benchmark1():
""" single processing """
t = time.time()
for tpl in generate_tpls():
result = do_something(tpl)
print('benchmark1 time:', time.time() - t)
def compute_chunksize(iterable_size, pool_size):
""" This is more-or-less the function used by the Pool.map method """
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
def benchmark2():
""" multiprocssing """
t = time.time()
pool_size = cpu_count() # 8 logical cores (4 physical cores)
N_TUPLES = 64 # number of tuples that will be generated
pool = Pool(pool_size)
chunksize = compute_chunksize(N_TUPLES, pool_size)
for result in pool.imap(do_something, generate_tpls(), chunksize=chunksize):
pass
print('benchmark2 time:', time.time() - t)
if __name__ == '__main__':
benchmark1()
benchmark2()
Prints:
benchmark1 time: 32.261038303375244
benchmark2 time: 8.174998044967651
The nested For loops creating the array before the main definition appears to be the problem. Moving that part to underneath the main definition clears up any memory problems.
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
iteration
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
if __name__ == '__main__':
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
List.append((i,k,j))
t3 = time.time()
test = []
List = chunkIt(List,3)
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(do_something,List)
for result in results:
test.append(result)
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1
I am iterating through very large file size [mesh]. Since the iteration is independent, I would like to split my mesh into smaller sizes and run them all at the same time in order to lower computation time. Below is a sample code. For example, if mesh is of length=50000, I would like to divide the mesh into 100 and run fun for each mesh/100 at the same time.
import numpy as np
def fnc(data, mesh):
d = []
for i, dummy_val in enumerate(mesh):
d.append(np.sqrt((data[:, 0]-mesh[i, 0])**2.0 + (data[:, 1]-mesh[i, 1])**2.0))
return d
interpolate = fnc(mydata, mymesh)
I would like to know to achieve this using multiprocessing or multithreading as I'm unable to reconcile it with the execution of my loop.
This will give you the general idea. I couldn't test this since I do not have your data. The default constructor for ProcessPoolExecutor will use the number of processors on your computer. But since that determines the level of multiprocessing you can have, it will probably be more efficient to set the N_CHUNKS parameter to the number of simultaneous processes you can support. That is, if you have a processing pool size of 6, then it is better to just divide your array into 6 large chunks and have 6 processes do the work rather than breaking it up into smaller pieces where processes will have to wait to run. So you should probably specify a specific max_workers number to the ProcessPoolExecutor not greater than the number of processors you have and set N_CHUNKS to the same value.
from concurrent.futures import ProcessPoolExecutor, as_completed
import numpy as np
def fnc(data, mesh):
d = []
for i, dummy_val in enumerate(mesh):
d.append(np.sqrt((data[:, 0]-mesh[i, 0])**2.0 + (data[:, 1]-mesh[i, 1])**2.0))
return d
def main(data, mesh):
#N_CHUNKS = 100
N_CHUNKS = 6 # assuming you have 6 processors; see max_workers parameter
n = len(mesh)
assert n != 0
if n <= N_CHUNKS:
N_CHUNKS = 1
chunk_size = n
last_chunk_size = n
else:
chunk_size = n // N_CHUNKS
last_chunk_size = n - chunk_size * (N_CHUNKS - 1)
with ProcessPoolExcutor(max_workers=N_CHUNKS) as executor: # assuming you have 6 processors
the_futures = {}
start = 0
for i in range(N_CHUNKS - 1):
future = executor.submit(fnc, data, mesh[start:start+chunk_size]) # pass slice
the_futures[future] = (start, start+chunk_size) # map future to request parameters
start += chunk_size
if last_chunk_size:
future = executor.submit(fnc, data, mesh[start:n]) # pass slice
the_futures[future] = (start, start+n)
for future in as_completed(the_futures):
(start, end) = the_futures[future] # the original range
d = future.result() # do something with the results
if __name__ == '__main__':
# the call to main must be done in a block governed by if __name__ == '__main__' or you will get into a recursive
# loop where each subprocess calls main again
main(data, mesh)
I have a function which creates a set of results in a list. This is in a for-loop which changes one of the variables in each iteration. I need to be able to store these lists separately so that I can show the difference in results between each iteration as a graph.
Is there any way to store them separately like that? So far the only solution I've found is to copy out the function multiple times and manually change the variable and name of the list it stores to, but obviously this is a terrible way of doing it and I figure there must be a proper way.
Here is the code. The function is messy but works. Ideally I would be able to put this all in another for-loop which changes deceleration_p each iteration and then stores collected_averages as a different list so that I could compare collected_averages for each iteration.
import numpy as np
import random
import matplotlib.pyplot as plt
from statistics import mean
road_length = 500
deceleration_p = 0.1
max_speed = 5
buffer_road = np.zeros(road_length, dtype=int)
buffer_speed = 0
number_of_iterations = 1000
average_speed = 0
average_speed_list = []
collected_averages = []
total_speed = 0
for cars in range(1, road_length):
empty_road = np.ones(road_length - cars, dtype=int) * -1
cars_on_road = np.ones(cars, dtype=int)
road = np.append(empty_road, cars_on_road)
np.random.shuffle(road)
for i in range(0, number_of_iterations):
# acceleration
for speed in np.nditer(road, op_flags=['readwrite']):
if -1 < speed < max_speed:
speed[...] += 1
# randomisation
for speed in np.nditer(road, op_flags=['readwrite']):
if 0 < speed:
if deceleration_p > random.random():
speed += -1
# slowing down
for cell in range(0, road_length):
speed = road[cell]
for val in range(1, speed + 1):
new_speed = val
if (cell + val) > (road_length - 1):
val += -road_length
if road[cell + val] > -1:
speed = val - 1
road[cell] = new_speed - 1
break
buffer_road=np.ones(road_length, dtype=int)*-1
for cell in range(0, road_length):
speed = road[cell]
buffer_cell = cell + speed
if (buffer_cell) > (road_length - 1):
buffer_cell += -road_length
if speed > -1:
total_speed += speed
buffer_road[buffer_cell] = speed
road = buffer_road
average_speed = total_speed/cars
average_speed_list.append(average_speed)
average_speed = 0
total_speed = 0
steady_state_average=mean(average_speed_list[9:number_of_iterations])
average_speed_list=[]
collected_averages.append(steady_state_average)
Not to my knowledge. As stated in the comments, you could use a dictionary, but my suggestion is to use a list. For every iteration of the loop, you could append the value. (From what I understood) You stated that your results are in a list, so you could make a 2D array. My recommendation would be to use a numpy array as it is much faster. Hopefully this was helpful.
I am currently running test_matrix_speed() to see how fast my search_and_book_availability function is. Using the PyCharm profiler I can see that each search_and_book_availability function call averages a speed of 0.001ms. Having the Numba #jit(nopython=True) decorator makes no difference to the performance of this function. Is this because there are no improvements to be had and Numpy is operating as fast as possible here? (I don't care about the speed of the generate_searches function)
Here's the code I'm running
import random
import numpy as np
from numba import jit
def generate_searches(number, sim_start, sim_end):
searches = []
for i in range(number):
start_slot = random.randint(sim_start, sim_end - 1)
end_slot = random.randint(start_slot + 1, sim_end)
searches.append((start_slot, end_slot))
return searches
#jit(nopython=True)
def search_and_book_availability(matrix, search_start, search_end):
search_slice = matrix[:, search_start:search_end]
output = np.where(np.sum(search_slice, axis=1) == 0)[0]
number_of_bookable_vecs = output.size
if number_of_bookable_vecs > 0:
if number_of_bookable_vecs == 1:
id_to_book = output[0]
else:
id_to_book = np.random.choice(output)
matrix[id_to_book, search_start:search_end] = 1
return True
else:
return False
def test_matrix_speed():
shape = (10, 1440)
matrix = np.zeros(shape)
sim_start = 0
sim_end = 1440
searches = generate_searches(1000000, sim_start, sim_end)
for i in searches:
search_start = i[0]
search_end = i[1]
availability = search_and_book_availability(matrix, search_start, search_end)
Using your function and the following code to profile the speed
import time
shape = (10, 1440)
matrix = np.zeros(shape)
sim_start = 0
sim_end = 1440
searches = generate_searches(1000000, sim_start, sim_end)
def reset():
matrix[:] = 0
def test_matrix_speed():
for i in searches:
search_start = i[0]
search_end = i[1]
availability = search_and_book_availability(matrix, search_start, search_end)
def timeit(func):
# warmup
reset()
func()
reset()
start = time.time()
func()
end = time.time()
return end - start
print(timeit(test_matrix_speed))
I find on the order of 11.5s for jited version and 7.5s without jit. I am no expert on numba, but what it is made for is optimizing numerical code written in non-vectorized way, in particular explicit for loops. In your code there is none, you only use vectorized operations. Therefore I expected jit to not outperform baseline solution, though I must admit that I am surprised to see it that much worse. If you're looking to optimize your solution, you can cut the execution time (at least on my PC) with the following code:
def search_and_book_availability_opt(matrix, search_start, search_end):
search_slice = matrix[:, search_start:search_end]
# we don't need to sum in order to check if all elements are 0.
# ndarray.any() can use short-circuiting and is therefore faster.
# Also, we don't need the selected values from np.where, only the
# indexes, so np.nonzero is faster
bookable, = np.nonzero(~search_slice.any(axis=1))
# short circuit
if bookable.size == 0:
return False
# we can perform random choice even if size is 1
id_to_book = np.random.choice(bookable)
matrix[id_to_book, search_start:search_end] = 1
return True
and by initializing matrix as np.zeros(shape, dtype=np.bool), instead of the default float64. I am able to get execution times of around 3.8s, a ~50% improvement over your unjited solution and ~70% improvement over the jited version. Hope that helps.
I have a large number of data files and each data loaded from a data file are resampled hundreds of times and processed by several methods. I used numpy to do this. What I'm facing is the memory error after running the programs several hours. As each data is processed separately and the results are stored in a .mat file using scipy.savemat, I think the memory used by previous data can be released, so I used del variable_name+gc.collect(), but this does not work. Then I used multiprocessing module, as suggested in this post and this post, it still not works.
Here are my main codes:
import scipy.io as scio
import gc
from multiprocessing import Pool
def dataprocess_session:
i = -1
for f in file_lists:
i += 1
data = scio.loadmat(f)
ixs = data['rm_ix'] # resample indices
del data
gc.collect()
data = scio.loadmat('xd%d.mat'%i) # this is the data, and indices in "ixs" is used to resample subdata from this data
j = -1
mvs_ls_org = {} # preallocate results files as dictionaries, as required by scipy.savemat.
mvs_ls_norm = {}
mvs_ls_auto = {}
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data,ix)
mvs_ls_org[key] = process(X)
scio.savemat('d%d_ls_org.mat'%i,mvs_ls_org)
del mvs_ls_org
gc.collect()
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data,ix)
X2 = scale(X.copy(), 'norm')
mvs_ls_norm[key] = process(X2)
scio.savemat('d%d_ls_norm.mat'%i,mvs_ls_norm)
del mvs_ls_norm
gc.collect()
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data,ix)
X2 = scale(X.copy(), 'auto')
mvs_ls_auto[key] = process(X2)
scio.savemat('d%d_ls_auto.mat'%i,mvs_ls_auto)
del mvs_ls_auto
gc.collect()
# use another method to process data
j = -1
mvs_fcm_org = {} # also preallocate variable for storing results
mvs_fcm_norm = {}
mvs_fcm_auto = {}
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data['X'].copy(), ix)
dp, _ = process_2(X.copy())
mvs_fcm_org[key] = dp
scio.savemat('d%d_fcm_org.mat'%i,mvs_fcm_org)
del mvs_fcm_org
gc.collect()
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data['X'].copy(), ix)
X2 = scale(X.copy(), 'norm')
dp, _ = process_2(X2.copy())
mvs_fcm_norm[key] = dp
scio.savemat('d%d_fcm_norm.mat'%i,mvs_fcm_norm)
del mvs_fcm_norm
gc.collect()
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data['X'].copy(), ix)
X2 = scale(X.copy(), 'auto')
dp, _ = process_2(X2.copy())
mvs_fcm_auto[key] = dp
scio.savemat('d%d_fcm_auto.mat'%i,mvs_fcm_auto)
del mvs_fcm_auto
gc.collect()
This is the initial way I've done. I split file_lists into 7 parts, and ran 7 python screens, as my computer has 8 CPU cores. No problem in MATLAB if I do in this way. I do not combine the iterations over ixs for each data process method because the memory error can occur, so I ran resample_from_data and saved results separately. As the memory error persists, I used Pool class as:
pool = Pool(processes=7)
pool.map(dataprocess_session_2, file_lists)
which ran the iteration over file_lists parallelized with file names in file_lists as inputs.
All codes are run in openSuSE with python 2.7.5, 8 cores CPU and 32G RAM. I used top to monitor the memory used. All matrices are not so large and it's ok if I run any one of the loaded data using all codes. But after several iterations over file_lists, the free memory falls dramatically. I'm sure that this phenomenon is not caused by the data itself since no such large memory should be used even the largest data matrix is in processing. So I suspected that the above ways I tried to release the memory used by processing previous data as well as storing processing results did not really release memory.
Any suggestion?
All the variables you del explicitly are automatically released as soon as the loop ends. Subsequently i don't think they are your problem. I think it's more likely that your machine simply can't handle 7 threads with (in worst case) 7 simultaneously executed data = scio.loadmat(f). You could try to mark that call as a criticial section with locks.
this may be of some help, gc.collect()