Multiprocessing Pool Not Mapping - python

I am attempting to create Pool() objects, so that I can break down large arrays. Though, each time after the first I run through the below code, the map is never run. Only the first pass seems to enter the function, though the arguments are the same size, even when running it using the EXACT same arguments - only the first
appears to run. Below is the source of my pain (not all the code in the file):
def iterCount():
#m is not in shared memory, as intended.
global m
m = m + 1
return m
def thread_search(pair):
divisor_lower = pair[0]
divisor_upper = pair[1]
for i in range(divisor_lower, divisor_upper,window_size):
current_section = np.array(x[i: i + window_size])
for row in current_section:
if (row[2].startswith('NP') ) and checkPep(row[0]): #checkPep is a simple unique-in-array checking function.
#shared_list is a multiprocessing.Manager list.
m = iterCount()
if not m%1000000:
print(f'Encountered m = {m}', flush = True)
def poolMap(pairs, group):
job = Pool(3)
print(f'Pool Created')
print('Pool Closed')
if __name__ == '__main__':
for group in [1,2,3]: #Example times to be run...
x = None
lower_bound = int((group - 1)*group_step)
upper_bound = int(group*group_step)
x = list(csv.reader(open(pa_table_name,"rt", encoding = "utf-8"), delimiter = "\t"))[lower_bound:upper_bound]
divisor_pairs = [ [int(lower_bound + (i - 1)*chunk_size) , int(lower_bound + i*chunk_size)] for i in range(1,6143) ]
poolMap(divisor_pairs, group)
The output of this function is:
Program started: 03/09/19, 12:41:25 (Machine Time)
11008256 - Length of the file read in (in the group)
Pool Created
6142 - len(pairs)
Encountered m = 1000000
Encountered m = 1000000
Encountered m = 1000000
Encountered m = 2000000
Encountered m = 2000000
Encountered m = 2000000
Encountered m = 3000000
Encountered m = 3000000
Encountered m = 3000000 (Total size is ~ 9 million per set)
Pool Closed
11008256 (this is the number of lines read, correct value)
Pool Created
6142 (Number of pairs is correct, though map appears to never run...)
Pool Closed
Pool Created
Pool Closed
At this point, the shared_list is saved, and only the first threads results appear to be present.
I'm really at a loss to what is happening here, and I've tried to find bugs (?) or similar instances of any of this.
Ubuntu 18.04
Python 3.6


How to have precise time in python for timing attacks?

I'd like to know why python gives me two different times when I re-order the two nested for loops.
The difference is that significant that causes inaccurate results.
This one almost gives me the result I expect to see:
for i in range(20000):
for j in possibleChars:
entered_pwd = passStr + j + possibleChars[0] * leftPassLen
st = time.perf_counter_ns()
verify_password(stored_pwd, entered_pwd)
endTime = time.perf_counter_ns() - st
tmr[j] += endTime
But this code generate inaccurate results from my view:
for i in possibleChars:
for j in range(20000):
entered_pwd = passStr + i + possibleChars[0] * leftPassLen
st = time.perf_counter_ns()
verify_password(stored_pwd, entered_pwd)
endTime = time.perf_counter_ns() - st
tmr[i] += endTime
This is the function I'm attempting to run timing attack on it:
def verify_password(stored_pwd, entered_pwd):
if len(stored_pwd) != len(entered_pwd):
return False
for i in range(len(stored_pwd)):
if stored_pwd[i] != entered_pwd[i]:
return False
return True
I also observed a problem with character 'U' (capital case), so to have successful runs I had to delete it from my possibleChars list.
The problem is when I measure the time for 'U', it is always near double as other chars.
Let me know if you have any question.
Summing up the timings may not be a good idea here:
One interruption due to e.g., scheduling will have a huge effect on the total and may completely invalidate your measurements.
Iterating like in the first loop is probably more likely to spread noise more evenly across the measurements (this is just an educated guess though).
However, it would be better to use the median or minimum time instead of the sum.
This way, you eliminate all noisy measurements.
That being said, I don't expect the timing difference to be huge and python being a high-level language will generate more noisy measurements compared to more low-level languages (because of garbage collection and so on).
But it still works :)
I've implemented an example relying on the minimum time (instead of the sum).
On my local machine, it works reliably except for the last character, where the timing difference is way smaller:
import time
import string
# We want to leak this
stored_pwd = "S3cret"
def verify_password(entered_pwd):
if len(stored_pwd) != len(entered_pwd):
return False
for i in range(len(stored_pwd)):
if stored_pwd[i] != entered_pwd[i]:
return False
return True
possibleChars = string.printable
MEASUREMENTS = 2000 # works even with numbers as small as 20 (for me)
def find_char(prefix, len_pwd):
tmr = {i: 9999999999 for i in possibleChars}
for i in possibleChars:
for j in range(MEASUREMENTS):
entered_pwd = prefix + i + i * (len_pwd - len(prefix) - 1)
st = time.perf_counter_ns()
endTime = time.perf_counter_ns() - st
tmr[i] = min(endTime, tmr[i])
return max(tmr.items(), key = lambda x: x[1])[0]
def find_length(max_length = 100):
tmr = [99999999 for i in range(max_length + 1)]
for i in range(max_length + 1):
for j in range(MEASUREMENTS):
st = time.perf_counter_ns()
verify_password("X" * i)
endTime = time.perf_counter_ns() - st
tmr[i] = min(endTime, tmr[i])
return max(enumerate(tmr), key = lambda x: x[1])[0]
length = find_length()
print(f"password length: {length}")
recovered_password = ""
for i in range(length):
recovered_password += find_char(recovered_password, length)
print(f"{recovered_password}{'?' * (length - len(recovered_password))}")
print(f"Password: {recovered_password}")

Can't use Python multiprocessing with large amount of calculations

I have to speed up my current code to do around 10^6 operations in a feasible time. Before I used multiprocessing in my the actual document I tried to do it in a mock case. Following is my attempt:
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
t3 = time.time()
test = []
List = chunkIt(List,3)
if __name__ == '__main__':
with concurrent.futures.ProcessPoolExecutor() as executor:
results =,List)
for result in results:
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1
However, when I increase the size of my "List" my computer tires to use all of my RAM and CPU and freezes. I even cut my "List" into 3 pieces so it will only use 3 of my cores. However, nothing changed. Also, when I tried to use it on a smaller data set I noticed the code ran much slower than when it ran on a single core.
I am still very new to multiprocessing in Python, am I doing something wrong. I would appreciate it if you could help me.
To reduce memory usage, I suggest you use instead the multiprocessing module and specifically the imap method method (or imap_unordered method). Unlike the map method of either multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor, the iterable argument is processed lazily. What this means is that if you use a generator function or generator expression for the iterable argument, you do not need to create the complete list of arguments in memory; as a processor in the pool become free and ready to execute more tasks, the generator will be called upon to generate the next argument for the imap call.
By default a chunksize value of 1 is used, which can be inefficient for a large iterable size. When using map and the default value of None for the chunksize argument, the pool will look at the length of the iterable first converting it to a list if necessary and then compute what it deems to be an efficient chunksize based on that length and the size of the pool. When using imap or imap_unordered, converting the iterable to a list would defeat the whole purpose of using that method. But if you know what that size would be (more or less) if the iterable were converted to a list, then there is no reason not to apply the same chunksize calculation the map method would have, and that is what is done below.
The following benchmarks perform the same processing first as a single process and then using multiprocessing using imap where each invocation of do_something on my desktop takes approximately .5 seconds. do_something now has been modified to just process a single i, k, j tuple as there is no longer any need to break up anything into smaller lists:
from multiprocessing import Pool, cpu_count
import time
def half_second():
sum = 0
sum += 1
return sum
def do_something(tpl):
# in real case this function takes about 0.5 seconds to finish for each iteration
half_second() # on my desktop
return tpl[0]**2, tpl[1]**2, tpl[2]**2
def generate_tpls():
for i in range(1, 20-2):
for k in range(i+1, 20-1):
for j in range(k+1, 20):
yield i, k, j
# Use smaller number of tuples so we finish in a reasonable amount of time:
def generate_tpls():
# 64 tuples:
for i in range(1, 5):
for k in range(1, 5):
for j in range(1, 5):
yield i, k, j
def benchmark1():
""" single processing """
t = time.time()
for tpl in generate_tpls():
result = do_something(tpl)
print('benchmark1 time:', time.time() - t)
def compute_chunksize(iterable_size, pool_size):
""" This is more-or-less the function used by the method """
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
def benchmark2():
""" multiprocssing """
t = time.time()
pool_size = cpu_count() # 8 logical cores (4 physical cores)
N_TUPLES = 64 # number of tuples that will be generated
pool = Pool(pool_size)
chunksize = compute_chunksize(N_TUPLES, pool_size)
for result in pool.imap(do_something, generate_tpls(), chunksize=chunksize):
print('benchmark2 time:', time.time() - t)
if __name__ == '__main__':
benchmark1 time: 32.261038303375244
benchmark2 time: 8.174998044967651
The nested For loops creating the array before the main definition appears to be the problem. Moving that part to underneath the main definition clears up any memory problems.
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
if __name__ == '__main__':
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
t3 = time.time()
test = []
List = chunkIt(List,3)
with concurrent.futures.ProcessPoolExecutor() as executor:
results =,List)
for result in results:
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1

Optimizing loop for by using vectors python

I created following code:
sample_all = np.load('sample.npy')
sd = np.zeros(M)
chi_arr = np.zeros((M,4))
sigma_e = np.zeros((M,41632))
mean_sigma = np.zeros(M)
max_sigma = np.zeros(M)
min_sigma = np.zeros(M)
z = np.load('z_array.npy')
prof = np.load('profile_at_sources.npy')
L = np.load('luminosities.npy')
for k in range(M):
arr = np.genfromtxt('samples_fin1.txt').T[2:6]
arr_T = arr.T
chi_arr[k,:] = arr_T[k,:]
sigma_e[k,:]=np.sqrt(calc(z,prof,chi_arr[k,:], L))
mean_sigma[k] = np.array(sp.mean(sigma_e[k,:]))
max_sigma[k] = np.array(sigma_e[k,:].max())
min_sigma[k] = np.array(sigma_e[k,:].min())
where calc(...) is a function that calculates some stuff (is not important for my question)
This loop takes, for M=20000, about 27 hours on my machine. It's enough... There's a way to optimize it, maybe with vectors instead of loop for?
For me it's really simple create loop, my head thinks with loops for this kind of code... It's my limitation... Could you help me? thanks
It seems to me like each of the k-th rows that are created in your various arrays are independent of each k-th iteration of your for loop and only dependent on rows of sigma_e... so you could parallelize it over many workers. Not sure if the code is 100% kosher but you didn't provide a working example.
Note this only works if each k-th iteration is COMPLETELY independent of each k-1th iteration.
sample_all = np.load('sample.npy')
sd = np.zeros(M)
chi_arr = np.zeros((M,4))
sigma_e = np.zeros((M,41632))
mean_sigma = np.zeros(M)
max_sigma = np.zeros(M)
min_sigma = np.zeros(M)
z = np.load('z_array.npy')
prof = np.load('profile_at_sources.npy')
L = np.load('luminosities.npy')
workers = 100
arr = np.genfromtxt('samples_fin1.txt').T[2:6] # only works if this is really what you're doing to set arr.
def worker(k_start, k_end):
for k in range(k_start, k_end + 1):
arr_T = arr.T
chi_arr[k,:] = arr_T[k,:]
sigma_e[k,:]=np.sqrt(calc(z,prof,chi_arr[k,:], L))
mean_sigma[k] = np.array(sp.mean(sigma_e[k,:]))
max_sigma[k] = np.array(sigma_e[k,:].max())
min_sigma[k] = np.array(sigma_e[k,:].min())
threads = []
kstart = 0
for k in range(0, workers):
T = threading.Thread(target=worker, args=[0 + k * M / workers, (1+ k) * M / workers - 1 ])
for t in threads:
Edited following comments:
Seems like there's a mutex that CPython places on all objects that prevents parallel access. Use IronPython or Jython to step around this. Also, you can move the file read outside if you're really just deserializing the same array from samples_fin1.txt.

How to effectively release memory in python?

I have a large number of data files and each data loaded from a data file are resampled hundreds of times and processed by several methods. I used numpy to do this. What I'm facing is the memory error after running the programs several hours. As each data is processed separately and the results are stored in a .mat file using scipy.savemat, I think the memory used by previous data can be released, so I used del variable_name+gc.collect(), but this does not work. Then I used multiprocessing module, as suggested in this post and this post, it still not works.
Here are my main codes:
import as scio
import gc
from multiprocessing import Pool
def dataprocess_session:
i = -1
for f in file_lists:
i += 1
data = scio.loadmat(f)
ixs = data['rm_ix'] # resample indices
del data
data = scio.loadmat('xd%d.mat'%i) # this is the data, and indices in "ixs" is used to resample subdata from this data
j = -1
mvs_ls_org = {} # preallocate results files as dictionaries, as required by scipy.savemat.
mvs_ls_norm = {}
mvs_ls_auto = {}
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data,ix)
mvs_ls_org[key] = process(X)
del mvs_ls_org
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data,ix)
X2 = scale(X.copy(), 'norm')
mvs_ls_norm[key] = process(X2)
del mvs_ls_norm
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data,ix)
X2 = scale(X.copy(), 'auto')
mvs_ls_auto[key] = process(X2)
del mvs_ls_auto
# use another method to process data
j = -1
mvs_fcm_org = {} # also preallocate variable for storing results
mvs_fcm_norm = {}
mvs_fcm_auto = {}
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data['X'].copy(), ix)
dp, _ = process_2(X.copy())
mvs_fcm_org[key] = dp
del mvs_fcm_org
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data['X'].copy(), ix)
X2 = scale(X.copy(), 'norm')
dp, _ = process_2(X2.copy())
mvs_fcm_norm[key] = dp
del mvs_fcm_norm
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data['X'].copy(), ix)
X2 = scale(X.copy(), 'auto')
dp, _ = process_2(X2.copy())
mvs_fcm_auto[key] = dp
del mvs_fcm_auto
This is the initial way I've done. I split file_lists into 7 parts, and ran 7 python screens, as my computer has 8 CPU cores. No problem in MATLAB if I do in this way. I do not combine the iterations over ixs for each data process method because the memory error can occur, so I ran resample_from_data and saved results separately. As the memory error persists, I used Pool class as:
pool = Pool(processes=7), file_lists)
which ran the iteration over file_lists parallelized with file names in file_lists as inputs.
All codes are run in openSuSE with python 2.7.5, 8 cores CPU and 32G RAM. I used top to monitor the memory used. All matrices are not so large and it's ok if I run any one of the loaded data using all codes. But after several iterations over file_lists, the free memory falls dramatically. I'm sure that this phenomenon is not caused by the data itself since no such large memory should be used even the largest data matrix is in processing. So I suspected that the above ways I tried to release the memory used by processing previous data as well as storing processing results did not really release memory.
Any suggestion?
All the variables you del explicitly are automatically released as soon as the loop ends. Subsequently i don't think they are your problem. I think it's more likely that your machine simply can't handle 7 threads with (in worst case) 7 simultaneously executed data = scio.loadmat(f). You could try to mark that call as a criticial section with locks.
this may be of some help, gc.collect()

Multiprocessing in python to speed up functions

I am confused with Python multiprocessing.
I am trying to speed up a function which process strings from a database but I must have misunderstood how multiprocessing works because the function takes longer when given to a pool of workers than with “normal processing”.
Here an example of what I am trying to achieve.
from time import clock, time
from multiprocessing import Pool, freeze_support
from random import choice
def foo(x):
TupWerteMany = []
for i in range(0,len(x)):
TupWerte = []
s = list(x[i][3])
NewValue = choice(s)+choice(s)+choice(s)+choice(s)
TupWerte = tuple(TupWerte)
return TupWerteMany
if __name__ == '__main__':
start_time = time()
List = [(u'1', u'aa', u'Jacob', u'Emily'),
(u'2', u'bb', u'Ethan', u'Kayla')]
List1 = List*1000000
# METHOD 1 : NORMAL (takes 20 seconds)
x2 = foo(List1)
print x2[1:3]
# METHOD 2 : APPLY_ASYNC (takes 28 seconds)
# pool = Pool(4)
# Werte = pool.apply_async(foo, args=(List1,))
# x2 = Werte.get()
# print '--------'
# print x2[1:3]
# print '--------'
# pool = Pool(4)
# Werte =, args=(List1,))
# x2 = Werte.get()
# print '--------'
# print x2[1:3]
# print '--------'
print 'Time Elaspse: ', time() - start_time
My questions:
Why does apply_async takes longer than the “normal way” ?
What I am doing wrong with map?
Does it makes sense to speed up such tasks with multiprocessing at all?
Finally: after all I have read here, I am wondering if multiprocessing in python works on windows at all ?
So your first problem is that there is no actual parallelism happening in foo(x), you are passing the entire list to the function once.
The idea of a process pool is to have many processes doing computations on separate bits of some data.
jobs = 4
size = len(List1)
pool = Pool(4)
results = []
# split the list into 4 equally sized chunks and submit those to the pool
heads = range(size/jobs, size, size/jobs) + [size]
tails = range(0,size,size/jobs)
for tail,head in zip(tails, heads):
werte = pool.apply_async(foo, args=(List1[tail:head],))
pool.join() # wait for the pool to be done
for result in results:
werte = result.get() # get the return value from the sub jobs
This will only give you an actual speedup if the time it takes to process each chunk is greater than the time it takes to launch the process, in the case of four processes and four jobs to be done, of course these dynamics change if you've got 4 processes and 100 jobs to be done. Remember that you are creating a completely new python interpreter four times, this isn't free.
2) The problem you have with map is that it applies foo to EVERY element in List1 in a separate process, this will take quite a while. So if you're pool has 4 processes map will pop an item of the list four times and send it to a process to be dealt with - wait for process to finish - pop some more stuff of the list - wait for the process to finish. This makes sense only if processing a single item takes a long time, like for instance if every item is a file name pointing to a one gigabyte text file. But as it stands map will just take a single string of the list and pass it to foo where as apply_async takes a slice of the list. Try the following code
def foo(thing):
print thing
map(foo, ['a','b','c','d'])
That's the built-in python map and will run a single process, but the idea is exactly the same for the multiprocess version.
Added as per J.F.Sebastian's comment: You can however use the chunksize argument to map to specify an approximate size of for each chunk., List1, chunksize=size/jobs)
I don't know though if there is a problem with map on Windows as I don't have one available for testing.
3) yes, given that your problem is big enough to justify forking out new python interpreters
4) can't give you a definitive answer on that as it depends on the number of cores/processors etc. but in general it should be fine on Windows.
On question (2)
With the guidance of Dougal and Matti, I figured out what's went wrong.
The original foo function processes a list of lists, while map requires a function to process single elements.
The new function should be
def foo2 (x):
TupWerte = []
s = list(x[3])
NewValue = choice(s)+choice(s)+choice(s)+choice(s)
TupWerte = tuple(TupWerte)
return TupWerte
and the block to call it :
jobs = 4
size = len(List1)
pool = Pool()
#Werte =, List1, chunksize=size/jobs)
Werte =, List1)
print Werte[1:3]
Thanks to all of you who helped me understand this.
Results of all methods:
for List * 2 Mio records: normal 13.3 seconds , parallel with async: 7.5 seconds, parallel with with map with chuncksize : 7.3, without chunksize 5.2 seconds
Here's a generic multiprocessing template if you are interested.
import multiprocessing as mp
import time
def worker(x):
print "x= %s, x squared = %s" % (x, x*x)
return x*x
def apply_async():
pool = mp.Pool()
for i in range(100):
pool.apply_async(worker, args = (i, ))
if __name__ == '__main__':
And the output looks like this:
x= 0, x squared = 0
x= 1, x squared = 1
x= 2, x squared = 4
x= 3, x squared = 9
x= 4, x squared = 16
x= 6, x squared = 36
x= 5, x squared = 25
x= 7, x squared = 49
x= 8, x squared = 64
x= 10, x squared = 100
x= 11, x squared = 121
x= 9, x squared = 81
x= 12, x squared = 144
As you can see, the numbers are not in order, as they are being executed asynchronously.

