Improve performance of combinations

Improve performance of combinations - python

Hey guys I have a script that compares each possible user and checks how similar their text is:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
if a[1][2] == b[1][2] and not a[1][1].isdisjoint(b[1][1]):
similarity_score = fuzz.ratio(a[1][0], b[1][0])
if (similarity_score >= 95 and len(a[1][0]) >= 10) or similarity_score == 100:
highly_similar.append([a[0], b[0], a[1][0], b[1][0], similarity_score])
This script takes around 15 minutes to run, the dataframe contains 120k users, so comparing each possible combination takes quite a bit of time, if I just write pass on the for loop it takes 2 minutes to loop through all values.
I tried using filter() and map() for the if statements and fuzzy score but the performance was worse. I tried improving the script as much as I could but I don't know how I can improve this further.
Would really appreciate some help!

It is slightly complicated to reason about the data since you have not attached it, but we can see multiple places that might provide an improvement:
First, let's rewrite the code in a way which is easier to reason about than using the indices:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
a_id, (a_text, a_set, a_compre_string) = a
b_id, (b_text, b_set, b_compre_string) = b
if (a_compre_string == b_compre_string
and not a_set.isdisjoint(b_set)):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
You seem to only care about pairs having the same compare_string values. Therefore, and assuming this is not something that all pairs share, we can key by whatever that value is to cover much less pairs.
To put some numbers into it, let's say you have 120K inputs, and 1K values for each value of val[1][2] - then instead of covering 120K * 120K = 14 * 10^9 combinations, you would have 120 bins of size 1K (where in each bin we'd need to check all pairs) = 120 * 1K * 1K = 120 * 10^6 which is about 1000 times faster. And it would be even faster if each bin has less than 1K elements.
import collections
# Create a dictionary from compare_string to all items
# with the same compare_string
items_by_compare_string = collections.defaultdict(list)
for item in dictionary.items():
compare_string = item[1][2]
items_by_compare_string[compare_string].append(items)
# Iterate over each group of items that have the same
# compare string
for item_group in items_by_compare_string.values():
# Check pairs only within that group
for a, b in itertools.combinations(item_group, 2):
# No need to compare the compare_strings!
if not a_set.isdisjoint(b_set):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
But, what if we want more speed? Let's look at the remaining operations:
We have a check to find if two sets share at least one item
This seems like an obvious candidate for optimization if we have any knowledge about these sets (to allow us to determine which pairs are even relevant to compare)
Without additional knowledge, and just looking at every two pairs and trying to speed this up, I doubt we can do much - this is probably highly optimized using internal details of Python sets, I don't think it's likely to optimize it further
We a fuzz.ratio computation which is some external function, and I'm going to assume is heavy
If you are using this from the FuzzyWuzzy package, make sure to install python-Levenshtein to get the speedups detailed here
We have some comparisons which we are unlikely to be able to speed up
We might be able to cache the length of a_text by nesting the two loops, but that's negligible
We have appends to a list, which runs on average ("amortized") constant time per operation, so we can't really speed that up
Therefore, I don't think we can reasonably suggest any more speedups without additional knowledge. If we know something about the sets that can help optimize which pairs are relevant we might be able to speed things up further, but I think this is about it.
EDIT: As pointed out in other answers, you can obviously run the code in multi-threading. I assumed you were looking for an algorithmic change that would possibly reduce the number of operations significantly, instead of just splitting these over more CPUs.

Essentially, from python programming side, i see two things that can improve your processing time:
Multi-threads and Vectorized operations
From the fuzzy score side, here is a list of tips you can use to improve your processing time (new anonymous tab to avoid paywall):
https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536
Using multi thread you can speed you operation up to N times, being N the number of threads in you CPU. You can check it with:
import multiprocessing
multiprocessing.cpu_count()
Using vectorized operations you can parallel process your operations in low level with SIMD (single instruction / multiple data) operations, or with gpu tensor operations (like those in tensorflow/pytorch).
Here is a small comparison of results for each case:
import numpy as np
import time
A = [np.random.rand(512) for i in range(2000)]
B = [np.random.rand(512) for i in range(2000)]
high_similarity = []
def measure(i,j,a,b,high_similarity):
d = ((a-b)**2).sum()
if d>12:
high_similarity.append((i,j,d))
start_single_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
measure(i,j,A[i],B[j],high_similarity)
finis_single_thread = time.time()
print("single thread time:",finis_single_thread-start_single_thread)
out[0] single thread time: 147.64517450332642
running on multi thread:
from threading import Thread
high_similarity = []
def measure(a = None,b= None,high_similarity = None):
d = ((a-b)**2).sum()
if d > 12:
high_similarity.append(d)
start_multi_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
thread = Thread(target=measure,kwargs= {'a':A[i],'b':B[j],'high_similarity':high_similarity} )
thread.start()
thread.join()
finish_multi_thread = time.time()
print("time to run on multi threads:",finish_multi_thread - start_multi_thread)
out[1] time to run on multi-threads: 11.946279764175415
A_array = np.array(A)
B_array = np.array(B)
start_vectorized = time.time()
for i in range(len(A_array)):
#vectorized distance operation
dists = (A_array-B_array)**2
high_similarity+= dists[dists>12].tolist()
aux = B_array[-1]
np.delete(B_array,-1)
np.insert(B_array, 0, aux)
finish_vectorized = time.time()
print("time to run vectorized operations:",finish_vectorized-start_vectorized)
out[2] time to run vectorized operations: 2.302949905395508
Note that you can't guarantee any order of execution, so will you also need to store the index of results. The snippet of code is just to illustrate that you can use parallel process, but i highly recommend to use a pool of threads and divide your dataset in N subsets for each worker and join the final result (instead of create a thread for each function call like i did).

Related

How to use multiprocessing in Python on for loop to generate nested dictionary?

I am currently generating a nested dictionary that saves some arrays by using a nested for loop. Unfortunately, it takes quite some time; I realized that the server I am working on has a few cores available, so I was wondering if Python's multiprocessing library could be helpful to speed up the creation of the dictionary.
The nested for loop looks something like this (the actual computation is heavier and more complex):
import numpy as np
data_dict = {}
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
So this is what I tried:
from multiprocessing import Pool
import numpy as np
data_dict = {}
def process():
#sci=fits.open('{}.fits'.format(name))
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
pool.map(process)
But pool.map (last line) seems to require an iterable, which I'm not sure what to insert there.

In my opinion, the real problem is what kind of processing is needed to compute entries of the dictionary and how many entries are there.
The kind of processing is essential to understand if multiprocessing can significantly speed up the creation of the dictionary. If your computation is I/O bound, you should use multithreading, while if it's CPU bound you should use multiprocessing. You can find more bout this here.
Assuming that the value of each entry can be computed independently and that this computation is CPU bound, let's benchmark the difference between single process and multiprocess implementation (based on multiprocessing library).
The following code is used to test the two approaches in some scenarios, varying the complexity of the computation needed for each entry and the number of entries (for the multiprocess implementation, 7 processes were used).
import timeit
import numpy as np
def some_fun(s, d, n=1):
"""A function with an adaptable complexity"""
a = s * np.ones(np.random.randint(1, 10, (2,))) / (d + 1)
for _ in range(n):
a += np.random.random(a.shape)
return a
# Code to create dictionary with only one process
setup_simple = "from __main__ import some_fun, n_first_level, n_second_level, complexity"
code_simple = """
data_dict = {}
for s in range(n_first_level):
data_dict[s] = {}
for d in range(n_second_level):
data_dict[s][d] = some_fun(s, d, n=complexity)
"""
# Code to create a dictionary with multiprocessing: we are going to use all the available cores except 1
setup_mp = """import numpy as np
import multiprocessing as mp
import itertools
from functools import partial
from __main__ import some_fun, n_first_level, n_second_level, complexity
n_processes = mp.cpu_count() - 1
# Uncomment if you want to know how many concurrent processes are you going to use
# print(f'{n_processes} concurrent processes')
"""
code_mp = """
with mp.Pool(processes=n_processes) as pool:
dict_values = pool.starmap(partial(some_fun, n=complexity), itertools.product(range(n_first_level), range(n_second_level)))
data_dict = {
k: dict(zip(range(n_second_level), dict_values[k * n_second_level: (k + 1) * n_second_level]))
for k in range(n_first_level)
}
"""
# Time the code with different settings
print('Execution time on 10 repetitions: mean [std]')
for label, complexity, n_first_level, n_second_level in (
("TRIVIAL FUNCTION", 0, 10, 10),
("TRIVIAL FUNCTION", 0, 500, 500),
("SIMPLE FUNCTION", 5, 500, 500),
("COMPLEX FUNCTION", 50, 100, 100),
("HEAVY FUNCTION", 1000, 10, 10),
):
print(f'\n{label}, {n_first_level * n_second_level} dictionary entries')
for l, t in (
('Single process', timeit.repeat(stmt=code_simple, setup=setup_simple, number=1, repeat=10)),
('Multiprocess', timeit.repeat(stmt=code_mp, setup=setup_mp, number=1, repeat=10)),
):
print(f'\t{l}: {np.mean(t):.3e} [{np.std(t):.3e}] seconds')
These are the results:
Execution time on 10 repetitions: mean [std]
TRIVIAL FUNCTION, 100 dictionary entries
Single process: 7.752e-04 [7.494e-05] seconds
Multiprocess: 1.163e-01 [2.024e-03] seconds
TRIVIAL FUNCTION, 250000 dictionary entries
Single process: 7.077e+00 [7.098e-01] seconds
Multiprocess: 1.383e+00 [7.752e-02] seconds
SIMPLE FUNCTION, 250000 dictionary entries
Single process: 1.405e+01 [1.422e+00] seconds
Multiprocess: 2.858e+00 [5.742e-01] seconds
COMPLEX FUNCTION, 10000 dictionary entries
Single process: 1.557e+00 [4.330e-02] seconds
Multiprocess: 5.383e-01 [5.330e-02] seconds
HEAVY FUNCTION, 100 dictionary entries
Single process: 3.181e-01 [5.026e-03] seconds
Multiprocess: 1.171e-01 [2.494e-03] seconds
As you can see, assuming that you have a CPU bounded computation, the multiprocess approach achieves better results in most of the scenarios. Only if you have a very light computation for each entry and/or a very limited number of entries, the single process approach should be preferred.
On the other hand, the improvement provided by multiprocessing comes with a cost: for example, if your computation for each entry uses a significant amount of memory, you could incur an OutOfMemory error, meaning that you have to improve your code and make it more complex to avoid it, finding the right balance between memory occupation and decrease in execution time. If you look around, there are a lot of questions asking how to solve memory issues caused by a non-optimal use of multiprocessing. In other words, this means that your code will be less easy to read and maintain.
To sum up, you should judge if the improvement in execution time is worthed, even if it is possible.

Python for loop: Proper implementation of multiprocessing

The following for loop is part of a iterative simulation process and is the main bottleneck regarding computational time:
import numpy as np
class Simulation(object):
def __init__(self,n_int):
self.n_int = n_int
def loop(self):
for itr in range(self.n_int):
#some preceeding code which updates rows_list and diff with every itr
cols_red_list = []
rows_list = list(range(2500)) #row idx for diff where negative element is known to appear
diff = np.random.uniform(-1.323, 3.780, (2500, 300)) #np.random.uniform is just used as toy example
for row in rows_list:
col = next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
cols_red_list.append(col)
# some subsequent code which uses the cols_red_list data
sim1 = Simulation(n_int=10)
sim1.loop()
Hence, I tried to parallelize it by using the multiprocessing package in hope to reduce computation time:
import numpy as np
from multiprocessing import Pool, cpu_count
from functools import partial
def crossings(row, diff):
return next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
class Simulation(object):
def __init__(self,n_int):
self.n_int = n_int
def loop(self):
for itr in range(self.n_int):
#some preceeding code which updates rows_list and diff with every
rows_list = list(range(2500))
diff = np.random.uniform(-1, 1, (2500, 300))
if __name__ == '__main__':
num_of_workers = cpu_count()
print('number of CPUs : ', num_of_workers)
pool = Pool(num_of_workers)
cols_red_list = pool.map(partial(crossings,diff = diff), rows_list)
pool.close()
print(len(cols_red_list))
# some subsequent code which uses the cols_red_list data
sim1 = Simulation(n_int=10)
sim1.loop()
Unfortunately, the parallelization turns out to be much slower compared to the sequential piece of code.
Hence my question: Did I use the multiprocessing package properly in that particular example? Are there alternative ways to parallelize the above mentioned for loop ?

Disclaimer: As you're trying to reduce the runtime of your code through parallelisation, this doesn't strictly answer your question but it might still be a good learning opportunity.
As a golden rule, before moving to multiprocessing to improve
performance (execution time), one should first optimise the
single-threaded case.
Your
rows_list = list(range(2500))
Generates the numbers 0 to 2499 (that's the range) and stores them in memory (list), which requires time to do the allocation of the required memory and the actual write. You then only use these predictable values once each, by reading them from memory (which also takes time), in a predictable order:
for row in rows_list:
This is particularly relevant to the runtime of your loop function as you do it repeatedly (for itr in range(n_int):).
Instead, consider generating the number only when you need it, without an intermediate store (which conceptually removes any need to access RAM):
for row in range(2500):
Secondly, on top of sharing the same issue (unnecessary accesses to memory), the following:
diff = np.random.uniform(-1, 1, (2500, 300))
# ...
col = next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
seems to me to be optimisable at the level of math (or logic).
What you're trying to do is get a random variable (that col index) by defining it as "the first time I encounter a random variable in [-1;1] that is lower than 0". But notice that figuring out if a random variable with a uniform distribution over [-α;α] is negative, is the same as having a random variable over {0,1} (i.e. a bool).
Therefore, you're now working with bools instead of floats and you don't even have to do the comparison (val < 0) as you already have a bool. This potentially makes the code much faster. Using the same idea as for rows_list, you can generate that bool only when you need it; testing it until it is True (or False, choose one, it doesn't matter obviously). By doing so, you only generate as many random bools as you need, not more and not less (BTW, what happens in your code if all 300 elements in the row are negative? ;) ):
for _ in range(n_int):
cols_red_list = []
for row in range(2500):
col = next(i for i in itertools.count() if random.getrandbits(1))
cols_red_list.append(col)
or, with list comprehension:
cols_red_list = [next(i for i in count() if getrandbits(1))
for _ in range(2500)]
I'm sure that, through proper statistical analysis, you even can express that col random variable as a non-uniform variable over [0;limit[, allowing you to compute it much faster.
Please test the performance of an "optimized" version of your single-threaded implementation first. If the runtime is still not acceptable, you should then look into multithreading.

multiprocessing uses system processes (not threads!) for parallelization, which require expensive IPC (inter-process communication) to share data.
This bites you in two spots:
diff = np.random.uniform(-1, 1, (2500, 300)) creates a large matrix which is expensive to pickle/copy to another process
rows_list = list(range(2500)) creates a smaller list, but the same applies here.
To avoid this expensive IPC, you have one and a half choices:
If on a POSIX-compliant system, initialize your variables on the module level, that way each process gets a quick-and-dirty copy of the required data. This is not scalable as it requires POSIX, weird architecture (you probably don't want to put everything on the module level), and doesn't support sharing changes to that data.
Use shared memory. This only supports mostly primitive data types, but mp.Array should cover your needs.
The second problem is that setting up a pool is expensive, as num_cpu processes need to be started. Your workload is small enough to be negligible compared to this overhead. A good practice is to only create one pool and reuse it.
Here is a quick-and-dirty example of the POSIX only solution:
import numpy as np
from multiprocessing import Pool, cpu_count
from functools import partial
n_int = 10
rows_list = np.array(range(2500))
diff = np.random.uniform(-1, 1, (2500, 300))
def crossings(row, diff):
return next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
def workload(_):
cols_red_list = [crossings(row, diff) for row in rows_list]
print(len(cols_red_list))
class Simulation(object):
def loop(self):
num_of_workers = cpu_count()
with Pool(num_of_workers) as pool:
pool.map(workload, range(10))
pool.close()
sim1 = Simulation()
sim1.loop()
For me (and my two cores) this is roughly twice as fast as the sequential version.
Update with shared memory:
import numpy as np
from multiprocessing import Pool, cpu_count, Array
from functools import partial
n_int = 10
ROW_COUNT = 2500
### WORKER
diff = None
result = None
def init_worker(*args):
global diff, result
(diff, result) = args
def crossings(i):
result[i] = next(idx for idx, val in enumerate(diff[i*300:(i+1)*300]) if val < 0)
### MAIN
class Simulation():
def loop(self):
num_of_workers = cpu_count()
diff = Array('d', range(ROW_COUNT*300), lock=False)
result = Array('i', ROW_COUNT, lock=False)
# Shared memory needs to be passed when workers are spawned
pool = Pool(num_of_workers, initializer=init_worker, initargs=(diff, result))
for i in range(n_int):
# SLOW, I assume you use a different source of values anyway.
diff[:] = np.random.uniform(-1, 1, ROW_COUNT*300)
pool.map(partial(crossings), range(ROW_COUNT))
print(len(result))
pool.close()
sim1 = Simulation()
sim1.loop()
A few notes:
Shared memory needs to be set up at worker creation, so it's global anyway.
This still isn't faster than the sequential version, but that's mainly due to random.uniform needing to be copied entirely into shared memory. I assume that are just values for testing, and in reality you'd fill it differently anyway.
I only pass indices to the worker, and use them to read and write values to the shared memory.

High Memory Usage when python multiprocessing run in Windows

The code down below is a contrived example that simulates an actual problem I have that uses multiprocessing to speed up the code. The code is run on Windows 10 64-bit OS, python 3.7.5, and ipython 7.9.0
the transformation functions(these functions will be used to transform arrays in main())
from itertools import product
from functools import partial
from numba import njit, prange
import multiprocessing as mp
import numpy as np
#njit(parallel= True)
def transform_array_c(data, n):
ar_len= len(data)
sec_max1= np.empty(ar_len, dtype = data.dtype)
sec_max2= np.empty(ar_len, dtype = data.dtype)
for i in prange(n-1):
sec_max1[i]= np.nan
for sec in prange(ar_len//n):
s2_max= data[n*sec+ n-1]
s1_max= data[n*sec+ n]
for i in range(n-1,-1,-1):
if data[n*sec+i] > s2_max:
s2_max= data[n*sec+i]
sec_max2[n*sec+i]= s2_max
sec_max1[n*sec+ n-1]= sec_max2[n*sec]
for i in range(n-1):
if n*sec+n+i < ar_len:
if data[n*sec+n+i] > s1_max:
s1_max= data[n*sec+n+i]
sec_max1[n*sec+n+i]= max(s1_max, sec_max2[n*sec+i+1])
else:
break
return sec_max1
#njit(error_model= 'numpy', cache= True)
def rt_mean_sq_dev(array1, array2, n):
msd_temp = np.empty(array1.shape[0])
K = array2[n-1]
rs_x= array1[0] - K
rs_xsq = rs_x *rs_x
msd_temp[0] = np.nan
for i in range(1,n):
rs_x += array1[i] - K
rs_xsq += np.square(array1[i] - K)
msd_temp[i] = np.nan
y_i = array2[n-1] - K
msd_temp[n-1] = np.sqrt(max(y_i*y_i + (rs_xsq - 2*y_i*rs_x)/n, 0))
for i in range(n, array1.shape[0]):
rs_x = array1[i] - array1[i-n]+ rs_x
rs_xsq = np.square(array1[i] - K) - np.square(array1[i-n] - K) + rs_xsq
y_i = array2[i] - K
msd_temp[i] = np.sqrt(max(y_i*y_i + (rs_xsq - 2*y_i*rs_x)/n, 0))
return msd_temp
#njit(cache= True)
def transform_array_a(data, n):
result = np.empty(data.shape[0], dtype= data.dtype)
alpharev = 1. - 2 / (n + 1)
alpharev_exp = alpharev
e = data[0]
w = 1.
if n == 2: result[0] = e
else:result[0] = np.nan
for i in range(1, data.shape[0]):
w += alpharev_exp
e = e*alpharev + data[i]
if i > n -3:result[i] = e / w
else:result[i] = np.nan
if alpharev_exp > 3e-307:alpharev_exp*= alpharev
else:alpharev_exp=0.
return result
The multiprocessing part
def func(tup, data): #<-------------the function to be run among all
a_temp= a[tup[2][0]]
idx1 = a_temp > a[tup[2][1]]
idx2= a_temp < b[(tup[2][1], tup[1][1])]
c_final = c[tup[0][1]][idx1 | idx2]
data_final= data[idx1 | idx2]
return (tup[0][0], tup[1][0], *tup[2]), c_final[-1] - data_final[-1]
def setup(a_dict, b_dict, c_dict): #initialize the shared dictionaries
global a,b,c
a,b,c = a_dict, b_dict, c_dict
def main(a_arr, b_arr, c_arr, common_len):
np.random.seed(0)
data_array= np.random.normal(loc= 24004, scale=500, size= common_len)
a_size = a_arr[-1] + 1
b_size = len(b_arr)
c_size = len(c_arr)
loop_combo = product(enumerate(c_arr),
enumerate(b_arr),
(n_tup for n_tup in product(np.arange(1,a_arr[-1]), a_arr) if n_tup[1] > n_tup[0])
)
result = np.zeros((c_size, b_size, a_size -1 ,a_size), dtype = np.float32)
###################################################
#This part simulates the heavy-computation in the actual problem
a= {}
b= {}
c= {}
for i in range(1, a_arr[-1]+1):
a[i]= transform_array_a(data_array, i)
if i in a_arr:
for j in b_arr:
b[(i,j)]= rt_mean_sq_dev(data_array, a[i], i)/data_array *j
for i in c_arr:
c[i]= transform_array_c(data_array, i)
###################################################
with mp.Pool(processes= mp.cpu_count() - 1,
initializer= setup,
initargs= [a,b,c]
) as pool:
mp_res= pool.imap_unordered(partial(func, data= data_array),
loop_combo
)
for item in mp_res:
result[item[0]] =item[1]
return result
if __name__ == '__main__':
mp.freeze_support()
a_arr= np.arange(2,44,2)
b_arr= np.arange(0.4,0.8, 0.20)
c_arr= np.arange(2,42,10)
common_len= 440000
final_res= main(a_arr, b_arr, c_arr, common_len)
For performance reasons, multiple shared "read only" dictionaries are used among all processes to reduce the redundant computations(in the actual problem, the total computation time is reduced by 40% after using shared dictionaries among all the processes). However, The ram usage becomes absurdly higher after using shared dictionaries in my actual problem; memory usage in my 6C/12T Windows computer goes from (8.2GB peak, 5.0GB idle) to (23.9GB peak, 5.0GB idle), a little too high of a cost to pay in order to gain 40% speed up.
Is the high ram usage unavoidable when using multiple shared data among processes is a must? What can be done to my code in order to make it as fast as possible while using as low memory as possible?
Thank you in advance
Note: I tried using imap_unordered() instead of map because I heard it is supposed to reduce the memory usage when the input iterable is large, but I honestly can't see an improvement in the ram usage. Maybe I have done something wrong here?
EDIT: Due to the feedback in the answers, I have already changed the heavy computation part of the code such that it looks less dummy and resembles the computation in the actual problem.

High Memory Usage when manipulating shared dictionaries in python multiprocessing run in Windows
It is fair to demystify a bit the problem, before we move into details - there are no shared dictionaries in the original code, the less they get manipulated ( yes, each of the a,b,c did get "assigned" to a reference to the dict_a, dict_b, dict_c yet none of them is shared, but just get replicated as the multiprocessing does in Windows-class O/S-es. No writes "into" dict-s ( just non-destructive reads-from either of their replicas )
Similarly, the np.memmap()-s are possible to put some part of the originally proposed data onto disk-space ( at a cost of doing so + bearing some ( latency-masked ) random-reads latency of ~ 10 [ms] instead of ~ 0.5 [ns] if smart-aligned vectorised memory-patterns were designed into the performance hot-spot ) yet no dramatic change-of-paradigm ought be expected here, as the "external iterator" almost avoids any smart-aligned cache re-uses
Q : What can be done to my code in order to make it as fast as possible while using as low memory as possible?
The first sin was in using an 8B-int64 to store one plain Bbit ( no Qbits here yet ~ All salutes to Burnaby Quantum R&D Teams )
for i in c_arr: # ~~ np.arange( 2, 42, 10 )
np.random.seed( i ) # ~ yields a deterministic state
c[i] = np.random.poisson( size = common_len ) # ~ 440.000 int64-s with {0|1}
This took 6 (processes) x 440000 x 8B ~ 0.021 GB "smuggled" in all copies of dictionary c, whereas each and every such value is deterministically known and could be generated ALAP inside a respective target process, by just knowing the value of i ( indeed no need to pre-generate and many-times replicate ~ 0.021 GB of data )
So far, the Windows-class O/S lack an os.fork() and thus do a python full-copy ( yes, RAM ..., yes, TIME ) of as many replicated python-interpreter sessions ( plus importing the main module ) as was requested, in multiprocessing for process-based separation ( doing that for avoiding a GIL-lock ordered, pure-[SERIAL], code execution )
The Best Next Step:re-factor the codefor both efficiency and performance
The best next step - refactor the code, so as to minimise a "shallow" ( and expensive ) use of the 6-processes but "externally"-commanded by a central iterator ( the loop_combo "dictator" with ~ 18522 items to repeat the call to a "remotely-dispatched" func( tup, data ) so as to fetch a simple "DMA-tuple"-( (x,y,z), value ) to store one value into a central process result-float32-array ).
Try to increase the computing "density" - so try to re-factor the code by a divide-and-conquer manner ( i.e., that each of the mp.pool-processes computes in one smooth block some remarkably sized, dedicated sub-space of the parameter-space covered ( here iteratively "from ouside" ) and may easily reduce the returned blocks of results. Performance will only improve by doing this ( best without any form of expensive sharing ).
This re-factoring will avoid parameter pickle/unpickle-costs ( add-on overheads - both the one-time ( in passing the unique parameter-set values ) and the repetitive ( in about a ~ 18522-times executed repetitive memory-allocation, buildup and pickle/unpickle-costs of an np.arange( 440000 ) due to a poor call-signature design / engineering )
All these steps will improve your processing efficiency and reduce the there unnecessary RAM-allocations.

Why does my Python script run slower than it should on my HeapSort implementation?

I've got as assignment to implement the heap sort algorithm into either Python or Java (or any other languages). Since I'm not that really "fluent" in Python or Java I decided to do both.
But here I ran into a problem, the running time of the program is way too much hight than it "should" be.
I mean by that, that the heap sort is supposed to run into a O(n * log n) and for current processor running on a clock rate of several GHz I didn't expect for that algorithm to run into over 2000secs for an array of size 320k
So for what I've done, I implemented the algorithm from the pseudo code of this sort in Python and in Java (I also tried the code in Julia from Rosetta Code to see if the running time was similar, why Julia ? Random pick)
So I checked the output for small input size problem, such as an array of size 10, 20 and 30. It appears that the array it correctly sorted in both languages/implementations.
Then I used the heapq library that implement this same algorithm to check once again if the running time was similar. It surprised me when it was actually the case... But after few tries I tried one last thing which is updating Python and then, the program using heapq ran much faster than the previous ones. Actually it was around 2k sec for the 320k array and now it around 1.5 sec or so.
I retried my algorithm and the problem was still there.
So here are the Heapsort class that I implemented:
class MaxHeap:
heap = []
def __init__(self, data=None):
if data is not None:
self.buildMaxHeap(data)
#classmethod
def toString(cls):
return str(cls.heap)
#classmethod
def add(cls, elem):
cls.heap.insert(len(cls.heap), elem)
cls.buildMaxHeap(cls.heap)
#classmethod
def remove(cls, elem):
try:
cls.heap.pop(cls.heap.index(elem))
except ValueError:
print("The value you tried to remove is not in the heap")
#classmethod
def maxHeapify(cls, heap, i):
left = 2 * i + 1
right = 2 * i + 2
largest = i
n = len(heap)
if left < n and heap[left] > heap[largest]:
largest = left
if right < n and heap[right] > heap[largest]:
largest = right
if largest != i:
heap[i], heap[largest] = heap[largest], heap[i]
cls.maxHeapify(heap, largest)
#classmethod
def buildMaxHeap(cls, heap):
for i in range(len(heap) // 2, -1, -1):
cls.maxHeapify(heap, i)
cls.heap = heap
#staticmethod
def heapSort(table):
heap = MaxHeap(table)
output = []
i = len(heap.heap) - 1
while i >= 0:
heap.heap[0], heap.heap[i] = heap.heap[i], heap.heap[0]
output = [heap.heap[i]] + output
heap.remove(heap.heap[i])
heap.maxHeapify(heap.heap, 0)
i -= 1
return output
To log the runtime for each array size (10000 - 320000) I use this loop in the main function :
i = 10000
while i <= 320000:
tab = [0] * i
j = 0
while j < i:
tab[j] = randint(0, i)
j += 1
start = time()
MaxHeap.heapSort(tab)
end = time()
pprint.pprint("Size of the array " + str(i))
pprint.pprint("Total execution time: " + str(end - start) + "s")
i *= 2
If you need the rest of the code to see where the error could be, don't hesitate I'll provide it. Just didn't want to share the whole file for no reasons.
As said earlier the running time I expected is from the worst case running time : O(n * log n)
with modern architecture and a processor of 2.6GHz I would expect something around 1sec or even less (since the running time is asked in nanosecond I suppose that even 1 sec is still too long)
Here are the results :
Python (own) : Java (Own)
Time Size Time Size
593ms. 10k 243ms. 10k
2344ms. 20k 600ms. 20k
9558ms. 40k 1647ms. 40k
38999ms. 80k 6666ms. 80k
233811ms. 160k 62789ms. 160k
1724926ms. 320k 473177ms. 320k
Python (heapq) Julia (Rosetta Code)
Time Size Time Size
6ms. 10k 21ms. 10k
14ms. 20k 21ms. 20k
15ms. 40k 23ms. 40k
34ms. 80k 28ms. 80k
79ms. 160k 39ms. 160k
168ms. 320k 60ms. 320k
And according to the formula the O(n * log n) give me :
40000 10k
86021 20k
184082 40k
392247 80k
832659 160k
1761648 320k
I think that these result could be used to determine how much time it should take depending on the machine (theoretically)
As you can see the high running time result comes from my algorithm, but I can't tell where in the code and that's why I'm asking here for help. (Runs slow both in Java and Python) (Didn't try to use heap sort in java lib is there is one to see the difference with my implementation, my bad)
Thanks a lot.
Edit : I forgot to add that I run this program on a MacBook Pro (last version MacOS, i7 2,6GHz. In case the problem could also comes from anything else than the code.
Edit 2 : Here are the modifications I did on the algorithm, following the answer I received. The program run approximately 200 times faster than previously, and so now it run in barely 2sec for the array of size 320k
class MaxHeap:
def __init__(self, data=None):
self.heap = []
self.size = 0
if data is not None:
self.size = len(data)
self.buildMaxHeap(data)
def toString(self):
return str(self.heap)
def add(self, elem):
self.heap.insert(self.size, elem)
self.size += 1
self.buildMaxHeap(self.heap)
def remove(self, elem):
try:
self.heap.pop(self.heap.index(elem))
except ValueError:
print("The value you tried to remove is not in the heap")
def maxHeapify(self, heap, i):
left = 2 * i + 1
right = 2 * i + 2
largest = i
if left < self.size and heap[left] > heap[largest]:
largest = left
if right < self.size and heap[right] > heap[largest]:
largest = right
if largest != i:
heap[i], heap[largest] = heap[largest], heap[i]
self.maxHeapify(heap, largest)
def buildMaxHeap(self, heap):
for i in range(self.size // 2, -1, -1):
self.maxHeapify(heap, i)
self.heap = heap
#staticmethod
def heapSort(table):
heap = MaxHeap(table)
i = len(heap.heap) - 1
while i >= 0:
heap.heap[0], heap.heap[i] = heap.heap[i], heap.heap[0]
heap.size -= 1
heap.maxHeapify(heap.heap, 0)
i -= 1
return heap.heap
And it runs using the same main as given before

Its interesting that you posted the clock speed of your computer- you COULD calculate the actual number of steps your algorithm requires... but you would need to know an awful lot about the implementation. For example, in python every time an object is created or goes out of scope, the interpreter updates counters on the underlying object, and frees the memory if those ref counts reach 0. Instead, you should look at the relative speed.
The third party examples you posted show the speed as less then doubling when the input array length doubles. That doesn't seem right, does it? Turns out that for those examples the initial work of building the array probably dominates the time spent sorting the array!
In your code, there is already one comment that calls out what I was going to say...
heap.remove(heap.heap[i])
This operation will go through your list (starting at index 0) looking for a value that matches, and then deletes it. This is already bad (if it works as intended, you are doing 320k comparisons on that line if your code worked as you expected!). But it gets worse- deleting an object from an array is not an in-place modification- every object after the deleted object has to be moved forward in the list. Finally, there is no guarantee that you are actually removing the last object there... duplicate values could exist!
Here is a useful website that lists the complexity of various operations in python - https://wiki.python.org/moin/TimeComplexity. In order to implement an algorithm as efficiently as possible, you need as many of your data structure operations to be O(1) as possible. Here is an example... here is some original code, presumably with heap.heap being a list...
output = [heap.heap[i]] + output
heap.remove(heap.heap[i])
doing
output.append(heap.heap.pop())
Would avoid allocating a new list AND use a constant time operation to mutate the old one. (much better to just use the output backwards than use the O(n) time insert(0) method! you could use a dequeue object for output to get appendleft method if you really need the order)
If you posted your whole code there are probably lots of other little things we could help with. Hopefully this helped!

Python, multiprocessing.pool took about the same amount of time as a for loop

I am trying to use python to process some large data sets from several data stations. My idea is to use multiprocessing.pool to assign each CPU the data from a single station, since the data from each station are independent from each other.
However, it seems that my calculation time does not really go down, comparing to single for loop.
Here is part of my code:
#function calculating the square of each data point, and taking the cumulative sum
def get_cumdd(data):
#if not isinstance(data, list):
# data = [data]
dd = np.zeros((len(data),1))
cum_dd = np.zeros((len(data),1))
for i in range(len(data)):
dd[i] = data[i]**2
cum_dd=np.cumsum(dd)
return cum_dd
#parallelization between each station
if __name__ == '__main__':
n_proc = np.min([mp.cpu_count(),nstation]) #nstation = 10
p = mp.Pool(processes=int(n_proc))
result = p.map(get_cumdd,data)
p.close()
p.join()
cum_dd = np.zeros((nstation,len(data[0])))
for i in range(nstation):
cum_dd[i] = result[i].T
I do not use chunksize because cum_dd takes the summation of all the previous data^2. I am essentially dividing my data into 10 equal pieces because there is no communication between processes. I wonder if I missed anything here.
My data has 2 million points per station per day, and I need to process years of data.

This doesn't address your multiprocessing question directly, but (as Ugur MULUK and Iguananaut mention) I think your get_cumdd function is inefficient. Numpy provides np.cumsum. Reimplementing your function I get more than 1000x speedup for an array with 10k elements. With 100k elements it's about 7000x faster. With 2M elements I didn't bother to let it finish.
# your function
def cum_dd(data):
#if not isinstance(data, list):
# data = [data]
dd = np.zeros((len(data),1))
cum_dd = np.zeros((len(data),1))
for i in range(len(data)):
dd[i] = data[i]**2
cum_dd[i]=np.sum(dd[0:i])
return cum_dd
# numpy implementation
def cum_dd2(data):
# adding an axis to match the shape of the output of your cum_dd function
return np.cumsum(data**2)[:, np.newaxis]
For 2e6 points this implementation takes ~11ms on my computer. I think that's about 30 seconds for 10 years of data for a single station.

NumPy already implements efficient parallel processing on CPUs and GPUs. The processing algorithms use Single Instruction Multiple Data (SIMD) instructions.
By pooling computations manually, you are reducing the efficiency. You can improve performance by vectorizing your explicit for loop.
See the video below for more information about vectorization.
https://www.youtube.com/watch?v=qsIrQi0fzbY
If you are having difficulties, I will be around for updates or help. Good luck!

Thanks a lot for all the comments and answers! After applying vectorization and pooling, I reduced the calculation time from one hour to 3 second (10*1.7 million data points). I have my code here in case anyone is interested,
def get_cumdd(data):
#if not isinstance(data, list):
# data = [data]
dd = np.zeros((len(data),1))
for i in range(len(data)):
dd[i] = data[i]**2
cum_dd=np.cumsum(dd)
return dd,cum_dd
if __name__ == '__main__':
n_proc = np.min([mp.cpu_count(),nstation])
p = mp.Pool(processes=int(n_proc))
result = p.map(CC.get_cumdd,d)
p.close()
p.join()
I'm not using shared memory Queue because all my processes are independent from each other.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.