Python Gensim how to make WMD similarity run faster with multiprocessing

Python Gensim how to make WMD similarity run faster with multiprocessing - python

I am trying to run gensim WMD similarity faster. Typically, this is what is in the docs:
Example corpus:
my_corpus = ["Human machine interface for lab abc computer applications",
>>> "A survey of user opinion of computer system response time",
>>> "The EPS user interface management system",
>>> "System and human system engineering testing of EPS",
>>> "Relation of user perceived response time to error measurement",
>>> "The generation of random binary unordered trees",
>>> "The intersection graph of paths in trees",
>>> "Graph minors IV Widths of trees and well quasi ordering",
>>> "Graph minors A survey"]
my_query = 'Human and artificial intelligence software programs'
my_tokenized_query =['human','artificial','intelligence','software','programs']
model = a trained word2Vec model on about 100,000 documents similar to my_corpus.
model = Word2Vec.load(word2vec_model)
from gensim import Word2Vec
from gensim.similarities import WmdSimilarity
def init_instance(my_corpus,model,num_best):
instance = WmdSimilarity(my_corpus, model,num_best = 1)
return instance
instance[my_tokenized_query]
the best matched document is "Human machine interface for lab abc computer applications" which is great.
However the function instance above takes an extremely long time. So I thought of breaking up the corpus into N parts and then doing WMD on each with num_best = 1, then at the end of it, the part with the max score will be the most similar.
from multiprocessing import Process, Queue ,Manager
def main( my_query,global_jobs,process_tmp):
process_query = gensim.utils.simple_preprocess(my_query)
def worker(num,process_query,return_dict):
instance=init_instance\
(my_corpus[num*chunk+1:num*chunk+chunk], model,1)
x = instance[process_query][0][0]
y = instance[process_query][0][1]
return_dict[x] = y
manager = Manager()
return_dict = manager.dict()
for num in range(num_workers):
process_tmp = Process(target=worker, args=(num,process_query,return_dict))
global_jobs.append(process_tmp)
process_tmp.start()
for proc in global_jobs:
proc.join()
return_dict = dict(return_dict)
ind = max(return_dict.iteritems(), key=operator.itemgetter(1))[0]
print corpus[ind]
>>> "Graph minors A survey"
The problem I have with this is that, even though it outputs something, it doesn't give me a good similar query from my corpus even though it gets the max similarity of all the parts.
Am I doing something wrong?

Comment: chunk is a static variable: e.g. chunk = 600 ...
If you define chunk static, then you have to compute num_workers.
10001 / 600 = 16,6683333333 = 17 num_workers
It's common to use not more process than cores you have.
If you have 17 cores, that's ok.
cores are static, therefore you should:
num_workers = os.cpu_count()
chunk = chunksize(my_corpus, num_workers)
Not the same result, changed to:
#process_query = gensim.utils.simple_preprocess(my_query)
process_query = my_tokenized_query
All worker results Index 0..n.
Therefore, return_dict[x] could be overwritten from last worker with same Index having lower value. The Index in return_dict is NOT the same as Index in my_corpus. Changed to:
#return_dict[x] = y
return_dict[ (num * chunk)+x ] = y
Using +1 in chunk size computing, will skip that first Document.
I don't know how you compute chunk, consider this example:
def chunksize(iterable, num_workers):
c_size, extra = divmod(len(iterable), num_workers)
if extra:
c_size += 1
if len(iterable) == 0:
c_size = 0
return c_size
#Usage
chunk = chunksize(my_corpus, num_workers)
...
#my_corpus_chunk = my_corpus[num*chunk+1:num*chunk+chunk]
my_corpus_chunk = my_corpus[num * chunk:(num+1) * chunk]
Results: 10 cycle, Tuple=(Index worker num=0, Index worker num=1)
With multiprocessing, with chunk=5:
02,09:(3, 8), 01,03:(3, 5):
System and human system engineering testing of EPS
04,06,07:(0, 8), 05,08:(0, 5), 10:(0, 7):
Human machine interface for lab abc computer applications
Without multiprocessing, with chunk=5:
01:(3, 6), 02:(3, 5), 05,08,10:(3, 7), 07,09:(3, 8):
System and human system engineering testing of EPS
03,04,06:(0, 5):
Human machine interface for lab abc computer applications
Without multiprocessing, without chunking:
01,02,03,04,06,07,08:(3, -1):
System and human system engineering testing of EPS
05,09,10:(0, -1):
Human machine interface for lab abc computer applications
Tested with Python: 3.4.2

Using Python 2.7:
I used threading instead of multi-processing.
In the WMD-Instance creation thread, I do something like this:
wmd_instances = []
if wmd_instance_count > len(wmd_corpus):
wmd_instance_count = len(wmd_corpus)
chunk_size = int(len(wmd_corpus) / wmd_instance_count)
for i in range(0, wmd_instance_count):
if i == wmd_instance_count -1:
wmd_instance = WmdSimilarity(wmd_corpus[i*chunk_size:], wmd_model, num_results)
else:
wmd_instance = WmdSimilarity(wmd_corpus[i*chunk_size:chunk_size], wmd_model, num_results)
wmd_instances.append(wmd_instance)
wmd_logic.setWMDInstances(wmd_instances, chunk_size)
'wmd_instance_count' is the number of threads to use to search. I also remember the chunk-size. Then, when I want to search for something, I start "wmd_instance_count"-threads to search for and they return found sims:
def perform_query_for_job_on_instance(wmd_logic, wmd_instances, query, jobID, instance):
wmd_instance = wmd_instances[instance]
sims = wmd_instance[query]
wmd_logic.set_mt_thread_result(jobID, instance, sims)
'wmd_logic' is the instance of a class that then does this:
def set_mt_thread_result(self, jobID, instance, sims):
res = []
#
# We need to scale the found ids back to our complete corpus size...
#
for sim in sims:
aSim = (int(sim[0] + (instance * self.chunk_size)), sim[1])
res.append(aSim)
I know, the code isn't nice, but it works. It uses 'wmd_instance_count' threads to find results, I aggregate them and then choose the top-10 or something like that.
Hope this helps.

Related

High Memory Usage when python multiprocessing run in Windows

The code down below is a contrived example that simulates an actual problem I have that uses multiprocessing to speed up the code. The code is run on Windows 10 64-bit OS, python 3.7.5, and ipython 7.9.0
the transformation functions(these functions will be used to transform arrays in main())
from itertools import product
from functools import partial
from numba import njit, prange
import multiprocessing as mp
import numpy as np
#njit(parallel= True)
def transform_array_c(data, n):
ar_len= len(data)
sec_max1= np.empty(ar_len, dtype = data.dtype)
sec_max2= np.empty(ar_len, dtype = data.dtype)
for i in prange(n-1):
sec_max1[i]= np.nan
for sec in prange(ar_len//n):
s2_max= data[n*sec+ n-1]
s1_max= data[n*sec+ n]
for i in range(n-1,-1,-1):
if data[n*sec+i] > s2_max:
s2_max= data[n*sec+i]
sec_max2[n*sec+i]= s2_max
sec_max1[n*sec+ n-1]= sec_max2[n*sec]
for i in range(n-1):
if n*sec+n+i < ar_len:
if data[n*sec+n+i] > s1_max:
s1_max= data[n*sec+n+i]
sec_max1[n*sec+n+i]= max(s1_max, sec_max2[n*sec+i+1])
else:
break
return sec_max1
#njit(error_model= 'numpy', cache= True)
def rt_mean_sq_dev(array1, array2, n):
msd_temp = np.empty(array1.shape[0])
K = array2[n-1]
rs_x= array1[0] - K
rs_xsq = rs_x *rs_x
msd_temp[0] = np.nan
for i in range(1,n):
rs_x += array1[i] - K
rs_xsq += np.square(array1[i] - K)
msd_temp[i] = np.nan
y_i = array2[n-1] - K
msd_temp[n-1] = np.sqrt(max(y_i*y_i + (rs_xsq - 2*y_i*rs_x)/n, 0))
for i in range(n, array1.shape[0]):
rs_x = array1[i] - array1[i-n]+ rs_x
rs_xsq = np.square(array1[i] - K) - np.square(array1[i-n] - K) + rs_xsq
y_i = array2[i] - K
msd_temp[i] = np.sqrt(max(y_i*y_i + (rs_xsq - 2*y_i*rs_x)/n, 0))
return msd_temp
#njit(cache= True)
def transform_array_a(data, n):
result = np.empty(data.shape[0], dtype= data.dtype)
alpharev = 1. - 2 / (n + 1)
alpharev_exp = alpharev
e = data[0]
w = 1.
if n == 2: result[0] = e
else:result[0] = np.nan
for i in range(1, data.shape[0]):
w += alpharev_exp
e = e*alpharev + data[i]
if i > n -3:result[i] = e / w
else:result[i] = np.nan
if alpharev_exp > 3e-307:alpharev_exp*= alpharev
else:alpharev_exp=0.
return result
The multiprocessing part
def func(tup, data): #<-------------the function to be run among all
a_temp= a[tup[2][0]]
idx1 = a_temp > a[tup[2][1]]
idx2= a_temp < b[(tup[2][1], tup[1][1])]
c_final = c[tup[0][1]][idx1 | idx2]
data_final= data[idx1 | idx2]
return (tup[0][0], tup[1][0], *tup[2]), c_final[-1] - data_final[-1]
def setup(a_dict, b_dict, c_dict): #initialize the shared dictionaries
global a,b,c
a,b,c = a_dict, b_dict, c_dict
def main(a_arr, b_arr, c_arr, common_len):
np.random.seed(0)
data_array= np.random.normal(loc= 24004, scale=500, size= common_len)
a_size = a_arr[-1] + 1
b_size = len(b_arr)
c_size = len(c_arr)
loop_combo = product(enumerate(c_arr),
enumerate(b_arr),
(n_tup for n_tup in product(np.arange(1,a_arr[-1]), a_arr) if n_tup[1] > n_tup[0])
)
result = np.zeros((c_size, b_size, a_size -1 ,a_size), dtype = np.float32)
###################################################
#This part simulates the heavy-computation in the actual problem
a= {}
b= {}
c= {}
for i in range(1, a_arr[-1]+1):
a[i]= transform_array_a(data_array, i)
if i in a_arr:
for j in b_arr:
b[(i,j)]= rt_mean_sq_dev(data_array, a[i], i)/data_array *j
for i in c_arr:
c[i]= transform_array_c(data_array, i)
###################################################
with mp.Pool(processes= mp.cpu_count() - 1,
initializer= setup,
initargs= [a,b,c]
) as pool:
mp_res= pool.imap_unordered(partial(func, data= data_array),
loop_combo
)
for item in mp_res:
result[item[0]] =item[1]
return result
if __name__ == '__main__':
mp.freeze_support()
a_arr= np.arange(2,44,2)
b_arr= np.arange(0.4,0.8, 0.20)
c_arr= np.arange(2,42,10)
common_len= 440000
final_res= main(a_arr, b_arr, c_arr, common_len)
For performance reasons, multiple shared "read only" dictionaries are used among all processes to reduce the redundant computations(in the actual problem, the total computation time is reduced by 40% after using shared dictionaries among all the processes). However, The ram usage becomes absurdly higher after using shared dictionaries in my actual problem; memory usage in my 6C/12T Windows computer goes from (8.2GB peak, 5.0GB idle) to (23.9GB peak, 5.0GB idle), a little too high of a cost to pay in order to gain 40% speed up.
Is the high ram usage unavoidable when using multiple shared data among processes is a must? What can be done to my code in order to make it as fast as possible while using as low memory as possible?
Thank you in advance
Note: I tried using imap_unordered() instead of map because I heard it is supposed to reduce the memory usage when the input iterable is large, but I honestly can't see an improvement in the ram usage. Maybe I have done something wrong here?
EDIT: Due to the feedback in the answers, I have already changed the heavy computation part of the code such that it looks less dummy and resembles the computation in the actual problem.

High Memory Usage when manipulating shared dictionaries in python multiprocessing run in Windows
It is fair to demystify a bit the problem, before we move into details - there are no shared dictionaries in the original code, the less they get manipulated ( yes, each of the a,b,c did get "assigned" to a reference to the dict_a, dict_b, dict_c yet none of them is shared, but just get replicated as the multiprocessing does in Windows-class O/S-es. No writes "into" dict-s ( just non-destructive reads-from either of their replicas )
Similarly, the np.memmap()-s are possible to put some part of the originally proposed data onto disk-space ( at a cost of doing so + bearing some ( latency-masked ) random-reads latency of ~ 10 [ms] instead of ~ 0.5 [ns] if smart-aligned vectorised memory-patterns were designed into the performance hot-spot ) yet no dramatic change-of-paradigm ought be expected here, as the "external iterator" almost avoids any smart-aligned cache re-uses
Q : What can be done to my code in order to make it as fast as possible while using as low memory as possible?
The first sin was in using an 8B-int64 to store one plain Bbit ( no Qbits here yet ~ All salutes to Burnaby Quantum R&D Teams )
for i in c_arr: # ~~ np.arange( 2, 42, 10 )
np.random.seed( i ) # ~ yields a deterministic state
c[i] = np.random.poisson( size = common_len ) # ~ 440.000 int64-s with {0|1}
This took 6 (processes) x 440000 x 8B ~ 0.021 GB "smuggled" in all copies of dictionary c, whereas each and every such value is deterministically known and could be generated ALAP inside a respective target process, by just knowing the value of i ( indeed no need to pre-generate and many-times replicate ~ 0.021 GB of data )
So far, the Windows-class O/S lack an os.fork() and thus do a python full-copy ( yes, RAM ..., yes, TIME ) of as many replicated python-interpreter sessions ( plus importing the main module ) as was requested, in multiprocessing for process-based separation ( doing that for avoiding a GIL-lock ordered, pure-[SERIAL], code execution )
The Best Next Step:re-factor the codefor both efficiency and performance
The best next step - refactor the code, so as to minimise a "shallow" ( and expensive ) use of the 6-processes but "externally"-commanded by a central iterator ( the loop_combo "dictator" with ~ 18522 items to repeat the call to a "remotely-dispatched" func( tup, data ) so as to fetch a simple "DMA-tuple"-( (x,y,z), value ) to store one value into a central process result-float32-array ).
Try to increase the computing "density" - so try to re-factor the code by a divide-and-conquer manner ( i.e., that each of the mp.pool-processes computes in one smooth block some remarkably sized, dedicated sub-space of the parameter-space covered ( here iteratively "from ouside" ) and may easily reduce the returned blocks of results. Performance will only improve by doing this ( best without any form of expensive sharing ).
This re-factoring will avoid parameter pickle/unpickle-costs ( add-on overheads - both the one-time ( in passing the unique parameter-set values ) and the repetitive ( in about a ~ 18522-times executed repetitive memory-allocation, buildup and pickle/unpickle-costs of an np.arange( 440000 ) due to a poor call-signature design / engineering )
All these steps will improve your processing efficiency and reduce the there unnecessary RAM-allocations.

Overhead of python multiprocessing initialization is worse than benefits

I want to use a trie search with python 3.7 in order to match a string with some given words.
The trie search algorithm is actually quite fast, however I also want to use all of the cores my CPU has. Lets assume my pc has 8 cores and I want to use 7 of them.
So I split my word database into 7 equally big lists and created a trie off every one. (That's the basic Idea for parallizing the code)
However, when I call Process() off the multiprocessing module, the Process().start() method can take up a couple of seconds on the real database. (the search itself takes about a microseconds).
To be honest, I'm not yet a professional programmer, which means I probably have put in some major mistake in the code. Does someone see the reason the start of the process is so damn slow?
Please consider that I tested the script with a way bigger database than the trie below. I also tested the script with calling only 1 process each time and that was also significantly slower.
I wanted to provide less code, however I think It's nice to see the running problem. I can also provide additional info if needed.
import string
import sys
import time
from multiprocessing import Process, Manager
from itertools import combinations_with_replacement
class TrieNode:
def __init__(self):
self.isString = False
self.children = {}
def insertString(self, word, root):
currentNode = root
for char in word:
if char not in currentNode.children:
currentNode.children[char] = TrieNode()
currentNode = currentNode.children[char]
currentNode.isString = True
def findStrings(self, prefix, node, results):
# Hänge das Ergebnis an, wenn ein Ende gefunden wurde
if node.isString:
results.append(prefix)
for char in node.children:
self.findStrings(prefix + char, node.children[char], results)
def findSubStrings(self, start_prefix, root, results):
currentNode = root
for char in start_prefix:
# Beende Schleife auf fehlende Prefixes oder deren Kinder
if char not in currentNode.children:
break
# Wechsle zu Kindern in anderem Falle
else:
currentNode = currentNode.children[char]
# Verwende findStrings Rekursiv zum auffinden von End-Knoten
self.findStrings(start_prefix, currentNode, results)
return results
def gen_word_list(num_words, min_word_len=4, max_word_len=10):
wordList = []
total_words = 0
for long_word in combinations_with_replacement(string.ascii_lowercase, max_word_len):
wordList.append(long_word)
total_words += 1
if total_words >= num_words:
break
for cut_length in range(1, max_word_len-min_word_len+1):
wordList.append(long_word[:-cut_length])
total_words += 1
if total_words >= num_words:
break
return wordList
if __name__ == '__main__':
# Sample word list
wordList = gen_word_list(1.5 * 10**5)
# Configs
try:
n_cores = int(sys.argv[-1] or 7)
except ValueError:
n_cores = 7
# Repetitions to do in order to estimate the runtime of a single run
num_repeats = 20
real_num_repeats = n_cores * num_repeats
# Creating Trie
root = TrieNode()
# Adding words
for word in wordList:
root.insertString(word, root)
# Extending trie to use it on multiple cores at once
multiroot = [root] * n_cores
# Measure time
print('Single process ...')
t_0 = time.time()
for i in range(real_num_repeats):
r = []
root.findSubStrings('he', root, r)
single_proc_time = (time.time()-t_0)
print(single_proc_time/real_num_repeats)
# using multicore to speed up the process
man = Manager()
# Loop to test the multicore Solution
# (Less repetitions are done to compare the timings to the single-core solution)
print('\nMultiprocess ...')
t_00 = time.time()
p_init_time = 0
procs_append_time = 0
p_start_time = 0
for i in range(num_repeats):
# Create Share-able list
res = man.list()
procs = []
for i in range(n_cores):
t_0 = time.time()
p = Process(target=multiroot[i].findSubStrings, args=('a', multiroot[i], res))
t_1 = time.time()
p_init_time += t_1 - t_0
procs.append(p)
t_2 = time.time()
procs_append_time += t_2 - t_1
p.start()
p_start_time += time.time() - t_2
for p in procs:
p.join()
multi_proc_time = time.time() - t_00
print(multi_proc_time / real_num_repeats)
init_overhead = p_init_time / single_proc_time
append_overhead = procs_append_time / single_proc_time
start_overhead = p_start_time / single_proc_time
total_overhead = (multi_proc_time - single_proc_time) / single_proc_time
print(f"Process(...) overhead: {init_overhead:.1%}")
print(f"procs.append(p) overhead: {append_overhead:.1%}")
print(f"p.start() overhead: {start_overhead:.1%}")
print(f"Total overhead: {total_overhead:.1%}")
Single process ...
0.007229958261762347
Multiprocess ...
0.7615800397736686
Process(...) overhead: 0.9%
procs.append(p) overhead: 0.0%
p.start() overhead: 8.2%
Total overhead: 10573.8%

General idea
There are many things to consider and most of them are already described in Multiprocessing > Programming guidelines. The most important thing is to remember that you are actually working with multiple processes and so there are 3 (or 4) ways of how variables are handled:
Synchronized wrappers over ctypes shared-state variables (like
multiprocessing.Value). Actual variable is always "one object" in
memory, and wrapper by default is using "locking" to set/get real
value.
Proxies (like Manager().list()). These variables are similar to shared-state variables, but are placed in the special "server process", and all the operations over them are actually sending pickled values between manager-process and active-process:
results.append(x) pickles x and sends it from manager process to active process that makes this call,
then it's unpickled
Any other access to results (like len(results), iteration over results) involves the same pickling/sending/unpickling process.
So generally proxies are much slower than any other approach for common variables and in many cases using manager
for "local" parallelization will give worse performance even compared to single-process runs.
But a manager-server can be used remotely, so it's reasonable to use them when you want to parallelize the work
using workers distributed on multiple machines
Objects available during subprocess create. For "fork" start method all the objects available during creation of subprocess are still available and "not shared", so changing them only changes it "locally for the subprocess". But before they are changed each process really "shares" the memory for each such object, so:
If they are used "read-only", then nothing is copied or "communicated".
If they are changed then they are copied inside the subprocess and, the copy is being changed. This is called Copy-On-Write or COW.
Please note that making a new reference to object, e.g. assigning a variable
to reference it, or appending it to a list increases ref_count of object, and that is
considered to be "a change".
Behavior may also vary depending on "start method": e.g. for "spawn"/"forkserver" method changeable global variables are not really "the same objects" value seen by subprocess may not be the same as in parent process.
So initial values of multiroot[i] (used in Process(target=..., args=(..., multiroot[i], ...))) are shared but:
if you are not using 'fork' start method (and by default Windows is not using it), then all the args are pickled at least once for each subprocess. And so start may be taking a long time if multiroot[i].children is huge.
Even if you are using fork: initially multiroot[i] seems to be shared and not copied, but I'm not sure what happens
when variables are assigned inside of findSubStrings method (e.g. currentNode = ...) — maybe it's causing copy-on-write (COW) and so whole instance of TrieNode is being copied.
What can be done to improve the situation:
If you are using fork start method, then make sure that "database" objects (TrieNode instances) are truly
readonly and don't event have methods with variables assignments in them. For example you can move findSubStrings to another class, and make sure to call all the instance.insertString before starting subprocesses.
You are using man.list() instance as a results argument to findSubStrings. This means that for each subprocess
a different "wrapper" is created, and all the results.append(prefix) actions are pickling prefix, and then sending it
to server process. If you are using Pool with limited number of processes, then it's not a big deal. If you are spawning
huge amount of subprocesses, then it might affect performance. And I think that by default they all use "locking" so concurrent appends migth be relatively slow. If order of items in results does not matter (I'm not experienced with prefix-trees and don't remember theory behind it), then you can fully avoid any overheads related to concurrent
results.append:
create new results list inside the findSubStrings method. Don't use res = man.list() at all.
To get the "final" results: iterate over every result object returned by pool.apply_async());
get the results; "merge them".
Using weak references
Using currentNode = root in findSubStrings will result in COW of root. That's why weak references (currentNodeRef = weakref.ref(root)) can give a little extra benefit.
Example
import string
import sys
import time
import weakref
from copy import deepcopy
from multiprocessing import Pool
from itertools import combinations_with_replacement
class TrieNode:
def __init__(self):
self.isString = False
self.children = {}
def insertString(self, word, root):
current_node = root
for char in word:
if char not in current_node.children:
current_node.children[char] = TrieNode()
current_node = current_node.children[char]
current_node.isString = True
# findStrings: not a method of TrieNode anymore, and works with reference to node.
def findStrings(prefix, node_ref, results):
# Hänge das Ergebnis an, wenn ein Ende gefunden wurde
if node_ref().isString:
results.append(prefix)
for char in node_ref().children:
findStrings(prefix + char, weakref.ref(node_ref().children[char]), results)
# findSubStrings: not a method of TrieNode anymore, and works with reference to node.
def findSubStrings(start_prefix, node_ref, results=None):
if results is None:
results = []
current_node_ref = node_ref
for char in start_prefix:
# Beende Schleife auf fehlende Prefixes oder deren Kinder
if char not in current_node_ref().children:
break
# Wechsle zu Kindern in anderem Falle
else:
current_node_ref = weakref.ref(current_node_ref().children[char])
# Verwende findStrings Rekursiv zum auffinden von End-Knoten
findStrings(start_prefix, current_node_ref, results)
return results
def gen_word_list(num_words, min_word_len=4, max_word_len=10):
wordList = []
total_words = 0
for long_word in combinations_with_replacement(string.ascii_lowercase, max_word_len):
wordList.append(long_word)
total_words += 1
if total_words >= num_words:
break
for cut_length in range(1, max_word_len-min_word_len+1):
wordList.append(long_word[:-cut_length])
total_words += 1
if total_words >= num_words:
break
return wordList
if __name__ == '__main__':
# Sample word list
wordList = gen_word_list(1.5 * 10**5)
# Configs
try:
n_cores = int(sys.argv[-1] or 7)
except ValueError:
n_cores = 7
# Repetitions to do in order to estimate the runtime of a single run
real_num_repeats = 420
simulated_num_repeats = real_num_repeats // n_cores
# Creating Trie
root = TrieNode()
# Adding words
for word in wordList:
root.insertString(word, root)
# Create tries for subprocesses:
multiroot = [deepcopy(root) for _ in range(n_cores)]
# NOTE: actually all subprocesses can use the same `root`, but let's copy them to simulate
# that we are using different tries when splitting job to sub-jobs
# localFindSubStrings: defined after `multiroot`, so `multiroot` can be used as "shared" variable
def localFindSubStrings(start_prefix, root_index=None, results=None):
if root_index is None:
root_ref = weakref.ref(root)
else:
root_ref = weakref.ref(multiroot[root_index])
return findSubStrings(start_prefix, root_ref, results)
# Measure time
print('Single process ...')
single_proc_num_results = None
t_0 = time.time()
for i in range(real_num_repeats):
iteration_results = localFindSubStrings('help', )
if single_proc_num_results is None:
single_proc_num_results = len(iteration_results)
single_proc_time = (time.time()-t_0)
print(single_proc_time/real_num_repeats)
# Loop to test the multicore Solution
# (Less repetitions are done to compare the timings to the single-core solution)
print('\nMultiprocess ...')
p_init_time = 0
apply_async_time = 0
results_join_time = 0
# Should processes be joined between repeats (simulate single job on multiple cores) or not (simulate multiple jobs running simultaneously)
PARALLEL_REPEATS = True
if PARALLEL_REPEATS:
t_0 = time.time()
pool = Pool(processes=n_cores)
t_1 = time.time()
p_init_time += t_1 - t_0
async_results = []
final_results = []
t_00 = time.time()
for repeat_num in range(simulated_num_repeats):
final_result = []
final_results.append(final_result)
if not PARALLEL_REPEATS:
t_0 = time.time()
pool = Pool(processes=n_cores)
t_1 = time.time()
p_init_time += t_1 - t_0
async_results = []
else:
t_1 = time.time()
async_results.append(
(
final_result,
pool.starmap_async(
localFindSubStrings,
[('help', core_num) for core_num in range(n_cores)],
)
)
)
t_2 = time.time()
apply_async_time += t_2 - t_1
if not PARALLEL_REPEATS:
for _, a_res in async_results:
for result_part in a_res.get():
t_3 = time.time()
final_result.extend(result_part)
results_join_time += time.time() - t_3
pool.close()
pool.join()
if PARALLEL_REPEATS:
for final_result, a_res in async_results:
for result_part in a_res.get():
t_3 = time.time()
final_result.extend(result_part)
results_join_time += time.time() - t_3
pool.close()
pool.join()
multi_proc_time = time.time() - t_00
# Work is not really parallelized, instead it's just 'duplicated' over cores,
# and so we divide using `real_num_repeats` (not `simulated_num_repeats`)
print(multi_proc_time / real_num_repeats)
init_overhead = p_init_time / single_proc_time
apply_async_overhead = apply_async_time / single_proc_time
results_join_percent = results_join_time / single_proc_time
total_overhead = (multi_proc_time - single_proc_time) / single_proc_time
print(f"Pool(...) overhead: {init_overhead:.1%}")
print(f"pool.starmap_async(...) overhead: {apply_async_overhead:.1%}")
print(f"Results join time percent: {results_join_percent:.1%}")
print(f"Total overhead: {total_overhead:.1%}")
for iteration_results in final_results:
num_results = len(iteration_results) / n_cores
if num_results != single_proc_num_results:
raise AssertionError(f'length of results should not change! {num_results} != {single_proc_num_results}')
NOTES:
PARALLEL_REPEATS=True simulates running of the multiple jobs (for example each job should be started for different prefixes, but in example I use the same prefix to have consistent "load" for each run), and each job is "parallelized" over all cores.
PARALLEL_REPEATS=False simulates running of a single job
parallelized over all cores and it's slower than single-process
solution.
It seems that parallelism is only better when each
worker in the pool is issued apply_async more than 1 time.
Example output:
Single process ...
0.007109369550432477
Multiprocess ...
0.002928720201764788
Pool(...) overhead: 1.3%
pool.apply_async(...) overhead: 1.5%
Results join time percent: 1.8%
Total overhead: -58.8%

at first I want to thank everyone who was participating as every answer contributed to the solution.
As the first comments pointed out, creating a new process every time leads to python shifting the needed data into the process. This can take a couple of seconds and is leading to a non-desired delay.
What brought the ultimate solution for me is creating the processes (one per core) once using the Process class of the multiprocessing library once during the startup of the program.
You can then communicate with the process using the Pipe class of the same module.
I found the ping-pong example here really helping: https://www.youtube.com/watch?v=s1SkCYMnfbY&t=900s
It is still not optimal as multiple pipes trying to talk to the process during the same time causes the process to crash.
However, I should be able solving this issue using queues. If someone is interested in the solution feel free to ask.

Dataset strings replace not speeding up with threads

I was recently getting into Natural Language Processing for a university project and, given a list of words, I wanted to try and delete all those words from a dataset of Strings.
My dataset looks like this, but much bigger:
data_set = ['Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system',
'System and human system engineering testing of EPS',
'Relation of user perceived response time to error measurement',
'The generation of random binary unordered trees',
'The intersection graph of paths in trees',
'Graph minors IV Widths of trees and well quasi ordering',
'Graph minors A survey']
The list of words to delete looks like this, but again, much longer:
to_remove = ['abc', 'of', 'quasi', 'well']
Since in Python I didn't find any function to directly delete words from strings, I used the replace() function.
The program should take the data_set and, for each word in to_remove, it should call a replace() on a different string of the data_set. I was hoping that threads could speed things up, but unfortunately it takes almost the same time as the program without threads. Am I correctly implementing threads? Or did I miss something?
The code with threads is the following:
from multiprocessing.dummy import Pool as ThreadPool
def remove_words(params):
changed_data_set = params[0]
for elem in params[1]:
changed_data_set = changed_data_set.replace(' ' + elem, ' ')
return changed_data_set
def parallel_task(params, threads=2):
pool = ThreadPool(threads)
results = pool.map(remove_words, params)
pool.close()
pool.join()
return results
parameters = []
for rows in data_set:
parameters.append((rows, to_remove))
new_data_set = parallel_task(parameters, 8)
The code without threads is the following:
def remove_words(data_set, to_replace):
for len in range(len(data_set)):
for word in to_replace:
data_set[len] = data_set[len].replace(' ' + row, ' ')
return data_set
changed_data_set = remove_words(data_set, to_remove)

Multiprocess or threading with huge data structure for RAM and speed issues. Python 2.7

I'm writing an application about MST algorithm passing huge graph (like 100 / 150 milion edges) in Python 2.7 . Graph is setted up with Adjacency List using a classic class with method like :
def insertArcW(self, tail, head, weight):
if head in self.nodes and tail in self.nodes:
self.adj[tail].addAsLast(head)
self.adj[tail].addAsLast(weight)
def insertNode(self, e):
newnode = Node(self.nextId, e)
self.nextId += 1
I'm also using Linked List (created with array) and queue from python stdlibrary(version 2.7).
With this piece of code the insert is really fast (due to less number of nodes compare to number of edges.):
n = []
for i in xrange(int(file_List[0])):
n.append(G.insertNode(i))
Problem comes with the insert of the edges:
for e in xrange(len(arc_List))
G.insertArcW(n[arc_List[e][0]].index, n[arc_List[e][1]].index,arc_List[e][2])
G.insertArcW(n[arc_List[e][1]].index, n[arc_List[e][0]].index,arc_List[e][2])
It's working great with 1 milion edges but with more it going to eat all of my ram (4GB , 64bit) but no freeze ! It can build the graph in a long time ! Considering that usage of CPU is limited to 19/25 % while doing this , there is a way of doing such things in multiprocess or multithread ? Like build the graph with two core doing same operation at same time but with different data ? I mean one core working with half of edges and another core with other half.
I'm practically new to this "place of programming" above all in Python.
EDIT : By using this function i'm setting up two list for nodes and edges ! I need to take information by a ".txt" file. Inserting the insertArcW and insertNode there is a oscillation of RAM between 2.4GB to 2.6GB . Now I can say that is stable (maybe due to "delete" of the two huge list of edges and node) but always at the same speed. Code :
f = open(graph + '.txt','r')
v = f.read()
file_List = re.split('\s+',v)
arc_List = []
n = []
p = []
for x in xrange(0,int(file_List[1])):
arc_List.append([0,0,0])
for i in xrange(int(file_List[0])):
n.append(G.insertNode(i))
for weight in xrange(1,int(file_List[1])+1):
p.append(weight)
random.shuffle(p)
i = 0
r = 0
while r < int(file_List[1]):
for k in xrange(2,len(file_List),2):
arc_List[r][0] = int(file_List[k])
arc_List[r][1] = int(file_List[k+1])
arc_List[r][2] = float(p[i])
G.insertArcW(n[arc_List[r][0]].index, n[arc_List[r][1]].index,arc_List[r][2])
G.insertArcW(n[arc_List[r][1]].index, n[arc_List[r][0]].index,arc_List[r][2])
print r
i+=1
r+=1
f.close()

SimPy Resource of 3 where each has different characteristics

I am trying to simulate a situation where we have 5 machines that occur in a 1 -> 3 -> 1 situation. i.e the 3 in the middle operate in parallel to reduce the effective time they take.
I can easily simulate this by create a SimPy resource with a value of three like this:
simpy.Resource(env, capacity=3)
However in my situation each of the three resources operates slightly differently and sometimes I want to be able to use any of them (when I'm operating) or book a specific one (when i want to clean). Basically the three machines slowly foul up at different rates and operate slower, I want to be able to simulate these and also enable a clean to occur when one gets too dirty.
I have tried a few ways of simulating this but have come up with problems and issues every time.
The first was when it booked the resource it also booked one of the 3 machines (A,B,C) globals flags and a flag itself to tell it which machine it was using. This works but it's not clean and makes it really difficult to understand what is occurring with huge if statements everywhere.
The second was to model it as three separate resources and then try to wait and request one of the 3 machines with something like:
reqA = A.res.request()
reqB = B.res.request()
reqC = C.res.request()
unitnumber = yield reqA | reqB | reqC
yield env.process(batch_op(env, name, machineA, machineB, machineC, unitnumber))
But this doesn't work and I can't work out the best way to look at yielding one of a choice.
What would be the best way to simulate this scenario. For completeness here is what im looking for:
Request any of 3 machines
Request a specific machine
Have each machine track it's history
Have each machines characteristics be different. i.e on fouls up faster but works faster initially
Detect and schedule a clean based on the performance or indicator
This is what I have so far on my latest version of trying to model each as seperate resources
class Machine(object):
def __init__(self, env, cycletime, cleantime, k1foul, k2foul):
self.env = env
self.res = simpy.Resource(env, 1)
self.cycletime = cycletime
self.cleantime = cleantime
self.k1foul = k1foul
self.k2foul = k2foul
self.batchessinceclean = 0
def operate(self):
self.cycletime = self.cycletime + self.k2foul * np.log(self.k1foul * self.batchessinceclean + 1)
self.batchessinceclean += 1
yield self.env.timeout(self.cycletime)
def clean(self):
print('%s begin cleaning at %s' % (self.env.now))
self.batchessinceclean = 0
yield env.timeout(self.cleantime)
print('%s finished cleaning at %s' % (self.env.now))

You should try (Filter)Store:
import simpy
def user(machine):
m = yield machine.get()
print(m)
yield machine.put(m)
m = yield machine.get(lambda m: m['id'] == 1)
print(m)
yield machine.put(m)
m = yield machine.get(lambda m: m['health'] > 98)
print(m)
yield machine.put(m)
env = simpy.Environment()
machine = simpy.FilterStore(env, 3)
machine.put({'id': 0, 'health': 100})
machine.put({'id': 1, 'health': 95})
machine.put({'id': 2, 'health': 97.2})
env.process(user(machine))
env.run()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Gensim how to make WMD similarity run faster with multiprocessing - python

Related

High Memory Usage when python multiprocessing run in Windows

Overhead of python multiprocessing initialization is worse than benefits

Dataset strings replace not speeding up with threads

Multiprocess or threading with huge data structure for RAM and speed issues. Python 2.7

SimPy Resource of 3 where each has different characteristics

Categories

Resources