Parallel computing in graphs

Parallel computing in graphs - python

I want to do some analysis on graphs to find all the possible simple paths between all pairs of nodes in graph. With help of Networkx library I can use DFS to find all possible paths between 2 nodes with this function:
nx.all_simple_paths(G,source,target)
The below code runs without any workload since my toy example contains only 6 nodes in the graph. However, in my real task, my graph contains 5,213 nodes and 11,377,786 edges and finding all possible simple paths in this graph is impossible with below solution:
import networkx as nx
graph = nx.DiGraph()
graph.add_weighted_edges_from(final_edges_list)
list_of_nodes = list(graph.nodes())
paths = {}
for n1 in list_of_nodes:
for n2 in list_of_nodes:
if n1 != n2:
all_simple_paths = list(nx.all_simple_paths(graph,n1,n2))
paths[n1+ "-"+n2] = all_simple_paths
The "paths" dictionary holds the "n1-n2" (source node and target node respectively) as keys, and list of all simple paths as values.
The question is whether I can use of multi processing in this scenario in order to run this code on my original problem or not. My knowledge about the processors, threads, shared memory and CPU cores are very naive and I am not sure if I can really use the concurrency (running my nested loops in parallel) in my task.
I use a windows server with 128 GB RAM and 32 core CPU.
PS: Thorough searching the net (mostly StackOverFlow), I've found solutions which recommended to use threading and others recommended multiprocessing. I am not sure if I understand the distinction between these two :|

If you want to use threading then use threadpool executor to submit your function call to a thread. It will return a future object. Future.result() will return the value returned by the call. If the call hasn’t yet completed then this method will wait up to timeout seconds.If call is not completed till that time it will raise the TimeoutError.
with ThreadPoolExecutor() as executor:
for n1 in list_of_nodes:
for n2 in list_of_nodes:
if n1 != n2:
all_simple_paths_futures = executor.submit(nx.all_simple_paths, graph,n1,n2)
paths[n1+ "-"+n2] = all_simple_paths_futures
try:
for key in paths.keys():
# get back results from thread
future_obj = paths[key]
paths[key]= list(future_obj.result())
except Exception as e:
print(e)
raise e
For the difference between multiprocessing and threads, check this link :Multiprocessing vs Threading Python

Related

Using dask distributed with large memory shared in a function call

I am using Dask on a HPC cluster with 4 nodes, and each node has 12 cores.
My code is pure Python dealing with lists and sets, and doing most of the computation in tight Python for loops. I read an answer here which suggests to use more processes and less threads for such computation.
If I have
client = Client(n_workers=24, threads_per_worker=2)
will a computation on Python list when using .map() and .compute() split up the work into 48 chunks in parallel? Wouldn't the GIL allow only one thread, and hence only 24 computations in parallel?
EDIT: How is that if I use multiprocessing module and call a pool of threads, on a single node it is faster? Can I use dask with 4 workers (1 worker per node), and 12 pool of threads from the multiprocessing module?
My stripped down code looks like this:
b = db.from_sequence([some_list], npartitions=48).map(my_func, g, k)
m_op = b.compute()
def my_func(g, k):
# several for loops
return
The data g is a pretty huge list, and this gets duplicated if I use more number of processes, and hence becomes the bottleneck. I also tried using
gx = dask.delayed(g) and passing gx to the function. This is also both memory and time consuming.
I understand (from the answers on stackoverflow), that I can use:
[future] = c.scatter([g])
but if all my workers randomly use the data g, I will have to broadcast, and this will again be memory consuming.
Please note that I am not modifying g in my function.
What is the right approach to tackle this?
One other minor observation/question about dask is:
my_func is searching for something, and returns a list of found elements. If a particular worker did not find an element, the return is an empty list. In the end to concatenate the output, I have an ugly piece of code like below:
for sl in m_op:
for item in sl:
if item != []:
nm_op.append(item)
Is there a better way to do this?
Thanks a lot for your time.

Multiprocessing on PBS cluster node

I have to run multiple simulations of the same model with varying parameters (or random number generator seed). Previously I worked on a server with many cores, where I used python multiprocessing library with apply_async. This was very handy as I could decide the maximum number of cores to occupy and simulations would just go into a queue.
As I understand from other questions, multiprocessing works on pbs clusters as long as you work on just one node, which can be fine for now. However, my code doesn't always work.
To let you understand my kind of code:
import functions_library as L
import multiprocessing as mp
if __name__ == "__main__":
N = 100
proc = 50
pool = mp.Pool(processes = proc)
seed = 342
np.random.seed(seed)
seeds = np.random.randint(low=1,high=100000,size=N)
resul = []
for SEED in seeds:
SEED = int(SEED)
resul.append(pool.apply_async(L.some_function, args = (some_args)))
print(SEED)
results = [p.get() for p in resul]
database = pd.DataFrame(results)
database.to_csv("prova.csv")
The function creates 3 N=10000 networkx graphs and perform some computations on them, then returns a simple short python dictionary.
The weird thing I cannot debug is the following error message:
multiprocessing.pool.MaybeEncodingError: Error sending result: >''. >Reason: 'RecursionError('maximum recursion depth exceeded while calling a >Python object')'
What's strange is that I run multiple istances of the code on different nodes. 3 times the code correctly worked, whereas most of the times it returns the previous error. I tried lunching different number of parallel simulation, from 7 to 20 (# cores of the nodes), but there doesn't seem to be a pattern, so I guess it's not a memory issue.
In other questions similar error seems to be related to pickling strange or big objects, but in this case the only thing that comes out of the function is a short dictionary, so it shouldn't be related to that.
I also tried increasing the allowed recursion depth with the sys library at the beginning og the work but didn't work up to 15000.
Any idea to solve or at least understand this behavior?

It was related to eigenvector_centrality() not converging.
When running outside of multiprocessing it correctly returns a networkx error, whereas inside it only this recursion error is returned.
I am not aware if this is a weird very function specific behavior or sometimes multiprocessing cannot handle some library errors.

Using pool for multiprocessing in Python (Windows)

I have to do my study in a parallel way to run it much faster. I am new to multiprocessing library in python, and could not yet make it run successfully.
Here, I am investigating if each pair of (origin, target) remains at certain locations between various frames of my study. Several points:
It is one function, which I want to run faster (It is not several processes).
The process is performed subsequently; it means that each frame is compared with the previous one.
This code is a very simpler form of the original code. The code outputs a residece_list.
I am using Windows OS.
Can someone check the code (the multiprocessing section) and help me improve it to make it work. Thanks.
import numpy as np
from multiprocessing import Pool, freeze_support
def Main_Residence(total_frames, origin_list, target_list):
Previous_List = {}
residence_list = []
for frame in range(total_frames): #Each frame
Current_List = {} #Dict of pair and their residence for frames
for origin in range(origin_list):
for target in range(target_list):
Pair = (origin, target) #Eahc pair
if Pair in Current_List.keys(): #If already considered, continue
continue
else:
if origin == target:
if (Pair in Previous_List.keys()): #If remained from the previous frame, add residence
print "Origin_Target remained: ", Pair
Current_List[Pair] = (Previous_List[Pair] + 1)
else: #If new, add it to the current
Current_List[Pair] = 1
for pair in Previous_List.keys(): #Add those that exited from residence to the list
if pair not in Current_List.keys():
residence_list.append(Previous_List[pair])
Previous_List = Current_List
return residence_list
if __name__ == '__main__':
pool = Pool(processes=5)
Residence_List = pool.apply_async(Main_Residence, args=(20, 50, 50))
print Residence_List.get(timeout=1)
pool.close()
pool.join()
freeze_support()
Residence_List = np.array(Residence_List) * 5

Multiprocessing does not make sense in the context you are presenting here.
You are creating five subprocesses (and three threads belonging to the pool, managing workers, tasks and results) to execute one function once. All of this is coming at a cost, both in system resources and execution time, while four of your worker processes don't do anything at all. Multiprocessing does not speed up the execution of a function. The code in your specific example will always be slower than plainly executing Main_Residence(20, 50, 50) in the main process.
For multiprocessing to make sense in such a context, your work at hand would need to be broken down to a set of homogenous tasks that can be processed in parallel with their results potentially being merged later.
As an example (not necessarily a good one), if you want to calculate the largest prime factors for a sequence of numbers, you can delegate the task of calculating that factor for any specific number to a worker in a pool. Several workers would then do these individual calculations in parallel:
def largest_prime_factor(n):
p = n
i = 2
while i * i <= n:
if n % i:
i += 1
else:
n //= i
return p, n
if __name__ == '__main__':
pool = Pool(processes=3)
start = datetime.now()
# this delegates half a million individual tasks to the pool, i.e.
# largest_prime_factor(0), largest_prime_factor(1), ..., largest_prime_factor(499999)
pool.map(largest_prime_factor, range(500000))
pool.close()
pool.join()
print "pool elapsed", datetime.now() - start
start = datetime.now()
# same work just in the main process
[largest_prime_factor(i) for i in range(500000)]
print "single elapsed", datetime.now() - start
Output:
pool elapsed 0:00:04.664000
single elapsed 0:00:08.939000
(the largest_prime_factor function is taken from #Stefan in this answer)
As you can see, the pool is only roughly twice as fast as single process execution of the same amount of work, all while running in three processes in parallel. That's due to the overhead introduced by multiprocessing/the pool.
So, you stated that the code in your example has been simplified. You'll have to analyse your original code to see if it can be broken down to homogenous tasks that can be passed down to your pool for processing. If that is possible, using multiprocessing might help you speed up your program. If not, multiprocessing will likely cost you time, rather than save it.
Edit:
Since you asked for suggestions on the code. I can hardly say anything about your function. You said yourself that it is just a simplified example to provide an MCVE (much appreciated by the way! Most people don't take the time to strip down their code to its bare minimum). Requests for a code review are anyway better suited over at Codereview.
Play around a bit with the available methods of task delegation. In my prime factor example, using apply_async came with a massive penalty. Execution time increased ninefold, compared to using map. But my example is using just a simple iterable, yours needs three arguments per task. This could be a case for starmap, but that is only available as of Python 3.3.Anyway, the structure/nature of your task data basically determines the correct method to use.
I did some q&d testing with multiprocessing your example function.
The input was defined like this:
inp = [(20, 50, 50)] * 5000 # that makes 5000 tasks against your Main_Residence
I ran that in Python 3.6 in three subprocesses with your function unaltered, except for the removal of the print statment (I/O is costly). I used, starmap, apply, starmap_async and apply_async and also iterated through the results each time to account for the blocking get() on the async results.
Here's the output:
starmap elapsed 0:01:14.506600
apply elapsed 0:02:11.290600
starmap async elapsed 0:01:27.718800
apply async elapsed 0:01:12.571200
# btw: 5k calls to Main_Residence in the main process looks as bad
# as using apply for delegation
single elapsed 0:02:12.476800
As you can see, the execution times differ, although all four methods do the same amount of work; the apply_async you picked appears to be the fastest method.
Coding Style. Your code looks quite ... unconventional :) You use Capitalized_Words_With_Underscore for your names (both, function and variable names), that's pretty much a no-no in Python. Also, assigning the name Previous_List to a dictionary is ... questionable. Have a look at PEP 8, especially the section Naming Conventions to see the commonly accepted coding style for Python.
Judging by the way your print looks, you are still using Python 2. I know that in corporate or institutional environments that's sometimes all you have available. Still, keep in mind that the clock for Python 2 is ticking

Algorithm Complexity Analysis for Variable Length Queue BFS

I have developed an algorithm that is kind of a variation of a BFS on a tree, but it includes a probabilistic factor. To check whether a node is the one I am looking for, a statistical test is performed (I won't get into too much detail about this). If the test result is positive, the node is added to another queue (called tested). But when a node fails the test, the nodes in the tested need to be tested again, so this queue is appended to the one with the nodes yet to be tested.
In Python, considering that the queue q starts with the root node:
...
tested = []
while q:
curr = q.pop(0)
p = statistical_test(curr)
if p:
tested.append(curr)
else:
q.extend(curr.children())
q.extend(tested)
tested = []
return tested
As the algorithm is probabilistic, more than one node might be in tested after the search, but that is expected. The problem I am facing is trying to estimate this algorithm's complexity because I can't simply use BFS's complexity as q and tested will have a variable length.
I don't need a closed and definitive answer for this. What I need are some insights on how to deal with this situation.

The worst case scenario is the following process:
All elements 1 : n-1 pass the test and are appended to the tested queue.
Element n fails the test, is removed from q, and n-1 elements from tested are pushed back into q.
Go back to step 1 with n = n-1
This is a classic O(n2) process.

python - multiprocessing - static tree traversal - performance gain?

I have a node tree where every node has an id (node number), a list over children and a debth indicator. I am then given a list over nodes which i am to find the debth of. To do this i use a recursive function.
This is all fine and dandy but I want to speed the process up. I've been looking into multiprocessing, but every time I try it, the calculation time goes up (the higher process count, the longer runtime) compared to using no other processes at all.
My code looks like junk from trying to understand a lot of different examples, so il post this psuedocode instead.
class Node:
id = int
children = int[]
debth = int
function makeNodeTree() ...
function find(x, node):
for c in node.children:
if c.id == x: return c
else:
if find(x, c) != None: return result
return None
function main():
search = [nodeid, nodeid, nodeid...]
timerstart
for x in search: find(x, rootNode)
timerstop
timerstart
<split list over number of processes>
<do some multiprocess magic>
<get results>
timerstop
compare the two
I've tried all kinds off tree sizes to see if there is any gain at all, but i have yet to find such a case, which leads me thinking I'm doing something wrong. I guess what I'm asking for is an example/way of doing this traversal with a performance gain, using multiprocessing.
I know there are plenty ways to organize nodes to make this task easy, but i want to check the possible(?) performance boost, if it is possible at all.

Multiprocessing has overhead because every time you add a process it takes time to set it up. Also if you are using standard Python threads you are unlikely to get any speedup because all the threads will still run on one processor. So three thoughts (1) are your really so big that you need to speed it up? (2) spawn subprocesses (3) don't use paralellism at each node, just at the top few levels to minimize overhead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel computing in graphs - python

Related

Using dask distributed with large memory shared in a function call

Multiprocessing on PBS cluster node

Using pool for multiprocessing in Python (Windows)

Algorithm Complexity Analysis for Variable Length Queue BFS

python - multiprocessing - static tree traversal - performance gain?

Categories

Resources