TL;DR: It is ten times faster to generate a list of static networks than it is to merge these static networks into a single dynamic network. Why is this so?
Following this answer, I attempt to generate a random dynamic graph using NetworkX and DyNetx.
The issue arises when dealing with mid-scale networks (approximately 1000 nodes and 1000 timestamps) - memory crashes. Also on a smaller scale (about 100 nodes and 300 timestamps), the process is extremely slow. I believe I've identified the impediment, but I'm not sure how to deal with it.
The following is a simple example of code that generates a random temporal network:
import dynetx as dnx
import networkx as nx
import itertools
from random import random
def dynamic_random_graph(n, steps, up_rate, seed=42):
# Create list of static graphs
list_of_snapshots = list()
for t in range(0, steps):
G_t = nx.Graph()
edges = itertools.combinations(range(n), 2)
G_t.add_nodes_from(range(n))
for e in edges:
if random() < up_rate:
G_t.add_edge(*e)
list_of_snapshots.append(G_t)
# Merge the static graphs into dynamic one
dynamic_graph = dnx.DynGraph()
for t, graph in enumerate(list_of_snapshots):
dynamic_graph.add_interactions_from(graph.edges(data=False), t=t)
return dynamic_graph
If we run the following command:
%timeit dynamic_random_graph(300, 100, 0.5) # Memory was crahsed on larger networks.
>> 1 loop, best of 5: 15.1 s per loop
In contrast, if we run the code without the networks merge, we will get significantly better results:
%timeit dynamic_random_graph_without_merge(300, 100, 0.5) # Ignore the merge part in the function
>> 1 loop, best of 5: 15.1 s per loop
We can work on networks with 1000 nodes without memory crash if we run the function without the merge part.
So, I'd like to look at the DyNetx source code and try to figure out what's wrong with the add_interactions_from method.
The function is short and simple, but I'm curious why it takes so much time and memory, and how I can improve it. What are your thoughts?
This is the source code:
def add_interactions_from(self, ebunch, t=None, e=None):
"""Add all the interaction in ebunch at time t.
Parameters
----------
ebunch : container of interaction
Each interaction given in the container will be added to the
graph. The interaction must be given as as 2-tuples (u,v) or
3-tuples (u,v,d) where d is a dictionary containing interaction
data.
t : appearance snapshot id, mandatory
e : vanishing snapshot id, optional
See Also
--------
add_edge : add a single interaction
Examples
--------
>>> import dynetx as dn
>>> G = dn.DynGraph()
>>> G.add_edges_from([(0,1),(1,2)], t=0)
"""
# set up attribute dict
if t is None:
raise nx.NetworkXError(
"The t argument must be a specified.")
# process ebunch
for ed in ebunch:
self.add_interaction(ed[0], ed[1], t, e)
I suppose the loop at the end is the source of all problems.
Link to the add_interaction implementation.
just a few considerations:
it is completely normal that creating a snapshot list without the merging phase is less costly than merging them in a DynGraph: this is prevalently due to the fact that temporal information for replicated edges have to be compressed as edge's attributes;
the random graphs you are generating are dense (50% of the edges are present, something unrealistic in most real contexts) and this requires constant edges' attributes updates. By reducing the number of edges you'll be able to scale up to bigger networks. Just as an example, consider that for the ER model you are simulating it suffices a p=1/N (where N is the number of nodes in the graph) to guarantee a supercritical regime (i.e., a single connected component);
dynetx is built extending networkx that is not particularly scalable (both in terms of memory consumption and execution times): when dealing with dense, heavily edge-attributed, graphs such limitations are more evident than ever;
the way you are building the dynamic graph is likely the most time-consuming one available. You are adding interactions among each pair of nodes without leveraging the knowledge of their effective duration. If the interaction (u,v) takes place k times from t to t+k you can insert such edge just once specifying its vanishing time, thus reducing the graph manipulation operations.
Indeed, DyNetx is not designed to handle particularly large graphs, however, we leveraged it to analyze interaction networks built on top of online social network data several orders of magnitude larger (in terms of nodes) than the reported examples.
As I said before: real networks are sparser than the ones you are simulating. Moreover, (social) interactions usually happen in "bursts". Those two data characteristics often mitigate the library limitations.
Anyhow, we welcome every contribution to the library: if anyone would like to work on its scalability he will have all our support!
Related
I have a very specific graph problem in networkx:
My directed graph has two different type of nodes ( i will call them I and T) and it is built with edges only between I-T and T-I (so T doesn't connect with other T and the same with I)
Now I need to simulate a new graph with the same behavior: i have a certain number of I and T and the edge between I-T exists with a certain probability (also for T-I but with different probability, let's call them p_i and p_o).
So my problem is that i can't iterate with for loops both for I and then for T because both are quite big (the data I'm analyzing right now are 5000 T's and 5000 I's but the will probably increase up to 300000 each) and my pc can't handle that.
What is the best way to create a graph in this situation?
This is the solution that #ravenspoint makes me reach with his comment.
For T=5000 and I=5000 it works by doing for loops in I and one in T and by using np.random.binomial(1, pi, nI) and np.random.binomial(1, po, nO) from numpy, where nO and nI are the length of O and I in the real graph and creating edges if these arrays are 1.
If po=pi (as it happens in my example) also #Stef solution works and you can use nx.bipartite.random_graph(countadd, countt, p, seed=None,directed=True)
Networkx appears to have a lot of random graph generators. Why are there so many and which should I choose?
fast_gnp_random_graph,
gnp_random_graph
dense_gnm_random_graph
gnm_random_graph
erdos_renyi_graph
binomial_graph
https://networkx.github.io/documentation/stable/reference/generators.html
Some of them really are identical – i.e. just aliases for the purpose of convenience.
E.g. gnp_random_graph = binomial_graph = erdos_renyi_graph.
They all generate the same type of graph but some use different algorithms that perform better or worse depending on the parameters/properties of your graph (size, density, ...). So there is no single best choice. (Even if there were, it may be of academic interest to some people to also have alternate algorithms available – e.g. for speed comparisons.)
Some also differ in the way you define/paramerterize your graph. E.g. Some use the number of nodes and the probability to grow an edge, while others are defined by the number of nodes and the number of edges.
Depending on your application one may be preferable to the other.
I'm trying to compute the matrix product Y=XX^T for a matrix X of size 10,000 * 800,000. The matrix X is stored on-disk in an h5py file. The resulting Y should be a 10,000*10,000 matrix stored in the same h5py file. Here is a reproducible sample code.
import dask.array as da
from blaze import into
into("h5py:///tmp/dummy::/X", da.ones((10**4, 8*10**5), chunks=(10**4,10**4)))
x = into(da.Array, "h5py:///tmp/dummy::/X", chunks=(10**4,10**4)))
y = x.dot(x.T)
into("h5py:///tmp/dummy::/Y", y)
I expected this computation to go smoothly as each (10,000*10,000) chunk should be individually transposed, followed by a dot product and then summed up to the final result. However, running this computation fills both my RAM and swap memory until the process eventually gets killed.
Here is a sample of the computation graph plotted with dot_graph: Computation graph sample
According to the sheduling doc that http: //dask.pydata.org/en/latest/scheduling-policy.html
I would expect the upper tensordot intermediary results to be summed up one by one into the last sum result as soon as they have been individually computed. This would free the memory of these tensordot intermediary results, so that we would not face memory errors.
Playing around with a smaller toy example:
from dask.diagnostics import Profiler, CacheProfiler, ResourceProfiler
# Experiment on a (1,0000 * 5,000) matrix X split into 500 chunks of size (1,000 * 10)
x = into(da.Array, "h5py:///tmp/dummy::/X", chunks=(10**3,10)))[:10**3,5000]
y = x.T.dot(x)
with Profiler() as prof, CacheProfiler() as cprof, ResourceProfiler() as rprof:
into("h5py:///tmp/dummy::/X", y)
rprof.visualize()
I get the following display:
Ressource profiler
Where the green bar represents the sum operation, while yellow and purple bars represent respectively get_array and tensordot operations. This seems to indicate that the sum operation waits for all intermediary tensordot operations to be performed before summing them. This would also explain my process running out of memory and getting killed.
So my questions are:
Is that the normal behavior of the sum operation?
Is there a way to force it to compute intermediary sums before all
the intermediary tensordot products are computed and kept in memory?
If not, is there a work around that does not involve spilling to disk?
Any help much much appreciated!
Generally speaking performing a dense matrix-matrix multiply in small space is hard. This is because every intermediate chunk will by used by several of the output chunks.
According to the sheduling doc that http: //dask.pydata.org/en/latest/scheduling-policy.html I would expect the upper tensordot intermediary results to be summed up one by one into the last sum result as soon as they have been individually computed.
The graph that you have shown has many inputs to a sum function. Dask will wait until all of those inputs are complete before running the sum function. The task scheduler has no idea that sum is associative and can be run piece by piece. This lack of semantic information is the price you pay for using a general task scheduling system like Dask rather than a dedicated linear algebra library. If your goal is to perform dense linear algebra as efficiently as possible then you might want to look elsewhere; this is a well covered field.
So as written your memory requirements are at least 8e5 * 1e4 * dtype.itemsize, assuming that Dask proceeds in exactly the right order (which it should mostly do).
You might try the following:
Reduce the chunksize along the non-contracting dimension
Use a version of Dask later than 0.14.1 (0.14.2 should be released by May 5th, 2017), where we break down those large sum calls into many smaller ones explicitly in the graph.
Use the distributed scheduler, which handles writing data to disk more efficiently.
from dask.distributed import Client
client = Client(processes=False) # create a local cluster in this process
I am attempting to draw a very large networkx graph that has approximately 5000 nodes and 100000 edges. It represents the road network of a large city. I cannot determine if the computer is hanging or if it simply just takes forever. The line of code that it seems to be hanging on is the following:
##a is my network
pos = networkx.spring_layout(a)
Is there perhaps a better method for plotting such a large network?
Here is the good news. Yes it wasn't broken, It was working and you wouldn't want to wait for it even if you could.
Check out my answer to this question to see what your end result would look like.
Drawing massive networkx graph: Array too big
I think the spring layout is an n^3 algorithm which would take 125,000,000,000 calculations to get the positions for your graph. The best thing for you is to choose a different layout type or plot the positions yourself.
So another alternative is pulling out the relevant points yourself using a tool called gephi.
As Aric said, if you know the locations, that's probably the best option.
If instead you just know distances, but don't have locations to plug in, there's some calculation you can do that will reproduce locations pretty well (up to a rotation). If you do a principal component analysis of the distances and project into 2 dimensions, it will probably do a very good job estimating the geographic locations. (It was an example I saw in a linear algebra class once)
I am trying to get the list of connected components in a graph with 100 million nodes. For smaller graphs, I usually use the connected_components function of the Networkx module in Python which does exactly that. However, loading a graph with 100 million nodes (and their edges) into memory with this module would require ca. 110GB of memory, which I don't have. An alternative would be to use a graph database which has a connected components function but I haven't found any in Python. It would seem that Dex (API: Java, .NET, C++) has this functionality but I'm not 100% sure. Ideally I'm looking for a solution in Python. Many thanks.
SciPy has a connected components algorithm. It expects as input the adjacency matrix of your graph in one of its sparse matrix formats and handles both the directed and undirected cases.
Building a sparse adjacency matrix from a sequence of (i, j) pairs adj_list where i and j are (zero-based) indices of nodes can be done with
i_indices, j_indices = zip(*adj_list)
adj_matrix = scipy.sparse.coo_matrix((np.ones(number_of_nodes),
(i_indices, j_indices)))
You'll have to do some extra work for the undirected case.
This approach should be efficient if your graph is sparse enough.
https://graph-tool.skewed.de/performance
this tool as you can see from performance is very fast. It's written in C++ but the interface is in Python.
If this tool isn't good enough for you. (Which I think it will) then you can try Apache Giraph (http://giraph.apache.org/).