I have a very specific graph problem in networkx:
My directed graph has two different type of nodes ( i will call them I and T) and it is built with edges only between I-T and T-I (so T doesn't connect with other T and the same with I)
Now I need to simulate a new graph with the same behavior: i have a certain number of I and T and the edge between I-T exists with a certain probability (also for T-I but with different probability, let's call them p_i and p_o).
So my problem is that i can't iterate with for loops both for I and then for T because both are quite big (the data I'm analyzing right now are 5000 T's and 5000 I's but the will probably increase up to 300000 each) and my pc can't handle that.
What is the best way to create a graph in this situation?
This is the solution that #ravenspoint makes me reach with his comment.
For T=5000 and I=5000 it works by doing for loops in I and one in T and by using np.random.binomial(1, pi, nI) and np.random.binomial(1, po, nO) from numpy, where nO and nI are the length of O and I in the real graph and creating edges if these arrays are 1.
If po=pi (as it happens in my example) also #Stef solution works and you can use nx.bipartite.random_graph(countadd, countt, p, seed=None,directed=True)
Related
TL;DR: It is ten times faster to generate a list of static networks than it is to merge these static networks into a single dynamic network. Why is this so?
Following this answer, I attempt to generate a random dynamic graph using NetworkX and DyNetx.
The issue arises when dealing with mid-scale networks (approximately 1000 nodes and 1000 timestamps) - memory crashes. Also on a smaller scale (about 100 nodes and 300 timestamps), the process is extremely slow. I believe I've identified the impediment, but I'm not sure how to deal with it.
The following is a simple example of code that generates a random temporal network:
import dynetx as dnx
import networkx as nx
import itertools
from random import random
def dynamic_random_graph(n, steps, up_rate, seed=42):
# Create list of static graphs
list_of_snapshots = list()
for t in range(0, steps):
G_t = nx.Graph()
edges = itertools.combinations(range(n), 2)
G_t.add_nodes_from(range(n))
for e in edges:
if random() < up_rate:
G_t.add_edge(*e)
list_of_snapshots.append(G_t)
# Merge the static graphs into dynamic one
dynamic_graph = dnx.DynGraph()
for t, graph in enumerate(list_of_snapshots):
dynamic_graph.add_interactions_from(graph.edges(data=False), t=t)
return dynamic_graph
If we run the following command:
%timeit dynamic_random_graph(300, 100, 0.5) # Memory was crahsed on larger networks.
>> 1 loop, best of 5: 15.1 s per loop
In contrast, if we run the code without the networks merge, we will get significantly better results:
%timeit dynamic_random_graph_without_merge(300, 100, 0.5) # Ignore the merge part in the function
>> 1 loop, best of 5: 15.1 s per loop
We can work on networks with 1000 nodes without memory crash if we run the function without the merge part.
So, I'd like to look at the DyNetx source code and try to figure out what's wrong with the add_interactions_from method.
The function is short and simple, but I'm curious why it takes so much time and memory, and how I can improve it. What are your thoughts?
This is the source code:
def add_interactions_from(self, ebunch, t=None, e=None):
"""Add all the interaction in ebunch at time t.
Parameters
----------
ebunch : container of interaction
Each interaction given in the container will be added to the
graph. The interaction must be given as as 2-tuples (u,v) or
3-tuples (u,v,d) where d is a dictionary containing interaction
data.
t : appearance snapshot id, mandatory
e : vanishing snapshot id, optional
See Also
--------
add_edge : add a single interaction
Examples
--------
>>> import dynetx as dn
>>> G = dn.DynGraph()
>>> G.add_edges_from([(0,1),(1,2)], t=0)
"""
# set up attribute dict
if t is None:
raise nx.NetworkXError(
"The t argument must be a specified.")
# process ebunch
for ed in ebunch:
self.add_interaction(ed[0], ed[1], t, e)
I suppose the loop at the end is the source of all problems.
Link to the add_interaction implementation.
just a few considerations:
it is completely normal that creating a snapshot list without the merging phase is less costly than merging them in a DynGraph: this is prevalently due to the fact that temporal information for replicated edges have to be compressed as edge's attributes;
the random graphs you are generating are dense (50% of the edges are present, something unrealistic in most real contexts) and this requires constant edges' attributes updates. By reducing the number of edges you'll be able to scale up to bigger networks. Just as an example, consider that for the ER model you are simulating it suffices a p=1/N (where N is the number of nodes in the graph) to guarantee a supercritical regime (i.e., a single connected component);
dynetx is built extending networkx that is not particularly scalable (both in terms of memory consumption and execution times): when dealing with dense, heavily edge-attributed, graphs such limitations are more evident than ever;
the way you are building the dynamic graph is likely the most time-consuming one available. You are adding interactions among each pair of nodes without leveraging the knowledge of their effective duration. If the interaction (u,v) takes place k times from t to t+k you can insert such edge just once specifying its vanishing time, thus reducing the graph manipulation operations.
Indeed, DyNetx is not designed to handle particularly large graphs, however, we leveraged it to analyze interaction networks built on top of online social network data several orders of magnitude larger (in terms of nodes) than the reported examples.
As I said before: real networks are sparser than the ones you are simulating. Moreover, (social) interactions usually happen in "bursts". Those two data characteristics often mitigate the library limitations.
Anyhow, we welcome every contribution to the library: if anyone would like to work on its scalability he will have all our support!
What is the best algorithm for generating a reachability matrix from a given adjacency matrix. There is warshall's algorithm but it is not the best method. There are some other methods but the procedures are more theoretical. Is there any module or with which I can create a reachability matrix with ease. I am working on python 2.7.
I don't think there is a way to do it faster than O(n³) in general case for directed graph.
That being said, you can try to use clever technics to reduce the constant.
For example, you can do the following:
Convert your graph into DAG by finding all strongly connected components and replacing them with a single vertex. This can be done in Θ(V + E) or Θ(V²)
On the new graph, run DFS to calculate reachability for all vertices, but, when updating reachability set for a vertex, do it in the fast vectorized way. This is technically Θ( (V + E) * V ) or Θ(V³), but the constant will be low (see below).
The proposed vectorized way is to have the reachability set for every vertex represented as the bit vector, residing on GPU. This way, the calculation of the union of two sets is performed in extremely fast parallelized manner on GPU. You can use any tensor library for GPU, for example, tf.bitwise.
After you calculated reachability bit vectors for every vertex on GPU, you can extract them into CPU memory in Θ(V²) time.
I am attempting to draw a very large networkx graph that has approximately 5000 nodes and 100000 edges. It represents the road network of a large city. I cannot determine if the computer is hanging or if it simply just takes forever. The line of code that it seems to be hanging on is the following:
##a is my network
pos = networkx.spring_layout(a)
Is there perhaps a better method for plotting such a large network?
Here is the good news. Yes it wasn't broken, It was working and you wouldn't want to wait for it even if you could.
Check out my answer to this question to see what your end result would look like.
Drawing massive networkx graph: Array too big
I think the spring layout is an n^3 algorithm which would take 125,000,000,000 calculations to get the positions for your graph. The best thing for you is to choose a different layout type or plot the positions yourself.
So another alternative is pulling out the relevant points yourself using a tool called gephi.
As Aric said, if you know the locations, that's probably the best option.
If instead you just know distances, but don't have locations to plug in, there's some calculation you can do that will reproduce locations pretty well (up to a rotation). If you do a principal component analysis of the distances and project into 2 dimensions, it will probably do a very good job estimating the geographic locations. (It was an example I saw in a linear algebra class once)
I'm trying to find what seems to be a complicated and time-consuming multi-objective optimization on a large-ish graph.
Here's the problem: I want to find a graph of n vertices (n is constant at, say 100) and m edges (m can change) where a set of metrics are optimized:
Metric A needs to be as high as possible
Metric B needs to be as low as possible
Metric C needs to be as high as possible
Metric D needs to be as low as possible
My best guess is to go with GA. I am not very familiar with genetic algorithms, but I can spend a little time to learn the basics. From what I'm reading so far, I need to go as such:
Generate a population of graphs of n nodes randomly connected to
each other by m = random[1,2000] (for instance) edges
Run the metrics A, B, C, D on each graph
Is an optimal solution found (as defined in the problem)?
If yes, perfect. If not:
Select the best graphs
Crossover
Mutate (add or remove edges randomly?)
Go to 3.
Now, I usually use Python for my little experiments. Could DEAP (https://code.google.com/p/deap/) help me with this problem?
If so, I have many more questions (especially on the crossover and mutate steps), but in short: are the steps (in Python, using DEAP) easy enough to be explain or summarized here?
I can try and elaborate if needed. Cheers.
Disclaimer: I am one of DEAP lead developer.
Your individual could be represented by a binary string. Each bit would indicate whether there is an edge between two vertices. Therefore, your individuals would be composed of n * (n - 1) / 2 bits, where n is the number of vertices. To evaluate your individual, you would simply need to build an adjacency matrix from the individual genotype. For an evaluation function example, see the following gist https://gist.github.com/cmd-ntrf/7816665.
Your fitness would be composed of 4 objectives, and based on what you said regarding minimization and maximization of each objective, the fitness class would be created like this :
creator.create("Fitness", base.Fitness, weights=(1.0, -1.0, 1.0, -1.0)
The crossover and mutation operators could be the same as in the OneMax example.
http://deap.gel.ulaval.ca/doc/default/examples/ga_onemax_short.html
However, since you want to do multi-objective, you would need a multi-objective selection operator, either NSGA2 or SPEA2. Finally, the algorithm would have to be mu + lambda. For both multi-objective selection and mu + lambda algorithm usage, see the GA Knapsack example.
http://deap.gel.ulaval.ca/doc/default/examples/ga_knapsack.html
So essentially, to get up and running, you only have to merge a part of the onemax example with the knapsack while using the proposed evaluation function.
I suggest the excellent pyevolve library https://github.com/perone/Pyevolve. This will do most of the work for you, you will only have to define the fitness function and your representation nodes/functions. You can specify the crossover and mutation rate as well.
I am trying to get the list of connected components in a graph with 100 million nodes. For smaller graphs, I usually use the connected_components function of the Networkx module in Python which does exactly that. However, loading a graph with 100 million nodes (and their edges) into memory with this module would require ca. 110GB of memory, which I don't have. An alternative would be to use a graph database which has a connected components function but I haven't found any in Python. It would seem that Dex (API: Java, .NET, C++) has this functionality but I'm not 100% sure. Ideally I'm looking for a solution in Python. Many thanks.
SciPy has a connected components algorithm. It expects as input the adjacency matrix of your graph in one of its sparse matrix formats and handles both the directed and undirected cases.
Building a sparse adjacency matrix from a sequence of (i, j) pairs adj_list where i and j are (zero-based) indices of nodes can be done with
i_indices, j_indices = zip(*adj_list)
adj_matrix = scipy.sparse.coo_matrix((np.ones(number_of_nodes),
(i_indices, j_indices)))
You'll have to do some extra work for the undirected case.
This approach should be efficient if your graph is sparse enough.
https://graph-tool.skewed.de/performance
this tool as you can see from performance is very fast. It's written in C++ but the interface is in Python.
If this tool isn't good enough for you. (Which I think it will) then you can try Apache Giraph (http://giraph.apache.org/).