MapReduce: Finding the triangles in a network graph using Mrjob - python

I've an application where I have a graph and I need to count the number of triangles in the graph using MrJob (MapReduce in Python). However, I'm having some trouble wrapping my head around the mapping and the reducing steps needed.
What is the best Map Reduce pipeline for computing the triangles of a network graph?

Well, it would help to answer this to have a bit more context. Do you have a single graph or a large number of graphs, a tree? How many nodes are we talking about in your graph?
But in general, I would try to build a solution that uses the networkx package, specifically the triangles method at the core.
An issue you may face is filtering duplicates, as the triangles are reported relative to a node.
So a bit more context here on the specifics here would help narrow down the answer.

Related

Model for generating and detecting communities in dense network

I have a complete undirected weighted graph. Think of a graph where persons are nodes and the edge (u,v,w) indicates the kind of relationship between u and v with weight w. w can take any value between 1 (doesn't know each other - hence the completeness), 2 (acquaintances), 3(friends). This kind of relationships form naturally clusters based on the edge weight.
My goal is to define a model that models this phenomena and from where I can sample some graphs and see the observed behaviour in reality.
So far I've played with stochastic block models (https://graspy.neurodata.io/tutorials/simulations/sbm.html) since there are some papers about the use of these generative models for these community-detection tasks. However I may be overseeing something, since I can't seem to be able to fully represent what I need: g = sbm(list_of_params) where g is complete and there are some discernibles clusters among nodes sharing weight 3.
At this point I am not even sure whether sbm is the best approach for this task.
I am also assuming that everything that graph-tool can do, graspy can also do. Since at the beginning I read about both and it seems that is the case.
Summarizing:
Is there a way to generate a stochastic block model in graspy that yields a complete undirected weighted graph?
Is sbm the best model for the task. Should I be looking at gmm?
Thanks
Is there a way to generate a stochastic block model in graspy that yields a complete undirected weighted graph?
Yes, but as pointed out in the comments above, that's a strange way to specify the model. If you want to benefit from the deep literature on community detection in social networks, you should not use a complete graph. Do what everyone else does: The presence (or absence) of an edge should indicate a relationship (or lack thereof), and an optional weight on the edge can indicate the strength of the relationship.
To generate graphs from SBM with weights, use this function:
https://graspy.neurodata.io/reference/simulations.html#graspologic.simulations.sbm
I am also assuming that everything that graph-tool can do, graspy can also do.
This is not true. There are (at least) two different popular methods for inferring the parameters of an SBM. Unfortunately, the practitioners of each method seem to avoid citing each other in their papers and code.
graph-tool uses an MCMC statistical inference approach to find the optimal graph partitioning.
graspologic (formerly graspy) uses a trick related to spectral clustering to find the partitioning.
From what I can tell, the graph-tool approach offers more straightforward and principled model selection methods. It also has useful extensions, such as overlapping communities, nested (hierarchical) communities, layered graphs, and more.
I'm not as familiar with the graspologic (spectral) methods, but -- to me -- they seem more difficult to extend beyond merely seeking a point estimate for the ideal community partitioning. You should take my opinion with a hefty bit of skepticism, though. I'm not really an expert in this space.

"Transitivity works on simple graphs only" InternalError in igraph

I'm analysing a big graph - 30M nodes and 350M+ edges - using the python interface of igraph. I can load the edges without any issue but executing a function like transitivity_local_undirected to compute the clustering coefficient of each node returns the error "Transitivity works on simple graphs only, Invalid value".
I can't find anything online - any help would be much appreciated, thanks!
A simple graph is a graph with no loops or multiple edges -- it sounds like the computer thinks your graph is non-simple for some reason.
Are you sure your nodes have no loops or multiple edges between them?

Networkx shortest tree algorithm

I have a undirected weighted graph G with a set of nodes and weighted edges.
I want to know if there is a method implemented in networkx to find a minimum spanning tree in a graph between given nodes (e.g. nx.steiner_tree(G, ['Berlin', 'Kiel', 'Munster', 'Nurnberg'])) (aparently there is none?)
I don't have reputation points to post images. The link to similar image could be:
Map
(A3, C1, C5, E4)
What I'm thinking:
check dijkstras shortest paths between all destination nodes;
put all the nodes (intermediate and destinations) and edges to a new graph V;
compute the mst on V (to remove cycles by breaking longest edge);
Maybe there are better ways(corectness- and computation- wise)? Because this approach does pretty bad with three destination nodes and becomes better with more nodes.
P.S. My graph is planar (it can be drawn on paper so that edges would not intersect). So maybe some kind of spring/force (like in d3 visualisation) algorithm could help?
As I understand your question, you're trying to find the lowest weight connected component that contains a set of nodes. This is the Steiner tree in graphs problem. It is NP complete. You're probably best off taking some sort of heuristic based on the specific case you are studying.
For two nodes, the approach to do this is Dijkstra's algorithm- it's fastest if you expand around both nodes until the two shells intersect. For three nodes I suspect a version of Dijkstra's algorithm where you expand around each of the nodes will give some good results, but I don't see how to make sure you're getting the best.
I've found another question for this, which has several decent answers (the posted question had an unweighted graph so it's different, but the algorithms given in the answers are appropriate for weighted). There are some good ones beyond just the accepted answer.
In networkx there's a standard Kruskal algorithm implemented with undirected weighted graph as input. The function is called "minimum_spanning_tree"
I propose you build a subgraph that contains the nodes you need and then let the Kruskal algorithm run on it.
import nertworkx as nx
H=G.subgraph(['Berlin','Kiel', 'Konstanz'])
MST=nx.minimum_spanning_tree(H)
As pointed out already, this is the Steiner tree problem in graphs.
There is a Steiner tree algorithm in networkx:
https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.approximation.steinertree.steiner_tree.html
However, it only gives you an approximate solution, and it is also rather slow. For state-of-the-art solvers, see the section "External Links" under:
https://en.wikipedia.org/wiki/Steiner_tree_problem

Graphing a very large graph in networkx

I am attempting to draw a very large networkx graph that has approximately 5000 nodes and 100000 edges. It represents the road network of a large city. I cannot determine if the computer is hanging or if it simply just takes forever. The line of code that it seems to be hanging on is the following:
##a is my network
pos = networkx.spring_layout(a)
Is there perhaps a better method for plotting such a large network?
Here is the good news. Yes it wasn't broken, It was working and you wouldn't want to wait for it even if you could.
Check out my answer to this question to see what your end result would look like.
Drawing massive networkx graph: Array too big
I think the spring layout is an n^3 algorithm which would take 125,000,000,000 calculations to get the positions for your graph. The best thing for you is to choose a different layout type or plot the positions yourself.
So another alternative is pulling out the relevant points yourself using a tool called gephi.
As Aric said, if you know the locations, that's probably the best option.
If instead you just know distances, but don't have locations to plug in, there's some calculation you can do that will reproduce locations pretty well (up to a rotation). If you do a principal component analysis of the distances and project into 2 dimensions, it will probably do a very good job estimating the geographic locations. (It was an example I saw in a linear algebra class once)

What algorithms can I use to make inferences from a graph?

Edited question to make it a bit more specific.
Not trying to base it on content of nodes but solely of structure of directed graph.
For example, pagerank(at first) solely used the link structure(directed graph) to make inferences on what was more relevant. I'm not totally sure, but I think Elo(chess ranking) does something simlair to rank players(although it adds scores also).
I'm using python's networkx package but right now I just want to understand any algorithms that accomplish this.
Thanks!
Eigenvector centrality is a network metric that can be used to model the probability that a node will be encountered in a random walk. It factors in not only the number of edges that a node has but also the number of edges the nodes it connects to have and onward with the edges that the nodes connected to its connected nodes have and so on. It can be implemented with a random walk which is how Google's PageRank algorithm works.
That said, the field of network analysis is broad and continues to develop with new and interesting research. The way you ask the question implies that you might have a different impression. Perhaps start by looking over the three links I included here and see if that gets you started and then follow up with more specific questions.
You should probably take a look at Markov Random Fields and Conditional Random Fields. Perhaps the closest thing similar to what you're describing is a Bayesian Network

Categories

Resources