I have a complete undirected weighted graph. Think of a graph where persons are nodes and the edge (u,v,w) indicates the kind of relationship between u and v with weight w. w can take any value between 1 (doesn't know each other - hence the completeness), 2 (acquaintances), 3(friends). This kind of relationships form naturally clusters based on the edge weight.
My goal is to define a model that models this phenomena and from where I can sample some graphs and see the observed behaviour in reality.
So far I've played with stochastic block models (https://graspy.neurodata.io/tutorials/simulations/sbm.html) since there are some papers about the use of these generative models for these community-detection tasks. However I may be overseeing something, since I can't seem to be able to fully represent what I need: g = sbm(list_of_params) where g is complete and there are some discernibles clusters among nodes sharing weight 3.
At this point I am not even sure whether sbm is the best approach for this task.
I am also assuming that everything that graph-tool can do, graspy can also do. Since at the beginning I read about both and it seems that is the case.
Summarizing:
Is there a way to generate a stochastic block model in graspy that yields a complete undirected weighted graph?
Is sbm the best model for the task. Should I be looking at gmm?
Thanks
Is there a way to generate a stochastic block model in graspy that yields a complete undirected weighted graph?
Yes, but as pointed out in the comments above, that's a strange way to specify the model. If you want to benefit from the deep literature on community detection in social networks, you should not use a complete graph. Do what everyone else does: The presence (or absence) of an edge should indicate a relationship (or lack thereof), and an optional weight on the edge can indicate the strength of the relationship.
To generate graphs from SBM with weights, use this function:
https://graspy.neurodata.io/reference/simulations.html#graspologic.simulations.sbm
I am also assuming that everything that graph-tool can do, graspy can also do.
This is not true. There are (at least) two different popular methods for inferring the parameters of an SBM. Unfortunately, the practitioners of each method seem to avoid citing each other in their papers and code.
graph-tool uses an MCMC statistical inference approach to find the optimal graph partitioning.
graspologic (formerly graspy) uses a trick related to spectral clustering to find the partitioning.
From what I can tell, the graph-tool approach offers more straightforward and principled model selection methods. It also has useful extensions, such as overlapping communities, nested (hierarchical) communities, layered graphs, and more.
I'm not as familiar with the graspologic (spectral) methods, but -- to me -- they seem more difficult to extend beyond merely seeking a point estimate for the ideal community partitioning. You should take my opinion with a hefty bit of skepticism, though. I'm not really an expert in this space.
Is it possible to do clustering without providing any input apart from the data? The clustering method/algorithm should decide from the data on how many logical groups the data can be divided, even it doesn't require me to input the threshold eucledian distance on which the clusters are built, this also needs to be learned from the data.
Could you please suggest me what is closest solution for my problem?
Why not code your algorithm to create a list of clusters ranging from size 1 to n (which could be defined in a config file so that you can avoid hard coding and just fix it once).
Once that is done, compute the clusters of size 1 to n. Choose the value which gives you the smallest Mean Square Error.
This would require some additional work by your machine to determine the optimal number of logical groups the data can be divided (bounded between 1 and n).
Clustering is an explorative technique.
This means it must always be able to produce different results, as desired by the user. Having many parameters is a feature. It means the method can be adapted easily to very different data, and to user preferences.
There will never be a generally useful parameter-free technique. At best, some parameters will have default values or heuristics (such as Euclidean distance, such as standardizing the input prior to clusterings such as the gap statistic for choosing k) that may give a reasonable first try in 80% of cases. But after that first try, you'll need to understand the data, and try other parameters to learn more about your data.
Methods that claim to be "parameter free" usually just have some hidden parameters set so it works on the few toy example it was demonstrated on.
I am attempting to draw a very large networkx graph that has approximately 5000 nodes and 100000 edges. It represents the road network of a large city. I cannot determine if the computer is hanging or if it simply just takes forever. The line of code that it seems to be hanging on is the following:
##a is my network
pos = networkx.spring_layout(a)
Is there perhaps a better method for plotting such a large network?
Here is the good news. Yes it wasn't broken, It was working and you wouldn't want to wait for it even if you could.
Check out my answer to this question to see what your end result would look like.
Drawing massive networkx graph: Array too big
I think the spring layout is an n^3 algorithm which would take 125,000,000,000 calculations to get the positions for your graph. The best thing for you is to choose a different layout type or plot the positions yourself.
So another alternative is pulling out the relevant points yourself using a tool called gephi.
As Aric said, if you know the locations, that's probably the best option.
If instead you just know distances, but don't have locations to plug in, there's some calculation you can do that will reproduce locations pretty well (up to a rotation). If you do a principal component analysis of the distances and project into 2 dimensions, it will probably do a very good job estimating the geographic locations. (It was an example I saw in a linear algebra class once)
Edited question to make it a bit more specific.
Not trying to base it on content of nodes but solely of structure of directed graph.
For example, pagerank(at first) solely used the link structure(directed graph) to make inferences on what was more relevant. I'm not totally sure, but I think Elo(chess ranking) does something simlair to rank players(although it adds scores also).
I'm using python's networkx package but right now I just want to understand any algorithms that accomplish this.
Thanks!
Eigenvector centrality is a network metric that can be used to model the probability that a node will be encountered in a random walk. It factors in not only the number of edges that a node has but also the number of edges the nodes it connects to have and onward with the edges that the nodes connected to its connected nodes have and so on. It can be implemented with a random walk which is how Google's PageRank algorithm works.
That said, the field of network analysis is broad and continues to develop with new and interesting research. The way you ask the question implies that you might have a different impression. Perhaps start by looking over the three links I included here and see if that gets you started and then follow up with more specific questions.
You should probably take a look at Markov Random Fields and Conditional Random Fields. Perhaps the closest thing similar to what you're describing is a Bayesian Network
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need to be able to manipulate a large (10^7 nodes) graph in python. The data corresponding to each node/edge is minimal, say, a small number of strings. What is the most efficient, in terms of memory and speed, way of doing this?
A dict of dicts is more flexible and simpler to implement, but I intuitively expect a list of lists to be faster. The list option would also require that I keep the data separate from the structure, while dicts would allow for something of the sort:
graph[I][J]["Property"]="value"
What would you suggest?
Yes, I should have been a bit clearer on what I mean by efficiency. In this particular case I mean it in terms of random access retrieval.
Loading the data in to memory isn't a huge problem. That's done once and for all. The time consuming part is visiting the nodes so I can extract the information and measure the metrics I'm interested in.
I hadn't considered making each node a class (properties are the same for all nodes) but it seems like that would add an extra layer of overhead? I was hoping someone would have some direct experience with a similar case that they could share. After all, graphs are one of the most common abstractions in CS.
I would strongly advocate you look at NetworkX. It's a battle-tested war horse and the first tool most 'research' types reach for when they need to do analysis of network based data. I have manipulated graphs with 100s of thousands of edges without problem on a notebook. Its feature rich and very easy to use. You will find yourself focusing more on the problem at hand rather than the details in the underlying implementation.
Example of Erdős-Rényi random graph generation and analysis
"""
Create an G{n,m} random graph with n nodes and m edges
and report some properties.
This graph is sometimes called the Erd##[m~Qs-Rényi graph
but is different from G{n,p} or binomial_graph which is also
sometimes called the Erd##[m~Qs-Rényi graph.
"""
__author__ = """Aric Hagberg (hagberg#lanl.gov)"""
__credits__ = """"""
# Copyright (C) 2004-2006 by
# Aric Hagberg
# Dan Schult
# Pieter Swart
# Distributed under the terms of the GNU Lesser General Public License
# http://www.gnu.org/copyleft/lesser.html
from networkx import *
import sys
n=10 # 10 nodes
m=20 # 20 edges
G=gnm_random_graph(n,m)
# some properties
print "node degree clustering"
for v in nodes(G):
print v,degree(G,v),clustering(G,v)
# print the adjacency list to terminal
write_adjlist(G,sys.stdout)
Visualizations are also straightforward:
More visualization: http://jonschull.blogspot.com/2008/08/graph-visualization.html
Even though this question is now quite old, I think it is worthwhile to mention my own python module for graph manipulation called graph-tool. It is very efficient, since the data structures and algorithms are implemented in C++, with template metaprograming, using the Boost Graph Library. Therefore its performance (both in memory usage and runtime) is comparable to a pure C++ library, and can be orders of magnitude better than typical python code, without sacrificing ease of use. I use it myself constantly to work with very large graphs.
As already mentioned, NetworkX is very good, with another option being igraph. Both modules will have most (if not all) the analysis tools you're likely to need, and both libraries are routinely used with large networks.
A dictionary may also contain overhead, depending on the actual implementation. A hashtable usually contain some prime number of available nodes to begin with, even though you might only use a couple of the nodes.
Judging by your example, "Property", would you be better of with a class approach for the final level and real properties? Or is the names of the properties changing a lot from node to node?
I'd say that what "efficient" means depends on a lot of things, like:
speed of updates (insert, update, delete)
speed of random access retrieval
speed of sequential retrieval
memory used
I think that you'll find that a data structure that is speedy will generally consume more memory than one that is slow. This isn't always the case, but most data structures seems to follow this.
A dictionary might be easy to use, and give you relatively uniformly fast access, it will most likely use more memory than, as you suggest, lists. Lists, however, generally tend to contain more overhead when you insert data into it, unless they preallocate X nodes, in which they will again use more memory.
My suggestion, in general, would be to just use the method that seems the most natural to you, and then do a "stress test" of the system, adding a substantial amount of data to it and see if it becomes a problem.
You might also consider adding a layer of abstraction to your system, so that you don't have to change the programming interface if you later on need to change the internal data structure.
As I understand it, random access is in constant time for both Python's dicts and lists, the difference is that you can only do random access of integer indexes with lists. I'm assuming that you need to lookup a node by its label, so you want a dict of dicts.
However, on the performance front, loading it into memory may not be a problem, but if you use too much you'll end up swapping to disk, which will kill the performance of even Python's highly efficient dicts. Try to keep memory usage down as much as possible. Also, RAM is amazingly cheap right now; if you do this kind of thing a lot, there's no reason not to have at least 4GB.
If you'd like advice on keeping memory usage down, give some more information about the kind of information you're tracking for each node.
Making a class-based structure would probably have more overhead than the dict-based structure, since in python classes actually use dicts when they are implemented.
No doubt NetworkX is the best data structure till now for graph. It comes with utilities like Helper Functions, Data Structures and Algorithms, Random Sequence Generators, Decorators, Cuthill-Mckee Ordering, Context Managers
NetworkX is great because it wowrs for graphs, digraphs, and multigraphs. It can write graph with multiple ways: Adjacency List, Multiline Adjacency List,
Edge List, GEXF, GML. It works with Pickle, GraphML, JSON, SparseGraph6 etc.
It has implimentation of various radimade algorithms including:
Approximation, Bipartite, Boundary, Centrality, Clique, Clustering, Coloring, Components, Connectivity, Cycles, Directed Acyclic Graphs,
Distance Measures, Dominating Sets, Eulerian, Isomorphism, Link Analysis, Link Prediction, Matching, Minimum Spanning Tree, Rich Club, Shortest Paths, Traversal, Tree.