Python: Graph a Network from Data - python

I have the following dataset:
firm_id_1 firm_id_2
1 2
1 4
1 5
2 1
2 3
3 2
3 6
4 1
4 5
4 6
5 4
5 7
6 3 ....
I would like to graph the network of firm_id = 1. In other words, I want to see a graph that shows that firm_id = 1 is directly connected to 2, 4, 5, and indirectly connected to 3 via firm 2, connected to 6 via firm 4 and indirectly connected to 7 via firm 5. In other words I graph the shortest distance to each node (firm_id) starting from firm_id=1. There is 3000 nodes in my data and I know that firm 1 reaches all nodes in less than 9 vertices. How can I graph this in Python?

I would start with a library called NetworkX. I'm not sure I understand everything that you are looking for, but I think this should be close enough for you to modify it.
This program will load the data in from a text file graphdata.txt, split by whitespace, and add the pair as an edge.
It will then calculate the shortest paths to all nodes from 1, and then print if the distance is larger than 9... see the documentation for more details.
Lastly, it will render the graph using a spring layout to a file called mynetwork.png and to the screen.
Some optimization may / may not be needed for 3000 nodes.
Hope this helps!
import networkx as nx
import matplotlib.pyplot as plt
graph = nx.Graph()
with open('graphdata.txt') as f:
for line in f:
firm_id_1, firm_id_2 = line.split()
graph.add_edge(firm_id_1, firm_id_2)
paths_from_1 = nx.shortest_path(graph, "1")
for path in paths_from_1:
if len(paths_from_1[node]) > 9:
print "Shortest path from 1 to", node, "is longer than 9"
pos = nx.spring_layout(graph, iterations=200)
nx.draw(graph, pos)
plt.savefig("mynetwork.png")
plt.show()

You can try python-graph package. I am not sure about its scalability, but you can do something like...
from pygraph.classes.digraph import digraph
from pygraph.algorithms.minmax import shortest_path
gr= digraph()
gr.add_nodes(range(1,num_nodes))
for i in range(num_edges):
gr.add_edge((edge_start, edge_end))
# shortest path from the node 1 to all others
shortest_path(gr,1)

Related

Ranking how direct spaCy dependencies are on tree

I have a SpaCy dependency tree made by this code:
from spacy import displacy
text = "We could say to them that if in fact that's all there is, then we could, Oh, we can do something."
print(displacy.render(nlp(text), style='dep', jupyter = True, options = {'distance': 120}))
That prints out this:
SpaCy determines that this entire string is connected in a dependency tree. What I am trying to figure out is how to discern how direct or indirect the connection is between a word and the next word. For example, looking at the first 3 words:
'We' is connected to the next word 'could', because it is directly connected to 'say', which is directly connected to 'could'. Therefor, it is 2 connection points away from the next word.
'could' is directly connected to 'say'. There for it is 1 connection point away from the start.
and so on.
Essentially, I want to make a df that would look like this:
word connection_points_to_next_word
We 2
could 1
say 1
...
I'm not sure how to achieve this. As SpaCy makes this graph, I'm sure there is some efficient way to calculate the number of vertices required to connect adjacent nodes, but all of SpaCy's tools I've found, such as:
token.lefts
token.rights
token.subtree
token.children
more here https://spacy.io/api/token
Include connection information, but not how direct this connection is. Any ideas how to get closer to this problem?
Using the networkx library, we can build an undirected graph from the edgelist of token-children relationships. I am using the index of the token in the document as a unique identifier so that repeat words are treated as separate nodes.
import spacy
import networkx as nx
nlp= spacy.load('en_core_web_lg')
text = "We could say to them that if in fact that's all there is, then we could, Oh, we can do something."
doc = nlp(text)
edges = []
for tok in doc:
edges.extend([(tok.i, child.i) for child in tok.children])
The shortest path between neighboring tokens can be calculated as below:
for idx, _ in enumerate(doc):
if idx < len(doc)-1:
print(doc[idx], doc[idx+1], nx.shortest_path_length(graph,source=idx, target=idx+1))
Output:
We could 2
could say 1
say to 1
to them 1
them that 4
that if 3
if in 2
in fact 1
fact that 3
that 's 1
's all 1
all there 2
there is 1
is , 4
, then 2
then we 2
we could 2
could , 2
, Oh 2
Oh , 2
, we 2
we can 2
can do 1
do something 1
something . 3

How to find all connected subgraph of a graph in networkx?

I'm developing a python application, and i want to list all possible connected subgraph of any size and starting from every node using NetworkX.
I just tried using combinations() from itertools library to find all possible combination of nodes but it is very too slow because it searchs also for not connected nodes:
for r in range(0,NumberOfNodes)
for SG in (G.subgraph(s) for s in combinations(G,r):
if (nx.is_connected(SG)):
nx.draw(SG,with_labels=True)
plt.show()
The actual output is correct. But i need another way faster to do this, because all combinations of nodes with a graph of 50 nodes and 8 as LenghtTupleToFind are up to 1 billion (n! / r! / (n-r)!) but only a minimal part of them are connected subgraph so are what i am interested in. So, it's possible to have a function for do this?
Sorry for my english, thank you in advance
EDIT:
As an example:
so, the results i would like to have:
[0]
[0,1]
[0,2]
[0,3]
[0,1,4]
[0,2,5]
[0,2,5,4]
[0,1,4,5]
[0,1,2,4,5]
[0,1,2,3]
[0,1,2,3,5]
[0,1,2,3,4]
[0,1,2,3,4,5]
[0,3,2]
[0,3,1]
[0,3,2]
[0,1,4,2]
and all combination that generates a connected graph
I had the same requirements and ended up using this code, super close to what you were doing. This code yields exactly the input you asked for.
import networkx as nx
import itertools
G = you_graph
all_connected_subgraphs = []
# here we ask for all connected subgraphs that have at least 2 nodes AND have less nodes than the input graph
for nb_nodes in range(2, G.number_of_nodes()):
for SG in (G.subgraph(selected_nodes) for selected_nodes in itertools.combinations(G, nb_nodes)):
if nx.is_connected(SG):
print(SG.nodes)
all_connected_subgraphs.append(SG)
I have modified Charly Empereur-mot's answer by using ego graph to make it faster:
import networkx as nx
import itertools
G = you_graph.copy()
all_connected_subgraphs = []
# here we ask for all connected subgraphs that have nb_nodes
for n in you_graph.nodes():
egoG = nx.generators.ego_graph(G,n,radius=nb_nodes-1)
for SG in (G.subgraph(sn+(n,) for sn in itertools.combinations(egoG, nb_nodes-1)):
if nx.is_connected(SG):
all_connected_subgraphs.append(SG)
G.remove_node(n)
You might want to look into connected_components function. It will return you all connected nodes, which you can then filter by size and node.
You can find all the connected components in O(n) time and memory complexity. Keep a seen boolean array, and run Depth First Search (DFS) or Bread First Search (BFS), to find the connected components.
In my code, I used DFS to find the connected components.
seen = [False] * num_nodes
def search(node):
component.append(node)
seen[node] = True
for neigh in G.neighbors(node):
if not seen[neigh]:
dfs(neigh)
all_subgraphs = []
# Assuming nodes are numbered 0, 1, ..., num_nodes - 1
for node in range(num_nodes):
component = []
dfs(node)
# Here `component` contains nodes in a connected component of G
plot_graph(component) # or do anything
all_subgraphs.append(component)

Writing graph object to dimacs file format

I have created a graph object using the networkx library with the following code.
import networkx as nx
#Convert snap dataset to graph object
g = nx.read_edgelist('com-amazon.ungraph.txt',create_using=nx.Graph(),nodetype = int)
print(nx.info(g))
However I need to write the graph object to a dimacs file format which I believe networkx's functions do not include. Is there a way to do so?
The specification described on http://prolland.free.fr/works/research/dsat/dimacs.html is pretty simple, so you can just do something like this:
g = nx.house_x_graph() # stand-in graph since we don't have your data
dimacs_filename = "mygraph.dimacs"
with open(dimacs_filename, "w") as f:
# write the header
f.write("p EDGE {} {}\n".format(g.number_of_nodes(), g.number_of_edges()))
# now write all edges
for u, v in g.edges():
f.write("e {} {}\n".format(u, v))
this generates the file "mygraph.dimacs":
p EDGE 5 8
e 0 1
e 0 2
e 0 3
e 1 2
e 1 3
e 2 3
e 2 4
e 3 4

Parse a file to create a graph in python

i have a file with format like this(but its a bigger file):
13 16 1
11 17 1
8 18 -1
11 19 1
11 20 -1
11 21 1
11 22 1
The first column is the starting vertex, the second column is the ending vertex and the third is the weight between the starting and ending vertex.
I try to create a graph with networkx but im getting this error:
"Edge tuple %s must be a 2-tuple or 3-tuple." % (e,))
Here is my code:
import networkx as nx
file = open("network.txt","r")
lines = file.readlines()
start_vertex = []
end_vertex = []
sign = []
for x in lines:
start_vertex.append(x.split('\t')[0])
end_vertex.append(x.split('\t')[1])
sign.append(x.split('\t')[2])
file.close()
G = nx.Graph()
for i in lines:
G.add_nodes_from(start_vertex)
G.add_nodes_from(end_vertex)
G.add_edges_from([start_vertex, end_vertex, sign])
You should use networkx's read_edgelist command.
G=nx.read_edgelist('network.txt', delimiter = ' ', nodetype = int, data = (('weight', int),))
notice that the delimiter I'm using is two spaces, because this appears to be what you've used in your input file.
If you want to stick to your code:
First, get rid of for i in lines.
The reason for your error is twofold. First, you want to use G.add_weighted_edges_from rather than G.add_edges_from.
Also, this expects a list (or similar object) whose entries are of the form (u,v,weight). So for example, G.add_weighted_edges_from([(13,16,1), (11,17,1)]) would add your first two edges. It sees the command G.add_weighted_edges_from([[13,11,8,11,...],[16,17,18,19,...],[1,1,-1,1,...]) and thinks that [13,11,8,11,...] needs to be the information for the first edge, [16,17,18,19,...] is the second edge and [1,1,-1,1,...] is the third edge. It can't do this.
You could do G.add_weighted_edges_from(zip(start_vertex,end_vertex,sign)). See this explanation of zip: https://stackoverflow.com/a/13704903/2966723
finally,
G.add_nodes_from(start_vertex) and G.add_nodes_from(end_vertex) are unneeded. If the nodes don't exist already when networkx tries to add an edge it will add the nodes as well.
Use the networkx library of python .. (I am assuming Python 3.6).
The following code will read your file as is. You won't need the lines you have written above.
The print command that I have written is to help you check if the graph which has been read is correct or not.
Note: If your graph is not a directed graph then you can remove the create_using=nx.DiGraph() part written in the function.
import networkx as nx
g = nx.read_edgelist('network.txt', nodetype=int, data=(('weight', int),), create_using=nx.DiGraph(),delimiter=' ')
print(nx.info(g))

Iterate variable for every node | Node Connectivity in Python Graph

I would like to find node connectivity between node 1 and rest of the nodes in a graph. The input text file format is as follows:
1 2 1
1 35 1
8 37 1
and so on for 167 lines. First column represents source node, second column represents destination node while the last column represents weight of the edge.
I'm trying to read the source, destination nodes from input file and forming an edge between them. I need to then find out if it is a connected network (only one component of graph and no sub-components). Here is the code
from numpy import*
import networkx as nx
G=nx.empty_graph()
for row in file('out40.txt'):
row = row.split()
src = row[0]
dest = row[1]
#print src
G.add_edge(src, dest)
print src, dest
for i in range(2, 41):
if nx.bidirectional_dijkstra(G, 1, i): print "path exists from 1 to ", i
manually adding the edges using
G.add_edge(1, 2)
works but is tedious and not suitable for large input files such as mine. The if loop condition works when I add edges manually but throws the following error for the above code:
in neighbors_iter
raise NetworkXError("The node %s is not in the graph."%(n,))
networkx.exception.NetworkXError: The node 2 is not in the graph.
Any help will be much appreciated!
In your code, you're adding nodes "1" and "2" et cetera (since reading from a file is going to give you strings unless you explicitly convert them).
However, you're then trying to refer to nodes 1 and 2. I'm guessing that networkx does not think that 2 == "2".
Try changing this...
G.add_edge(src, dest)
to this:
G.add_edge(int(src), int(dest))
Not sure if that is an option for you, but are you aware of the build-in support of networkx for multiple graph text formats?
The edge list format seems to apply pretty well to your case. Specifically, the following method will read your input files without the need for custom code:
G = nx.read_weighted_edgelist(filename)
If you want to remove the weights (because you don't need them), you could subsequently do the following:
for e in G.edges_iter(data=True):
e[2].clear() #[2] is the 3rd element of the tuple, which
#contains the dictionary with edge attributes
From Networkx documentation:
for row in file('out40.txt'):
row = row.split()
src = row[0]
dest = row[1]
G.add_nodes_from([src, dest])
#print src
G.add_edge(src, dest)
print src, dest
The error message says the the graph G doesn't have the nodes you are looking to create an edge in between.
You can also use "is_connected()" to make this a little simpler. e.g.
$ cat disconnected.edgelist
1 2 1
2 3 1
4 5 1
$ cat connected.edgelist
1 2 1
2 3 1
3 4 1
$ ipython
In [1]: import networkx as nx
In [2]: print(nx.is_connected(nx.read_weighted_edgelist('disconnected.edgelist')))
False
In [3]: print(nx.is_connected(nx.read_weighted_edgelist('connected.edgelist')))
True
Another option is to load the file as a pandas dataframe and then use iterrows to iterate:
import pandas as pd
import networkx as nx
cols = ["src", "des", "wei"]
df = pd.read_csv('out40.txt', sep=" ", header=None, names=cols)
G = nx.empty_graph()
for index, row in df.iterrows():
G.add_edge(row["src"], row["des"])

Categories

Resources