Let’s suppose we have the below data (a sample of my whole dataset that counts thousands of rows):
Node Target
1 2
1 3
1 5
2 1
2 3
2 6
7 8
7 12
9 13
9 15
9 14
Clearly, if I plot in a graph this data I have two components that are disconnected.
I am wondering how to isolate or remove one component from my network, e.g., the smallest one.
I would say that first I should identify the components, then isolate/filtering out the component(s) that I am not interested in.
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
connected_com=[len(c) for c in sorted(nx.connected_components(G), key=len, reverse=True)]
Now I should create a network only with data from the largest component:
largest_cc = max(nx.connected_components(G), key=len)
This is easy in case of two components. However, if I would like to select two components and exclude one, how should I do? This is my question.
In the example data you provided, 3 islands are obtained when plotting the graph with the code below:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
df=pd.read_fwf('data.txt')
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
nx.draw(G,with_labels=True)
And the graph looks like that:
Now if you want to only keep the biggest two islands you can use the nx.connected_components(G) function that you mentioned and store the two biggest components. Below is the code to do this:
N_subs=2 #Number of biggest islands you want to keep
G_sub=[]
largest_components=[]
for i in range(N_subs):
largest_components.append(sorted(nx.connected_components(G), key=len, reverse=True)[i])
G_sub.append(G.subgraph(largest_components[i]))
You will then need to create a subgraph of G that is composed of both islands. You can use nx.compose_all to do that. And you can then just plot your subgraph
G_subgraphs=nx.compose_all(G_sub)
nx.draw(G_subgraphs,with_labels=True)
So overall the code looks like that:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
df=pd.read_fwf('data.txt')
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
N_subs=2
G_sub=[]
largest_components=[]
for i in range(N_subs):
largest_components.append(sorted(nx.connected_components(G), key=len, reverse=True)[i])
G_sub.append(G.subgraph(largest_components[i]))
G_subgraphs=nx.compose_all(G_sub)
nx.draw(G_subgraphs,with_labels=True)
And the output of this code gives:
Note: According to this, nx.connected_components is best used for undirected graphs. Since you are dealing with directed graphs, you might want to use strongly_connected_components(G) or weakly_connected_components(G) instead.
Related
I have a dataset with fields movie_id, genres. Total number of rows in the dataset is 103110. Now I am trying to construct a network graph of movies where two movies are connected if they belong to similar genres. Since it is a large dataset its taking too much time.
movie_id | genres
1 ab,bc,ca
5 dd,aa
20 ab,zz
22 aa,bb
33 cc,rr
In this eg. movie 1 and movie 20 will have an edge, similarly movie 5 and movie 22 will have an edge in the network. I am using the following code but it is taking too much time. please suggest some faster approach.
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
rdf = pd.read_csv("movies.csv")
g=nx.Graph()
g.add_nodes_from(rdf['movieId'].tolist())
eg=[]
for i,r in rdf.iterrows():
for j,r1 in rdf.iterrows():
if r['genres'] in r1['genres']:
eg1.append((r['movieId'],r1['movieId']))
g.add_edges_from(eg)
nx.write_adjlist(g,"movie_graph.csv")
I have a large dataset which compares products with a relatedness measure which looks like this:
product1 product2 relatedness
0101 0102 0.047619
0101 0103 0.023810
0101 0104 0.095238
0101 0105 0.214286
0101 0106 0.047619
... ... ...
I used the following code to feed the data into the NetworkX graphing tool and produce an MST diagram:
import networkx as nx
import matplotlib.pyplot as plt
products = (data['product1'])
products = list(dict.fromkeys(products))
products = sorted(products)
G = nx.Graph()
G.add_nodes_from(products)
print(G.number_of_nodes())
print(G.nodes())
row = 0
for c in data['product1']:
p = data['product2'][row]
w = data['relatedness'][row]
if w > 0:
G.add_edge(c,p, weight=w, with_labels=True)
row = row + 1
nx.draw(nx.minimum_spanning_tree(G), with_labels=True)
plt.show()
The resulting diagram looks like this: https://i.imgur.com/pBbcPGc.jpg
However, when I re-run the code, with the same data and no modifications, the arrangement of the clusters appears to change, so it then looks different, example here: https://i.imgur.com/4phvFGz.jpg, second example here: https://i.imgur.com/f2YepVx.jpg. The clusters, edges, and weights do not appear to be changing, but the arrangement of them on the graph space is changing each time.
What causes the arrangement of the nodes to change each time without any changes to the code or data? How can I re-write this code to produce a network diagram with approximately the same arrangement of nodes and edges for the same data each time?
The nx.draw method uses by default the spring_layout (link to the doc). This layout implements the Fruchterman-Reingold force-directed algorithm which starts with random initial positions. This is this layout effect that you witness in your repetitive trials.
If you want to "fix" the positions, then you should explicitely call the spring_layout function and specify the initial positions in the pos argument.
Assign G = nx.minimum_spanning_tree(G) for purpose of clarity. Then
nx.draw(G, with_labels=True)
is equivalent to
pos = nx.spring_layout(G)
nx.draw(G, pos=pos, with_labels=True)
Since you don't like pos to be calculated randomly every time you run your script, the only way to keep your pos stable is to store it once and retrieve from file after each rerun. You can put this script to calculate pos in an improved manner before nx.draw(G, pos=pos, with_labels=True):
import os, json
def store(pos):
#form of dictionary to be stored dictionary retrieved
return {k: v.tolist() for k, v in pos.items()}
def retrieve(pos):
#form of dictionary to be retrieved
return {float(k): v for k, v in pos.items()}
if 'nodes.txt' in os.listdir():
json_file = open('pos.txt').read()
pos = retrieve(json.loads(json_file)) #retrieving dictionary from file
print('retrieve', pos)
else:
with open('pos.txt', 'w') as outfile:
pos = nx.spring_layout(new_G) #calculates pos
print('store', pos)
json.dump(store(pos), outfile, indent=4) #records pos dictionary into file
This is an ugly solution because it depends unconditionally of data types used in pos dictionary. It worked for me, but you might to define your custom ones used in store and retrieve
I am using hierarchical clustering from seaborn.clustermap to cluster my data. This works fine to nicely visualize the clusters in a heatmap. However, now I would like to extract all row values that are assigned to the different clusters.
This is what my data looks like:
import pandas as pd
# load DataFrame
df = pd.read_csv('expression_data.txt', sep='\t', index_col=0)
df
log_HU1 log_HU2
EEF1A1 13.439499 13.746856
HSPA8 13.169191 12.983910
FTH1 13.861164 13.511200
PABPC1 12.142340 11.885885
TFRC 11.261368 10.433607
RPL26 13.837205 13.934710
NPM1 12.381585 11.956855
RPS4X 13.359880 12.588574
EEF2 11.076926 11.379336
RPS11 13.212654 13.915813
RPS2 12.910164 13.009184
RPL11 13.498649 13.453234
CA1 9.060244 13.152061
RPS3 11.243343 11.431791
YBX1 12.135316 12.100374
ACTB 11.592359 12.108637
RPL4 12.168588 12.184330
HSP90AA1 10.776370 10.550427
HSP90AB1 11.200892 11.457365
NCL 11.366145 11.060236
Then I perform the clustering using seaborn as follows:
fig = sns.clustermap(df)
Which produces the following clustermap:
For this example I may be able to manually interpret the values belonging to each cluster (e.g. that TFRC and HSP90AA1 cluster). However I am planning to do these clustering analysis on much bigger data sets.
So my question is: does anyone know how to get the row values belonging to each cluster?
Thanks,
Using scipy.cluster.hierarchy module with fcluster allows cluster retrieval:
import pandas as pd
import seaborn as sns
import scipy.cluster.hierarchy as sch
df = pd.read_csv('expression_data.txt', sep='\t', index_col=0)
# retrieve clusters using fcluster
d = sch.distance.pdist(df)
L = sch.linkage(d, method='complete')
# 0.2 can be modified to retrieve more stringent or relaxed clusters
clusters = sch.fcluster(L, 0.2*d.max(), 'distance')
# clusters indicices correspond to incides of original df
for i,cluster in enumerate(clusters):
print(df.index[i], cluster)
Out:
EEF1A1 2
HSPA8 1
FTH1 2
PABPC1 3
TFRC 5
RPL26 2
NPM1 3
RPS4X 1
EEF2 4
RPS11 2
RPS2 1
RPL11 2
CA1 6
RPS3 4
YBX1 3
ACTB 3
RPL4 3
HSP90AA1 5
HSP90AB1 4
NCL 4
Given the following example which is from: https://python-graph-gallery.com/404-dendrogram-with-heat-map/
It generates a dendrogram where I assume that it is based on scipy.
# Libraries
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
# Data set
url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
df = pd.read_csv(url)
df = df.set_index('model')
del df.index.name
df
# Default plot
sns.clustermap(df)
Question: How can one get the dendrogram in non-graphical form?
Background information:
From the root of that dendrogram I want to cut it at the largest length. For example we have one edge from the root to a left cluster (L) and an edge to a right cluster (R) ...from those two I'd like to get their edge lengths and cut the whole dendrogram at the longest of these two edges.
Best regards
clustermap returns a handle to the ClusterGrid object, which includes child objects for each dendrogram,
h.dendrogram_col and h.dendrogram_row.
Inside these are the dendrograms themselves, which provides the dendrogram geometry
as per the scipy.hierarchical.dendrogram return data, from which you could compute
the lengths of a specific branch.
h = sns.clustermap(df)
dgram = h.dendrogram_col.dendrogram
D = np.array(dgram['dcoord'])
I = np.array(dgram['icoord'])
# then the root node will be the last entry, and the length of the L/R branches will be
yy = D[-1]
lenL = yy[1]-yy[0]
lenR = yy[2]-yy[3]
The linkage matrix, the input used to compute the dendrogram, might also help:
h.dendrogram_col.linkage
h.dendrogram_row.linkage
I imported my Facebook data onto my computer in the form of a .json file. The data is in the format:
{"nodes":[{"name":"Alan"},{"name":"Bob"}],"links":[{"source":0,"target:1"}]}
Then, I use this function:
def parse_graph(filename):
"""
Returns networkx graph object of facebook
social network in json format
"""
G = nx.Graph()
json_data=open(filename)
data = json.load(json_data)
# The nodes represent the names of the respective people
# See networkx documentation for information on add_* functions
G.add_nodes_from([n['name'] for n in data['nodes']])
G.add_edges_from([(data['nodes'][e['source']]['name'],data['nodes'][e['target']]['name']) for e in data['links']])
json_data.close()
return G
to enable this .json file to be used a graph on NetworkX. If I find the degree of the nodes, the only method I know how to use is:
degree = nx.degree(p)
Where p is the graph of all my friends. Now, I want to plot the graph such that the size of the node is the same as the degree of that node. How do I do this?
Using:
nx.draw(G,node_size=degree)
didn't work and I can't think of another method.
Update for those using networkx 2.x
The API has changed from v1.x to v2.x. networkx.degree no longer returns a dict but a DegreeView Object as per the documentation.
There is a guide for migrating from 1.x to 2.x here.
In this case it basically boils down to using dict(g.degree) instead of d = nx.degree(g).
The updated code looks like this:
import networkx as nx
import matplotlib.pyplot as plt
g = nx.Graph()
g.add_edges_from([(1,2), (2,3), (2,4), (3,4)])
d = dict(g.degree)
nx.draw(g, nodelist=d.keys(), node_size=[v * 100 for v in d.values()])
plt.show()
nx.degree(p) returns a dict while the node_size keywod argument needs a scalar or an array of sizes. You can use the dict nx.degree returns like this:
import networkx as nx
import matplotlib.pyplot as plt
g = nx.Graph()
g.add_edges_from([(1,2), (2,3), (2,4), (3,4)])
d = nx.degree(g)
nx.draw(g, nodelist=d.keys(), node_size=[v * 100 for v in d.values()])
plt.show()
#miles82 provided a great answer. However, if you've already added the nodes to your graph using something like G.add_nodes_from(nodes), then I found that d = nx.degree(G) may not return the degrees in the same order as your nodes.
Building off the previous answer, you can modify the solution slightly to ensure the degrees are in the correct order:
d = nx.degree(G)
d = [(d[node]+1) * 20 for node in G.nodes()]
Note the d[node]+1, which will be sure that nodes of degree zero are added to the chart.
other method if you still get 'DiDegreeView' object has no attribute 'keys'
1)you can first get the degree of each node as a list of tuples
2)build a node list from the first value of tuple and degree list from the second value of tuple.
3)finally draw the network with the node list you've created and degree list you've created
here's the code:
list_degree=list(G.degree()) #this will return a list of tuples each tuple is(node,deg)
nodes , degree = map(list, zip(*list_degree)) #build a node list and corresponding degree list
plt.figure(figsize=(20,10))
nx.draw(G, nodelist=nodes, node_size=[(v * 5)+1 for v in degree])
plt.show() #ploting the graph