Basic Network Analysis Python - python

I have an intermediate level of python and have used it before to plot some pretty nice graphs for academic purposes. Recently I ended up with a nice DF of agreements between regulators and want to create a Network graph but it seems a little more complicated than I thought.
Party = Nodes
Edge = Agreements (type)
The idea is to identify the centrality of the Parties (John, for example, may have many agreements with different parties while Mary, only once but with two parties) and to display different types of agreements with a different colors.
My data frame is more or less like this:
YEAR
PARTIES
TYPE OF AGREEMENT
2005
John, Ann
Complex Agreement
2010
John, Mary, Rupert
Crossfunctional Agreement
....
...
...
Any ideas/suggestions?

This might get you going.
from itertools import combinations
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
df = pd.DataFrame(
{
"YEAR": [2005, 2010],
"PARTIES": ["John, Ann", "John, Mary, Rupert"],
"TYPE OF AGREEMENT": ["Complex Agreement", "Crossfunctional Agreement"],
}
)
df["PARTIES"] = df["PARTIES"].str.split(", ")
graph = nx.MultiGraph()
for ix, row in df.iterrows():
for combo in combinations(row["PARTIES"], 2):
graph.add_edge(*combo, year=row["YEAR"], type=row["TYPE OF AGREEMENT"])
nx.draw(graph, with_labels=True)
plt.savefig("graph.png")
saves a PNG like this:
You can refer to the Networkx docs for e.g. the centrality metrics and how to style and label edges better.

Related

Create clusters depending on scores performance

I have data from students who took a test that has 2 sections : the 1st section tests their digital skill at level2, and the second section tests their digital skills at level3. I need to come up with 3 clusters of students depending on their scores to place them in 3 different skills levels (1,2 and 3) --> code sample below
import pandas as pd
data = [12,24,14,20,8,10,5,23]
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
'Scores_section1': [12,24,14,20,8,10,5,23],
'Scores_section2' : [20,4,1,0,18,9,12,10],
'Sum_all_scores': [32,28,15,20,26,19,17,33]}
# Create DataFrame
df = pd.DataFrame(data)
# print dataframe.
df
I thought about using K-means clustering, but following a tutorial online, I'd need to use x,y coordinates. Should I use scores_section1 as x, and Scores_section2 as y or vice-versa, and why?
Many thanks in advance for your help!
Try it this way.
import pandas as pd
data = [12,24,14,20,8,10,5,23]
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
'Scores_section1': [12,24,14,20,8,10,5,23],
'Scores_section2' : [20,4,1,0,18,9,12,10],
'Sum_all_scores': [32,28,15,20,26,19,17,33]}
# Create DataFrame
df = pd.DataFrame(data)
# print dataframe.
df
#Import required module
from sklearn.cluster import KMeans
#Initialize the class object
kmeans = KMeans(n_clusters=3)
#predict the labels of clusters.
df = df[['Scores_section1', 'Scores_section2', 'Sum_all_scores']]
label = kmeans.fit_predict(df)
label
df['kmeans'] = label
df
# K-Means Clustering may be the most widely known clustering algorithm and involves assigning examples to
# clusters in an effort to minimize the variance within each cluster.
# The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets
# on the basis of a sample. The process, which is called ‘k-means,’ appears to give partitions which are reasonably
# efficient in the sense of within-class variance.
# plot X & Y coordinates and color by cluster number
import plotly.express as px
fig = px.scatter(df, x="Scores_section1", y="Scores_section2", color="kmeans", size='Sum_all_scores', hover_data=['kmeans'])
fig.show()
Feel free to modify the code to suit your needs.

Can you specify a bidirectional edge in a NetworkX digraph?

I'd like to be able to draw a NetworkX graph connecting characters from the movie "Love, Actually" (because it's that time of the year in this country), and specifying how each character "relates" to the other in the story.
Certain relationships between characters are unidirectional - e.g. Mark is in love with Juliet, but not the reverse. However, Mark is best friends with Peter, and Peter is best friends with Mark - this is a bidirectional relationship. Ditto Peter and Juliet being married to each other.
I'd like to specify both kinds of relationships. Using a NetworkX digraph in Python, I seem to have a problem: to specify a bidirectional edge between two nodes, I apparently have to provide the same link twice, which will subsequently create two arrows between two nodes.
What I'd really like is a single arrow connecting two nodes, with heads pointing both ways. I'm using NetworkX to create the graph, and pyvis.Network to render it in HTML.
Here is the code so far, which loads a CSV specifying the nodes and edges to create in the graph.
import networkx as nx
import csv
from pyvis.network import Network
dg = nx.DiGraph()
with open("rels.txt", "r") as fh:
reader = csv.reader(fh)
for row in reader:
if len(row) != 3:
continue # Quick check for malformed csv input
dg.add_edge(row[0], row[1], label=row[2])
nt = Network('500px', '800px', directed=True)
nt.from_nx(dg)
nt.show('nx.html', True)
Here is the CSV, which can be read as "Node1", "Node2", "Edge label":
Mark,Juliet,in love with
Mark,Peter,best friends
Peter,Mark,best friends
Juliet,Peter,married
Peter,Juliet,married
And the resulting image:
Whereas what I'd really like the graph to look like is this:
(Thank you to this site for the wonderful graph tool for the above visualisation)
Is there a way to achieve the above visualisation using NetworkX and Pyvis? I wasn't able to find any documentation on ways to create bidirectional edges in a directed graph.
Read the csv into pandas. Create a digraph and plot. Networkx has quite a comprehensive documentation on plotting. See what I came up with
import pandas as pd
import networkx as nx
from networkx import*
df =pd.DataFrame({'Source':['Mark','Mark','Peter','Juliet','Peter'],'Target':['Juliet','Peter','Mark','Peter','Juliet'],'Status':['in love with','best friends','best friends','married','married']})
#Create graph
g = nx.from_pandas_edgelist(df, 'Source', "Target", ["Status"], create_using=nx.DiGraph())
pos = nx.spring_layout(g)
nx.draw(g, pos, with_labels=True)
edge_labels = dict([((n1, n2), d['Status'])
for n1, n2, d in g.edges(data=True)])
nx.draw_networkx_edge_labels(g,
pos, edge_labels=edge_labels,
label_pos=0.5,
font_color='red',
font_size=7,
font_weight='bold',
verticalalignment='bottom' )
plt.show()

Efficient and faster way of constructing network graph from a dataframe?

I have a dataset with fields movie_id, genres. Total number of rows in the dataset is 103110. Now I am trying to construct a network graph of movies where two movies are connected if they belong to similar genres. Since it is a large dataset its taking too much time.
movie_id | genres
1 ab,bc,ca
5 dd,aa
20 ab,zz
22 aa,bb
33 cc,rr
In this eg. movie 1 and movie 20 will have an edge, similarly movie 5 and movie 22 will have an edge in the network. I am using the following code but it is taking too much time. please suggest some faster approach.
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
rdf = pd.read_csv("movies.csv")
g=nx.Graph()
g.add_nodes_from(rdf['movieId'].tolist())
eg=[]
for i,r in rdf.iterrows():
for j,r1 in rdf.iterrows():
if r['genres'] in r1['genres']:
eg1.append((r['movieId'],r1['movieId']))
g.add_edges_from(eg)
nx.write_adjlist(g,"movie_graph.csv")

How to remove specific components in a disconnected network

Let’s suppose we have the below data (a sample of my whole dataset that counts thousands of rows):
Node Target
1 2
1 3
1 5
2 1
2 3
2 6
7 8
7 12
9 13
9 15
9 14
Clearly, if I plot in a graph this data I have two components that are disconnected.
I am wondering how to isolate or remove one component from my network, e.g., the smallest one.
I would say that first I should identify the components, then isolate/filtering out the component(s) that I am not interested in.
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
connected_com=[len(c) for c in sorted(nx.connected_components(G), key=len, reverse=True)]
Now I should create a network only with data from the largest component:
largest_cc = max(nx.connected_components(G), key=len)
This is easy in case of two components. However, if I would like to select two components and exclude one, how should I do? This is my question.
In the example data you provided, 3 islands are obtained when plotting the graph with the code below:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
df=pd.read_fwf('data.txt')
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
nx.draw(G,with_labels=True)
And the graph looks like that:
Now if you want to only keep the biggest two islands you can use the nx.connected_components(G) function that you mentioned and store the two biggest components. Below is the code to do this:
N_subs=2 #Number of biggest islands you want to keep
G_sub=[]
largest_components=[]
for i in range(N_subs):
largest_components.append(sorted(nx.connected_components(G), key=len, reverse=True)[i])
G_sub.append(G.subgraph(largest_components[i]))
You will then need to create a subgraph of G that is composed of both islands. You can use nx.compose_all to do that. And you can then just plot your subgraph
G_subgraphs=nx.compose_all(G_sub)
nx.draw(G_subgraphs,with_labels=True)
So overall the code looks like that:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
df=pd.read_fwf('data.txt')
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
N_subs=2
G_sub=[]
largest_components=[]
for i in range(N_subs):
largest_components.append(sorted(nx.connected_components(G), key=len, reverse=True)[i])
G_sub.append(G.subgraph(largest_components[i]))
G_subgraphs=nx.compose_all(G_sub)
nx.draw(G_subgraphs,with_labels=True)
And the output of this code gives:
Note: According to this, nx.connected_components is best used for undirected graphs. Since you are dealing with directed graphs, you might want to use strongly_connected_components(G) or weakly_connected_components(G) instead.

Pandas dataframe: Frequency plot with hue based on different columns that share same string entries

I'm analysing this Kaggle dataset: https://www.kaggle.com/astronasko/transport-for-london-journey-information
I've created a DataFrame with all the completed journeys, where the start station ('StartStn') and end station ('EndStn') are not the same and there is information on each of them.
I've created a frequency plot of Start stations and a separate frequency plot of end stations (see images below):
Figure 1 code:
complete['StartStn'].value_counts()[:20].plot(kind='bar')
Figure 2 code:
complete['EndStn'].value_counts()[:20].plot(kind='bar')
Here is a sample of the dataframe, taking a subset of just these two columns:
IN:
complete[['StartStn','EndStn']].sample(10)
OUT:
StartStn EndStn
102417 Leytonstone East Ham
995246 Walthamstow Central Piccadilly Circus
1102327 Earls Court Holborn
604323 Stratford Shepherd's Bush Und
481718 Warren Street Walthamstow Central
2344106 Marble Arch Northolt
1234444 Colliers Wood Holborn
1408620 Earls Court Marble Arch
465436 Tottenham Court Rd Mile End
1580309 Woodside Park Hammersmith D
As you can see, many stations, such as 'Walthamstow Central', are in both columns.
Problem:
Using seaborn, matplotlib or pandas, how do I create a frequency plot for all stations that has a hue of StartStn vs EndStn (i.e. on the same axes)?
The best I can do is to create a frequency plot with all stations, combining frequencies in 'StartStn' and 'EndStn':
stations = pd.concat([complete['StartStn'],complete['EndStn']],axis=0)
stations.value_counts()[:10].plot(kind='bar')
Which gives me the following output:
Most Popular Stations (Start or End)
Would be very grateful for any suggestions!
Thanks a lot,
Beni
Hy Certiprince
You can use countplot from seaborn and utilize Startstn and Endstn as a "hue" so that there are 2 bars per station.
Please find below a suitable code. I have tried with your sample with 10 items.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import OrderedDict
columns = ['StartStn','EndStn']
startstn = ['Leytonstone','Walthamstow','Earls Court','Stratford','Warren Street','Marble Arch','Colliers Wood',
'Earls Court','Tottenham Court Rd','Woodside Park']
endstn = ['East Ham','Piccadilly Circus','Holborn','Shepherds Bush Und','Walthamstow Central','Northolt',
'Holborn','Marble Arch','Mile End','Hammersmith D']
df = pd.DataFrame(data={'StartStn':startstn,'EndStn':endstn})
print(df)
df['hue'] = 'Start'
df['Stations'] = df['StartStn']
df_start = df[['Stations','hue']]
df['hue'] = 'End'
df['Stations'] = df['EndStn']
df_end = df[['Stations','hue']]
orderstart = df['StartStn'].value_counts()
startstnlist = orderstart.index.tolist()
orderend = df['EndStn'].value_counts()
endstnlist = orderend.index.tolist()
order = startstnlist+endstnlist
order = list(OrderedDict.fromkeys(order))
df_concatenated = pd.concat([df_start,df_end],ignore_index=True)
sns.countplot(data=df_concatenated,x='Stations', order=order,hue='hue')
plt.show()
Edit:
I have included a piece of code so that the diagram is ordered and the order is given by the startstation frequency

Categories

Resources