NetworkX Graph attributes in a single table

NetworkX Graph attributes in a single table - python

I am trying to create a table displaying the attributes of nodes in a NetworkX (using Python) graph. The functions used to get these attributes generate a list of the nodes with their assigned attribute and are generated like in the following:
import networkx as nx
# degrees of nodes
pprint(g.degree())
# clustering coefficient
pprint(nx.clustering(g))
I should like to be able to compile these into a table with intuitive overview in the following format:
node degree clustering coefficient
----------------------------------------
a 2 3
b 1 1
...
Any tips on how to do so? Perhaps using a dict and tabulate but I am not quite sure... Would be grateful for any tips!
Edit
I found a way to do some simple printing, but it isn't formatted very nicely and doesn't make for nice syntax:
for n in g.nodes():
print(n, g.degree(n), nx.clustering(g, n))

I would use a Pandas dataframe (pd.DataFrame) to store this data, which I would construct in a list comprehension of dictionaries (each dictionary corresponding to a row in the final data frame). Here's how I would do that with two attributes, in_degree and out_degree for each node:
import pandas as pd
import networkx as nx
g = nx.DiGraph()
# ... populate the Digraph ...
def compute_indegree(node, digraph):
return digraph.in_degree(node)
def compute_outdegree(node, digraph):
return digraph.out_degree(node)
attr_dicts = [
{ 'node': node,
'in_degree': compute_indegree(node, g), \
'out_degree': compute_outdegree(node, g)} \
for node in g.nodes
]
dataframe = pd.DataFrame(attr_dicts)
dataframe.set_index('node', inplace=True)
print(dataframe)
The final print line neatly formats the resulting DataFrame:
If you modify or add the above functions compute_indegree and compute_outdegree to return other stats like your clustering coefficient, the table will still populate as above.

Related

Using networkx with multiple dummy variables

I am trying to learn networkx and have run into an issue.
I have a dataset that has individuals and multiple columns of dummy variables.
I am trying to have each individual as a node and each dummy variable as a node with edges that connect individuals to each dummy variable they have a 1 for.
Code I have:
G = nx.Graph()
G = nx.from_pandas_edgelist(df, "NAME", dummy_variables)
nx.draw_shell(G, with_labels=True)
where dummy_variables is a list containing the dummy variables from the dataframe I am interested in
Here is a sample of the dataframe I am working with:
dummy variable dataframe
Where HSSAFETYX and other variables are dummy variables. Name is the individual.
What I want to see is each row represented by a node with edges connecting them to the dummy variables they have a 1 for.

NetworkX - Setting node attributes from dataframe

I'm having trouble figuring out how to add attributes to nodes in my network from columns in my dataframe.
I have provided an example of my dataframe below, there are around 10 columns in total, but I only use the 5 columns shown below when creating my network.
Unfortunately at the moment I can only get edge attributes working with my network, I am doing this as shown below:
g = nx.from_pandas_dataframe(df, 'node_from', 'node_to', edge_attr=['attribute1','attribute2','attribute3'])
The network will be a directed network. The attributes shown in the below dataframe are the attributes for the 'node_from' nodes. The 'node_to' nodes sometimes appear as 'node_from' nodes. All the nodes that can possibly be shown in the network and their respective attributes are shown in the df_attributes_only table.
df_relationship:
node_from: node_to: ........ attribute1: attribute2: attribute3:
jim john ........ tall red fat
...
All of the columns have words as their values, not digits.
I also have another dataframe which has each possible node and their attributes:
df_attributes_only:
id: attribute1: attribute2: attribute3:
jim tall red fat
john small blue fat
...
I essentially need to assign the above three attributes to their respective id, so every node has their 3 attributes attached.
Any help on how I could get node attributes working with my network is greatly appreciated.

As of Networkx 2.0, you can input a dictionary of dictionaries into nx.set_node_attributes to set attributes for multiple nodes. This is a much more streamlined approach compared to iterating over each node manually. The outer dictionary keys represent each node, and the inner dictionaries keys correspond to the attributes you want to set for each node. Something like this:
attrs = {
node0: {attr0: val00, attr1: val01},
node1: {attr0: val10, attr1: val11},
node2: {attr0: val20, attr1: val21},
}
nx.set_node_attributes(G, attrs)
You can find more detail in the documentation.
Using your example, assuming your index is id, you can convert your dataframe df_attributes_only of node attributes to this format and add to your graph:
df_attributes_only = pd.DataFrame(
[['jim', 'tall', 'red', 'fat'], ['john', 'small', 'blue', 'fat']],
columns=['id', 'attribute1', 'attribute2', 'attribute3']
)
node_attr = df_attributes_only.set_index('id').to_dict('index')
nx.set_node_attributes(g, node_attr)
g.nodes['jim']
>>> {'attribute1': 'tall', 'attribute2': 'red', 'attribute3': 'fat'}

nx.from_pandas_dataframe (and from_pandas_edgelist in latest stable version 2.2), conceptually converts an edgelist to a graph. I.e., each row in the dataframe represents an edge, which is a pair of 2 different nodes.
Using this API it is not possible to read nodes' attributes. It makes sense, because each row has two different nodes and keeping specific columns for the different nodes would be cumbersome and can cause discrepancies. For example, consider the following dataframe:
node_from node_to src_attr_1 tgt_attr_1
a b 0 3
a c 2 4
What should be the 'src_attr_1' value for node a? Is it 0 or 2? Moreover, we need to keep two columns for each attribute (since it's a node attribute both of the nodes in each edge should have it). In my opinion it would be bad design to support it, and I guess that's why NetworkX API doesn't.
You can still read nodes' attributes, after converting the df to a graph, as follows:
import networkx as nx
import pandas as pd
# Build a sample dataframe (with 2 edges: 0 -> 1, 0 -> 2, node 0 has attr_1 value of 'a', node 1 has 'b', node 2 has 'c')
d = {'node_from': [0, 0], 'node_to': [1, 2], 'src_attr_1': ['a','a'], 'tgt_attr_1': ['b', 'c']}
df = pd.DataFrame(data=d)
G = nx.from_pandas_edgelist(df, 'node_from', 'node_to')
# Iterate over df rows and set the source and target nodes' attributes for each row:
for index, row in df.iterrows():
G.nodes[row['node_from']]['attr_1'] = row['src_attr_1']
G.nodes[row['node_to']]['attr_1'] = row['tgt_attr_1']
print(G.edges())
print(G.nodes(data=True))
Edit:
In case you want to have a large list of attributes for the source node, you can extract the dictionary of this columns automatically as follows:
#List of desired source attributes:
src_attributes = ['src_attr_1', 'src_attr_2', 'src_attr_3']
# Iterate over df rows and set source node attributes:
for index, row in df.iterrows():
src_attr_dict = {k: row.to_dict()[k] for k in src_attributes}
G.nodes[row['node_from']].update(src_attr_dict)

This is building off of #zohar.kom's answer. There is a way to solve this problem without iteration. That answer can be optimized. I'm assuming that the attributes describe the node_from.
Start with a graph from an edgelist (like in #zohar.kom's anser):
G = nx.from_pandas_edgelist(df, 'node_from', 'node_to')
You can add the nodes and attributes first.
# Create a mask with only the first records
mask = ~df['node_from'].duplicated()
# Get a list of nodes with attributes
nodes = df[mask][['node_from','attribute1','attribute2','attribute3']]
This method for adding nodes from a dataframe comes from this answer.
# Add the attributes one at a time.
attr_dict = nodes.set_index('node_from')['attribute1'].to_dict()
nx.set_node_attributes(G,attr_dict,'attr1')
attr_dict = nodes.set_index('node_from')['attribute2'].to_dict()
nx.set_node_attributes(G,attr_dict,'attr2')
attr_dict = nodes.set_index('node_from')['attribute3'].to_dict()
nx.set_node_attributes(G,attr_dict,'attr3')
Similar result to #zohar.kom, but with less iterating.

Answer:
Objective: From dataframe object, generate network with nodes, edges, and node-attributes.
Lets consider, we want to generate a network with nodes and node-attributes. Each node has 3 attributes .e., attr1, attr2, and attr3.
Given a dataframe df with 1st and 2nd column as from_node and to_node respectively; and has attribute columns namely attr1, attr2, and attr3.
Below code will add required edge, node, and node-attributes from dataframe.
#%%time
g = nx.Graph()
# Add edges
g = nx.from_pandas_edgelist(df_5, 'from_node','to_node')
# Iterate over df rows and set the target nodes' and node-attributes for each row:
for index, row in df.iterrows():
g.nodes[row[0]]['attr_dict'] = row.iloc[2:].to_dict()
list(g.edges())[0:5]
list(g.nodes(data=True))[0:5]

Find all the ancestors of leaf nodes in a tree with pandas

I have a table that has two columns, 'parent' and 'child'. This is a download from SAP (ERP) for SETNODE table. Need to create a dataframe in python that has each level as it's own column in respect to it's parent and all levels before.
In python 3+.
There are an unknown (or always changing) number of levels for the full relationship so that max level can't always be defined. I would like to create a full dataframe table that shows ALL parent/child relationships for all levels. Right now it's about 15 levels but it can probably go up to 20 or more with other data I work with.
For example (example_df) of the two columns:
example_df = pd.DataFrame({'parent:['a','a','b','c','c','f'],'child':['b','c','d','f','g','h']})
To give output dataframe (solution_example):
solution_example = pd.DataFrame({'child':['h','f','d'],'parent_1':['a','a','a'],'parent_2':['c','c','b'],'parent_3':['f', 'none', 'none']})

This can be solved using the networkx library. First, build a directed graph from the DataFrame, and then find all ancestors of the leaf nodes.
import networkx as nx
leaves = set(df.child).difference(df.parent)
g = nx.from_pandas_edgelist(df, 'parent', 'child', create_using=nx.DiGraph())
ancestors = {
n: nx.algorithms.dag.ancestors(g, n) for n in leaves
}
(pd.DataFrame.from_dict(ancestors, orient='index')
.rename(lambda x: 'parent_{}'.format(x+1), axis=1)
.rename_axis('child')
.fillna(''))
parent_1 parent_2 parent_3
child
h a c f
g a c
d a b

Trajectory Clustering/ Aggregation with Python

I’m working with geo-located social media posts and clustering their locations (latitude/longitude) using DBSCAN. In my data set, I have many users who have posted multiple times, which allows me to derive their trajectory (a time ordered sequence of positions from place to place). Ex:
3945641 [[38.9875, -76.94], [38.91711157, -77.02435118], [38.8991, -77.029], [38.8991, -77.029], [38.88927534, -77.04858468])
I have derived trajectories for my entire data set, and my next step is to cluster or aggregate the trajectories in order to identify areas with dense movements between locations. Any ideas on how to tackle trajectory clustering/aggregation in Python?
Here is some code I've been working with to create trajectories as line strings/JSON dicts:
import pandas as pd
import numpy as np
import ujson as json
import time
# Import Data
data = pd.read_csv('filepath.csv', delimiter=',', engine='python')
#print len(data),"rows"
#print data
# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude','cluster_labels])
#print data.head()
# Get a list of unique user_id values
uniqueIds = np.unique(data['user_id'].values)
# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,data.loc[data['user_id']==id].sort_values(by='timestamp')[['latitude','longitude']].values.tolist()] for id in uniqueIds]
# Save outputs as csv
outputs = pd.DataFrame(output)
#print outputs
outputs.to_csv('filepath_out.csv', index=False, header=False)
# Save outputs as JSON
#outputDict = {}
#for i in output:
# outputDict[i[0]]=i[1]
#with open('filepath.json','w') as f:
#json.dump(outputDict, f, sort_keys=True, indent=4, ensure_ascii=False,)
EDIT
I've come across a python package, NetworkX, and was debating the idea of creating a network graph from my clusters as opposed to clustering the trajectory lines/segments. Any opinions on clustering trajectories v.s. turning clusters into a graph to identify densely clustered movements between locations.
Below is an example of some clusters look like:

In an effort to answer my own 1+ year old question, I've come up with a couple solutions for which have solved this (and similar questions), albeit, without Python (which was my hope). First, using a method I provided a user in the GIS StackExchange using ArcGIS and a couple of built-in tools to carry out a line density analysis (https://gis.stackexchange.com/questions/42224/creating-polyline-based-heatmap-from-gps-tracks/270524#270524). This takes GPS points, creates lines, segments the lines, and then clusters them. The second method uses SQL (ST_MakeLine primarily) and a Postgres/GIS/CARTO data base to create lines ordered by ascending timestamp, and then grouped by users (e.g. https://carto.com/blog/jets-and-datelines/). One can then count the number of line occurrences (assuming points are clustered with clearly defined centroids similar to the initial question of mine above), and treat this as a cluster (e.g. Python/NetworkX: Add Weights to Edges by Frequency of Edge Occurance, https://carto.com/blog/alteryx-and-carto-to-explore-london-bike-data/).

Convert multilevel dictionary into a network graph in Python

I am stuck with a strange issue. I am reading data from a CSV file and converting it into a multi-level dictionary.
CSV Format: I have a total of 1,500 rows in my CSV file, see the format below.
1-103rd Street,1-96th Street,2327.416174
1-116th Street–Columbia University,1-Cathedral Parkway–110th Street,2327.416174
1-125th Street,1-116th Street–Columbia University,2327.416174
1-137th Street–City College,1-125th Street,2327.416174
1-145th Street,1-137th Street–City College,2327.416174
1-14th Street,1-Christopher Street–Sheridan Square,2327.416174
In the above file, the first column denotes a source station, the second column denotes a destination station, and the third column provides the distance between them.
I will have to apply Dijkstra's Algorithm to find the shortest distance between two stations, and for that I need to convert the whole CSV file into a weighted graph, in which each station is a node and the distance between them is the weight of the edge.
My approach:
First I am reading each row from the CSV file and converting it into a multi-level dictionary. I am getting a proper dictionary for this. Below is my code.
my_dict = {}
with open('final_subway_data.csv') as f_input:
for row in csv.reader(f_input):
my_dict[row[0]] = {row[1]: row[2]}
Now I need to convert this newly created dictionary into a graph in order to apply Dijkstra's Algorithm. For that I am using this code:
G = nx.from_dict_of_dicts(my_dict)
But I am getting an error saying "TypeError: Input graph is not a networkx graph type".
Please help me. How can I convert the whole CSV file into a graph so I can apply Dijkstra's Algorithm to find a shortest distance between any two stations.

I'm not super familiar with NetworkX, but I'd do the following using pandas and nx.from_pandas_dataframe().
import pandas as pd
import networkx as nx
df = pd.read_csv('csvpath.csv', names=['origin', 'dest', 'dist'])
g = nx.from_pandas_dataframe(df, source='origin', target='dest', edge_attr='dist')
g['1-103rd Street']['1-96th Street']['dest']
# 2327.416174

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.