Convert multilevel dictionary into a network graph in Python - python

I am stuck with a strange issue. I am reading data from a CSV file and converting it into a multi-level dictionary.
CSV Format: I have a total of 1,500 rows in my CSV file, see the format below.
1-103rd Street,1-96th Street,2327.416174
1-116th Street–Columbia University,1-Cathedral Parkway–110th Street,2327.416174
1-125th Street,1-116th Street–Columbia University,2327.416174
1-137th Street–City College,1-125th Street,2327.416174
1-145th Street,1-137th Street–City College,2327.416174
1-14th Street,1-Christopher Street–Sheridan Square,2327.416174
In the above file, the first column denotes a source station, the second column denotes a destination station, and the third column provides the distance between them.
I will have to apply Dijkstra's Algorithm to find the shortest distance between two stations, and for that I need to convert the whole CSV file into a weighted graph, in which each station is a node and the distance between them is the weight of the edge.
My approach:
First I am reading each row from the CSV file and converting it into a multi-level dictionary. I am getting a proper dictionary for this. Below is my code.
my_dict = {}
with open('final_subway_data.csv') as f_input:
for row in csv.reader(f_input):
my_dict[row[0]] = {row[1]: row[2]}
Now I need to convert this newly created dictionary into a graph in order to apply Dijkstra's Algorithm. For that I am using this code:
G = nx.from_dict_of_dicts(my_dict)
But I am getting an error saying "TypeError: Input graph is not a networkx graph type".
Please help me. How can I convert the whole CSV file into a graph so I can apply Dijkstra's Algorithm to find a shortest distance between any two stations.

I'm not super familiar with NetworkX, but I'd do the following using pandas and nx.from_pandas_dataframe().
import pandas as pd
import networkx as nx
df = pd.read_csv('csvpath.csv', names=['origin', 'dest', 'dist'])
g = nx.from_pandas_dataframe(df, source='origin', target='dest', edge_attr='dist')
g['1-103rd Street']['1-96th Street']['dest']
# 2327.416174

Related

NetworkX Graph attributes in a single table

I am trying to create a table displaying the attributes of nodes in a NetworkX (using Python) graph. The functions used to get these attributes generate a list of the nodes with their assigned attribute and are generated like in the following:
import networkx as nx
# degrees of nodes
pprint(g.degree())
# clustering coefficient
pprint(nx.clustering(g))
I should like to be able to compile these into a table with intuitive overview in the following format:
node degree clustering coefficient
----------------------------------------
a 2 3
b 1 1
...
Any tips on how to do so? Perhaps using a dict and tabulate but I am not quite sure... Would be grateful for any tips!
Edit
I found a way to do some simple printing, but it isn't formatted very nicely and doesn't make for nice syntax:
for n in g.nodes():
print(n, g.degree(n), nx.clustering(g, n))
I would use a Pandas dataframe (pd.DataFrame) to store this data, which I would construct in a list comprehension of dictionaries (each dictionary corresponding to a row in the final data frame). Here's how I would do that with two attributes, in_degree and out_degree for each node:
import pandas as pd
import networkx as nx
g = nx.DiGraph()
# ... populate the Digraph ...
def compute_indegree(node, digraph):
return digraph.in_degree(node)
def compute_outdegree(node, digraph):
return digraph.out_degree(node)
attr_dicts = [
{ 'node': node,
'in_degree': compute_indegree(node, g), \
'out_degree': compute_outdegree(node, g)} \
for node in g.nodes
]
dataframe = pd.DataFrame(attr_dicts)
dataframe.set_index('node', inplace=True)
print(dataframe)
The final print line neatly formats the resulting DataFrame:
If you modify or add the above functions compute_indegree and compute_outdegree to return other stats like your clustering coefficient, the table will still populate as above.

adding weight list to already created edgelist from pandas dataframe and display the weight as a edge label in graph

I have my data in pandas dataframe, it has source, target , weight.
H=nx.from_pandas_edgelist(links,source='source',target='target')
used this to create my edge list, this doesnt have an option to add weights, i have also kept my data in different forms apart from pd.DataFrame
edges_df={'source':links['source'],
'target':links['target'],
'weights':links['value']}
edges_source=links['source']
edges_target=links['target']
weights=links['value']
same thing but different structure , i tried using nx.set_edge_attribute but it gave an error along the lines of not iterable/hashable , not in the G[u][v][w]
The simple thing would be to use the edge_attr parameter when creating your graph like below.
H=nx.from_pandas_edgelist(links,source='source',target='target',edge_attr='weight')
If you already created the graph and want to add the attributes after the fact, you can use
for e in g.edges:
g.edges[e]['weight'] = df.loc[(df.source == e[0]) & (df.target == e[1]),'weight'].values[0] #use .value[0] to ensure you get the value, not an array or series

Pandas data frame gets sorted differently when filtering on different columns

I am using plotly dash for visual representation of data analysis that I have performed on database of IPL. I have bunch of csv that I have exported from sql views.
And now I am reading this csv with the help of pandas and giving the retrieved data based on my filters to plotly graph.
The problem is data comes sorted based on different columns when filter is applied on a different column, i.e. When I filter data by season_id data comes sorted based on runs and when I filter data by team_bowling data comes sorted based on match_id.
I am not able to understand this behavior of filtering or pandas data frame.
Here is my code and the output.
stats = pd.read_csv('data_files/All_Season_Batsman_Runs.csv', delimiter=',')
kohli = stats[stats.Player_Name == 'V Kohli'][stats.Season_Id == 1]
print(kohli)
stats = pd.read_csv('data_files/All_Season_Batsman_Runs.csv', delimiter=',')
kohli = stats[stats.Player_Name == 'V Kohli'][stats.Team_Bowling == 1]
print(kohli)
I am using
Pandas => 0.23.4
Python => 3.7
Looking at the index numbers, the original file has some sorting already. Possibly by season and runs. Nothing unexpected as far is I can tell.

Number of edges differ when converting pandas dataframe to Networkx object

I am using networkx to build an email network structure from a txt file where each row represents an "edge." I first loaded the txt file (3 columns: {'#Sender', 'Recipient', 'time'}) into Python and then converted to an networkx object using the following code:
import networkx as nx
import pandas as pd
email_df = pd.read_csv('email_network.txt', delimiter = '->')
email = nx.from_pandas_dataframe(email_df, '#Sender', 'Recipient', edge_attr = 'time')
The email.txt data can be accessed here.
However, email_df (a pandas DataFrame object) has a length of 82927, while email (a Networkx object) has a length of 3251.
In [1]: len(email_df)
In [2]: 82927
In [3]: len(email.edges())
In [4]: 3251
I got really confused because even if for rows containing the same two nodes in the first two columns of email_df with the same sequence of direction (say, '1' to '2'), the third column ('time', meaning timestamped) should distinguish them from each other, hence, no replicated edges would appear. Then why does the number of edges dramatically decreased from 82927 to 3251 after I used nx.from_pandas_dataframe to read from `email_df'?
Would anyone help explain this to me?
Thank you.
Your line here is saying to take the Sender column as the source node, the Recipient column as the Target and add the time as edge attributes. So you are only creating a single (directed) edge between Sender and Recipient, and only the time of the last row will be added as an attribute of the edge.
email = nx.from_pandas_dataframe(email_df, '#Sender', 'Recipient', edge_attr = 'time')
You can only have one edge defined for a pair of nodes - you could group the dataframe before constructing your network and use the count as the weights for the edges,
edge_groups = email_df.groupby(["#Sender", "Recipient"], as_index=False).count().rename(columns={"time":"weight"})
email = nx.from_pandas_dataframe(edge_groups, '#Sender', 'Recipient', edge_attr = 'weight')

Trajectory Clustering/ Aggregation with Python

I’m working with geo-located social media posts and clustering their locations (latitude/longitude) using DBSCAN. In my data set, I have many users who have posted multiple times, which allows me to derive their trajectory (a time ordered sequence of positions from place to place). Ex:
3945641 [[38.9875, -76.94], [38.91711157, -77.02435118], [38.8991, -77.029], [38.8991, -77.029], [38.88927534, -77.04858468])
I have derived trajectories for my entire data set, and my next step is to cluster or aggregate the trajectories in order to identify areas with dense movements between locations. Any ideas on how to tackle trajectory clustering/aggregation in Python?
Here is some code I've been working with to create trajectories as line strings/JSON dicts:
import pandas as pd
import numpy as np
import ujson as json
import time
# Import Data
data = pd.read_csv('filepath.csv', delimiter=',', engine='python')
#print len(data),"rows"
#print data
# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude','cluster_labels])
#print data.head()
# Get a list of unique user_id values
uniqueIds = np.unique(data['user_id'].values)
# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,data.loc[data['user_id']==id].sort_values(by='timestamp')[['latitude','longitude']].values.tolist()] for id in uniqueIds]
# Save outputs as csv
outputs = pd.DataFrame(output)
#print outputs
outputs.to_csv('filepath_out.csv', index=False, header=False)
# Save outputs as JSON
#outputDict = {}
#for i in output:
# outputDict[i[0]]=i[1]
#with open('filepath.json','w') as f:
#json.dump(outputDict, f, sort_keys=True, indent=4, ensure_ascii=False,)
EDIT
I've come across a python package, NetworkX, and was debating the idea of creating a network graph from my clusters as opposed to clustering the trajectory lines/segments. Any opinions on clustering trajectories v.s. turning clusters into a graph to identify densely clustered movements between locations.
Below is an example of some clusters look like:
In an effort to answer my own 1+ year old question, I've come up with a couple solutions for which have solved this (and similar questions), albeit, without Python (which was my hope). First, using a method I provided a user in the GIS StackExchange using ArcGIS and a couple of built-in tools to carry out a line density analysis (https://gis.stackexchange.com/questions/42224/creating-polyline-based-heatmap-from-gps-tracks/270524#270524). This takes GPS points, creates lines, segments the lines, and then clusters them. The second method uses SQL (ST_MakeLine primarily) and a Postgres/GIS/CARTO data base to create lines ordered by ascending timestamp, and then grouped by users (e.g. https://carto.com/blog/jets-and-datelines/). One can then count the number of line occurrences (assuming points are clustered with clearly defined centroids similar to the initial question of mine above), and treat this as a cluster (e.g. Python/NetworkX: Add Weights to Edges by Frequency of Edge Occurance, https://carto.com/blog/alteryx-and-carto-to-explore-london-bike-data/).

Categories

Resources