Homophily in a social network using python - python

I am trying to determine the chance of homophily, then the homophily, of a dataset having nodes as keys and colors as values.
Example:
Node Target Colors
A N 1
N A 0
A D 1
D A 1
C X 1
X C 0
S D 0
D S 1
B 0
R N 2
N R 2
Colors are associated with the Node column and span from 0 to 2 (int).
The steps for calculating the chance of homophily on a characteristic z (in my case Color) are illustrated as follows:
c_list=df[['Node','Colors']].set_index('Node').T.to_dict('list')
print("\nChance of same color:", round(chance_homophily(c_list),2))
where chance_homophily is defined as follows:
# The function below takes a dictionary with characteristics as keys and the frequency of their occurrence as values.
# Then it computes the chance homophily for that characteristic (color)
def chance_homophily(dataset):
freq_dict = Counter([tuple(x) for x in dataset.values()])
df_freq_counter = freq_dict
c_list = list(df_freq_counter.values())
chance_homophily = 0
for class_count in c_list:
chance_homophily += (class_count/sum(c_list))**2
return chance_homophily
Then the homophily is calculated as follows:
def homophily(G, chars, IDs):
"""
Given a network G, a dict of characteristics chars for node IDs,
and dict of node IDs for each node in the network,
find the homophily of the network.
"""
num_same_ties = 0
num_ties = 0
for n1, n2 in G.edges():
if IDs[n1] in chars and IDs[n2] in chars:
if G.has_edge(n1, n2):
num_ties+=1
if chars[IDs[n1]] == chars[IDs[n2]]:
num_same_ties+=1
return (num_same_ties / num_ties)
G should be built from my dataset above (so taking into account both node and target columns).
I am not totally familiar with this network property but I think I have missed something in the implementation (e.g., is it correctly taking count of relationships among nodes in the network?). In another example (with different dataset) found on the web
https://campus.datacamp.com/courses/using-python-for-research/case-study-6-social-network-analysis?ex=1
the characteristic is also color (though it is a string, while I have a numeric variable). I do not know if they take into consideration relationship among nodes to determine, maybe using adjacency matrix: this part has not been implemented in my code, where I am using
G = nx.from_pandas_edgelist(df, source='Node', target='Target')

Your code works perfectly fine. The only thing you are missing is the IDs dict, which would map the names of your nodes to the names of the nodes in the graph G. By creating the graph from a pandas edgelist, you are already naming your nodes, as they are in the data.
This renders the use of the "IDs"dict unnecessary. Check out the example below, one time wihtou the IDs dict and one time with a trivial dict to use the original function:
import networkx as nx
import pandas as pd
from collections import Counter
df = pd.DataFrame({"Node":["A","N","A","D","C","X","S","D","B","R","N"],
"Target":["N","A","D","A","X","C","D","S","","N","R"],
"Colors":[1,0,1,1,1,0,0,1,0,2,2]})
c_list=df[['Node','Colors']].set_index('Node').T.to_dict('list')
G = nx.from_pandas_edgelist(df, source='Node', target='Target')
def homophily_without_ids(G, chars):
"""
Given a network G, a dict of characteristics chars for node IDs,
and dict of node IDs for each node in the network,
find the homophily of the network.
"""
num_same_ties = 0
num_ties = 0
for n1, n2 in G.edges():
if n1 in chars and n2 in chars:
if G.has_edge(n1, n2):
num_ties+=1
if chars[n1] == chars[n2]:
num_same_ties+=1
return (num_same_ties / num_ties)
print(homophily_without_ids(G, c_list))
#create node ids map - trivial in this case
nodes_ids = {i:i for i in G.nodes()}
def homophily(G, chars, IDs):
"""
Given a network G, a dict of characteristics chars for node IDs,
and dict of node IDs for each node in the network,
find the homophily of the network.
"""
num_same_ties = 0
num_ties = 0
for n1, n2 in G.edges():
if IDs[n1] in chars and IDs[n2] in chars:
if G.has_edge(n1, n2):
num_ties+=1
if chars[IDs[n1]] == chars[IDs[n2]]:
num_same_ties+=1
return (num_same_ties / num_ties)
print(homophily(G, c_list, nodes_ids))

Related

Python: Finding connected components in a graph presented as edge lists

I have an edge list in the form
start | end
a c
b d
e b
I have tens of millions of edges (approx 30 million) and I'm not able to read in the entire graph into memory - at least not using a library like networkx which is a bit memory intensive.
My goal is to find all connected components in the graph represented by that list that contain less than x number of nodes.
For example, I want to get all connected components with less than x=30 nodes. But I don't want to do this by building the entire graph and then doing a search for connected components (e.g. calling this networkx command: nx.connected_component_subgraphs(nxg)).
Is there a way I can search for connected components just using the edge list file and without having to build the entire graph?
Additional info: The node names are strings of length 10-20 asci values.
You would first need to reduce the footprint of your data by assigning shorter identifiers for your nodes. You could write a mapping between those shorter identifiers and original names to another file, so you can translate any solution to those names after running an algorithm.
Assuming you have short enough identifiers for your nodes, you could load everything in memory, in a dictionary keyed by node identifiers.
Then use the Union-Find structure & algorithm to identify the connected components.
Finally filter those by the maximum size they are allowed to have.
There are some libraries out there which provide Union-Find implementations, which could provide better performance. Here is a simple implementation of Union-Find:
class Node:
def __init__(self, key):
self.key = key
self.parent = self
self.size = 1
class UnionFind(dict):
def find(self, key):
node = self.get(key, None)
if node is None:
node = self[key] = Node(key)
else:
while node.parent != node:
# walk up & perform path compression
node.parent, node = node.parent.parent, node.parent
return node
def union(self, key_a, key_b):
node_a = self.find(key_a)
node_b = self.find(key_b)
if node_a != node_b: # disjoint? -> join!
if node_a.size < node_b.size:
node_a.parent = node_b
node_b.size += node_a.size
else:
node_b.parent = node_a
node_a.size += node_b.size
Then, the following function would load that structure from an iterator, and return the components whose sizes meet the requirement:
from collections import defaultdict
def find_components(line_iterator, max_size):
forest = UnionFind()
for line in line_iterator:
forest.union(*line.split())
result = defaultdict(list)
for key in forest.keys():
root = forest.find(key)
if root.size <= max_size:
result[root.key].append(key)
return list(result.values())
Here is a demo for the following graph:
data = """x d
c j
i e
f x
n z
a u
g r
w x
p l
u o
m g
k s
t q
y l
h m
n b
k v
e u
i o
r m
n c
x q
f q
j l
s v"""
results = find_components(data.splitlines(), 5)
print(results)
The output for this demo is:
[['i', 'e', 'a', 'u', 'o'], ['g', 'r', 'm', 'h'], ['k', 's', 'v']]

Distance map returned from shortest_distance function misses entries of certain vertices

I have a network present in a postgres database, where I can route with the pgrouting extension. I've read this into mem, and now want to calculate the distance of all nodes within 0.1 hours from a specific starting node:
dm = G.new_vp("double", np.inf)
gt.shortest_distance(G, source=nd[102481678], weights=wgts, dist_map = dm, max_dist=0.1)
where wgts is an EdgePropertyMap containing the weights per edge, and nd is a reverse mapping to get vertex index from the outside id.
In pgRouting this delivers 349 reachable nodes, using graph-tool only 328. The results are more or less the same (e.g. the furthest node is the same with the exact same cost, nodes present in both lists have same distance), but the graph-tool distance map just seems to miss certain nodes. The weird thing is that I found a cul-de-sac node labeled with a distance (second one from below), but the node connecting the cul-de-sac with the outside world is missing. Seems weird, because if the connecting node would not be reachable, the cul-de-sac would be unreachable as well.
I've compiled a MWE: https://gofile.io/d/YpgjSw
Below is the python code:
import graph_tool.all as gt
import numpy as np
import time
# construct list of source, target, edge-id (edge-id not really used in this example)
l = []
with open('nw.txt') as f:
rows = f.readlines()
for row in rows:
id = int(row.split('\t')[0])
source = int(row.split('\t')[1])
target = int(row.split('\t')[2])
l.append([source, target, id])
l.append([target, source, id])
print len(l)
# construct graph
G = gt.Graph(directed=True)
G.ep["edge_id"] = G.new_edge_property("int")
n = G.add_edge_list(l, hashed=True, eprops=G.ep["edge_id"])
# construct a dict for mapping outside node-id's to internal id's (node indexes)
nd = {}
i = 0
for x in n:
nd[x] = i
i = i + 1
# construct a dict for mapping (source, target) combis to a cost and reverse cost
db_wgts = {}
with open('costs.txt') as f:
rows = f.readlines()
for row in rows:
source = int(row.split('\t')[0])
target = int(row.split('\t')[1])
cost = float(row.split('\t')[2])
reverse_cost = float(row.split('\t')[3])
db_wgts[(source, target)] = cost
db_wgts[(target, source)] = reverse_cost
# construct an edge property and fill it according to previous dict
wgts = G.new_edge_property("double")
i = 0
for e in G.edges():
i = i + 1
print i
print e
s = n[int(e.source())]
t = n[int(e.target())]
try:
wgts[e] = db_wgts[(s, t)]
except KeyError:
# this was necessary
wgts[e] = 1000000
# calculate shortest distance to all nodes within 0.1 total cost from source-node with outside-id of 102481678
dm = G.new_vp("double", np.inf)
gt.shortest_distance(G, source=nd[102481678], weights=wgts, dist_map = dm, max_dist=0.1)
# some mumbo-jumbo for getting the result in a nice node-id: cost format
ar = dm.get_array()
idxs = np.where(dm.get_array() < 0.1)
vals = ar[ar < 0.1]
final_res = [(i, v) for (i,v) in zip(list(idxs[0]), list(vals))]
final_res.sort(key=lambda tup: tup[1])
for x in final_res:
print n[x[0]], x[1]
# output saved in result_missing_nodes.txt
# 328 records, should be 349
To illustrate (one of the) missing nodes:
>>> dm[nd[63447311]]
0.0696234786274957
>>> dm[nd[106448775]]
0.06165528930577409
>>> dm[nd[127601733]]
inf
>>> dm[nd[100428293]]
0.0819900275163846
>>>
This doesn't seem possible because this is the local layout of the network, labels are the id's referenced above:
This is a numerical precision problem. You have very low edge weights (1e-6) combined with very large values (1000000), which cause differences to be lost to finite precision. If you replace all values 1000000 (which I assume mean infinite weight) by numpy.inf, you actually get a more stable calculation, and no missing nodes in your example.
An even better alternative is to actually remove the "infinite weight"
edges using an edge filter:
u = GraphView(G, efilt=wgts.fa < 1000000)
and compute the distances on that.

Constructing graph using nodes and vertices based on some data

I am working with directed graphs given by an adjacency representation. In other words a graph G will be represented by a dictionary whose keys are the vertices and whose values are dictionaries whose keys are the neighbors of a vertex, the values of which may be assigned to 1. Given two vertices u, v in a directed graph G there may be an edge from u to v but not vice versa. It is however possible that there is an edge in both directions.
I have created a function called reachable_vertices which will take a graph G and vertex v, as input and returns a list of all the vertices in G which can be reached from v. If a vertex w can be reached by v this means that there is a chain v → v1 → v2... → w where there is an edge from each vertex in the chain to the one immediately after it. The vertex v does not have to have a particular type such as int or string, it could be either of these, it need only be a key in the dictionary representing the graph G.
I have defined a function called cfb_graph which takes no arguments. I formed a directed graph from the file cfb2010.csv (Link Below) by considering the teams as vertices and creating an edge between team1 and team2 only if team1 defeated team2.
Data Set Link =https://drive.google.com/open?id=1ZgNjH_QE7if1xHMfRU2-ebd9bNpL2E3d
cfb_graph will return a dictionary giving this representation.
I was able to find the following questions for which I am attaching my code below:
i. Which teams are not reachable from Auburn. Store them in a list.
ii. Which teams are reachable from Notre Dame. Store them in a list.
iii. Which teams are not reachable from Alabama. Store them in a list.
I am working on the following code:
def reachable(G, v, setA): # This function checks if it's possible to reach w from v
setA|={v}
try:
for w in set(G[v])-setA:reachable(G,w,setA)
except KeyError:
donothing = 0
return setA
## 2a ##
def reachable_vertices(G, v):
setA=set()
setA|={v}
try:
for n in set(G[v])-setA:reachable(G,n,setA)
except KeyError:
donothing = 0
return setA
def cfb_graph():
svertex = []
evertex = []
count= 0
file = open("cfb2010.csv","r")
for line in file:
fields = line.split(",")
if fields[5].replace("\n", "") == 'W':
svertex.append(fields[1])
evertex.append(fields[2])
if count == 0:
count = count +1
graph = {}
for i in range(len(svertex)):
v = svertex[i]
if v in graph:
graph[v] |= set([evertex[i]])
else:
graph[v] = set([evertex[i]])
for key, value in graph.items():
graph[key] = dict.fromkeys(value,1)
return(graph)
######Part 2 c############
auburn_answer = list(set(cfb_graph().keys()).difference(set(reachable_vertices(cfb_graph(), "Auburn"))))
notre_dame_answer = reachable_vertices(cfb_graph(), "Notre Dame")
alabama_answer = list(set(cfb_graph().keys()).difference(set(reachable_vertices(cfb_graph(), "Alabama"))))
In particular for each vertex I want to return a dictionary where the keys are the reachable vertices and the values are as will now be described. If a vertex w is reachable from a vertex v there is a path from v to w. The value corresponding to w in the returned dictionary will be the vertex which immediately preceeds it in some path from v to w. If I use the queue approach then the value of w would the first vertex u in the while loop for which w is a neighbor of u.
Also, I want to define a function called path which will take as input a graph G and two vertices v and w. If w is reachable from v it will return a list of vertices whose first element is v and whose last element is w and the other vertices are those on a path from v to w in the order in which they are traversed. If there is no path I should return None. I will probably want to use the function defined above.
I suppose the fast and powerful graph processing library networkx will help you a lot. It has the huge amount of various algorithms so you can not to implement it manually, but just use a function call in your code.
I constructed a small workflow that copies all your functionality and solves your problems:
# Imports
import networkx as nx
import csv
# Load CSV file and construct the directed graph
G = nx.DiGraph()
with open('cfb2010.csv', 'r') as f:
sreader = csv.reader(f, delimiter=',')
for line in sreader:
if line[-1] != 'W':
continue
G.add_node(line[1])
G.add_node(line[2])
G.add_edge(line[1], line[2])
# Get all nodes
all_nodes = set(G.nodes())
# Get nodes that can be reached from the particular node
notredame_nodes = set(nx.bfs_tree(G, 'Notre Dame').nodes())
alabama_nodes = set(nx.bfs_tree(G, 'Alabama').nodes())
auburn_nodes = set(nx.bfs_tree(G, 'Auburn').nodes())
# Construct lists of nodes you need
print(all_nodes - alabama_nodes)
print(all_nodes - auburn_nodes)
print(notredame_nodes)
Networkx also has a function equals to your function called path function:
print(nx.shortest_path(G, 'Florida', 'Illinois'))
['Florida', 'Penn St', 'Michigan', 'Illinois']
P.S. Reachable nodes construction uses BFS algorithm.

Calculating the number of graphs created and the number of vertices in each graph from a list of edges

Given a list of edges such as, edges = [[1,2],[2,3],[3,1],[4,5]]
I need to find how many graphs are created, by this I mean how many groups of components are created by these edges. Then get the number of vertices in the group of components.
However, I am required to be able to handle 10^5 edges, and i am currently having trouble completing the task for large number of edges.
My algorithm is currently getting the list of edges= [[1,2],[2,3],[3,1],[4,5]] and merging each list as set if they have a intersection, this will output a new list that now contains group components such as , graphs = [[1,2,3],[4,5]]
There are two connected components : [1,2,3] are connected and [4,5] are connected as well.
I would like to know if there is a much better way of doing this task.
def mergeList(edges):
sets = [set(x) for x in edges if x]
m = 1
while m:
m = 0
res = []
while sets:
common, r = sets[0], sets[1:]
sets = []
for x in r:
if x.isdisjoint(common):
sets.append(x)
else:
m = 1
common |= x
res.append(common)
sets = res
return sets
I would like to try doing this in a dictionary or something efficient, because this is toooo slow.
A basic iterative graph traversal in Python isn't too bad.
import collections
def connected_components(edges):
# build the graph
neighbors = collections.defaultdict(set)
for u, v in edges:
neighbors[u].add(v)
neighbors[v].add(u)
# traverse the graph
sizes = []
visited = set()
for u in neighbors.keys():
if u in visited:
continue
# visit the component that includes u
size = 0
agenda = {u}
while agenda:
v = agenda.pop()
visited.add(v)
size += 1
agenda.update(neighbors[v] - visited)
sizes.append(size)
return sizes
Do you need to write your own algorithm? networkx already has algorithms for this.
To get the length of each component try
import networkx as nx
G = nx.Graph()
G.add_edges_from([[1,2],[2,3],[3,1],[4,5]])
components = []
for graph in nx.connected_components(G):
components.append([graph, len(graph)])
components
# [[set([1, 2, 3]), 3], [set([4, 5]), 2]]
You could use Disjoint-set data structure:
edges = [[1,2],[2,3],[3,1],[4,5]]
parents = {}
size = {}
def get_ancestor(parents, item):
# Returns ancestor for a given item and compresses path
# Recursion would be easier but might blow stack
stack = []
while True:
parent = parents.setdefault(item, item)
if parent == item:
break
stack.append(item)
item = parent
for item in stack:
parents[item] = parent
return parent
for x, y in edges:
x = get_ancestor(parents, x)
y = get_ancestor(parents, y)
size_x = size.setdefault(x, 1)
size_y = size.setdefault(y, 1)
if size_x < size_y:
parents[x] = y
size[y] += size_x
else:
parents[y] = x
size[x] += size_y
print(sum(1 for k, v in parents.items() if k == v)) # 2
In above parents is a dict where vertices are keys and ancestors are values. If given vertex doesn't have a parent then the value is the vertex itself. For every edge in the list the ancestor of both vertices is set the same. Note that when current ancestor is queried the path is compressed so following queries can be done in O(1) time. This allows the whole algorithm to have O(n) time complexity.
Update
In case components are required instead of just number of them the resulting dict can be iterated to produce it:
from collections import defaultdict
components = defaultdict(list)
for k, v in parents.items():
components[v].append(k)
print(components)
Output:
defaultdict(<type 'list'>, {3: [1, 2, 3], 5: [4, 5]})

Graph has unconnected nodes

So, I'm still new at python, but I wrote this function to generate a graph that has certain hierarchical features. But, sometimes this function creates a graph with an unconnected node(s). But this should not be happening because every node should be fully connected with the other nodes below its "level". The characteristics of the hierarchical follow those of Seigel 2009: "The base is a hierarchy determined by the parameter Expansion Rate in which one individual is placed at the top, and each individual in the network is connected to a number of individuals below him equal to Expansion Rate, continuing until no more individuals are left in the population. Thus, while each level of the hierarchy before the last one contains a number of individuals equal to a power of Expansion Rate, the last level may have fewer than this if the total population does not divide appropriately. Each potential tie between individuals within the same level also has a probability equal to Level Connection of being made."
def heirarchy_graph(n,e,l):
'''
n is number of nodes in graph,
e is the Expansion Rate. This is the number of people on each level
If n/e has a remainder then the last level has that many
people in it.
l is the Level connection. It is the probability that a person is connected to someone
within the level they belong to.
'''
G = nx.Graph()
G.name="heirarchy_graph(%s,%s,%s)"%(n,e,l)
r = (n-1)%e
s = (n-r-1)/e
h = s + 1
#G = empty_graph(n=0)
G.add_node(0, level=int(0))
for i in range(s):
list = range(1,(e+1))
A = nx.Graph()
#for item in list:
#create e nodes with attribute level='i'
A.add_nodes_from(list,level=int(i))
# add edges between nodes with probability l
names = A.nodes()
for name in names:
B = non_neighbors(A,name)
for u in B:
q = random.uniform(0,1)
if q <= l:
A.add_edge(u,name)
#return A
#print(A)
G = nx.disjoint_union(G,A)
if r != 0:
h = s+1
list = range(1,(r+1))
A = nx.Graph()
#create e nodes with attribute level='i'
A.add_nodes_from(list,level=int(h))
# add edges between nodes with probability l
names = A.nodes()
for name in names:
B = non_neighbors(A,name)
for u in B:
q = random.uniform(0,1)
if q <= l:
A.add_edge(u,name)
G = nx.disjoint_union(G,A)
## add edges between levels
level = nx.get_node_attributes(G,'level')
names = G.nodes()
for name in names:
levelname = level[name]
B = non_neighbors(G,name)
for u in B:
levelneighbor = level[u]
if levelname == (levelneighbor + 1):
G.add_edge(u,name)
return G
I ran this code multiple times with n=25,e=5 and l=.5 but oftentimes I end up with nodes 21,22,23 or 24 (or some combination of them) being unconnected (checked by pulling each node's degree centrality).
I would greatly appreciate any help with this. The code runs without errors, I just don't know why it is giving me unconnected nodes. These nodes should be connected with the level above them. Thank you in advance.

Categories

Resources