Words instead of numbers in a co-occurrence matrix with sklearn

Words instead of numbers in a co-occurrence matrix with sklearn - python

I have this code. It reads a list of sentences, and then uses sklearn's CountVectorizer to compute word co-occurrences.
from sklearn.feature_extraction.text import CountVectorizer
data = ['this is a sentence', 'this was a monkey', 'all this is nice']
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
X = count_model.fit_transform(data)
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
matrix_dense = Xc.todense() # matrix in dense format
import networkx as nx
G=nx.from_numpy_matrix(matrix_dense)
If I do G.edges(data=True), it outputs this:
[(0, 1, {'weight': 1}),
(0, 3, {'weight': 1}),
(0, 5, {'weight': 1}),
(1, 3, {'weight': 1}),
(1, 4, {'weight': 1}),
(1, 5, {'weight': 2})
and so on. How can I get words instead of numbers as source, target?
EDIT:
This is a:
labels = count:model.get_feature_names() # get the word labels
G=nx.from_numpy_matrix(matrix_dense) # create graph
for node, label in zip(G.nodes(), labels): # add labels to the graph
G.node[node]['label'] = label

With networkx you can replace one set of with another set of nodes. This is with relabel_nodes.
Here is the example from the documentation. It creates a 3 node graph and then creates a copy of that graph with the new node names. You can also do directly to G by setting the optional argument copy to False in the function call.
G = nx.path_graph(3)
sorted(G)
> [0, 1, 2]
mapping = {0: 'a', 1: 'b', 2: 'c'}
H = nx.relabel_nodes(G, mapping)
sorted(H)
> ['a', 'b', 'c']

Related

How to load in graph from networkx into PyTorch geometric and set node features and labels?

Goal: I am trying to import a graph FROM networkx into PyTorch geometric and set labels and node features.
(This is in Python)
Question(s):
How do I do this [the conversion from networkx to PyTorch geometric]? (presumably by using the from_networkx function)
How do I transfer over node features and labels? (more important question)
I have seen some other/previous posts with this question but they weren't answered (correct me if I am wrong).
Attempt: (I have just used an unrealistic example below, as I cannot post anything real on here)
Let us imagine we are trying to do a graph learning task (e.g. node classification) on a group of cars (not very realistic as I said). That is, we have a group of cars, an adjacency matrix, and some features (e.g. price at the end of the year). We want to predict the node label (i.e. brand of the car).
I will be using the following adjacency matrix: (apologies, cannot use latex to format this)
A = [(0, 1, 0, 1, 1), (1, 0, 1, 1, 0), (0, 1, 0, 0, 1), (1, 1, 0, 0, 0), (1, 0, 1, 0, 0)]
Here is the code (for Google Colab environment):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
from torch_geometric.utils.convert import to_networkx, from_networkx
import torch
!pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.10.0+cpu.html
# Make the networkx graph
G = nx.Graph()
# Add some cars (just do 4 for now)
G.add_nodes_from([
(1, {'Brand': 'Ford'}),
(2, {'Brand': 'Audi'}),
(3, {'Brand': 'BMW'}),
(4, {'Brand': 'Peugot'}),
(5, {'Brand': 'Lexus'}),
])
# Add some edges
G.add_edges_from([
(1, 2), (1, 4), (1, 5),
(2, 3), (2, 4),
(3, 2), (3, 5),
(4, 1), (4, 2),
(5, 1), (5, 3)
])
# Convert the graph into PyTorch geometric
pyg_graph = from_networkx(G)
So this correctly converts the networkx graph to PyTorch Geometric. However, I still don't know how to properly set the labels.
The brand values for each node have been converted and are stored within:
pyg_graph.Brand
Below, I have just made some random numpy arrays of length 5 for each node (just pretend that these are realistic).
ford_prices = np.random.randint(100, size = 5)
lexus_prices = np.random.randint(100, size = 5)
audi_prices = np.random.randint(100, size = 5)
bmw_prices = np.random.randint(100, size = 5)
peugot_prices = np.random.randint(100, size = 5)
This brings me to the main question:
How do I set the prices to be the node features of this graph?
How do I set the labels of the nodes? (and will I need to remove the labels from pyg_graph.Brand when training the network?)
Thanks in advance and happy holidays.

The easiest way is to add all information to the networkx graph and directly create it in the way you need it. I guess you want to use some Graph Neural Networks. Then you want to have something like below.
Instead of text as labels, you probably want to have a categorial representation, e.g. 1 stands for Ford.
If you want to match the "usual convention". Then you name your input features x and your labels/ground truth y.
The splitting of the data into train and test is done via mask. So the graph still contains all information, but only part of it is used for training. Check the PyTorch Geometric introduction for an example, which uses the Cora dataset.
import networkx as nx
import numpy as np
import torch
from torch_geometric.utils.convert import from_networkx
# Make the networkx graph
G = nx.Graph()
# Add some cars (just do 4 for now)
G.add_nodes_from([
(1, {'y': 1, 'x': 0.5}),
(2, {'y': 2, 'x': 0.2}),
(3, {'y': 3, 'x': 0.3}),
(4, {'y': 4, 'x': 0.1}),
(5, {'y': 5, 'x': 0.2}),
])
# Add some edges
G.add_edges_from([
(1, 2), (1, 4), (1, 5),
(2, 3), (2, 4),
(3, 2), (3, 5),
(4, 1), (4, 2),
(5, 1), (5, 3)
])
# Convert the graph into PyTorch geometric
pyg_graph = from_networkx(G)
print(pyg_graph)
# Data(edge_index=[2, 12], x=[5], y=[5])
print(pyg_graph.x)
# tensor([0.5000, 0.2000, 0.3000, 0.1000, 0.2000])
print(pyg_graph.y)
# tensor([1, 2, 3, 4, 5])
print(pyg_graph.edge_index)
# tensor([[0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 4, 4],
# [1, 3, 4, 0, 2, 3, 1, 4, 0, 1, 0, 2]])
# Split the data
train_ratio = 0.2
num_nodes = pyg_graph.x.shape[0]
num_train = int(num_nodes * train_ratio)
idx = [i for i in range(num_nodes)]
np.random.shuffle(idx)
train_mask = torch.full_like(pyg_graph.y, False, dtype=bool)
train_mask[idx[:num_train]] = True
test_mask = torch.full_like(pyg_graph.y, False, dtype=bool)
test_mask[idx[num_train:]] = True
print(train_mask)
# tensor([ True, False, False, False, False])
print(test_mask)
# tensor([False, True, True, True, True])

Question about Drawing a graph with networkx

I am using NetworkX for drawing graph, when I searching in NetworkX documentation I saw a code from Antigraph class that was confusing and I can't understand some line of this code. Help me for understanding this code, please.
I attached this code:
import networkx as nx
from networkx.exception import NetworkXError
import matplotlib.pyplot as plt
class AntiGraph(nx.Graph):
"""
Class for complement graphs.
The main goal is to be able to work with big and dense graphs with
a low memory footprint.
In this class you add the edges that *do not exist* in the dense graph,
the report methods of the class return the neighbors, the edges and
the degree as if it was the dense graph. Thus it's possible to use
an instance of this class with some of NetworkX functions.
"""
all_edge_dict = {"weight": 1}
def single_edge_dict(self):
return self.all_edge_dict
edge_attr_dict_factory = single_edge_dict
def __getitem__(self, n):
"""Return a dict of neighbors of node n in the dense graph.
Parameters
----------
n : node
A node in the graph.
Returns
-------
adj_dict : dictionary
The adjacency dictionary for nodes connected to n.
"""
return {
node: self.all_edge_dict for node in set(self.adj) - set(self.adj[n]) - {n}
}
def neighbors(self, n):
"""Return an iterator over all neighbors of node n in the
dense graph.
"""
try:
return iter(set(self.adj) - set(self.adj[n]) - {n})
except KeyError as e:
raise NetworkXError(f"The node {n} is not in the graph.") from e
def degree(self, nbunch=None, weight=None):
"""Return an iterator for (node, degree) in the dense graph.
The node degree is the number of edges adjacent to the node.
Parameters
----------
nbunch : iterable container, optional (default=all nodes)
A container of nodes. The container will be iterated
through once.
weight : string or None, optional (default=None)
The edge attribute that holds the numerical value used
as a weight. If None, then each edge has weight 1.
The degree is the sum of the edge weights adjacent to the node.
Returns
-------
nd_iter : iterator
The iterator returns two-tuples of (node, degree).
See Also
--------
degree
Examples
--------
>>> G = nx.path_graph(4) # or DiGraph, MultiGraph, MultiDiGraph, etc
>>> list(G.degree(0)) # node 0 with degree 1
[(0, 1)]
>>> list(G.degree([0, 1]))
[(0, 1), (1, 2)]
"""
if nbunch is None:
nodes_nbrs = (
(
n,
{
v: self.all_edge_dict
for v in set(self.adj) - set(self.adj[n]) - {n}
},
)
for n in self.nodes()
)
elif nbunch in self:
nbrs = set(self.nodes()) - set(self.adj[nbunch]) - {nbunch}
return len(nbrs)
else:
nodes_nbrs = (
(
n,
{
v: self.all_edge_dict
for v in set(self.nodes()) - set(self.adj[n]) - {n}
},
)
for n in self.nbunch_iter(nbunch)
)
if weight is None:
return ((n, len(nbrs)) for n, nbrs in nodes_nbrs)
else:
# AntiGraph is a ThinGraph so all edges have weight 1
return (
(n, sum((nbrs[nbr].get(weight, 1)) for nbr in nbrs))
for n, nbrs in nodes_nbrs
)
def adjacency_iter(self):
"""Return an iterator of (node, adjacency set) tuples for all nodes
in the dense graph.
This is the fastest way to look at every edge.
For directed graphs, only outgoing adjacencies are included.
Returns
-------
adj_iter : iterator
An iterator of (node, adjacency set) for all nodes in
the graph.
"""
for n in self.adj:
yield (n, set(self.adj) - set(self.adj[n]) - {n})
# Build several pairs of graphs, a regular graph
# and the AntiGraph of it's complement, which behaves
# as if it were the original graph.
Gnp = nx.gnp_random_graph(20, 0.8, seed=42)
Anp = AntiGraph(nx.complement(Gnp))
Gd = nx.davis_southern_women_graph()
Ad = AntiGraph(nx.complement(Gd))
Gk = nx.karate_club_graph()
Ak = AntiGraph(nx.complement(Gk))
pairs = [(Gnp, Anp), (Gd, Ad), (Gk, Ak)]
# test connected components
for G, A in pairs:
gc = [set(c) for c in nx.connected_components(G)]
ac = [set(c) for c in nx.connected_components(A)]
for comp in ac:
assert comp in gc
# test biconnected components
for G, A in pairs:
gc = [set(c) for c in nx.biconnected_components(G)]
ac = [set(c) for c in nx.biconnected_components(A)]
for comp in ac:
assert comp in gc
# test degree
for G, A in pairs:
node = list(G.nodes())[0]
nodes = list(G.nodes())[1:4]
assert G.degree(node) == A.degree(node)
assert sum(d for n, d in G.degree()) == sum(d for n, d in A.degree())
# AntiGraph is a ThinGraph, so all the weights are 1
assert sum(d for n, d in A.degree()) == sum(d for n, d in A.degree(weight="weight"))
assert sum(d for n, d in G.degree(nodes)) == sum(d for n, d in A.degree(nodes))
nx.draw(Gnp)
plt.show()
I can't understand in these 2 lines:
(1) for v in set(self.adj) - set(self.adj[n]) - {n}
(2) nbrs = set(self.nodes()) - set(self.adj[nbunch]) - {nbunch}

To understand these lines, lets break each term carefully. For the purpose of explaination, I will create the following Graph:
import networkx as nx
source = [1, 2, 3, 4, 2, 3]
dest = [2, 3, 4, 6, 5, 5]
edge_list = [(u, v) for u, v in zip(source, dest)]
G = nx.Graph()
G.add_edges_from(ed_ls)
The Graph has the following edges:
print(G.edges())
# EdgeView([(1, 2), (2, 3), (2, 5), (3, 4), (3, 5), (4, 6)])
Now lets understand the terms in the above code:
set(self.adj)
If we print this out, we can see it is the set of nodes in the Graph:
print(set(self.adj))
# {1, 2, 3, 4, 5, 6}
set(self.adj[n])
This is the set of nodes adjacent to node n:
print(set(G.adj[2]))
# {1, 3, 5}
Now lets look at the first line that you asked in your question
for v in set(self.adj) - set(self.adj[n]) - {n}
This can be translated as follows:
for v in set of all nodes - set of nodes adjacent to node N - node N
So, this set of all nodes - set of nodes adjacent to node N returns the set of nodes that are not adjacent to node N (and this includes node N itself). (Essentially this will create the complement of the Graph).
Lets, look at an example:
nodes_nbrs = (
(
n,
{
v: {'weight': 1}
for v in set(G.adj) - set(G.adj[n]) - {n}
},
)
for n in G.nodes()
)
This will have the following value:
Node 1: {3: {'weight': 1}, 4: {'weight': 1}, 5: {'weight': 1}, 6: {'weight': 1}}
Node 2: {4: {'weight': 1}, 6: {'weight': 1}}
Node 3: {1: {'weight': 1}, 6: {'weight': 1}}
Node 4: {1: {'weight': 1}, 2: {'weight': 1}, 5: {'weight': 1}}
Node 6: {1: {'weight': 1}, 2: {'weight': 1}, 3: {'weight': 1}, 5: {'weight': 1}}
Node 5: {1: {'weight': 1}, 4: {'weight': 1}, 6: {'weight': 1}}
So if you look closely, for each node, we get the a list of nodes that were not adjacent to the node.
For say, node 2, the calculation would look something like this:
{1, 2, 3, 4, 5, 6} - {1, 3, 5} - {2} = {4, 6}
Now lets come to the second line:
nbrs = set(self.nodes()) - set(self.adj[nbunch]) - {nbunch}
Here set(self.adj[nbunch]) is basically the set of nodes adjacent to nodes in nbunch. nbunch is nothing but an iterator of nodes, so instead of set(self.adj[n]) where we get neighbors of a single node, here we get neighbors of multiple nodes.
So the expression can be translated as follows:
Set of all nodes - Set of all nodes adjacent to each node in nbunch - Set of nodes in nbunch
Which is same as the first expression that you asked except that this one is for multiple nodes, i.e. This will also return the list of nodes that are not adjacent to nodes in nbunch

Python/NetworkX: Add Weights to Edges by Frequency of Edge Occurance

I have a MultiDiGraph created in networkx for which I am trying to add weights to the edges, after which I assign a new weight based on the frequency/count of the edge occurance. I used the following code to create the graph and add weights, but I'm not sure how to tackle reassigning weights based on count:
g = nx.MultiDiGraph()
df = pd.read_csv('G:\cluster_centroids.csv', delimiter=',')
df['pos'] = list(zip(df.longitude,df.latitude))
dict_pos = dict(zip(df.cluster_label,df.pos))
#print dict_pos
for row in csv.reader(open('G:\edges.csv', 'r')):
if '[' in row[1]: #
g.add_edges_from(eval(row[1]))
for u, v, d in g.edges(data=True):
d['weight'] = 1
for u,v,d in g.edges(data=True):
print u,v,d
Edit
I was able to successfully assign weights to each edge, first part of my original question, with the following:
for u, v, d in g.edges(data=True):
d['weight'] = 1
for u,v,d in g.edges(data=True):
print u,v,d
However, I am still unable to reassign weights based on the number of times an edge occurs (a single edge in my graph can occur multiple times)? I need to accomplish this in order to visualize edges with a higher count differently than edges with a lower count (using edge color or width). I'm not sure how to proceed with reassigning weights based on count, please advise. Below are sample data, and links to my full data set.
Data
Sample Centroids(nodes):
cluster_label,latitude,longitude
0,39.18193382,-77.51885109
1,39.18,-77.27
2,39.17917928,-76.6688633
3,39.1782,-77.2617
4,39.1765,-77.1927
5,39.1762375,-76.8675441
6,39.17468,-76.8204499
7,39.17457332,-77.2807235
8,39.17406072,-77.274685
9,39.1731621,-77.2716502
10,39.17,-77.27
Sample Edges:
user_id,edges
11011,"[[340, 269], [269, 340]]"
80973,"[[398, 279]]"
608473,"[[69, 28]]"
2139671,"[[382, 27], [27, 285]]"
3945641,"[[120, 422], [422, 217], [217, 340], [340, 340]]"
5820642,"[[458, 442]]"
6060732,"[[291, 431]]"
6912362,"[[68, 27]]"
7362602,"[[112, 269]]"
Full data:
Centroids(nodes):https://drive.google.com/open?id=0B1lvsCnLWydEdldYc3FQTmdQMmc
Edges: https://drive.google.com/open?id=0B1lvsCnLWydEdEtfM2E3eXViYkk
UPDATE
I was able to solve, at least temporarily, the issue of overly disproportional edge widths due to high edge weight by setting a minLineWidth and multiplying it by the weight:
minLineWidth = 0.25
for u, v, d in g.edges(data=True):
d['weight'] = c[u, v]*minLineWidth
edges,weights = zip(*nx.get_edge_attributes(g,'weight').items())
and using width=[d['weight'] for u,v, d in g.edges(data=True)] in nx.draw_networkx_edges() as provided in the solution below.
Additionally, I was able to scale color using the following:
# Set Edge Color based on weight
values = range(7958) #this is based on the number of edges in the graph, use print len(g.edges()) to determine this
jet = cm = plt.get_cmap('YlOrRd')
cNorm = colors.Normalize(vmin=0, vmax=values[-1])
scalarMap = cmx.ScalarMappable(norm=cNorm, cmap=jet)
colorList = []
for i in range(7958):
colorVal = scalarMap.to_rgba(values[i])
colorList.append(colorVal)
And then using the argument edge_color=colorList in nx.draw_networkx_edges().

Try this on for size.
Note: I added a duplicate of an existing edge, just to show the behavior when there are repeats in your multigraph.
from collections import Counter
c = Counter(g.edges()) # Contains frequencies of each directed edge.
for u, v, d in g.edges(data=True):
d['weight'] = c[u, v]
print(list(g.edges(data=True)))
#[(340, 269, {'weight': 1}),
# (340, 340, {'weight': 1}),
# (269, 340, {'weight': 1}),
# (398, 279, {'weight': 1}),
# (69, 28, {'weight': 1}),
# (382, 27, {'weight': 1}),
# (27, 285, {'weight': 2}),
# (27, 285, {'weight': 2}),
# (120, 422, {'weight': 1}),
# (422, 217, {'weight': 1}),
# (217, 340, {'weight': 1}),
# (458, 442, {'weight': 1}),
# (291, 431, {'weight': 1}),
# (68, 27, {'weight': 1}),
# (112, 269, {'weight': 1})]
Edit: To visualize the graph with edge weights as thicknesses, use this:
nx.draw_networkx(g, width=[d['weight'] for _, _, d in g.edges(data=True)])

How to extract positive and negative subnetworks from a singed network

With this code I found the list of all subgraphs, and then trying the extracting all positive and negative subnetworks but did not find any logic for this, can anyone help me
import networkx as nx
from networkx.algorithms.components.connected import connected_components
import matplotlib.pyplot as plt
G = nx.read_edgelist('/home/suman/Desktop/dataset/CA-GrQc.txt', create_using = None, nodetype=int,edgetype=int)
H=nx.connected_component_subgraphs(G)
for i in H:
print list(i)
pos=nx.spring_layout(G)
nx.draw(G,pos=pos)
nx.draw_networkx_labels(G,pos=pos)
plt.show()

I think what you're after is to create the network made up of just negative edges and the network made up of just positive edges.
If so, here is some code to do that (edited to account for the fact that add_edges_from can handle weighted edges - I had misread the documentation):
G=nx.Graph()
G.add_edges_from([(1,3),(2,4),(3,5),(4,6)], weight = 1)
G.add_edges_from([(1,2),(2,3),(3,4),(4,5)], weight = -1)
pos_edges = [(u,v,w) for (u,v,w) in G.edges(data=True) if w['weight']>0]
neg_edges = [(u,v,w) for (u,v,w) in G.edges(data=True) if w['weight']<0]
Hpos = nx.Graph()
Hneg = nx.Graph()
Hpos.add_edges_from(pos_edges)
Hneg.add_edges_from(neg_edges)
Hneg.edges(data=True)
> [(1, 2, {'weight': -1}),
(2, 3, {'weight': -1}),
(3, 4, {'weight': -1}),
(4, 5, {'weight': -1})]
Hpos.edges(data=True)
> [(1, 3, {'weight': 1}),
(2, 4, {'weight': 1}),
(3, 5, {'weight': 1}),
(4, 6, {'weight': 1})]
Please let me know if this is what you're after. I have to go now so I can't give detailed explanation, but if you have some comments on what does/does not make sense, I will respond later.

Coordinates to graph

Is there a simpler, easier way to convert coordinates (long, lat) to a "networkx"-graph, than nested looping over those coordinates and adding weighted nodes/edges for each one?
for idx1, itm1 in enumerate(data):
for idx2, itm2 in enumerate(data):
pos1 = (itm1["lng"], itm1["lat"])
pos2 = (itm2["lng"], itm2["lat"])
distance = vincenty(pos1, pos2).meters #geopy distance
# print(idx1, idx2, distance)
graph.add_edge(idx1, idx2, weight=distance)
The target is representing points as a graph in order to use several functions on this graph.
Edit: Using an adjacency_matrix would still need a nested loop

You'll have to do some kind of loop. But if you are using an undirected graph you can eliminate half of the graph.add_edge() (only need to add u-v and not v-u). Also as #EdChum suggests you can use graph.add_weighted_edges_from() to make it go faster.
Here is a nifty way to do it
In [1]: from itertools import combinations
In [2]: import networkx as nx
In [3]: data = [10,20,30,40]
In [4]: edges = ( (s[0],t[0],s[1]+t[1]) for s,t in combinations(enumerate(data),2))
In [5]: G = nx.Graph()
In [6]: G.add_weighted_edges_from(edges)
In [7]: G.edges(data=True)
Out[7]:
[(0, 1, {'weight': 30}),
(0, 2, {'weight': 40}),
(0, 3, {'weight': 50}),
(1, 2, {'weight': 50}),
(1, 3, {'weight': 60}),
(2, 3, {'weight': 70})]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Words instead of numbers in a co-occurrence matrix with sklearn - python

Related

How to load in graph from networkx into PyTorch geometric and set node features and labels?

Question about Drawing a graph with networkx

Python/NetworkX: Add Weights to Edges by Frequency of Edge Occurance

How to extract positive and negative subnetworks from a singed network

Coordinates to graph

Categories

Resources