I have a dataset with fields movie_id, genres. Total number of rows in the dataset is 103110. Now I am trying to construct a network graph of movies where two movies are connected if they belong to similar genres. Since it is a large dataset its taking too much time.
movie_id | genres
1 ab,bc,ca
5 dd,aa
20 ab,zz
22 aa,bb
33 cc,rr
In this eg. movie 1 and movie 20 will have an edge, similarly movie 5 and movie 22 will have an edge in the network. I am using the following code but it is taking too much time. please suggest some faster approach.
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
rdf = pd.read_csv("movies.csv")
g=nx.Graph()
g.add_nodes_from(rdf['movieId'].tolist())
eg=[]
for i,r in rdf.iterrows():
for j,r1 in rdf.iterrows():
if r['genres'] in r1['genres']:
eg1.append((r['movieId'],r1['movieId']))
g.add_edges_from(eg)
nx.write_adjlist(g,"movie_graph.csv")
Related
I have an intermediate level of python and have used it before to plot some pretty nice graphs for academic purposes. Recently I ended up with a nice DF of agreements between regulators and want to create a Network graph but it seems a little more complicated than I thought.
Party = Nodes
Edge = Agreements (type)
The idea is to identify the centrality of the Parties (John, for example, may have many agreements with different parties while Mary, only once but with two parties) and to display different types of agreements with a different colors.
My data frame is more or less like this:
YEAR
PARTIES
TYPE OF AGREEMENT
2005
John, Ann
Complex Agreement
2010
John, Mary, Rupert
Crossfunctional Agreement
....
...
...
Any ideas/suggestions?
This might get you going.
from itertools import combinations
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
df = pd.DataFrame(
{
"YEAR": [2005, 2010],
"PARTIES": ["John, Ann", "John, Mary, Rupert"],
"TYPE OF AGREEMENT": ["Complex Agreement", "Crossfunctional Agreement"],
}
)
df["PARTIES"] = df["PARTIES"].str.split(", ")
graph = nx.MultiGraph()
for ix, row in df.iterrows():
for combo in combinations(row["PARTIES"], 2):
graph.add_edge(*combo, year=row["YEAR"], type=row["TYPE OF AGREEMENT"])
nx.draw(graph, with_labels=True)
plt.savefig("graph.png")
saves a PNG like this:
You can refer to the Networkx docs for e.g. the centrality metrics and how to style and label edges better.
Let’s suppose we have the below data (a sample of my whole dataset that counts thousands of rows):
Node Target
1 2
1 3
1 5
2 1
2 3
2 6
7 8
7 12
9 13
9 15
9 14
Clearly, if I plot in a graph this data I have two components that are disconnected.
I am wondering how to isolate or remove one component from my network, e.g., the smallest one.
I would say that first I should identify the components, then isolate/filtering out the component(s) that I am not interested in.
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
connected_com=[len(c) for c in sorted(nx.connected_components(G), key=len, reverse=True)]
Now I should create a network only with data from the largest component:
largest_cc = max(nx.connected_components(G), key=len)
This is easy in case of two components. However, if I would like to select two components and exclude one, how should I do? This is my question.
In the example data you provided, 3 islands are obtained when plotting the graph with the code below:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
df=pd.read_fwf('data.txt')
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
nx.draw(G,with_labels=True)
And the graph looks like that:
Now if you want to only keep the biggest two islands you can use the nx.connected_components(G) function that you mentioned and store the two biggest components. Below is the code to do this:
N_subs=2 #Number of biggest islands you want to keep
G_sub=[]
largest_components=[]
for i in range(N_subs):
largest_components.append(sorted(nx.connected_components(G), key=len, reverse=True)[i])
G_sub.append(G.subgraph(largest_components[i]))
You will then need to create a subgraph of G that is composed of both islands. You can use nx.compose_all to do that. And you can then just plot your subgraph
G_subgraphs=nx.compose_all(G_sub)
nx.draw(G_subgraphs,with_labels=True)
So overall the code looks like that:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
df=pd.read_fwf('data.txt')
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
N_subs=2
G_sub=[]
largest_components=[]
for i in range(N_subs):
largest_components.append(sorted(nx.connected_components(G), key=len, reverse=True)[i])
G_sub.append(G.subgraph(largest_components[i]))
G_subgraphs=nx.compose_all(G_sub)
nx.draw(G_subgraphs,with_labels=True)
And the output of this code gives:
Note: According to this, nx.connected_components is best used for undirected graphs. Since you are dealing with directed graphs, you might want to use strongly_connected_components(G) or weakly_connected_components(G) instead.
Currently my data frame consist of both numerical and categorical values (mixed data type). My data frame looks like -
id age txn_duration Statename amount gender religion
1 27 275 bihar 110 m hindu
2 33 163 maharashtra 50 f muslim
3 53 63 delhi 50 f muslim
4 47 100 up 50 m hindu
5 39 263 punjab 100 m punjabi
6 41 303 delhi 50 m punjabi
There is 20 states (Statename) and 7 religion. I have done get_dummies for both Statename and rekigion but got lots of noise. Also detect outlier.My question is -
1. how to find optimum no of cluster for mixed data type.
2. In this case I am using k-means algo.Can I use k-modes or any other methods which will help my results. Because I am not getting good results using k-means
3.How to interpretation my cluster results. I have use
print (cluster_data[clmns].groupby(['clusters']).mean())
Any other way I can see or plot?please provide me the code
My code is -
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
#Importing libraries
import os
import matplotlib.pyplot as plt#visualization
from PIL import Image
%matplotlib inline
import seaborn as sns#visualization
import itertools
import warnings
warnings.filterwarnings("ignore")
import io
from scipy import stats
from sklearn.cluster import KMeans
from kmodes.kprototypes import KPrototypes
cluster_data = pd.read_csv("cluster.csv")
cluster_data = pd.get_dummies(cluster_data, columns=['StateName'])
cluster_data = pd.get_dummies(cluster_data, columns=['gender'])
cluster_data = pd.get_dummies(cluster_data, columns=['religion'])
clmns = ['mobile', 'age', 'txn_duration', 'amount', 'StateName_Bihar',
'StateName_Delhi', 'StateName_Gujarat', 'StateName_Karnataka',
'StateName_Maharashtra', 'StateName_Punjab', 'StateName_Rajasthan',
'StateName_Telangana', 'StateName_Uttar Pradesh',
'StateName_West Bengal', 'gender_female',
'gender_male', 'religion_buddhist',
'religion_christian', 'religion_hindu',
'religion_jain', 'religion_muslim',
'religion_other', 'religion_sikh']
df_tr_std = stats.zscore(cluster_data[clmns])
#Cluster the data
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_tr_std)
labels = kmeans.labels_
#Glue back to originaal data
cluster_data['clusters'] = labels
clmns.extend(['clusters'])
#Lets analyze the clusters
print (cluster_data[clmns].groupby(['clusters']).mean())
You can run something like this code:
Look at the image attached, in that plot you can see that having more than 3 clusters (for the dataset it was run on) does not provide a significant decrease in distortion. So optimum cluster number would be 3 in that case (simple synthetic data). For noisy data the decision might be harder.
Reference: A. Mueller's scipy notes on sklearn
import matplotlib.pyplot as plt
distortions = []
for i in range(1, 11):
km = KMeans(n_clusters=i,
random_state=0)
km.fit(X)
distortions.append(km.inertia_)
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()
Edit for ValueError:
For ValueError: you need just numericals, so you can do like this:
df_numerics = df.drop(['Statename', 'gender', 'religion], axis=1)
You can also drop other columns that you don't want included in clustering analysis.
with df_numerics, try the elbow method and try to find a good cluster number.
Then, let's say you found out that 3 clusters was good, you can run:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
labels contains the cluster numbers (0,1,2 for 3-clusters) for each row in your dataframe.You can also save this as a column in you datafame:
df['cluster_labels'] = labels
Then to visualize it you can pick 2 columns (more than that is dificult to visualize). Let's say you picked 'txn_duration' and 'amount' you can plot those columns, and add the cluster labels as color like this:
import matplotlib.pyplot as plt
plt.scatter(df['txn_duration'],df['amount'], c=df['cluster_labels'])
I am using hierarchical clustering from seaborn.clustermap to cluster my data. This works fine to nicely visualize the clusters in a heatmap. However, now I would like to extract all row values that are assigned to the different clusters.
This is what my data looks like:
import pandas as pd
# load DataFrame
df = pd.read_csv('expression_data.txt', sep='\t', index_col=0)
df
log_HU1 log_HU2
EEF1A1 13.439499 13.746856
HSPA8 13.169191 12.983910
FTH1 13.861164 13.511200
PABPC1 12.142340 11.885885
TFRC 11.261368 10.433607
RPL26 13.837205 13.934710
NPM1 12.381585 11.956855
RPS4X 13.359880 12.588574
EEF2 11.076926 11.379336
RPS11 13.212654 13.915813
RPS2 12.910164 13.009184
RPL11 13.498649 13.453234
CA1 9.060244 13.152061
RPS3 11.243343 11.431791
YBX1 12.135316 12.100374
ACTB 11.592359 12.108637
RPL4 12.168588 12.184330
HSP90AA1 10.776370 10.550427
HSP90AB1 11.200892 11.457365
NCL 11.366145 11.060236
Then I perform the clustering using seaborn as follows:
fig = sns.clustermap(df)
Which produces the following clustermap:
For this example I may be able to manually interpret the values belonging to each cluster (e.g. that TFRC and HSP90AA1 cluster). However I am planning to do these clustering analysis on much bigger data sets.
So my question is: does anyone know how to get the row values belonging to each cluster?
Thanks,
Using scipy.cluster.hierarchy module with fcluster allows cluster retrieval:
import pandas as pd
import seaborn as sns
import scipy.cluster.hierarchy as sch
df = pd.read_csv('expression_data.txt', sep='\t', index_col=0)
# retrieve clusters using fcluster
d = sch.distance.pdist(df)
L = sch.linkage(d, method='complete')
# 0.2 can be modified to retrieve more stringent or relaxed clusters
clusters = sch.fcluster(L, 0.2*d.max(), 'distance')
# clusters indicices correspond to incides of original df
for i,cluster in enumerate(clusters):
print(df.index[i], cluster)
Out:
EEF1A1 2
HSPA8 1
FTH1 2
PABPC1 3
TFRC 5
RPL26 2
NPM1 3
RPS4X 1
EEF2 4
RPS11 2
RPS2 1
RPL11 2
CA1 6
RPS3 4
YBX1 3
ACTB 3
RPL4 3
HSP90AA1 5
HSP90AB1 4
NCL 4
A small snipet of my dataframe is given below.
UserID Recommendations
0 A001 [(B000OR5928, 5.671419620513916), (B000A1HU1G, 5.435572624206543), (B0039HBNMA, 5.4260640144348145), (B000EEGAJW, 5.502416133880615), (B001L8KE06, 5.508320331573486), (B0002ZO60I, 5.640686511993408), (B0002D0096, 5.543562412261963), (B0013PU75Y, 5.452023506164551), (B005M0TKL8, 5.481754302978516), (B001PGXHYO, 5.5017194747924805)]
1 A002 [(B000EEGAJW, 4.382242679595947), (B004ZKIHVU, 4.182255268096924), (B000CBE3GE, 4.242227077484131), (B000CCJP4I, 4.354374408721924), (B000VBC5CY, 4.342846393585205), (B0002KZHQA, 4.127199649810791), (B0026RB0G8, 4.246310234069824), (B0002D0CQC, 4.275753021240234), (B0002M6CVC, 4.679849624633789), (B0002D0KOG, 4.138158321380615)]
The dataframe contains two columns UserID and Recommendations.The recommendation column contains productID of products recommended to that user along with ratings which is in the form of list.
What I want to do is if I click on user A001 then a graph should get display.The y-axis of graph will display productIDs recommended to A001 and X-axis will display rating of that product.This should be done in case of each UserID
I know how to plot a graph with single values using matplotlib but here it has a list of values .How can I go about it.
You can try this code to solve your problem:
import matplotlib.pyplot as plt
import numpy as np
for i in df.UserID:
ratings = []
productsIDs = []
for points in df.Recommendations[np.where(df.UserID==i)[0]]:
for point in points:
ratings.append(point[1])
productsIDs.append(point[0])
plt.plot(ratings, productsIDs)
plt.show()