Sklearn Nearest Neighbors and unseen data - python

I've used Nearest Neighbors to find closely related customers in order to recommend popular products to a target customer. I've fit a sparse matrix of training users to get the cosine distances. However, I cannot get indices and distances of new users on the fitted model because those users aren't in the original matrix. Is there a way around this, or do I have to refit the model each time new users are introduced?
Thanks!!
from scipy.sparse import csr_matrix
train_df = train.pivot(index = 'user', columns = 'product_id', values = 'rating').fillna(0)
test_df = test.pivot(index = 'user', columns = 'product_id', values = 'rating').fillna(0)
train_mat = csr_matrix(train_df.values)
test_mat = csr_matrix(test_df.values)
from sklearn.neighbors import NearestNeighbors
model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute', n_neighbors=30)
model_knn.fit(train_mat)
test_user = list(np.sort(test_df.user.unique()))
list1=[]
query_index = np.random.choice(test_user)
distances, indices = model_knn.kneighbors(test_df.loc[query_index, :].values.reshape(1, -1))
for i in range(0, len(distances.flatten())):
list1.append(test_df.index[indices.flatten()[i]])
Here is the error message:
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 1605 while Y.shape[1] == 2724

In the documentation it states:
Returns:
dist : array
Array representing the lengths to points, only present if return_distance=True
So you can try:
distances, indices = model_knn.kneighbors(
test_df.loc[query_index,:].values.reshape(1, -1),
return_distance=True
)

Related

Replacing for-loop with better alternatives in panda dataframes for similarity measurement

I am working on creating a function which will calculate the cosine similarity of each record in a dataset (MxK dimension) against records in another dataset (NxK dimension) where N is much smaller than M.
The below code does the job well when I test it on a tiny dataset ('iris' dataset for example). I am worried it might struggle when I have bigger datasets ( 100K records & 100+ variables).
I know for loop is not advisable for such scenarios and I got two for loops in this case. I am wondering if anyone can suggest ways of improving this code.
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def similarity_calculation(seed_data, pool_data):
# Create an empty dataframe to store the similarity scores
similarity_matrix = pd.DataFrame()
for indexi, rowi in pool_data.iterrows():
# Create an array to score similarity score for each record in pool data
similarity_score_array = []
for indexj, rowj in seed_data.iterrows():
# Fetch a single record from pool dataset
pool = rowi.values.reshape(1, -1)
# Fetch a single record from seed dataset
seed = rowj.values.reshape(1, -1)
# Measure similarity score between the two records
similarity_score = (cosine_similarity(pool, seed))[0][0]
similarity_score_array.append(similarity_score)
# Append the similarity score array as a new record to the similarity matrix
similarity_matrix = similarity_matrix.append(pd.Series(similarity_score_array), ignore_index=True)
Edit1: Sample data iris dataset is used as follows
iris_data = pd.read_csv("iris_data.csv", header=0)
# Split the data into seeds and pool sets, excluding the species details
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]
Expected result is
My new compact code (with a single for loop) is as follows
def similarity_calculation_compact(seed_data, pool_data):
Array1 = pool_data.values
Array2 = seed_data.values
scores = []
for i in range(Array1.shape[0]):
scores.append(np.mean(cosine_similarity(Array1[None, i, :], Array2)))
final_data = pool_data.copy()
final_data['mean_similarity_score'] = scores
final_data = final_data.sort_values(by='mean_similarity_score', ascending=False)
return(final_data)
The output I am getting is
I was expecting identical results as both functions are supposed to fetch records from pool data most similar (in terms of average cosine similarity) to the seed data.
There is no need for the for-loops, since cosine_similarity takes as input two arrays of shapes (n_samples_X, n_features) and (n_samples_Y, n_features) and returns an array of shape (n_samples_X, n_samples_Y) by computing cosine similarity between each pair of the two input arrays.
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
iris_data = pd.read_csv("iris.csv", header=0)
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]
np.mean(cosine_similarity(pool_set, seed_set), axis=1)
Result (after sorting):
array([0.99952255, 0.99947777, 0.99947545, 0.99946886, 0.99946596, ...])

How do I obtain individual centroids of K mean cluster using nltk (python)

I have used nltk to perform k mean clustering as I would like to change the distance metrics to cosine distance. However, how do I obtain the centroids of all the clusters?
kclusterer = KMeansClusterer(8, distance = nltk.cluster.util.cosine_distance, repeats = 1)
predict = kclusterer.cluster(features, assign_clusters = True)
centroids = kclusterer._centroid
df_clustering['cluster'] = predict
#df_clustering['centroid'] = centroids[df_clustering['cluster'] - 1].tolist()
df_clustering['centroid'] = centroids
I am trying to perform the k mean clustering on a pandas dataframe, and would like to have the coordinates of the centroid of the cluster of each data point to be in the dataframe column 'centroid'.
Thank you in advance!
import pandas as pd
import numpy as np
# created dummy dataframe with 3 feature
df = pd.DataFrame([[1,2,3],[50, 51,52],[2.0,6.0,8.5],[50.11,53.78,52]], columns = ['feature1', 'feature2','feature3'])
print(df)
obj = KMeansClusterer(2, distance = nltk.cluster.util.cosine_distance) #giving number of cluster 2
vectors = [np.array(f) for f in df.values]
df['predicted_cluster'] = obj.cluster(vectors,assign_clusters = True))
print(obj.means())
#OP
[array([50.055, 52.39 , 52. ]), array([1.5 , 4. , 5.75])] #which is going to be mean of three feature for 2 cluster, since number of cluster that we passed is 2
#now if u want the cluster center in pandas dataframe
df['centroid'] = df['predicted_cluster'].apply(lambda x: obj.means()[x])

How do I find the euclidean distances between rows of my test and train set efficiently?

I have test and train sets with the following dimensions with all features (i.e. columns) as integers.
X_train.shape
(990188L, 19L)
X_test.shape
(424367L, 19L)
I want to find out the euclidean distance among all the rows of train set and all the rows of the test set.
I have to also remove the rows from the train set with a distance threshold of 0.005.
I have a following linear code which is too slow but works fine.
for a in range(X_test.shape[0]):
a_test = np_Test[a]
for b in range(X_train.shape[0]):
a_train = np_Train[b]
if(a != b):
dst = distance.euclidean(a_test, a_train)
if(dst <= 0.005):
train.append(b)
where I note down the indexes of the rows that lie within the distance threshold.
Is there any way to parallelize this code?
I tried using from sklearn.metrics.pairwise import euclidean_distances
but as the data set is huge, I am getting a memory error.
I tried to parallelize the code by using euclidean_distances is batches but some how I think the following code is not working fine.
Please help me if there is any way to parallelize the code.
rows = X_train.shape[0]
rem = rows%1000
no = rows/1000
i = 0
while (i <= no*1000) :
dst_mat = euclidean_distances(X_train[i:i+1000, :], X_test)
condition = np.any(dst_mat <= 0.005, axis = 1)
index = np.where(condition == True)
index = np.add(index, i)
print(index)
print(dst_mat)
i+=1000
Use scipy.spatial.cdist. This will calculate the pairwise distance.
Thanks to Warren Weckesser for pointing out this solution.

Select 5 data points closest to SVM hyperlane

I have written Python code using Sklearn to cluster my dataset:
af = AffinityPropagation().fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_= len(cluster_centers_indices)
I am exploring the use of query-by-clustering and so form an inital training dataset by:
td_title =[]
td_abstract = []
td_y= []
for each in centers:
td_title.append(title[each])
td_abstract.append(abstract[each])
td_y.append(y[each])
I then train my model (an SVM) on it by:
clf = svm.SVC()
clf.fit(X, data_y)
I wish to write a function that given the centres, the model, the X values and the Y values will append the 5 data points which the model is most unsure about, ie. the data points closest to the hyperplane. How can I do this?
The first steps of your process aren't entirely clear to me, but here's a suggestion for "Select(ing) 5 data points closest to SVM hyperplane". The scikit documentation defines decision_function as the distance of the samples to the separating hyperplane. The method returns an array which can be sorted with argsort to find the "top/bottom N samples".
Following this basic scikit example, define a function closestN to return the samples closest to the hyperplane.
import numpy as np
def closestN(X_array, n):
# array of sample distances to the hyperplane
dists = clf.decision_function(X_array)
# absolute distance to hyperplane
absdists = np.abs(dists)
return absdists.argsort()[:n]
Add these two lines to the scikit example to see the function implemented:
closest_samples = closestN(X, 5)
plt.scatter(X[closest_samples][:, 0], X[closest_samples][:, 1], color='yellow')
Original
Closest Samples Highlighted
If you need to append the samples to some list, you could somelist.append(closestN(X, 5)). If you needed the sample values you could do something like somelist.append(X[closestN(X, 5)]).
closestN(X, 5)
array([ 1, 20, 14, 31, 24])
X[closestN(X, 5)]
array([[-1.02126202, 0.2408932 ],
[ 0.95144703, 0.57998206],
[-0.46722079, -0.53064123],
[ 1.18685372, 0.2737174 ],
[ 0.38610215, 1.78725972]])

How to traverse a tree from sklearn AgglomerativeClustering?

I have a numpy text file array at: https://github.com/alvations/anythingyouwant/blob/master/WN_food.matrix
It's a distance matrix between terms and each other, my list of terms are as such: http://pastebin.com/2xGt7Xjh
I used the follow code to generate a hierarchical cluster:
import numpy as np
from sklearn.cluster import AgglomerativeClustering
matrix = np.loadtxt('WN_food.matrix')
n_clusters = 518
model = AgglomerativeClustering(n_clusters=n_clusters,
linkage="average", affinity="cosine")
model.fit(matrix)
To get the clusters for each term, I could have done:
for term, clusterid in enumerate(model.labels_):
print term, clusterid
But how do I traverse the tree that the AgglomerativeClustering outputs?
Is it possible to convert it into a scipy dendrogram (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html)? And after that how do I traverse the dendrogram?
I've answered a similar question for sklearn.cluster.ward_tree:
How do you visualize a ward tree from sklearn.cluster.ward_tree?
AgglomerativeClustering outputs the tree in the same way, in the children_ attribute. Here's an adaptation of the code in the ward tree question for AgglomerativeClustering. It outputs the structure of the tree in the form (node_id, left_child, right_child) for each node of the tree.
import numpy as np
from sklearn.cluster import AgglomerativeClustering
import itertools
X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
model = AgglomerativeClustering(linkage="average", affinity="cosine")
model.fit(X)
ii = itertools.count(X.shape[0])
[{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in model.children_]
https://stackoverflow.com/a/26152118
Adding to A.P.'s answer, here is code that will give you a dictionary of membership. member[node_id] gives all the data point indices (zero to n).
on_split is a simple reformat of A.P's clusters that give the two clusters that form when node_id is split.
up_merge tells what node_id merges into and what node_id must be combined to merge into that.
ii = itertools.count(data_x.shape[0])
clusters = [{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in fit_cluster.children_]
import copy
n_points = data_x.shape[0]
members = {i:[i] for i in range(n_points)}
for cluster in clusters:
node_id = cluster["node_id"]
members[node_id] = copy.deepcopy(members[cluster["left"]])
members[node_id].extend(copy.deepcopy(members[cluster["right"]]))
on_split = {c["node_id"]: [c["left"], c["right"]] for c in clusters}
up_merge = {c["left"]: {"into": c["node_id"], "with": c["right"]} for c in clusters}
up_merge.update({c["right"]: {"into": c["node_id"], "with": c["left"]} for c in clusters})

Categories

Resources