I am working on one task where I need to check the cosine similarity between two dataframe columns.
I am using two for loop to iterate over two columns of data1 and data2 respectively.
for i in range(0,len(input_df)):
for j in range(0,len(data1)):
##check similarity ratio
similarity_score= cosine_sim(input_df['Summary'].iloc[i],data1['Summary'].iloc[j])
print(similarity_score)
###cosine_sim() is my function that gave similarity score.
how can i do this using Lambda instead of for loop as nested loop is taking much time.
There are other operations as well which I am doing after checking the cosine similarity.
To compute the cosine similarity between two vectors (your two columns), you could make use of NumPy:
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
cosine_similarity(input_df['Summary'], data1['Summary'])
However, based on your code example, it seems that you want to compute the cosine similarity between each element of the columns. So I'm not entirely sure if the above is what you are looking for.
Related
I have a problem where I am trying to compute the nearest strings using the Edit/Levenshtein distance.
I have a list containing about 250,000 unique strings, and for each item in the list, I need to return the index of the string in the list that is closest.
My problem is that I can't just use something like pdist because it will generate a 250k^2/2 array and it'll lead to memory problems. But if I do a row by row operation like
def closest(s):
"""
Returns index of minimum Levenshtein distance
"""
distances = [levenshtein_distance(s, X[i]) for i in range(len(X))]
minimum_distance = min(i for i in distances if i > 0)
return distances.index(minimum_distance)
this will also be super inefficient as it isn't optimised like pdist and is the same as generating a dense matrix.
Would anyone have any suggestions? Many thanks!
I've a function like this which returns result after comparing certain columns against the function arguments. For instance P =[0.2,5,5] and it gets compared against the relevant columns. I've to do this against a range of values and possibly would have to run some optimization algorithm as well. Right now I'm using loops to change values of P. But I was wondering if there's a way to vectorize this process. So p[0],P[1] and P[2] can be given as vectors and the output is a vector for all the combinations?
TP =(np.sum(dsf[(dsf.SCORE>P[0]) & (dsf.CANDIDATE_FRAMES>P[1])]
.groupby(['ELD_ID'])
.size().reset_index(name='event_count')
.query('event_count > #P[2]').ELD_ID.isin(annotator.ELD_ID)))
I have two lists list1 and list2, each of size 5000 with every entry of the lists being a numpy.array. I want to calculate the squares of the Euclidean distances between the elements of the lists in a fast and efficient way, i.e. I need to calculate sum((list1[i]-list2[j])**2) for every combination of i and j, which are thus in total 2,500,000 combinations. I currently did so by running a double loop and writing every result into a 2d numpy.array by means of
result[i,j] = sum((list1[i]-list2[j])**2)
but still need on my computer about 4 minutes of time. I was wondering, whether any tricks could be used to further speed the calculation up.
If you insist on numpy (assuming your inner arrays are 1-D):
dist_mat = ((list1[:,None,:]-list2[:,:])**2).sum(2)
Note that according to your definition of distance in question, this is square of Euclidean distances. If you want the distance itself, simply take square root of this.
Otherwise, I would prefer #Quang's comment:
from scipy.spatial import distance_matrix
dist_mat = distance_matrix(list1, list2)
I am using gensim wmdistance for calculating similarity between a reference sentence and 1000 other sentences.
model = gensim.models.KeyedVectors.load_word2vec_format(
'GoogleNews-vectors-negative300.bin', binary=True)
model.init_sims(replace=True)
reference_sentence = "it is a reference sentence"
other_sentences = [1000 sentences]
index = 0
for sentence in other_sentences:
distance [index] = model.wmdistance(refrence_sentence, other_sentences)
index = index + 1
According to gensim source code, model.wmdistance returns the following:
emd(d1, d2, distance_matrix)
where
d1 = # Compute nBOW representation of reference_setence.
d2 = # Compute nBOW representation of other_sentence (one by one).
distance_matrix = see the source code as its a bit too much to paste it here.
This code is inefficient in two ways for my use case.
1) For the reference sentence, it is repeatedly calculating d1 (1000 times) for the distance function emd(d1, d2, distance_matrix).
2) This distance function is called by multiple users from different points which repeat this whole process of model.wmdistance(doc1, doc2) for the same other_sentences and it is computationally expensive. For this 1000 comparisons, it takes around 7-8 seconds.
Therefore, I would like to isolate the two tasks. The final calculation of distance: emd(d1, d2, distance_matrix) and the preparation of these inputs: d1, d2, and distance matrix. As distance matrix depends on both so at least its input preparation should be isolated from the final matrix calculation.
My initial plan is to create three customized functions:
d1 = prepared1(reference_sentence)
d2 = prepared2(other_sentence)
distance_matrix inputs = prepare inputs
Is it possible to do this with this gensim function or should I just go my own customized version? Any ideas and solutions to deal with this problem in a better way?
You are right to observe that this code could be refactored & optimized to avoid doing repetitive operations, especially in the common case where one reference/query doc is evaluated against a larger set of documents. (Any such improvements would also be a welcome contribution back to gensim.)
Simply preparing single documents outside the calculation might not offer a big savings; in each case, all word-to-word distances between the two docs must be calculated. It might make sense to precalculate a larger distance_matrix (to the extent that the relevant vocabulary & system memory allows) that includes all words needed for many pairwise WMD calculations.
(As tempting as it might be to precalculate all word-to-word distances, with a vocabulary of 3 million words like the GoogleNews vector-set, and mere 4-byte float distances, storing them all would take at least 18TB. So calculating distances for relevant words, on manageable batches of documents, may make more sense.)
A possible way to start would be to create a variant of wmdistance() that explicitly works on one document versus a set-of-documents, and can thus combine the creation of histograms/distance-matrixes for many comparisons at once.
For the common case of not needing all WMD values, but just wanting the top-N nearest results, there's an optimization described in the original WMD paper where another faster calculation (called there 'RWMD') can be used to deduce when there's no chance a document could be in the top-N results, and thus skip the full WMD calculation entirely for those docs.
I have an array of doubles, roughly 200,000 rows by 100 columns, and I'm looking for a fast algorithm to find the rows that contain sequences most similar to a given pattern (the pattern can be anywhere from 10 to 100 elements). I'm using python, so the brute force method (code below: looping over each row and starting column index, and computing the Euclidean distance at each point) takes around three minutes.
The numpy.correlate function promises to solve this problem much faster (running over the same dataset in less than 20 seconds). However, it simply computes a sliding dot product of the pattern over the full row, meaning that to compare similarity I'd have to normalize the results first. Normalizing the cross-correlation requires computing the standard deviation of each slice of the data, which instantly negates the speed improvement of using numpy.correlate in the first place.
Is it possible to compute normalized cross-correlation quickly in python? Or will I have to resort to coding the brute force method in C?
def norm_corr(x,y,mode='valid'):
ya=np.array(y)
slices=[x[pos:pos+len(y)] for pos in range(len(x)-len(y)+1)]
return [np.linalg.norm(np.array(z)-ya) for z in slices]
similarities=[norm_corr(arr,pointarray) for arr in arraytable]
If your data is in a 2D Numpy array, you can take a 2D slice from it (200000 rows by len(pattern) columns) and compute the norm for all the rows at once. Then slide the window to the right in a for loop.
ROWS = 200000
COLS = 100
PATLEN = 20
#random data for example's sake
a = np.random.rand(ROWS,COLS)
pattern = np.random.rand(PATLEN)
tmp = np.empty([ROWS, COLS-PATLEN])
for i in xrange(COLS-PATLEN):
window = a[:,i:i+PATLEN]
tmp[:,i] = np.sum((window-pattern)**2, axis=1)
result = np.sqrt(tmp)