I have a similarity matrix of words and would like to apply an algorithm that can put the words in clusters.
Here's the example I have so far:
from Levenshtein import distance
import numpy as np
words = ['The Bachelor','The Bachelorette','The Bachelor Special','SportsCenter',
'SportsCenter 8 PM','SportsCenter Sunday']
list1 = words
list2 = words
matrix = np.zeros((len(list1),len(list2)),dtype=np.int)
for i in range(0,len(list1)):
for j in range(0,len(list2)):
matrix[i,j] = distance(list1[I],list2[j])
Obviously this is a very simple dummy example, but what I would expect the output to be is 2 clusters, one with 'The Bachelor','The Bachelorette','The Bachelor Special', and the other with 'SportsCenter','SportsCenter 8 PM','SportsCenter Sunday'.
Can anyone help me with this?
Using the matrix that your code generates, the matrix can be looped through to find the similarity between each word and the words under a certain threshold can be grouped. The code should look similar to the following:
from Levenshtein import distance
import numpy as np
words = ['The Bachelor','The Bachelorette','The Bachelor Special','SportsCenter',
'SportsCenter 8 PM','SportsCenter Sunday']
list1 = words
list2 = words
matrix = np.zeros((len(list1),len(list2)),dtype=np.int64)
for i in range(0,len(list1)):
for j in range(0,len(list2)):
matrix[i,j] = distance(list1[i],list2[j])
difference_threshold = 10 #this variable is the maximum difference that 2 words can have to be grouped together
clusters = [] #list of clusters
#loop through the matrix previously generated
for col in matrix:
cluster=[] #cluster generated for this row
for row in range(len(col)):
if(col[row] <= difference_threshold):
cluster.append(words[row])
#if the current cluster is a NOT duplicate, add it
if cluster not in clusters:
clusters.append(cluster)
The threshold variable must be altered depending on the desired sensitivity for similarity. The code assumes that words may be repeated in separate clusters as long as they meet the similarity threshold. I recommend changing the matrix values to a percentage as well so that the length of the words will have less impact on sensitivity threshold. Also, if the matrix does not need to be computed before the clusters are created, the clusters could be created as the Levenshtein distances are computed.
Happy Coding!
Related
I have a corpus with m documents and n unique words.
Based on this corpus, I want to calculate a co-occurrence matrix for words and calculate their similarity.
To do so, I have created a NumPy array occurrences (m x n), which indicates which words are present in each document.
Based on the occurrences, I have created cooccurrences as follows:
cooccurrences = np.transpose(occurrences) # occurrences
Furthermore, word_occurrences gives the sum per word in the corpus:
word_occurrences = occurrences.sum(axis=0)
Now, I want to calculate the similarity scores of words in cooccurrences based on the association strength.
I want to divide each cell i, j in cooccurrences, by word_occurrences[i] * word_occurrences[j].
Currently, I loop through cooccurrences to do this.
def calculate_association_strength(cooc, i, j, word_occurrences):
return cooc/(word_occurrences[i]*word_occurrences[j])
for i in range(len(cooccurrences)):
for j in range(len(cooccurrences)):
if i != j:
if cooccurrences[i,j] > 0 :
cooccurrences[i,j] = 1 - self.calculate_association_strength(cooccurrences[i,j], i,j,word_occurrences)
else:
cooccurrences[i,j] = 0
But with m > 30 000, this is very time-consuming. Is there a faster way to do this?
Here, they discuss mapping a function on a np.array. However, they don't use multiple variables derived from the array.
If I understand the problem here correctly, You could vectorise the whole operation, so the result would be:
cooccurrences / word_occurrences.reshape(-1,1) * word_occurrences
This is pretty much guaranteed to work faster than looping through the array
I'm trying to solve a text mining problem in python which consist on:
Target: Create a graph composed of nodes(sentences) by tokenizing a paragraph into sentences, their edges would be their similarity.
This isn't new at all, but the crux of the question is not very treated on the Internet. So, after getting the sentences from a paragraph the interesting point would be to compute a matrix of similarity between sentences(all combinations) to draw the graph.
Is there any package to perform the similairty between several vectors in an easy way?, even with a given list of strings make a graph of similarity...
A reproducible example:
# tokenize into sentences
>>> from nltk import tokenize
>>> p = "Help my code works. The graph isn't still connected. The code computes the relationship in a graph. "
>>> sentences=tokenize.sent_tokenize(p)
['Help my code works.', "The graph isn't still connected.", 'The code computes the relationship in a graph.']
>>> len (sentences)
3
# compute similarity with dice coeffcient.
>>>def dice_coefficient(a, b):
"""dice coefficient 2nt/na + nb."""
a_bigrams = set(a)
b_bigrams = set(b)
overlap = len(a_bigrams & b_bigrams)
return overlap * 2.0/(len(a_bigrams) + len(b_bigrams)
>>>dice_coefficient(sentences[1],sentences[2])
0.918918918918919
So, with this function I can do it manually and later make the graph with the nodes and the edges. But always a global solution (with n sentences) is the best one.
Any suggestion?
The following list comprehension creates a list of tuples where the first two elements are indexes and the last one is the similarity:
edges = [(i,j,dice_coefficient(x,y))
for i,x in enumerate(sentences)
for j,y in enumerate(sentences) if i < j]
You can now remove the edges that are under a certain threshold, and convert the remaining edges into a graph with networkx:
import networkx as nx
G = nx.Graph()
G.add_edges_from((i,j) for i,j,sim in edges if sim >= THRESHOLD)
The output of LDAModel.topicsMatrix() is unclear to me.
I think I understand the concept of LDA and that each topic is represented by a distribution over terms.
In the LDAModel.describeTopics() it is clear (I think):
The highest sum of likelihoods of words of a sentence per topic, indicates the evidence of this tweet belonging to a topic.
With n topics, the output of describeTopics() is a n times m matrix where m stands for the size of the vocabulary. The values in this matrix are smaller or equal to 1.
However in the LDAModel.topicsMatrix(), I have no idea what I am looking at. The same holds when reading the documentation.
The matrix is a m times n matrix, the dimensions have changed and the values in this matrix are larger than zero (and thus can take the value 2, which is not a probability value). What are these values? The occurrence of this word in the topic perhaps?
How do I use these values do calculate the distance of a sentence to a topic?
i think the matrix is m*n m is the words number and n is the topic number
I have some documents and I'd like to find the k documents most similar to a selected document. For the sake of a reproducible example, let's say k is 1 and my documents are these
documents = ['Two roads diverged in a yellow wood,',
'And sorry I could not travel both',
'And be one traveler, long I stood',
'And looked down one as far as I could',
'To where it bent in the undergrowth']
Then I think what I want to do is the below. (I'm using CountVectorizer for transparency and simplicity, even though maybe later I'd want to use Tf-Idf and a hashing vectorizer.)
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vectorizer = CountVectorizer(analyzer='word')
ft = vectorizer.fit_transform(documents)
one_doc = documents[1]
one_doc_code = vectorizer.transform([one_doc])
doc_match = np.matrix(ft) * np.matrix(one_doc_code.transpose())
and now doc_match is a column vector with weights that indicate closeness of match (0 = bad match, 1 = perfect match). But in order to do the multiplication, I (in desperation, in the face of element-wise multiplication) converted to a numpy matrix, so now I have this CSR format matrix that doesn't have a todense() member (so I can't just look, not that that would scale beyond my tiny example).
What I think I want now (but haven't been able to figure out so far) is how to say "what are the indices of the top k elements of doc_match?" (even if k is not 1).
If all you want are the indices in doc_match that have the highest score, you can do:
sorted_indices = np.argsort(doc_match)
doc_match_vals_sorted = doc_match[sorted_indices]
I am new in pandas and python. I want to find common words for my data set. e.g i have list of companies ["Microsoft.com", "Microsoft", "Microsoft com", "apple" ...] etc. I have around 1M list of such companies and i want to calculate correlation between them to find the relevance for the words e.g Microsoft.com, Microsoft, Microsoft com there are common words.
This is what i did but its very slow:
import hashlib
companies = pd.read_csv('/tmp/companies.csv', error_bad_lines=False)
unique_companies = companies.groupby(['company'])['company'].unique()
df = DataFrame()
for company in unique_companies:
df[hashlib.md5(company).hexdigest()] = [{'name': company[0], 'code': [ord(c) for c in company[0]]}]
rows = df.unstack()
for company in rows:
series1 = Series(company['code'])
for word in rows:
series2 = Series(word['code'])
if series1.corr(series2) > 0.8:
company['match'] = [word['name']]
can anyone please guide me how to find matrix correlation for the words ?
I don't think there's a corr function that will work for strings - only numerics.
If you can somehow compress your words into meaningful numeric values that preserves the "closeness" of one against another, you might then be able to "corr" them, but other options are available.
Hamming Distance is one (basic) method, but slightly better is calculating the Levenshtein difference: http://en.wikipedia.org/wiki/Levenshtein_distance
It's tricky but one way of trying this would be to build a matrix of m x n cells. Where m is the number of unique words in your first wordlist, and n is the number of unique words in the secornd wordlist - then calculate the Hamming, or Levenshtein distances between row/column identifiers.
There are python modules that package up the distance-algorithms for you -
e.g. https://pypi.python.org/pypi/python-Levenshtein/
Or you could write your own, I think the packaged ones are likely to be faster as they're C'ified.
So, assuming the Levenshtein module (I don't know, as have not used it) provides a function say getLev (word1, word2) that generates a numeric score, you should be able to feed in the contents from two wordlists from sources 1 and 2. If you make sure your inputs are already filtered for uniqueness, and maybe sorted alphabetically, that would help too.
Feed them into a matrix generation function.
Here, I've imported numpy as np and am using that module for speed
def genLevenshteinMatrix(wordlist1, wordlist2):
x = len(wordlist1)
y = len(wordlist1)
l_matrix = np.zeros(( x, y))
for i in range( 0 , x ):
x_word = wordlist1[i]
for j in range ( 0 , y ):
y_word = wordlist[j]
l_matrix[i][j] = getLev ( x_word, y_word )
Something like that should allow you to generate a matrix that stores a measure of which words are most like which other words.
Once that's created, you can interrogate it using a function like this:
def interrogate_Levenshtein_matrix (ndarray_x, wordlist1, wordlist2, float_threshold):
l = []
x = len(ndarray_x)
y = len(ndarray_x[0])
for i in range(0 , x ):
for j in range(0 , y ):
if ndarray_x[i][j] >= float_threshold:
l.append ([(wordlist1[i],wordlist2[j]),ndarray_x[i][j]])
return l
And that will output a list of words that are "close" (i.e. have a lower distance) as measured by the Levenshtein function used earlier, as a list, containing a list of the two similar words.
You might need to trim it down somehow, as I think you'll get all like-combinations twice, i.e. ['word','work'] as one return value, and ['work','word'] as another.
As you develop the code, you could swap in different correlation functions and try different threshold values.