How to cluster large number of strings based on similarity matrix? - python

I need to cluster 500K+ strings based on their similarity.
I have calculated their pair-wise Levenshtein Distances and made a sparse similarity matrix. This matrix contains binary similarities: values for small distances are set to 1.0 and others are 0.0.
I don't know what kind of clustering is good for me. I don't know the number of clusters in advance but it may be considerably large because the similarity matrix is very sparse (about 0.1% values are non-zero).

have you considered doing something like https://en.wikipedia.org/wiki/Soundex ? the advantage in such algorithms is that similar words have the same canonical form. For example, both "Robert" and "Rupert" return the same string "R163". Then your clustering boils down to a map like:
clusters = { canonical_form: [list of similar words] }
Naturally, you can tweak the Soundex rules according to your domain.

Related

Fast way find strings within hamming distance x of each other in a large array of random fixed length strings

I have a large array with millions of DNA sequences which are all 24 characters long. The DNA sequences should be random and can only contain A,T,G,C,N. I am trying to find strings that are within a certain hamming distance of each other.
My first approach was calculating the hamming distance between every string but this would take way to long.
My second approach used a masking method to create all possible variations of the strings and store them in a dictionary and then check if this variation was found more then 1 time. This worked pretty fast(20 min) for a hamming distance of 1 but is very memory intensive and would not be viable to use for a hamming distance of 2 or 3.
Python 2.7 implementation of my second approach.
sequences = []
masks = {}
for sequence in sequences:
for i in range(len(sequence)):
try:
masks[sequence[:i] + '?' + sequence[i + 1:]].append(sequence[i])
except KeyError:
masks[sequence[:i] + '?' + sequence[i + 1:]] = [sequence[i], ]
matches = {}
for mask in masks:
if len(masks[mask]) > 1:
matches[mask] = masks[mask]
I am looking for a more efficient method. I came across Trie-trees, KD-trees, n-grams and indexing but I am lost as to what will be the best approach to this problem.
One approach is Locality Sensitive Hashing
First, you should note that this method does not necessarily return all the pairs, it returns all the pairs with a high probability (or most pairs).
Locality Sensitive Hashing can be summarised as: data points that are located close to each other are mapped to similar hashes (in the same bucket with a high probability). Check this link for more details.
Your problem can be recast mathematically as:
Given N vectors v ∈ R^{24}, N<<5^24 and a maximum hamming distance d, return pairs which have a hamming distance atmost d.
The way you'll solve this is to randomly generates K planes {P_1,P_2,...,P_K} in R^{24}; Where K is a parameter you'll have to experiment with. For every data point v, you'll define a hash of v as the tuple Hash(v)=(a_1,a_2,...,a_K) where a_i∈{0,1} denotes if v is above this plane or below it. You can prove (I'll omit the proof) that if the hamming distance between two vectors is small then the probability that their hash is close is high.
So, for any given data point, rather than checking all the datapoints in the sequences, you only check data points in the bin of "close" hashes.
Note that these are very heuristic based and will need you to experiment with K and how "close" you want to search from each hash. As K increases, your number of bins increase exponentially with it, but the likelihood of similarity increases.
Judging by what you said, it looks like you have a gigantic dataset so I thought I would throw this for you to consider.
Found my solution here: http://www.cs.princeton.edu/~rs/strings/
This uses ternary search trees and took only a couple of minutes and ~1GB of ram. I modified the demo.c file to work for my use case.

Semantically weighted mean of word embeddings

Given a list of word embedding vectors I'm trying to calculate an average word embedding where some words are more meaningful than others. In other words, I want to calculate a semantically weighted word embedding.
All the stuff I found is on just finding the mean vector (which is quite trivial of course) which represents the average meaning of the list OR some kind of weighted average of words for document representation, however that is not what I want.
For example, given word vectors for ['sunglasses', 'jeans', 'hats'] I would like to calculate such a vector which represents the semantics of those words BUT with 'sunglasses' having a bigger semantic impact. So, when comparing similarity, the word 'glasses' should be more similar to the list than 'pants'.
I hope the question is clear and thank you very much in advance!
Actually averaging of word vectors can be done in two ways
Mean of word vectors without tfidf weights.
Mean of Word vectors multiplied with tfidf weights.
This will solve your problem of word importance.

How do I calculate the semantic similarity between two n-grams?

I'm trying to calculate the semantic similarity between two bi-grams and I need to use fasttext's pre-trained word vectors to accomplish this task.
For ex :
The b-grams are python lists of two elements:
[his, name] and [I, am]
They are two tuples and I need to calculate the similarity between these two tuples by any means necessary.
I'm hoping there's a score which can give me a good approximation of similarity.
For ex - If there are methods which can tell me that [His, name] is more similar to [I, am] than [An, apple].
Right now I only made use of cosine similarity which does include any semantic similarity.
Cosine similarity might be useful if you average both word vectors in a bi-gram first. So you want to take the vector for 'his' and 'name', average them into one vector. Then take the vector for 'I' and 'am' and average them into one vector. Finally, calculate cosine similarity for both resulting vectors and it should give you a rough semantic similarity.

What is the output of Spark MLLIB LDA topicsmatrix?

The output of LDAModel.topicsMatrix() is unclear to me.
I think I understand the concept of LDA and that each topic is represented by a distribution over terms.
In the LDAModel.describeTopics() it is clear (I think):
The highest sum of likelihoods of words of a sentence per topic, indicates the evidence of this tweet belonging to a topic.
With n topics, the output of describeTopics() is a n times m matrix where m stands for the size of the vocabulary. The values in this matrix are smaller or equal to 1.
However in the LDAModel.topicsMatrix(), I have no idea what I am looking at. The same holds when reading the documentation.
The matrix is a m times n matrix, the dimensions have changed and the values in this matrix are larger than zero (and thus can take the value 2, which is not a probability value). What are these values? The occurrence of this word in the topic perhaps?
How do I use these values do calculate the distance of a sentence to a topic?
i think the matrix is m*n m is the words number and n is the topic number

Algorithm to sort rows and cols by similarity

I fell across a spreadsheet that explains a method to sort both rows and columns of a matrix that contains binary data so that the number of changes between consecutive rows and cols is minimzed.
For example, starting with:
After 15 manual steps described in the tabs of the spreadsheed, the following table is obtained:
I would like to know:
what is the common name of this algorithm or method ?
how to apply it to larger table (where 2^n would overflow...)
how to generalize it to non binary data, for example using Levenshtein distance ?
if there is any link to code (Excel VBA, Python, ...) already implementing this (otherwise I'll write it ... )
Thanks !
You can represent each row by a vector L = [1, 1, 0, ... 1], and then define the distance between two lines d(L0, L1) by the number of elements at corresponding positions which are different between L0 and L1. This is known as the binary Hamming distance. If you had non-binary data, you would just extend your definition of distance and yes, Levenshtein distance would be an option.
Once you have distance well-defined, the rest of your problem is minimizing distance between consecutive rows. This is exactly the Traveling salesman problem, which is known to be NP-hard(http://www.diku.dk/hjemmesider/ansatte/jyrki/Paper/EKP85.pdf).
The direct solution (visiting all permutations) is O(n!), but you can do better easily by using dynamic programming, for example Held–Karp_algorithm. There are also approximate algorithms, such as the Nearest_neighbour_algorithm which quickly computes a non-optimal solution.
Finally, for implementations you can easily google "traveling salesman excel/python" and find many tutorials and examples.

Categories

Resources