Faster similarity clustering in python - python

I have a collection of a few thousand strings (DNA sequences). I want to trim this down to a couple of hundred (exact number not critical) by excluding sequences that are very similar.
I can do this via matching using the "Levenshtein" module. It works, but it's pretty slow, and I'm pretty sure there must be a much faster way. The code here is the same approach but applied to words, to make it more testable; for me with this cutoff it takes about 10 seconds and collects ~1000 words.
import Levenshtein as lev
import random
f = open("/usr/share/dict/words", 'r')
txt =
cutoff = .5
collected = []
while len(txt) > 0:
a = random.choice(txt)
txt = filter( lambda b: lev.ratio(a,b) < cutoff, txt)
I've tried a few different variations and some other matching modules (Jellyfish) without getting significantly faster.

Maybe you could use locality-sensitve hashing. You could hash all strings into buckets so that all strings in one bucket are very similar to each other. Than you can pick just one of each bucket.
I've never used this personally (don't know how good it works in your case), it's just an idea.
related: Python digest/hash for string similarity

I'm not quite sure, what your application aims at, but when working with sequences you might want to consider using Biopython. An alternative way to calculate the distance between two DNA snipplets is using an alignment score (which is related to Levenshtein by non-constant alignement weights). Using biopython you could do a multiple sequence alignment and create a phylogenetic tree. If you'd like to have a faster solution, use BLAST which is a heuristical approach. This would be consistent solutions while yours depends on the random choice of the input sequences.
Concerning your initial question, there's no 'simple' solution to speed up your code.


How to detect the amount of almost-repetition in a text file?

I am a programming teacher, and I would like to write a script that detects the amount of repetition in a C/C++/Python file. I guess I can treat any file as pure text.
The script's output would be the number of similar sequences that repeat. Eventually, I am only interested in a DRY's metric (how much the code satisfied the DRY principle).
Naively I tried to do a simple autocorrelation but it would be hard to find the proper threshold.
u = open("find.c").read()
v = [ord(x) for x in u]
y = np.correlate(v, v, mode="same")
y = y[: int(len(y) / 2)]
x = range(len(y))
z = np.polyval(np.polyfit(x, y, 3), x)
f = (y - z)[: -5]
So I am looking at different strategies... I also tried to compare the similarities between each line, each group of 2 lines, each group of 3 lines ...
import difflib
import numpy as np
lines = open("b.txt").readlines()
lines = [line.strip() for line in lines]
n = 3
d = []
for i in range(len(lines)):
a = lines[i:i+n]
for j in range(len(lines)):
b = lines[j:j+n]
if i == j: continue # skip same line
group_size = np.sum([len(x) for x in a])
if group_size < 5: continue # skip short lines
ratio = 0
for u, v in zip(a, b):
r = difflib.SequenceMatcher(None, u, v).ratio()
ratio += r if r > 0.7 else 0
dry = sum(d) / len(lines)
In the following, we can identify some repetition at a glance:
w = int(len(d) / 100)
e = np.convolve(d, np.ones(w), "valid") / w * 10
plt.plot(range(len(d)), d, range(len(e)), e)
Why not using:
d = np.exp(np.array(d))
Thus, difflib module looks promising, the SequenceMatcher does some magic (Levenshtein?), but I would need some magic constants as well (0.7)... However, this code is > O(n^2) and runs very slowly for long files.
What is funny is that the amount of repetition is quite easily identified with attentive eyes (sorry to this student for having taken his code as a good bad example):
I am sure there is a more clever solution out there.
Any hint?
I would build a system based on compressibility, because that is essentially what things being repeated means. Modern compression algorithms are already looking for how to reduce repetition, so let's piggy back on that work.
Things that are similar will compress well under any reasonable compression algorithm, eg LZ. Under the hood a compression algo is a text with references to itself, which you might be able to pull out.
Write a program that feeds lines [0:n] into the compression algorithm, compare it to the output length with [0:n+1].
When you see the incremental length of the compressed output increases by a lot less than the incremental input, you note down that you potentially have a DRY candidate at that location, plus if you can figure out the format, you can see what previous text it was deemed similar to.
If you can figure out the compression format, you don't need to rely on the "size doesn't grow as much" heuristic, you can just pull out the references directly.
If needed, you can find similar structures with different names by pre-processing the input, for instance by normalizing the names. However I foresee this getting a bit messy, so it's a v2 feature. Pre-processing can also be used to normalize the formatting.
Looks like you're choosing a long path. I wouldn't go there.
I would look into trying to minify the code before analyzing it. To completely remove any influence of variable names, extra spacing, formatting and even slight logic reshuffling.
Another approach would be comparing byte-code of the students. But it may be not a very good idea since the result will likely have to be additionally cleaned up.
Dis would be an interesting option.
I would, most likely, stop on comparing their AST. But ast is likely to give false positives for short functions. Cuz their structure may be too similar, so consider checking short functions with something else, something trivial.
On top of thaaaat, I would consider using Levenshtein distance or something similar to numerically calculate the differences between byte-codes/sources/ast/dis of the students. This would be what? Almost O(N^2)? Shouldn't matter.
Or, if needed, make it more complex and calculate the distance between each function of student A and each function of student B, highlighting cases when the distance is too short. It may be not needed though.
With simplification and normalization of the input, more algorithms should start returning good results. If a student is good enough to take someone's code and reshuffle not only the variables, but the logic and maybe even improve the algo, then this student understands the code well enough to defend it and use it with no help in future. I guess, that's the kind of help a teacher would want to be exchanged between students.
You can treat this as a variant of the longest common subsequence problem between the input and itself, where the trivial matching of each element with itself is disallowed. This retains the optimal substructure of the standard algorithm, since it can be phrased as a non-transitive “equality” and the algorithm never relies on transitivity.
As such, we can write this trivial implementation:
import operator
class Repeat:
def __init__(self,l):
def __call__(self,m,n):
if not ret:
if not m or not n: ret=0,None
elif m!=n and l[m-1]==l[n-1]: # critical change here!
else: ret=max(self(m-1,n),self(m,n-1),key=operator.itemgetter(0))
return ret
def go(self):
while v:
return ret
def repeat(l): return Repeat(l).go()
You might want to canonicalize lines of code by removing whitespace (except perhaps between letters), removing comments, and/or replacing each unique identifier with a standardized label. You might also want to omit trivial lines like } in C/C++ to reduce noise. Finally, the symmetry should allow only cases with, say, m>=n to be treated.
Of course, there are also "real" answers and real research on this issue!
Frame challenge: I’m not sure you should do this
It’d be a fun programming challenge for yourself, but if you intend to use it as a teaching tool—-I’m not sure I would. There’s not a good definition of “repeat” from the DRY principle that would be easy to test for fully in a computer program. The human definition, which I’d say is basically “failure to properly abstract your code at an appropriate level, manifested via some type of repetition of code, whether repeating exact blocks of whether repeating the same idea over and over again, or somewhere in between” isn’t something I think anyone will be able to get working well enough at this time to use as a tool that teaches good habits with respect to DRY without confusing the student or teaching bad habits too. For now I’d argue this is a job for humans because it’s easy for us and hard for computers, at least for now…
That said if you want to give it a try, first define for yourself requirements for what errors you want to catch, what they’ll look like, and what good code looks like, and then define acceptable false positive and false negative rates and test your code on a wide variety of representative inputs, validating your code against human judgement to see if it performs well enough for your intended use. But I’m guessing you’re really looking for more than simple repetition of tokens, and if you want to have a chance at succeeding I think you need to clearly define what you’re looking for and how you’ll measure success and then validate your code. A teaching tool can do great harm if it doesn’t actually teach the correct lesson. For example if your tool simply encourages students to obfuscate their code so it doesn’t get flagged as violating DRY, or if the tool doesn’t flag bad code so the student assumes it’s ok. Or if it flags code that is actually very well written.
More specifically, what types of repetition are ok and what aren’t? Is it good or bad to use “if” or “for” or other syntax repeatedly in code? Is it ok for variables and functions/methods to have names with common substrings (e.g. average_age, average_salary, etc.?). How many times is repetition ok before abstraction should happen, and when it does what kind of abstraction is needed and at what level (e.g. a simple method, or a functor, or a whole other class, or a whole other module?). Is more abstraction always better or is perfect sometimes the enemy of on time on budget? This is a really interesting problem, but it’s also a very hard problem, and honestly I think a research problem, which is the reason for my frame challenge.
Or if you definitely want to try this anyway, you can make it a teaching tool--not necessarily as you may have intended, but rather by showing your students your adherence to DRY in the code you write when creating your tool, and by introducing them to the nuances of DRY and the shortcomings of automated code quality assessment by being transparent with them about the limitations of your quality assessment tool. What I wouldn’t do is use it like some professors use plagiarism detection tools, as a digital oracle whose assessment of the quality of the students’ code is unquestioned. That approach is likely to cause more harm than good toward the students.
I suggest the following approach: let's say that repetitions should be at least 3 lines long. Then we hash every 3 lines. If hash repeats, then we write down the line number where it occurred. All is left is to join together adjacent duplicated line numbers to get longer sequences.
For example, if you have duplicate blocks on lines 100-103 and 200-203, you will get {HASH1:(100,200), HASH2:(101,201)} (lines 100-102 and 200-202 will produce the same value HASH1, and HASH2 covers lines 101-103 and 201-203). When you join the results, it will produce a sequence (100,101,200,201). Finding in that monotonic subsequences, you will get ((100,101), (200,201)).
As no loops used, the time complexity is linear (both hashing and dictionary insertion are O(n))
read text line by line
transform it by
removing blanks
removing empty lines, saving mapping to original for the future
for each 3 transformed lines, join them and calculate hash on it
filter lines for which their hashes occur more then once (these are repetitions of at least 3 lines)
find longest sequences and present repetitive text
from itertools import groupby, cycle
import re
def sequences(l):
x2 = cycle(l)
grps = groupby(l, key=lambda j: j + 1 == next(x2))
yield from (tuple(v) + (next((next(grps)[1])),) for k,v in grps if k)
with open('program.cpp') as fp:
text = fp.readlines()
# remove white spaces
processed, text_map = [], {}
proc_ix = 0
for ix, line in enumerate(text):
line = re.sub(r"\s+", "", line, flags=re.UNICODE)
if line:
text_map[proc_ix] = ix
proc_ix += 1
# calc hashes
hashes, hpos = [], {}
for ix in range(len(processed)-2):
h = hash(''.join(processed[ix:ix+3])) # join 3 lines
hpos.setdefault(h, []).append(ix) # this list will reflect lines that are duplicated
# filter duplicated three liners
seqs = []
for k, v in hpos.items():
if len(v) > 1:
seqs = sorted(list(set(seqs)))
# find longer sequences
result = {}
for seq in sequences(seqs):
result.setdefault(hashes[seq[0]], []).append((text_map[seq[0]], text_map[seq[-1]+3]))
print('Duplicates found:')
for v in result.values():
vbeg, vend = v[0]
print(f'Found {len(v)} duplicates, lines')
for line_numbers in v:
print(f'{1+line_numbers[0]} : {line_numbers[1]}')

Optimize Word Generator

I am trying to build a program capable of finding the best word in a scrabble game. In the following code, I am trying to create a list of all the possible words given a set of 7 characters.
import csv
import itertools
with open('dictionary.csv', newline='') as f:
reader = csv.reader(f)
data = list(reader)
def FindLegalWords(data):
LegalWords = []
for i in data:
if len(i[0]) <= 15:
return LegalWords
PossibleWords = []
def word_generator(chars, start_with, min_len, max_len):
for i in range(min_len - 1, max_len):
for s in itertools.product(chars, repeat=i):
yield start_with + ''.join(s)
for word in word_generator('abcdefg', '', 2, 15):
if word in FindLegalWords(data):
I think it is clear that the aforementioned code will take days to find all the possible words. What would be a better approach to the problem? Personally, I thought of making each word a number and use NumPy to manipulate them because I have heard that NumPy is very quick. Would this solve the problem? Or it would not be enough? I will be happy to answer any questions that will arise about my code.
Thank you in advance
There is about 5_539 billion possibilities and codes working with strings are generally pretty slow (partially due to Unicode and allocations). This is huge. Generating a massive amount of data to filter most of them is not efficient. This algorithmic problem cannot be fixed using optimized libraries like Numpy. One solution to solve this problem is to directly generate a much smaller subset of all possible values that still fit to FindLegalWords. I guess you probably do not want to generate words likes "bfddgfbgfgd". Thus, you can generate pronounceable words by concatenating 2 pronounceable word parts. Doing this is a bit tricky though. A much better solution is to retrieve the possible words from an existing dictionary. You can find such list online. There are also some dictionary of pronounceable words that can be retrieved from free passwords databases. AFAIK, some tools like John-the-Ripper can generate such list of word you can store in a text file and then read it from your Python program. Note that since the list can be huge, it is better to compress the file and read directly the file from a compressed source.
Some notes regarding the update:
Since FindLegalWords(data) is a constant, you can store it so not to recompute it over and over. You can even compute set(FindLegalWords(data)) so to search word faster in the result. Still, the number of possibility is the main problem so it will not be enough.
PossibleWords will contain all possible subsets of all strings in FindLegalWords(data). Thus, you can generate it directly from data rather than using a bruteforce approach combined with a check. This should be several order of magnitude faster is data is small. Otherwise, the main problem will be that PossibleWords will be so big that your RAM will certainly not big enough to contain it anyway...

Bloom Filters with the Kirsch Mitzenmacher optimization

I have recently started playing around with Bloom Filters, and I have a use case in which this calculator suggests to use 15 different and unrelated hash functions.
Hashing 15 times a string would be quite computing-intensive, so I started looking for a better solution.
Gladly, I encountered the Kirsch-Mitzenmacher optimization, which suggests the following: hash_i = hash1 + i * hash2, 0 ≤ i ≤ k - 1, where K is the number of suggested hash functions.
In my specific case, I am trying to crate a 0.5MiB Bloom Filter that should eventually be able to efficiently store 200,000 (200k) pieces of information. With 15 hash functions, this would result in p = 0.000042214, or 1 in 23,689 (23k).
I tried implementing the optimization in python as follows:
import uuid
import hashlib
import mmh3
m = 4194304 # 0.5 MiB in bit
k = 15
bit_vector = [0] * m
data = str(uuid.uuid4()).encode()
hash1 = int(hashlib.sha256(data).hexdigest(), base=16)
hash2 = mmh3.hash(data)
for _ in range(0, k-1):
hash1 += hash2
bit = hash1 % m
bit_vector[bit] = 1
But it seems to get far worse results when compared to creating a classic Bloom Filter with only two hashes (it performs 10x worse in terms of false positives).
What am I doing wrong here? Am I completely misinterpreting the optimization - hence using the formula in the wrong way - or am I just using the wrong "parts" of the hashes (here I am using the full hashes in the calculation, maybe I should be using the first 32 bits on one and the latter 32 bits of the other)? Or maybe the problem is me encoding the data before hashing it?
I also found this git repository that talks about Bloom Filters and the optimization, and their results when using it are quite good both in terms of computing time and in false positives quantity.
*In the example I am using python, as that's the programming language I'm most comfortable with to test things out, but in my project I am actually going to use JavaScript to perform the whole process. Help in any programming language of your choice is much appreciated regardless.
The Kirsch-Mitzenmacher optimization is a proof of concept paper, and as such assumes a bunch of requirements on the table size and the hash functions themselves. Using it naively has stumbled people before you too. There is a bunch of practical stuff to consider. You can check out "6.5.1 Double hashing" of the P.C.Dillinger's thesis, linked in his long comment as well, in which he explains the issues and offers solutions. His solutions to rocksdb implementation issues can also be interesting.
Maybe start by trying the "enhanced double hashing":
hashes = []
for i in range(0, k-1):
hash2 += i
hash1 += hash2
(I don't know why your example code only generates 1 hash as result, and assume you pasted simplified code. 15 hash functions should set 15 bits.)
If this doesn't work, and you are exhausted of reading/trying more solutions - say "it's not worth it" at this point, and choose to find a library with a good implementation instead, compare them for speed/accuracy. Unfortunately, I don't konw any to recommend. Maybe the rocksdb implementation, since we know an expert has worked on it?
And even that may not be worth the effort, and calling mmh3.hash(data, i) with 15 seeds be reasonably fast enough.
You are indeed misinterpreting the Kirsch-Mitzenmacher optimization. You are combining the hash1 and hash2 functions, plus all your k functions, into what amounts to a single, more complicated hash function.
The point of the optimization is to use each of these as separate hash functions. You don't merge the k bit results together in any way, you set each of them separately in your bit vector. In other words, you should have something roughly along the lines of:
bit_vector[hash1] = 1
bit_vector[hash2] = 1
for i in range(0, k-1):
bit_vector[hash1 + i * hash2] = 1
See how I'm setting a bit for every hash function? You're just combining them all. That ruins the whole point of using multiple different "independent" hash functions, which is really necessary for good result from a bloom filter.

Determine difference of 2 large binary files?

I need to compare and get the difference of 2 large binary files (up to 100 MB).
For ASCII format I can use this:
import difflib
file1 = open('large1.txt', 'r')
file2 = open('large2.txt', 'r')
diff = difflib.ndiff(file1.readlines(), file2.readlines())
difference = ''.join(x[2:] for x in diff if x.startswith('- '))
How would one make it work for binary files? Tried different encoding, binary read mode but nothing worked yet.
EDIT: I use .vcl binary files.
difflib will be extremely slow for large file, 100MB would be categorized as very large...
Timing: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.
If you can tolerant the slowness, try difflib.SequenceMatcher, it works for almost all type of data.
This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.
Pythone Doc - class difflib.SequenceMatcher

Get the most relevant word (spell check) from 'enchant suggest()' in Python

I want to get the most relevant word from enchant suggest(). Is there any better way to do that. I feel my function is not efficient when it comes to checking large set of words in the range of 100k or more.
Problem with enchant suggest():
>>> import enchant
>>> d.suggest("prfomnc")
['prominence', 'performance', 'preform', 'Provence', 'preferment', 'proforma']
My function to get the appropriate word from a set of suggested words:
import enchant, difflib
dict,max = {},0
a = set(d.suggest(word))
for b in a:
tmp = difflib.SequenceMatcher(None, word, b).ratio();
dict[tmp] = b
if tmp > max:
max = tmp
print dict[max]
Result: performance
if I get multiple keys, meaning same difflib ratio() values, I use multi-key dictionary. As explained here:
No magic bullet, I'm afraid... a few suggestions however.
I'm guessing that most of the time in the logic is spent in the difflib's SequenceMatcher().ratio() call. This wouldn't be surprising since this method uses a variation on the Rattcliff-Obershelp algorithm which is relatively expensive, CPU-wise (but the metric it produces is rather "on the mark" to locate close matches, and that is probably why you like it).
To be sure, you should profile this logic and confirm that indeed SequenceMatcher() is the hot spot. Maybe Enchant.suggest() is also a bit slow, but there would be little we could do, code-wise, to improve this (configuration-wise, there may be a few options, for eg. doing away with personal dictionary to save the double look-upup and merge etc.).
Assuming that SequenceMatcher() is indeed the culprit, and assuming that you wish to stick with the Ratcliff-Obershelp similarity metric as the way to select the best match, you could do [some of] the following:
only compute the SequenceMatcher ratio value for the top (?) 5 items from Enchant. After all, Enchant.suggest() returns its suggestions in an ordered fashion with its best guesses first; therefore while based on different heuristics, there's value in the Enchant order as well, the chances of finding a hi-ranking match probably diminish as we move down the list. Also, even though, we may end up ignoring a few such hi-ranking matches, by testing only the top few Enchant suggestions, we somehow combine the "wisdom" found in Enchant's heuristics with these from the Ratcliff-Obershelp metric.
stop computing the SequenceMatcher ratio after a certain threshold has been reached
The idea is similar to the preceding: avoid calling SequenceMatcher once the odds of finding better are getting smaller (and once we have a decent if not best choice in hand)
filter out some of the words from Enchant with your own logic.
The idea is to have a relatively quick/inexpensive test which may tell us that a given word is unlikely to score well on the SequenceMatcher ratio. For example exclude words which do not have at least, say, length of user string minus two characters in common.
BTW, you can maybe use some of the SequenceMatcher object's [quicker] functions to get some data for the filtering heuristics.
use SequenceMatcher *quick_ratio*() function instead
at least in some cases.
only keep the best match, in a string, rather than using a dictionary
Apparently only the top choice matters, so except for test purposes you may not need the [relatively small] overhead of the dictionary.
you may consider writing your own Ratcliff-Obershelp (or similar) method, introducing therein various early exits when the prospect of meeting the current max ratio is small. BEWARE, it would likely be difficult to produce a method as efficient as the C-language one of difflib, your interest in doing this is with the early exits...
HTH, good luck ;-)
You don't actuall need to keep a dict if you are only interested in the best matches
>>> word="prfomnc"
>>> best_words = []
>>> best_ratio = 0
>>> a = set(d.suggest(word))
>>> for b in a:
... tmp = difflib.SequenceMatcher(None, word, b).ratio()
... if tmp > best_ratio:
... best_words = [b]
... best_ratio = tmp
... elif tmp == best_ratio:
... best_words.append(b)
>>> best_words

