Efficient finding of string mismatches in large lists - python

I am iterating over large lists of strings to find similar strings (with several mismatches). The following code works but takes ~20 min, whereas my goal is for it to complete in under 5 min. Is there a more efficient way to do this? What part of this code is most limiting?
I have k=10, mism=3, seq is a string consisting of characters A, T, C, and G. Each pattern and kmer is k characters long.
I've generated lists of patterns of length 4**k (~1 million), and kmers of length len(seq)-k+1 (~300). frequent is a dictionary.
The test iteration takes less that a minute:
for i in range (0,4**k):
for j in range(0,len(kmers)):
pass
Here's the real calculation that I need to make more efficient:
for pattern in patterns:
for kmer in kmers:
mism_counter=0
for j in range(0,k):
if not kmer[j]==pattern[j] : mism_counter+=1
if mism_counter <= mism :
if pattern in frequent:
frequent[pattern] += 1
else:
frequent[pattern] = 1
I tried wikipedia's hamming_distance function instead of my per-character comparison, and also tried to remove dictionary and just dump pattern's into a list for further processing. None of this improved performance of the loops. Any help would be appreciated!

This should save half the time ;-)
for pattern in patterns:
for kmer in kmers:
mism_counter=0
for j in range(0,k):
if kmer[j] != pattern[j] :
mism_counter+=1
if mism_counter > misn:
break
else:
if pattern in frequent:
frequent[pattern] += 1
else:
frequent[pattern] = 1
You have to do two things to make this really fast:
Compress the data so your program does less work. You don't have to represent GTAC as ascii letters (7 bit each), instead 2 bit per letter are enough.
Built a search trie from the patterns to speed up the comparison. Your search with allowed mismatches basically blows up the amount of pattern you have. You can have a trie with extra edges that allow a number of mismatches, but this essentially makes your search set huge.

Related

loop on long sequence in python with minimum time complexity in python

I have two DNA sequences (long sequences) I want to iterate onto them.
I tried traditional for loop but it takes linear time so it has big complexity so I want to find a way to loop on these two sequences and compare between characters without linear time, is there any way to do that ?
update
I want to compare between each character in two sequences
update 2 :
this is code which I tried
for i in range(0,len(seqA)):
if seqA[i]==seqB[i]:
print("similar")
else :
print("not similar")
and this is a sample of DNA sequences which I want to compare between it
seqA = "AT-AC-TCT-GG--TTTCT--CT----TCA---G-A-T--C-G--C--AT-A----AATC-T-T--T-CG-CCTT-T-T---A---C--TA-A--A---G-ATTTCCGT-GGAG-AGG-A-AC---AACTCT-G-AG-T--CT---TA--AC-CCA---ATT-----T--T-TTG-AG--CCTTGCCTT-GGCAA-GGCT--A---"
seqB = "ATCGCTTCTCGGCCTTT-TGGCTAAGATCAAGTGTAGTATCTGTTCTTATCAGTTTAAT-ATCTGATACGTCC-TCTATCCGAGGACAATATATTAAATGGATTT---TTGGAGCAGGGAGA-TGGAA---TAGGAGCTTGCTCCGT-CCACTCCACGCA-TCGACCTGGTATTGCAGTACC-T-CC--AGG-AACGG-TGCACCC"
If you want this: a sequence that tells you for each position whether they are equal at that point (assuming they are equally long):
result = "".join(
1 if a == seqB[i] else 0
for (i, a) in enumerate(seqA)
)
Not sure how useful result is. What if both sequences are similar, but one of them has deleted a letter? By this measure the sequences will then be very dissimilar.
The Levenshtein edit-distance is a captures the intuition of similarity better, but is much more costly to compute.
https://en.wikipedia.org/wiki/Levenshtein_distance
You could chop up the sequences in chunks of 100 letters, and perform Levenshtein on each pair of chunks.
This library has worked well for me, working on ancient texts, not DNA (I did billions of pairs of chunks in a few hours):
https://github.com/ztane/python-Levenshtein

Trying to calculate algorithm time complexity

So last night I solved this LeetCode question. My solution is not great, quite slow. So I'm trying to calculate the complexity of my algorithm to compare with standard algorithms that LeetCode lists in the Solution section. Here's my solution:
class Solution:
def longestCommonPrefix(self, strs: List[str]) -> str:
# Get lengths of all strings in the list and get the minimum
# since common prefix can't be longer than the shortest string.
# Catch ValueError if list is empty
try:
min_len = min(len(i) for i in strs)
except ValueError:
return ''
# split strings into sets character-wise
foo = [set(list(zip(*strs))[k]) for k in range(min_len)]
# Now go through the resulting list and check whether resulting sets have length of 1
# If true then add those characters to the prefix list. Break as soon as encounter
# a set of length > 1.
prefix = []
for i in foo:
if len(i) == 1:
x, = i
prefix.append(x)
else:
break
common_prefix = ''.join(prefix)
return common_prefix
I'm struggling a bit with calculating complexity. First step - getting minimum length of strings - takes O(n) where n is number of strings in the list. Then the last step is also easy - it should take O(m) where m is the length of the shortest string.
But the middle bit is confusing. set(list(zip(*strs))) should hopefully take O(m) again and then we do it n times so O(mn). But then overall complexity is O(mn + m + n) which seems way too low for how slow the solution is.
The other option is that the middle step is O(m^2*n), which makes a bit more sense. What is the proper way to calculate complexity here?
Yes, the middle portion is O{mn}, as well the overall is O{mn} because that dwarfs the O{m} and O{n} terms for large values of m and n.
Your solution has an ideal order of runtime complexity.
Optimize: Short-Circuit
However, you are probably dismayed that others have faster solutions. I suspect that others likely short-circuit on the first non-matching index.
Let's consider a test case of 26 strings (['a'*500, 'b'*500, 'c'*500, ...]). Your solution would proceed to create a list that is 500 long, with each entry containing a set of 26 elements. Meanwhile, if you short-circuited, you would only process the first index, ie one set of 26 characters.
Try changing your list into a generator. This might be all you need to short-circuit.
foo = (set(x) for x in zip(*strs)))
You can skip min_len check because default behaviour of zip is to iterate only as long as the shortest input.
Optimize: Generating Intermediate Results
I see that you append each letter to a list, then ''.join(lst). This is efficient, especially compared to the alternative of iteratively appending to a string.
However, we could just as easily save a counter match_len. Then when we detect the first mis-match, just:
return strs[0][:match_len]

Find K-most longest common suffix in a set of strings

I want to find most longest common suffix in a set of strings to detect some potential important morpheme in my natural language process project.
Given frequency K>=2,find the K-most common longest suffix in a list of strings S1,S2,S3...SN
To simplify the problem, here are some examples:
Input1:
K=2
S=["fireman","woman","businessman","policeman","businesswoman"]
Output1:
["man","eman","woman"]
Explanation1:
"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times
It would be appreciated if the output also tracked the frequency of each common suffix
{"man":4,"eman":2,"woman":2}
It is not guaranteed that each word will have at least one-length common suffix, see the example below.
Input2:
K=2
S=["fireman","woman","businessman","policeman","businesswoman","apple","pineapple","people"]
Output2:
["man","eman","woman","ple","apple"]
Explanation2:
"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times
"ple" occur 3 times, "apple" occur 2 times
Any efficient algorithm to solve this problem ?
I have referred to Suffix Tree and Generalised Suffix Tree algorithm but it seems not to fit this problem well.
(BTW, I am researching on a Chinese NLP project which means there are far more characters than English has)
If you want to be really efficient this paper could help. The algorithm gives you the most frequent (longest) substrings from a set of strings in O(n log n). The basic idea is the following:
Concatenate all strings with whitespace between them (these are used as seperators later) in order to from a single long string
Build a suffix array SA over the long string
For each pair of adjacent entries in the suffix array, check how many of the starting characters are equal of the two suffixes that the entries refer to (you should stop when you reach a whitespace). Put the count of equal characters in an additional array (called LCP). Each entry LCP[i] refers to the count of equal characters of the string at position SA[i-1] and SA[i] in the long string.
Go over the LCP an find successive entries with a count larger than K.
Example from the paper for3 strings
s1= bbabab
s2 = abacac
s3 = bbaaa
One idea would be to reverse the strings and then run a trie on them.
Such a simple trick in the book.
Once subsequence is reached, just a matter of reversal again.
E.g. Result from Tries would be
[nam, name, ...]
Reversal of these would give you the final answer.
[man, eman, ...]
Time complexity increased by + O(n).

An efficient way to search similar words (with specified length) in two strings using python

My input is two strings of the same length and a number which represents the length of the common words I need to find in both strings. I wrote a very straightforward code to do so, and it works, but it is super super slow, considering the fact that each string is ~200K letters.
This is my code:
for i in range(len(X)):
for j in range(len(Y)):
if(X[i] == Y[j]):
for k in range (kmer):
if (X[i+k] == Y[j+k]):
count +=1
else:
count=0
if(count == int(kmer)):
loc=(i,j)
pos.append(loc)
count=0
if(Xcmp[i] == Y[j]):
for k in range (kmer):
if (Xcmp[i+k] == Y[j+k]):
count +=1
else:
count=0
if(count == int(kmer)):
loc=(i,j)
pos.append(loc)
count=0
return pos
Where the first sequence is X and the second is Y and kmer is the length of the common words. (and when I say word, I just mean characters..)
I was able to create a X by kmer matrix (rather than the huge X by Y) but that's still very slow.
I also thought about using a trie, but thought that maybe it will take too long to populate it?
At the end I only need the positions of those common subsequences.
any ideas on how to improve my algorithm?
Thanks!! :)
Create a set of words like this
words = {X[i:i+kmer] for i in range(len(X)-kmer+1)}
for i in range(len(Y)-kmer+1):
if Y[i:i+kmer] in words:
print Y[i:i+kmer]
This is fairly efficient as long as kmer isn't so large that you'd run out of memory for the set. I assume it isn't since you were creating a matrix that size already.
For the positions, create a dict instead of a set as Tim suggests
from collections import defaultdict
wordmap = defaultdict(list)
for i in range(len(X)-kmer+1):
wordmap[X[i:i+kmer]].append(i)
for i in range(len(Y)-kmer+1):
word = Y[i:i+kmer]
if word in wordmap:
print word, wordmap[word], i
A triple nested for loop is giving you a runtime of n^3 because you're literally going through each entry. Consider using Rolling Hash. It has a linear average runtime and worstcase n^2. It's best for finding substrings so more or less what you're doing. In this case you may be closer to n^2 but it's still pretty good over n^3.

Permute a string to print all possible words

The code that i have written seems to be looking bad with asymptotic measure of running time and space
I am getting
T(N) = T(N-1)*N + O((N-1!)*N) where N is the size of input. I need advise to optimize it
Since it is an algorithm based interview question we are required to implement the logic in most efficient way without using any libraries
Here is my code
def str_permutations(str_input,i):
if len(str_input) == 1:
return [str_input]
comb_list = []
while i < len(str_input):
key = str_input[i]
if i+1 != len(str_input):
remaining_str = "".join((str_input[0:i],str_input[i+1:]))
else:
remaining_str = str_input[0:i]
all_combinations = str_permutations(remaining_str,0)
for index,value in enumerate(all_combinations):
all_combinations[index] = "".join((key,value))
comb_list.extend(all_combinations)
i = i+1
return comb_list
As I mentioned in a comment to the question, in the general case you won't get below exponential complexity since for n distinct characters, there are n! permutations of the input string, and O(2n) is a subset of O(n!).
Now the following won't improve the asymptotic complexity for the general case, but you can optimize the brute-force approach of producing all permutations for strings that have some characters with multiple occurrences. Take for example the string daedoid; if you blindly produce all permutations of it, you'll get every permutation 6 = 3! times since you have three occurrences of d. You can avoid that by first eliminating multiple occurrences of the same letter and instead remembering how often to use each letter. So if there is a letter c that has kc occurrences, you'll save kc! permutations. So in total, this saves you a factor of "product over kc! for all c".
If you don't need to write your own, see itertools.permutations and combinations.

Categories

Resources