Find K-most longest common suffix in a set of strings - python

I want to find most longest common suffix in a set of strings to detect some potential important morpheme in my natural language process project.
Given frequency K>=2,find the K-most common longest suffix in a list of strings S1,S2,S3...SN
To simplify the problem, here are some examples:
Input1:
K=2
S=["fireman","woman","businessman","policeman","businesswoman"]
Output1:
["man","eman","woman"]
Explanation1:
"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times
It would be appreciated if the output also tracked the frequency of each common suffix
{"man":4,"eman":2,"woman":2}
It is not guaranteed that each word will have at least one-length common suffix, see the example below.
Input2:
K=2
S=["fireman","woman","businessman","policeman","businesswoman","apple","pineapple","people"]
Output2:
["man","eman","woman","ple","apple"]
Explanation2:
"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times
"ple" occur 3 times, "apple" occur 2 times
Any efficient algorithm to solve this problem ?
I have referred to Suffix Tree and Generalised Suffix Tree algorithm but it seems not to fit this problem well.
(BTW, I am researching on a Chinese NLP project which means there are far more characters than English has)

If you want to be really efficient this paper could help. The algorithm gives you the most frequent (longest) substrings from a set of strings in O(n log n). The basic idea is the following:
Concatenate all strings with whitespace between them (these are used as seperators later) in order to from a single long string
Build a suffix array SA over the long string
For each pair of adjacent entries in the suffix array, check how many of the starting characters are equal of the two suffixes that the entries refer to (you should stop when you reach a whitespace). Put the count of equal characters in an additional array (called LCP). Each entry LCP[i] refers to the count of equal characters of the string at position SA[i-1] and SA[i] in the long string.
Go over the LCP an find successive entries with a count larger than K.
Example from the paper for3 strings
s1= bbabab
s2 = abacac
s3 = bbaaa

One idea would be to reverse the strings and then run a trie on them.
Such a simple trick in the book.
Once subsequence is reached, just a matter of reversal again.
E.g. Result from Tries would be
[nam, name, ...]
Reversal of these would give you the final answer.
[man, eman, ...]
Time complexity increased by + O(n).

Related

loop on long sequence in python with minimum time complexity in python

I have two DNA sequences (long sequences) I want to iterate onto them.
I tried traditional for loop but it takes linear time so it has big complexity so I want to find a way to loop on these two sequences and compare between characters without linear time, is there any way to do that ?
update
I want to compare between each character in two sequences
update 2 :
this is code which I tried
for i in range(0,len(seqA)):
if seqA[i]==seqB[i]:
print("similar")
else :
print("not similar")
and this is a sample of DNA sequences which I want to compare between it
seqA = "AT-AC-TCT-GG--TTTCT--CT----TCA---G-A-T--C-G--C--AT-A----AATC-T-T--T-CG-CCTT-T-T---A---C--TA-A--A---G-ATTTCCGT-GGAG-AGG-A-AC---AACTCT-G-AG-T--CT---TA--AC-CCA---ATT-----T--T-TTG-AG--CCTTGCCTT-GGCAA-GGCT--A---"
seqB = "ATCGCTTCTCGGCCTTT-TGGCTAAGATCAAGTGTAGTATCTGTTCTTATCAGTTTAAT-ATCTGATACGTCC-TCTATCCGAGGACAATATATTAAATGGATTT---TTGGAGCAGGGAGA-TGGAA---TAGGAGCTTGCTCCGT-CCACTCCACGCA-TCGACCTGGTATTGCAGTACC-T-CC--AGG-AACGG-TGCACCC"
If you want this: a sequence that tells you for each position whether they are equal at that point (assuming they are equally long):
result = "".join(
1 if a == seqB[i] else 0
for (i, a) in enumerate(seqA)
)
Not sure how useful result is. What if both sequences are similar, but one of them has deleted a letter? By this measure the sequences will then be very dissimilar.
The Levenshtein edit-distance is a captures the intuition of similarity better, but is much more costly to compute.
https://en.wikipedia.org/wiki/Levenshtein_distance
You could chop up the sequences in chunks of 100 letters, and perform Levenshtein on each pair of chunks.
This library has worked well for me, working on ancient texts, not DNA (I did billions of pairs of chunks in a few hours):
https://github.com/ztane/python-Levenshtein

SequenceMatcher - finding the two most similar elements of two or more lists of data

I was trying to compare a set of strings to an already defined set of strings.
For example, you want to find the addressee of a letter, which text is digitalized via OCR.
There is an array of adresses, which has dictionaries as elements.
Each element, which is unique, contains ID, Name, Street, ZIP Code and City. This list will be 1000 entries long.
Since OCR scanned text can be inaccurate, we need to find the best matching candidates of strings with the list, which contains the addresses.
The text is 750 words long. We reduce the number of words by using an appropiate filter function, which firstly splits by whitespaces, stripts more whitespaces from each element, deletes all words less then 5 characters long and removes duplicats; the resulting list is 200 words long.
Since each addressee has 4 strings (Name Street, Zip code and city) and the remaining letter ist 200 words long, my comparrisson has to run 4 * 1000 * 200
= 800'000 times.
I have used python with medium success. Matches have correctly been found. However, the algorithm takes a long time to process a lot of letters (up to 50 hrs per 1500 letters). List comprehension has been applied. Is there a way to correctly (and not unessesary) implement multithreading? What if this application needs to run on a low spec server? My 6 core CPU does not complain about such tasks, however, I do not know how much time it will take to process a lot of documents on a small AWS instance.
>> len(addressees)
1000
>> addressees[0]
{"Name": "John Doe", "Zip": 12345, "Street": "Boulevard of broken dreams 2", "City": "Stockholm"}
>> letter[:5] # already filtered
["Insurance", "Taxation", "Identification", "1592212", "St0ckhlm", "Mozart"]
>> from difflib import SequenceMatcher
>> def get_similarity_per_element(addressees, letter):
"""compare the similarity of each word in the letter with the addressees"""
ratios = []
for l in letter:
for a in addressee.items():
ratios.append(int(100 * SequenceMatcher(None, a, l).ratio())) # using ints for faster arithmatic
return max(ratios)
>> get_similarity_per_element(addressees[0], letter[:5]) # percentage of the most matching word in the letter with anything from the addressee
82
>> # then use this method to find all addressents with the max matching ratio
>> # if only one is greater then the others -> Done
>> # if more then one, but less then 3 are equal -> Interactive Promt -> Done
>> # else -> mark as not sortable -> Done.
I expected a faster processing for each document. (1 minute max), not 50 hrs per 1500 letters. I am sure this is the bottleneck, since the other tasks are working fast and flawless.
Is there a better (faster) way to do this?
A few quick tips:
1) Let me know how long does it take to do quick_ratio() or real_quick_ratio() instead of ratio()
2) Invert the order of the loops and use set_seq2 and set_seq1 so that SequenceMatcher reuses information
for a in addressee.items():
s = SequenceMatcher()
s.set_seq2(a)
for l in letter:
s.set_seq1(l)
ratios.append(int(100 * s.ratio()))
But a better solution would be something like #J_H describes
You want to recognize inputs that are similar to dictionary words, e.g. "St0ckholm" -> "Stockholm". Transposition typos should be handled. Ok.
Possibly you would prefer to set autojunk=False. But a quadratic or cubic algorithm sounds like trouble if you're in a hurry.
Consider the Anagram Problem, where you're asked if an input word and a dictionary word are anagrams of one another. The straightforward solution is to compare the sorted strings for equality. Let's see if we can adapt that idea into a suitable data structure for your problem.
Pre-process your dictionary words into canonical keys that are easily looked up, and hang a list of one or more words off of each key. Use sorting to form the key. So for example we would have:
'dgo' -> ['dog', 'god']
Store this map sorted by key.
Given an input word, you want to know if exactly that word appears in the dictionary, or if a version with limited edit distance appears in the dictionary. Sort the input word and probe the map for 1st entry greater or equal to that. Retrieve the (very short) list of candidate words and evaluate the distance between each of them and your input word. Output the best match. This happens very quickly.
For fuzzier matching, use both the 1st and 2nd entries >= target, plus the preceding entry, so you have a larger candidate set. Also, so far this approach is sensitive to deletion of "small" letters like "a" or "b", due to ascending sorting. So additionally form keys with descending sort, and probe the map for both types of key.
If you're willing to pip install packages, consider import soundex, which deliberately discards information from words, or import fuzzywuzzy.

How to piece together short reads of DNA? Matching base pairs in file of sequences

I am trying to piece together DNA short reads. I need to match around 3 base pairs to other short read fragments. (base pair= ex. TCG (basically just 3 letters))
I have tried regex expressions but when I try to read a file with a bunch of short reads I need to make nucleotides a variables and I don't think regex does that. I have a file with a bunch of these short reads and i need to match these base pairs to other short reads that have these same base pair sequences.
ex. I have these two lines of short reads in a file:
AAAGGGTTTCCCGGGAAATCA
CCCGGGAAATCAGGGAAATTT
I need the outcome to be:
AAAGGGTTTCCCGGGAAATCAGGGAAATTT
How do I match and paste the matched lines on top of the other lines to where I can combine them at the point of similarity?
You can just find the index of the match sequence in the second sequence and concatenate them:
seq1 = 'AAAGGGTTTCCCGGGAAATCA'
seq2 = 'CCCGGGAAATCAGGGAAATTT'
match_pair_count = 5
match_seq = seq1[-match_pair_count:]
match_index = seq2.rfind(match_seq)
combined_seq = seq1[:-match_pair_count] + seq2[match_index:]
NOTE: If you need to catch cases where the match sequence doesn't appear in the second sequence you will need to add code to handle match_index == -1.
Straightforward solution for each subsequence compute 5 tail and five head. Try then all the combination with deapth first search.
More sophisticated way to draw a graph where directed edges are labeled with number of coinsiding letters (within say 4-9). Apply traveling salesman solutions or other appropriate algo to find shortest path throw all vertices.
I am sure there is a lot of appropriate tools and techniques tailored for genome inference. https://www.youtube.com/watch?v=fGxx7TvQ3f4.
To find distance betwee two sequences invert one and find the longest common prefix.
def joinifmatch(seq1, seq2, minlen = 4):
tail = seq2[:4]
for i in range(len(seq1), 4, -1)
if seq1.startswith(seq2[n-i:])
return "%s%S"(seq2[:n-i], seq1)

Efficient finding of string mismatches in large lists

I am iterating over large lists of strings to find similar strings (with several mismatches). The following code works but takes ~20 min, whereas my goal is for it to complete in under 5 min. Is there a more efficient way to do this? What part of this code is most limiting?
I have k=10, mism=3, seq is a string consisting of characters A, T, C, and G. Each pattern and kmer is k characters long.
I've generated lists of patterns of length 4**k (~1 million), and kmers of length len(seq)-k+1 (~300). frequent is a dictionary.
The test iteration takes less that a minute:
for i in range (0,4**k):
for j in range(0,len(kmers)):
pass
Here's the real calculation that I need to make more efficient:
for pattern in patterns:
for kmer in kmers:
mism_counter=0
for j in range(0,k):
if not kmer[j]==pattern[j] : mism_counter+=1
if mism_counter <= mism :
if pattern in frequent:
frequent[pattern] += 1
else:
frequent[pattern] = 1
I tried wikipedia's hamming_distance function instead of my per-character comparison, and also tried to remove dictionary and just dump pattern's into a list for further processing. None of this improved performance of the loops. Any help would be appreciated!
This should save half the time ;-)
for pattern in patterns:
for kmer in kmers:
mism_counter=0
for j in range(0,k):
if kmer[j] != pattern[j] :
mism_counter+=1
if mism_counter > misn:
break
else:
if pattern in frequent:
frequent[pattern] += 1
else:
frequent[pattern] = 1
You have to do two things to make this really fast:
Compress the data so your program does less work. You don't have to represent GTAC as ascii letters (7 bit each), instead 2 bit per letter are enough.
Built a search trie from the patterns to speed up the comparison. Your search with allowed mismatches basically blows up the amount of pattern you have. You can have a trie with extra edges that allow a number of mismatches, but this essentially makes your search set huge.

Longest substring (for sequences of triplets)

I am trying to compare strings of the format: AAA-ABC-LAP-ASZ-ASK; basically, triplets of letters separated by dashes.
I'm trying to find between 2 such sequences of arbitrary length (from 1 to 30 triplets) the longest sequence of common triplets.
For example, AAA-BBB-CCC-AAA-DDD-EEE-BBB and BBB-AAA-DDD-EEE-BBB, you can find a sequence of 5 (BBB-AAA-DDD-EEE-BBB, even if CCC is not present in the 2nd sequence).
The dashes should not be considered for comparison; they only serve to separate the triplets.
I'm using Python but just a general algorithm to achieve this should do :)
I think that you're looking for the Longest Common Subsequence algorithm, which can find this sequence extremely quickly (in O(n2) time). The algorithm is based on a simple dynamic programming recurrence, and there are many examples online of how you might implement the algorithm.
Intuitively, the algorithm works using the following recursive decomposition that works by looking at the first triplet of each sequence:
If either sequence is empty, the longest common subsequence is the empty sequence.
Otherwise:
If the first triplets of each sequence match, the LCS is that element followed by the LCS of the remainders of the two sequences.
If not, the LCS is the longer of the following: the LCS of the first sequence and all but the first element of the second sequence, or the LCS of the second sequence and all but the first element of the first sequence.
Hope this helps!
Sequence alignment algorithms, that are commonly used in bio-informatics, could be used here. They are mostly used to align one-character sequences but they can be modified to accept n-character sequences. Needleman–Wunsch algorithm is one that is fairly efficient.
As a start, you can at least reduce the problem by computing the set symmetric difference to eliminate any triplets that do not occur in both sequences.
For the longest subsequence, the algorithm uses a dynamic programming approach. For each triplet, find the shortest substring of length two that occurs in both. Loop over those pairs, trying to extend them by combining the pairs. Keep extending until you have all the longest extensions for each triplet. Pick the longest of those:
ABQACBBA
ZBABBA
Eliminate symmetric difference
ABABBA and BABBA
Start with the first A in ABABBA.
It is followed by B, giving the elements [0,1]
Check to see if AB is in BABBA, and record a match at [1,2]
So, the first pair is ((0,1), (1,2))
Next, try the first B in ABABBA.
It is followed by an A giving the elements [1,2]
Check to see if BA is in BABBA and record a match at [0,1]
Continue with the rest of the letters in ABABBA.
Then, try extensions.
The first pair AB at [0,1] and [1,2] can be extended from BA
to ABA [0,1,3] and [1,2,4]. Note, the latter group is all the
way to the right so it cannot be extended farther. ABA cannot
be extended.
Continue until all sequences have extended are far as possible.
Keep only the best of those.

Categories

Resources