Longest substring (for sequences of triplets)

Longest substring (for sequences of triplets) - python

I am trying to compare strings of the format: AAA-ABC-LAP-ASZ-ASK; basically, triplets of letters separated by dashes.
I'm trying to find between 2 such sequences of arbitrary length (from 1 to 30 triplets) the longest sequence of common triplets.
For example, AAA-BBB-CCC-AAA-DDD-EEE-BBB and BBB-AAA-DDD-EEE-BBB, you can find a sequence of 5 (BBB-AAA-DDD-EEE-BBB, even if CCC is not present in the 2nd sequence).
The dashes should not be considered for comparison; they only serve to separate the triplets.
I'm using Python but just a general algorithm to achieve this should do :)

I think that you're looking for the Longest Common Subsequence algorithm, which can find this sequence extremely quickly (in O(n2) time). The algorithm is based on a simple dynamic programming recurrence, and there are many examples online of how you might implement the algorithm.
Intuitively, the algorithm works using the following recursive decomposition that works by looking at the first triplet of each sequence:
If either sequence is empty, the longest common subsequence is the empty sequence.
Otherwise:
If the first triplets of each sequence match, the LCS is that element followed by the LCS of the remainders of the two sequences.
If not, the LCS is the longer of the following: the LCS of the first sequence and all but the first element of the second sequence, or the LCS of the second sequence and all but the first element of the first sequence.
Hope this helps!

Sequence alignment algorithms, that are commonly used in bio-informatics, could be used here. They are mostly used to align one-character sequences but they can be modified to accept n-character sequences. Needleman–Wunsch algorithm is one that is fairly efficient.

As a start, you can at least reduce the problem by computing the set symmetric difference to eliminate any triplets that do not occur in both sequences.
For the longest subsequence, the algorithm uses a dynamic programming approach. For each triplet, find the shortest substring of length two that occurs in both. Loop over those pairs, trying to extend them by combining the pairs. Keep extending until you have all the longest extensions for each triplet. Pick the longest of those:
ABQACBBA
ZBABBA
Eliminate symmetric difference
ABABBA and BABBA
Start with the first A in ABABBA.
It is followed by B, giving the elements [0,1]
Check to see if AB is in BABBA, and record a match at [1,2]
So, the first pair is ((0,1), (1,2))
Next, try the first B in ABABBA.
It is followed by an A giving the elements [1,2]
Check to see if BA is in BABBA and record a match at [0,1]
Continue with the rest of the letters in ABABBA.
Then, try extensions.
The first pair AB at [0,1] and [1,2] can be extended from BA
to ABA [0,1,3] and [1,2,4]. Note, the latter group is all the
way to the right so it cannot be extended farther. ABA cannot
be extended.
Continue until all sequences have extended are far as possible.
Keep only the best of those.

Related

loop on long sequence in python with minimum time complexity in python

I have two DNA sequences (long sequences) I want to iterate onto them.
I tried traditional for loop but it takes linear time so it has big complexity so I want to find a way to loop on these two sequences and compare between characters without linear time, is there any way to do that ?
update
I want to compare between each character in two sequences
update 2 :
this is code which I tried
for i in range(0,len(seqA)):
if seqA[i]==seqB[i]:
print("similar")
else :
print("not similar")
and this is a sample of DNA sequences which I want to compare between it
seqA = "AT-AC-TCT-GG--TTTCT--CT----TCA---G-A-T--C-G--C--AT-A----AATC-T-T--T-CG-CCTT-T-T---A---C--TA-A--A---G-ATTTCCGT-GGAG-AGG-A-AC---AACTCT-G-AG-T--CT---TA--AC-CCA---ATT-----T--T-TTG-AG--CCTTGCCTT-GGCAA-GGCT--A---"
seqB = "ATCGCTTCTCGGCCTTT-TGGCTAAGATCAAGTGTAGTATCTGTTCTTATCAGTTTAAT-ATCTGATACGTCC-TCTATCCGAGGACAATATATTAAATGGATTT---TTGGAGCAGGGAGA-TGGAA---TAGGAGCTTGCTCCGT-CCACTCCACGCA-TCGACCTGGTATTGCAGTACC-T-CC--AGG-AACGG-TGCACCC"

If you want this: a sequence that tells you for each position whether they are equal at that point (assuming they are equally long):
result = "".join(
1 if a == seqB[i] else 0
for (i, a) in enumerate(seqA)
)
Not sure how useful result is. What if both sequences are similar, but one of them has deleted a letter? By this measure the sequences will then be very dissimilar.
The Levenshtein edit-distance is a captures the intuition of similarity better, but is much more costly to compute.
https://en.wikipedia.org/wiki/Levenshtein_distance
You could chop up the sequences in chunks of 100 letters, and perform Levenshtein on each pair of chunks.
This library has worked well for me, working on ancient texts, not DNA (I did billions of pairs of chunks in a few hours):
https://github.com/ztane/python-Levenshtein

Checking validity of permutations in python

Using python, I would like to generate all possible permutations of 10 labels (for simplicity, I'll call them a, b, c, ...), and return all permutations that satisfy a list of conditions. These conditions have to do with the ordering of the different labels - for example, let's say I want to return all permutations in which a comes before b and when d comes after e. Notably, none of the conditions pertain to any details of the labels themselves, only their relative orderings. I would like to know what the most suitable data structure and general approach is for dealing with these sorts of problems. For example, I can generate all possible permutations of elements within a list, but I can't see a simple way to verify whether a given permutation satisfies the conditions I want.

"The most suitable data structure and general approach" varies, depending on the actual problem. I can outline three basic approaches to the problem you give (generate all permutations of 10 labels a, b, c, etc. in which a comes before b and d comes after e).
First, generate all permutations of the labels, using itertools.permutations, remove/skip over the ones where a comes after b and d comes before e. Given a particular permutation p (represented as a Python tuple) you can check for
p.index("a") < p.index("b") and p.index("d") > p.index("e")
This has the disadvantage that you reject three-fourths of the permutations that are initially generated, and that expression involves four passes through the tuple. But this is simple and short and most of the work is done in the fast code inside Python.
Second, general all permutation of the locations 0 through 9. Consider these to represent the inverses of your desired permutations. In other words, the number at position 0 is not what will go to position 0 in the permutation but rather shows where label a will go in the permutation. Then you can quickly and easily check for your requirements:
p[0] < p[1] and p[3] > p[4]
since a is the 0'th label, etc. If the permutation passes this test, then find the inverse permutation of this and apply it to your labels. Finding the inverse involves one or two passes through the tuple, so it makes fewer passes than the first method. However, this is more complicated and does more work outside the innards of Python, so it is very doubtful that this will be faster than the first method.
Third, generate only the permutations you need. This can be done with these steps.
3a. Note that there are four special positions in the permutations (those for a, b, d, and e). So use itertools.combinations to choose 4 positions out of the 10 total positions. Note I said positions, not labels, so choose 4 integers between 0 and 9.
3b. Use itertools.combinations again to choose 2 of those positions out of the 4 already chosen in step 3a. Place a in the first (smaller) of those 2 positions and b in the other. Place e in the first of the other 2 positions chosen in step 3a and place d in the other.
3c. Use itertools.permutations to choose the order of the other 6 labels.
3d. Interleave all that into one permutation. There are several ways to do that. You could make one pass through, placing everything as needed, or you could use slices to concatenate the various segments of the final permutation.
That third method generates only what you need, but the time involved in constructing each permutation is sizable. I do not know which of the methods would be fastest--you could test with smaller sizes of permutations. There are multiple possible variations for each of the methods, of course.

SequenceMatcher - finding the two most similar elements of two or more lists of data

I was trying to compare a set of strings to an already defined set of strings.
For example, you want to find the addressee of a letter, which text is digitalized via OCR.
There is an array of adresses, which has dictionaries as elements.
Each element, which is unique, contains ID, Name, Street, ZIP Code and City. This list will be 1000 entries long.
Since OCR scanned text can be inaccurate, we need to find the best matching candidates of strings with the list, which contains the addresses.
The text is 750 words long. We reduce the number of words by using an appropiate filter function, which firstly splits by whitespaces, stripts more whitespaces from each element, deletes all words less then 5 characters long and removes duplicats; the resulting list is 200 words long.
Since each addressee has 4 strings (Name Street, Zip code and city) and the remaining letter ist 200 words long, my comparrisson has to run 4 * 1000 * 200
= 800'000 times.
I have used python with medium success. Matches have correctly been found. However, the algorithm takes a long time to process a lot of letters (up to 50 hrs per 1500 letters). List comprehension has been applied. Is there a way to correctly (and not unessesary) implement multithreading? What if this application needs to run on a low spec server? My 6 core CPU does not complain about such tasks, however, I do not know how much time it will take to process a lot of documents on a small AWS instance.
>> len(addressees)
1000
>> addressees[0]
{"Name": "John Doe", "Zip": 12345, "Street": "Boulevard of broken dreams 2", "City": "Stockholm"}
>> letter[:5] # already filtered
["Insurance", "Taxation", "Identification", "1592212", "St0ckhlm", "Mozart"]
>> from difflib import SequenceMatcher
>> def get_similarity_per_element(addressees, letter):
"""compare the similarity of each word in the letter with the addressees"""
ratios = []
for l in letter:
for a in addressee.items():
ratios.append(int(100 * SequenceMatcher(None, a, l).ratio())) # using ints for faster arithmatic
return max(ratios)
>> get_similarity_per_element(addressees[0], letter[:5]) # percentage of the most matching word in the letter with anything from the addressee
82
>> # then use this method to find all addressents with the max matching ratio
>> # if only one is greater then the others -> Done
>> # if more then one, but less then 3 are equal -> Interactive Promt -> Done
>> # else -> mark as not sortable -> Done.
I expected a faster processing for each document. (1 minute max), not 50 hrs per 1500 letters. I am sure this is the bottleneck, since the other tasks are working fast and flawless.
Is there a better (faster) way to do this?

A few quick tips:
1) Let me know how long does it take to do quick_ratio() or real_quick_ratio() instead of ratio()
2) Invert the order of the loops and use set_seq2 and set_seq1 so that SequenceMatcher reuses information
for a in addressee.items():
s = SequenceMatcher()
s.set_seq2(a)
for l in letter:
s.set_seq1(l)
ratios.append(int(100 * s.ratio()))
But a better solution would be something like #J_H describes

You want to recognize inputs that are similar to dictionary words, e.g. "St0ckholm" -> "Stockholm". Transposition typos should be handled. Ok.
Possibly you would prefer to set autojunk=False. But a quadratic or cubic algorithm sounds like trouble if you're in a hurry.
Consider the Anagram Problem, where you're asked if an input word and a dictionary word are anagrams of one another. The straightforward solution is to compare the sorted strings for equality. Let's see if we can adapt that idea into a suitable data structure for your problem.
Pre-process your dictionary words into canonical keys that are easily looked up, and hang a list of one or more words off of each key. Use sorting to form the key. So for example we would have:
'dgo' -> ['dog', 'god']
Store this map sorted by key.
Given an input word, you want to know if exactly that word appears in the dictionary, or if a version with limited edit distance appears in the dictionary. Sort the input word and probe the map for 1st entry greater or equal to that. Retrieve the (very short) list of candidate words and evaluate the distance between each of them and your input word. Output the best match. This happens very quickly.
For fuzzier matching, use both the 1st and 2nd entries >= target, plus the preceding entry, so you have a larger candidate set. Also, so far this approach is sensitive to deletion of "small" letters like "a" or "b", due to ascending sorting. So additionally form keys with descending sort, and probe the map for both types of key.
If you're willing to pip install packages, consider import soundex, which deliberately discards information from words, or import fuzzywuzzy.

How to piece together short reads of DNA? Matching base pairs in file of sequences

I am trying to piece together DNA short reads. I need to match around 3 base pairs to other short read fragments. (base pair= ex. TCG (basically just 3 letters))
I have tried regex expressions but when I try to read a file with a bunch of short reads I need to make nucleotides a variables and I don't think regex does that. I have a file with a bunch of these short reads and i need to match these base pairs to other short reads that have these same base pair sequences.
ex. I have these two lines of short reads in a file:
AAAGGGTTTCCCGGGAAATCA
CCCGGGAAATCAGGGAAATTT
I need the outcome to be:
AAAGGGTTTCCCGGGAAATCAGGGAAATTT
How do I match and paste the matched lines on top of the other lines to where I can combine them at the point of similarity?

You can just find the index of the match sequence in the second sequence and concatenate them:
seq1 = 'AAAGGGTTTCCCGGGAAATCA'
seq2 = 'CCCGGGAAATCAGGGAAATTT'
match_pair_count = 5
match_seq = seq1[-match_pair_count:]
match_index = seq2.rfind(match_seq)
combined_seq = seq1[:-match_pair_count] + seq2[match_index:]
NOTE: If you need to catch cases where the match sequence doesn't appear in the second sequence you will need to add code to handle match_index == -1.

Straightforward solution for each subsequence compute 5 tail and five head. Try then all the combination with deapth first search.
More sophisticated way to draw a graph where directed edges are labeled with number of coinsiding letters (within say 4-9). Apply traveling salesman solutions or other appropriate algo to find shortest path throw all vertices.
I am sure there is a lot of appropriate tools and techniques tailored for genome inference. https://www.youtube.com/watch?v=fGxx7TvQ3f4.
To find distance betwee two sequences invert one and find the longest common prefix.
def joinifmatch(seq1, seq2, minlen = 4):
tail = seq2[:4]
for i in range(len(seq1), 4, -1)
if seq1.startswith(seq2[n-i:])
return "%s%S"(seq2[:n-i], seq1)

Find K-most longest common suffix in a set of strings

I want to find most longest common suffix in a set of strings to detect some potential important morpheme in my natural language process project.
Given frequency K>=2,find the K-most common longest suffix in a list of strings S1,S2,S3...SN
To simplify the problem, here are some examples:
Input1:
K=2
S=["fireman","woman","businessman","policeman","businesswoman"]
Output1:
["man","eman","woman"]
Explanation1:
"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times
It would be appreciated if the output also tracked the frequency of each common suffix
{"man":4,"eman":2,"woman":2}
It is not guaranteed that each word will have at least one-length common suffix, see the example below.
Input2:
K=2
S=["fireman","woman","businessman","policeman","businesswoman","apple","pineapple","people"]
Output2:
["man","eman","woman","ple","apple"]
Explanation2:
"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times
"ple" occur 3 times, "apple" occur 2 times
Any efficient algorithm to solve this problem ？
I have referred to Suffix Tree and Generalised Suffix Tree algorithm but it seems not to fit this problem well.
(BTW, I am researching on a Chinese NLP project which means there are far more characters than English has)

If you want to be really efficient this paper could help. The algorithm gives you the most frequent (longest) substrings from a set of strings in O(n log n). The basic idea is the following:
Concatenate all strings with whitespace between them (these are used as seperators later) in order to from a single long string
Build a suffix array SA over the long string
For each pair of adjacent entries in the suffix array, check how many of the starting characters are equal of the two suffixes that the entries refer to (you should stop when you reach a whitespace). Put the count of equal characters in an additional array (called LCP). Each entry LCP[i] refers to the count of equal characters of the string at position SA[i-1] and SA[i] in the long string.
Go over the LCP an find successive entries with a count larger than K.
Example from the paper for3 strings
s1= bbabab
s2 = abacac
s3 = bbaaa

One idea would be to reverse the strings and then run a trie on them.
Such a simple trick in the book.
Once subsequence is reached, just a matter of reversal again.
E.g. Result from Tries would be
[nam, name, ...]
Reversal of these would give you the final answer.
[man, eman, ...]
Time complexity increased by + O(n).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.