Finding palindromics sequences of length 18 in FASTA file? - python

I want to design guide RNAs to find palindromic sequences in a FASTA file. I want to write a python script that finds all the palindromic sequences of length 18 throughout my sequence. I have a logic in mind but I don't know how to put it in Python words. My logic is:
1)If i is [ATCG] and i+17 is [TAGC] then check:
2)if i+1 is [ATCG] and i+16 is [TAGC] then check:
3)if i+2 is [ATCG] and i+15 is [TAGC] then check"
.
.
.
10)if i+9 is [ATCG] and i+10 is [TAGC] and all the above are true,
then recognize the sequence of i to i+17 as a palindromic. But I need to make sure that for an A for i it considers only T for i+17.
Any idea how I write this logic in python?
Thanks,

So you want to match A+T and G+C. We can use a dictionary for that. Then we just check if opposite sides are pairs.
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
for i in range(len(sequence) - 18 + 1):
pal = True
for j in range(9):
if pairs[ sequence[i+j] ] != sequence[i+17-j]:
pal = False
break
if pal:
print(sequence[i : i+18])
For any n-length palindrome (including odd n):
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
n=18
for i in range(len(sequence) - n + 1):
pal = True
for j in range(n//2):
if pairs[ sequence[i+j] ] != sequence[i-j+n-1]:
pal = False
break
if pal:
print(sequence[i : i+n])

Looping individually through strings takes too much time. String handling is way more efficient in Python.
#create random test sequence
import random
random.seed(1234)
seq = "".join(random.choices(["A", "T", "C", "G"], k=99))
n = 4 #not exactly 18 but good enough as a test case
print(seq)
>>>GTAGGCCAGAAGTCCAAAATGACTCACTCCTTAGTCACAATTACACAGGGATATGAAGAGATTTGTGTGGTGGTAATACGTGCCTCGAGTAGCGTATAT
#dictionary because translation
bp = {"A":"T", "T":"A", "G":"C", "C":"G"}
#checks if first half translates into reversed second half
#returns False if not, e.g., if the length ls of s is not an even number
def palin(s):
ls = len(s)
if ls%2:
return False
return s[:ls//2]=="".join([bp[i] for i in s[ls:ls//2-1:-1]])
#now to the actual test, checking all substrings of length n in our test sequence seq
#returns tuples of the index within seq and the found substring
res = [(i, seq[i:i+n]) for i in range(len(seq)-n+1) if palin(seq[i:i+n])]
print(res)
>>>[(3, 'GGCC'), (38, 'AATT'), (50, 'ATAT'), (77, 'ACGT'), (84, 'TCGA'), (94, 'TATA'), (95, 'ATAT')]

Related

find the minimum number of words(distance) between repeated occurrences of a search string in the input string

Here are test cases for the code:
string - 'Tim had been saying that he had been there'
search - 'had'
expected output - 4
string - 'he got what he got and what he wanted'
search - 'he'
expected out - 2
def return_distance(input, search):
words = input.split()
distance = None
indx = []
if not input or not search:
return None
else:
if words.count(search) >1:
indx = [ index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0]
for i in range(len(indx)-1):
distance = min(distance, indx[i+1] - indx[i])-1
return distance
I am thinking how to optimize the code. I admit it is poorly written.
How about
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])])
This splits the input sentence, makes a list of every index that matches the target word, then iterates over this list to compute the differences between each index and returns the minimum difference.
Since behavior is unspecified when the sentence doesn't have a word, it raises an error but you can add a check for this and return the value of your choice if desired using min's default parameter:
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
As an aside, naming a variable input overwrites a builtin and return_distance is a rather ambiguous name for a function.
Adding a precondition for parameters for None as done with if not input or not search: is not typically done in Python (we assume caller will always pass in a string and adhere to the function's contract).
If you want to generalize this further, move the split() duty to the domain of the caller which enables the function to operate on arbitrary iterables:
def min_distance_between_occurrences(it, target):
idxes = [i for i, e in enumerate(it) if e == target]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
Call with:
min_distance_between_occurrences("a b c a".split(), "a")
min_distance_between_occurrences([(1, 2), (1, 3), (1, 2)], (1, 2))
Refactoring aside, as pointed out in the comments, the original code isn't correct. Issues include:
search_str does not exist. You probably meant search.
distance and min_dist don't really work together. Pick one or the other and use it for all minimum calculations.
min(min_dist, indx[i+1] - indx[i])-1 subtracts 1 in the wrong place, throwing off the count.
Here's a potential fix for these issues:
def return_distance(input, search):
words = input.split()
distance = None
if words.count(search) > 1:
indx = [index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0] - 1
# ^^^^
for i in range(len(indx) - 1):
distance = min(distance, indx[i+1] - indx[i] - 1)
# ^^^^
return distance
One way is to use min with list comprehension on indx
min_dist = min([(indx[i+1] - indx[i]-1) for i in range(len(indx)-1) ])

String replacement combinations with fixed parts

Let's say I have the following string abcixigea and I want to replace the first 'i', 'e' and the second 'a' with '1', '3' and '4', getting all the combinations with those "progressive" replacement.
So, I need to get:
abc1xigea
abcixig3a
abcixig34
abc1xige4
...and so on.
I tried with itertools.product following this question python string replacement, all possible combinations #2 but the result I get is not exactly what I need and I can see why.
However I'm stuck on trying with combinations and keeping parts of the string fixed (changing just some chars as explained above).
from itertools import product
s = "abc{}xig{}{}"
for combo in product(("i", 1), ("e", 3), ("a", 4)):
print(s.format(*combo))
produces
abcixigea
abcixige4
abcixig3a
abcixig34
abc1xigea
abc1xige4
abc1xig3a
abc1xig34
Edit: in a more general way, you want something like:
from itertools import product
def find_nth(s, char, n):
"""
Return the offset of the nth occurrence of char in s,
or -1 on failure
"""
assert len(char) == 1
offs = -1
for _ in range(n):
offs = s.find(char, offs + 1)
if offs == -1:
break
return offs
def gen_replacements(base_string, *replacement_values):
"""
Generate all string combinations from base_string
by replacing some characters according to replacement_values
Each replacement_value is a tuple of
(original_char, occurrence, replacement_char)
"""
assert len(replacement_values) > 0
# find location of each character to be replaced
replacement_offsets = [
(find_nth(base_string, orig, occ), orig, occ, (orig, repl))
for orig,occ,repl in replacement_values
]
# put them in ascending order
replacement_offsets.sort()
# make sure all replacements are actually possible
if replacement_offsets[0][0] == -1:
raise ValueError("'{}' occurs less than {} times".format(replacement_offsets[0][1], replacement_offsets[0][2]))
# create format string and argument list
args = []
for i, (offs, _, _, arg) in enumerate(replacement_offsets):
# we are replacing one char with two, so we have to
# increase the offset of each replacement by
# the number of replacements already made
base_string = base_string[:offs + i] + "{}" + base_string[offs + i + 1:]
args.append(arg)
# ... and we feed that into the original code from above:
for combo in product(*args):
yield base_string.format(*combo)
def main():
s = "abcixigea"
for result in gen_replacements(s, ("i", 1, "1"), ("e", 1, "3"), ("a", 2, "4")):
print(result)
if __name__ == "__main__":
main()
which produces exactly the same output as above.

Longest common sequence between many sub-sequences

Fancy title :)
I have a file that contains the following:
>sequence_40
ABCDABDCABCDBACDBACDBACDBACDABDCDC
ACDCCDCABDCADCADBCACBDCABD
>sequence_41
DCBACDBACDADCDCDCABCDCACBDCBDACBDC
BCDBABABBABACDCDBCACDBACDBACDBACDC
BCDB
...
Then, I have a function that returns a dictionary (called dict) that returns the sequences as keys and the strings (combined on one line) as values for the keys. The sequences range from 40 to 59.
I want to take a dictionary of sequences and return the longest common sub-sequence found in ALL the sequences. Managed to find some help here on stackoverflow and made a code that only compares the LAST TWO strings in that dictionary, not all of them :).
This is the code
def longest_common_sequence(s1, s2):
m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
longest, x_longest = 0, 0
for x in range(1, 1 + len(s1)):
for y in range(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
for i in range(40,59):
s1=str(dictionar['sequence_'+str(i)])
s2=str(dictionar['sequence_'+str(i+1)])
longest_common_sequence(s1,s2)
How can I modify it to get the common subsequence among ALL sequences in dictionary? Thanks!
EDIT: As #lmcarreiro pointed out, there is a relevant difference between substrings (or subarrays or sublists) and subsequences. To my understanding we are all talking about substrings here, so I will use this term in my answer.
Guillaumes answer can be improved:
def eachPossibleSubstring(string):
for size in range(len(string) + 1, 0, -1):
for start in range(len(string) - size + 1):
yield string[start:start+size]
def findLongestCommonSubstring(strings):
shortestString = min(strings, key=len)
for substring in eachPossibleSubstring(shortestString):
if all(substring in string
for string in strings if string != shortestString):
return substring
print findLongestCommonSubstring([
'ABCDABDCABCDBACDBACDBACDBACDABDCDCACDCCDCABDCADCADBCACBDCABD',
'DCBACDBACDADCDCDCABCDCACBDCBDACBDCBCDBABABBABACDCDBCACDBACDBACDBACDCBCDB',
])
This prints:
ACDBACDBACDBACD
This is faster because I return the first found and search from longest to shortest.
The basic idea is this: Take each possible substring of the shortest of your strings (in the order from the longest to the shortest) and see if this substring can be found in all other strings. If so, return it, otherwise try the next substring.
You need to understand generators. Try e. g. this:
for substring in eachPossibleSubstring('abcd'):
print substring
or
print list(eachPossibleSubstring('abcd'))
I'd start by defining a function to return all possible subsequences of a given sequence:
from itertools import combinations_with_replacement
def subsequences(sequence):
"returns all possible subquences of a given sequence"
for start, stop in combinations_with_replacement(range(len(sequence)), 2):
if start < stop:
yield sequence[start:stop]
then I'd make another method to check if a given subsequence in present in all given sequences:
def is_common_subsequence(sub, sequences):
"returns True if <sub> is a common subsequence in all <sequences>"
return all(sub in sequence for sequence in sequences)
then using the 2 methods above it is pretty easy to get all common subsequences in a given set of sequences:
def common_sequences(sequences):
"return all subsequences common in sequences"
shortest_seq = min(sequences, key=len)
return set(subsequence for subsequence in subsequences(shortest_seq) \
if is_common_subsequence(subsequence, sequences))
... and extracting the longuest sequence:
def longuest_common_subsequence(sequences):
"returns the longuest subsequence in sequences"
return max(common_sequences(sequences), key=len)
Result:
sequences = {
41: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
42: '123ABCDEFGHIJKLMNOPQRSTUVW',
43: '123456ABCDEFGHIJKLMNOPQRST'
}
sequences2 = {
0: 'ABCDEFGHIJ',
1: 'DHSABCDFKDDSA',
2: 'SGABCEIDEFJRNF'
}
print(longuest_common_subsequence(sequences.values()))
>>> ABCDEFGHIJKLMNOPQRST
print(longuest_common_subsequence(sequences2.values()))
>>> ABC
Here you have a possible approach. First let's define a function that returns the longest substring between two strings:
def longest_substring(s1, s2):
t = [[0]*(1+len(s2)) for i in range(1+len(s1))]
l, xl = 0, 0
for x in range(1,1+len(s1)):
for y in range(1,1+len(s2)):
if s1[x-1] == s2[y-1]:
t[x][y] = t[x-1][y-1] + 1
if t[x][y]>l:
l = t[x][y]
xl = x
else:
t[x][y] = 0
return s1[xl-l: xl]
Now I'll create a random dict of sequences for the example:
import random
import string
d = {i : ''.join(random.choice(string.ascii_uppercase) for _ in range(50)) for i in range(10)}
print d
{0: 'ASCUCEVJNIGWVMWMBBQQBZYBBNGQAJRYXACGFEIFWHMBCNYRGL', 1: 'HKUKZOJJUCRTSBLNZXCIBARLPNAPAABRBZEVGVILJAFCGWGQVV', 2: 'MMHCYPKECRJFEWTGYITMHZSNHAFEZVFYDAVILRYRKIDDBEFRVX', 3: 'DGBULRFJINFZEELDASRFBIRSADWMRAYMGCDAOJDKQIMXIRLTEI', 4: 'VDUFWZSXLRGOIMAHOAMZAIWDPTHDVDXUACRBASJMCUHREDORRH', 5: 'RFGAVHOWNKRZMYMSFSSNUGCKEWUNVETCDWJXSPBJHKSTPFNSJO', 6: 'HFMLMHCFSOEXBXWFAROIRGJNPRTKRWCEPLFOKGMXNUPCPWREWX', 7: 'CNPGSHGVIRLDXAADXUVWCTJCXUHQLALBUOJMXQBKXWHKGSJHEH', 8: 'UWDXXTRCFNCBUBEYGYTDWTPLNTRHYQWKTHPRVCBAWIMNGHULDC', 9: 'OOCJRXBZKJIGHZEJOOIKWKMQKIEQVPEDTFPJQAUQKJQVLOMGJB'}
Finally, we need to find the longest subsequence between all sequences:
import itertools
max([longest_substring(i,j) for i,j in itertools.combinations(d.values(), 2)], key=len)
Output:
'VIL'

Search and replace multiple specific sequences of elements in Python list/array

I currently have 6 separate for loops which iterate over a list of numbers looking to match specific sequences of numbers within larger sequences, and replace them like this:
[...0,1,0...] => [...0,0,0...]
[...0,1,1,0...] => [...0,0,0,0...]
[...0,1,1,1,0...] => [...0,0,0,0,0...]
And their inverse:
[...1,0,1...] => [...1,1,1...]
[...1,0,0,1...] => [...1,1,1,1...]
[...1,0,0,0,1...] => [...1,1,1,1,1...]
My existing code is like this:
for i in range(len(output_array)-2):
if output_array[i] == 0 and output_array[i+1] == 1 and output_array[i+2] == 0:
output_array[i+1] = 0
for i in range(len(output_array)-3):
if output_array[i] == 0 and output_array[i+1] == 1 and output_array[i+2] == 1 and output_array[i+3] == 0:
output_array[i+1], output_array[i+2] = 0
In total I'm iterating over the same output_array 6 times, using brute force checking. Is there a faster method?
# I would create a map between the string searched and the new one.
patterns = {}
patterns['010'] = '000'
patterns['0110'] = '0000'
patterns['01110'] = '00000'
# I would loop over the lists
lists = [[0,1,0,0,1,1,0,0,1,1,1,0]]
for lista in lists:
# i would join the list elements as a string
string_list = ''.join(map(str,lista))
# we loop over the patterns
for pattern,value in patterns.items():
# if a pattern is detected, we replace it
string_list = string_list.replace(pattern, value)
lista = list(string_list)
print lista
While this question related to the questions Here and Here, the question from OP relates to fast searching of multiple sequences at once. While the accepted answer works well, we may not want to loop through all the search sequences for every sub-iteration of the base sequence.
Below is an algo which checks for a sequence of i ints only if the sequence of (i-1) ints is present in the base sequence
# This is the driver function which takes in a) the search sequences and
# replacements as a dictionary and b) the full sequence list in which to search
def findSeqswithinSeq(searchSequences,baseSequence):
seqkeys = [[int(i) for i in elem.split(",")] for elem in searchSequences]
maxlen = max([len(elem) for elem in seqkeys])
decisiontree = getdecisiontree(seqkeys)
i = 0
while i < len(baseSequence):
(increment,replacement) = get_increment_replacement(decisiontree,baseSequence[i:i+maxlen])
if replacement != -1:
baseSequence[i:i+len(replacement)] = searchSequences[",".join(map(str,replacement))]
i +=increment
return baseSequence
#the following function gives the dictionary of intermediate sequences allowed
def getdecisiontree(searchsequences):
dtree = {}
for elem in searchsequences:
for i in range(len(elem)):
if i+1 == len(elem):
dtree[",".join(map(str,elem[:i+1]))] = True
else:
dtree[",".join(map(str,elem[:i+1]))] = False
return dtree
# the following is the function does most of the work giving us a) how many
# positions we can skip in the search and b)whether the search seq was found
def get_increment_replacement(decisiontree,sequence):
if str(sequence[0]) not in decisiontree:
return (1,-1)
for i in range(1,len(sequence)):
key = ",".join(map(str,sequence[:i+1]))
if key not in decisiontree:
return (1,-1)
elif decisiontree[key] == True:
key = [int(i) for i in key.split(",")]
return (len(key),key)
return 1, -1
You can test the above code with this snippet:
if __name__ == "__main__":
inputlist = [5,4,0,1,1,1,0,2,0,1,0,99,15,1,0,1]
patternsandrepls = {'0,1,0':[0,0,0],
'0,1,1,0':[0,0,0,0],
'0,1,1,1,0':[0,0,0,0,0],
'1,0,1':[1,1,1],
'1,0,0,1':[1,1,1,1],
'1,0,0,0,1':[1,1,1,1,1]}
print(findSeqswithinSeq(patternsandrepls,inputlist))
The proposed solution represents the sequences to be searched as a decision tree.
Due to skipping the many of the search points, we should be able to do better than O(m*n) with this method (where m is number of search sequences and n is length of base sequence.
EDIT: Changed answer based on more clarity in edited question.

Checking if word segmentation is possible

This is a follow up question to this response and the pseudo-code algorithm that the user posted. I didn't comment on that question because of its age. I am only interested in validating whether or not a string can be split into words. The algorithm doesn't need to actually split the string. This is the response from the linked question:
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if
the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for
i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and
isWord[j..i]).
I'm translating this algorithm into simple python code, but I'm not sure if I'm understanding it properly. Code:
def is_all_words(a_string, dictionary)):
str_len = len(a_string)
S = [False] * str_len
S[0] = is_word(a_string[0], dictionary)
for i in range(1, str_len):
check = is_word(a_string[0:i], dictionary)
if (check):
S[i] = check
else:
for j in range(1, str_len):
check = (S[j - 1] and is_word(a_string[j:i]), dictionary)
if (check):
S[i] == True
break
return S
I have two related questions. 1) Is this code a proper translation of the linked algorithm into Python, and if it is, 2) Now that I have S, how do I use it to tell if the string is only comprised of words? In this case, is_word is a function that simply looks a given word up in a list. I haven't implemented it as a trie yet.
UPDATE: After updating the code to include the suggested change, it doesn't work. This is the updated code:
def is_all_words(a_string, dictionary)):
str_len = len(a_string)
S = [False] * str_len
S[0] = is_word(a_string[0], dictionary)
for i in range(1, str_len):
check = is_word(a_string[0:i], dictionary)
if (check):
S[i] = check
else:
for j in range(1, i): #THIS LINE WAS UPDATED
check = (S[j - 1] and is_word(a_string[j:i]), dictionary)
if (check):
S[i] == True
break
return S
a_string = "carrotforever"
S = is_all_words(a_string, dictionary)
print(S[len(S) - 1]) #prints FALSE
a_string = "hello"
S = is_all_words(a_string, dictionary)
print(S[len(S) - 1]) #prints TRUE
It should return True for both of these.
Here is a modified version of your code that should return good results.
Notice that your mistake was simply in the translation from pseudocode array indexing (starting at 1) to python array indexing (starting at 0) therefore S[0] and S[1] where populated with the same value where S[L-1] was actually never computed. You can easily trace this mistake by printing the whole S values. You will find that S[3] is set true in the first example where it should be S[2] for the word "car".
Also you could speed up the process by storing the index of composite words found so far, instead of testing each position.
def is_all_words(a_string, dictionary):
str_len = len(a_string)
S = [False] * (str_len)
# I replaced is_word function by a simple list lookup,
# feel free to replace it with whatever function you use.
# tries or suffix tree are best for this.
S[0] = (a_string[0] in dictionary)
for i in range(1, str_len):
check = a_string[0:i+1] in dictionary # i+1 instead of i
if (check):
S[i] = check
else:
for j in range(0,i+1): # i+1 instead of i
if (S[j-1] and (a_string[j:i+1] in dictionary)): # i+1 instead of i
S[i] = True
break
return S
a_string = "carrotforever"
S = is_all_words(a_string, ["a","car","carrot","for","eve","forever"])
print(S[len(a_string)-1]) #prints TRUE
a_string = "helloworld"
S = is_all_words(a_string, ["hello","world"])
print(S[len(a_string)-1]) #prints TRUE
For a real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the recursive approach. By modifying the score function you can prioritize longer matches.
Installation is easy with pip:
$ pip install wordsegment
And segment returns a list of words:
>>> import wordsegment
>>> wordsegment.segment('carrotfever')
['carrot', 'forever']
1) at first glance, looks good. One thing: for j in range(1, str_len): should be for j in range(1, i): I think
2) if S[str_len-1]==true, then the whole string should consist of whole words only.
After all S[i] is true iff
the whole string from 0 to i consists of a single dictionary word
OR there exists a S[j-1]==true with j<i, and the string[j:i] is a single dictionaryword
so if S[str_len-1] is true, then the whole string is composed out of dictionary words

Categories

Resources