Multiple mismatches in DNA search sequence regex

Multiple mismatches in DNA search sequence regex - python

I have written this barbaric script to create permutations of a string of characters that contain n (up to n=4) $'s in all possible combinations of positions within the string. I will eventually .replace('$','(\\w)') to use for mismatches in a dna search sequence. Because of the way I wrote the script, some of the permutations have less than the requested number of $'s. I then wrote a script to remove them, but it doesn't seem to be effective, and each time I run the removal script, it removes more of the unwanted permutations. In the code pasted below, you will see that I test the function with a simple sequence with 4 mismatches. I then run a series of removal scripts that count how many expressions are removed each time...in my experience, it takes about 8 times to remove all expressions with less than 4 wild-card $'s. I have a couple questions about this:
Is there a built in function for searches with 'n' mismatches? Maybe even in biopython? So far, I've seen the Paul_McGuire_regex function:
Search for string allowing for one mismatch in any location of the string,
which seems only to generate 1 mismatch. I must admit, I don't fully understand all of the code in the remainining functions on that page, as I am a very new coder.
Since I see this as a good exercise for me, is there a better way to write this entire script?...Can I iterate Paul_McGuire_regex function as many times as I need?
Most perplexing to me, why won't the removal script work 100% the first time?
Thanks for any help you can provide!
def Mismatch(Search,n):
List = []
SearchL = list(Search)
if n > 4:
return("Error: Maximum of 4 mismatches")
for i in range(0,len(Search)):
if n == 1:
SearchL_i = list(Search)
SearchL_i[i] = '$'
List.append(''.join(SearchL_i))
if n > 1:
for j in range (0,len(Search)):
if n == 2:
SearchL_j = list(Search)
SearchL_j[i] = '$'
SearchL_j[j] = '$'
List.append(''.join(SearchL_j))
if n > 2:
for k in range(0,len(Search)):
if n == 3:
SearchL_k = list(Search)
SearchL_k[i] = '$'
SearchL_k[j] = '$'
SearchL_k[k] = '$'
List.append(''.join(SearchL_k))
if n > 3:
for l in range(0,len(Search)):
if n ==4:
SearchL_l = list(Search)
SearchL_l[i] = '$'
SearchL_l[j] = '$'
SearchL_l[k] = '$'
SearchL_l[l] = '$'
List.append(''.join(SearchL_l))
counter=0
for el in List:
if el.count('$') < n:
counter+=1
List.remove(el)
return(List)
List_RE = Mismatch('abcde',4)
counter = 0
for el in List_RE:
if el.count('$') < 4:
List_RE.remove(el)
counter+=1
print("Filter2="+str(counter))

We can do away with questions 2 and 3 by answering question 1, but understanding question 3 is important so I'll do that first and then show how you can avoid it entirely:
Question 3
As to question 3, it's because when you loop over a list in python and make changes to it within the loop, the list that you loop over changes.
From the python docs on control flow (for statement section):
It is not safe to modify the sequence being iterated over in the loop
(this can only happen for mutable sequence types, such as lists).
Say your list is [a,b,c,d] and you loop through it with for el in List.
Say el is currently a and you do List.remove(el).
Now, your list is [b,c,d]. However, the iterator points to the second element in the list (since it's done the first), which is now c.
In essence, you've skipped b. So the problem is that you are modifying the list you are iterating over.
There are a few ways to fix this: if your List is not expensive to duplicate, you could make a copy. So iterate over List[:] but remove from List.
But suppose it's expensive to make copies of List all the time.
Then what you do is iterate over it backwards. Note the reversed below:
for el in reversed(List):
if el.count('$') < n:
counter+=1
List.remove(el)
return(List)
In the example above, suppose we iterate backwards over List.
The iterator starts at d, and then goes to c.
Suppose we remove c, so that List=[a,b,d].
Since the iterator is going backwards, it now points to element b, so we haven't skipped anything.
Basically, this avoids modifying bits of the list you have yet to iterate over.
Questions 1 & 2
If I understand your question correctly, you basically want to choose n out of m positions, where m is the length of the string (abcde), and place a '$' in each of these n positions.
In that case, you can use the itertools module to do that.
import itertools
def Mismatch(Search,n):
SearchL = list(Search)
List = [] # hold output
# print list of indices to replace with '$'
idxs = itertools.combinations(range(len(SearchL)),n)
# for each combination `idx` in idxs, replace str[idx] with '$':
for idx in idxs:
str = SearchL[:] # make a copy
for i in idx:
str[i]='$'
List.append( ''.join(str) ) # convert back to string
return List
Let's look at how this works:
turn the Search string into a list so it can be iterated over, create empty List to hold results.
idxs = itertools.combinations(range(len(SearchL)),n) says "find all subsets of length n in the set [0,1,2,3,...,length-of-search-string -1].
Try
idxs = itertools.combinations(range(5),4)
for idx in idxs:
print idx
to see what I mean.
Each element of idxs is a tuple of n indices from 0 to len(SearchL)-1 (e.g. (0,1,2,4). Replace the i'th character of SearchL with a '$' for each i in the tuple.
Convert the result back into a string and add it to List.
As an example:
Mismatch('abcde',3)
['$$$de', '$$c$e', '$$cd$', '$b$$e', '$b$d$', '$bc$$', 'a$$$e', 'a$$d$', 'a$c$$', 'ab$$$']
Mismatch('abcde',4) # note, the code you had made lots of duplicates.
['$$$$e', '$$$d$', '$$c$$', '$b$$$', 'a$$$$']

Related

Execution Timed Out on codewars when finding the least used element in an array. Needs to be optimized but dont know how

Instructions on codewars:
There is an array with some numbers. All numbers are equal except for one. Try to find it!
find_uniq([ 1, 1, 1, 2, 1, 1 ]) == 2
find_uniq([ 0, 0, 0.55, 0, 0 ]) == 0.55
It’s guaranteed that array contains at least 3 numbers.
The tests contain some very huge arrays, so think about performance.
This is the code I wrote:
def find_uniq(arr):
for n in arr:
if arr.count(n) == 1:
return n
exit()
It works as follows:
For every character in the array, if that character appears only once, it returns said character and exits the code. If the character appears more than once, it does nothing
When attempting the code on codewars, I get the following error:
STDERR
Execution Timed Out (12000 ms)
I am a beginner so I have no idea how to further optimize the code in order for it to not time out
The first version of my code looked like this:
def find_uniq(arr):
arr.sort()
rep = str(arr)
for character in arr:
cantidad = arr.count(character)
if cantidad > 1:
rep = rep.replace(str(character), "")
rep = rep.replace("[", "")
rep = rep.replace("]", "")
rep = rep.replace(",", "")
rep = rep.replace(" ", "")
rep = float(rep)
n = rep
return n
After getting timed out, I assumed it was due to the repetitive replace functions and the fact that the code had to go through every element even if it had already found the correct one, since the code was deleting the incorrect ones, instead of just returning the correct one
After some iterations that I didn't save we got to the current code, which checks if the character is only once in the array, returns that and exits
def find_uniq(arr):
for n in arr:
if arr.count(n) == 1:
return n
exit()
I have no clue how to further optimize this

.count() iterates over the entire array every time that you call it. If your array has n elements, it will iterate over the array n times, which is quite slow.
You can use collections.Counter as Unmitigated suggests, but if you're not familiar with the module, it might seem overkill for this problem. Since in this case you know that there's only two unique elements in the array, you can get all of the unique elements using set(), and then check the frequency of each unique element:
def find_uniq(arr):
for n in set(arr):
if arr.count(n) == 1:
return n

You can use a dict or collections.Counter to get the frequency of each element with linear time complexity. Then return the element with a frequency of one.
def find_uniq(l):
from collections import Counter
return Counter(l).most_common()[-1][0]

Compare the first two numbers. If they match, find the one in the array that doesn't match (longest solution). Otherwise, return the one that doesn't match the third. Coded:
def find_uniq(arr):
if arr[0]==arr[1]:
target=arr[0]
for i in range(2,len(arr)):
if arr[i] != target:
return arr[i]
else:
if arr[0]==arr[2]:
return arr[1]
else:
return arr[0]

In your original code:
def find_uniq(arr):
for n in arr:
if arr.count(n) == 1:
return n
exit() # note: this line does nothing because you already returned
you're calling arr.count once for each element in the array (assuming the worst case scenario where the unique element is at the very end). Each call to arr.count(n) scans through the entire array counting up n -- so you're iterating over the entire array of N elements N times, which makes this O(N^2) -- very slow if N is big!
The second version of your code has the same problem, but it adds a huge amount of extra complexity by turning the list into a string and then trying to parse the string -- don't do that!
The way to make this fast is to iterate over the entire list once and keep track of the count of each item as you go. This is easiest to do with the built in collections.Counter class:
from collections import Counter
def find_uniq(arr):
return next(i for i, c in Counter(arr).items() if c == 1)
Given the constraint that there are only two different values in the array and exactly one of them is unique, you can make this more efficient (such that you don't even need to iterate over the entire array in all cases) by breaking it into two possibilities: either the first two items are identical and you just need to look for the item that's not equal to those, or they're different and you just need to return the one that's not equal to the third.
def find_uniq(arr):
if arr[0] == arr[1]:
# First two items are the same, iterate through
# the rest of the array to find the unique one.
return next(i for i in arr if i != arr[0])
# Otherwise, either arr[0] or arr[1] is unique.
return arr[0] if arr[1] == arr[2] else arr[1]
In this approach, you only ever need to iterate through the array as far as the unique item (or exactly one item past it in the case where it's one of the first two items). In the specific case where the unique item is toward the start of a very long array, this will be much faster than an approach that iterates over the entire array no matter what. In the worst case scenario, you will still have only a single iteration.

sub-sum from a list without loops

So i'm studying recursion and have to write some codes using no loops
For a part of my code I want to check if I can sum up a subset of a list to a specific number, and if so return the indexes of those numbers on the list.
For example, if the list is [5,40,20,20,20] and i send it with the number 60, i want my output to be [1,2] since 40+20=60.
In case I can't get to the number, the output should be an empty list.
I started with
def find_sum(num,lst,i,sub_lst_sum,index_lst):
if num == sub_lst_sum:
return index_lst
if i == len(sum): ## finished going over the list without getting to the sum
return []
if sub_lst_sum+lst[i] > num:
return find_sum(num,lst,i+1,sub_lst_sum,index_lst)
return ?..
index_lst = find_sum(60,[5,40,20,20,20],0,0,[])
num is the number i want to sum up to,
lst is the list of numbers
the last return should go over both the option that I count the current number in the list and not counting it.. (otherwise in the example it will take the five and there will be no solution).
I'm not sure how to do this..

Here's a hint. Perhaps the simplest way to go about it is to consider the following inductive reasoning to guide your recursion.
If
index_list = find_sum(num,lst,i+1)
Then
index_list = find_sum(num,lst,i)
That is, if a list of indices can be use to construct a sum num using elements from position i+1 onwards, then it is also a solution when using elements from position i onwards. That much should be clear. The second piece of inductive reasoning is,
If
index_list = find_sum(num-lst[i],lst,i+1)
Then
[i]+index_list = find_sum(num,lst,i)
That is, if a list of indices can be used to return a sum num-lst[i] using elements from position i+1 onwards, then you can use it to build a list of indices whose respective elements sum is num by appending i.
These two bits of inductive reasoning can be translated into two recursive calls to solve the problem. Also the first one I wrote should be used for the second recursive call and not the first (question: why?).
Also you might want to rethink using empty list for the base case where there is no solution. That can work, but your returning as a solution a list that is not a solution. In python I think None would be a the standard idiomatic choice (but you might want to double check that with someone more well-versed in python than me).
Fill in the blanks
def find_sum(num,lst,i):
if num == 0 :
return []
elif i == len(lst) :
return None
else :
ixs = find_sum(???,lst,i+1)
if ixs != None :
return ???
else :
return find_sum(???,lst,i+1)

Extracting the subsequence of maximum length from a sequence [PYTHON] [duplicate]

This question already has an answer here:
Longest increasing unique subsequence
(1 answer)
Closed 6 years ago.
I have a sequence of values [1,2,3,4,1,5,1,6,7], and I have to find the longest subsequence of increasing length. However, the function needs to stop counting once it reaches a number lower than the previous one. The answer in this sequence in that case is [1,2,3,4]. As it has 4 values before being reset. How would I write the Python code for this?
Note: Finding the "longest increasing subsequence" seems to be a common challenge and so searching online I find a lot of solutions that would count for the entire length of the sequence, and return a subsequence of increasing values, ignoring any decrease, so in this case it would return [1,2,3,4,5,6,7]. That is not what I'm looking for.
It needs to count each subsequence, and reset the count upon reaching a number lower than the previous one. It then needs to compare all the subsequences counted, and return the longest one.
Thanks in advance.

Consider a function that generates all possible ascending subsequences, you would start with an empty list, add items until one element was less (or equal to?) the the previous at which point you save (yield) the subsequence and restart with a new subsequence.
One implementation using a generator could be this:
def all_ascending_subsequences(sequence):
#use an iterator so we can pull out the first element without slicing
seq = iter(sequence)
try: #NOTE 1
last = next(seq) # grab the first element from the sequence
except StopIteration: # or if there are none just return
#yield [] #NOTE 2
return
sub = [last]
for value in seq:
if value > last: #or check if their difference is exactly 1 etc.
sub.append(value)
else: #end of the subsequence, yield it and reset sub
yield sub
sub = [value]
last = value
#after the loop we send the final subsequence
yield sub
two notes about the handling of empty sequences:
To finish a generator a StopIteration needs to be
raised so we could just let the one from next(seq) propegate out - however when from __future__ import generator_stop is in
effect it would cause a RuntimeError so to be future compatible we
need to catch it and explicitly return.
As I've written it passing an empty list to
all_ascending_subsequences would generate no values, which may not
be the desired behaviour. Feel free to uncomment the yield [] to
generate an empty list when passed an empty list.
Then you can just get the longest by calling max on the result with key=len
b = [1,2,3,4,1,5,1,6,7]
result = max(all_ascending_subsequences(b),key=len)
print("longest is", result)
#print(*all_ascending_subsequences(b))

b = [4,1,6,3,4,5,6,7,3,9,1,0]
def findsub(a):
start = -1
count = 0
cur = a[0]
for i, n in enumerate(a):
if n is cur+1:
if start is -1:
start = i - 2
count=1
count+=1
cur = n
if n < cur and count > 1:
return [a[j] for j in range(start,start+count+1)]
print findsub(b)
A somewhat sloppy algorithm, but I believe it does what you want. Usually i would not have simply shown you code, but I suspect that is what you wanted, and I hope you can learn from it, and create your own from what you learn.
a slightly better looking way because I didn't like that:
b = [1,2,0,1,2,3,4,5]
def findsub(a):
answer = [a[0]]
for i in a[1:]:
if answer[-1] + 1 is i:
answer.append(i)
if not len(answer) > 1:
answer = [i]
elif i < answer[-1] and len(answer) > 1:
return answer
return answer
print findsub(b)

You need to do the following:
Create a function W that given a list, returns the index of the last item which is not strictly increasing from the start of the list.
For example, given the following lists: [1,2,3,4,1,2,3], [4,2,1], [5,6,7,8], it should return 4, 1, 4, respectively for each list
Create a variable maxim and set the value to 0
Repeatedly do the following until your list is empty
Call your function W on your list, and let's call the return value x
if x is greater than maxim
set maxim to x
At this point if you wish to store this sequence, you can use the list-slice notation to get that portion of your list which contains the sequence.
delete that portion of your list from the list
Finally, print maxim and if you were storing the parts of your list containing the longest sequence, then print the last one you got

What is the purpose of the fourth line of code particularly "l[j][-1]" here? [duplicate]

This question already has answers here:
Understanding slicing
(38 answers)
Closed 7 years ago.
The ff is a code to find the longest increasing subsequence of an array/list in this case it is d. The 4th line confuses me like what does l[j][-1] mean. Let's say i=1. What would l[j][-1] be?
d = [10,300,40,43,2,69]
l = []
for i in range(len(d)):
l.append(max([l[j] for j in range(i) if l[j][-1] < d[i]] or [[]], key=len) + [d[i]])

When I hit something like this, I consider that either I'm facing a smart algorithm that's beyond my trivial understanding, or that I'm just seeing tight code written by somebody while their mind was well soaked with the problem at hand, and not thinking about the guy who'd try to read it later (that guy most often being the writer himself, six months later... unless he's an APL hacker...).
So I take the code and try deobfuscate it. I copy/paste it into my text editor, and from there I rework it to split complex statements or expressions into smaller ones, and to assign intermediate values to variables with meaningful names. I do so iteratively, peeling off one layer of convolution at a time, until the result makes sense to me.
I did just that for your code. I believe you'll understand it easily by yourself in this shape. And me not being all too familiar with Python, I commented a few things for my own understanding at least as much as for yours.
This resulting code could use a bit of reordering and simplification, but I left in the form that would map most easily to the original four-liner, being the direct result of its expansion. Here we go :
data = [10,300,40,43,2,69]
increasingSubsequences = []
for i in range(len(data)):
# increasingSubsequences is a list of lists
# increasingSubsequences[j][-1] is the last element of increasingSubsequences[j]
candidatesForMoreIncrease = []
for j in range(i):
if data[i] > increasingSubsequences[j][-1]:
candidatesForMoreIncrease.append( increasingSubsequences[j] )
if len(candidatesForMoreIncrease) != 0:
nonEmpty = candidatesForMoreIncrease
else:
nonEmpty = [[]]
longestCandidate = max( nonEmpty, key=len ) # keep the LONGEST of the retained lists
# with the empty list included as a bottom element
# (stuff + [data[i]]) is like copying stuff and and then appending data[i] to the copy
increasingSubsequences.append( longestCandidate + [data[i]] )
print "All increasingSubsequences : ", increasingSubsequences
print "The result you expected : ", max(increasingSubsequences, key=len)

This looks like python to me.
max(arg1, arg2, *args[, key])
Returns the largest item in an iterable or the largest of two or more arguments.
If one positional argument is provided, iterable must be a non-empty iterable (such as a non-empty string, tuple or list). The largest item in the iterable is returned. If two or more positional arguments are provided, the largest of the positional arguments is returned.
The optional key argument specifies a one-argument ordering function like that used for list.sort(). The key argument, if supplied, must be in keyword form (for example, max(a,b,c,key=func)).
In this case it appears the variable l = [] is going to be appended with the largest iterable or the largest of 2 arguments from the stuff inside of the max function.
iterable 1:
[l[j] for j in range(i) if l[j][-1] < d[i]]
or
[[]]
Sometimes in words it helps too:
j in l for j in range i if j in l times -1 is less than i in d or [[]]
iterable 2:
key=len
Add [d[i]] to the one that returns the max result
Append this result to l = [] making l = [result]
Append simply adds an item to the end of the list; equivalent to a[len(a):] = [x].

Count the number of occurrences of a given item in a (sorted) list?

I'm asked to create a method that returns the number of occurrences of a given item in a list. I know how to write code to find a specific item, but how can I code it to where it counts the number of occurrences of a random item.
For example if I have a list [4, 6 4, 3, 6, 4, 9] and I type something like
s1.count(4), it should return 3 or s1.count(6) should return 2.
I'm not allowed to use and built-in functions though.
In a recent assignment, I was asked to count the number of occurrences that sub string "ou" appeared in a given string, and I coded it
if len(astr) < 2:
return 0
else:
return (astr[:2] == "ou")+ count_pattern(astr[1:])
Would something like this work??
def count(self, item):
num=0
for i in self.s_list:
if i in self.s_list:
num[i] +=1
def __str__(self):
return str(self.s_list)

If this list is already sorted, the "most efficient" method -- in terms of Big-O -- would be to perform a binary search with a count-forward/count-backward if the value was found.
However, for an unsorted list as in the example, then the only way to count the occurrences is to go through each item in turn (or sort it first ;-). Here is some pseudo-code, note that it is simpler than the code presented in the original post (there is no if x in list or count[x]):
set count to 0
for each element in the list:
if the element is what we are looking for:
add one to count
Happy coding.

If I told you to count the number of fours in the following list, how would you do it?
1 4 2 4 3 8 2 1 4 2 4 9 7 4
You would start by remembering no fours yet, and add 1 for each element that equals 4. To traverse a list, you can use a for statement. Given an element of the list el, you can check whether it is four like this:
if el == 4:
# TODO: Add 1 to the counter here
In response to your edit:
You're currently testing if i in self.s_list:, which doesn't make any sense since i is an element of the list and therefore always present in it.
When adding to a number, you simply write num += 1. Brackets are only necessary if you want to access the values of a list or dictionary.
Also, don't forget to return num at the end of the function so that somebody calling it gets the result back.

Actually the most efficient method in terms of Big-O would be O(log n). #pst's method would result in O(log n + s) which could become linear if the array is made up of equal elements.
The way to achieve O(log n) would be to use 2 binary searches (which gives O(2log n), but we discard constants, so it is still O(log n)) that are modified to not have an equality test, therefore making all searches unsuccessful. However, on an unsuccessful search (low > high) we return low.
In the first search, if the middle is greater than your search term, recurse into the higher part of the array, else recurse into the lower part. In the second search, reverse the binary comparison.
The first search yields the right boundary of the equal element and the second search yields the left boundary. Simply subtract to get the amount of occurrences.
Based on algorithm described in Skiena.

This seems like a homework... anyways. Try list.count(item). That should do the job.
Third or fourth element here:
http://docs.python.org/tutorial/datastructures.html
Edit:
try something else like:
bukket = dict()
for elem in astr:
if elem not in bukket.keys():
bukket[elem] = 1
else:
bukket[elem] += 1
You can now get all the elements in the list with dict.keys() as list and the corresponding occurences with dict[key].
So you can test it:
import random
l = []
for i in range(0,200):
l.append(random.randint(0,20))
print l
l.sort()
print l
bukket = dict()
for elem in l:
if elem not in bukket.keys():
bukket[elem] = 1
else:
bukket[elem] += 1
print bukket

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiple mismatches in DNA search sequence regex - python

Related

Execution Timed Out on codewars when finding the least used element in an array. Needs to be optimized but dont know how

sub-sum from a list without loops

Extracting the subsequence of maximum length from a sequence [PYTHON] [duplicate]

What is the purpose of the fourth line of code particularly "l[j][-1]" here? [duplicate]

Count the number of occurrences of a given item in a (sorted) list?

Categories

Resources