I am trying to randomly generate multiple short 5 base-pair DNA sequences. Among them, I want to pick the sequences that meet the following conditions:
If the first letter is A then the last letter cannot be T
If the first letter is T then the last letter cannot be A
If the first letter is C then the last letter cannot be G
If the first letter is G then the last letter cannot be C
The same requirements are repeated for the second and the second to the last letters.
I am currently using a very long If-Statement to make the first-last letter work, but I was wondering if there is a simple way to achieve the same result so I don't have to repeat the long statement for making the second-second-to-the-last letter work? If so, how should I change the code? Thank you.
import itertools
a = "ATCG"
for output in itertools.product(a, repeat=5):
if((output[0] == 'A') and (output[4] != 'T')) or ((output[0] == 'T') and (output[4] != 'A')) or ((output[0] == 'C') and (output[4] != 'G')) or ((output[0] == 'G') and (output[4] != "C")):
list = "".join(output)
print(list)
'''
Here is a dictionary containing the forbidden opposite:
forbidden = {
'A': 'T',
'T': 'A',
'C': 'G',
'G': 'C',
}
Now you can check that the character at index -1 - i is not the forbidden opposite of the one at i by doing a simple lookup. The trick is to loop only over the first half of the string:
def check(s):
for i in range(len(s) // 2):
if s[-1 - i] == forbidden[s[i]]:
return False
return True
Incidentally, this will work correctly on both even and odd string lengths.
for sequence in map(''.join, itertools.product(forbidden.keys(), repeat=5)):
if check(sequence):
print(sequence)
All that being as it may, it's a bit inefficient to generate a bunch of extra sequences when you only want ones matching a specific pattern. The pattern is that the first half of your string is constrained to 4 options, while the second half is to 3. You can therefore generate only matching patterns with something like this:
def generate(n=5):
first = random.choices('ATCG', k=(n + 1) // 2)
second = random.choices('ATC', k = n // 2)
second = ['G' if s == forbidden[f] else s for f, s in zip(first, second)]
return ''.join(first + second[::-1])
Given that only one character is forbidden, you can generate any three characters for the second half, and replace forbidden ones with the missing. The second half then gets reversed because of how you actually want to compare the halves.
Are you looking for something like this?
You can define regular expression to filter your outputs.
To learn more about regular expression: https://docs.python.org/3/library/re.html
import itertools
import re
a = "ATCG"
case1 = ["(^[A].{3}[^T]$)",
"(^[T].{3}[^A]$)",
"(^[C].{3}[^G]$)",
"(^[G].{3}[^C]$)"]
case2 = ["(^.[A].[^T].$)",
"(^.[T].[^A].$)",
"(^.[C].[^G].$)",
"(^.[G].[^C].$)"]
case1_filter = '|'.join(case1)
case2_filter = '|'.join(case2)
for output in itertools.product(a, repeat=5):
sequence = ''.join(output)
if re.match(case1_filter, sequence) and re.match(case2_filter, sequence):
print(''.join(output))
I'd use sets:
disallowed = [{'A', 'T'},
{'C', 'G'}]
for output in itertools.product(a, repeat=5):
first_last = {output[0], output[4]}
second_fourth = {output[1], output[3]}
pairs = (first_last, second_fourth)
if all(pair not in disallowed for pair in pairs):
sequence = "".join(output)
print(sequence)
We're using the all function which takes a Python sequence and will return True if all items in the sequence evaluate to True. This means it only evaluates enough to determine it: once one value evaluates False it stops comparing because the result will always then be False.
pairs is just a tuple of the two sets. This makes it easy to iterate over the two sets. Otherwise we would just have to write the comparison twice. I'd rather do it in a loop and write the comparison once.
The sequence we pass to the all function is each of the two pairs, and we check to see whether it is not in disallowed. If both pairs are not in disallowed then all returns True.
Related
I am trying to solve this problem on HackerRank and I am having a issue with my logic. I am confused and not able to think what I'm doing wrong, feels like I'm stuck in logic.
Question link: https://www.hackerrank.com/challenges/game-of-thrones/
I created a dictionary of alphabets with value 0. And then counting number of times the alphabet appears in the string. If there are more than 1 alphabet characters occurring 1 times in string, then obviously that string cannot become a palindrome. That's my logic, however it only pass 10/21 test cases.
Here's my code:
def gameOfThrones(s):
alpha_dict = {chr(x): 0 for x in range(97,123)}
counter = 0
for i in s:
if i in alpha_dict:
alpha_dict[i] += 1
for key in alpha_dict.values():
if key == 1:
counter += 1
if counter <= 1:
return 'YES'
else:
return 'NO'
Any idea where I'm going wrong?
Explanation
The issue is that the code doesn't really look for palindromes. Let's step through it with a sample text based on a valid one that they gave: aaabbbbb (the only difference between this and their example is that there is an extra b).
Your first for loop counts how many times the letters appear in the string. In this case, 3 a and 5 b with all the other characters showing up 0 times (quick aside, the end of the range function is exclusive so this would not count any z characters that might show up).
The next for loop counts how many character there are that show up only once in the string. This string is made up of multiple a and b characters, more than the check that you have for if key == 1 so it doesn't trigger it. Since the count is less than 1, it returns YES and exits. However aaabbbbb is not a palindrome unscrambled.
Suggestion
To fix it, I would suggest having more than just one function so you can break down exactly what you need. For example, you can have a function that would return a list of all the unscrambled possibilities.
def allUnscrambled(string)->list:
# find all possible iterations of the string
# if given 'aabb', return 'aabb', 'abab', 'abba', 'bbaa', 'baba', 'baab'
return lstOfStrings
After this, create a palindrome checker. You can use the one shown by Dmitriy or create your own.
def checkIfPalindrome(string)->bool:
# determine if the given string is a palindrome
return isOrNotPalindrome
Put the two together to get a function that will, given a list of strings, determine if at least one of them is a palindrome. If it is, that means the original string is an anagrammed palindrome.
def palindromeInList(lst)->bool:
# given the list of strings from allUnscrambled(str), is any of them a palindrome?
return isPalindromeInList
Your function gameOfThrones(s) can then call this palindromeInList( allUnscrambled(s) ) and then return YES or NO depending on the result. Breaking it up into smaller pieces and delegating tasks is usually a good way to handle these problems.
Corrected the logic in my solution. I was just comparing key == 1 and not with every odd element.
So the corrected code looks like:
for key in alpha_dict.values():
if key % 2 == 1:
counter += 1
It passes all the testcases on HackerRank website.
The property that you have to check on the input string is that the number of characters with odd repetitions must be less than 1. So, the main ingredients to cook you recipe are:
a counter for each character
an hash map to store the counters, having the characters as keys
iterate over the input string
A plain implementation could be:
def gameOfThrones(s):
counters = {}
for c in s:
counters[c] = counters.get(c, 0) + 1
n_odd_characters = sum(v % 2 for v in counters.values())
Using a functional approach, based on reduce from functools:
from functools import reduce
def gamesOfThrones(s):
return ['NO', 'YES'][len(reduce(
lambda x, y: (x | {y: 1}) if y not in x else (x.pop(y) and x),
s,
{}
)) <= 1]
If you want, you can use the Counter class from collections to make your code more concise:
def gamesOfThrones(s):
return ['NO', 'YES'][sum([v % 2 for v in Counter(s).values() ]) <= 1]
so i need to code a program which, for example if given the input 3[a]2[b], prints "aaabb" or when given 3[ab]2[c],prints "abababcc"(basicly prints that amount of that letter in the given order). i tried to use a for loop to iterate the first given input and then detect "[" letters in it so it'll know that to repeatedly print but i don't know how i can make it also understand where that string ends
also this is where i could get it to,which probably isnt too useful:
string=input()
string=string[::-1]
bulundu=6
for i in string:
if i!="]":
if i!="[":
lst.append(i)
if i=="[":
break
The approach I took is to remove the brackets, split the items into a list, then walk the list, and if the item is a number, add that many repeats of the next item to the result for output:
import re
data = "3[a]2[b]"
# Remove brackets and convert to a list
data = re.sub(r'[\[\]]', ' ', data).split()
result = []
for i, item in enumerate(data):
# If item is a number, print that many of the next item
if item.isdigit():
result.append(data[i+1] * int(item))
print(''.join(result))
# aaabb
A different approach, inspired by Subbu's use of re.findall. This approach finds all 'pairs' of numbers and letters using match groups, then multiplies them to produce the required text:
import re
data = "3[a]2[b]"
matches = re.findall('(\d+)\[([a-zA-Z]+)\]',data)
# [(3, 'a'), (2, 'b')]
for x in matches:
print(x[1] * int(x[0]), end='')
#aaabb
Lenghty and documented version using NO regex but simple string and list manipulation:
first split the input into parts that are numbers and texts
then recombinate them again
I opted to document with inline comments
This could be done like so:
# testcases are tuples of input and correct result
testcases = [ ("3[a]2[b]","aaabb"),
("3[ab]2[c]","abababcc"),
("5[12]6[c]","1212121212cccccc"),
("22[a]","a"*22)]
# now we use our algo for all those testcases
for inp,res in testcases:
split_inp = [] # list that takes the splitted values of the input
num = 0 # accumulator variable for more-then-1-digit numbers
in_text = False # bool that tells us if we are currently collecting letters
# go over all letters : O(n)
for c in inp:
# when a [ is reached our num is complete and we need to store it
# we collect all further letters until next ] in a list that we
# add at the end of your split_inp
if c == "[":
split_inp.append(num) # add the completed number
num = 0 # and reset it to 0
in_text = True # now in text
split_inp.append([]) # add a list to collect letters
# done collecting letters
elif c == "]":
in_text = False # no longer collecting, convert letters
split_inp[-1] = ''.join(split_inp[-1]) # to text
# between [ and ] ... simply add letter to list at end
elif in_text:
split_inp[-1].append(c) # add letter
# currently collecting numbers
else:
num *= 10 # increase current number by factor 10
num += int(c) # add newest number
print(repr(inp), split_inp, sep="\n") # debugging output for parsing part
# now we need to build the string from our parsed data
amount = 0
result = [] # intermediate list to join ['aaa','bb']
# iterate the list, if int remember it, it text, build composite
for part in split_inp:
if isinstance(part, int):
amount = part
else:
result.append(part*amount)
# join the parts
result = ''.join(result)
# check if all worked out
if result == res:
print("CORRECT: ", result + "\n")
else:
print (f"INCORRECT: should be '{res}' but is '{result}'\n")
Result:
'3[a]2[b]'
[3, 'a', 2, 'b']
CORRECT: aaabb
'3[ab]2[c]'
[3, 'ab', 2, 'c']
CORRECT: abababcc
'5[12]6[c]'
[5, '12', 6, 'c']
CORRECT: 1212121212cccccc
'22[a]'
[22, 'a']
CORRECT: aaaaaaaaaaaaaaaaaaaaaa
This will also handle cases of '5[12]' wich some of the other solutions wont.
You can capture both the number of repetitions n and the pattern to repeat v in one go using the described pattern. This essentially matches any sequence of digits - which is the first group we need to capture, reason why \d+ is between brackets (..) - followed by a [, followed by anything - this anything is the second pattern of interest, hence it is between backets (...) - which is then followed by a ].
findall will find all these matches in the passed line, then the first match - the number - will be cast to an int and used as a multiplier for the string pattern. The list of int(n) * v is then joined with an empty space. Malformed patterns may throw exceptions or return nothing.
Anyway, in code:
import re
pattern = re.compile("(\d+)\[(.*?)\]")
def func(x): return "".join([v*int(n) for n,v in pattern.findall(x)])
print(func("3[a]2[b]"))
print(func("3[ab]2[c]"))
OUTPUT
aaabb
abababcc
FOLLOW UP
Another solution which achieves the same result, without using regular expression (ok, not nice at all, I get it...):
def func(s): return "".join([int(x[0])*x[1] for x in map(lambda x:x.split("["), s.split("]")) if len(x) == 2])
I am not much more than a beginner and looking at the other answers, I thought understanding regex might be a challenge for a new contributor such as yourself since I myself haven't really dealt with regex.
The beginner friendly way to do this might be to loop through the input string and use string functions like isnumeric() and isalpha()
data = "3[a]2[b]"
chars = []
nums = []
substrings = []
for i, char in enumerate(data):
if char.isnumeric():
nums.append(char)
if char.isalpha():
chars.append(char)
for i, char in enumerate(chars):
substrings.append(char * int(nums[i]))
string = "".join(substrings)
print(string)
OUTPUT:
aaabb
And on trying different values for data:
data = "0[a]2[b]3[p]"
OUTPUT bbppp
data = "1[a]1[a]2[a]"
OUTPUT aaaa
NOTE: In case you're not familiar with the above functions, they are string functions, which are fairly self-explanatory. They are used as <your_string_here>.isalpha() which returns true if and only if the string is an alphabet (whitespace, numerics, and symbols return false
And, similarly for isnumeric()
For example,
"]".isnumeric() and "]".isalpha() return False
"a".isalpha() returns True
IF YOU NEED ANY CLARIFICATION ON A FUNCTION USED, PLEASE DO NOT HESITATE TO LEAVE A COMMENT
So I have a challenge I'm working on - find the longest string of alphabetical characters in a string. For example, "abcghiijkyxz" should result in "ghiijk" (Yes the i is doubled).
I've been doing quite a bit with loops to solve the problem - iterating over the entire string, then for each character, starting a second loop using lower and ord. No help needed writing that loop.
However, it was suggested to me that Regex would be great for this sort of thing. My regex is weak (I know how to grab a static set, my look-forwards knowledge extends to knowing they exist). How would I write a Regex to look forward, and check future characters for being next in alphabetical order? Or is the suggestion to use Regex not practical for this type of thing?
Edit: The general consensus seems to be that Regex is indeed terrible for this type of thing.
Just to demonstrate why regex is not practical for this sort of thing, here is a regex that would match ghiijk in your given example of abcghiijkyxz. Note it'll also match abc, y, x, z since they should technically be considered for longest string of alphabetical characters in order. Unfortunately, you can't determine which is the longest with regex alone, but this does give you all the possibilities. Please note that this regex works for PCRE and will not work with python's re module! Also, note that python's regex library does not currently support (*ACCEPT). Although I haven't tested, the pyre2 package (python wrapper for Google's re2 pyre2 using Cython) claims it supports the (*ACCEPT) control verb, so this may currently be possible using python.
See regex in use here
((?:a+(?(?!b)(*ACCEPT))|b+(?(?!c)(*ACCEPT))|c+(?(?!d)(*ACCEPT))|d+(?(?!e)(*ACCEPT))|e+(?(?!f)(*ACCEPT))|f+(?(?!g)(*ACCEPT))|g+(?(?!h)(*ACCEPT))|h+(?(?!i)(*ACCEPT))|i+(?(?!j)(*ACCEPT))|j+(?(?!k)(*ACCEPT))|k+(?(?!l)(*ACCEPT))|l+(?(?!m)(*ACCEPT))|m+(?(?!n)(*ACCEPT))|n+(?(?!o)(*ACCEPT))|o+(?(?!p)(*ACCEPT))|p+(?(?!q)(*ACCEPT))|q+(?(?!r)(*ACCEPT))|r+(?(?!s)(*ACCEPT))|s+(?(?!t)(*ACCEPT))|t+(?(?!u)(*ACCEPT))|u+(?(?!v)(*ACCEPT))|v+(?(?!w)(*ACCEPT))|w+(?(?!x)(*ACCEPT))|x+(?(?!y)(*ACCEPT))|y+(?(?!z)(*ACCEPT))|z+(?(?!$)(*ACCEPT)))+)
Results in:
abc
ghiijk
y
x
z
Explanation of a single option, i.e. a+(?(?!b)(*ACCEPT)):
a+ Matches a (literally) one or more times. This catches instances where several of the same characters are in sequence such as aa.
(?(?!b)(*ACCEPT)) If clause evaluating the condition.
(?!b) Condition for the if clause. Negative lookahead ensuring what follows is not b. This is because if it's not b, we want the following control verb to take effect.
(*ACCEPT) If the condition (above) is met, we accept the current solution. This control verb causes the regex to end successfully, skipping the rest of the pattern. Since this token is inside a capturing group, only that capturing group is ended successfully at that particular location, while the parent pattern continues to execute.
So what happens if the condition is not met? Well, that means that (?!b) evaluated to false. This means that the following character is, in fact, b and so we allow the matching (rather capturing in this instance) to continue. Note that the entire pattern is wrapped in (?:)+ which allows us to match consecutive options until the (*ACCEPT) control verb or end of line is met.
The only exception to this whole regular expression is that of z. Being that it's the last character in the English alphabet (which I presume is the target of this question), we don't care what comes after, so we can simply put z+(?(?!$)(*ACCEPT)), which will ensure nothing matches after z. If you, instead, want to match za (circular alphabetical order matching - idk if this is the proper terminology, but it sounds right to me) you can use z+(?(?!a)(*ACCEPT)))+ as seen here.
As mentioned, regex is not the best tool for this. Since you are interested in a continuous sequence, you can do this with a single for loop:
def LNDS(s):
start = 0
cur_len = 1
max_len = 1
for i in range(1,len(s)):
if ord(s[i]) in (ord(s[i-1]), ord(s[i-1])+1):
cur_len += 1
else:
if cur_len > max_len:
max_len = cur_len
start = i - cur_len
cur_len = 1
if cur_len > max_len:
max_len = cur_len
start = len(s) - cur_len
return s[start:start+max_len]
>>> LNDS('abcghiijkyxz')
'ghiijk'
We keep a running total of how many non-decreasing characters we have seen, and when the non-decreasing sequence ends we compare it to the longest non-decreasing sequence we saw previously, updating our "best seen so far" if it is longer.
Generate all the regex substrings like ^a+b+c+$ (longest to shortest).
Then match each of those regexs against all the substrings (longest to shortest) of "abcghiijkyxz" and stop at the first match.
def all_substrings(s):
n = len(s)
for i in xrange(n, 0, -1):
for j in xrange(n - i + 1):
yield s[j:j + i]
def longest_alphabetical_substring(s):
for t in all_substrings("abcdefghijklmnopqrstuvwxyz"):
r = re.compile("^" + "".join(map(lambda x: x + "+", t)) + "$")
for u in all_substrings(s):
if r.match(u):
return u
print longest_alphabetical_substring("abcghiijkyxz")
That prints "ghiijk".
Regex: char+ meaning a+b+c+...
Details:
+ Matches between one and unlimited times
Python code:
import re
def LNDS(text):
array = []
for y in range(97, 122): # a - z
st = r"%s+" % chr(y)
for x in range(y+1, 123): # b - z
st += r"%s+" % chr(x)
match = re.findall(st, text)
if match:
array.append(max(match, key=len))
else:
break
if array:
array = [max(array, key=len)]
return array
Output:
print(LNDS('abababababab abc')) >>> ['abc']
print(LNDS('abcghiijkyxz')) >>> ['ghiijk']
For string abcghiijkyxz regex pattern:
a+b+ i+j+k+l+
a+b+c+ j+k+
a+b+c+d+ j+k+l+
b+c+ k+l+
b+c+d+ l+m+
c+d+ m+n+
d+e+ n+o+
e+f+ o+p+
f+g+ p+q+
g+h+ q+r+
g+h+i+ r+s+
g+h+i+j+ s+t+
g+h+i+j+k+ t+u+
g+h+i+j+k+l+ u+v+
h+i+ v+w+
h+i+j+ w+x+
h+i+j+k+ x+y+
h+i+j+k+l+ y+z+
i+j+
i+j+k+
Code demo
To actually "solve" the problem, you could use
string = 'abcxyzghiijkl'
def sort_longest(string):
stack = []; result = [];
for idx, char in enumerate(string):
c = ord(char)
if idx == 0:
# initialize our stack
stack.append((char, c))
elif idx == len(string) - 1:
result.append(stack)
elif c == stack[-1][1] or c == stack[-1][1] + 1:
# compare it to the item before (a tuple)
stack.append((char, c))
else:
# append the stack to the overall result
# and reinitialize the stack
result.append(stack)
stack = []
stack.append((char, c))
return ["".join(item[0]
for item in sublst)
for sublst in sorted(result, key=len, reverse=True)]
print(sort_longest(string))
Which yields
['ghiijk', 'abc', 'xyz']
in this example.
The idea is to loop over the string and keep track of a stack variable which is filled by your requirements using ord().
It's really easy with regexps!
(Using trailing contexts here)
rexp=re.compile(
"".join(['(?:(?=.' + chr(ord(x)+1) + ')'+ x +')?'
for x in "abcdefghijklmnopqrstuvwxyz"])
+'[a-z]')
a = 'bcabhhjabjjbckjkjabckkjdefghiklmn90'
re.findall(rexp, a)
#Answer: ['bc', 'ab', 'h', 'h', 'j', 'ab', 'j', 'j', 'bc', 'k', 'jk', 'j', 'abc', 'k', 'k', 'j', 'defghi', 'klmn']
So I have a challenge I'm working on - find the longest string of alphabetical characters in a string. For example, "abcghiijkyxz" should result in "ghiijk" (Yes the i is doubled).
I've been doing quite a bit with loops to solve the problem - iterating over the entire string, then for each character, starting a second loop using lower and ord. No help needed writing that loop.
However, it was suggested to me that Regex would be great for this sort of thing. My regex is weak (I know how to grab a static set, my look-forwards knowledge extends to knowing they exist). How would I write a Regex to look forward, and check future characters for being next in alphabetical order? Or is the suggestion to use Regex not practical for this type of thing?
Edit: The general consensus seems to be that Regex is indeed terrible for this type of thing.
Just to demonstrate why regex is not practical for this sort of thing, here is a regex that would match ghiijk in your given example of abcghiijkyxz. Note it'll also match abc, y, x, z since they should technically be considered for longest string of alphabetical characters in order. Unfortunately, you can't determine which is the longest with regex alone, but this does give you all the possibilities. Please note that this regex works for PCRE and will not work with python's re module! Also, note that python's regex library does not currently support (*ACCEPT). Although I haven't tested, the pyre2 package (python wrapper for Google's re2 pyre2 using Cython) claims it supports the (*ACCEPT) control verb, so this may currently be possible using python.
See regex in use here
((?:a+(?(?!b)(*ACCEPT))|b+(?(?!c)(*ACCEPT))|c+(?(?!d)(*ACCEPT))|d+(?(?!e)(*ACCEPT))|e+(?(?!f)(*ACCEPT))|f+(?(?!g)(*ACCEPT))|g+(?(?!h)(*ACCEPT))|h+(?(?!i)(*ACCEPT))|i+(?(?!j)(*ACCEPT))|j+(?(?!k)(*ACCEPT))|k+(?(?!l)(*ACCEPT))|l+(?(?!m)(*ACCEPT))|m+(?(?!n)(*ACCEPT))|n+(?(?!o)(*ACCEPT))|o+(?(?!p)(*ACCEPT))|p+(?(?!q)(*ACCEPT))|q+(?(?!r)(*ACCEPT))|r+(?(?!s)(*ACCEPT))|s+(?(?!t)(*ACCEPT))|t+(?(?!u)(*ACCEPT))|u+(?(?!v)(*ACCEPT))|v+(?(?!w)(*ACCEPT))|w+(?(?!x)(*ACCEPT))|x+(?(?!y)(*ACCEPT))|y+(?(?!z)(*ACCEPT))|z+(?(?!$)(*ACCEPT)))+)
Results in:
abc
ghiijk
y
x
z
Explanation of a single option, i.e. a+(?(?!b)(*ACCEPT)):
a+ Matches a (literally) one or more times. This catches instances where several of the same characters are in sequence such as aa.
(?(?!b)(*ACCEPT)) If clause evaluating the condition.
(?!b) Condition for the if clause. Negative lookahead ensuring what follows is not b. This is because if it's not b, we want the following control verb to take effect.
(*ACCEPT) If the condition (above) is met, we accept the current solution. This control verb causes the regex to end successfully, skipping the rest of the pattern. Since this token is inside a capturing group, only that capturing group is ended successfully at that particular location, while the parent pattern continues to execute.
So what happens if the condition is not met? Well, that means that (?!b) evaluated to false. This means that the following character is, in fact, b and so we allow the matching (rather capturing in this instance) to continue. Note that the entire pattern is wrapped in (?:)+ which allows us to match consecutive options until the (*ACCEPT) control verb or end of line is met.
The only exception to this whole regular expression is that of z. Being that it's the last character in the English alphabet (which I presume is the target of this question), we don't care what comes after, so we can simply put z+(?(?!$)(*ACCEPT)), which will ensure nothing matches after z. If you, instead, want to match za (circular alphabetical order matching - idk if this is the proper terminology, but it sounds right to me) you can use z+(?(?!a)(*ACCEPT)))+ as seen here.
As mentioned, regex is not the best tool for this. Since you are interested in a continuous sequence, you can do this with a single for loop:
def LNDS(s):
start = 0
cur_len = 1
max_len = 1
for i in range(1,len(s)):
if ord(s[i]) in (ord(s[i-1]), ord(s[i-1])+1):
cur_len += 1
else:
if cur_len > max_len:
max_len = cur_len
start = i - cur_len
cur_len = 1
if cur_len > max_len:
max_len = cur_len
start = len(s) - cur_len
return s[start:start+max_len]
>>> LNDS('abcghiijkyxz')
'ghiijk'
We keep a running total of how many non-decreasing characters we have seen, and when the non-decreasing sequence ends we compare it to the longest non-decreasing sequence we saw previously, updating our "best seen so far" if it is longer.
Generate all the regex substrings like ^a+b+c+$ (longest to shortest).
Then match each of those regexs against all the substrings (longest to shortest) of "abcghiijkyxz" and stop at the first match.
def all_substrings(s):
n = len(s)
for i in xrange(n, 0, -1):
for j in xrange(n - i + 1):
yield s[j:j + i]
def longest_alphabetical_substring(s):
for t in all_substrings("abcdefghijklmnopqrstuvwxyz"):
r = re.compile("^" + "".join(map(lambda x: x + "+", t)) + "$")
for u in all_substrings(s):
if r.match(u):
return u
print longest_alphabetical_substring("abcghiijkyxz")
That prints "ghiijk".
Regex: char+ meaning a+b+c+...
Details:
+ Matches between one and unlimited times
Python code:
import re
def LNDS(text):
array = []
for y in range(97, 122): # a - z
st = r"%s+" % chr(y)
for x in range(y+1, 123): # b - z
st += r"%s+" % chr(x)
match = re.findall(st, text)
if match:
array.append(max(match, key=len))
else:
break
if array:
array = [max(array, key=len)]
return array
Output:
print(LNDS('abababababab abc')) >>> ['abc']
print(LNDS('abcghiijkyxz')) >>> ['ghiijk']
For string abcghiijkyxz regex pattern:
a+b+ i+j+k+l+
a+b+c+ j+k+
a+b+c+d+ j+k+l+
b+c+ k+l+
b+c+d+ l+m+
c+d+ m+n+
d+e+ n+o+
e+f+ o+p+
f+g+ p+q+
g+h+ q+r+
g+h+i+ r+s+
g+h+i+j+ s+t+
g+h+i+j+k+ t+u+
g+h+i+j+k+l+ u+v+
h+i+ v+w+
h+i+j+ w+x+
h+i+j+k+ x+y+
h+i+j+k+l+ y+z+
i+j+
i+j+k+
Code demo
To actually "solve" the problem, you could use
string = 'abcxyzghiijkl'
def sort_longest(string):
stack = []; result = [];
for idx, char in enumerate(string):
c = ord(char)
if idx == 0:
# initialize our stack
stack.append((char, c))
elif idx == len(string) - 1:
result.append(stack)
elif c == stack[-1][1] or c == stack[-1][1] + 1:
# compare it to the item before (a tuple)
stack.append((char, c))
else:
# append the stack to the overall result
# and reinitialize the stack
result.append(stack)
stack = []
stack.append((char, c))
return ["".join(item[0]
for item in sublst)
for sublst in sorted(result, key=len, reverse=True)]
print(sort_longest(string))
Which yields
['ghiijk', 'abc', 'xyz']
in this example.
The idea is to loop over the string and keep track of a stack variable which is filled by your requirements using ord().
It's really easy with regexps!
(Using trailing contexts here)
rexp=re.compile(
"".join(['(?:(?=.' + chr(ord(x)+1) + ')'+ x +')?'
for x in "abcdefghijklmnopqrstuvwxyz"])
+'[a-z]')
a = 'bcabhhjabjjbckjkjabckkjdefghiklmn90'
re.findall(rexp, a)
#Answer: ['bc', 'ab', 'h', 'h', 'j', 'ab', 'j', 'j', 'bc', 'k', 'jk', 'j', 'abc', 'k', 'k', 'j', 'defghi', 'klmn']
So, the program works just fine, but even after incorporating that last modification suggested by abarnert, it still won't make sure to generate a unique mutation.
This is what I've got so far. I'm sure it's not right, but I don't fully understand how python executes the code written by abarnert below.
a = open (scgenome, 'r')
codon = [ ]
for line in a:
data=line.split("\t")
codon.append(data[12])
import random
def string_replace(s,index,char):
return s[:index] + char + s[index+1:]
for x in range(1,1000):
index = random.randrange(3)
letter_to_replace = random.choice(list({"A", "G", "T", "C"} - {codon[index]}))
mutated_codon = [string_replace(codon[x], index, letter_to_replace)]
for c in mutated_codon:
codon_lookup[c]
I have also tried to write this your way without using range, although I like using the range function so that I can print out 10 or 100 codons and manually check if the output is correct, but then I get a Keyerror: 'r', which didn't occur before when i ran this program before trying to make sure that each substitution is unique:
def string_replace(s,index,char):
return s[:index] + char + s[index+1:]
def mutate_codon(codon):
index = random.randrange(3)
letter_to_replace = random.choice(list({"A", "G", "T", "C"} - {codon[index]}))
return string_replace(codon, index, letter_to_replace)
for codon in codons:
codons = mutate_codon(codon)
print codons
for c in codons:
codon_lookup[c]
if codon_lookup[c] == ref_aminoacid[x]:
print codons, "\t", codon_lookup[c]
else:
print codons, "\t", codon_lookup[c]
The function random.choice picks a random element from a sequence. So:
letter_to_replace = random.choice(['A', 'C', 'G', 'T'])
To pick a letter from a given codon, you really want to pick an index at random—0, 1, or 2. (After all, for the codon 'AAA', you presumably want to be able to replace any of the three 'A' characters, right?) For that, use random.randrange(3):
for codon in codons:
index_to_replace = random.randrange(3)
codon[index_to_replace] = letter_to_replace
Except that if each codon is a string, of course, you can't mutate it in-place, so you need a function like this:
def string_replace(s, index, char):
return s[:index] + char + s[index+1:]
What we're doing here is building a new string out of slices: s[:index] is all of the characters from the start to the indexth (remember that Python slicing is half-open: s[i:j] includes i, i+1, …, j-1, but not j), and s[index+1] is all of the characters from the index+1th to the end. So, this is everything before index, char in place of whatever was in index, and then everything after index. This is described in detail in the Strings section of the tutorial (with a bit of followup in the Lists section of the same chapter).
And while you're already doing things immutably:
codons = [string_replace(codon, random.randrange(3), letter_to_replace)
for codon in codons]
This uses a list comprehension: instead of modifying the list of codons in-place, we build a new list of codons. List Comprehensions in the tutorial explains how these work, but a simple example may help:
a = [1, 2, 3, 4]
b = [2 * element for element in a]
assert b == [2, 4, 6, 8]
c = []
for element in a:
c.append(2 * element)
assert c == b
You can also filter the list as you build it with if clauses, nest multiple for clauses together, build a set or dict, or a lazy generator, instead of a list… see the documentation for full details.
Here's how to put it all together, with a few other fixes (using a with to make sure the file gets closed, and some of the stuff I commented on the question):
# Read the codons into a list
with open(scgenome) as f:
codons = [line.split('\t')[12] for line in f]
# Create a new list of mutated codons
def string_replace(s, index, char):
return s[:index] + char + s[index+1:]
letter_to_replace = random.choice(['A', 'C', 'G', 'T'])
codons = [string_replace(codon, random.randrange(3), letter_to_replace)
for codon in codons]
If you want to guarantee a single point mutation in each codon, and you don't need each one to mutate to the same base, you need to rethink things a little. For each codon, pick one of the three positions. Then, instead of picking randomly from all four bases, pick from all of the bases except the one that's already there. So:
def string_replace(s, index, char):
return s[:index] + char + s[index+1:]
def mutate_codon(codon):
index = random.randrange(3)
new_base = random.choice(list({'A', 'C', 'T', 'G'} - {codon[index]}))
return string_replace(codon, index, new_base)
codons = [mutate_codon(codon) for codon in codons]
If that function line is confusing, let me explain: Sets have a nice - operator that computes the set difference—that is, all values in the left set that aren't also in the right set. {'A', 'C', 'T', 'G'} - {'T'} is {'A', 'C', 'G'}. So, I take the set of all four bases, subtract out the one that's already at codon[index], and randomly choose any of the other three. Since choice only works on sequences, I have to make a list out of the set.
You could, of course, rewrite this to use a list (or even str) in the first place, but then you have to write the "list difference" manually. Not a big deal:
new_base = random.choice([base for base in codon if base != codon[index]])