define a loop which returns different possibilities - python

Hello I am pretty new to python. I have the following problem:
I want to write a script that, given a (dna) sequence with ambiguities, writes all possible sequences, (if there are less than 100, if there are more than 100 possible sequences, an appropriate error message is printed)
For DNA nucleotide ambiguities: http://www.bioinformatics.org/sms/iupac.html
Example: for the sequence “AYGH” the script’s output would be “ACGA”, “ACGC”, “ACGT”, “ATGA”, “ATGC”, and “ATGT”. A, C, G and T are the default nucleotides. ALL others can have different values (see link).
So i wrote this:
def possible_sequences (seq):
poss_seq = ''
for i in seq:
if i=='A'or i=='C'or i=='G'or i=='T':
poss_seq += i
else:
if i== 'R':
poss_seq += 'A' # OR 'G', how should i implement this?
elif i == 'Y':
poss_seq += 'C' # OR T
elif i == 'S':
poss_seq += 'G' # OR C
elif i == 'W':
poss_seq += 'A' # OR T
elif i == 'K':
poss_seq += 'G' # OR T
elif i == 'M':
poss_seq += 'A' # OR C
elif i == 'B':
poss_seq += 'C' # OR G OR T
elif i == 'D':
poss_seq += 'A' # OR G OR T
elif i == 'H':
poss_seq += 'A' # OR C OR T
elif i == 'V':
poss_seq += 'A' # OR C OR G
elif i == 'N':
poss_seq += 'A' # OR C OR G OR T
elif i == '-' or i == '.':
poss_seq += ' '
return poss_seq
when I test my function:
possible_sequences ('ATRY-C')
i got:
'ATAC C'
but i should have get:
'ATAC C'
'ATAT C'
'ATGC C'
'ATGT C'
Can somebody please help me? I understand that I have to recap the and write a second poss_seq when there is an ambiguity present but I don't know how...

You can use itertools.product to generate the possibilities:
from itertools import product
# List possible nucleotides for each possible item in sequence
MAP = {
'A': 'A',
'C': 'C',
'G': 'G',
'T': 'T',
'R': 'AG',
'Y': 'CT',
'S': 'GC',
'W': 'AT',
'K': 'GT',
'M': 'AC',
'B': 'CGT',
'D': 'AGT',
'H': 'ACT',
'V': 'ACG',
'N': 'ACGT',
'-': ' ',
'.': ' '
}
def possible_sequences(seq):
return (''.join(c) for c in product(*(MAP[c] for c in seq)))
print(list(possible_sequences('AYGH')))
print(list(possible_sequences('ATRY-C')))
Output:
['ACGA', 'ACGC', 'ACGT', 'ATGA', 'ATGC', 'ATGT']
['ATAC C', 'ATAT C', 'ATGC C', 'ATGT C']
In above we first iterate over the items in the given sequence and get the list of possible nucleotides for each item:
possibilities = [MAP[c] for c in 'ATRY-C']
print(possibilities)
# ['A', 'T', 'AG', 'CT', ' ', 'C']
Then the iterable is unpacked as arguments given to product which will return the cartesian product:
products = list(product(*['A', 'T', 'AG', 'CT', ' ', 'C']))
print(products)
# [('A', 'T', 'A', 'C', ' ', 'C'), ('A', 'T', 'A', 'T', ' ', 'C'),
# ('A', 'T', 'G', 'C', ' ', 'C'), ('A', 'T', 'G', 'T', ' ', 'C')]
Finally each one of the products is turned to a string with join:
print(list(''.join(p) for p in products))
# ['ATAC C', 'ATAT C', 'ATGC C', 'ATGT C']
Note that possible_sequences returns a generator instead of constructing all the possible sequences at once so you can easily stop the iteration whenever you want instead of having to wait every sequence to be generated.

Related

Decrypt message with random shift of letters

I am writing a program to decrypt a message and only given assumption that the maximum occur letter of decrypted message is "e". No shift number is given. Below code are my workdone. I can only hardcode the shift number to decrypt the given message, but if the message changed my code of course didn't work.
from collections import Counter
import string
message = "Zyp cpxpxmpc ez wzzv fa le esp delcd lyo yze ozhy le jzfc qppe Ehz ypgpc rtgp fa hzcv Hzcv rtgpd jzf xplytyr lyo afcazdp lyo wtqp td pxaej hteszfe te Escpp tq jzf lcp wfnvj pyzfrs ez qtyo wzgp cpxpxmpc te td espcp lyo ozye esczh te lhlj Depaspy Slhvtyr"
#frequency of each letter
letter_counts = Counter(message)
print(letter_counts) # Print the count of each element in string
#find max letter
maxFreq = -1
maxLetter = None
letter_counts[' '] = 0 # Don't count spaces zero count
for letter, freq in letter_counts.items():
print(letter, ":", freq)
maxLetter = max(letter_counts, key = letter_counts.get) # Find max freq letter in the string
print("Max Ocurring Letter:", maxLetter)
#right shift for encrypting and left shift for descripting.
#predict shift
#assume max letter is 'e'
letters = string.ascii_letters #contains 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
shift = 15 #COMPUTE SHIFT HERE (hardcode)
print("Predicted Shift:", shift)
totalLetters = 26
keys = {} #use dictionary for letter mapping
invkeys = {} #use dictionary for inverse letter mapping, you could use inverse search from original dict
for index, letter in enumerate(letters):
# cypher setup
if index < totalLetters: #lowercase
# Dictionary for encryption
letter = letters[index]
keys[letter] = letters[(index + shift) % 26]
# Dictionary for decryption
invkeys = {val: key for key, val in keys.items()}
else: #uppercase
# Dictionary for encryption
keys[letter] = letters[(index + shift) % 26 + 26]
# Dictionary for decryption
invkeys = {val: key for key, val in keys.items()}
print("Cypher Dict", keys)
#decrypt
decryptedMessage = []
for letter in message:
if letter == ' ': #spaces
decryptedMessage.append(letter)
else:
decryptedMessage.append(keys[letter])
print("Decrypted Message:", ''.join(decryptedMessage)) #join is used to put list inot string
# Checking if message is the same as the encrypt message provided
#Encrypt
encryptedMessage = []
for letter in decryptedMessage:
if letter == ' ': #spaces
encryptedMessage.append(letter)
else:
encryptedMessage.append(invkeys[letter])
print("Encrypted Message:", ''.join(encryptedMessage)) #join is used to put list inot string
The encrypt part of code is not necessary to exist, it is for checking only. It would be great if someone could help to modify my code/ give me some hints for the predict shift part. Thanks!
Output of the code:
Cypher Dict {'a': 'p', 'b': 'q', 'c': 'r', 'd': 's', 'e': 't', 'f': 'u', 'g': 'v', 'h': 'w', 'i': 'x', 'j': 'y', 'k': 'z', 'l': 'a', 'm': 'b', 'n': 'c', 'o': 'd', 'p': 'e', 'q': 'f', 'r': 'g', 's': 'h', 't': 'i', 'u': 'j', 'v': 'k', 'w': 'l', 'x': 'm', 'y': 'n', 'z': 'o', 'A': 'P', 'B': 'Q', 'C': 'R', 'D': 'S', 'E': 'T', 'F': 'U', 'G': 'V', 'H': 'W', 'I': 'X', 'J': 'Y', 'K': 'Z', 'L': 'A', 'M': 'B', 'N': 'C', 'O': 'D', 'P': 'E', 'Q': 'F', 'R': 'G', 'S': 'H', 'T': 'I', 'U': 'J', 'V': 'K', 'W': 'L', 'X': 'M', 'Y': 'N', 'Z': 'O'}
Decrypted Message: One remember to look up at the stars and not down at your feet Two never give up work Work gives you meaning and purpose and life is empty without it Three if you are lucky enough to find love remember it is there and dont throw it away Stephen Hawking
Encrypted Message: Zyp cpxpxmpc ez wzzv fa le esp delcd lyo yze ozhy le jzfc qppe Ehz ypgpc rtgp fa hzcv Hzcv rtgpd jzf xplytyr lyo afcazdp lyo wtqp td pxaej hteszfe te Escpp tq jzf lcp wfnvj pyzfrs ez qtyo wzgp cpxpxmpc te td espcp lyo ozye esczh te lhlj Depaspy Slhvtyr
This has three components:
Finding the character with max frequency:
test_str = "" # your string
counter = Counter(test_str)
keys = sorted(counter, key=counter.get, reverse=True)
res = keys[1] if keys[0] == " " else keys[0]
Calculating the shift:
shift = ord('e') - ord(res)
Encrypting/decrypting the string, which is trivial since you know the shift now
Something like this should allow you to calculate the shift based on the assumption that the letter in the original message with the highest frequency is 'e':
letter_counts = Counter(message)
e_encrypted = [k for k, v in letter_counts.items() if v == max(count for c, count in letter_counts.items() if c != ' ')][0]
shift = (ord('e') - ord(e_encrypted)) % 26
Or, to unroll the comprehensions for ease of understanding:
letter_counts = Counter(message)
e_encrypted, max_v = None, 0
for k, v in letter_counts.items():
if v > max_v and k != ' ':
e_encrypted, max_v = k, v
shift = (ord('e') - ord(e_encrypted)) % 26
It does the following:
take frequency counts of characters in message using the Counter class
find the maximum frequency, and the character with that maximum frequency
set the shift equal to the difference between the ascii value of that character and the letter 'e' (modulo 26)

In a dictionnary, how to check that a key has exactly 1 value?

I have this dictionnary:
{128: ['S', 'S', 'O', 'F'], 512: ['S', 'F']}
I would like to be sure that each key has exactly one value 'F' and one value 'S' and return a message if it's not the case
I tried this but it didn't seem to work:
it didn't print the message
for key in d:
if not re.search(r"[F]{1}","".join(d[key])) or not re.search(r"[S].{1}","".join(d[key])):
print(f"There is no start or end stop for the line {key}.")
Thanks
Your dictionary contain a list, not a string so you shouldn't use regex. You can check if the list contain exactly the number of values you want using list.count(value).
for key in d:
if not (d[ḱey].count('F') == 1 and d[ḱey].count('S') == 1):
print(f"There is no start or end stop for the line {key}.")
You can check the count of those characters in each value. One approach is below
sample = {128: ['S', 'S', 'O', 'F'], 512: ['S', 'F']}
wrong_lines = [index for index, value in sample.items() if value.count('S') == value.count('F') == 1]
print(f"Wrong lines {wrong_lines}")
d = {128: ['S', 'S', 'O', 'F'], 512: ['S', 'F']}
for key in d:
if not (d[key].count('F') == 1 and d[key].count('S') == 1):
print(f"There is no start or end stop for the line {key}.")
You can use a Counter and get the counts of all letters simultaneously:
>>> from collections import Counter
>>> di={128: ['S', 'S', 'O', 'F'], 512: ['S', 'F']}
>>> {k:Counter(v) for k,v in di.items()}
{128: Counter({'S': 2, 'O': 1, 'F': 1}), 512: Counter({'S': 1, 'F': 1})}

Encoding a string using function

I created the following function to pair each letter in the alphabet with its corresponding encoded letter based on a given shift:
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
def build_cipher(shift):
'''
Description: takes in shift (an integer representing the amount the letter key in the dictionary is shifted from its corresponding letter) and returns a dictionary containing all letters and their corresponding letters after the shift. This is achieved through subtracting the shift from the number corresponding to the letter, and using modulo 26.
>>> build_cipher(-3)
{'a': 'x', 'b': 'y', 'c': 'z', 'd': 'a', 'e': 'b', 'f': 'c', 'g': 'd', 'h': 'e', 'i': 'f', 'j': 'g', 'k': 'h', 'l': 'i', 'm': 'j', 'n': 'k', 'o': 'l', 'p': 'm', 'q': 'n', 'r': 'o', 's': 'p', 't': 'q', 'u': 'r', 'v': 's', 'w': 't', 'x': 'u', 'y': 'v', 'z': 'w'}
'''
return {alphabet[i]: alphabet[(i + shift) % 26] for i in range(0, 26)}
Next I need to define a function encode that takes in text and shift, and returns the text encoded. I also need to use my build_cipher function in order to do this. So far I have:
def encode(text, shift):
'''
Description: takes in a text string and shift. Returns the text string encoded based on the shift.
>>> encode('test', -4)
>>> encode('code', 5)
'''
#return (text[(i + shift) % 26] for i in range(0,26))
#return (build_cipher(shift) for text in alphabet)
#return (build_cipher(shift) for text in range(0,26))
Each of my attempts at the return statement are in comments at the bottom. None are working correctly and I am unsure of how to exactly do this since build_cipher returns as a dictionary. Any tips on how I could achieve this are appreciated.
You already have created your cipher building function, now let's use it to create the cipher and apply it to each character in your text. I used here get to enable keeping a character that is not in the cipher unchanged.
def encode(text, shift):
cipher = build_cipher(shift)
return ''.join(cipher.get(c, c) for c in text)
Example:
>>> encode('good morning', 4)
'kssh qsvrmrk'
>>> encode('kssh qsvrmrk', -4)
'good morning'
NB. your code is currently not handling capital letters, that's something you might want to fix ;)

Pandas Series Conditionally Change String

I am trying to change the strings in a Pandas Series by a condition. If the string name is say 'A', it should be 'AA'. The code snippet below works but it is very un elegant and inefficient. I am passing a Pandas series as an argument as I said. Is there any other way to accomplish this?
def conditions(x):
if x == 'A':
return "AA"
elif x == 'B':
return "BB"
elif x == 'C':
return "CC"
elif x == 'D':
return "DD"
elif x == 'E':
return "EE"
elif x == 'F':
return "FF"
elif x == 'G':
return "GG"
elif x == 'H':
return "HH"
elif x == 'I':
return "I"
elif x == 'J':
return "JJ"
elif x == 'K':
return "KK"
elif x == 'L':
return 'LL'
func = np.vectorize(conditions)
test = func(rfqs["client"])
If you are just trying to repeat a given string, you can add the string to itself across all rows at once. If you have some other condition, you can specify that condition and add the string to itself only for rows that meet the criteria. See this toy example:
df = pd.DataFrame({'client': ['A', 'B', 'Z']})
df.loc[df['client'].str.contains('[A-L]'), 'client'] = df['client'] * 2
to get
client
0 AA
1 BB
2 Z
You can use a dictionary to avoid all those if elses:
d = {i:2*i for i in ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L')}
test = rfqs["client"].apply(lambda x: d[x] if x in d else x)
Here is another way:
l = list('ABCDEFGHIJKL')
df['col'] = df['col'].mask(df['col'].isin(l),df['col'].str.repeat(2))
With np.where()
df['col'].mul(np.where(df['col'].isin(l),2,1))
With map()
df['A'].mul(df['A'].map({i:2 for i in l}).fillna(1).astype(int))

encoding using a random cipher

I'm trying to write a program that takes a long string of letters and characters, and creates a dictionary of {original character:random character}. It should remove characters that have already been assigned a random value.
This is what I have:
import random
all_chars='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789,.!?'
def make_encoder(all_chars):
all_chars=list(all_chars)
encoder = {}
for c in range (0,len(all_chars)):
e = random.choice(all_chars)
all_chars.remove(e)
key = all_chars[c]
encoder[key] = e
return encoder
I keep getting index out of range: 33 on line 10 key = all_chars[c]
Here's my whole code, with the first problem fixed:
import random
all_chars='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789,.!?'
def make_encoder(all_chars):
list_chars= list(all_chars)
all_chars= list(all_chars)
encoder = {}
i=0
while len(encoder) < len(all_chars):
e = random.choice(all_chars)
key = all_chars[i]
if key not in encoder.keys():
encoder[key] = e
i += 1
return encoder
def encode_message(encoder,msg):
encoded_msg = ""
for x in msg:
c = encoder[x]
encoded_msg = encoded_msg + c
def make_decoder(encoder):
decoder = {}
for k in encoder:
v = encoder[k]
decoder[v] = k
return decoder
def decode_message(decoder,msg):
decoded_msg = ""
for x in msg:
c = decoder[x]
decoded_msg = decoded_msg + c
def main():
alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ,.!?"
e = make_encoder(alphabet)
d = make_decoder(e)
print(e)
print(d)
phrase = input("enter a phrase")
print(phrase)
encoded = encode_message(e,phrase)
print(encoded)
decoded = decode_message(d,encoded)
print(decoded)
I now get TypeError: iteration over non-sequence of type NoneType for the line for x in msg:
You are altering the list. Point: never alter list while iterating over it.
for c in range (0,len(all_chars)): this line will iterate till length of list but at same time you removing element, so list got altered, that is why you got list out of range.
try like this:
import random
all_chars='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789,.!?'
def make_encoder(all_chars):
all_char = list(all_chars)
encoder = {}
i=0
while len(encoder) < len(all_char):
e = random.choice(all_char)
key = all_char[i]
if key not in encoder.keys():
encoder[key] = e
i += 1
return encoder
output:
>>> make_encoder(all_chars)
{'!': '3', ',': 'l', '.': 'J', '1': 'y', '0': 'l', '3': 'G', '2': ',', '5': '6', '4': 'f', '7': 'f', '6': 'C', '9': 'F', '8': 'y', '?': 'S', 'A': 'm', 'C': 'z', 'B': 'b', 'E': 'J', 'D': '0', 'G': 'S', 'F': 'v', 'I': 'v', 'H': '?', 'K': 'd', 'J': 'X', 'M': 'o', 'L': 'O', 'O': 'Q', 'N': 'P', 'Q': 'Z', 'P': '8', 'S': 'r', 'R': 'h', 'U': 'o', 'T': 'M', 'W': 'l', 'V': '.', 'Y': 'R', 'X': 'C', 'Z': 'a', 'a': 's', 'c': 'Y', 'b': 'X', 'e': 's', 'd': 'd', 'g': 'L', 'f': 'G', 'i': 'm', 'h': 'k', 'k': 'f', 'j': '1', 'm': 'J', 'l': 'L', 'o': '2', 'n': 'N', 'q': 'n', 'p': 'l', 's': 'W', 'r': '7', 'u': 'y', 't': 'S', 'w': 'J', 'v': 'E', 'y': 'r', 'x': 'C', 'z': 'i'}
You're modifying the list as you iterate over it:
for c in range(0,len(all_chars)):
e = random.choice(all_chars)
all_chars.remove(e)
The range item range(0,len(all_chars)) is only generated when the for loop starts. That means it will always assume its length is what it started as.
After you remove a character, all_chars.remove(e), now the list is one item shorter than when the for loop started, leading to the eventual over-run.
How about this instead:
while all_chars: # While there are chars left in the list
...
You should never modify an iterable while you are iterating over it.
Think about it: you told Python to loop from 0 to the length of the list all_chars, which is 66 in the beginning. But you are constantly shrinking this length with all_chars.remove(e). So, the loop still loops 66 times, but all_chars only has 66 items for the first iteration. Afterwards, it has 65, then 64, then 63, etc.
Eventually, you will run into an IndexError when c equals the length of the list (which happens at c==33). Note that it is not when c is greater than the length because Python indexes start at 0:
>>> [1, 2, 3][3] # There is no index 3 because 0 is the first index
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> [1, 2, 3][2] # 2 is the greatest index
3
>>>
To fix the problem, you can either:
Stop removing elements from all_chars inside the loop. That way, its length will always be 66.
Use a while True: loop and break when all_chars is empty (you run out of characters).
I would recommend making two strings or at least separating the two databases.
import random
all_chars='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789,.!?'
def make_encoder(all_chars):
list_chars= list(all_chars)
all_chars= list(all_chars) #<-------------EDIT
encoder = {}
for c in all_chars:
e = random.choice(list_chars)
list_chars.remove(e)
key = c #<---------------EDIT
encoder[key] = e
return encoder<--------EDIT, unindented this line.
That is your issue, because you were taking away from the list you were iterating though. Making two lists, although a little messy, is the best way.
You don't have to remove it from the initial string (it's bad practice to change a item while iterating over it)
Just check if the item isn't already in the dictonary.
import random
all_chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789,.!?'
encoder = {}
n = 0
while len(all_chars) != len(encoder):
rand = random.choice(all_chars)
if rand not in encoder:
encoder[all_chars[n]] = rand
n += 1
for k,v in sorted(encoder.iteritems()):
print k,v
By the way, your encoder may work fine by doing this, but you have no way to decode it back since you are using a random factor to build the encoder. You can fix this by using random.seed('KEY').

Categories

Resources