Newbie at coding, doing this for university. I have written a dictionary which translates codons into single letter amino acids. However, my function can't find the keys in the dict and just adds an X to the list I've made. See code below:
codon_table = {('TTT', 'TTC'): 'F',
('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'): 'L',
('ATT', 'ATC', 'ATA'): 'I',
('ATG'): 'M',
('GTT', 'GTC', 'GTA', 'GTG'): 'V',
('TCT', 'TCC', 'TCA', 'TCG'): 'S',
('CCT', 'CCC', 'CCA', 'CCG'): 'P',
('ACT', 'ACC', 'ACA', 'ACG'): 'T',
('GCT', 'GCC', 'GCA', 'GCG'): 'A',
('TAT', 'TAC'): 'Y',
('CAT', 'CAC'): 'H',
('CAA', 'CAG'): 'Q',
('AAT', 'AAC'): 'N',
('AAA', 'AAG'): 'K',
('GAT', 'GAC'): 'D',
('GAA', 'GAG'): 'E',
('TGT', 'TGC'): 'C',
('TGG'): 'W',
('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'): 'R',
('AGT', 'AGC'): 'S',
('GGT', 'GGC', 'GGA', 'GGG'): 'G',
('TAA', 'TAG', 'TGA'): '*',
}
AA_seq = []
input_DNA = str(input('Please input a DNA string: '))
def translate_dna():
list(input_DNA)
global AA_seq
for codon in range(0, len(input_DNA), 3):
if codon in codon_table:
AA_seq = codon_table[codon]
AA_seq.append(codon_table[codon])
else:
AA_seq.append('X')
print(str(' '.join(AA_seq)).strip('[]').replace("'", ""))
translate_dna()
Inputted a DNA sequence, eg TGCATGCTACGTAGCGGACCTGG, which would only return XXXXXXX. What I would expect is a string of single letters corresponding to the amino acids in the dict.
I've been staring at it for the best part of an hour, so I figured it's time to ask the experts. Thanks in advance.
You need a codon dictionary keyed on single codons.
Then you need to iterate over the input sequence in groups of 3.
You also need to decide what the output should look like if a triplet is not found in your lookup dictionary.
For example:
from functools import cache
codon_table = {('TTT', 'TTC'): 'F',
('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'): 'L',
('ATT', 'ATC', 'ATA'): 'I',
('ATG'): 'M',
('GTT', 'GTC', 'GTA', 'GTG'): 'V',
('TCT', 'TCC', 'TCA', 'TCG'): 'S',
('CCT', 'CCC', 'CCA', 'CCG'): 'P',
('ACT', 'ACC', 'ACA', 'ACG'): 'T',
('GCT', 'GCC', 'GCA', 'GCG'): 'A',
('TAT', 'TAC'): 'Y',
('CAT', 'CAC'): 'H',
('CAA', 'CAG'): 'Q',
('AAT', 'AAC'): 'N',
('AAA', 'AAG'): 'K',
('GAT', 'GAC'): 'D',
('GAA', 'GAG'): 'E',
('TGT', 'TGC'): 'C',
('TGG'): 'W',
('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'): 'R',
('AGT', 'AGC'): 'S',
('GGT', 'GGC', 'GGA', 'GGG'): 'G',
('TAA', 'TAG', 'TGA'): '*',
}
#cache
def lookup(codon):
for k, v in codon_table.items():
if codon in k:
return v
return '?'
sequence = 'TGCATGCTACGTAGCGGACCTGG'
AA_Seq = []
for i in range(0, len(sequence), 3):
AA_Seq.append(lookup(sequence[i:i+3]))
print(AA_Seq)
Output:
['C', 'M', 'L', 'R', 'S', 'G', 'P', '?']
Note:
The ? appears because the last item extracted from the input sequence is 'GG' which is not a valid codon.
Also note that the key/value pair in codon_table of ('ATG'): 'M' is not a tuple/string pair. ('ATG') is just a string (the parentheses are irrelevant). You could write it as ('ATG',): 'M' to make the key a 1-tuple
Your for loop goes through input and inside it can't find any matches and appends "X" to your AA_seq
This is because
you are trying to access only 1 element in the input string rather than 3
your dictionary keys are tuples, which means "TTT" is not the same
thing as ("TTT",)
To fix this:
You have to reorder your dictionary to only use single value for key instead of a tuple.
You have to loop through your input such as [i:i+3] to get a string length of three
Related
I want to write a really short script that will help me generate a random/nonsense word with the following qualities:
-Has 8 letters
-First letter is "A"
-Second and Fourth letters are random letters
-Fifth letter is a vowel
-Sixth and Seventh letters are random letters and are the same
-Eighth letter is a vowel that's not "a"
This is what I have tried so far (using all the info I could find and understand online)
firsts = 'A'
seconds = ['a','b','c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
thirds = ['a', 'e', 'i', 'o', 'u', 'y']
fourths = ['a','b','c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
fifths = ['a', 'e', 'i', 'o', 'u', 'y']
sixths = sevenths = ['a','b','c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
eighths = ['e', 'i', 'o', 'u', 'y']
print [''.join(first, second, third, fourth, fifth)
for first in firsts
for second in seconds
for third in thirds
for fourth in fourths
for fifth in fifths
for sixth in sixths
for seventh in sevenths
for eighth in eighths]
However it keeps showing a SyntaxError: invalid syntax after the for and now I have absolutely no idea how to make this work. If possible please look into this for me, thank you so much!
So the magic function you need to know about to pick a random letter is random.choice. You can pass a list into this function and it will give you a random element from that list. It also works with strings because strings are basically a list of chars. Also to make your life easier, use string module. string.ascii_lowercase returns all the letters from a to z in a string so you don't have to type it out. Lastly, you don't use loops to join strings together. Keep it simple. You can just add them together.
import string
from random import choice
first = 'A'
second = choice(string.ascii_lowercase)
third = choice(string.ascii_lowercase)
fourth = choice(string.ascii_lowercase)
fifth = choice("aeiou")
sixthSeventh = choice(string.ascii_lowercase)
eighth = choice("eiou")
word = first + second + third + fourth + fifth + sixthSeventh + sixthSeventh + eighth
print(word)
Try this:
import random
sixth=random.choice(sixths)
s='A'+random.choice(seconds)+random.choice(thirds)+random.choice(fourths)+random.choice(fifths)+sixth+sixth+random.choice(eighths)
print(s)
Output:
Awixonno
Ahiwojjy
etc
There are several things to consider. First, the str.join() method takes in an iterable (e.g. a list), not a bunch of individual elements. Doing
''.join([first, second, third, fourth, fifth])
fixes the program in this respect. If you are using Python 3, print() is a function, and so you should add parentheses around the entire list comprehension.
With the syntax out of the way, let's get to a more interesting problem: Your program constructs every (82255680 !) possible word. This takes a long time and memory. What you want is probably to just pick one. You can of course do this by first constructing all, then picking one at random. It's far cheaper though to pick one letter from each of firsts, seconds, etc. at random and then collecting these. All together then:
import random
firsts = ['A']
seconds = ['a','b','c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
thirds = ['a', 'e', 'i', 'o', 'u', 'y']
fourths = ['a','b','c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
fifths = ['a', 'e', 'i', 'o', 'u', 'y']
sixths = sevenths = ['a','b','c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
eighths = ['e', 'i', 'o', 'u', 'y']
result = ''.join([
random.choice(firsts),
random.choice(seconds),
random.choice(thirds),
random.choice(fourths),
random.choice(fifths),
random.choice(sixths),
random.choice(sevenths),
random.choice(eighths),
])
print(result)
To improve the code from here, try to:
Find a way to generate the "data" in a neater way than writing it out explicitly. As an example:
import string
seconds = list(string.ascii_lowercase) # you don't even need list()!
Instead of having a separate variable firsts, seconds, etc., collect these into a single variable, e.g. a single list containing each original list as a single str with all characters included.
This will implement what you describe. You can make the code neater by putting the choices into an overall list rather than have several different variables, but you will have to explicitly deal with the fact that the sixth and seventh letters are the same; they will not be guaranteed to be the same simply because there are the same choices available for each of them.
The list choices_list could contain sub-lists per your original code, but as you are choosing single characters it will work equally with strings when using random.choice and this also makes the code a bit neater.
import random
choices_list = [
'A',
'abcdefghijklmnopqrstuvwxyz',
'aeiouy',
'abcdefghijklmnopqrstuvwxyz',
'aeiouy',
'abcdefghijklmnopqrstuvwxyz',
'eiouy'
]
letters = [random.choice(choices) for choices in choices_list]
word = ''.join(letters[:6] + letters[5:]) # here the 6th letter gets repeated
print(word)
Some example outputs:
Alaeovve
Aievellu
Ategiwwo
Aeuzykko
Here's the syntax fix:
print(["".join([first, second, third])
for first in firsts
for second in seconds
for third in thirds])
This method might take up a lot of memory.
I have a list as follows:
input1 = ['XS','S', 'M', 'L', 'XL', 'XXL', 'XXXL']
input2 = ['XS', 'S', 'M', 'L', 'XL', 'XS', 'S']
input3 = ['XS', 'S', 'M', 'L', 'S]
input4 = ['XS', 'S', 'M', 'L', 'XS', 'L']
etc.
As you see the list elements will change every time. I want to know how to find the largest element each time.
the standard list is this :
['XS','S', 'M', 'L', 'XL', 'XXL', 'XXXL']
This is what I have tried:
lst1 = {k:v for k, v in enumerate(['XS', 'S', 'M', 'L','XL','XXL', 'XXXL'])}
lst2 = ['XS', 'S', 'M', 'L', 'XL', 'XS', 'S']
tem = []
for i in lst2:
for k,v in list(lst1.items()):
if i == v:
tem.append(k)
print(lst1[max(tem)])
But this is very complicated I guess. It should be much easier!
Just reverse the dictionary and use the get function as key to max:
lst1 = {v: k for k, v in enumerate(['XS', 'S', 'M', 'L','XL','XXL', 'XXXL']) }
lst2 = ['XS', 'S', 'M', 'L', 'XL', 'XS', 'S']
result = max(lst2, key=lst1.get)
print(result)
Output
XL
You don't need to overcomplicate stuff with dictionaries when simple string search would be enough.
lst1 = ['XS', 'S', 'M', 'L','XL','XXL', 'XXXL']
lst2 = ['XS', 'S', 'M', 'L', 'XL', 'XS', 'S']
result = next(x for x in lst1[::-1] if x in lst2)
If you don't want to reverse the list, just create on opposite order
I have been working on Rosalind exercises for Bioinformatics stronghold on RNA Splicing. I am currently using Python 3.6 version. It didn't tell me there is any error in my code, so I'm assuming my code is fine. However, there is no output produced, no error warning or whatsoever. Below is my code:
DNA_CODON_TABLE = {
'TTT': 'F', 'CTT': 'L', 'ATT': 'I', 'GTT': 'V',
'TTC': 'F', 'CTC': 'L', 'ATC': 'I', 'GTC': 'V',
'TTA': 'L', 'CTA': 'L', 'ATA': 'I', 'GTA': 'V',
'TTG': 'L', 'CTG': 'L', 'ATG': 'M', 'GTG': 'V',
'TCT': 'S', 'CCT': 'P', 'ACT': 'T', 'GCT': 'A',
'TCC': 'S', 'CCC': 'P', 'ACC': 'T', 'GCC': 'A',
'TCA': 'S', 'CCA': 'P', 'ACA': 'T', 'GCA': 'A',
'TCG': 'S', 'CCG': 'P', 'ACG': 'T', 'GCG': 'A',
'TAT': 'Y', 'CAT': 'H', 'AAT': 'N', 'GAT': 'D',
'TAC': 'Y', 'CAC': 'H', 'AAC': 'N', 'GAC': 'D',
'TAA': '-', 'CAA': 'Q', 'AAA': 'K', 'GAA': 'E',
'TAG': '-', 'CAG': 'Q', 'AAG': 'K', 'GAG': 'E',
'TGT': 'C', 'CGT': 'R', 'AGT': 'S', 'GGT': 'G',
'TGC': 'C', 'CGC': 'R', 'AGC': 'S', 'GGC': 'G',
'TGA': '-', 'CGA': 'R', 'AGA': 'R', 'GGA': 'G',
'TGG': 'W', 'CGG': 'R', 'AGG': 'R', 'GGG': 'G'
}
def result(s):
result = ''
lines = s.split()
dna = lines[0]
introns = lines[1:]
for intron in introns:
dna = dna.replace(intron, '')
for i in range(0, len(dna), 3):
codon = dna[i:i+3]
protein = None
if codon in DNA_CODON_TABLE:
protein = DNA_CODON_TABLE[codon]
if protein == '-':
break
if protein:
result += protein
return ''.join(list(result))
if __name__ == "__main__":
"""small_dataset = ' '"""
large_dataset = open('rosalind_splc.txt').read().strip()
print (result(large_dataset))
This is the content in rosalind_splc.txt text file:
>Rosalind_3363
ATGGGGCTGAGCCCATGTCTAAATGATATCTTGGTGCATTGCAATCTAACTATTTTTTCG
CAACCATGTTCCATCTGGCGCAAAATGGGCGTGTAGGGAGCTTCGCTATAGTCACTGAAG
AACATTCGCAACTTACAGCTCTCGAGAGGGTACAGCTGGACGGTGTTTGTTTGGTCTAAG
TCTGAGTCCAAAGTCGTTGAATGTCGAGCTAGGTTGACGTCATTCTTCGAGTTACGTCTT
CATTGATTCGCGGCGGCCGCCAGCATTTGATTGTACACATCCGACGTCTTTGGCAATCTA
CATAATTATATTGAGAGGGGCGCCATTACTCGAACCCATAACAAACAACTGTCCGTTTAC
AAGGTTATATTATCATGACCTAATGGTTGAGCTACGGAGTGGGGGGCCCTCGGCTACAGG
TGTTAAACTATCCTGCGGATGCGGATCTTAGCCCGATTTGCATGGCCCAGTAAGGCGCTG
ATTGTAAACCGCCTAGCATACATGTGCTTCTTACTCCAGGGTCCATTGCTACCAGTTCGC
TTCTGACGCCTCAATTGTACCTTCCTTTTTTGAATGGCAACCTGCAATAGCAGTCGACTG
ATGGGGCGTTACAGTATGAAGGCTATATTTACATTATCTCTAAACACACTGCTACCGCGA
AACCCCAACTCGGACCGGTCAGAGCGCTCGTGCTTTGTTCTTGGTCGCTAGCGACCAACA
GTGGATAGGTGGGCGCGGGCCTTGCACCTCCTAGAGCATCACGTGGAGTGGATGCAAACA
GTCTATGGTCCCCCGCTTCGGCTCACGGGTAACGTCTCTTGTGGTACTAGACCATAGGCA
TCCAGGTGAGGGCTACATCCGTATTTAATGAAACTGAGTTCCTCCAAAGCTCCTCGGGAC
GCAGGCAGGTTCATCCGCAGTCAGTAAGGGAGGGAAGAGCTTTCCCCGTTCCACCCAGAT
GCCCTGTGCACGGGAGAGAGATCCAGGTGGTAG
>Rosalind_0423
TCGCAACTTACAGCTCTCGAGAGGG
>Rosalind_5768
GCCCAGTAAGGCGCTGATTGTAAACCGCCTAGCATACAT
>Rosalind_6780
GTCTTCATTGATTCGCGGCGGCCGCCAGCA
>Rosalind_6441
GCAAACAGTCT
>Rosalind_3315
TTGGTCGCTAGCGACCAACAGTGGATAGGTGGGCGCGGGCCTTGCACCT
>Rosalind_7467
TTATCTCTAAACACACTGC
>Rosalind_3159
CGCAGTCAGTAAGGGAGG
>Rosalind_6420
TCTAAGTCTGAGTCCAAAGTCGTTGAATGTCGAGCTAGGTTGACGT
>Rosalind_8344
GGGGCGCCATTACTCGAACCCATAACAAACAACT
>Rosalind_2993
CCAGGTGAGGGCTACATCCGTAT
>Rosalind_0536
ATTATCATGACCTAATG
>Rosalind_3774
TCGCAACCATGTTCCAT
>Rosalind_7168
GGGCCCTCGGCTACAGGTGTTAAACTAT
>Rosalind_8059
CAATTGTACCTTCCTTTTTTGAATG
Since there is no output given, I would like to know which part of my code need to be fixed in order for the output to come out. Thanks.
To understand which part of your code you need to change, it helps to understand what goes wrong in your code. If you have a code editor with a debugger, it helps to step through the code. If you don't have one, you can use the online tool http://pythontutor.com. Here is a direct link to your code with the first few lines of your input.
Click on the forward button under the code. At step 20 you jump into your function result(). After step 24 your input is split on the newlines. You can see that lines is now:
lines = ['>Rosalind_3363',
'ATGGGGCTGAGCCCATGTCTAAATGATATCTTGGTGCATTGCAATCTAACTATTTTTTCG',
'CAACCATGTTCCATCTGGCGCAAAATGGGCGTGTAGGGAGCTTCGCTATAGTCACTGAAG',
'>Rosalind_0423',
'TCGCAACTTACAGCTCTCGAGAGGG',
'>Rosalind_5768',
'GCCCAGTAAGGCGCTGATTGTAAACCGCCTAGCATACAT']
In step 25, you assign the first item of lines to the variable dna. So dna is now equal to >Rosalind_3363. You assign the rest of the items in the list to the variable introns in the next step. So now we have
dna = '>Rosalind_3363'
introns = ['ATGGGGCTGAGCCCATGTCTAAATGATATCTTGGTGCATTGCAATCTAACTATTTTTTCG',
'CAACCATGTTCCATCTGGCGCAAAATGGGCGTGTAGGGAGCTTCGCTATAGTCACTGAAG',
'>Rosalind_0423',
'TCGCAACTTACAGCTCTCGAGAGGG',
'>Rosalind_5768',
'GCCCAGTAAGGCGCTGATTGTAAACCGCCTAGCATACAT']
Here the first signs of trouble are already apparent. You probably expect dna to contain a DNA sequence. But it contains the sequence header of the FASTA file. Similarly, introns should only contain DNA sequences as well, but here they also contains FASTA sequence headers (>Rosalind_0423, >Rosalind_5768).
So what happens in the next lines doesn't make any sense anymore with the data you have now.
In the lines
for intron in introns:
dna = dna.replace(intron, '')
you want to remove the introns from the DNA, but dna doesn't contain a DNA sequence string and introns contains other things than substrings of dna. So after this loop, dna still equals >Rosalind_3363. None of the three letter sequences of dna (>Ro, sal, ind, ...) are valid codons, so they are not found in DNA_CODON_TABLE. And hence, result() returns an empty string.
Now my guess as to what happened. You lifted the code verbatim from the internet (it is exactly equal to the code here) without understanding what it does and without realizing that the original author had already preprocessed the input data.
So, what do you need to do to fix the code?
parse the FASTA file, for example using Bio.SeqIO.parse()
If necessary, concatenate the DNA strings of the first sequence. This is what should end up in your dnavariable
the following sequence strings are what should end up in your introns variable.
I am working through the 'Rosalind' problems and I've become stuck on what the issue with my code is... The problem is:
Either strand of a DNA double helix can serve as the coding strand for
RNA transcription. Hence, a given DNA string implies six total reading
frames, or ways in which the same region of DNA can be translated into
amino acids: three reading frames result from reading the string
itself, whereas three more result from reading its reverse complement.
An open reading frame (ORF) is one which starts from the start codon
and ends by stop codon, without any other stop codons in between.
Thus, a candidate protein string is derived by translating an open
reading frame into amino acids until a stop codon is reached.
Given: A DNA string s of length at most 1 kbp in FASTA format.
Return: Every distinct candidate protein string that can be translated
from ORFs of s. Strings can be returned in any order.
Here is my code (Python):
DNA_Codons = {
'TTT': 'F', 'CTT': 'L', 'ATT': 'I', 'GTT': 'V',
'TTC': 'F', 'CTC': 'L', 'ATC': 'I', 'GTC': 'V',
'TTA': 'L', 'CTA': 'L', 'ATA': 'I', 'GTA': 'V',
'TTG': 'L', 'CTG': 'L', 'ATG': 'M', 'GTG': 'V',
'TCT': 'S', 'CCT': 'P', 'ACT': 'T', 'GCT': 'A',
'TCC': 'S', 'CCC': 'P', 'ACC': 'T', 'GCC': 'A',
'TCA': 'S', 'CCA': 'P', 'ACA': 'T', 'GCA': 'A',
'TCG': 'S', 'CCG': 'P', 'ACG': 'T', 'GCG': 'A',
'TAT': 'Y', 'CAT': 'H', 'AAT': 'N', 'GAT': 'D',
'TAC': 'Y', 'CAC': 'H', 'AAC': 'N', 'GAC': 'D',
'TAA': '-', 'CAA': 'Q', 'AAA': 'K', 'GAA': 'E',
'TAG': '-', 'CAG': 'Q', 'AAG': 'K', 'GAG': 'E',
'TGT': 'C', 'CGT': 'R', 'AGT': 'S', 'GGT': 'G',
'TGC': 'C', 'CGC': 'R', 'AGC': 'S', 'GGC': 'G',
'TGA': '-', 'CGA': 'R', 'AGA': 'R', 'GGA': 'G',
'TGG': 'W', 'CGG': 'R', 'AGG': 'R', 'GGG': 'G'
}
bases={"A":"T",
"T":"A",
"G":"C",
"C":"G"}
def Pro(DNA, start, Rev):
#Calculates the Reverse compliment if using
if Rev == True:
reverse=DNA[::-1]
compliment=[]
for base in reverse:
compliment+=bases[base]
Seq="".join(compliment)
elif Rev== False:
Seq=DNA
Protein=[]
#Finds a start codon
for i in range(start, len(Seq),3):
codon=Seq[i:i+3]
if codon=="ATG":
#Starting from that start codon, returns a protein, breaks if stop codon
#-2 included so that it's always in blocks of 3
for j in range(i,len(Seq)-2,3):
new_codon=Seq[j:j+3]
if DNA_Codons[new_codon]!="-":
Protein+=[DNA_Codons[new_codon]]
else:
#Adds in the '-' to split proteins that start within the same Reading Frame
Protein+=[DNA_Codons[new_codon]]
break
return Protein
f = open('rosalind_orf.txt','r').read()
#Puts each FASTA String into an arrary
strings=f.split(">")
#removes the FASTA ID from the string in array and new line characters
for i in range(len(strings)):
strings[i]=strings[i].strip("Rosalind_0123456789")
strings[i]=strings[i].replace("\n","")
DNA=strings[1]
#Adds proteins from all Open Reading Frames
Proteins=[]
for i in range(len(DNA)):
Proteins+="".join(Pro(DNA,i,False)).split('-')
Proteins+="".join(Pro(DNA,i,True)).split('-')
#Mades a list of Unique Proteins and prints them
Unique_Proteins=[]
for p in Proteins:
if (p not in Unique_Proteins and p!=""):
Unique_Proteins+=[p]
print p
Using the sample data:
Rosalind_99 AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
My code works fine, however for every question dataset I've been given it fails...
Here is one of the question datasets that I've failed on:
Rosalind_1485 GACCAGAATGCGTTAGTCGGCCTCAGAGCGCACAAAAACCAGTATTTACAAAGTGGGACG
TAGCGCCCCGCGGCGTCCTTTTGCCCTATCGAAAGTATAGGCATCAGCTTTTTACCACCT
TGTCATAGGTAAACTGCCCGACCCAGGTCCGGCCCTCAGCCCAACGCAGATAAACCAAGG
TTATAGATGTGGCCTGTAGGCATATTGCTCTTAATGTTATAAAGAGCGAAGCGTGGTCTC
GGTTTGTAAACATTAATCAAATTCCCAGGCACTAAGCCATGGTCGCCCCGGATTGGTTTT
CCGGTGTACGCATCGGTGGCAGCTGGAGGGGACAGTTTAGGTGCTGCAATTGAACATGAA
ACTGCACGAAAGGTGGGGTGGGCCGGATCTTGCGGGCCTCGAAAGGGTAGTGTTCCTCTG
CTATCTAGTCCAATTACCTGTAGTATATATGATCAGGCCGTCGGTTACTTAGCTAAGTAA
CCGACGGCCTGATCATCTCCTAGGAAATGGTCCTGAATGCGAACTAGGTTCCGTGGAATG
ATGGGGCCCAGAGGAAACCTGTACGCAATGGATCCCGGACAGATAGACCGGGAGGTCTTG
CAACCTCTTGTGGGAGTTACAGGCCGTACCTGAATTGCCCTCGTACCATTTGAAATGGTG
CGACGCCTGTACGCAACAATCGTTCGCCTGGATAATACAGACGGCCATTTCTGTAGGAAC
GATACCGTAACGCGACGTCAGGCATGACGTTAACTGCGTCACGTTTCATACCACTATGTG
AGGTACCCACTCCTTCATTTACCGCGAGATAAAGAGCCACCACCACCTTCTCTTGGTTTC
CATGCGCCGATCGGCTAAACGTGCATCACATTCAGGCGAAGAGTCAAATGGAAGCTCGCA
ATTTTAGGCCTTTATGGCGAATATCCCGCAAGCCTTAGGCGCGT
Obviously this code is nowhere near efficient and there's lot that could be improved upon, I'm just curious as to why it's not working.
This is a problem that could apply to any language, but I'll use python to show it.
Say you have a list of numbers, ls = [0,100,200,300,400]
You can insert an element at any index, but the elements must always stay in numerical order. Duplicates are not allowed.
For example, ls.insert(2, 150) results in ls = [0,100,150,200,300,400]. The elements are in the correct order, so this is correct.
However, ls.insert(3, 190) results in ls = [0,100,200,190,300,400]. This is incorrect.
For any index i, what is the best number x to use in ls.insert(i,x) to minimize the number of sorts?
My first intuition was to add half the difference between the previous and next numbers to the previous one. So to insert a number at index 3, x would equal 200 + (300-200), or 250. However this approaches the asymptote far too quickly. When the differences get too close to 0, I could restore the differences by looping through and changing each number to produce a larger difference. I want to choose the best number for x so to minimize the number of times I need to reset.
EDIT
The specific problem I'm applying this to is a iOS app with a list view. The items in the list are represented in a Set, and each object has an attribute orderingValue. I can't use an Array to represent the list (due to issues with cache-server syncing), so I have to sort the set each time I display the list to the user. In order to do this, the orderingValue must be stored on the ListItem object.
One additional detail is, due to the nature of the UI, the user is probably more likely to add an item to the top or bottom of the list rather than insert it in the middle.
You can generate sort keys indefinitely if you use strings rather than integers. That's because a lexicographical ordering of strings puts an infinite number of values between any two strings (as long as the larger isn't the smaller one followed by "a").
Here's a function to generate a lowercase string key between two other keys:
def get_key_str(low="a", high="z"):
if low == "":
low = "a"
assert(low < high)
for i, (a, b) in enumerate(zip(low, high)):
if a < b:
mid = chr((ord(a) + ord(b))//2) # get the character half-way between a and b
if mid != a:
return low[:i] + mid
else:
return low[:i+1] + get_key_str(low[i+1:], "z")
return low + get_key_str("a", high[len(low):])
It always returns a string s such that "a" <= low < s < high <= "z". "a" and "z" are never used themselves as keys, they're special values to indicate the boundaries of the possible results.
You'd call it with get_key_str([lst[i-1], lst[i]) to get a value to insert before the value at index i. You can insert and generate a value in one go with lst.insert(i, get_key_str(lst[i-1], lst[i])). Obviously though, the ends of the list need special handling.
The default values are set so that you can omit an argument to get a value to insert at the start or the end. That is, call get_key_str(high=lst[0]) to get a value to put at the start of your list or get_key_str(lst[-1]) to get a value to append to at the end. You can also explicitly pass "a" as low or "z" as high, if that's easier. With no arguments, it will return "m", which is a reasonable first value to put in an empty list.
It's possible that you could tune this a bit to give shorter keys when you're mostly adding at the start or end, but that would be a bit more complicated. This version should have its keys grow roughly evenly if you're inserting randomly.
Here's an example of doing some random inserts:
>>> import random
>>> lst = []
>>> for _ in range(10):
index = random.randint(0, len(lst))
print("inserting at", index)
if index == 0:
low = "a"
else:
low = lst[index-1]
if index == len(lst):
high = "z"
else:
high = lst[index]
lst.insert(index, get_key_str(low, high))
print(lst)
inserting at 0
['m']
inserting at 1
['m', 's']
inserting at 2
['m', 's', 'v']
inserting at 2
['m', 's', 't', 'v']
inserting at 2
['m', 's', 'sm', 't', 'v']
inserting at 0
['g', 'm', 's', 'sm', 't', 'v']
inserting at 3
['g', 'm', 's', 'sg', 'sm', 't', 'v']
inserting at 2
['g', 'm', 'p', 's', 'sg', 'sm', 't', 'v']
inserting at 2
['g', 'm', 'n', 'p', 's', 'sg', 'sm', 't', 'v']
inserting at 3
['g', 'm', 'n', 'o', 'p', 's', 'sg', 'sm', 't', 'v']
And here's how it behaves if we then do a bunch of inserts at the start and end:
>>> for _ in range(10):
lst.insert(0, get_key_str(high=lst[0])) # start
lst.insert(len(lst), get_key_str(low=lst[-1])) # end
print(lst)
['d', 'g', 'm', 'n', 'o', 'p', 's', 'sg', 'sm', 't', 'v', 'x']
['b', 'd', 'g', 'm', 'n', 'o', 'p', 's', 'sg', 'sm', 't', 'v', 'x', 'y']
['am', 'b', 'd', 'g', 'm', 'n', 'o', 'p', 's', 'sg', 'sm', 't', 'v', 'x', 'y', 'ym']
['ag', 'am', 'b', 'd', 'g', 'm', 'n', 'o', 'p', 's', 'sg', 'sm', 't', 'v', 'x', 'y', 'ym', 'ys']
['ad', 'ag', 'am', 'b', 'd', 'g', 'm', 'n', 'o', 'p', 's', 'sg', 'sm', 't', 'v', 'x', 'y', 'ym', 'ys', 'yv']
['ab', 'ad', 'ag', 'am', 'b', 'd', 'g', 'm', 'n', 'o', 'p', 's', 'sg', 'sm', 't', 'v', 'x', 'y', 'ym', 'ys', 'yv', 'yx']
['aam', 'ab', 'ad', 'ag', 'am', 'b', 'd', 'g', 'm', 'n', 'o', 'p', 's', 'sg', 'sm', 't', 'v', 'x', 'y', 'ym', 'ys', 'yv', 'yx', 'yy']
['aag', 'aam', 'ab', 'ad', 'ag', 'am', 'b', 'd', 'g', 'm', 'n', 'o', 'p', 's', 'sg', 'sm', 't', 'v', 'x', 'y', 'ym', 'ys', 'yv', 'yx', 'yy', 'yym']
['aad', 'aag', 'aam', 'ab', 'ad', 'ag', 'am', 'b', 'd', 'g', 'm', 'n', 'o', 'p', 's', 'sg', 'sm', 't', 'v', 'x', 'y', 'ym', 'ys', 'yv', 'yx', 'yy', 'yym', 'yys']
['aab', 'aad', 'aag', 'aam', 'ab', 'ad', 'ag', 'am', 'b', 'd', 'g', 'm', 'n', 'o', 'p', 's', 'sg', 'sm', 't', 'v', 'x', 'y', 'ym', 'ys', 'yv', 'yx', 'yy', 'yym', 'yys', 'yyv']
So at the start you may end up with keys prefixed by as, and at the end you'll get keys prefixed by ys.
As far as the 'best' value is concerned, it is always going to be halfway through the previous and the next element. And it is going to reach the asymptote.
One way to delay arrival at the asymptote if there are repeated insertions at a particular index is to decrement the previous and increment the next value (I'm assuming you are allowed to do this) every time you perform the insert.
So, for ls.insert(2,150), after insertion
ls[1] = ls[1] - (ls[1] - ls[0])/2
ls[3] = ls[3] + (ls[4] - ls[3])/2
For every other insertion, this rule will hold, and assuming insertions are at random indices, you would have a fair amount of time before you need to increase each number's value.
Also, the moment you encounter two adjacent numbers with a difference of 1, you would, of course, have to loop through the numbers and increase them.