I am fairly new at python (and programming in general, just started 2 months ago). I have been tasked with creating a program that takes a users starting string (i.e. "11001100") and prints each generation based off a set of rules. It then stops when it repeats the users starting string. However, I am clueless as to where to even begin. I vaguely understand the concept of cellular automata and therefore am at a loss as to how to implement it into a script.
Ideally, it would take the users input string "11001100" (gen0) and looks at the rule set I created and converts it so "11001100" would be "00110011" (gen1) and then converts it again to (gen3) and again to (gen4) until it is back to the original input the user provided (gen0). My rule set is below:
print("What is your starting string?")
SS = input()
gen = [SS]
while 1:
for i in range(len(SS)):
if gen[-1] in gen[:-2]:
break
for g in gen:
print(g)
newstate = {
#this is used to convert the string. we break up the users string into threes. i.e if user enters 11001100, we start with the left most digit "1" and look at its neighbors (x-1 and x+1) or in this case "0" and "1". Using these three numbers we compare it to the chart below:
'000': 1 ,
'001': 1 ,
'010': 0 ,
'011': 0 ,
'100': 1 ,
'101': 1 ,
'110': 0 ,
'111': 0 ,
}
I would greatly appreciate any help or further explanation/dummy proof explanation of how to get this working.
Assuming that newstate is a valid dict where the key/value pairs correspond with your state replacement (if you want 100 to convert to 011, newstate would have newstate['100'] == '011'), you can do list comprehensions on split strings:
changed = ''.join(newstate[c] for c in prev)
where prev is your previous state string. IE:
>>> newstate = {'1':'0','0':'1'}
>>> ''.join(newstate[c] for c in '0100101')
'1011010'
you can then use this list comp to change a string itself by calling itself in the list comprehension:
>>> changed = '1010101'
>>> changed = ''.join(newstate[c] for c in changed)
>>> changed
'0101010'
you have the basic flow down in your original code, you jsut need to refine it. The psuedo code would look something like:
newstate = dict with key\value mapping pairs
original = input
changed = original->after changing
while changed != original:
changed = changed->after changing
print changed
The easiest way to do this would be with the re.sub() method in the python regex module, re.
import re
def replace_rule(string, new, pattern):
return re.sub(pattern, new, string)
def replace_example(string):
pattern = r"100"
replace_with = "1"
return re.sub(pattern, replace_with, string)
replace_example("1009")
=> '19'
replace_example("1009100")
=> '191'
Regex is a way to match strings to certain regular patterns, and do certain operations on them, like sub, which finds and replaces patterns in strings. Here is a link: https://docs.python.org/3/library/re.html
Related
I have a really ugly command where I use many appended "replace()" methods to replace/substitute/scrub many different strings from an original string. For example:
newString = originalString.replace(' ', '').replace("\n", '').replace('()', '').replace('(Deployed)', '').replace('(BeingAssembled)', '').replace('ilo_', '').replace('ip_', '').replace('_ilop', '').replace('_ip', '').replace('backupnetwork', '').replace('_ilo', '').replace('prod-', '').replace('ilo-','').replace('(EndofLife)', '').replace('lctcvp0033-dup,', '').replace('newx-', '').replace('-ilo', '').replace('-prod', '').replace('na,', '')
As you can see, it's a very ugly statement and makes it very difficult to know what strings are in the long command. It also makes it hard to reuse.
What I'd like to do is define an input array of of many replacement pairs, where a replacement pair looks like [<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>]; where the greater array looks something like:
replacementArray = [
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>],
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>],
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>],
[<ORIGINAL_SUBSTRING>, <NEW_SUBSTRING>]
]
AND, I'd like to pass that replacementArray, along with the original string that needs to be scrubbed to a function that has a structure something like:
def replaceAllSubStrings(originalString, replacementArray):
newString = ''
for each pair in replacementArray:
perform the substitution
return newString
MY QUESTION IS: What is the right way to write the function's code block to apply each pair in the replacementArray? Should I be using the "replace()" method? The "sub()" method? I'm confused as to how to restructure the original code into a nice clean function.
Thanks, in advance, for any help you can offer.
You have the right idea. Use sequence unpacking to iterate each pair of values:
def replaceAllSubStrings(originalString, replacementArray):
for in_rep, out_rep in replacementArray:
originalString = originalString.replace(in_rep, out_rep)
return originalString
How about using re?
import re
def make_xlat(*args, **kwds):
adict = dict(*args, **kwds)
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
def xlat(text):
return rx.sub(one_xlat, text)
return xlat
replaces = {
"a": "b",
"well": "hello"
}
replacer = make_xlat(replaces)
replacer("a well?")
# b hello?
You can add as many items in replaces as you want.
I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.
this is not a programming question but a question about the IDLE. Is it possible to change the comment block key from '#' to something else?
here is the part that is not going to work:
array = []
y = array.append(str(words2)) <-- another part of the program
Hash = y.count(#) <-- part that won't work
print("There are", Hash, "#'s")
No, that isn't specific to IDLE that is part of the language.
EDIT: I'm pretty sure you want to use
y.count('#') # note the quotes
Remember one of the strengths of Python is portability. Writing a program that would only work with your custom version of the interpreter would be removing the strengths of the language.
As a rule of thumb anytime you find yourself thinking that solution is to rewrite part of the language you might be heading in the wrong direction.
You need to call count on the string not the list:
array = []
y = array.append(str(words2)) <-- another part of the program
Hash = y[0].count('#') # note the quotes and calling count on an element of the list not the whole list
print("There are", Hash, "#'s")
with output:
>>> l = []
>>> l.append('#$%^&###%$^^')
>>> l
['#$%^&###%$^^']
>>> l.count('#')
0
>>> l[0].count('#')
4
count is looking for an exact match and '#$%^&###%$^^' != '#'. You can use it on a list like so:
>>> l =[]
>>> l.append('#')
>>> l.append('?')
>>> l.append('#')
>>> l.append('<')
>>> l.count('#')
2
I'm stuck in a exercice in python where I need to convert a DNA sequence into its corresponding amino acids. So far, I have:
seq1 = "AATAGGCATAACTTCCTGTTCTGAACAGTTTGA"
for i in range(0, len(seq), 3):
print seq[i:i+3]
I need to do this without using dictionaries, and I was going for replace, but it seems it's not advisable either. How can I achieve this?
And it's supposed to give something like this, for exemple:
>seq1_1_+
TQSLIVHLIY
>seq1_2_+
LNRSFTDSST
>seq1_3_+
SIADRSLTHLL
Update 2: OK, so i had to resort to functions, and as suggested, i have gotten the output i wanted. Now, i have a series of functions, which return a series of aminoacid sequences, and i want to get an output file that looks like this, for exemple:
>seq1_1_+
iyyslrs-las-smrlssiv-m
>seq1_2_+
fiirydrs-ladrcgshrssk
>seq1_3_+
llfativas-lidaalidrl
>seq1_1_-
frrsmraasis-lativannkm
>seq1_2_-
lddr-ephrsas-lrs-riin
>seq1_3_-
-tidesridqlasydrse--m
For that, i'm using this:
for x in f1:
x = x.strip()
if x.count("seq"):
f2.write((x)+("_1_+\n"))
f2.write((x)+("_2_+\n"))
f2.write((x)+("_3_+\n"))
f2.write((x)+("_1_-\n"))
f2.write((x)+("_2_-\n"))
f2.write((x)+("_3_-\n"))
else:
f2.write((translate1(x))+("\n"))
f2.write((translate2(x))+("\n"))
f2.write((translate3(x))+("\n"))
f2.write((translate1neg(x))+("\n"))
f2.write((translate2neg(x))+("\n"))
f2.write((translate3neg(x))+("\n"))
But unlike the expected output file suggested, i get this:
>seq1_1_+
>seq1_2_+
>seq1_3_+
>seq1_1_-
>seq1_2_-
>seq1_3_-
iyyslrs-las-smrlssiv-m
fiirydrs-ladrcgshrssk
llfativas-lidaalidrl
frrsmraasis-lativannkm
lddr-ephrsas-lrs-riin
-tidesridqlasydrse--m
So he's pretty much doing all the seq's first, and all the functions afterwards, so i need to intercalate them, problem is how.
To translate you need a table of codons, so without dictionary or other data structure seems strange.
Maybe you can look into biopython? And see how they manage it.
You can also translate directly from the coding strand DNA sequence:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) "
You may take a look into
You cannot practically do this without either a function or a dictionary. Part 1, converting the sequence into three-character codons, is easy enough as you have already done it.
But Part 2, to convert these into amino acids, you will need to define a mapping, either:
mapping = {"NNN": "X", ...}
or
def mapping(codon):
if codon in ("AGA", "AGG", "CGA", "CGC", "CGG", "CGT"):
return "R"
...
or
for codon, acid in [("CAA", "Q"), ("CAG", "Q"), ...]:
I would favour the second of these as it has the least duplication (and therefore potential for error).
You got the amino acid output for the first codon only because you used 'return' inside the 'for loop'. Once the first amino acid is returned, the loop terminates, hence the second codon won't be tested at all.
You can create an empty list to keep the results for the translation of each codon, e.g.
aa = []
then, instead of using return, append the output to the list:
for x in range(0,len(seq1),3):
nuc2= seq1[x:x+3]
if nuc2 in ('GCT', 'GCC', 'GCA', 'GCG'):
aa.append("a")
elif nuc2 in ('TGT', 'TGC'):
aa.append("c")
....
and finally, join the alphabets in the list and return the string from the function:
return "".join(aa)
or simply print it:
print("".join(aa))
you can convert the nucleotide bases in numbers (base 4) and then translate using ordered aa in a string:
def translate(seq,frame):
BASES = 'ACGT'
# standard code
AA = 'KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF'
# convert DNA sequence in numbers: A=0; C=1; G=2; T=3
seqn = [str(BASES.find(i)) for i in seq.upper()]
# list of all codons in all forward frames (i.e. 3 digit numbers in base 4)
allframes = [''.join(seqn[x:x+3]) for x in range(len(seqn)) if x <= (len(seqn)-3)]
# translate codons in frame taking aa in AA string using indexes (turned in base 10 from base 4) in allframes
return ''.join([AA[int(i,4)] for i in allframes[(frame-1)::3]])
I have a list of email addresses with the following format:
name####email.com
But the number is not always present. For example: john45#email.com, bob#email.com joe2#email.com, etc. I want to sort these names by the number, with those without a number coming first. I have come up with something that works, but being new to Python, I'm curious as to whether there's a better way of doing it. Here is my solution:
import re
def sortKey(name):
m = re.search(r'(\d+)#', name)
return int(m.expand(r'\1')) if m is not None else 0
names = [ ... a list of emails ... ]
for name in sorted(names, key = sortKey):
print name
This is the only time in my script that I am ever using "sortKey", so I would prefer it to be a lambda function, but I'm not sure how to do that. I know this will work:
for name in sorted(names, key = lambda n: int(re.search(r'(\d+)#', n).expand(r'\1')) if re.search(r'(\d+)#', n) is not None else 0):
print name
But I don't think I should need to call re.search twice to do this. What is the most elegant way of doing this in Python?
Better using re.findall as if no numbers are found, then it returns an empty list which will sort before a populated list. The key used to sort is any numbers found (converted to ints), followed by the string itself...
emails = 'john45#email.com bob#email.com joe2#email.com'.split()
import re
print sorted(emails, key=lambda L: (map(int, re.findall('(\d+)#', L)), L))
# ['bob#email.com', 'joe2#email.com', 'john45#email.com']
And using john1 instead the output is: ['bob#email.com', 'john1#email.com', 'joe2#email.com'] which shows that although lexicographically after joe, the number has been taken into account first shifting john ahead.
There is a somewhat hackish way if you wanted to keep your existing method of using re.search in a one-liner (but yuck):
getattr(re.search('(\d+)#', s), 'groups', lambda: ('0',))()