error comparing sequences - string interpreted as number

error comparing sequences - string interpreted as number - python

I'm trying to do something similar with my previous question.
My purpose is to join all sequences that are equal. But this time instead of letters, I have numbers.
alignment file can be found here - phylip file
the problem is when I try to do this:
records = list(SeqIO.parse(file(filename),'phylip'))
I get this error:
ValueError: Sequence 1 length 49, expected length 1001000000100000100000001000000000000000
I don't understand why because this is the second file I'm creating and the first one worked perfectly..
Below is the code used to build the alignment file:
fl.write('\t')
fl.write(str(161))
fl.write('\t')
fl.write(str(size))
fl.write('\n')
for i in info_plex:
if 'ref' in i[0]:
i[0] = 'H37Rv'
fl.write(str(i[0]))
num = 10 - len(i[0])
fl.write(' ' * num)
for x in i[1:]:
fl.write(str(x))
fl.write('\n')
So it shouldn't interpret 1001000000100000100000001000000000000000 as a number since its a string..
Any ideas?
Thank you!

Your PHYLIP file is broken. The header says 161 sequences but there are 166. Fixing that the current version of Biopython seems to load your file fine. Maybe use len(info_plex) when creating the header line.
P.S. It would have been a good idea to include the version of Biopython in your question.

The code of Kevin Jacobs in your former question employs Biopython that uses sequences of type Seq that
« are essentially strings of letters like AGTACACTGGT, which seems very
natural since this is the most common way that sequences are seen in
biological file formats. »
« There are two important differences between Seq objects and standard
Python strings. (...)
First of all, they have different methods. (...)
Secondly, the Seq object has an important
attribute, alphabet, which is an object describing what the individual
characters making up the sequence string “mean”, and how they should
be interpreted. For example, is AGTACACTGGT a DNA sequence, or just a
protein sequence that happens to be rich in Alanines, Glycines,
Cysteines and Threonines?
The alphabet object is perhaps the important thing that makes the Seq
object more than just a string. The currently available alphabets for
Biopython are defined in the Bio.Alphabet module. »
http://biopython.org/DIST/docs/tutorial/Tutorial.html
The reason of your problem is simply that SeqIO.parse() can't create Seq objects from a file containing characters for which there is no alphabet attribute able to manage them.
.
So, you must use another method. Not try to plate an inadapted method on a different problem.
Here's my way:
from itertools import groupby
from operator import itemgetter
import re
regx = re.compile('^(\d+)[ \t]+([01]+)',re.MULTILINE)
with open('pastie-2486250.rb') as f:
records = regx.findall(f.read())
records.sort(key=itemgetter(1))
print 'len(records) == %s\n' % len(records)
n = 0
for seq,equal in groupby(records, itemgetter(1)):
ids = tuple(x[0] for x in equal)
if len(ids)>1:
print '>%s :\n%s' % (','.join(ids), seq)
else:
n+=1
print '\nNumber of unique occurences : %s' % n
result
len(records) == 165
>154995,168481 :
0000000000001000000010000100000001000000000000000
>123031,74772 :
0000000000001111000101100011100000100010000000000
>176816,178586,80016 :
0100000000000010010010000010110011100000000000000
>129575,45329 :
0100000000101101100000101110001000000100000000000
Number of unique occurences : 156
.
Edit
I've understood MY problem: I let 'fasta' instead of 'phylip' in my code.
'phylip' is a valid value for the attribute alphabet, with it it works fine
records = list(SeqIO.parse(file('pastie-2486250.rb'),'phylip'))
def seq_getter(s): return str(s.seq)
records.sort(key=seq_getter)
ecr = []
for seq,equal in groupby(records, seq_getter):
ids = tuple(s.id for s in equal)
if len(ids)>1:
ecr.append( '>%s\n%s' % (','.join(ids),seq) )
print '\n'.join(ecr)
produces
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
>154995,168481
0000000000001000000010000100000001000000000000000
>123031,74772
0000000000001111000101100011100000100010000000000
>176816,178586,80016
0100000000000010010010000010110011100000000000000
>129575,45329
0100000000101101100000101110001000000100000000000
There is an incredible amount of characters ,,,,,,,,,,,,,,,, before the interesting data, I wonder what it is.
.
But my code isn't useless. See:
from time import clock
from itertools import groupby
from operator import itemgetter
import re
from Bio import SeqIO
def seq_getter(s): return str(s.seq)
t0 = clock()
with open('pastie-2486250.rb') as f:
records = list(SeqIO.parse(f,'phylip'))
records.sort(key=seq_getter)
print clock()-t0,'seconds'
t0 = clock()
regx = re.compile('^(\d+)[ \t]+([01]+)',re.MULTILINE)
with open('pastie-2486250.rb') as f:
records = regx.findall(f.read())
records.sort(key=itemgetter(1))
print clock()-t0,'seconds'
result
12.4826178327 seconds
0.228640588399 seconds
ratio = 55 !

Related

How to split a string into equal sized parts?

I have a string that contains a sequence of nucleotides. The string is 1191 nucleotides long.
How do I print the sequence in a format which each line only has 100 nucleotides? right now I have it hard coded but I would like it to work for any string of nucleotides. here is the code I have now
def printinfasta(SeqName, Sequence, SeqDescription):
print(SeqName + " " + SeqDescription)
#how do I make sure to only have 100 nucleotides per line?
print(Sequence[0:100])
print(Sequence[100:200])
print(Sequence[200:300])
print(Sequence[400:500])
print(Sequence[500:600])
print(Sequence[600:700])
print(Sequence[700:800])
print(Sequence[800:900])
print(Sequence[900:1000])
print(Sequence[1000:1100])
print(Sequence[1100:1191])
printinfasta(SeqName, Sequence, SeqDescription)
Sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"

You can use textwrap.wrap to split long strings into list of strings
import textwrap
seq = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
print('\n'.join(textwrap.wrap(seq, width=100)))

You can use itertools.zip_longest and some iter magic to get this in one line:
from itertools import zip_longest
sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
output = [''.join(filter(None, s)) for s in zip_longest(*([iter(sequence)]*100))]
Or:
for s in zip_longest(*([iter(sequence)]*100)):
print(''.join(filter(None, s)))

A possible solution is to use re module.
import re
def splitstring(strg, leng):
chunks = re.findall('.{1,%d}' % leng, strg)
for i in chunks:
print(i)
splitstring(strg = seq, leng = 100))

You can use a helper function based on itertools.zip_longest. The helper function has been designed to (also) handle cases where the sequence isn't an exact multiple of the size of the equal parts (the last group will have fewer elements than those before it).
from itertools import zip_longest
def grouper(n, iterable):
""" s -> (s0,s1,...sn-1), (sn,sn+1,...s2n-1), (s2n,s2n+1,...s3n-1), ... """
FILLER = object() # Value that couldn't be in data.
for result in zip_longest(*[iter(iterable)]*n, fillvalue=FILLER):
yield ''.join(v for v in result if v is not FILLER)
def printinfasta(SeqName, Sequence, SeqDescription):
print(SeqName + " " + SeqDescription)
for group in grouper(100, Sequence):
print(group)
Sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
printinfasta('Name', Sequence, 'Description')
Sample output:
Name Description
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
CCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTA
AATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCC
TAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTT
TGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACAT
TTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT

I assume that your sequence is in FASTA format. If this is the case, you can use any of a number of bioinformatics packages that provide FASTA sequence wrapping utilities. For example, you can use FASTX-Toolkit. Wrap FASTA sequences using FASTA Formatter command line utility, for example to a max of 100 nucleotides per line:
fasta_formatter -i INFILE -o OUTFILE -w 100
You can install FASTX-Toolkit package using conda, for example:
conda install fastx_toolkit
or
conda create -n fastx_toolkit fastx_toolkit
Note that if you end up writing the (simple) code to wrap FASTA sequences from scratch, remember that the header lines (the lines starting with >) should not be wrapped. Wrap only the sequence lines.
SEE ALSO:
Convert single line fasta to multi line fasta

Package cytoolz (installable using pip install cytoolz) provides a function partition_all that can be used here:
#!/usr/bin/env python3
from cytoolz import partition_all
def printinfasta(name, seq, descr):
header = f">{name} {descr}"
print(header)
print(*map("".join, partition_all(100, seq)), sep="\n")
printinfasta("test", 468 * "ACGTGA", "this is a test")
partition_all(100, seq)) generate tuples of 100 letters each taken from seq, and a last shorter one is the number of letters is not a multiple of 100.
The map("".join, ...) is used to group letters within each such tuple into a single string.
The * in front of the map makes its results considered as separate arguments to print.

How can I effectively pull out human readable strings/terms from code automatically?

I'm trying to determine the most common words, or "terms" (I think) as I iterate over many different files.
Example - For this line of code found in a file:
for w in sorted(strings, key=strings.get, reverse=True):
I'd want these unique strings/terms returned to my dictionary as keys:
for
w
in
sorted
strings
key
strings
get
reverse
True
However, I want this code to be tunable so that I can return strings with periods or other characters between them as well, because I just don't know what makes sense yet until I run the script and count up the "terms" a few times:
strings.get
How can I approach this problem? It would help to understand how I can do this one line at a time so I can loop it as I read my file's lines in. I've got the basic logic down but I'm currently just doing the tallying by unique line instead of "term":
strings = dict()
fname = '/tmp/bigfile.txt'
with open(fname, "r") as f:
for line in f:
if line in strings:
strings[line] += 1
else:
strings[line] = 1
for w in sorted(strings, key=strings.get, reverse=True):
print str(w).rstrip() + " : " + str(strings[w])
(Yes I used code from my little snippet here as the example at the top.)

If the only python token you want to keep together is the object.attr construct then all the tokens you are interested would fit into the regular expression
\w+\.?\w*
Which basically means "one or more alphanumeric characters (including _) optionally followed by a . and then some more characters"
note that this would also match number literals like 42 or 7.6 but that would be easy enough to filter out afterwards.
then you can use collections.Counter to do the actual counting for you:
import collections
import re
pattern = re.compile(r"\w+\.?\w*")
#here I'm using the source file for `collections` as the test example
with open(collections.__file__, "r") as f:
tokens = collections.Counter(t.group() for t in pattern.finditer(f.read()))
for token, count in tokens.most_common(5): #show only the top 5
print(token, count)
Running python version 3.6.0a1 the output is this:
self 226
def 173
return 170
self.data 129
if 102
which makes sense for the collections module since it is full of classes that use self and define methods, it also shows that it does capture self.data which fits the construct you are interested in.

Filtering a FASTA file based on restriction-sequence with BioPython

I have a fasta file. From that file, I need to get the only sequences containing 'CCNNNGG' (where 'N' represents random nucleotides) and put them in a new fasta file.
Example (it should output the first sequence):
m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/3312_3597
CGCGGCATCGAATTAATACGACTCACTATAGGTTTTTTTATTG*********CCTACGG***********GTATTTTCAGTTAGATTCTTTCTTCTTAGAGGGTACAGAGAAAGGGAGAAAATAGCTACAGACATGGGAGTGAAAGGTAGGAAGAAGAGCGAAGCAGACATTATTCA
m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/3708_4657
CAACGGTTTTGCCACAAGATCAGGAACATAAGTCACCAGACTCAATTCATCCCCATAAGACCTCGGACCTCTCAATCCTCGAATTAGGATGTTCTCGTACGGTCTATCAGTATATAAACCTGACATACTATAAAAAAGTATACCAT
TCTTATCATGTACAGTAGGGTACAGTAGG
(*s added for highlighting)
And my code:
from Bio import SeqIO
my_sequences = []
for record in SeqIO.parse(open("example.fa", "rU"), "fasta") :
if "CCTACGG" in record.seq : #Works fine with CCTACGG
my_sequences.append(record)
output_handle = open("my_seqs.fasta", "w")
SeqIO.write(my_sequences, output_handle, "fasta")
output_handle.close()
My problem is that I don't know how to write random nucleotides, instead of write "CCTACGG" after if I want to put 'CCNNNGG', where N represents random nucleotides ('C' or 'T' or 'G' or 'A').

You can use regular expressions to do this, via Python's re module:
import re
pattern = 'CCNNNGG'
regex = re.compile(pattern.replace('N', '[ACGT]'))
for record in SeqIO.parse(...):
if re.search(regex, record.seq) is not None:
my_sequences.append(record)
This replaces every 'N' in your pattern with '[ACGT]', which will match any one of those four characters, then searches for that pattern in each record.seq.
Also, note that your examples aren't very good - the second one also matches that pattern (it contains 'CCCATGG') - see the results!

best way to compare sequence of letters inside file?

I have a file, that have lots of sequences of letters.
Some of these sequences might be equal, so I would like to compare them, all to all.
I'm doing something like this but this isn't exactly want I wanted:
for line in fl:
line = line.split()
for elem in line:
if '>' in elem:
pass
else:
for el in line:
if elem == el:
print elem, el
example of the file:
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
>2
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
So what I want if to known if any sequence is totally equal to 1, or to 2, and so on.

If the goal is to simply group like sequences together, then simply sorting the data will do the trick. Here is a solution that uses BioPython to parse the input FASTA file, sorts the collection of sequences, uses the standard Python itertools.groupby function to merge ids for equal sequences, and outputs a new FASTA file:
from itertools import groupby
from Bio import SeqIO
records = list(SeqIO.parse(file('spoo.fa'),'fasta'))
def seq_getter(s): return str(s.seq)
records.sort(key=seq_getter)
for seq,equal in groupby(records, seq_getter):
ids = ','.join(s.id for s in equal)
print '>%s' % ids
print seq
Output:
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>2,5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA

In general for this type of work you may want to investigate Biopython which has lots of functionality for parsing and otherwise dealing with sequences.
However, your particular problem can be solved using a dict, an example of which Manoj has given you.

Comparing long sequences of letters is going to be pretty inefficient. It will be quicker to compare the hash of the sequences. Python offers two built in data types that use hash: set and dict. It's best to use dict here as we can store the line numbers of all the matches.
I've assumed the file has identifiers and labels on alternate lines, so if we split the file text on new lines we can take one line as the id and the next as the sequence to match.
We then use a dict with the sequence as a key. The corresponding value is a list of ids which have this sequence. By using defaultdict from collections we can easily handle the case of a sequence not being in the dict; if the key hasn't be used before defaultdict will automatically create a value for us, in this case an empty list.
So when we've finished working through the file the values of the dict will effectively be a list of lists, each entry containing the ids which share a sequence. We can then use a list comprehension to pull out the interesting values, i.e. entries where more than one id is used by a sequence.
from collections import defaultdict
lines = filetext.split("\n")
sequences = defaultdict(list)
while (lines):
id = lines.pop(0)
data = lines.pop(0)
sequences[data].append(id)
results = [match for match in sequences.values() if len(match) > 1]
print results

The following script will return a count of sequences. It returns a dictionary with the individual, distinct sequences as keys and the numbers (the first part of each line) where these sequences occur.
#!/usr/bin/python
import sys
from collections import defaultdict
def count_sequences(filename):
result = defaultdict(list)
with open(filename) as f:
for index, line in enumerate(f):
sequence = line.replace('\n', '')
line_number = index + 1
result[sequence].append(line_number)
return result
if __name__ == '__main__':
filename = sys.argv[1]
for sequence, occurrences in count_sequences(filename).iteritems():
print "%s: %s, found in %s" % (sequence, len(occurrences), occurrences)
Sample output:
etc#etc:~$ python ./fasta.py /path/to/my/file
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA: 1, found in ['4']
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA: 1, found in ['3']
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA: 2, found in ['2', '5']
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA: 1, found in ['7']
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA: 1, found in ['1']
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG: 1, found in ['6']
Update
Changed code to use dafaultdict and for loop. Thanks #KennyTM.
Update 2
Changed code to use append rather than +. Thanks #Dave Webb.

How to refactor this python code block to be more efficient

This code block works - it loops through a file that has a repeating number of sets of data
and extracts out each of the 5 pieces of information for each set.
But I I know that the current factoring is not as efficient as it can be since it is looping
through each key for each line found.
Wondering if some python gurus can offer better way to do this more efficiently.
def parse_params(num_of_params,lines):
for line in lines:
for p in range(1,num_of_params + 1,1):
nam = "model.paramName "+str(p)+" "
par = "model.paramValue "+str(p)+" "
opt = "model.optimizeParam "+str(p)+" "
low = "model.paramLowerBound "+str(p)+" "
upp = "model.paramUpperBound "+str(p)+" "
keys = [nam,par,opt,low,upp]
for key in keys:
if key in line:
a,val = line.split(key)
if key == nam: names.append(val.rstrip())
if key == par: params.append(val.rstrip())
if key == opt: optimize.append(val.rstrip())
if key == upp: upper.append(val.rstrip())
if key == low: lower.append(val.rstrip())
print "Names = ",names
print "Params = ",params
print "Optimize = ",optimize
print "Upper = ",upper
print "Lower = ",lower

Though this doesn't answer your question (other answers are getting at that) something that has helped me a lot in doing things similar to what you're doing are List Comprehensions. They allow you to build lists in a concise and (I think) easy to read way.
For instance, the below code builds a 2-dimenstional array with the values you're trying to get at. some_funct here would be a little regex, if I were doing it, that uses the index of the last space in the key as the parameter, and looks ahead to collect the value you're trying to get in the line (the value which corresponds to the key currently being looked at) and appends it to the correct index in the seen_keys 2D array.
Wordy, yes, but if you get list-comprehension and you're able to construct the regex to do that, you've got a nice, concise solution.
keys = ["model.paramName ","model.paramValue ","model.optimizeParam ""model.paramLowerBound ","model.paramUpperBound "]
for line in lines:
seen_keys = [[],[],[],[],[]]
[seen_keys[keys.index(k)].some_funct(line.index(k) for k in keys if k in line]

It's not totally easy to see the expected format. From what I can see, the format is like:
lines = [
"model.paramName 1 foo",
"model.paramValue 2 bar",
"model.optimizeParam 3 bat",
"model.paramLowerBound 4 zip",
"model.paramUpperBound 5 ech",
"model.paramName 1 foo2",
"model.paramValue 2 bar2",
"model.optimizeParam 3 bat2",
"model.paramLowerBound 4 zip2",
"model.paramUpperBound 5 ech2",
]
I don't see the above code working if there is more than one value in each line. Which means the digit is not really significant unless I'm missing something. In that case this works very easily:
import re
def parse_params(num_of_params,lines):
key_to_collection = {
"model.paramName":names,
"model.paramValue":params,
"model.optimizeParam":optimize,
"model.paramLowerBound":upper,
"model.paramUpperBound":lower,
}
reg = re.compile(r'(.+?) (\d) (.+)')
for line in lines:
m = reg.match(line)
key, digit, value = m.group(1, 2, 3)
key_to_collection[key].append(value)

It's not entirely obvious from your code, but it looks like each line can have one "hit" at most; if that's indeed the case, then something like:
import re
def parse_params(num_of_params, lines):
sn = 'Names Params Optimize Upper Lower'.split()
ks = '''paramName paramValue optimizeParam
paramLowerBound paramUpperBound'''.split()
vals = dict((k, []) for k in ks)
are = re.compile(r'model\.(%s) (\d+) (.*)' % '|'.join(ks))
for line in lines:
mo = are.search(line)
if not mo: continue
p = int(mo.group(2))
if p < 1 or p > num_of_params: continue
vals[mo.group(1)].append(mo.group(3).rstrip())
for k, s in zip(ks, sn):
print '%-8s =' % s,
print vals[k]
might work -- I exercised it with a little code as follows:
if __name__ == '__main__':
lines = '''model.paramUpperBound 1 ZAP
model.paramLowerBound 1 zap
model.paramUpperBound 5 nope'''.splitlines()
parse_params(2, lines)
and it emits
Names = []
Params = []
Optimize = []
Upper = ['zap']
Lower = ['ZAP']
which I think is what you want (if some details must differ, please indicate exactly what they are and let's see if we can fix it).
The two key ideas are: use a dict instead of lots of ifs; use a re to match "any of the following possibilities" with parenthesized groups in the re's pattern to catch the bits of interest (the keyword after model., the integer number after that, and the "value" which is the rest of the line) instead of lots of if x in y checks and string manipulation.

There is a lot of duplication there, and if you ever add another key or param, you're going to have to add it in many places, which leaves you ripe for errors. What you want to do is pare down all of the places you have repeated things and use some sort of data model, such as a dict.
Some others have provided some excellent examples, so I'll just leave my answer here to give you something to think about.

Are you sure that parse_params is the bottle-neck? Have you profiled your app?
import re
from collections import defaultdict
names = ("paramName paramValue optimizeParam "
"paramLowerBound paramUpperBound".split())
stmt_regex = re.compile(r'model\.(%s)\s+(\d+)\s+(.*)' % '|'.join(names))
def parse_params(num_of_params, lines):
stmts = defaultdict(list)
for m in (stmt_regex.match(s) for s in lines):
if m and 1 <= int(m.group(2)) <= num_of_params:
stmts[m.group(1)].append(m.group(3).rstrip())
for k, v in stmts.iteritems():
print "%s = %s" % (k, ' '.join(v))

The code given in the OP does multiple tests per line to try to match against the expected set of values, each of which is being constructed on the fly. Rather than construct paramValue1, paramValue2, etc. for each line, we can use a regular expression to try to do the matching in a cheaper (and more robust) manner.
Here's my code snippet, drawing from some ideas that have already been posted. This lets you add a new keyword to the key_to_collection dictionary and not have to change anything else.
import re
def parse_params(num_of_params, lines):
pattern = re.compile(r"""
model\.
(.+) # keyword
(\d+) # index to keyword
[ ]+ # whitespace
(.+) # value
""", re.VERBOSE)
key_to_collection = {
"paramName": names,
"paramValue": params,
"optimizeParam": optimize,
"paramLowerBound": upper,
"paramUpperBound": lower,
}
for line in lines:
match = pattern.match(line)
if not match:
print "Invalid line: " + line
elif match[1] not in key_to_collection:
print "Invalid key: " + line
# Not sure if you really care about enforcing this
elif match[2] > num_of_params:
print "Invalid param: " + line
else:
key_to_collection[match[1]].append(match[3])
Full disclosure: I have not compiled/tested this.

It can certainly be made more efficient. But, to be honest, unless this function is called hundreds of times a second, or works on thousands of lines, is it necessary?
I would be more concerned about making it clear what is happening... currently, I'm far from clear on that aspect.
Just eyeballing it, the input seems to look like this:
model.paramName 1 A model.paramValue 1 B model.optimizeParam 1 C model.paramLowerBound 1 D model.paramUpperBound 1 E model.paramName 2 F model.paramValue 2 G model.optimizeParam 2 H model.paramLowerBound 2 I model.paramUpperBound 2 J
And your desired output seems to be something like:
Names = AF
Params = BG
etc...
Now, since my input certainly doesn't match yours, the output is likely off too, but I think I have the gist.
There are a few points. First, does it matter how many parameters are passed to the function? For example, if the input has two sets of parameters, do I just want to read both, or is it necessary to allow the function to only read one? For example, your code allows me to call parse_params(1,1) and have it only read parameters ending in a 1 from the same input. If that's not actually a requirement, you can skip a large chunk of the code.
Second, is it important to ONLY read the given parameters? If I, for example, have a parameter called 'paramFoo', is it bad if I read it? You can also simplify the procedure by just grabbing all parameters regardless of their name, and extracting their value.
def parse_params(input):
parameter_list = {}
param = re.compile(r"model\.([^ ]+) [0-9]+ ([^ ]+)")
each_parameter = param.finditer(input)
for match in each_parameter:
key = match[0]
value = match[1]
if not key in paramter_list:
parameter_list[key] = []
parameter_list[key].append(value)
return parameter_list
The output, in this instance, will be something like this:
{'paramName':[A, F], 'paramValue':[B, G], 'optimizeParam':[C, H], etc...}
Notes: I don't know Python well, I'm a Ruby guy, so my syntax may be off. Apologies.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

error comparing sequences - string interpreted as number - python

Your PHYLIP file is broken. The header says 161 sequences but there are 166. Fixing that the current version of Biopython seems to load your file fine. Maybe use len(info_plex) when creating the header line. P.S. It would have been a good idea to include the version of Biopython in your question.

Related

How to split a string into equal sized parts?

How can I effectively pull out human readable strings/terms from code automatically?

Filtering a FASTA file based on restriction-sequence with BioPython

best way to compare sequence of letters inside file?

How to refactor this python code block to be more efficient

Categories

Resources