Parsing sequences from a FASTA file in python - python

I have a text file:
>name_1
data_1
>name_2
data_2
>name_3
data_3
>name_4
data_4
>name_5
data_5
I want to store header (name_1, name_2....) in one list and data (data_1, data_2....) in another list in a Python program.
def parse_fasta_file(fasta):
desc=[]
seq=[]
seq_strings = fasta.strip().split('>')
for s in seq_strings:
if len(s):
sects = s.split()
k = sects[0]
v = ''.join(sects[1:])
desc.append(k)
seq.append(v)
for l in sys.stdin:
data = open('D:\python\input.txt').read().strip()
parse_fasta_file(data)
print seq
this is my code which i have tried but i am not able to get the answer.

The most fundamental error is trying to access a variable outside of its scope.
def function (stuff):
seq = whatever
function('data')
print seq ############ error
You cannot access seq outside of function. The usual way to do this is to have function return a value, and capture it in a variable within the caller.
def function (stuff):
seq = whatever
return seq
s = function('data')
print s
(I have deliberately used different variable names inside the function and outside. Inside function you cannot access s or data, and outside, you cannot access stuff or seq. Incidentally, it would be quite okay, but confusing to a beginner, to use a different variable with the same name seq in the mainline code.)
With that out of the way, we can attempt to write a function which returns a list of sequences and a list of descriptions for them.
def parse_fasta (lines):
descs = []
seqs = []
data = ''
for line in lines:
if line.startswith('>'):
if data: # have collected a sequence, push to seqs
seqs.append(data)
data = ''
descs.append(line[1:]) # Trim '>' from beginning
else:
data += line.rstrip('\r\n')
# there will be yet one more to push when we run out
seqs.append(data)
return descs, seqs
This isn't particularly elegant, but should get you started. A better design would be to return a list of (description, data) tuples where the description and its data are closely coupled together.
descriptions, sequences = parse_fasta(open('file', 'r').read().split('\n'))
The sys.stdin loop in your code does not appear to do anything useful.

Related

Python3 dictionary values being overwritten

I’m having a problem with a dictionary. I"m using Python3. I’m sure there’s something easy that I’m just not seeing.
I’m reading lines from a file to create a dictionary. The first 3 characters of each line are used as keys (they are unique). From there, I create a list from the information in the rest of the line. Each 4 characters make up a member of the list. Once I’ve created the list, I write to the directory with the list being the value and the first three characters of the line being the key.
The problem is, each time I add a new key:value pair to the dictionary, it seems to overlay (or update) the values in the previously written dictionary entries. The keys are fine, just the values are changed. So, in the end, all of the keys have a value equivalent to the list made from the last line in the file.
I hope this is clear. Any thoughts would be greatly appreciated.
A snippet of the code is below
formatDict = dict()
sectionList = list()
for usableLine in formatFileHandle:
lineLen = len(usableLine)
section = usableLine[:3]
x = 3
sectionList.clear()
while x < lineLen:
sectionList.append(usableLine[x:x+4])
x += 4
formatDict[section] = sectionList
for k, v in formatDict.items():
print ("for key= ", k, "value =", v)
formatFileHandle.close()
You always clear, then append and then insert the same sectionList, that's why it always overwrites the entries - because you told the program it should.
Always remember: In Python assignment never makes a copy!
Simple fix
Just insert a copy:
formatDict[section] = sectionList.copy() # changed here
Instead of inserting a reference:
formatDict[section] = sectionList
Complicated fix
There are lots of things going on and you could make it "better" by using functions for subtasks like the grouping, also files should be opened with with so that the file is closed automatically even if an exception occurs and while loops where the end is known should be avoided.
Personally I would use code like this:
def groups(seq, width):
"""Group a sequence (seq) into width-sized blocks. The last block may be shorter."""
length = len(seq)
for i in range(0, length, width): # range supports a step argument!
yield seq[i:i+width]
# Printing the dictionary could be useful in other places as well -> so
# I also created a function for this.
def print_dict_line_by_line(dct):
"""Print dictionary where each key-value pair is on one line."""
for key, value in dct.items():
print("for key =", key, "value =", value)
def mytask(filename):
formatDict = {}
with open(filename) as formatFileHandle:
# I don't "strip" each line (remove leading and trailing whitespaces/newlines)
# but if you need that you could also use:
# for usableLine in (line.strip() for line in formatFileHandle):
# instead.
for usableLine in formatFileHandle:
section = usableLine[:3]
sectionList = list(groups(usableLine[3:]))
formatDict[section] = sectionList
# upon exiting the "with" scope the file is closed automatically!
print_dict_line_by_line(formatDict)
if __name__ == '__main__':
mytask('insert your filename here')
You could simplify your code here by using a with statement to auto close the file and chunk the remainder of the line into groups of four, avoiding the re-use of a single list.
from itertools import islice
with open('somefile') as fin:
stripped = (line.strip() for line in fin)
format_dict = {
line[:3]: list(iter(lambda it=iter(line[3:]): ''.join(islice(it, 4)), ''))
for line in stripped
}
for key, value in format_dict.items():
print('key=', key, 'value=', value)

Python Dynamic Data Structures

I am going to read the lines of a given text file and select several chunks of data whose format are (int, int\n) . Every time the number of lines are different so I need a dynamic sized data structure in Python. I also would like to store those chunks in 2D data structures. If you are familiar with MATLAB programming, I'd like to have something like a structure A{n} n = number of chunks of data and each chunk includes several lines of the data mentioned above.
Which type of data structure would you recommend? and how to implement with it?
i.e. A{0} = ([1,2],[2,3],[3,4]) A{1} = ([1,1],[2,2],[5,5],[7,4]) and so on.
Thank you
A python list can contain lists as well any different data type.
l = []
l.append(2) # l is now (2)
l.extend([3,2]) # l is now (2,3,2)
l.append([4,5]) # l is now (2,3,2,[4,5])
list.Append appends whatever it is given as argument to the list
while list.extend makes the given the argument the tail of the list.
I guess you required list would appear somehwhat like this:
l = ([[1,2],[2,3],[3,4]],[[1,1],[2,2],[5,5],[7,4]])
PS: Here's a link to get you jump start learning python
https://learnxinyminutes.com/docs/python/
Just keep in mind that if your are reading data from text file , the format is string , you need to use int() to convert your string to int.
The issue was resolved with 2 steps appending the list.
import numpy as np
file = ('data.txt')
f = open(file)
i = 0
str2 = '.PEN_DOWN\n'
str3 = '.PEN_UP\n'
A = []
B = []
for line in f.readlines():
switch_end = False
if (line == str2) or (~switch_end):
if line[0].isdigit():
A.append(line[:-1])
elif line == str3:
switch_end = True
B.append(A)
A = []
B.append(A)
f.close()
print(np.shape(A))
print(np.shape(B))

enumerate column headers in CSV that belong to the same tag (key) in python

I am using the following sets of generators to parse XML in to CSV:
import xml.etree.cElementTree as ElementTree
from xml.etree.ElementTree import XMLParser
import csv
def flatten_list(aList, prefix=''):
for i, element in enumerate(aList, 1):
eprefix = "{}{}".format(prefix, i)
if element:
# treat like dict
if len(element) == 1 or element[0].tag != element[1].tag:
yield from flatten_dict(element, eprefix)
# treat like list
elif element[0].tag == element[1].tag:
yield from flatten_list(element, eprefix)
elif element.text:
text = element.text.strip()
if text:
yield eprefix[:].rstrip('.'), element.text
def flatten_dict(parent_element, prefix=''):
prefix = prefix + parent_element.tag
if parent_element.items():
for k, v in parent_element.items():
yield prefix + k, v
for element in parent_element:
eprefix = element.tag
if element:
# treat like dict - we assume that if the first two tags
# in a series are different, then they are all different.
if len(element) == 1 or element[0].tag != element[1].tag:
yield from flatten_dict(element, prefix=prefix)
# treat like list - we assume that if the first two tags
# in a series are the same, then the rest are the same.
else:
# here, we put the list in dictionary; the key is the
# tag name the list elements all share in common, and
# the value is the list itself
yield from flatten_list(element, prefix=eprefix)
# if the tag has attributes, add those to the dict
if element.items():
for k, v in element.items():
yield eprefix+k
# this assumes that if you've got an attribute in a tag,
# you won't be having any text. This may or may not be a
# good idea -- time will tell. It works for the way we are
# currently doing XML configuration files...
elif element.items():
for k, v in element.items():
yield eprefix+k
# finally, if there are no child tags and no attributes, extract
# the text
else:
yield eprefix, element.text
def makerows(pairs):
headers = []
columns = {}
for k, v in pairs:
if k in columns:
columns[k].extend((v,))
else:
headers.append(k)
columns[k] = [k, v]
m = max(len(c) for c in columns.values())
for c in columns.values():
c.extend(' ' for i in range(len(c), m))
L = [columns[k] for k in headers]
rows = list(zip(*L))
return rows
def main():
with open('2-Response_duplicate.xml', 'r', encoding='utf-8') as f:
xml_string = f.read()
xml_string= xml_string.replace('', '') #optional to remove ampersands.
root = ElementTree.XML(xml_string)
# for key, value in flatten_dict(root):
# key = key.rstrip('.').rsplit('.', 1)[-1]
# print(key,value)
writer = csv.writer(open("try5.csv", 'wt'))
writer.writerows(makerows(flatten_dict(root)))
if __name__ == "__main__":
main()
One column of the CSV, when opened in Excel, looks like this:
ObjectGuid
2adeb916-cc43-4d73-8c90-579dd4aa050a
2e77c588-56e5-4f3f-b990-548b89c09acb
c8743bdd-04a6-4635-aedd-684a153f02f0
1cdc3d86-f9f4-4a22-81e1-2ecc20f5e558
2c19d69b-26d3-4df0-8df4-8e293201656f
6d235c85-6a3e-4cb3-9a28-9c37355c02db
c34e05de-0b0c-44ee-8572-c8efaea4a5ee
9b0fe8f5-8ec4-4f13-b797-961036f92f19
1d43d35f-61ef-4df2-bbd9-30bf014f7e10
9cb132e8-bc69-4e4f-8f29-c1f503b50018
24fd77da-030c-4cb7-94f7-040b165191ce
0a949d4f-4f4c-467e-b0a0-40c16fc95a79
801d3091-c28e-44d2-b9bd-3bad99b32547
7f355633-426d-464b-bab9-6a294e95c5d5
This is due to the fact that there are 14 tags with name ObjectGuid. For example, one of these tags looks like this:
<ObjectGuid>2adeb916-cc43-4d73-8c90-579dd4aa050a</ObjectGuid>
My question: is there an efficient method to enumerate the headers (the keys) such that each key is enumerated like so with it's corresponding value (text in the XML data structure):
It would be displayed in Excel as follows:
ObjectGuid_1 ObjectGuid_2 ObejectGuid3 etc.....
Please let me know if there is any other information that you need from me (such as sample XML). Thank you for your help.
It is a mistake to add an element, attribute, or annotative descriptor to the data set itself for the purpose of identity… Normalizing the data should only be done if you own that data and know with 100% guarantee that doing so will not
have any negative effect on additional consumers (ones relying on attribute order to manipulate the DOM). However what is the point of using a dict or nested dicts (which I don’t quite get either t) if the efficiency of the hashed table lookup is taken right back by making 0(n) checks for this attribute new attribute? The point of this hashing is random look up..
If it’s simply the structured in (key, value) pair, which makes sense here.. Why not just use some other contiguous data structure, but treat it like a dictionary.. Say a named tuple…
A second solution is if you want to add additional state is to throw your generator in a class.
class order:
def__init__(self, lines, order):
self.lines = lines
self.order - python(max)
def __iter__(self):
for l, line in enumerate(self.lines, 1);
self.order.append( l, line))
yield line
when open (some file.csv) as f:
lines = oder( f);
Messing with the data a Harmless conversion? For example if were to create a conversion dictionary (see below)
Well that’s fine, that is until one of the values is blank…
types = [ (‘x ’, float’),
(‘y’, float’)
with open(‘some.csv’) as f:
for row in cvs.DictReader(f):
row.update((key, conversion (row [ key]))
for key, conversion in field_types)
[x: ,y: 2. 2] — > that is until there is an empty data point.. Kaboom.
So My suggestion would not be to change or add to the data, but change the algorithm in which deal with such.. If the problem is order why not simply treat say a tuple as a named tuple similar to a dictionary, the caveat being mutability however makes sense with data uniformity...
*I don’t understand the nested dictionary…That is for the y header values yes?
values and order key —> key — > ( key: value ) ? or you could just skip the
first row :p..
So just skip the first row..
for line in {header: list, header: list }
line.next() # problem solved.. or print(line , end = ‘’)
*** Notables
-To iterator over multiple sequences in parallel
h = [a,b,c]
x = [1,2,3]
for i in zip(a,b):
print(i)
(a, 1)
(b, 2)
-Chaining
a = [1,2 , 3]
b= [a, b , c ]enter code here
for x in chain(a, b):
//remove white space

function using a list in python

I want a function to return me the value of the equation for every number in the list. I have a list of 24 parameters, and I need to solve an equation for every value of this list.
This is the way I get my list:
wlist=[]
def w(i):
for i in range(24):
Calctruesolar=((i*60/1440)*1440+eq_time()+4*long-60*timezone)%1440
if Calctruesolar/4<0:
Calcw=(Calctruesolar/4)+180
wlist.append(Calcw)
print(Calcw)
else:
Calcw=(Calctruesolar/4)-180
wlist.append(Calcw)
print(Calcw)
Then, the list is this one:
>>> wlist=
[166.24797550450222, -178.75202449549778, -163.75202449549778, -148.75202449549778, -133.75202449549778, -118.75202449549778, -103.75202449549778, -88.75202449549778, -73.75202449549778, -58.75202449549778, -43.75202449549778, -28.75202449549778, -13.752024495497778, 1.2479755045022216, 16.24797550450222, 31.24797550450222, 46.24797550450222, 61.24797550450222, 76.24797550450222, 91.24797550450222, 106.24797550450222, 121.24797550450222, 136.24797550450222, 151.24797550450222]
Now, I use the next function:
def hourly_radiation(wlist):
for i in wlist:
Calcrt=(math.pi/24)*(a()+b()*math.cos(math.radians(i)))*((math.cos(math.radians(i)))-math.cos(math.radians(wss())))/(math.sin(math.radians(wss()))-((math.pi*wss()/180)*math.cos(math.radians(wss()))))
CalcI=Calcrt*radiation
print(Calcrt,CalcI)
So, I want to receive Calcrt and CalcI for every value inside the list. But it doesn't work. I have been looking for information in internet and tutorials but I didn't find anything.
Try this:
def hourly_radiation(wlist):
rt_list = []
I_list = []
for i in wlist:
Calcrt = (math.pi/24)*(a()+b()*math.cos(math.radians(i)))*((math.cos(math.radians(i)))-math.cos(math.radians(wss())))/(math.sin(math.radians(wss()))-((math.pi*wss()/180)*math.cos(math.radians(wss()))))
CalcI = Calcrt*radiation
rt_list.append(Calcrt)
i_list.append(CalcI)
print(Calcrt,CalcI)
dict = {}
dict["Calcrt"] = rt_list
dict["CalcI"] = i_list
return dict
This would return the values as a dictionary containing two lists. You may use any data structure that matches your requirements.
You may also create a tuple in each loop run and append it to a list and return it, like:
def hourly_radiation(wlist):
rt_list = []
data = ()
for i in wlist:
Calcrt = (math.pi/24)*(a()+b()*math.cos(math.radians(i)))*((math.cos(math.radians(i)))-math.cos(math.radians(wss())))/(math.sin(math.radians(wss()))-((math.pi*wss()/180)*math.cos(math.radians(wss()))))
CalcI = Calcrt*radiation
data = (Calcrt, CalcI)
print(Calcrt, CalcI)
rt_list.append(data)
return rt_list
I have not run or tested this code, but I hope it should work.
Please use this as a starting point and not as a copy-paste solution.

Smart filter with python

Hi
I need filter out all rows that don't contain symbols from huge "necessary" list, example code:
def any_it(iterable):
for element in iterable:
if element: return True
return False
regexp = re.compile(r'fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ...] # huge list of 10 000 members
f = open("huge_file", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
## File rows like, let's say:
# 1 djhds fruit=REDSOMETHING sdkjld
# 2 sdhfkjk fruit=GREENORANGE lkjfldk
# 3 dskjldsj fruit=YELLOWDOG sldkfjsdl
# 4 gfhfg fruit=REDSOMETHINGELSE fgdgdfg
filtered = (line for line in lines if any_it(regexp.findall(line)[0].startswith(x) for x in necessary))
I have python 2.4, so I can't use built-in any().
I wait a long time for this filtering, but is there some way to optimize it? For example row 1 and 4 contains "RED.." pattern, if we found that "RED.." pattern is ok, can we skip search in 10000-members list for row 4 the same pattern??
Is there some another way to optimize filtering?
Thank you.
...edited...
UPD: See real example data in comments to this post. I'm also interested in sorting by "fruits" the result. Thanks!
...end edited...
If you organized the necessary list as a trie, then you could look in that trie to check if the fruit starts with a valid prefix. That should be faster than comparing the fruit against every prefix.
For example (only mildly tested):
import bisect
import re
class Node(object):
def __init__(self):
self.children = []
self.children_values = []
self.exists = False
# Based on code at http://docs.python.org/library/bisect.html
def _index_of(self, ch):
i = bisect.bisect_left(self.children_values, ch)
if i != len(self.children_values) and self.children_values[i] == ch:
return (i, self.children[i])
return (i, None)
def add(self, value):
if len(value) == 0:
self.exists = True
return
i, child = self._index_of(value[0])
if not child:
child = Node()
self.children.insert(i, child)
self.children_values.insert(i, value[0])
child.add(value[1:])
def contains_prefix_of(self, value):
if self.exists:
return True
i, child = self._index_of(value[0])
if not child:
return False
return child.contains_prefix_of(value[1:])
necessary = ['RED', 'GREEN', 'BLUE', 'ORANGE', 'BLACK',
'LIGHTRED', 'LIGHTGREEN', 'GRAY']
trie = Node()
for value in necessary:
trie.add(value)
# Find lines that match values in the trie
filtered = []
regexp = re.compile(r'fruit=([A-Z]+)')
for line in open('whatever-file'):
fruit = regexp.findall(line)[0]
if trie.contains_prefix_of(fruit):
filtered.append(line)
This changes your algorithm from O(N * k), where N is the number of elements of necessary and k is the length of fruit, to just O(k) (more or less). It does take more memory though, but that might be a worthwhile trade-off for your case.
I'm convinced Zach's answer is on the right track. Out of curiosity, I've implemented another version (incorporating Zach's comments about using a dict instead of bisect) and folded it into a solution that matches your example.
#!/usr/bin/env python
import re
from trieMatch import PrefixMatch # https://gist.github.com/736416
pm = PrefixMatch(['YELLOW', 'GREEN', 'RED', ]) # huge list of 10 000 members
# if list is static, it might be worth picking "pm" to avoid rebuilding each time
f = open("huge_file.txt", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
filtered = (line for line in lines if pm.match(regexp.match(line).group(1)))
For brevity, implementation of PrefixMatch is published here.
If your list of necessary prefixes is static or changes infrequently, you can speed up subsequent runs by pickling and reusing the PickleMatch object instead of rebuilding it each time.
update (on sorted results)
According to the changelog for Python 2.4:
key should be a single-parameter function that takes a list element and
returns a comparison key for the
element. The list is then sorted using
the comparison keys.
also, in the source code, line 1792:
/* Special wrapper to support stable sorting using the decorate-sort-undecorate
pattern. Holds a key which is used for comparisons and the original record
which is returned during the undecorate phase. By exposing only the key
.... */
This means that your regex pattern is only evaluated once for each entry (not once for each compare), hence it should not be too expensive to do:
sorted_generator = sorted(filtered, key=regexp.match(line).group(1))
I personally like your code as is since you consider "fruit=COLOR" as a pattern which others does not. I think you want to find some solution like memoization which enables you to skip test for already solved problem but this is not the case I guess.
def any_it(iterable):
for element in iterable:
if element: return True
return False
necessary = ['YELLOW', 'GREEN', 'RED', ...]
predicate = lambda line: any_it("fruit=" + color in line for color in necessary)
filtered = ifilter(predicate, open("testest"))
Tested (but unbenchmarked) code:
import re
import fileinput
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ]
filtered = []
for line in fileinput.input(["test.txt"]):
try:
key = regexp.match(line).group(1)
except AttributeError:
continue # no match
for p in necessary:
if key.startswith(p):
filtered.append(line)
break
# "filtered" now holds your results
print "".join(filtered)
Diff to code in question:
We do not first load the whole file into memory (as is done when you use file.readlines()). Instead, we process each line as the file is read in. I use the fileinput module here for brevity, but one can also use line = file.readline() and a while line: loop.
We stop iterating through the necessary list once a match is found.
We modified the regex pattern and use re.match instead of re.findall. That's assuming that each line would only contain one "fruit=..." entry.
update
If the format of the input file is consistent, you can squeeze out a little more performance by getting rid of regex altogether.
try:
# with line = "2 asdasd fruit=SOMETHING asdasd...."
key = line.split(" ", 3)[2].split("=")[1]
except:
continue # no match
filtered=[]
for line in open('huge_file'):
found=regexp.findall(line)
if found:
fruit=found[0]
for x in necessary:
if fruit.startswith(x):
filtered.append(line)
break
or maybe :
necessary=['fruit=%s'%x for x in necessary]
filtered=[]
for line in open('huge_file'):
for x in necessary:
if x in line:
filtered.append(line)
break
I'd make a simple list of ['fruit=RED','fruit=GREEN'... etc. with ['fruit='+n for n in necessary], then use in rather than a regex to test them. I don't think there's any way to do it really quickly, though.
filtered = (line for line in f if any(a in line for a in necessary_simple))
(The any() function is doing the same thing as your any_it() function)
Oh, and get rid of file.readlines(), just iterate over the file.
Untested code:
filtered = []
for line in lines:
value = line.split('=', 1)[1].split(' ',1)[0]
if value not in necessary:
filtered.append(line)
That should be faster than pattern matching 10 000 patterns onto a line.
Possibly there are even faster ways. :)
It shouldn't take too long to iterate through 100,000 strings, but I see you have a 10,000 strings list, which means you iterate 10,000 * 100,000 = 1,000,000,000 times the strings, so I don't know what did you expect...
As for your question, if you encounter a word from the list and you only need 1 or more (if you want exacly 1 you need to iterate through the whole list) you can skip the rest, it should optimize the search operation.

Categories

Resources