Extracting lines from multiline string with various variable length sections

Extracting lines from multiline string with various variable length sections - python

I'm working with a pandas dataframe containing a large block of plain text for each row. The block of text has the following format:
Year 1
... (variable # of lines)
7. Stuff
... (variable # of lines, can be 0)
TOTAL Stuff
(single line, numeric)
... (variable # of lines)
Services
(single line)
... (variable # of lines)
Year 2
... (same format as prev)
<repeats for n years>
TOTAL
... (same format as years)
Justification
... (variable number of lines)
<repeat m times>
I'm trying to extract the plain text under the "7. Stuff" and "Justification" headings as well a numerical values for "TOTAL Stuff". My current code creates a list based on the line breaks and iterates through them, but I feel like this is not efficient. It my current implementation also only works when there is one cycle of years -> Total -> Justification (not m).
Here is my parse_text function. Any help on making it more 'pythonic' or just efficient in general is greatly appreciated.
def parse_budget_text(row):
stuff_value = 0
stuff_text = ''
justification_txt = ''
#ensure text is not hidden within a list
text = row['text_raw']
#parse and sum equipment lines
line_iter = iter([line.strip() for line in text.split("\n")])
total_flag = False
justification_flag = False
for line in line_iter:
#find each yearly section
if line.startswith("YEAR"):
while not line.startswith("7. Stuff"):
line = next(line_iter)
line = next(line_iter)
while not line.startswith("Services"):
if ("TOTAL Stuff" not in line) and (not is_number(line)) and (line[0] != "$"):
stuff_txt += line+'; '
line = next(line_iter)
#find total summary
elif line.startswith("TOTAL"):
cumulative_flag = True
while not line.startswith("TOTAL Stuff"):
line =next(line_iter)
stuff_value += int(next(line_iter).replace(',',''))
#find Justification line
elif line.startswith("Justification") and cumulative_flag:
justification_flag = True
#extract justification
elif justification_flag == True:
justification_txt += line
return pd.Series({'raw_text': text, 'Stuff_val': stuff_value, 'Stuff_txt': stuff_txt,})

Related

Reading from more than one line after keyword?

I have an output file which prints out a matrix of numeric data. I need to search through this file for the identifier at the start of each data set, which is:
GROUP 1 FIRST 1 LAST 163
Here GROUP 1 is the first column of the matrix, FIRST 1 is the first non-zero element of this matrix in position 1, and LAST 163 is the last non-zero element of the matrix in position 163. The matrix doesn't necessarily end at this LAST value - in this case there are 172 values.
I want to read this data into a simpler form to work with. Here is an example of the first two column results:
GROUP 1 FIRST 1 LAST 163
7.150814E-02 9.866657E-03 8.500540E-04 1.818338E-03 2.410691E-03 3.284499E-03 3.011986E-03 1.612432E-03
1.674247E-03 3.436244E-03 3.655873E-03 4.056876E-03 4.560725E-03 2.462454E-03 2.567764E-03 5.359393E-03
5.457415E-03 2.679373E-03 2.600020E-03 2.491592E-03 2.365089E-03 2.228494E-03 5.792616E-03 1.623274E-03
1.475062E-03 1.331820E-03 1.195052E-03 2.832699E-03 7.298341E-04 6.301271E-04 1.377459E-03 1.048925E-03
1.677453E-04 3.580640E-04 1.575301E-04 1.150545E-04 1.197719E-04 2.950028E-05 5.380539E-05 1.228784E-05
1.627659E-05 4.522051E-05 7.736908E-06 1.758838E-05 8.161204E-06 6.103670E-06 6.431876E-06 1.585671E-06
4.110246E-06 4.512924E-07 2.775227E-06 5.107739E-07 1.219448E-06 1.653674E-07 4.429047E-07 4.837661E-07
2.036820E-07 3.449548E-07 1.457648E-07 4.494116E-07 1.629392E-07 1.300509E-07 1.730199E-07 8.130338E-08
1.591993E-08 5.457638E-08 1.713141E-08 7.806754E-09 1.154869E-08 3.545961E-09 2.862203E-09 2.289470E-09
4.324002E-09 2.243199E-09 2.627165E-09 2.273119E-09 1.973867E-09 1.710714E-09 1.468845E-09 1.772236E-09
1.764492E-09 1.004393E-09 1.044698E-09 5.201382E-10 2.660613E-10 3.012732E-10 2.630323E-10 4.381052E-10
2.521794E-10 9.213524E-11 2.619283E-10 3.591906E-11 1.449830E-10 1.867363E-11 1.230445E-10 1.108149E-11
2.775004E-11 1.156249E-11 4.393752E-11 5.318751E-11 6.815569E-12 1.817489E-11 2.044674E-11 2.044673E-11
1.931080E-11 1.931076E-11 1.817484E-11 2.044668E-11 5.486837E-12 7.681572E-12 1.536314E-11 7.132886E-12
8.230253E-12 1.426577E-11 1.426577E-11 4.389468E-12 5.925780E-12 2.853153E-12 2.853153E-12 5.706307E-12
5.706307E-12 2.194733E-12 3.292099E-12 5.267358E-12 2.194733E-12 3.072626E-12 4.828412E-12 4.389466E-12
4.389465E-12 1.097366E-11 2.194732E-12 1.316839E-11 2.194732E-12 1.608784E-11 1.674222E-11 1.778860E-11
6.993074E-12 2.622402E-12 9.090994E-12 5.769285E-12 1.573441E-12 6.861030E-12 4.782885E-12 8.768619E-13
2.311727E-12 3.188589E-12 4.393636E-12 3.844430E-12 4.256331E-12 1.235709E-12 2.746020E-12 2.746020E-12
8.238059E-13 2.608719E-12 1.445203E-12 4.817344E-13 1.445203E-12 7.609642E-14 2.536547E-13 2.000924E-13
7.075681E-14 7.075681E-14 3.056704E-14
GROUP 2 FIRST 2 LAST 168
6.740271E-02 8.310813E-03 3.609403E-03 1.307012E-03 2.949375E-03 3.605043E-03 1.612647E-03 1.640960E-03
3.597806E-03 4.022993E-03 4.289805E-03 4.480576E-03 2.352539E-03 2.415121E-03 5.018262E-03 5.188098E-03
2.589224E-03 2.546116E-03 2.472462E-03 2.374431E-03 2.260519E-03 5.981164E-03 1.700972E-03 1.556116E-03
1.410140E-03 1.273499E-03 3.061941E-03 7.995844E-04 6.967963E-04 1.553994E-03 1.216266E-03 1.997540E-04
4.426460E-04 1.990445E-04 1.470610E-04 1.539762E-04 3.814900E-05 7.024764E-05 1.611156E-05 2.136422E-05
5.984886E-05 1.035646E-05 2.363444E-05 1.105747E-05 8.308678E-06 8.789299E-06 2.257693E-06 5.807418E-06
6.248625E-07 3.822327E-06 6.987942E-07 1.660586E-06 2.240283E-07 5.983062E-07 6.513773E-07 2.735403E-07
4.614998E-07 1.940877E-07 5.895136E-07 2.081549E-07 1.662117E-07 2.316650E-07 1.101916E-07 2.162701E-08
7.493990E-08 2.341661E-08 1.072330E-08 1.606536E-08 4.945307E-09 3.936301E-09 3.147244E-09 5.945972E-09
3.108514E-09 3.682241E-09 3.210760E-09 2.795020E-09 2.436545E-09 2.118219E-09 2.612622E-09 2.586657E-09
1.432507E-09 1.457386E-09 7.264341E-10 3.803348E-10 4.514677E-10 3.959518E-10 6.541553E-10 3.707172E-10
1.334816E-10 3.875547E-10 5.294296E-11 2.294557E-10 2.790137E-11 1.719152E-10 1.408339E-11 3.526731E-11
1.469469E-11 5.583990E-11 6.759567E-11 8.766360E-12 2.337697E-11 2.629908E-11 2.629908E-11 2.483802E-11
2.483802E-11 2.337697E-11 2.629908E-11 7.112706E-12 9.957791E-12 1.991557E-11 9.246516E-12 1.066906E-11
1.849303E-11 1.849303E-11 5.690165E-12 7.681722E-12 3.698607E-12 3.698607E-12 7.397214E-12 7.397214E-12
2.845082E-12 4.267624E-12 6.828199E-12 2.845082E-12 3.983115E-12 6.259180E-12 5.690165E-12 5.690165E-12
1.422541E-11 2.845082E-12 1.707049E-11 2.845082E-12 2.095991E-11 2.193285E-11 2.330364E-11 1.096642E-11
4.112407E-12 1.425635E-11 8.906802E-12 2.429128E-12 1.106603E-11 8.097092E-12 1.484468E-12 3.913596E-12
5.398063E-12 8.624785E-12 7.546689E-12 8.355261E-12 2.425721E-12 5.390492E-12 5.390492E-12 1.617147E-12
5.120967E-12 2.710198E-12 9.033993E-13 2.710198E-12 3.744092E-13 1.248030E-12 6.614939E-13 4.359798E-13
4.359798E-13 1.364861E-13 4.856661E-15 4.856661E-15 4.856661E-15 4.856661E-15 4.856661E-15
What I have at the moment works, except it only reads in the first line after the GROUP keyword line. How can I make it continue reading the data in until it reaches the next GROUP keyword?
file_name = "test_data.txt"
import re
import io
group_pattern = re.compile(r"GROUP +\d+ FIRST +(?P<first>\d+) LAST +(?P<last>\d+)")
def read_data_from_file(file_name, start_identifier, end_identifier):
results = []
longest = 0
with open(file_name) as file:
t = file.read()
t=t[t.find('MACRO'):]
t=t[t.find(start_identifier)+len(start_identifier):t.find(end_identifier)]
t=io.StringIO(t)
for line in t:
match = group_pattern.search(line)
if match:
first = int(match.group('first'))
last = int(match.group('last'))
data = [float(value) for value in next(t).split()]
row = [0.0] * last
for i, value in enumerate(data, start=first-1):
row[i] = value
longest = max(longest, len(row))
results.append(row)
for row in results:
if len(row) < longest:
row.extend([0.0] * (longest-len(row)))
return results
start_identifier = "SCATTER MOMENT 1"
end_identifier = "SCATTER MOMENT 2"
results = read_data_from_file(file_name, start_identifier, end_identifier)
print(results)
What I want the code to produce is a matrix with just the numerical data. In this case it would be size [2x168] but my full data set is [172x172]. I want every GROUP to be read in as a row of the matrix, and zeroes filled into every element not specified in the output data. The current code does almost all of this, except that it only reads the first line of data after the GROUP keyword line.

So I took a look at the data you provided in your question. I found what I think is a better and simpler way of pulling those data points out of that file. However I noticed that you have some other code thats looking for other things in the file as well but those weren't in the test data you posted. So you may have to adapt this a little to work with your dataset.
def read_data_from_file(file_name):
with open(file_name) as fp:
index = -1
matrices = []
# Iterate over the file line by line via iter. Reduces memory usage
for line in fp:
# Since headers are always on their own line and data points always being with
# two spaces we can just look for lines that start with two spaces.
# If we find a line without these spaces then its the header line, add a new
# list to matrices and add one to index
if not line.startswith(' '):
index += 1
matrices.append([])
else:
# Splice str at index 2 to ignore first two spaces
# Then split by two spaces to get each data point
str_data_points = line[2:].split(' ')
# Map the string data points to a floats
float_data_points = map(lambda s: float(s), str_data_points)
# Add those float data points to the list in matrices via index
matrices[index].extend(float_data_points)
max_matrix_length = max(map(lambda matrix: len(matrix), matrices))
for matrix in matrices:
matrix.extend([0.0] * (max_matrix_length - len(matrix)))
return matrices

Here's my solution to read the data from the .txt file and produce a matrix-like output (0.0 padded at the end of each group)
import re
def read_data_from_file(filepath):
GROUP_DATA = []
MAX_ELEMENT_COUNT = 0
with open(file_path) as f:
for line in f.readlines():
if 'GROUP' in line:
GROUP_DATA.append([])
MAX_ELEMENT_COUNT = max(MAX_ELEMENT_COUNT, int(re.findall(r'\d+', line)[-1]))
else:
values = line.split(' ')
for value in values:
try:
GROUP_DATA[-1].append(float(value))
except ValueError:
pass
for DATA in GROUP_DATA:
if len(DATA) < MAX_ELEMENT_COUNT:
DATA += [0.0] * (MAX_ELEMENT_COUNT - len(DATA))
return GROUP_DATA
For the data in the given question saved into data.txt, the output would be as follows:
>>> import numpy as np ------------------------------> Just to check the output shape
>>> mat = read_data_from_file('data.txt')
>>> np.shape(mat)
(2, 168) <-------------------------------------------- Output shape as expected
The Output Matrix's size is flexible to the given data

Python splitting with string as delimiter

I have a file that looks something like this:
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCTNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNAAAACGTGTGCATGAACAAAAAA
CGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCAC
TAAATCAACGGACATGTGTTGC
And I need to split it into the "non-N" sequences, so two separate files like this:
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCT
AAAACGTGTGCATGAACAAAAAACGTAGCAGATCGTGACTGGC
TATTGTATTGTGTCAATTTCGCTTCGTCACTAAATCAACGGACA
TGTGTTGC
What I currently have is this:
UMfile = open ("C:\Users\Manuel\Desktop\sequence.txt","r")
contignumber = 1
contigfile = open ("contig "+str(contignumber), "w")
DNA = UMfile.read()
DNAstring = str(DNA)
for s in DNAstring:
DNAstring.split("NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN",1)
contigfile.write(DNAstring)
contigfile.close()
contignumber = contignumber+1
contigfile = open ("contig "+str(contignumber), "w")
The thing is that I realize there is a linebreak between the "Ns" and that is why it is not splitting my file, but the "file" I'm showing is just a part of a much much bigger one. So sometimes the "Ns" will look like this "NNNNNN\n" and sometimes like "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\n", yet there is always a count of 1000 Ns between my sequences that I need to split.
So my question is: How do I tell python to split and wite into different files every 1000xNs knowing that there will be different number of Ns in each line?
Thank you all very much, I really have no informatics background and my python skills are at best basic.

Just split your string on 'N' and then remove all the strings that are empty, or just contain a newline. Like this:
#!/usr/bin/env python
DNAstring = '''AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCTNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNAAAACGTGTGCATGAACAAAAAA
CGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCAC
TAAATCAACGGACATGTGTTGC'''
sequences = [u for u in DNAstring.split('N') if u and u != '\n']
for i, seq in enumerate(sequences):
print i
print seq.replace('\n', '') + '\n'
output
0
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACTTCTCAATGGGCAGTACATATCATCTCT
1
AAAACGTGTGCATGAACAAAAAACGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCACTAAATCAACGGACATGTGTTGC
The code snippet above also removes newlines inside the sequences using .replace('\n', '').
Here are a few programs that you may find useful.
Firstly, a line buffer class. You initialise it with a file name and a line width. You can then feed it random length strings and it will automatically save them to the text file, line by line, with all lines (except possibly the last line) having the given length. You can use this class in other programs to make your output look neat.
Save this file as linebuffer.py to somewhere in your Python path; the simplest way is to save it wherever you save your Python programs and make that the current directory when you run the programs.
linebuffer.py
#! /usr/bin/env python
''' Text output buffer
Write fixed width lines to a text file
Written by PM 2Ring 2015.03.23
'''
class LineBuffer(object):
''' Text output buffer
Write fixed width lines to file fname
'''
def __init__(self, fname, width):
self.fh = open(fname, 'wt')
self.width = width
self.buff = []
self.bufflen = 0
def write(self, data):
''' Write a string to the buffer '''
self.buff.append(data)
self.bufflen += len(data)
if self.bufflen >= self.width:
self._save()
def _save(self):
''' Write the buffer to the file '''
buff = ''.join(self.buff)
#Split buff into lines
lines = []
while len(buff) >= self.width:
lines.append(buff[:self.width])
buff = buff[self.width:]
#Add an empty line so we get a trailing newline
lines.append('')
self.fh.write('\n'.join(lines))
self.buff = [buff]
self.bufflen = len(buff)
def close(self):
''' Flush the buffer & close the file '''
if self.bufflen > 0:
self.fh.write(''.join(self.buff) + '\n')
self.fh.close()
def testLB():
alpha = 'abcdefghijklmnopqrstuvwxyz'
fname = 'linebuffer_test.txt'
lb = LineBuffer(fname, 27)
for _ in xrange(30):
lb.write(alpha)
lb.write(' bye.')
lb.close()
if __name__ == '__main__':
testLB()
Here is a program that makes random DNA sequences of the form you described in your question. It uses linebuffer.py to handle the output. I wrote this so I could test my DNA sequence splitter properly.
Random_DNA0.py
#! /usr/bin/env python
''' Make random DNA sequences
Sequences consist of random subsequences of the letters 'ACGT'
as well as short sequences of 'N', of random length up to 200.
Exactly 1000 'N's separate sequence blocks.
All sequences may contain newlines chars
Takes approx 3 seconds per megabyte generated and saved
on a 2GHz CPU single core machine.
Written by PM 2Ring 2015.03.23
'''
import sys
import random
from linebuffer import LineBuffer
#Set seed to None to seed randomizer from system time
random.seed(37)
#Output line width
linewidth = 120
#Subsequence base length ranges
minsub, maxsub = 15, 300
#Subsequences per sequence ranges
minseq, maxseq = 5, 50
#random 'N' sequence ranges
minn, maxn = 5, 200
#Probability that a random 'N' sequence occurs after a subsequence
randn = 0.2
#Sequence separator
nsepblock = 'N' * 1000
def main():
#Get number of sequences from the command line
numsequences = int(sys.argv[1]) if len(sys.argv) > 1 else 2
outname = 'DNA_sequence.txt'
lb = LineBuffer(outname, linewidth)
for i in xrange(numsequences):
#Write the 1000*'N' separator between sequences
if i > 0:
lb.write(nsepblock)
for j in xrange(random.randint(minseq, maxseq)):
#Possibly make a short run of 'N's in the sequence
if j > 0 and random.random() < randn:
lb.write(''.join('N' * random.randint(minn, maxn)))
#Create a single subsequence
r = xrange(random.randint(minsub, maxsub))
lb.write(''.join([random.choice('ACGT') for _ in r]))
lb.close()
if __name__ == '__main__':
main()
Finally, we have a program that splits your random DNA sequences. Once again, it uses linebuffer.py to handle the output.
DNA_Splitter0.py
#! /usr/bin/env python
''' Split DNA sequences and save to separate files
Sequences consist of random subsequences of the letters 'ACGT'
as well as short sequences of 'N', of random length up to 200.
Exactly 1000 'N's separate sequence blocks.
All sequences may contain newlines chars
Written by PM 2Ring 2015.03.23
'''
import sys
from linebuffer import LineBuffer
#Output line width
linewidth = 120
#Sequence separator
nsepblock = 'N' * 1000
def main():
iname = 'DNA_sequence.txt'
outbase = 'contig'
with open(iname, 'rt') as f:
data = f.read()
#Remove all newlines
data = data.replace('\n', '')
sequences = data.split(nsepblock)
#Save each sequence to a series of files
for i, seq in enumerate(sequences, 1):
outname = '%s%05d' % (outbase, i)
print outname
#Write sequence data, with line breaks
lb = LineBuffer(outname, linewidth)
lb.write(seq)
lb.close()
if __name__ == '__main__':
main()

assuming you can read the whole file at once
s=DNAstring.replace("\n","") # first remove the nasty linebreaks
l=[x for x in s.split("N") if x] # split and drop empty lines
for x in l: # print in chunks
while x:
print x[:10]
x=x[10:]
print # extra linebreak between chunks

You could simply replace every N and \n with a space, and then split.
result = DNAstring.replace("\n", " ").replace("N", " ").split()
This will give you back a list of strings, and the 'ACGT' sequences will also be split with every new line.
if this is not you goal an you want to conserve the \n in the 'ACGT' and not split along it, you can do the following:
result = DNAstring.replace("N\n", " ").replace("N", " ").split()
this will only remove the \n if it is in the middle of an N sequence.
To split your string exactly after 1000 Ns:
# 1/ Get rid of line breaks in the N sequence
result = DNAstring.replace("N\n", "N")
# 2/ split every 1000 Ns
result = result.split(1000*"N")

Python perform actions only at certain locations in text file

I have a text file which contains the data like this
AA 331
line1 ...
line2 ...
% information here
AA 332
line1 ...
line2 ...
line3 ...
%information here
AA 1021
line1 ...
line2 ...
% information here
AA 1022
line1 ...
% information here
AA 1023
line1 ...
line2 ...
% information here
I want to perform action only for "informations" that comes after smallest integer that is after line "AA 331"and line "AA 1021" and not after lines "AA 332" , "AA 1022" and "AA 1023" .
P.s This is just a sample data of large file
The below code i try to parse the text file and get the integers which are after "AA" in a list "list1" and in second function i group them to get minimal value in "list2". This will return integers like [331,1021,...]. So i thought of extracting lines which comes after "AA 331" and perform action but i d'nt know how to proceed.
from itertools import groupby
def getlineindex(textfile):
with open(textfile) as infile:
list1 = []
for line in infile :
if line.startswith("AA"):
intid = line[3:]
list1.append(intid)
return list1
def minimalinteger(list1):
list2 = []
for k,v in groupby(list1,key=lambda x: x//10):
minimalint = min(v)
list2.append(minimalint)
return list2
list2 contains the smallest integers which comes after "AA" [331,1021,..]

You can use something like:
import re
matcher = re.compile("AA ([\d]+)")
already_was = []
good_block = False
with open(filename) as f:
for line in f:
m = matcher.match(line)
if m:
v = int(m.groups(0)) / 10
else:
v = None
if m and v not in already_was:
good_block = True
already_was.append(m)
if m and v in already_was:
good_block = False
if not m and good_block:
do_action()
These code works only if first value in group is minimal one.

Okay, here's my solution. At a high level, I go line by line, watching for AA lines to know when I've found the start/end of a data block, and watch what I call the run number to know whether or not we should process the next block. Then, I have a subroutine that handles any given block, basically reading off all relevant lines and processing them if needed. That subroutine is what watches for the next AA line in order to know when it's done.
import re
runIdRegex = re.compile(r'AA (\d+)')
def processFile(fileHandle):
lastNumber = None # Last run number, necessary so we know if there's been a gap or if we're in a new block of ten.
line = fileHandle.next()
while line is not None: # None is being used as a special value indicating we've hit the end of the file.
processData = False
match = runIdRegex.match(line)
if match:
runNumber = int(match.group(1))
if lastNumber == None:
# Startup/first iteration
processData = True
elif runNumber - lastNumber == 1:
# Continuation, see if the tenths are the same.
lastNumberTens = lastNumber / 10
runNumberTens = runNumber / 10
if lastNumberTens != runNumberTens:
processData = True
else:
processData = True
# Always remember where we were.
lastNumber = runNumber
# And grab and process data.
line = dataBlock(fileHandle, process=processData)
else:
try:
line = fileHandle.next()
except StopIteration:
line = None
def dataBlock(fileHandle, process=False):
runData = []
try:
line = fileHandle.next()
match = runIdRegex.match(line)
while not match:
runData.append(line)
line = fileHandle.next()
match = runIdRegex.match(line)
except StopIteration:
# Hit end of file
line = None
if process:
# Data processing call here
# processData(runData)
pass
# Return line so we don't lose it!
return line
Some notes for you. First, I'm in agreement with Jimilian that you should use a regular expression to match AA lines.
Second, the logic we talked about with regard to when we should process data is in processFile. Specifically these lines:
processData = False
match = runIdRegex.match(line)
if match:
runNumber = int(match.group(1))
if lastNumber == None:
# Startup/first iteration
processData = True
elif runNumber - lastNumber == 1:
# Continuation, see if the tenths are the same.
lastNumberTens = lastNumber / 10
runNumberTens = runNumber / 10
if lastNumberTens != runNumberTens:
processData = True
else:
processData = True
I assume we don't want to process data, then identify when we do. Logically speaking, you can do the inverse of this and assume you want to process data, then identify when you don't. Next, we need to store the last run's value in order to know whether or not we need to process this run's data. (and watch out for that first run edge case) We know we want to process data when the sequence is broken (the difference between two runs is greater than 1), which is handled by the else statement. We also know that we want to process data when the sequence increments the digit in the tens place, which is handled by my integer divide by 10.
Third, watch out for that return data from dataBlock. If you don't do that, you're going to lose the AA line that caused dataBlock to stop iterating, and processFile needs that line in order to know whether the next data block should be processed.
Last, I've opted to use fileHandle.next() and exception handling to identify when I get to the end of the file. But don't think this is the only way. :)
Let me know in comments if you have any questions.

Fastq parser not taking empty sequence (and other edge cases). Python

this is a continuation of Generator not working to split string by particular identifier . Python 2 . however, i modified the code completely and it's not the same format at all. this is about edge cases
Edge Cases:
. when sequence length is different than number of quality values
. when there's an empty sequence or entry
. when the number of lines with quality values is more than one
i cannot figure out how to work with the edge cases above. If its an empty data file, then I still want to output empty strings. i'm trying with these sequences right here for my input file: (Just a little background, IDs are set by # at beginning of line, sequence characters are followed by the lines after until a line with + is reached. the next lines are going to have quality values (value ~= chr(char) ) this format is terrible and poorly thought out.
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+
J(78=AEEC65HR+++*3327H00GD++++FF440.+-64444426ABAB<:=7888((/788P>>LAA8*+')3&++=<////==<4&<>EFHGGIJ66P;;;9;;FE34KHKHP<<11;HK:57678NJ990((&26>PDDJE,,JL>=##88,8,+>::J88ELF9.-5.45G+###NP==??<>455F((<BB===;;EE;3><<;M=>89PLLPP?>KP8+7699>A;ANO===J#'''B;.(...HP?E##AHGE77MNOO9=OO?>98?DLIMPOG>;=PRKB5H---3;MN&&&&&F?B>;99;8AA53)A<=;>777:<>;;8:LM==))6:#K..M?6?::7,/4444=JK>>HNN=//16#--F#K;9<:6449#BADD;>CD11JE55K;;;=&&%%,3644DL&=:<877..3>344:>>?44*+MN66PG==:;;?0./AGLKF99&&5?>+++JOP333333AC#EBBFBCJ>>HINPMNNCC>>++6:??3344>B=<89:/000::K>A=00#,+-/.,#(LL#>#I555K22221115666666477KML559-,333?GGGKCCP:::PPNPPNP??PPPLLMNOKKFOP2Q&&P7777PM<<<=<6<HPOPPP44?=#=:?BB=89:<<DHI777777645545PPO((((((((C3P??PM0000#NOPJPPFGGL<<<NNGNKGGGGGEELKB'''(((((L===L<<..*--MJ111?PO=788<8GG>>?JJL88,,1CF))??=?M6667PPKAKM&&&&&<?P43?OENPP''''&5579ICIFRPPPPOP>:>>>P888PLPAJDPCCDMMD;9=FBADDJFD7;ALL?,,,,06ID13..000DA4CFJC44,,->ED99;44CJK?42FAB?=CLNO''PJI999&77&&ERP><)))O==D677FP768PA=##HEE.::NM&&&>O''PO88H#A999P<:?IHL;;;GIIPPMMPPB7777PP>>>>KOPIIEEE<<CL%%5656AAAG<<DDFFGG%%N21778;M&&>>CCL::LKK6.711DGHHMIA#BAJ7>%6700;;=##?=;J55>>QP<<:>MF;;RPL==JMMPPPQR##P===;=BM99M>>PPOQGD44777PKKFP=<'''2215566>CG>>HH<<PLJI800CE<<PPPMGNOPMJ>>GG***LCCC777,,#AP>>AOPMFN99ENNMEPP>>>>>>CLPP??66OOKLLP=:>>KMBCPOPP#FKEI<<ML?>EAF>>>LDCD77JK=H>BN==:=<<<:==JN,,,659???8K<:==<4))))))P98>>>>;967777N66###AMKKKIKPMG;;AD88HN&&LMIGJOJMGHPC>#5D((((C?9--?8HGCDPNH7?9974;;AC&ABH''#%:=NP:,,9999=GJG>>=>JG21''':9>>>;;MP*****OKKKIE??55PPKJ21:K---///Q11//EN&';;;;:=;00011;IP##PP11?778JDDMM>>::KKLLKLNONOHDMPKLMIB>>?JP>9;KJL====;8;;;L)))))E#=$$$#.::,,BPJK76B;;F5<<J::K
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/904/ccs
CTCTCTCATCACACACGAGGAGTGAAGAGAGAACCTCCTCTCCACACGTGGAGTGAGGAGATCCTCTCACACACGTGAGGTGTTGAGAGAGATACTCTCTCATCACCTCACGTGAGGAGTGAGAGAGAT
+
{~~~~~sXNL>>||~~fVM~jtu~&&(uxy~f8YHh=<gA5
''<O1A44N'`oK57(((G&&Q*Q66;"$$Df66E~Z\ZMO>^;%L}~~~~~Q.~~~~x~#-LF9>~MMqbV~ABBV=99mhIwGRR~
#different_number_of_seq_qual
ATCG
+
**!
#this_should_work
GGGG
+
****
The ones with an error, I'm trying to replace the seq and qual strings with empty strings
seq,qual = '',''
Here's my code so far. These edge cases are so difficult for me to figure out please help . . .
def read_fastq(input, offset):
"""
Inputs a fastq file and reads each line at a time. 'offset' parameter can be set to 33 (phred+33 encoding
fastq), and 64. Yields a tuple in the format (ID, comments for a sequence, sequence, [integer quality values])
Capable of reading empty sequences and empty files.
"""
ID, comment, seq, qual = None,'','',''
step = 1 #step is a variable that organizes the order fastq parsing
#step= 1 scans for ID and comment line
#step= 2 adds relevant lines to sequence string
#step= 3 adds quality values to string
for line in input:
line = line.strip()
if step == 1 and line.startswith('#'): #Step system from Nedda Saremi
if ID is not None:
qual = [ord(char)-offset for char in qual] #Converts from phred encoding to integer values
sep = None
if ' ' in ID: sep = ' '
if sep is not None:
ID, comment = ID.split(sep,1) #Separates ID and comment by ' '
yield ID, comment, seq, qual
ID,comment,seq,qual = None,'','','' #Resets variable for next sequence
ID = line[1:]
step = 2
continue
if step==2 and not line.startswith('#') and not line.startswith('+'):
seq = seq + line.strip()
continue
if step == 2 and line.startswith('+'):
step = 3
continue
while step == 3:
#process the quality data
if len(qual) == len(seq):
#once the length of the quality seq and seq are the same, end gathering data
step = 1
continue
if len(qual) < len(seq):
qual = qual + line.strip()
if len(qual) < len(seq):
step = 3
continue
if (len(qual) > len(seq)):
sys.stderr.write('\nError: ' + ID + ' sequence length not equal to quality values\n')
comment,seq,qual= '','',''
ID = line
step = 1
continue
break
if ID is not None:
#Section reserved for last entry in file
if len(qual) > 0:
qual = [ord(char)-offset for char in qual]
sep = None
if ' ' in ID: sep = ' '
if sep is not None:
ID, comment = ID.split(sep,1)
if len(seq) == 0: ID,comment,seq,qual= '','','',''
yield ID, comment, seq, qual
my output is skipping the ID #m120204_092117_richard_c100250832550000001523001204251233_s1_p0/904/ccs and adding #**! when it should not be in the output
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+
J(78=AEEC65HR+++*3327H00GD++++FF440.+-64444426ABAB<:=7888((/788P>>LAA8*+')3&++=<////==<4&<>EFHGGIJ66P;;;9;;FE34KHKHP<<11;HK:57678NJ990((&26>PDDJE,,JL>=##88,8,+>::J88ELF9.-5.45G+###NP==??<>455F((<BB===;;EE;3><<;M=>89PLLPP?>KP8+7699>A;ANO===J#'''B;.(...HP?E##AHGE77MNOO9=OO?>98?DLIMPOG>;=PRKB5H---3;MN&&&&&F?B>;99;8AA53)A<=;>777:<>;;8:LM==))6:#K..M?6?::7,/4444=JK>>HNN=//16#--F#K;9<:6449#BADD;>CD11JE55K;;;=&&%%,3644DL&=:<877..3>344:>>?44*+MN66PG==:;;?0./AGLKF99&&5?>+++JOP333333AC#EBBFBCJ>>HINPMNNCC>>++6:??3344>B=<89:/000::K>A=00#,+-/.,#(LL#>#I555K22221115666666477KML559-,333?GGGKCCP:::PPNPPNP??PPPLLMNOKKFOP2Q&&P7777PM<<<=<6<HPOPPP44?=#=:?BB=89:<<DHI777777645545PPO((((((((C3P??PM0000#NOPJPPFGGL<<<NNGNKGGGGGEELKB'''(((((L===L<<..*--MJ111?PO=788<8GG>>?JJL88,,1CF))??=?M6667PPKAKM&&&&&<?P43?OENPP''''&5579ICIFRPPPPOP>:>>>P888PLPAJDPCCDMMD;9=FBADDJFD7;ALL?,,,,06ID13..000DA4CFJC44,,->ED99;44CJK?42FAB?=CLNO''PJI999&77&&ERP><)))O==D677FP768PA=##HEE.::NM&&&>O''PO88H#A999P<:?IHL;;;GIIPPMMPPB7777PP>>>>KOPIIEEE<<CL%%5656AAAG<<DDFFGG%%N21778;M&&>>CCL::LKK6.711DGHHMIA#BAJ7>%6700;;=##?=;J55>>QP<<:>MF;;RPL==JMMPPPQR##P===;=BM99M>>PPOQGD44777PKKFP=<'''2215566>CG>>HH<<PLJI800CE<<PPPMGNOPMJ>>GG***LCCC777,,#AP>>AOPMFN99ENNMEPP>>>>>>CLPP??66OOKLLP=:>>KMBCPOPP#FKEI<<ML?>EAF>>>LDCD77JK=H>BN==:=<<<:==JN,,,659???8K<:==<4))))))P98>>>>;967777N66###AMKKKIKPMG;;AD88HN&&LMIGJOJMGHPC>#5D((((C?9--?8HGCDPNH7?9974;;AC&ABH''#%:=NP:,,9999=GJG>>=>JG21''':9>>>;;MP*****OKKKIE??55PPKJ21:K---///Q11//EN&';;;;:=;00011;IP##PP11?778JDDMM>>::KKLLKLNONOHDMPKLMIB>>?JP>9;KJL====;8;;;L)))))E#=$$$#.::,,BPJK76B;;F5<<J::K
Error: different_number_of_seq_qual sequence length not equal to quality values
#**!
+
#this_should_work
GGGG
+
****

You probably should use BioPython.
Your bug appears to be the read that is skipped has 129 bases in its sequence but only 128 qv. So your parser reads the next defline as a quality line which then makes it too long so it prints the error.
Then your states don't account for the situation of where you are in step 1 but dont see a defline. So you keep reading extra lines overwritting the ID variable.
but if you really want to write your own parser:
I'll address your questions one at a time.
when sequence length is different than number of quality values
This is invalid. Each record in the fastq file must have the an equal number of bases and qualities. Different records in the file can be different lengths from each other, but each record must have equal bases and qualities.
when there's an empty sequence or entry
An empty read will have blank lines for the sequence and quality lines like this:
#SOLEXA1_0007:1:9:610:1983#GATCAG/2
+SOLEXA1_0007:1:9:610:1983#GATCAG/2
#SOLEXA1_0007:2:13:163:254#GATCAG/2
CGTAGTACGATATACGCGCGTGTACTGCTACGTCTCACTTTCGCAAGATTGCTCAGCTCATTGATGCTCAATGCTGGGCCATATCTCTTTTCTTTTTTTC
+SOLEXA1_0007:2:13:163:254#GATCAG/2
HHHHGHHEHHHHHE=HAHCEGEGHAG>CHH>EG5#>5*ECE+>AEEECGG72B&A*)569B+03B72>5.A>+*A>E+7A#G<CAD?#############
when the number of lines with quality values is more than one
Due to the requirements from the first answer above. We know that the number of bases and qualities must match. Also there will never be an + character in the sequence block. So we can keep parsing the sequence block until we see a line that starts with +. Then we know we are done parsing sequence. Then we can keep parsing quality lines until we get the same number of qualities as is in the sequence. We can't rely on looking for any special characters because depending on the quality encoding, # could be a valid quality call.
Also as an aside, you appear to be splitting the sequence defline to parse out the optional comment. You have to be careful for CASAVA 1.8 format which stupidly has spaces. So you might need a regex to see if it's a CASAVA 1.8 format then don't split on whitespace etc.

Have you considered using one of the robust python packages that are available for dealing with this kind of data rather than writing a parser from scratch? In partincular I'd recommend checking out HTSeq

How do you make tables with previously stored strings?

So the question basically gives me 19 DNA sequences and wants me to makea basic text table. The first column has to be the sequence ID, the second column the length of the sequence, the third is the number of "A"'s, 4th is "G"'s, 5th is "C", 6th is "T", 7th is %GC, 8th is whether or not it has "TGA" in the sequence. Then I get all these values and write a table to "dna_stats.txt"
Here is my code:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
I found that you have to convert the variables into strings. I have all of the values calculated correctly when I print them out in the terminal. However, I keep getting only 19 for the first column, when it should go 1,2,3,4,5,etc. to represent all of the sequences. I tried it with the other variables and it just got the total amounts of the whole file. I started trying to make the table but have not finished it.
So my biggest issue is that I don't know how to get the values for the variables for each specific line.
I am new to python and programming in general so any tips or tricks or anything at all will really help.
I am using python version 2.7

Well, your biggest issue:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
Indentation matters. This says "for every line (#1), open a file and then loop over every line again(#2)..."
De-indent those things.

This puts the info in a dictionary as you go and allows for DNA sequences to go over multiple lines
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.