I have a text file which contains the data like this
AA 331
line1 ...
line2 ...
% information here
AA 332
line1 ...
line2 ...
line3 ...
%information here
AA 1021
line1 ...
line2 ...
% information here
AA 1022
line1 ...
% information here
AA 1023
line1 ...
line2 ...
% information here
I want to perform action only for "informations" that comes after smallest integer that is after line "AA 331"and line "AA 1021" and not after lines "AA 332" , "AA 1022" and "AA 1023" .
P.s This is just a sample data of large file
The below code i try to parse the text file and get the integers which are after "AA" in a list "list1" and in second function i group them to get minimal value in "list2". This will return integers like [331,1021,...]. So i thought of extracting lines which comes after "AA 331" and perform action but i d'nt know how to proceed.
from itertools import groupby
def getlineindex(textfile):
with open(textfile) as infile:
list1 = []
for line in infile :
if line.startswith("AA"):
intid = line[3:]
list1.append(intid)
return list1
def minimalinteger(list1):
list2 = []
for k,v in groupby(list1,key=lambda x: x//10):
minimalint = min(v)
list2.append(minimalint)
return list2
list2 contains the smallest integers which comes after "AA" [331,1021,..]
You can use something like:
import re
matcher = re.compile("AA ([\d]+)")
already_was = []
good_block = False
with open(filename) as f:
for line in f:
m = matcher.match(line)
if m:
v = int(m.groups(0)) / 10
else:
v = None
if m and v not in already_was:
good_block = True
already_was.append(m)
if m and v in already_was:
good_block = False
if not m and good_block:
do_action()
These code works only if first value in group is minimal one.
Okay, here's my solution. At a high level, I go line by line, watching for AA lines to know when I've found the start/end of a data block, and watch what I call the run number to know whether or not we should process the next block. Then, I have a subroutine that handles any given block, basically reading off all relevant lines and processing them if needed. That subroutine is what watches for the next AA line in order to know when it's done.
import re
runIdRegex = re.compile(r'AA (\d+)')
def processFile(fileHandle):
lastNumber = None # Last run number, necessary so we know if there's been a gap or if we're in a new block of ten.
line = fileHandle.next()
while line is not None: # None is being used as a special value indicating we've hit the end of the file.
processData = False
match = runIdRegex.match(line)
if match:
runNumber = int(match.group(1))
if lastNumber == None:
# Startup/first iteration
processData = True
elif runNumber - lastNumber == 1:
# Continuation, see if the tenths are the same.
lastNumberTens = lastNumber / 10
runNumberTens = runNumber / 10
if lastNumberTens != runNumberTens:
processData = True
else:
processData = True
# Always remember where we were.
lastNumber = runNumber
# And grab and process data.
line = dataBlock(fileHandle, process=processData)
else:
try:
line = fileHandle.next()
except StopIteration:
line = None
def dataBlock(fileHandle, process=False):
runData = []
try:
line = fileHandle.next()
match = runIdRegex.match(line)
while not match:
runData.append(line)
line = fileHandle.next()
match = runIdRegex.match(line)
except StopIteration:
# Hit end of file
line = None
if process:
# Data processing call here
# processData(runData)
pass
# Return line so we don't lose it!
return line
Some notes for you. First, I'm in agreement with Jimilian that you should use a regular expression to match AA lines.
Second, the logic we talked about with regard to when we should process data is in processFile. Specifically these lines:
processData = False
match = runIdRegex.match(line)
if match:
runNumber = int(match.group(1))
if lastNumber == None:
# Startup/first iteration
processData = True
elif runNumber - lastNumber == 1:
# Continuation, see if the tenths are the same.
lastNumberTens = lastNumber / 10
runNumberTens = runNumber / 10
if lastNumberTens != runNumberTens:
processData = True
else:
processData = True
I assume we don't want to process data, then identify when we do. Logically speaking, you can do the inverse of this and assume you want to process data, then identify when you don't. Next, we need to store the last run's value in order to know whether or not we need to process this run's data. (and watch out for that first run edge case) We know we want to process data when the sequence is broken (the difference between two runs is greater than 1), which is handled by the else statement. We also know that we want to process data when the sequence increments the digit in the tens place, which is handled by my integer divide by 10.
Third, watch out for that return data from dataBlock. If you don't do that, you're going to lose the AA line that caused dataBlock to stop iterating, and processFile needs that line in order to know whether the next data block should be processed.
Last, I've opted to use fileHandle.next() and exception handling to identify when I get to the end of the file. But don't think this is the only way. :)
Let me know in comments if you have any questions.
Related
I have some structured data in a text file:
Parse.txt
name1
detail:
aaaaaaaa
bbbbbbbb
cccccccc
detail1:
dddddddd
detail2:
eeeeeeee
detail3:
ffffffff
detail4:
gggggggg
some of the detail4s do not have data and would be replaced by "-":
name2
detail:
aaaaaaaa
bbbbbbbb
cccccccc
detail1:
dddddddd
detail2:
eeeeeeee
detail3:
ffffffff
detail4:
-
How do i parse the data to get the elements below detail1, detail2 and detail3 of only the data with empty detail4s?
So far i have a partially working code but the problem is that it gets each item 40 times. Please help.
Code:
data = []
with open("parse.txt","r",encoding="utf-8") as text_file:
for line in text_file:
data.append(line)
det4li = []
finali= []
for elem,det4 in zip(data,data[1:]):
if "detail4" in elem:
det4li .append(det4)
if "-" in det4:
for elem1,det1,det2,det3 in zip(data,data[1:],data[3:],data[5:]):
if "detail1:" in elem1:
finali.append(det1.strip() + "," + det2.strip() + "," + det3)
Current Output: 40 records of dddddddd,eeeeeeee,ffffffff
Desired Output: dddddddd,eeeeeeee,ffffffff
Don't try to look ahead. Look behind, by storing preceding data:
final = []
with open("parse.txt","r",encoding="utf-8") as text_file:
section = {}
last_header = None
for line in text_file:
line = line.strip()
if line.startswith('detail'):
# detail line, record for later use
last_header = line.rstrip(':')
elif not last_header:
# name line, store as such
section['name'] = line
else:
section[last_header] = line
if last_header == 'detail4':
# section complete, process
if line == '-':
# A section we want to keep
final.append(section)
# reset section data
section, last_header = {}, None
This has the added advantage that you now don't need to read the whole file into memory. If you turn this into a generator (by putting it into a function and replacing the final.append(section) line with yield section), you can even process those matching sections as you read the file without sacrificing readability.
I'm working with a pandas dataframe containing a large block of plain text for each row. The block of text has the following format:
Year 1
... (variable # of lines)
7. Stuff
... (variable # of lines, can be 0)
TOTAL Stuff
(single line, numeric)
... (variable # of lines)
Services
(single line)
... (variable # of lines)
Year 2
... (same format as prev)
<repeats for n years>
TOTAL
... (same format as years)
Justification
... (variable number of lines)
<repeat m times>
I'm trying to extract the plain text under the "7. Stuff" and "Justification" headings as well a numerical values for "TOTAL Stuff". My current code creates a list based on the line breaks and iterates through them, but I feel like this is not efficient. It my current implementation also only works when there is one cycle of years -> Total -> Justification (not m).
Here is my parse_text function. Any help on making it more 'pythonic' or just efficient in general is greatly appreciated.
def parse_budget_text(row):
stuff_value = 0
stuff_text = ''
justification_txt = ''
#ensure text is not hidden within a list
text = row['text_raw']
#parse and sum equipment lines
line_iter = iter([line.strip() for line in text.split("\n")])
total_flag = False
justification_flag = False
for line in line_iter:
#find each yearly section
if line.startswith("YEAR"):
while not line.startswith("7. Stuff"):
line = next(line_iter)
line = next(line_iter)
while not line.startswith("Services"):
if ("TOTAL Stuff" not in line) and (not is_number(line)) and (line[0] != "$"):
stuff_txt += line+'; '
line = next(line_iter)
#find total summary
elif line.startswith("TOTAL"):
cumulative_flag = True
while not line.startswith("TOTAL Stuff"):
line =next(line_iter)
stuff_value += int(next(line_iter).replace(',',''))
#find Justification line
elif line.startswith("Justification") and cumulative_flag:
justification_flag = True
#extract justification
elif justification_flag == True:
justification_txt += line
return pd.Series({'raw_text': text, 'Stuff_val': stuff_value, 'Stuff_txt': stuff_txt,})
So the question basically gives me 19 DNA sequences and wants me to makea basic text table. The first column has to be the sequence ID, the second column the length of the sequence, the third is the number of "A"'s, 4th is "G"'s, 5th is "C", 6th is "T", 7th is %GC, 8th is whether or not it has "TGA" in the sequence. Then I get all these values and write a table to "dna_stats.txt"
Here is my code:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
I found that you have to convert the variables into strings. I have all of the values calculated correctly when I print them out in the terminal. However, I keep getting only 19 for the first column, when it should go 1,2,3,4,5,etc. to represent all of the sequences. I tried it with the other variables and it just got the total amounts of the whole file. I started trying to make the table but have not finished it.
So my biggest issue is that I don't know how to get the values for the variables for each specific line.
I am new to python and programming in general so any tips or tricks or anything at all will really help.
I am using python version 2.7
Well, your biggest issue:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
Indentation matters. This says "for every line (#1), open a file and then loop over every line again(#2)..."
De-indent those things.
This puts the info in a dictionary as you go and allows for DNA sequences to go over multiple lines
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()
I want to write a python program, that invokes ipcs and uses its output to delete shared memory segments and semaphores.
I have a working solution, but feel there must be a better way to do this.
Here is my program:
import subprocess
def getid(ip):
ret=''
while (output[ip]==' '):
ip=ip+1
while((output[ip]).isdigit()):
ret=ret+output[ip]
ip=ip+1
return ret
print 'invoking ipcs'
output = subprocess.check_output(['ipcs'])
print output
for i in range (len(output)):
if (output[i]=='m'):
r=getid(i+1)
print r
if (r):
op = subprocess.check_output(['ipcrm','-m',r])
print op
elif (output[i]=='s'):
r=getid(i+1)
print r
if (r):
op = subprocess.check_output(['ipcrm','-s',r])
print op
print 'invoking ipcs'
output = subprocess.check_output(['ipcs'])
print output
In particular, is there a better way to write "getid"? ie instead of parsing it character by character,
can I parse it string by string?
This is what the output variable looks like (before parsing):
Message Queues:
T ID KEY MODE OWNER GROUP
Shared Memory:
T ID KEY MODE OWNER GROUP
m 262144 0 --rw-rw-rw- xyz None
m 262145 0 --rw-rw-rw- xyz None
m 262146 0 --rw-rw-rw- xyz None
m 196611 0 --rw-rw-rw- xyz None
m 196612 0 --rw-rw-rw- xyz None
m 262151 0 --rw-rw-rw- xyz None
Semaphores:
T ID KEY MODE OWNER GROUP
s 262144 0 --rw-rw-rw- xyz None
s 262145 0 --rw-rw-rw- xyz None
s 196610 0 --rw-rw-rw- xyz None
Thanks!
You can pipe in the output from ipcs line by line as it outputs it. Then I would use .strip().split() to parse each line, and something like a try except block to make sure the line fits your criteria. Parsing it as a stream of characters makes things more complicated, I wouldn't recommend it..
import subprocess
proc = subprocess.Popen(['ipcs'],stdout=subprocess.PIPE)
for line in iter(proc.stdout.readline,''):
line=line.strip().split()
try:
r = int(line[1])
except:
continue
if line[0] == "m":
op = subprocess.check_output(['ipcrm','-m',str(r)])
elif line[0] == "s":
op = subprocess.check_output(['ipcrm','-s',str(r)])
print op
proc.wait()
There is really no need to iterate over the output one char at a time.
For one, you should split the output string in lines and iterate over those, handling one at a time. That is done by using the splitlines method of strings (see the docs for details).
You could further split the lines on blanks, using split(), but given the regularity of your output, a regular expression fits the bill nicely. Basically, if the first character is either m or s, the next number of digits is your id, and whether m or s matches decides your next action.
You can use names to identify the groups of characters you identified, which makes for an easier reading of the regular expression and more comfortable handling of the result due to groupdict.
import re
pattern = re.compile('^((?P<mem>m)|(?P<sem>s))\s+(?P<id>\d+)')
for line in output.splitlines():
m = pattern.match(line)
if m:
groups = m.groupdict()
_id = groups['id']
if groups['mem']:
print 'handling a memory line'
pass # handle memory case
else:
print ' handling a semaphore line'
pass # handle semaphore case
You could use the string split method to split the string based on the spaces in it.
So you could use,
for line in output:
if (line.split(" ")[0] == 'm')):
id = line.split(" ")[2]
print id
I have text file as follows seq.txt
>S1
AACAAGAAGAAAGCCCGCCCGGAAGCAGCTCAATCAGGAGGCTGGGCTGGAATGACAGCG
CAGCGGGGCCTGAAACTATTTATATCCCAAAGCTCCTCTCAGATAAACACAAATGACTGC
GTTCTGCCTGCACTCGGGCTATTGCGAGGACAGAGAGCTGGTGCTCCATTGGCGTGAAGT
CTCCAGGGCCAGAAGGGGCCTTTGTCGCTTCCTCACAAGGCACAAGTTCCCCTTCTGCTT
CCCCGAGAAAGGTTTGGTAGGGGTGGTGGTTTAGTGCCTATAGAACAAGGCATTTCGCTT
CCTAGACGGTGAAATGAAAGGGAAAAAAAGGACACCTAATCTCCTACAAATGGTCTTTAG
TAAAGGAACCGTGTCTAAGCGCTAAGAACTGCGCAAAGTATAAATTATCAGCCGGAACGA
GCAAACAGACGGAGTTTTAAAAGATAAATACGCATTTTTTTCCGCCGTAGCTCCCAGGCC
AGCATTCCTGTGGGAAGCAAGTGGAAACCCTATAGCGCTCTCGCAGTTAGGAAGGAGGGG
TGGGGCTGTCCCTGGATTTCTTCTCGGTCTCTGCAGAGACAATCCAGAGGGAGACAGTGG
ATTCACTGCCCCCAATGCTTCTAAAACGGGGAGACAAAACAAAAAAAAACAAACTTCGGG
TTACCATCGGGGAACAGGACCGACGCCCAGGGCCACCAGCCCAGATCAAACAGCCCGCGT
CTCGGCGCTGCGGCTCAGCCCGACACACTCCCGCGCAAGCGCAGCCGCCCCCCCGCCCCG
GGGGCCCGCTGACTACCCCACACAGCCTCCGCCGCGCCCTCGGCGGGCTCAGGTGGCTGC
GACGCGCTCCGGCCCAGGTGGCGGCCGGCCGCCCAGCCTCCCCGCCTGCTGGCGGGAGAA
ACCATCTCCTCTGGCGGGGGTAGGGGCGGAGCTGGCGTCCGCCCACACCGGAAGAGGAAG
TCTAAGCGCCGGAAGTGGTGGGCATTCTGGGTAACGAGCTATTTACTTCCTGCGGGTGCA
CAGGCTGTGGTCGTCTATCTCCCTGTTGTTC
>S2
ACACGCATTCACTAAACATATTTACTATGTGCCAGGCACTGTTCTCAGTGCTGGGGATAT
AGCAGTGAAGAAACAGAAACCCTTGCACTCACTGAGCTCATATCTTAGGGTGAGAAACAG
TTATTAAGCAAGATCAGGATGGAAAACAGATGGTACGGTAGTGTGAAATGCTAAAGAGAA
AAATAACTACGGAAAAGGGATAGGAAGTGTGTGTATCGCAGTTGACTTATTTGTTCGCGT
TGTTTACCTGCGTTCTGTCTGCATCTCCCACTAAACTGTAAGCTCTACATCTCCCATCTG
TCTTATTTACCAATGCCAACCGGGGCTCAGCGCAGCGCCTGACACACAGCAGGCAGCTGA
CAGACAGGTGTTGAGCAAGGAGCAAAGGCGCATCTTCATTGCTCTGTCCTTGCTTCTAGG
AGGCGAATTGGGAAATCCAGAGGGAAAGGAAAAGCGAGGAAAGTGGCTCGCTTTTGGCGC
TGGGGAAGAGGTGTACAGTGAGCAGTCACGCTCAGAGCTGGCTTGGGGGACACTCTCACG
CTCAGGAGAGGGACAGAGCGACAGAGGCGCTCGCAGCAGCGCGCTGTACAGGTGCAACAG
CTTAGGCATTTCTATCCCTATTTTTACAGCGAGGGACACTGGGCCTCAGAAAGGGAAGTG
CCTTCCCAAGCTCCAACTGCTCATAAGCAGTCAACCTTGTCTAAGTCCAGGTCTGAAGTC
CTGGAGCGATTCTCCACCCACCACGACCACTCACCTACTCGCCTGCGCTTCACCTCACGT
GAGGATTTTCCAGGTTCCTCCCAGTCTCTGGGTAGGCGGGGAGCGCTTAGCAGGTATCAC
CTATAAGAAAATGAGAATGGGTTGGGGGCCGGTGCAAGACAAGAATATCCTGACTGTGAT
TGGTTGAATTGGCTGCCATTCCCAAAACGAGCTTTGGCGCCCGGTCTCATTCGTTCCCAG
CAGGCCCTGCGCGCGGCAACATGGCGGGGTCCAGGTGGAGGTCTTGAGGCTATCAGATCG
GTATGGCATTGGCGTCCGGGCCCGCAAGGCG
.
.
.
.
I have to count patterns in these sequences to achieve python script
import re
infile = open("seq.txt", 'r')
out = open("pat.txt", 'w')
pattern = re.compile("GAAAT", flags=re.IGNORECASE)
for line in infile:
line = line.strip("\n")
if line.startswith('>'):
name = line
else:
s = re.findall(pattern,line)
print '%s:%s' %(name,s)
out.write('%s:\t%s\n' %(name,len(s)))
But it is giving the wrong result. The script is reading line by line.
S1 : 0
S1 : 0
S1 : 0
S1 : 0
S2 : 0
S2 : 1
S2 : 0
S2 : 1
But I want output as follows:
S1 : 0
S2 : 2
Can anybody help?
Use a hit counter, zero it if line.startswith('>'). Increment by len(s) otherwise.
This code might be helpful for you:
import re
pattern = re.compile("GAAAT", flags=re.IGNORECASE)
with open('seq.txt') as f:
sections = f.read().split('\n\n')
for section in sections:
lines = section.split()
name = lines[0].lstrip('>')
data = ''.join(lines[1:])
print '{0}: {1}'.format(name, len(pattern.findall(data)))
Example output:
S1: 1
S2: 2
Notes:
It's assumed that two newline characters are used to separate every section as in the example.
It's assumed that every section name is preceded by a greater than (>) character as in the example.
If you already have a pattern, use pattern.findall(data) instead of re.findall(pattern, data)
You should gather input until you enter the next pattern. This would also solve the corner case of where your pattern crosses a line boundary (not sure if that "can" happen with your data, but it looks like it).
Use a counter. Also, have your print function inside the for loop, so it's going to iterate as many times as the else condition. Note that it's also not a good idea to use the variable line as both the iterator variable in the for loop and as another variable. It makes the code more confusing.
counter_dict = {}
for line in infile:
if line[0] == '>':
name = line[1:len(line) - 2]
counter_dict[name] = 0
else:
counter_dict[name] += len(re.findall(pattern,line))
for (key, val) in counter_dict.items():
print '%s:%s' %(key, val)
out.write('%s:\t%s\n' %(key, val)