How to write specific line lengths of a file? - python

I have this sequences (over 9000) like this:
>TsM_000224500
MTTKWPQTTVTVATLSWGMLRLSMPKVQTTYKVTQSRGPLLAPGICDSWSRCLVLRVYVDRRRPGGDGSLGRVAVTVVETGCFGSAASFSMWVFGLAFVVTIEEQLL
>TsM_000534500
MHSHIVTVFVALLLTTAVVYAHIGMHGEGCTTLQCQRHAFMMKEREKLNEMQLELMEMLMDIQTMNEQEAYYAGLHGAGMQQPLPMPIQ
>TsM_000355900
MESGEENEYPMSCNIEEEEDIKFEPENGKVAEHESGEKKESIFVKHDDAKWVGIGFAIGTAVAPAVLSGISSAAVQGIRQPIQAGRNNGETTEDLENLINSVEDDL
The lines containing the ">" are the ID's and the lines with the letters are the amino acid (aa) sequences. I need to delete (or move to another files) the sequences below 40 aa and over 4000 aa.
Then, the resulting file, should contain only the sequences within this range (>= 40 aa and <= 4K aa).
I've tried writing the following script:
def read_seq(file_name):
with open(file_name) as file:
return file.read().split('\n')[0:]
ts = read_seq("/home/tiago/t_solium/ts_phtm0less.txt")
tsf = open("/home/tiago/t_solium/ts_secp-404k", 'w')
for x in range(len(ts)):
if ([x][0:1] != '>'):
if (len([x]) > 40 or len([x]) < 4000):
tsf.write('%s\n'%(x))
tsf.close()
print "OK!"
I've done some modifications, but all I'm getting are empty files or with all the +9000 sequences.

In your for loop, x is an iterating integer due to using range() (i.e, 0,1,2,3,4...). Try this instead:
for x in ts:
This will give you each element in ts as x
Also, you don't need the brackets around x; Python can iterate over the characters in strings on its own. When you put brackets around a string, you put it into a list, and thus if you tried, for example, to get the second character in x: [x][1], Python will try to get the second element in the list that you put x in, and will run into problems.
EDIT: To include IDs, try this:
NOTE: I also changed if (len(x) > 40 or len(x) < 4000) to if (len(x) > 40 and len(x) < 4000) -- using and instead of or will give you the result you're looking for.
for i, x in enumerate(ts): #NEW: enumerate ts to get the index of every iteration (stored as i)
if (x[0] != '>'):
if (len(x) > 40 and len(x) < 4000):
tsf.write('%s\n'%(ts[i-1])) #NEW: write the ID number found on preceding line
tsf.write('%s\n'%(x))

Try this, simple and easy to understand. It does not load the entire file into memory, instead iterates over the file line by line.
tsf=open('output.txt','w') # open the output file
with open("yourfile",'r') as ts: # open the input file
for line in ts: # iterate over each line of input file
line=line.strip() # removes all whitespace at the start and end, including spaces, tabs, newlines and carriage returns.
if line[0]=='>': # if line is an ID
continue # move to the next line
else: # otherwise
if (len(line)>40) or (len(line)<4000): # if line is in required length
tsf.write('%s\n'%line) # write to output file
tsf.close() # done
print "OK!"
FYI, you could also use awk for a one line solution if working in unix environment:
cat yourinputfile.txt | grep -v '>' | awk 'length($0)>=40' | awk 'length($0)<=4000' > youroutputfile.txt

Related

how to insert string line by line in list with python

i want to run the code in every line in this string to get only the value of the driver (c,d,e) and then but them in list
the string
1 C 1048576 30 GB IFS
2 d 1048576 30 GB IFS
1 e 1048576 30 GB IFS
i use that code but i repeat every value of (c,d,e) 3 times in the list
d[]
for line in data.split('\n')[1:]:
dliver = ouput2[17]
print('dliver',dliver)
dliver1 = dliver+':'
print(dliver1)
d.append(dliver1)
print('d',d)
note data the veritable inside it the value of string
In this line
for line in data.split('\n')[1:]:
you get a variable called line that you can work with. But you never do.
If I understand the question correctly, you simply want
dliver = line[17]
instead of
dliver = ouput2[17]
At first we create a data structure(list) named output. Then we iterate over the string. As the last step we are splitting it with spaces and newlines in the list comprehension. At the last we append the second string of the list into the output list.
output = []
for i in stringer:
splitted = [x for x in i.split() if x != '']
output.append(splitted[2])

Regex remove certain characters from a file

I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:
Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))
My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.
Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.

regular expressions in python using quotes

I am attempting to create a regular expression pattern for strings similar to the below which are stored in a file. The aim is to get any column for any row, the rows need not be on a single line. So for example, consider the following file:
"column1a","column2a","column
3a,", #entity 1
"column\"this is, a test\"4a"
"column1b","colu
mn2b,","column3b", #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c", #entity 3
"column\"this is, a test\"4c"
Each entity consists of four columns, column 4 for entity 2 would be "column\"this is, a test\"4b", column 2 for entity 3 would be "column2c". Each column begins with a quote and closes with a quote, however you must be careful because some columns have escaped quotes. Thanks in advance!
You could do like this, ie
Read the whole file.
Split the input according to the newline character which was not preceded by a comma.
Iterate over the spitted elements and again do splitting on the comma (and also the following optional newline character) which was preceded and followed by double quotes.
Code:
import re
with open(file) as f:
fil = f.read()
m = re.split(r'(?<!,)\n', fil.strip())
for i in m:
print(re.split('(?<="),\n?(?=")', i))
Output:
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
Here is the check..
$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
f is the input file name and f.py is the file-name which contains the python script.
Your problem is terribly familiar to what I have to deal thrice every month :) Except I'm not using python to solve it, but I can 'translate' what I usually do:
text = r'''"column1a","column2a","column
3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
mn2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''
import re
# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')
# Read the file line by line
for line in text.split("\n"):
# If there's no stored partial line, this is a new line
if buffer == "":
# Check if we get 4 columns and print, if not, put the line
# into buffer so we store a partial line for later
if len(check.findall(line)) == columns:
print matches
else:
# use line.strip() if you need to trim whitespaces
buffer = line
else:
# Update the variable (containing a partial line) with the
# next line and recheck if we get 4 columns
# use line.strip() if you need to trim whitespaces
buffer = buffer + line
# If we indeed get 4, our line is complete and print
# We must not forget to empty buffer now that we got a whole line
if len(check.findall(buffer)) == columns:
print matches
buffer = ""
# Optional; always good to have a safety backdoor though
# If there is a problem with the csv itself like a weird unescaped
# quote, you send it somewhere else
elif len(check.findall(buffer)) > columns:
print "Error: cannot parse line:\n" + buffer
buffer = ""
ideone demo

Fastq parser not taking empty sequence (and other edge cases). Python

this is a continuation of Generator not working to split string by particular identifier . Python 2 . however, i modified the code completely and it's not the same format at all. this is about edge cases
Edge Cases:
. when sequence length is different than number of quality values
. when there's an empty sequence or entry
. when the number of lines with quality values is more than one
i cannot figure out how to work with the edge cases above. If its an empty data file, then I still want to output empty strings. i'm trying with these sequences right here for my input file: (Just a little background, IDs are set by # at beginning of line, sequence characters are followed by the lines after until a line with + is reached. the next lines are going to have quality values (value ~= chr(char) ) this format is terrible and poorly thought out.
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+
J(78=AEEC65HR+++*3327H00GD++++FF440.+-64444426ABAB<:=7888((/788P>>LAA8*+')3&++=<////==<4&<>EFHGGIJ66P;;;9;;FE34KHKHP<<11;HK:57678NJ990((&26>PDDJE,,JL>=##88,8,+>::J88ELF9.-5.45G+###NP==??<>455F((<BB===;;EE;3><<;M=>89PLLPP?>KP8+7699>A;ANO===J#'''B;.(...HP?E##AHGE77MNOO9=OO?>98?DLIMPOG>;=PRKB5H---3;MN&&&&&F?B>;99;8AA53)A<=;>777:<>;;8:LM==))6:#K..M?6?::7,/4444=JK>>HNN=//16#--F#K;9<:6449#BADD;>CD11JE55K;;;=&&%%,3644DL&=:<877..3>344:>>?44*+MN66PG==:;;?0./AGLKF99&&5?>+++JOP333333AC#EBBFBCJ>>HINPMNNCC>>++6:??3344>B=<89:/000::K>A=00#,+-/.,#(LL#>#I555K22221115666666477KML559-,333?GGGKCCP:::PPNPPNP??PPPLLMNOKKFOP2Q&&P7777PM<<<=<6<HPOPPP44?=#=:?BB=89:<<DHI777777645545PPO((((((((C3P??PM0000#NOPJPPFGGL<<<NNGNKGGGGGEELKB'''(((((L===L<<..*--MJ111?PO=788<8GG>>?JJL88,,1CF))??=?M6667PPKAKM&&&&&<?P43?OENPP''''&5579ICIFRPPPPOP>:>>>P888PLPAJDPCCDMMD;9=FBADDJFD7;ALL?,,,,06ID13..000DA4CFJC44,,->ED99;44CJK?42FAB?=CLNO''PJI999&77&&ERP><)))O==D677FP768PA=##HEE.::NM&&&>O''PO88H#A999P<:?IHL;;;GIIPPMMPPB7777PP>>>>KOPIIEEE<<CL%%5656AAAG<<DDFFGG%%N21778;M&&>>CCL::LKK6.711DGHHMIA#BAJ7>%6700;;=##?=;J55>>QP<<:>MF;;RPL==JMMPPPQR##P===;=BM99M>>PPOQGD44777PKKFP=<'''2215566>CG>>HH<<PLJI800CE<<PPPMGNOPMJ>>GG***LCCC777,,#AP>>AOPMFN99ENNMEPP>>>>>>CLPP??66OOKLLP=:>>KMBCPOPP#FKEI<<ML?>EAF>>>LDCD77JK=H>BN==:=<<<:==JN,,,659???8K<:==<4))))))P98>>>>;967777N66###AMKKKIKPMG;;AD88HN&&LMIGJOJMGHPC>#5D((((C?9--?8HGCDPNH7?9974;;AC&ABH''#%:=NP:,,9999=GJG>>=>JG21''':9>>>;;MP*****OKKKIE??55PPKJ21:K---///Q11//EN&';;;;:=;00011;IP##PP11?778JDDMM>>::KKLLKLNONOHDMPKLMIB>>?JP>9;KJL====;8;;;L)))))E#=$$$#.::,,BPJK76B;;F5<<J::K
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/904/ccs
CTCTCTCATCACACACGAGGAGTGAAGAGAGAACCTCCTCTCCACACGTGGAGTGAGGAGATCCTCTCACACACGTGAGGTGTTGAGAGAGATACTCTCTCATCACCTCACGTGAGGAGTGAGAGAGAT
+
{~~~~~sXNL>>||~~fVM~jtu~&&(uxy~f8YHh=<gA5
''<O1A44N'`oK57(((G&&Q*Q66;"$$Df66E~Z\ZMO>^;%L}~~~~~Q.~~~~x~#-LF9>~MMqbV~ABBV=99mhIwGRR~
#different_number_of_seq_qual
ATCG
+
**!
#this_should_work
GGGG
+
****
The ones with an error, I'm trying to replace the seq and qual strings with empty strings
seq,qual = '',''
Here's my code so far. These edge cases are so difficult for me to figure out please help . . .
def read_fastq(input, offset):
"""
Inputs a fastq file and reads each line at a time. 'offset' parameter can be set to 33 (phred+33 encoding
fastq), and 64. Yields a tuple in the format (ID, comments for a sequence, sequence, [integer quality values])
Capable of reading empty sequences and empty files.
"""
ID, comment, seq, qual = None,'','',''
step = 1 #step is a variable that organizes the order fastq parsing
#step= 1 scans for ID and comment line
#step= 2 adds relevant lines to sequence string
#step= 3 adds quality values to string
for line in input:
line = line.strip()
if step == 1 and line.startswith('#'): #Step system from Nedda Saremi
if ID is not None:
qual = [ord(char)-offset for char in qual] #Converts from phred encoding to integer values
sep = None
if ' ' in ID: sep = ' '
if sep is not None:
ID, comment = ID.split(sep,1) #Separates ID and comment by ' '
yield ID, comment, seq, qual
ID,comment,seq,qual = None,'','','' #Resets variable for next sequence
ID = line[1:]
step = 2
continue
if step==2 and not line.startswith('#') and not line.startswith('+'):
seq = seq + line.strip()
continue
if step == 2 and line.startswith('+'):
step = 3
continue
while step == 3:
#process the quality data
if len(qual) == len(seq):
#once the length of the quality seq and seq are the same, end gathering data
step = 1
continue
if len(qual) < len(seq):
qual = qual + line.strip()
if len(qual) < len(seq):
step = 3
continue
if (len(qual) > len(seq)):
sys.stderr.write('\nError: ' + ID + ' sequence length not equal to quality values\n')
comment,seq,qual= '','',''
ID = line
step = 1
continue
break
if ID is not None:
#Section reserved for last entry in file
if len(qual) > 0:
qual = [ord(char)-offset for char in qual]
sep = None
if ' ' in ID: sep = ' '
if sep is not None:
ID, comment = ID.split(sep,1)
if len(seq) == 0: ID,comment,seq,qual= '','','',''
yield ID, comment, seq, qual
my output is skipping the ID #m120204_092117_richard_c100250832550000001523001204251233_s1_p0/904/ccs and adding #**! when it should not be in the output
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+
J(78=AEEC65HR+++*3327H00GD++++FF440.+-64444426ABAB<:=7888((/788P>>LAA8*+')3&++=<////==<4&<>EFHGGIJ66P;;;9;;FE34KHKHP<<11;HK:57678NJ990((&26>PDDJE,,JL>=##88,8,+>::J88ELF9.-5.45G+###NP==??<>455F((<BB===;;EE;3><<;M=>89PLLPP?>KP8+7699>A;ANO===J#'''B;.(...HP?E##AHGE77MNOO9=OO?>98?DLIMPOG>;=PRKB5H---3;MN&&&&&F?B>;99;8AA53)A<=;>777:<>;;8:LM==))6:#K..M?6?::7,/4444=JK>>HNN=//16#--F#K;9<:6449#BADD;>CD11JE55K;;;=&&%%,3644DL&=:<877..3>344:>>?44*+MN66PG==:;;?0./AGLKF99&&5?>+++JOP333333AC#EBBFBCJ>>HINPMNNCC>>++6:??3344>B=<89:/000::K>A=00#,+-/.,#(LL#>#I555K22221115666666477KML559-,333?GGGKCCP:::PPNPPNP??PPPLLMNOKKFOP2Q&&P7777PM<<<=<6<HPOPPP44?=#=:?BB=89:<<DHI777777645545PPO((((((((C3P??PM0000#NOPJPPFGGL<<<NNGNKGGGGGEELKB'''(((((L===L<<..*--MJ111?PO=788<8GG>>?JJL88,,1CF))??=?M6667PPKAKM&&&&&<?P43?OENPP''''&5579ICIFRPPPPOP>:>>>P888PLPAJDPCCDMMD;9=FBADDJFD7;ALL?,,,,06ID13..000DA4CFJC44,,->ED99;44CJK?42FAB?=CLNO''PJI999&77&&ERP><)))O==D677FP768PA=##HEE.::NM&&&>O''PO88H#A999P<:?IHL;;;GIIPPMMPPB7777PP>>>>KOPIIEEE<<CL%%5656AAAG<<DDFFGG%%N21778;M&&>>CCL::LKK6.711DGHHMIA#BAJ7>%6700;;=##?=;J55>>QP<<:>MF;;RPL==JMMPPPQR##P===;=BM99M>>PPOQGD44777PKKFP=<'''2215566>CG>>HH<<PLJI800CE<<PPPMGNOPMJ>>GG***LCCC777,,#AP>>AOPMFN99ENNMEPP>>>>>>CLPP??66OOKLLP=:>>KMBCPOPP#FKEI<<ML?>EAF>>>LDCD77JK=H>BN==:=<<<:==JN,,,659???8K<:==<4))))))P98>>>>;967777N66###AMKKKIKPMG;;AD88HN&&LMIGJOJMGHPC>#5D((((C?9--?8HGCDPNH7?9974;;AC&ABH''#%:=NP:,,9999=GJG>>=>JG21''':9>>>;;MP*****OKKKIE??55PPKJ21:K---///Q11//EN&';;;;:=;00011;IP##PP11?778JDDMM>>::KKLLKLNONOHDMPKLMIB>>?JP>9;KJL====;8;;;L)))))E#=$$$#.::,,BPJK76B;;F5<<J::K
Error: different_number_of_seq_qual sequence length not equal to quality values
#**!
+
#this_should_work
GGGG
+
****
You probably should use BioPython.
Your bug appears to be the read that is skipped has 129 bases in its sequence but only 128 qv. So your parser reads the next defline as a quality line which then makes it too long so it prints the error.
Then your states don't account for the situation of where you are in step 1 but dont see a defline. So you keep reading extra lines overwritting the ID variable.
but if you really want to write your own parser:
I'll address your questions one at a time.
when sequence length is different than number of quality values
This is invalid. Each record in the fastq file must have the an equal number of bases and qualities. Different records in the file can be different lengths from each other, but each record must have equal bases and qualities.
when there's an empty sequence or entry
An empty read will have blank lines for the sequence and quality lines like this:
#SOLEXA1_0007:1:9:610:1983#GATCAG/2
+SOLEXA1_0007:1:9:610:1983#GATCAG/2
#SOLEXA1_0007:2:13:163:254#GATCAG/2
CGTAGTACGATATACGCGCGTGTACTGCTACGTCTCACTTTCGCAAGATTGCTCAGCTCATTGATGCTCAATGCTGGGCCATATCTCTTTTCTTTTTTTC
+SOLEXA1_0007:2:13:163:254#GATCAG/2
HHHHGHHEHHHHHE=HAHCEGEGHAG>CHH>EG5#>5*ECE+>AEEECGG72B&A*)569B+03B72>5.A>+*A>E+7A#G<CAD?#############
when the number of lines with quality values is more than one
Due to the requirements from the first answer above. We know that the number of bases and qualities must match. Also there will never be an + character in the sequence block. So we can keep parsing the sequence block until we see a line that starts with +. Then we know we are done parsing sequence. Then we can keep parsing quality lines until we get the same number of qualities as is in the sequence. We can't rely on looking for any special characters because depending on the quality encoding, # could be a valid quality call.
Also as an aside, you appear to be splitting the sequence defline to parse out the optional comment. You have to be careful for CASAVA 1.8 format which stupidly has spaces. So you might need a regex to see if it's a CASAVA 1.8 format then don't split on whitespace etc.
Have you considered using one of the robust python packages that are available for dealing with this kind of data rather than writing a parser from scratch? In partincular I'd recommend checking out HTSeq

Bash or Python to go backwards?

I have a text file which a lot of random occurrences of the string #STRING_A, and I would be interested in writing a short script which removes only some of them. Particularly one that scans the file and once it finds a line which starts with this string like
#STRING_A
then checks if 3 lines backwards there is another occurrence of a line starting with the same string, like
#STRING_A
#STRING_A
and if it happens, to delete the occurrence 3 lines backward. I was thinking about bash, but I do not know how to "go backwards" with it. So I am sure that this is not possible with bash. I also thought about python, but then I should store all information in memory in order to go backwards and then, for long files it would be unfeasible.
What do you think? Is it possible to do it in bash or python?
Thanks
Funny that after all these hours nobody's yet given a solution to the problem as actually phrased (as #John Machin points out in a comment) -- remove just the leading marker (if followed by another such marker 3 lines down), not the whole line containing it. It's not hard, of course -- here's a tiny mod as needed of #truppo's fun solution, for example:
from itertools import izip, chain
f = "foo.txt"
for third, line in izip(chain(" ", open(f)), open(f)):
if third.startswith("#STRING_A") and line.startswith("#STRING_A"):
line = line[len("#STRING_A"):]
print line,
Of course, in real life, one would use an iterator.tee instead of reading the file twice, have this code in a function, not repeat the marker constant endlessly, &c;-).
Of course Python will work as well. Simply store the last three lines in an array and check if the first element in the array is the same as the value you are currently reading. Then delete the value and print out the current array. You would then move over your elements to make room for the new value and repeat. Of course when the array is filled you'd have to make sure to continue to move values out of the array and put in the newly read values, stopping to check each time to see if the first value in the array matches the value you are currently reading.
Here is a more fun solution, using two iterators with a three element offset :)
from itertools import izip, chain, tee
f1, f2 = tee(open("foo.txt"))
for third, line in izip(chain(" ", f1), f2):
if not (third.startswith("#STRING_A") and line.startswith("#STRING_A")):
print line,
Why shouldn't it possible in bash? You don't need to keep the whole file in memory, just the last three lines (if I understood correctly), and write what's appropriate to standard-out. Redirect that into a temporary file, check that everything worked as expected, and overwrite the source file with the temporary one.
Same goes for Python.
I'd provide a script of my own, but that wouldn't be tested. ;-)
As AlbertoPL said, store lines in a fifo for later use--don't "go backwards". For this I would definitely use python over bash+sed/awk/whatever.
I took a few moments to code this snippet up:
from collections import deque
line_fifo = deque()
for line in open("test"):
line_fifo.append(line)
if len(line_fifo) == 4:
# "look 3 lines backward"
if line_fifo[0] == line_fifo[-1] == "#STRING_A\n":
# get rid of that match
line_fifo.popleft()
else:
# print out the top of the fifo
print line_fifo.popleft(),
# don't forget to print out the fifo when the file ends
for line in line_fifo: print line,
This code will scan through the file, and remove lines starting with the marker. It only keeps only three lines in memory by default:
from collections import deque
def delete(fp, marker, gap=3):
"""Delete lines from *fp* if they with *marker* and are followed
by another line starting with *marker* *gap* lines after.
"""
buf = deque()
for line in fp:
if len(buf) < gap:
buf.append(line)
else:
old = buf.popleft()
if not (line.startswith(marker) and old.startswith(marker)):
yield old
buf.append(line)
for line in buf:
yield line
I've tested it with:
>>> from StringIO import StringIO
>>> fp = StringIO('''a
... b
... xxx 1
... c
... xxx 2
... d
... e
... xxx 3
... f
... g
... h
... xxx 4
... i''')
>>> print ''.join(delete(fp, 'xxx'))
a
b
xxx 1
c
d
e
xxx 3
f
g
h
xxx 4
i
This "answer" is for lyrae ... I'll amend my previous comment: if the needle is in the first 3 lines of the file, your script will either cause an IndexError or access a line that it shouldn't be accessing, sometimes with interesting side-effects.
Example of your script causing IndexError:
>>> lines = "#string line 0\nblah blah\n".splitlines(True)
>>> needle = "#string "
>>> for i,line in enumerate(lines):
... if line.startswith(needle) and lines[i-3].startswith(needle):
... lines[i-3] = lines[i-3].replace(needle, "")
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
IndexError: list index out of range
and this example shows not only that the Earth is round but also why your "fix" to the "don't delete the whole line" problem should have used .replace(needle, "", 1) or [len(needle):] instead of .replace(needle, "")
>>> lines = "NEEDLE x NEEDLE y\nnoddle\nnuddle\n".splitlines(True)
>>> needle = "NEEDLE"
>>> # Expected result: no change to the file
... for i,line in enumerate(lines):
... if line.startswith(needle) and lines[i-3].startswith(needle):
... lines[i-3] = lines[i-3].replace(needle, "")
...
>>> print ''.join(lines)
x y <<<=== whoops!
noddle
nuddle
<<<=== still got unwanted newline in here
>>>
My awk-fu has never been that good... but the following may provide you what you're looking for in a bash-shell/shell-utility form:
sed `awk 'BEGIN{ORS=";"}
/#STRING_A/ {
if(LAST!="" && LAST+3 >= NR) print LAST "d"
LAST = NR
}' test_file` test_file
Basically... awk is producing a command for sed to strip certain lines. I'm sure there's a relatively easy way to make awk do all of the processing, but this does seem to work.
The bad part? It does read the test_file twice.
The good part? It is a bash/shell-utility implementation.
Edit: Alex Martelli points out that the sample file above might have confused me. (my above code deletes the whole line, rather than the #STRING_A flag only)
This is easily remedied by adjusting the command to sed:
sed `awk 'BEGIN{ORS=";"}
/#STRING_A/ {
if(LAST!="" && LAST+3 >= NR) print LAST "s/#STRING_A//"
LAST = NR
}' test_file` test_file
This may be what you're looking for?
lines = open('sample.txt').readlines()
needle = "#string "
for i,line in enumerate(lines):
if line.startswith(needle) and lines[i-3].startswith(needle):
lines[i-3] = lines[i-3].replace(needle, "")
print ''.join(lines)
this outputs:
string 0 extra text
string 1 extra text
string 2 extra text
string 3 extra text
--replaced -- 4 extra text
string 5 extra text
string 6 extra text
#string 7 extra text
string 8 extra text
string 9 extra text
string 10 extra text
In bash you can use sort -r filename and tail -n filename to read the file backwards.
$LINES=`tail -n filename | sort -r`
# now iterate through the lines and do your checking
I would consider using sed. gnu sed supports definition of line ranges. if sed would fail, then there is another beast - awk and I'm sure you can do it with awk.
O.K. I feel I should put my awk POC. I could not figure out to use sed addresses. I have not tried combination of awk+sed, but it seems to me it's overkill.
my awk script works as follows:
It reads lines and stores them into 3 line buffer
once desired pattern is found (/^data.*/ in my case), the 3-line buffer is looked up to check, whether desired pattern has been seen three lines ago
if pattern has been seen, then 3 lines are scratched
to be honest, I would probably go with python also, given that awk is really awkward.
the AWK code follows:
function max(a, b)
{
if (a > b)
return a;
else
return b;
}
BEGIN {
w = 0; #write index
r = 0; #read index
buf[0, 1, 2]; #buffer
}
END {
# flush buffer
# start at read index and print out up to w index
for (k = r % 3; k r - max(r - 3, 0); k--) {
#search in 3 line history buf
if (match(buf[k % 3], /^data.*/) != 0) {
# found -> remove lines from history
# by rewriting them -> adjust write index
w -= max(r, 3);
}
}
buf[w % 3] = $0;
w++;
}
/^.*/ {
# store line into buffer, if the history
# is full, print out the oldest one.
if (w > 2) {
print buf[r % 3];
r++;
buf[w % 3] = $0;
}
else {
buf[w] = $0;
}
w++;
}

Categories

Resources