UCSC BLAT output python - python

Is there a way I can get the position number of the mismatch from the following BLAT result using Python?
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41629392 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41629348
As we can see, there are two mismatches in the above output. Can we get the position number of the mismatch/mutation using Python. This is how it appears in the source code also. So I'm a little confused on how to proceed.
Thank you.

You can find the mismatches using the .find method of a string. Mismatches are indicated by a space (' '), so we look for that in the middle line of the blat output. I don't know blat personally, so I'm not sure if the output always comes in triplet lines, but assuming it does, the following function will return a list of positions mismatching, each position represented as a tuple of the mismatching position in the top sequence, and the same in the bottom sequence.
blat_src = """00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41629392 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41629348"""
def find_mismatch(blat):
#break the blat input into lines
lines = blat.split("\n")
#give some firendly names to the different lines
seq_a = lines[0]
seq_b = lines[2]
#We're not interested in the '<' and '>' so we strip them out with a slice
matchstr = lines[1][9:-9]
#Get the integer values of the starts of each sequence segment
pos_a = int(seq_a[:8])
pos_b = int(seq_b[:8])
results = []
#find the index of first space character, mmpos = mismatch position
mmpos = matchstr.find(" ")
#if a space exists (-1 if none found)
while mmpos != -1:
#the position of the mismatch is the start position of the
#sequence plus the index within the segment
results.append((posa+mmpos, posb+mmpos))
#search the rest of the string (from mmpos+1 onwards)
mmpos = matchstr.find(" ", mmpos+1)
return results
print find_mismatch(blat_src)
Which produces
[(28, 41629419), (29, 41629420)]
Telling us positions 28 and 29 (indexed according to the top sequence) or positions 41629419 and 41629420 (indexed according to the bottom sequence) are mismatched.

Related

Reading from more than one line after keyword?

I have an output file which prints out a matrix of numeric data. I need to search through this file for the identifier at the start of each data set, which is:
GROUP 1 FIRST 1 LAST 163
Here GROUP 1 is the first column of the matrix, FIRST 1 is the first non-zero element of this matrix in position 1, and LAST 163 is the last non-zero element of the matrix in position 163. The matrix doesn't necessarily end at this LAST value - in this case there are 172 values.
I want to read this data into a simpler form to work with. Here is an example of the first two column results:
GROUP 1 FIRST 1 LAST 163
7.150814E-02 9.866657E-03 8.500540E-04 1.818338E-03 2.410691E-03 3.284499E-03 3.011986E-03 1.612432E-03
1.674247E-03 3.436244E-03 3.655873E-03 4.056876E-03 4.560725E-03 2.462454E-03 2.567764E-03 5.359393E-03
5.457415E-03 2.679373E-03 2.600020E-03 2.491592E-03 2.365089E-03 2.228494E-03 5.792616E-03 1.623274E-03
1.475062E-03 1.331820E-03 1.195052E-03 2.832699E-03 7.298341E-04 6.301271E-04 1.377459E-03 1.048925E-03
1.677453E-04 3.580640E-04 1.575301E-04 1.150545E-04 1.197719E-04 2.950028E-05 5.380539E-05 1.228784E-05
1.627659E-05 4.522051E-05 7.736908E-06 1.758838E-05 8.161204E-06 6.103670E-06 6.431876E-06 1.585671E-06
4.110246E-06 4.512924E-07 2.775227E-06 5.107739E-07 1.219448E-06 1.653674E-07 4.429047E-07 4.837661E-07
2.036820E-07 3.449548E-07 1.457648E-07 4.494116E-07 1.629392E-07 1.300509E-07 1.730199E-07 8.130338E-08
1.591993E-08 5.457638E-08 1.713141E-08 7.806754E-09 1.154869E-08 3.545961E-09 2.862203E-09 2.289470E-09
4.324002E-09 2.243199E-09 2.627165E-09 2.273119E-09 1.973867E-09 1.710714E-09 1.468845E-09 1.772236E-09
1.764492E-09 1.004393E-09 1.044698E-09 5.201382E-10 2.660613E-10 3.012732E-10 2.630323E-10 4.381052E-10
2.521794E-10 9.213524E-11 2.619283E-10 3.591906E-11 1.449830E-10 1.867363E-11 1.230445E-10 1.108149E-11
2.775004E-11 1.156249E-11 4.393752E-11 5.318751E-11 6.815569E-12 1.817489E-11 2.044674E-11 2.044673E-11
1.931080E-11 1.931076E-11 1.817484E-11 2.044668E-11 5.486837E-12 7.681572E-12 1.536314E-11 7.132886E-12
8.230253E-12 1.426577E-11 1.426577E-11 4.389468E-12 5.925780E-12 2.853153E-12 2.853153E-12 5.706307E-12
5.706307E-12 2.194733E-12 3.292099E-12 5.267358E-12 2.194733E-12 3.072626E-12 4.828412E-12 4.389466E-12
4.389465E-12 1.097366E-11 2.194732E-12 1.316839E-11 2.194732E-12 1.608784E-11 1.674222E-11 1.778860E-11
6.993074E-12 2.622402E-12 9.090994E-12 5.769285E-12 1.573441E-12 6.861030E-12 4.782885E-12 8.768619E-13
2.311727E-12 3.188589E-12 4.393636E-12 3.844430E-12 4.256331E-12 1.235709E-12 2.746020E-12 2.746020E-12
8.238059E-13 2.608719E-12 1.445203E-12 4.817344E-13 1.445203E-12 7.609642E-14 2.536547E-13 2.000924E-13
7.075681E-14 7.075681E-14 3.056704E-14
GROUP 2 FIRST 2 LAST 168
6.740271E-02 8.310813E-03 3.609403E-03 1.307012E-03 2.949375E-03 3.605043E-03 1.612647E-03 1.640960E-03
3.597806E-03 4.022993E-03 4.289805E-03 4.480576E-03 2.352539E-03 2.415121E-03 5.018262E-03 5.188098E-03
2.589224E-03 2.546116E-03 2.472462E-03 2.374431E-03 2.260519E-03 5.981164E-03 1.700972E-03 1.556116E-03
1.410140E-03 1.273499E-03 3.061941E-03 7.995844E-04 6.967963E-04 1.553994E-03 1.216266E-03 1.997540E-04
4.426460E-04 1.990445E-04 1.470610E-04 1.539762E-04 3.814900E-05 7.024764E-05 1.611156E-05 2.136422E-05
5.984886E-05 1.035646E-05 2.363444E-05 1.105747E-05 8.308678E-06 8.789299E-06 2.257693E-06 5.807418E-06
6.248625E-07 3.822327E-06 6.987942E-07 1.660586E-06 2.240283E-07 5.983062E-07 6.513773E-07 2.735403E-07
4.614998E-07 1.940877E-07 5.895136E-07 2.081549E-07 1.662117E-07 2.316650E-07 1.101916E-07 2.162701E-08
7.493990E-08 2.341661E-08 1.072330E-08 1.606536E-08 4.945307E-09 3.936301E-09 3.147244E-09 5.945972E-09
3.108514E-09 3.682241E-09 3.210760E-09 2.795020E-09 2.436545E-09 2.118219E-09 2.612622E-09 2.586657E-09
1.432507E-09 1.457386E-09 7.264341E-10 3.803348E-10 4.514677E-10 3.959518E-10 6.541553E-10 3.707172E-10
1.334816E-10 3.875547E-10 5.294296E-11 2.294557E-10 2.790137E-11 1.719152E-10 1.408339E-11 3.526731E-11
1.469469E-11 5.583990E-11 6.759567E-11 8.766360E-12 2.337697E-11 2.629908E-11 2.629908E-11 2.483802E-11
2.483802E-11 2.337697E-11 2.629908E-11 7.112706E-12 9.957791E-12 1.991557E-11 9.246516E-12 1.066906E-11
1.849303E-11 1.849303E-11 5.690165E-12 7.681722E-12 3.698607E-12 3.698607E-12 7.397214E-12 7.397214E-12
2.845082E-12 4.267624E-12 6.828199E-12 2.845082E-12 3.983115E-12 6.259180E-12 5.690165E-12 5.690165E-12
1.422541E-11 2.845082E-12 1.707049E-11 2.845082E-12 2.095991E-11 2.193285E-11 2.330364E-11 1.096642E-11
4.112407E-12 1.425635E-11 8.906802E-12 2.429128E-12 1.106603E-11 8.097092E-12 1.484468E-12 3.913596E-12
5.398063E-12 8.624785E-12 7.546689E-12 8.355261E-12 2.425721E-12 5.390492E-12 5.390492E-12 1.617147E-12
5.120967E-12 2.710198E-12 9.033993E-13 2.710198E-12 3.744092E-13 1.248030E-12 6.614939E-13 4.359798E-13
4.359798E-13 1.364861E-13 4.856661E-15 4.856661E-15 4.856661E-15 4.856661E-15 4.856661E-15
What I have at the moment works, except it only reads in the first line after the GROUP keyword line. How can I make it continue reading the data in until it reaches the next GROUP keyword?
file_name = "test_data.txt"
import re
import io
group_pattern = re.compile(r"GROUP +\d+ FIRST +(?P<first>\d+) LAST +(?P<last>\d+)")
def read_data_from_file(file_name, start_identifier, end_identifier):
results = []
longest = 0
with open(file_name) as file:
t = file.read()
t=t[t.find('MACRO'):]
t=t[t.find(start_identifier)+len(start_identifier):t.find(end_identifier)]
t=io.StringIO(t)
for line in t:
match = group_pattern.search(line)
if match:
first = int(match.group('first'))
last = int(match.group('last'))
data = [float(value) for value in next(t).split()]
row = [0.0] * last
for i, value in enumerate(data, start=first-1):
row[i] = value
longest = max(longest, len(row))
results.append(row)
for row in results:
if len(row) < longest:
row.extend([0.0] * (longest-len(row)))
return results
start_identifier = "SCATTER MOMENT 1"
end_identifier = "SCATTER MOMENT 2"
results = read_data_from_file(file_name, start_identifier, end_identifier)
print(results)
What I want the code to produce is a matrix with just the numerical data. In this case it would be size [2x168] but my full data set is [172x172]. I want every GROUP to be read in as a row of the matrix, and zeroes filled into every element not specified in the output data. The current code does almost all of this, except that it only reads the first line of data after the GROUP keyword line.
So I took a look at the data you provided in your question. I found what I think is a better and simpler way of pulling those data points out of that file. However I noticed that you have some other code thats looking for other things in the file as well but those weren't in the test data you posted. So you may have to adapt this a little to work with your dataset.
def read_data_from_file(file_name):
with open(file_name) as fp:
index = -1
matrices = []
# Iterate over the file line by line via iter. Reduces memory usage
for line in fp:
# Since headers are always on their own line and data points always being with
# two spaces we can just look for lines that start with two spaces.
# If we find a line without these spaces then its the header line, add a new
# list to matrices and add one to index
if not line.startswith(' '):
index += 1
matrices.append([])
else:
# Splice str at index 2 to ignore first two spaces
# Then split by two spaces to get each data point
str_data_points = line[2:].split(' ')
# Map the string data points to a floats
float_data_points = map(lambda s: float(s), str_data_points)
# Add those float data points to the list in matrices via index
matrices[index].extend(float_data_points)
max_matrix_length = max(map(lambda matrix: len(matrix), matrices))
for matrix in matrices:
matrix.extend([0.0] * (max_matrix_length - len(matrix)))
return matrices
Here's my solution to read the data from the .txt file and produce a matrix-like output (0.0 padded at the end of each group)
import re
def read_data_from_file(filepath):
GROUP_DATA = []
MAX_ELEMENT_COUNT = 0
with open(file_path) as f:
for line in f.readlines():
if 'GROUP' in line:
GROUP_DATA.append([])
MAX_ELEMENT_COUNT = max(MAX_ELEMENT_COUNT, int(re.findall(r'\d+', line)[-1]))
else:
values = line.split(' ')
for value in values:
try:
GROUP_DATA[-1].append(float(value))
except ValueError:
pass
for DATA in GROUP_DATA:
if len(DATA) < MAX_ELEMENT_COUNT:
DATA += [0.0] * (MAX_ELEMENT_COUNT - len(DATA))
return GROUP_DATA
For the data in the given question saved into data.txt, the output would be as follows:
>>> import numpy as np ------------------------------> Just to check the output shape
>>> mat = read_data_from_file('data.txt')
>>> np.shape(mat)
(2, 168) <-------------------------------------------- Output shape as expected
The Output Matrix's size is flexible to the given data

Grabbing parts of a line until a white space

I have a line of text that looks something like:
2018-05-22 00:00:00 STATUS ERROR_CODE /home/etm124/script.py ANOTHER_MSG
What I want to do is grab the script name. I cannot split on white space because the STATUS could be more than one word, however the script value is always in location [411] of line. I currently am trying to do something like:
with open(my_log, 'r') as fp:
for line in fp:
if line[45] == '7': #ERROR_CODE
print line[411: {white_space?}]
you could use str.find with an offset:
offset = 411
line[offset:line.find(" ",offset)]
It's fast (one slice only) but the problem is: if there's no space, you'd get -1 as a result and you lose 1 char.
The alternative is slicing then splitting/partitionning (even if there's no space afterwards it works):
line[411:].split()[0]
Some more intricate code to handle the "missing space" and only perform 1 slice & 1 find would be:
offset = 411
spacepos = line.find(" ",offset)
line[offset:spacepos if spacepos != -1 else None])
so if find returns -1, slice to the end of the string.

Python DNA sequence slice gives \N as wrong content in slice result

I am surprising, I am using python to slice a long DNA Sequence (4699673 character)to a specific length supstring, it's working properly with a problem in result, after 71 good result \n start apear in result for few slices then correct slices again and so on for whole long file
the code:
import sys
filename = open("out_filePU.txt",'w')
sys.stdout = filename
my_file = open("GCF_000005845.2_ASM584v2_genomic_edited.fna")
st = my_file.read()
length = len(st)
print ( 'Sequence Length is, :' ,length)
for i in range(0,len(st[:-9])):
print(st[i:i+9], i)
figure shows the error from the result file
please i need advice on that.
Your sequence file contains multiple lines, and at the end of each line there is a line break \n. You can remove them with st = my_file.read().replace("\n", "").
Try st = re.sub('\\s', '', my_file.read()) to replace any newlines or other whitespace (you'll need to add import re at the top of your script).
Then for i in range(0,len(st[:-9]),9): to step through your data in increments of nine characters. Otherwise you're only advancing by one character each time: that's why you can see the diagonal patterns in your output.

generate string with length equal to length of time in file, with 1 label per second , python

I have a file like this:
https://gist.github.com/manbharae/70735d5a7b2bbbb5fdd99af477e224be
What I want to do is generate 1 label for 1 second.
Since this above file is 160 seconds long, there should be 160 labels. in other words I want to generate string of length 160.
However I'm ending up having an str of len 166 instead of 160.
My code :
filename = './test_file.txt'
ann = []
with open(filename, 'r') as f:
for line in f:
_, end, label = line.strip().split('\t')
ann.append((int(float(end)), 'MIT' if label == 'MILAN' else 'not-MIT'))
str = ''
prev_value = 0
for s in ann:
value = s[0]
letter = 'M' if s[1] == 'MIT' else 'x'
str += letter * (value - prev_value)
print str
prev_value = value
name_of_file, file_ext = os.path.splitext(os.path.basename(filename))
print "\n\nfile_name processed:", name_of_file
print str
print "length of string", len(str),"\n\n"
My final output:
xxxxxxxMxMMMMxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxMMMMMMMMMMMMMMMMMMMMxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
166.
Which is wrong. Str should be 160characters with each character per second, because file is 160 seconds long.
There is some small bug somewhere, unable to find it.
Please advise what's going wrong here?
Thanks.
Few things that I tried were , trying to include an if condition to break out of the loop once length of 160 is reached like this:
if ann[len(ann)-1][0] == len(str):
break;
AFAIK, something is going wrong in the last iteration, because until then everything is fine.
however it didn't help.
I looked at : https://stackoverflow.com/a/14927311/4932791
https://stackoverflow.com/a/1424016/4932791
The reason it doesn't add up is because you have two occasions which should add a negative amount of letters because the value is lower than the previous number:
(69, 'not-MIT')
(68, 'not-MIT')
(76, 'not-MIT')
(71, 'not-MIT')
For future reference: it's better not to call your variables 'str' as 'str()' already is a defined function in python.

Fastq parser not taking empty sequence (and other edge cases). Python

this is a continuation of Generator not working to split string by particular identifier . Python 2 . however, i modified the code completely and it's not the same format at all. this is about edge cases
Edge Cases:
. when sequence length is different than number of quality values
. when there's an empty sequence or entry
. when the number of lines with quality values is more than one
i cannot figure out how to work with the edge cases above. If its an empty data file, then I still want to output empty strings. i'm trying with these sequences right here for my input file: (Just a little background, IDs are set by # at beginning of line, sequence characters are followed by the lines after until a line with + is reached. the next lines are going to have quality values (value ~= chr(char) ) this format is terrible and poorly thought out.
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+
J(78=AEEC65HR+++*3327H00GD++++FF440.+-64444426ABAB<:=7888((/788P>>LAA8*+')3&++=<////==<4&<>EFHGGIJ66P;;;9;;FE34KHKHP<<11;HK:57678NJ990((&26>PDDJE,,JL>=##88,8,+>::J88ELF9.-5.45G+###NP==??<>455F((<BB===;;EE;3><<;M=>89PLLPP?>KP8+7699>A;ANO===J#'''B;.(...HP?E##AHGE77MNOO9=OO?>98?DLIMPOG>;=PRKB5H---3;MN&&&&&F?B>;99;8AA53)A<=;>777:<>;;8:LM==))6:#K..M?6?::7,/4444=JK>>HNN=//16#--F#K;9<:6449#BADD;>CD11JE55K;;;=&&%%,3644DL&=:<877..3>344:>>?44*+MN66PG==:;;?0./AGLKF99&&5?>+++JOP333333AC#EBBFBCJ>>HINPMNNCC>>++6:??3344>B=<89:/000::K>A=00#,+-/.,#(LL#>#I555K22221115666666477KML559-,333?GGGKCCP:::PPNPPNP??PPPLLMNOKKFOP2Q&&P7777PM<<<=<6<HPOPPP44?=#=:?BB=89:<<DHI777777645545PPO((((((((C3P??PM0000#NOPJPPFGGL<<<NNGNKGGGGGEELKB'''(((((L===L<<..*--MJ111?PO=788<8GG>>?JJL88,,1CF))??=?M6667PPKAKM&&&&&<?P43?OENPP''''&5579ICIFRPPPPOP>:>>>P888PLPAJDPCCDMMD;9=FBADDJFD7;ALL?,,,,06ID13..000DA4CFJC44,,->ED99;44CJK?42FAB?=CLNO''PJI999&77&&ERP><)))O==D677FP768PA=##HEE.::NM&&&>O''PO88H#A999P<:?IHL;;;GIIPPMMPPB7777PP>>>>KOPIIEEE<<CL%%5656AAAG<<DDFFGG%%N21778;M&&>>CCL::LKK6.711DGHHMIA#BAJ7>%6700;;=##?=;J55>>QP<<:>MF;;RPL==JMMPPPQR##P===;=BM99M>>PPOQGD44777PKKFP=<'''2215566>CG>>HH<<PLJI800CE<<PPPMGNOPMJ>>GG***LCCC777,,#AP>>AOPMFN99ENNMEPP>>>>>>CLPP??66OOKLLP=:>>KMBCPOPP#FKEI<<ML?>EAF>>>LDCD77JK=H>BN==:=<<<:==JN,,,659???8K<:==<4))))))P98>>>>;967777N66###AMKKKIKPMG;;AD88HN&&LMIGJOJMGHPC>#5D((((C?9--?8HGCDPNH7?9974;;AC&ABH''#%:=NP:,,9999=GJG>>=>JG21''':9>>>;;MP*****OKKKIE??55PPKJ21:K---///Q11//EN&';;;;:=;00011;IP##PP11?778JDDMM>>::KKLLKLNONOHDMPKLMIB>>?JP>9;KJL====;8;;;L)))))E#=$$$#.::,,BPJK76B;;F5<<J::K
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/904/ccs
CTCTCTCATCACACACGAGGAGTGAAGAGAGAACCTCCTCTCCACACGTGGAGTGAGGAGATCCTCTCACACACGTGAGGTGTTGAGAGAGATACTCTCTCATCACCTCACGTGAGGAGTGAGAGAGAT
+
{~~~~~sXNL>>||~~fVM~jtu~&&(uxy~f8YHh=<gA5
''<O1A44N'`oK57(((G&&Q*Q66;"$$Df66E~Z\ZMO>^;%L}~~~~~Q.~~~~x~#-LF9>~MMqbV~ABBV=99mhIwGRR~
#different_number_of_seq_qual
ATCG
+
**!
#this_should_work
GGGG
+
****
The ones with an error, I'm trying to replace the seq and qual strings with empty strings
seq,qual = '',''
Here's my code so far. These edge cases are so difficult for me to figure out please help . . .
def read_fastq(input, offset):
"""
Inputs a fastq file and reads each line at a time. 'offset' parameter can be set to 33 (phred+33 encoding
fastq), and 64. Yields a tuple in the format (ID, comments for a sequence, sequence, [integer quality values])
Capable of reading empty sequences and empty files.
"""
ID, comment, seq, qual = None,'','',''
step = 1 #step is a variable that organizes the order fastq parsing
#step= 1 scans for ID and comment line
#step= 2 adds relevant lines to sequence string
#step= 3 adds quality values to string
for line in input:
line = line.strip()
if step == 1 and line.startswith('#'): #Step system from Nedda Saremi
if ID is not None:
qual = [ord(char)-offset for char in qual] #Converts from phred encoding to integer values
sep = None
if ' ' in ID: sep = ' '
if sep is not None:
ID, comment = ID.split(sep,1) #Separates ID and comment by ' '
yield ID, comment, seq, qual
ID,comment,seq,qual = None,'','','' #Resets variable for next sequence
ID = line[1:]
step = 2
continue
if step==2 and not line.startswith('#') and not line.startswith('+'):
seq = seq + line.strip()
continue
if step == 2 and line.startswith('+'):
step = 3
continue
while step == 3:
#process the quality data
if len(qual) == len(seq):
#once the length of the quality seq and seq are the same, end gathering data
step = 1
continue
if len(qual) < len(seq):
qual = qual + line.strip()
if len(qual) < len(seq):
step = 3
continue
if (len(qual) > len(seq)):
sys.stderr.write('\nError: ' + ID + ' sequence length not equal to quality values\n')
comment,seq,qual= '','',''
ID = line
step = 1
continue
break
if ID is not None:
#Section reserved for last entry in file
if len(qual) > 0:
qual = [ord(char)-offset for char in qual]
sep = None
if ' ' in ID: sep = ' '
if sep is not None:
ID, comment = ID.split(sep,1)
if len(seq) == 0: ID,comment,seq,qual= '','','',''
yield ID, comment, seq, qual
my output is skipping the ID #m120204_092117_richard_c100250832550000001523001204251233_s1_p0/904/ccs and adding #**! when it should not be in the output
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+
J(78=AEEC65HR+++*3327H00GD++++FF440.+-64444426ABAB<:=7888((/788P>>LAA8*+')3&++=<////==<4&<>EFHGGIJ66P;;;9;;FE34KHKHP<<11;HK:57678NJ990((&26>PDDJE,,JL>=##88,8,+>::J88ELF9.-5.45G+###NP==??<>455F((<BB===;;EE;3><<;M=>89PLLPP?>KP8+7699>A;ANO===J#'''B;.(...HP?E##AHGE77MNOO9=OO?>98?DLIMPOG>;=PRKB5H---3;MN&&&&&F?B>;99;8AA53)A<=;>777:<>;;8:LM==))6:#K..M?6?::7,/4444=JK>>HNN=//16#--F#K;9<:6449#BADD;>CD11JE55K;;;=&&%%,3644DL&=:<877..3>344:>>?44*+MN66PG==:;;?0./AGLKF99&&5?>+++JOP333333AC#EBBFBCJ>>HINPMNNCC>>++6:??3344>B=<89:/000::K>A=00#,+-/.,#(LL#>#I555K22221115666666477KML559-,333?GGGKCCP:::PPNPPNP??PPPLLMNOKKFOP2Q&&P7777PM<<<=<6<HPOPPP44?=#=:?BB=89:<<DHI777777645545PPO((((((((C3P??PM0000#NOPJPPFGGL<<<NNGNKGGGGGEELKB'''(((((L===L<<..*--MJ111?PO=788<8GG>>?JJL88,,1CF))??=?M6667PPKAKM&&&&&<?P43?OENPP''''&5579ICIFRPPPPOP>:>>>P888PLPAJDPCCDMMD;9=FBADDJFD7;ALL?,,,,06ID13..000DA4CFJC44,,->ED99;44CJK?42FAB?=CLNO''PJI999&77&&ERP><)))O==D677FP768PA=##HEE.::NM&&&>O''PO88H#A999P<:?IHL;;;GIIPPMMPPB7777PP>>>>KOPIIEEE<<CL%%5656AAAG<<DDFFGG%%N21778;M&&>>CCL::LKK6.711DGHHMIA#BAJ7>%6700;;=##?=;J55>>QP<<:>MF;;RPL==JMMPPPQR##P===;=BM99M>>PPOQGD44777PKKFP=<'''2215566>CG>>HH<<PLJI800CE<<PPPMGNOPMJ>>GG***LCCC777,,#AP>>AOPMFN99ENNMEPP>>>>>>CLPP??66OOKLLP=:>>KMBCPOPP#FKEI<<ML?>EAF>>>LDCD77JK=H>BN==:=<<<:==JN,,,659???8K<:==<4))))))P98>>>>;967777N66###AMKKKIKPMG;;AD88HN&&LMIGJOJMGHPC>#5D((((C?9--?8HGCDPNH7?9974;;AC&ABH''#%:=NP:,,9999=GJG>>=>JG21''':9>>>;;MP*****OKKKIE??55PPKJ21:K---///Q11//EN&';;;;:=;00011;IP##PP11?778JDDMM>>::KKLLKLNONOHDMPKLMIB>>?JP>9;KJL====;8;;;L)))))E#=$$$#.::,,BPJK76B;;F5<<J::K
Error: different_number_of_seq_qual sequence length not equal to quality values
#**!
+
#this_should_work
GGGG
+
****
You probably should use BioPython.
Your bug appears to be the read that is skipped has 129 bases in its sequence but only 128 qv. So your parser reads the next defline as a quality line which then makes it too long so it prints the error.
Then your states don't account for the situation of where you are in step 1 but dont see a defline. So you keep reading extra lines overwritting the ID variable.
but if you really want to write your own parser:
I'll address your questions one at a time.
when sequence length is different than number of quality values
This is invalid. Each record in the fastq file must have the an equal number of bases and qualities. Different records in the file can be different lengths from each other, but each record must have equal bases and qualities.
when there's an empty sequence or entry
An empty read will have blank lines for the sequence and quality lines like this:
#SOLEXA1_0007:1:9:610:1983#GATCAG/2
+SOLEXA1_0007:1:9:610:1983#GATCAG/2
#SOLEXA1_0007:2:13:163:254#GATCAG/2
CGTAGTACGATATACGCGCGTGTACTGCTACGTCTCACTTTCGCAAGATTGCTCAGCTCATTGATGCTCAATGCTGGGCCATATCTCTTTTCTTTTTTTC
+SOLEXA1_0007:2:13:163:254#GATCAG/2
HHHHGHHEHHHHHE=HAHCEGEGHAG>CHH>EG5#>5*ECE+>AEEECGG72B&A*)569B+03B72>5.A>+*A>E+7A#G<CAD?#############
when the number of lines with quality values is more than one
Due to the requirements from the first answer above. We know that the number of bases and qualities must match. Also there will never be an + character in the sequence block. So we can keep parsing the sequence block until we see a line that starts with +. Then we know we are done parsing sequence. Then we can keep parsing quality lines until we get the same number of qualities as is in the sequence. We can't rely on looking for any special characters because depending on the quality encoding, # could be a valid quality call.
Also as an aside, you appear to be splitting the sequence defline to parse out the optional comment. You have to be careful for CASAVA 1.8 format which stupidly has spaces. So you might need a regex to see if it's a CASAVA 1.8 format then don't split on whitespace etc.
Have you considered using one of the robust python packages that are available for dealing with this kind of data rather than writing a parser from scratch? In partincular I'd recommend checking out HTSeq

Categories

Resources