QScintilla syntax highlighting with QsciLexerCustom - UTF-8 issue with german characters - python

Much like this related question, I found myself using QScintilla to create a syntax highlighter that has to deal with non-ASCII characters (é, ä, ß, etc...). I use the trick described in the comments of that question to solve the problem, styling the characters base on the length of the utf-8 bytes rather than the Latin-1 bytes. When styling the entire document, it works fine.
However, my issue arises when using the start/end parameters to only style part of the document as there seems to be a mismatch between the start/end parameters and the actual length of the text being styled. I need to use this as I am dealing with large files that cause a 1-2 second input delay if I continuously style the entire document.
I have the following very simple example:
é
; Comment
When I open the file, which runs the highlighter from start to finish it looks like that:
However, if I remove and re-type the comment, the colouring will always be one letter off.
This effect stacks indefinitely, with every non-ASCII character, the colouring goes off by another letter until it is a mess.
I have provided my overridden styleText method below.
def styleText(self, start: int, end: int) -> None:
if self.first_pass:
self.startStyling(0)
text = self.parent().text()
self.first_pass = False
else:
self.startStyling(start)
text = self.parent().text()[start:end]
p = re.compile(r"[*]\/|\/[*]|\s+|\w+|\W")
token_list = [(token, len(bytearray(token, "utf-8"))) for token in p.findall(text)]
editor = self.parent()
apply_until_linebreak = None
if start > 0:
previous_style_nr = editor.SendScintilla(editor.SCI_GETSTYLEAT, start - 1)
if previous_style_nr in [2, 3]:
apply_until_linebreak = previous_style_nr
for i, token in enumerate(token_list):
if apply_until_linebreak is not None:
if "\n" in token[0]:
apply_until_linebreak = None
self.setStyling(token[1], 0)
else:
self.setStyling(token[1], apply_until_linebreak)
else:
if token[0].isdigit() or token[0] in ["%"]:
self.setStyling(token[1], 1)
elif token[0] == "#":
apply_until_linebreak = 2
self.setStyling(token[1], 2)
elif token[0] in ["/", ";"]:
apply_until_linebreak = 3
self.setStyling(token[1], 3)
elif token[0].lower() == "end":
self.setStyling(token[1], 4)
else:
self.setStyling(token[1], 0)

Related

Exclude escaped byte char from serial.read_until()

I'm writing code to communicate back and forth with a module over serial which returns specific byte values to indicate the start/end of its communication. The length of the data returned can vary as can all content between the start header and end footer.
In an ideal scenario, I'd be able to use the following code to receive all data from the module:
start = b'\x5a'
end = b'\x5b'
max_size = 1024
def get_from_serial(ser: serial.Serial) -> bytes:
with ser:
_ = ser.read_until(expected=start, size=max_size)
data = ser.read_until(expected=end, size=max_size)
return start + data
Unfortunately, there are circumstances where the data sent by the module includes bytes that match either the start or end byte values. In these instances, the module prepends an escape character to them:
valid_start = b'\x5a'
valid_end = b'\x5b'
escaped_start = b'\x5c\x5a'
escaped_end = b'\x5c\x5b'
A valid start/end byte can be preceded by ANY byte value other than an escape one:
good_result = b'\x5a\xff\x5c\x5b\xff\x5b'
bad_result = b'\x5a\xff\x5c\x5b' # missed b'\xff\x5b'
Is there a way to configure ser.read_until() to ignore any escaped instance of a start/end byte and only return when encountering a valid start/end byte?
There's probably a way to do this with a loop that checks if data[-2] == b'\x5c': each time ser.read_until() returns something though I feel it could get complicated if the module returns multiple instances of an escaped start/end byte scattered throughout the data.
Any thoughts or suggestions would be greatly appreciated.
Edit:
Starting to think this isn't actually possible to do from inside ser.read_until() so have added a check before returning the data.
start = b'\x5a'
end = b'\x5b'
escape = b'\x5c'
max_size = 1024
def get_from_serial(ser: serial.Serial) -> bytes:
with ser:
_ = ser.read_until(expected=start, size=max_size)
data = ser.read_until(expected=end, size=max_size)
if valid_packet(data):
return start + data
else:
raise Exception("Invalid packet")
def valid_packet(packet: bytearray) -> bool:
header = packet[:1]
footer = packet[-1:]
escape_check = packet[-2:-1]
valid_header = header == start
valid_footer = footer == end
not_escaped = escape_check != escape
return all([
valid_header,
valid_footer,
not_escaped
])

Add linebreaks after N special characters are observed

I have a requirement wherein I have a CSV file which has data in a wrong format. However based on the number pipes I need to add a newline character and make the data ready for Consumption.
Can we count the number pipes and add newline \ncharacter?
Example:
sadasd|asdasd|l||||0sds|sdsds|2||||0sdsd|asdasd|l||||0
Expected output:
sadasd|asdasd|l||||0
sds|sdsds|2||||0
sdsd|asdasd|l||||0 .
Something like this?
_in = "sadasd|asdasd|l||||0sds|sdsds|2||||0sdsd|asdasd|l||||0"
_out = ""
pipeCount = 0
for char in _in:
if pipeCount == 6:
_out = _out+char+"\n"
pipeCount = 0
else:
_out = _out+char
if char == "|":
pipeCount += 1
print(_out)
I am not sure I understood the criterion for adding newline (See comments on question), but my output conforms with your expectation:
sadasd|asdasd|l||||0
sds|sdsds|2||||0
sdsd|asdasd|l||||0
Output is still a string, but you can just as easily make it a list of string.

ID3v1 Null Byte parsing in python

I am writing a tool to parse ID3 tags from files an edit them in a GUI fashion. Up until now everything is great. However I am trying to remove the null byte terminators when displaying the info and then adding it back when user saves it to preserver the ID3v1 format. However when doing a check for the null terminator I get nothing.
This is the portion of the code related to the handlig of the tag:
if(bytes.decode(check) == "TAG"):
title = self.__clean(bytes.decode(f.read(30)))
artist = self.__clean(bytes.decode(f.read(30)))
album = self.__clean(bytes.decode(f.read(30)))
year = bytes.decode(f.read(4))
comment = self.__clean(bytes.decode(f.read(30)))
tmp_gen = bytes.decode(f.read(1))
genre = self.__clean(Utils.genreByteToString(tmp_gen))
return TagV1(title, artist, album, year, comment, genre)
return None
The clean method is here:
def __clean(self, string):
counter = 0
for i in range(0, len(string)):
w = string[i]
if(not w.strip()) or b"\00" == w or w == b"00" or w == bytes.decode(b"\00"):
counter+=1
else:
counter = 0
if(counter == 2):
return string[0:i-1]
return string
I've tried every possible combination know of null byte. Either not w or not w.split() I even tried putting it in bytes and then looping thorught that for null byte but still nothing. My counter always stays 0 on the debugger. Also when trying to copy the value from the debugger it appears as this which is an empty space. In the debugger it appears as an empty square. I would appreciate the input.
Using PyChar 2017 1.4
I figured out that the only solution that works is to use
w == str.decode(b"\00") or rstrip("\0")
as denoted by Marteen
Everything else seems to not work. However there are still some places where it doesn't work. For example the comment in the file I am trying doesn't have null bytes until the last one.
Upon further inspection with a hex editor I have found some odd characters. The comment continues on with the \20 character in hex until position 29 where a null character is (for denoting it has a track indicator) the next character is a \01 for the track. Oddly the genre indicator is a 0C which translates to (cannot paste it, it's a box with ceros in it).
EDIT: Using the __clean() method checking for decoded null terminator aswell as w.isspace() seemed to fix the issue in both other cases.

Fastq parser not taking empty sequence (and other edge cases). Python

this is a continuation of Generator not working to split string by particular identifier . Python 2 . however, i modified the code completely and it's not the same format at all. this is about edge cases
Edge Cases:
. when sequence length is different than number of quality values
. when there's an empty sequence or entry
. when the number of lines with quality values is more than one
i cannot figure out how to work with the edge cases above. If its an empty data file, then I still want to output empty strings. i'm trying with these sequences right here for my input file: (Just a little background, IDs are set by # at beginning of line, sequence characters are followed by the lines after until a line with + is reached. the next lines are going to have quality values (value ~= chr(char) ) this format is terrible and poorly thought out.
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+
J(78=AEEC65HR+++*3327H00GD++++FF440.+-64444426ABAB<:=7888((/788P>>LAA8*+')3&++=<////==<4&<>EFHGGIJ66P;;;9;;FE34KHKHP<<11;HK:57678NJ990((&26>PDDJE,,JL>=##88,8,+>::J88ELF9.-5.45G+###NP==??<>455F((<BB===;;EE;3><<;M=>89PLLPP?>KP8+7699>A;ANO===J#'''B;.(...HP?E##AHGE77MNOO9=OO?>98?DLIMPOG>;=PRKB5H---3;MN&&&&&F?B>;99;8AA53)A<=;>777:<>;;8:LM==))6:#K..M?6?::7,/4444=JK>>HNN=//16#--F#K;9<:6449#BADD;>CD11JE55K;;;=&&%%,3644DL&=:<877..3>344:>>?44*+MN66PG==:;;?0./AGLKF99&&5?>+++JOP333333AC#EBBFBCJ>>HINPMNNCC>>++6:??3344>B=<89:/000::K>A=00#,+-/.,#(LL#>#I555K22221115666666477KML559-,333?GGGKCCP:::PPNPPNP??PPPLLMNOKKFOP2Q&&P7777PM<<<=<6<HPOPPP44?=#=:?BB=89:<<DHI777777645545PPO((((((((C3P??PM0000#NOPJPPFGGL<<<NNGNKGGGGGEELKB'''(((((L===L<<..*--MJ111?PO=788<8GG>>?JJL88,,1CF))??=?M6667PPKAKM&&&&&<?P43?OENPP''''&5579ICIFRPPPPOP>:>>>P888PLPAJDPCCDMMD;9=FBADDJFD7;ALL?,,,,06ID13..000DA4CFJC44,,->ED99;44CJK?42FAB?=CLNO''PJI999&77&&ERP><)))O==D677FP768PA=##HEE.::NM&&&>O''PO88H#A999P<:?IHL;;;GIIPPMMPPB7777PP>>>>KOPIIEEE<<CL%%5656AAAG<<DDFFGG%%N21778;M&&>>CCL::LKK6.711DGHHMIA#BAJ7>%6700;;=##?=;J55>>QP<<:>MF;;RPL==JMMPPPQR##P===;=BM99M>>PPOQGD44777PKKFP=<'''2215566>CG>>HH<<PLJI800CE<<PPPMGNOPMJ>>GG***LCCC777,,#AP>>AOPMFN99ENNMEPP>>>>>>CLPP??66OOKLLP=:>>KMBCPOPP#FKEI<<ML?>EAF>>>LDCD77JK=H>BN==:=<<<:==JN,,,659???8K<:==<4))))))P98>>>>;967777N66###AMKKKIKPMG;;AD88HN&&LMIGJOJMGHPC>#5D((((C?9--?8HGCDPNH7?9974;;AC&ABH''#%:=NP:,,9999=GJG>>=>JG21''':9>>>;;MP*****OKKKIE??55PPKJ21:K---///Q11//EN&';;;;:=;00011;IP##PP11?778JDDMM>>::KKLLKLNONOHDMPKLMIB>>?JP>9;KJL====;8;;;L)))))E#=$$$#.::,,BPJK76B;;F5<<J::K
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/904/ccs
CTCTCTCATCACACACGAGGAGTGAAGAGAGAACCTCCTCTCCACACGTGGAGTGAGGAGATCCTCTCACACACGTGAGGTGTTGAGAGAGATACTCTCTCATCACCTCACGTGAGGAGTGAGAGAGAT
+
{~~~~~sXNL>>||~~fVM~jtu~&&(uxy~f8YHh=<gA5
''<O1A44N'`oK57(((G&&Q*Q66;"$$Df66E~Z\ZMO>^;%L}~~~~~Q.~~~~x~#-LF9>~MMqbV~ABBV=99mhIwGRR~
#different_number_of_seq_qual
ATCG
+
**!
#this_should_work
GGGG
+
****
The ones with an error, I'm trying to replace the seq and qual strings with empty strings
seq,qual = '',''
Here's my code so far. These edge cases are so difficult for me to figure out please help . . .
def read_fastq(input, offset):
"""
Inputs a fastq file and reads each line at a time. 'offset' parameter can be set to 33 (phred+33 encoding
fastq), and 64. Yields a tuple in the format (ID, comments for a sequence, sequence, [integer quality values])
Capable of reading empty sequences and empty files.
"""
ID, comment, seq, qual = None,'','',''
step = 1 #step is a variable that organizes the order fastq parsing
#step= 1 scans for ID and comment line
#step= 2 adds relevant lines to sequence string
#step= 3 adds quality values to string
for line in input:
line = line.strip()
if step == 1 and line.startswith('#'): #Step system from Nedda Saremi
if ID is not None:
qual = [ord(char)-offset for char in qual] #Converts from phred encoding to integer values
sep = None
if ' ' in ID: sep = ' '
if sep is not None:
ID, comment = ID.split(sep,1) #Separates ID and comment by ' '
yield ID, comment, seq, qual
ID,comment,seq,qual = None,'','','' #Resets variable for next sequence
ID = line[1:]
step = 2
continue
if step==2 and not line.startswith('#') and not line.startswith('+'):
seq = seq + line.strip()
continue
if step == 2 and line.startswith('+'):
step = 3
continue
while step == 3:
#process the quality data
if len(qual) == len(seq):
#once the length of the quality seq and seq are the same, end gathering data
step = 1
continue
if len(qual) < len(seq):
qual = qual + line.strip()
if len(qual) < len(seq):
step = 3
continue
if (len(qual) > len(seq)):
sys.stderr.write('\nError: ' + ID + ' sequence length not equal to quality values\n')
comment,seq,qual= '','',''
ID = line
step = 1
continue
break
if ID is not None:
#Section reserved for last entry in file
if len(qual) > 0:
qual = [ord(char)-offset for char in qual]
sep = None
if ' ' in ID: sep = ' '
if sep is not None:
ID, comment = ID.split(sep,1)
if len(seq) == 0: ID,comment,seq,qual= '','','',''
yield ID, comment, seq, qual
my output is skipping the ID #m120204_092117_richard_c100250832550000001523001204251233_s1_p0/904/ccs and adding #**! when it should not be in the output
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+
J(78=AEEC65HR+++*3327H00GD++++FF440.+-64444426ABAB<:=7888((/788P>>LAA8*+')3&++=<////==<4&<>EFHGGIJ66P;;;9;;FE34KHKHP<<11;HK:57678NJ990((&26>PDDJE,,JL>=##88,8,+>::J88ELF9.-5.45G+###NP==??<>455F((<BB===;;EE;3><<;M=>89PLLPP?>KP8+7699>A;ANO===J#'''B;.(...HP?E##AHGE77MNOO9=OO?>98?DLIMPOG>;=PRKB5H---3;MN&&&&&F?B>;99;8AA53)A<=;>777:<>;;8:LM==))6:#K..M?6?::7,/4444=JK>>HNN=//16#--F#K;9<:6449#BADD;>CD11JE55K;;;=&&%%,3644DL&=:<877..3>344:>>?44*+MN66PG==:;;?0./AGLKF99&&5?>+++JOP333333AC#EBBFBCJ>>HINPMNNCC>>++6:??3344>B=<89:/000::K>A=00#,+-/.,#(LL#>#I555K22221115666666477KML559-,333?GGGKCCP:::PPNPPNP??PPPLLMNOKKFOP2Q&&P7777PM<<<=<6<HPOPPP44?=#=:?BB=89:<<DHI777777645545PPO((((((((C3P??PM0000#NOPJPPFGGL<<<NNGNKGGGGGEELKB'''(((((L===L<<..*--MJ111?PO=788<8GG>>?JJL88,,1CF))??=?M6667PPKAKM&&&&&<?P43?OENPP''''&5579ICIFRPPPPOP>:>>>P888PLPAJDPCCDMMD;9=FBADDJFD7;ALL?,,,,06ID13..000DA4CFJC44,,->ED99;44CJK?42FAB?=CLNO''PJI999&77&&ERP><)))O==D677FP768PA=##HEE.::NM&&&>O''PO88H#A999P<:?IHL;;;GIIPPMMPPB7777PP>>>>KOPIIEEE<<CL%%5656AAAG<<DDFFGG%%N21778;M&&>>CCL::LKK6.711DGHHMIA#BAJ7>%6700;;=##?=;J55>>QP<<:>MF;;RPL==JMMPPPQR##P===;=BM99M>>PPOQGD44777PKKFP=<'''2215566>CG>>HH<<PLJI800CE<<PPPMGNOPMJ>>GG***LCCC777,,#AP>>AOPMFN99ENNMEPP>>>>>>CLPP??66OOKLLP=:>>KMBCPOPP#FKEI<<ML?>EAF>>>LDCD77JK=H>BN==:=<<<:==JN,,,659???8K<:==<4))))))P98>>>>;967777N66###AMKKKIKPMG;;AD88HN&&LMIGJOJMGHPC>#5D((((C?9--?8HGCDPNH7?9974;;AC&ABH''#%:=NP:,,9999=GJG>>=>JG21''':9>>>;;MP*****OKKKIE??55PPKJ21:K---///Q11//EN&';;;;:=;00011;IP##PP11?778JDDMM>>::KKLLKLNONOHDMPKLMIB>>?JP>9;KJL====;8;;;L)))))E#=$$$#.::,,BPJK76B;;F5<<J::K
Error: different_number_of_seq_qual sequence length not equal to quality values
#**!
+
#this_should_work
GGGG
+
****
You probably should use BioPython.
Your bug appears to be the read that is skipped has 129 bases in its sequence but only 128 qv. So your parser reads the next defline as a quality line which then makes it too long so it prints the error.
Then your states don't account for the situation of where you are in step 1 but dont see a defline. So you keep reading extra lines overwritting the ID variable.
but if you really want to write your own parser:
I'll address your questions one at a time.
when sequence length is different than number of quality values
This is invalid. Each record in the fastq file must have the an equal number of bases and qualities. Different records in the file can be different lengths from each other, but each record must have equal bases and qualities.
when there's an empty sequence or entry
An empty read will have blank lines for the sequence and quality lines like this:
#SOLEXA1_0007:1:9:610:1983#GATCAG/2
+SOLEXA1_0007:1:9:610:1983#GATCAG/2
#SOLEXA1_0007:2:13:163:254#GATCAG/2
CGTAGTACGATATACGCGCGTGTACTGCTACGTCTCACTTTCGCAAGATTGCTCAGCTCATTGATGCTCAATGCTGGGCCATATCTCTTTTCTTTTTTTC
+SOLEXA1_0007:2:13:163:254#GATCAG/2
HHHHGHHEHHHHHE=HAHCEGEGHAG>CHH>EG5#>5*ECE+>AEEECGG72B&A*)569B+03B72>5.A>+*A>E+7A#G<CAD?#############
when the number of lines with quality values is more than one
Due to the requirements from the first answer above. We know that the number of bases and qualities must match. Also there will never be an + character in the sequence block. So we can keep parsing the sequence block until we see a line that starts with +. Then we know we are done parsing sequence. Then we can keep parsing quality lines until we get the same number of qualities as is in the sequence. We can't rely on looking for any special characters because depending on the quality encoding, # could be a valid quality call.
Also as an aside, you appear to be splitting the sequence defline to parse out the optional comment. You have to be careful for CASAVA 1.8 format which stupidly has spaces. So you might need a regex to see if it's a CASAVA 1.8 format then don't split on whitespace etc.
Have you considered using one of the robust python packages that are available for dealing with this kind of data rather than writing a parser from scratch? In partincular I'd recommend checking out HTSeq

Format individual characters differently within an Excel cell

I have a column in Excel 2013 containing letters and the digits 1,2,3 and 4 (representing pinyin pronunciations and tone values). They are all in the same font & format, but I would like to convert the numbers only to superscript. It does not seem that I can use any of Excel's built-in find-and-replace functionality to replace a single character in a cell with its superscript version: the entire cell format gets changed. I saw a thread Format individual characters in a single Excel cell with python which apparently holds a solution, but that was the first time I had heard of Python or xlwt.
Since I have never used Python and xlwt, can someone give me a basic step-by-step set of instructions to install those utilities, customize the script and run it?
Sample:
Li1Shi4
Qin3Fat1
Gon1Lin3
Den1Choi3
Xin1Nen3
Script from other thread:
import xlwt
wb = xlwt.Workbook()
ws = wb.add_sheet('Sheet1')
font0 = xlwt.easyfont('')
font1 = xlwt.easyfont('bold true')
font2 = xlwt.easyfont('color_index red')
style = xlwt.easyxf('font: color_index blue')
seg1 = ('bold', font1)
seg2 = ('red', font2)
seg3 = ('plain', font0)
seg4 = ('boldagain', font1)
ws.write_rich_text(2, 5, (seg1, seg2, seg3, seg4))
ws.write_rich_text(4, 1, ('xyz', seg2, seg3, '123'), style)
wb.save('rich_text.xls')
What is the syntax that will achieve the "find numbers and replace with superscript"? Is it a font or a style? The code from the other thread seems to manually input "seg1" , "seg2" , "seg3" etc. Or am I misunderstanding the code?
Thanks in advance. I am using Windows 8, 64 bit, Excel 2013.
I'm bored and in a teaching mood, so, here's a long "answer" that also explains a little bit about how you can figure these things out for yourself in the future :)
I typed abc123def into a cell, and recorded a macro using the macro recorder.
This is where you should always start if you don't know what the correct syntax is.
In any case, I selected the numeric part of this cell, and right-clicked, format cell, change font to superscript.
This is what the macro recorder gives me. This is a lot of code. Fortunately, it's a lot of junk.
Sub Macro2()
With ActiveCell.Characters(Start:=1, Length:=3).Font 'Applies to the first 3 characters
.Name = "Calibri"
.FontStyle = "Regular"
.Size = 11
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
With ActiveCell.Characters(Start:=4, Length:=3).Font 'Applies to the middle 3 characters
.Name = "Calibri"
.FontStyle = "Regular"
.Size = 11
.Strikethrough = False
.Superscript = True
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
With ActiveCell.Characters(Start:=7, Length:=3).Font 'Applies to the last 3 characters
.Name = "Calibri"
.FontStyle = "Regular"
.Size = 11
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
End Sub
What it represents is three blocks of formatting: the first is the first 3 characters that aren't changed, then the 3 that we applied superscript to, and then the last three characters.
Almost all of this is default properties, since I made no other changes, so I can revise it to this:
Sub Macro2()
With ActiveCell.Characters(Start:=4, Length:=3).Font
.Superscript = False
End With
End Sub
Now we can see that there are two important parts to this. The first part is how to specify which characters to format. This is done by refereing to a cell's .Characters:
ActiveCell.Characters(Start:=4, Length:=3).Font
So we can see that this macro refers to the characters in the positon 4-6 in the string "abc123def", or "123".
The next, obvious part is to assign the .Font.Superscript property is True.
Now you want to generalize this so that you can apply it anywhere. The above code is "hardcoded" the Start and Length arguments. We need to make it dynamic. Easiest way to do this is to go 1 character at a time, and check to see if it's numeric, if so, apply the superscript.
Sub ApplySuperscriptToNumbers()
Dim i As Long
Dim str As String
Dim rng As Range
Dim cl As Range
'## Generally should work on any contiguous "Selection" of cell(s)
Set rng = Range(Selection.Address)
'## Iterate over each cell in this selection
For Each cl In rng.Cells
str = cl.Value
'## Iterate over each character in the cell
For i = 1 To Len(str)
'## Check if this character is numeric
If IsNumeric(Mid(str, i, 1)) Then
'## Apply superscript to this 1 character
cl.Characters(Start:=i, Length:=1).Font.Superscript = True
End If
Next
Next
End Sub

Categories

Resources