I am following this tutorial (https://towardsdatascience.com/build-your-own-whatsapp-chat-analyzer-9590acca9014) to build a WhatsApp analyzer.
This is the entire code from his tutorial -
def startsWithDateTime(s):
pattern = '^([0-2][0-9]|(3)[0-1])(\/)(((0)[0-9])|((1)[0-2]))(\/)(\d{2}|\d{4}), ([0-9][0-9]):([0-9][0-9]) -'
result = re.match(pattern, s)
if result:
return True
return False
def startsWithAuthor(s):
patterns = [
'([\w]+):', # First Name
'([\w]+[\s]+[\w]+):', # First Name + Last Name
'([\w]+[\s]+[\w]+[\s]+[\w]+):', # First Name + Middle Name + Last Name
'([+]\d{2} \d{5} \d{5}):', # Mobile Number (India)
'([+]\d{2} \d{3} \d{3} \d{4}):', # Mobile Number (US)
'([+]\d{2} \d{4} \d{7})' # Mobile Number (Europe)
]
pattern = '^' + '|'.join(patterns)
result = re.match(pattern, s)
if result:
return True
return False
def getDataPoint(line):
# line = 18/06/17, 22:47 - Loki: Why do you have 2 numbers, Banner?
splitLine = line.split(' - ') # splitLine = ['18/06/17, 22:47', 'Loki: Why do you have 2 numbers, Banner?']
dateTime = splitLine[0] # dateTime = '18/06/17, 22:47'
date, time = dateTime.split(', ') # date = '18/06/17'; time = '22:47'
message = ' '.join(splitLine[1:]) # message = 'Loki: Why do you have 2 numbers, Banner?'
if startsWithAuthor(message): # True
splitMessage = message.split(': ') # splitMessage = ['Loki', 'Why do you have 2 numbers, Banner?']
author = splitMessage[0] # author = 'Loki'
message = ' '.join(splitMessage[1:]) # message = 'Why do you have 2 numbers, Banner?'
else:
author = None
return date, time, author, message
parsedData = [] # List to keep track of data so it can be used by a Pandas dataframe
conversationPath = 'chat.txt'
with open(conversationPath, encoding="utf-8") as fp:
fp.readline() # Skipping first line of the file (usually contains information about end-to-end encryption)
messageBuffer = [] # Buffer to capture intermediate output for multi-line messages
date, time, author = None, None, None # Intermediate variables to keep track of the current message being processed
while True:
line = fp.readline()
if not line: # Stop reading further if end of file has been reached
break
line = line.strip() # Guarding against erroneous leading and trailing whitespaces
if startsWithDateTime(line): # If a line starts with a Date Time pattern, then this indicates the beginning of a new message
print('true')
if len(messageBuffer) > 0: # Check if the message buffer contains characters from previous iterations
parsedData.append([date, time, author, ' '.join(messageBuffer)]) # Save the tokens from the previous message in parsedData
messageBuffer.clear() # Clear the message buffer so that it can be used for the next message
date, time, author, message = getDataPoint(line) # Identify and extract tokens from the line
messageBuffer.append(message) # Append message to buffer
else:
messageBuffer.append(line) # If a line doesn't start with a Date Time pattern, then it is part of a multi-line message. So, just append to buffer
When I plug in my chat file - chat.txt, I noticed that my list parsedData is empty. After going through his code, I noticed what might be responsible for the empty list.
From his tutorial, his chats are in this format (24hrs) -
18/06/17, 22:47 - Loki: Why do you have 2 numbers, Banner? but my chats are this format (12 hrs) - [4/19/20, 8:10:57 PM] Joe: How are you doing.
Reason why the startsWithDateTime function is unable to match any date and time.
Please how do I change the regex format in the startsWithDateTime function to match my chat format?
There are a few problems here.
First, your regex is looking for a date in the format DD/MM/YYYY, but the format you give is in M/DD/YYYY.
Second, the square brackets are not present in the first example you give, which succeeds, but are present in the second example.
Third, in looking at your regex, the problem isn't in the 12-hour v 24-hour time format per se, but in the fact that your regex is searching for a strictly 2-digit hour digit. When using 24-hour format, it is common to include a leading zero for single-digit hours (e.g., 08:10 for 8:10am), but in 12-hour format it is not (so your code would fail to find 8:10.
You can fix your regular expression by changing the relevant section from
([0-9][0-9]):([0-9][0-9])
to
([0-9]{1,2}):([0-9]{2})
The number in curly braces indicates how many examples of that character to look for, so in this case this expression will look for either one or two digits, then a colon, then exactly two digits.
Then the final regex would have to be
^\[(\d{1,2})\/(\d{2})\/(\d{2}|\d{4}), ([0-9]{1,2}):([0-9]{2}):([0-9]{2})\ ([AP]M)\]
Example:
import re
a = '[4/19/20, 8:10:57 PM] Joe: How are you doing'
p = r'^\[(\d{1,2})\/(\d{2})\/(\d{2}|\d{4}), ([0-9]{1,2}):([0-9]{2}):([0-9]{2})\ ([AP]M)\]'
re.findall(p, a)
# [('4', '19', '20', '8', '10', '57', 'PM')]
Related
I have the following unstructured data in a text file, which is message log data from Discord.
[06-Nov-19 03:36 PM] Dyno#0000
{Embed}
Server
**Message deleted in #reddit-feed**
Author: ? | Message ID: 171111183099756545
[12-Nov-19 01:35 PM] Dyno#0000
{Embed}
Member Left
#Unknown User
ID: 171111183099756545
[16-Nov-19 11:25 PM] Dyno#0000
{Embed}
Member Joined
#User
ID: 171111183099756545
Essentially my goal is to parse the data and extract all the join and leave messages then plot the growth of members in the server. Some of the messages are irrelevant, and each message block has varying length of rows too.
Date Member-change
4/24/2020 2
4/25/2020 -1
4/26/2020 3
I've tried parsing the data in a loop but because the data is unstructured and has varying lengths of rows, I'm confused on how to set it up. Is there a way to ignore all blocks without "Member Joined" and "Member Left"?
It is structured text, just not in the way you are expecting.
A file can be structured if the text is written in a consistent format even though normally we think of structured text as field-based.
The fields are separated by a date-based header, followed by the {embed} keyword, followed by the command you are interested in.
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import re
from itertools import count
# Get rid of the newlines for convenience
message = message_log.replace("\n", " ")
# Use a regular expression to split the log file into records
rx = r"(\[\d{2}-\w{3}-\d{2})"
replaced = re.split(rx, message)
# re.split will leave a blank entry as the first entry
replaced.pop(0)
# Each record will be a separate entry in a list
# Unfortunately the date component gets put in a different section of the list
# from the record is refers to and needs to be merged back together
merge_list = list()
for x, y in zip(count(step=2), replaced):
try:
merge_list.append(replaced[x] + replaced[x+1])
except:
continue
# Now a nice clean record list exists, it is possible to get the user count
n = 0
for z in merge_list:
# Split the record into date and context
log_date = re.split("(\d{2}-\w{3}-\d{2})", z)
# Work out whether the count should be incremented or decremented
if "{Embed} Member Joined" in z:
n = n + 1
elif "{Embed} Member Left" in z:
n = n - 1
else:
continue
# log_date[1] is needed to get the date from the record
print(log_date[1] + " " + str(n))
I'm new to Python & here is my question
Write a program to read through the mbox-short.txt and figure out the distribution by hour of the day for each of the messages. You can pull the hour out from the 'From ' line by finding the time and then splitting the string a second time using a colon.
From stephen.marquard#uct.ac.za Sat Jan 5 09:14:16 2008
Once you have accumulated the counts for each hour, print out the counts, sorted by hour as shown below.
Link of the file:
http://www.pythonlearn.com/code/mbox-short.txt
This is my code:
name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
counts = dict()
for line in handle:
if not line.startswith ("From "):continue
#words = line.split()
col = line.find(':')
coll = col - 2
print coll
#zero = line.find('0')
#one = line.find('1')
#b = line[ zero or one : col ]
#print b
#hour = words[5:6]
#print hour
#for line in hour:
# hr = line.split(':')
# x = hr[1]
for x in coll:
counts[x] = counts.get(x,0) + 1
for key, value in sorted(counts.items()):
print key, value
My first try was with list splitting(Comments) and it didn't work as it considered the 0 & the 1 as the first & the second letter not the numbers
second one was with line find (:) which is partially worked with minutes not with hours as required!!
First question
Why when I write line.find(:), it takes automatically the 2 numbers after?
Second question
Why when I run the program now, it gives an error
TypeError: 'int' object is not iterable on line 26 ??
Third question
Why it considered 0 & 1 as first & second letters of the line not 0 & 1 numbers
Finally
If possible please solve me this problem with a little of explanation please (with the same codes to keep my learning sequence)
Thank you...
First question
Why when I write line.find(:), it takes automatically the 2 numbers
after?
str.find() return the first index of the character that you want to find. If your string is "From 00:00:00", it returns 7 as the first ':' is at index 7.
Second question
Why when I run the program now, it gives an error TypeError: 'int'
object is not iterable on line 26 ??
As have said above, it returns an int, which you cannot iterate
Third question
Why it considered 0 & 1 as first & second letters of the line not 0 &
1 numbers
I don't really understand what do you mean here. Anyway, as I understand, you try to find the first index which '0' or '1' occurs and assume that the first letter of hour? What about 8-11pm(start with 2)?
Finally If possible please solve me this problem with a little of
explanation please (with the same codes to keep my learning sequence)
Sure, it will be like this:
for line in f:
if not line.startswith("From "): continue
first_colon_index = line.find(":")
if first_colon_index == -1: # there is no ':'
continue
first_char_hour_index = first_colon_index - 2
# string slicing
# [a:b] get string from index a to b
hour = line[first_char_hour_index:first_char_hour_index+2]
hour_int = int(hour)
# if key exist, increase by 1. If not, set to 1
if hour_int in count:
count[hour_int] += 1
else:
count[hour_int] = 1
# print hour & count, in sorting order
for hour in sorted(count):
print hour, count[hour]
The part about string slicing can be confusing, you can read more about it at Python docs.
And you have to sure that: in the line, there is no other ":" or this method will fail as the first ":" will not be the one between hour and minute.
To make sure it works, it's better to use Regex. Something like:
for line in f:
if not line.startswith("From"): continue
match = re.search(r'^From.*?([0-9]{2,2}:[0-9]{2,2}:[0-9]{2,2})', line)
if match:
time = match.group(1) # hh:mm:ss
hh = int(time.split(":")[0])
# if key exist, increase by 1. If not, set to 1
if hh in count:
count[hh] += 1
else:
count[hh] = 1
# print hour & count, in sorting order
for hour in sorted(count):
print hour, count[hour]
That's because str.find() returns an index of the found substring, not the string itself. Consequently, when you subtract 2 from it and then try to loop through it it will complain that you're trying to loop through an integer and raise a TypeError.
You can grab the whole time string as:
time_start = line.find(":")
if time_start == -1: # not found
continue
time_string = line[time_start-2:time_start+6] # slice out the whole time string
You can then further split the time_string by : to get hours, minutes and seconds (e.g. hours, minutes, seconds = time_string.split(":", 2) just keep in mind that those will be strings, not integers), or if you just want the hour:
hour = int(line[time_start-2:time_start])
You can take it from there - just increase your dict value and when you're done with parsing the file sort everything out.
I have a text file which contains the data like this
AA 331
line1 ...
line2 ...
% information here
AA 332
line1 ...
line2 ...
line3 ...
%information here
AA 1021
line1 ...
line2 ...
% information here
AA 1022
line1 ...
% information here
AA 1023
line1 ...
line2 ...
% information here
I want to perform action only for "informations" that comes after smallest integer that is after line "AA 331"and line "AA 1021" and not after lines "AA 332" , "AA 1022" and "AA 1023" .
P.s This is just a sample data of large file
The below code i try to parse the text file and get the integers which are after "AA" in a list "list1" and in second function i group them to get minimal value in "list2". This will return integers like [331,1021,...]. So i thought of extracting lines which comes after "AA 331" and perform action but i d'nt know how to proceed.
from itertools import groupby
def getlineindex(textfile):
with open(textfile) as infile:
list1 = []
for line in infile :
if line.startswith("AA"):
intid = line[3:]
list1.append(intid)
return list1
def minimalinteger(list1):
list2 = []
for k,v in groupby(list1,key=lambda x: x//10):
minimalint = min(v)
list2.append(minimalint)
return list2
list2 contains the smallest integers which comes after "AA" [331,1021,..]
You can use something like:
import re
matcher = re.compile("AA ([\d]+)")
already_was = []
good_block = False
with open(filename) as f:
for line in f:
m = matcher.match(line)
if m:
v = int(m.groups(0)) / 10
else:
v = None
if m and v not in already_was:
good_block = True
already_was.append(m)
if m and v in already_was:
good_block = False
if not m and good_block:
do_action()
These code works only if first value in group is minimal one.
Okay, here's my solution. At a high level, I go line by line, watching for AA lines to know when I've found the start/end of a data block, and watch what I call the run number to know whether or not we should process the next block. Then, I have a subroutine that handles any given block, basically reading off all relevant lines and processing them if needed. That subroutine is what watches for the next AA line in order to know when it's done.
import re
runIdRegex = re.compile(r'AA (\d+)')
def processFile(fileHandle):
lastNumber = None # Last run number, necessary so we know if there's been a gap or if we're in a new block of ten.
line = fileHandle.next()
while line is not None: # None is being used as a special value indicating we've hit the end of the file.
processData = False
match = runIdRegex.match(line)
if match:
runNumber = int(match.group(1))
if lastNumber == None:
# Startup/first iteration
processData = True
elif runNumber - lastNumber == 1:
# Continuation, see if the tenths are the same.
lastNumberTens = lastNumber / 10
runNumberTens = runNumber / 10
if lastNumberTens != runNumberTens:
processData = True
else:
processData = True
# Always remember where we were.
lastNumber = runNumber
# And grab and process data.
line = dataBlock(fileHandle, process=processData)
else:
try:
line = fileHandle.next()
except StopIteration:
line = None
def dataBlock(fileHandle, process=False):
runData = []
try:
line = fileHandle.next()
match = runIdRegex.match(line)
while not match:
runData.append(line)
line = fileHandle.next()
match = runIdRegex.match(line)
except StopIteration:
# Hit end of file
line = None
if process:
# Data processing call here
# processData(runData)
pass
# Return line so we don't lose it!
return line
Some notes for you. First, I'm in agreement with Jimilian that you should use a regular expression to match AA lines.
Second, the logic we talked about with regard to when we should process data is in processFile. Specifically these lines:
processData = False
match = runIdRegex.match(line)
if match:
runNumber = int(match.group(1))
if lastNumber == None:
# Startup/first iteration
processData = True
elif runNumber - lastNumber == 1:
# Continuation, see if the tenths are the same.
lastNumberTens = lastNumber / 10
runNumberTens = runNumber / 10
if lastNumberTens != runNumberTens:
processData = True
else:
processData = True
I assume we don't want to process data, then identify when we do. Logically speaking, you can do the inverse of this and assume you want to process data, then identify when you don't. Next, we need to store the last run's value in order to know whether or not we need to process this run's data. (and watch out for that first run edge case) We know we want to process data when the sequence is broken (the difference between two runs is greater than 1), which is handled by the else statement. We also know that we want to process data when the sequence increments the digit in the tens place, which is handled by my integer divide by 10.
Third, watch out for that return data from dataBlock. If you don't do that, you're going to lose the AA line that caused dataBlock to stop iterating, and processFile needs that line in order to know whether the next data block should be processed.
Last, I've opted to use fileHandle.next() and exception handling to identify when I get to the end of the file. But don't think this is the only way. :)
Let me know in comments if you have any questions.
this is a continuation of Generator not working to split string by particular identifier . Python 2 . however, i modified the code completely and it's not the same format at all. this is about edge cases
Edge Cases:
. when sequence length is different than number of quality values
. when there's an empty sequence or entry
. when the number of lines with quality values is more than one
i cannot figure out how to work with the edge cases above. If its an empty data file, then I still want to output empty strings. i'm trying with these sequences right here for my input file: (Just a little background, IDs are set by # at beginning of line, sequence characters are followed by the lines after until a line with + is reached. the next lines are going to have quality values (value ~= chr(char) ) this format is terrible and poorly thought out.
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+
J(78=AEEC65HR+++*3327H00GD++++FF440.+-64444426ABAB<:=7888((/788P>>LAA8*+')3&++=<////==<4&<>EFHGGIJ66P;;;9;;FE34KHKHP<<11;HK:57678NJ990((&26>PDDJE,,JL>=##88,8,+>::J88ELF9.-5.45G+###NP==??<>455F((<BB===;;EE;3><<;M=>89PLLPP?>KP8+7699>A;ANO===J#'''B;.(...HP?E##AHGE77MNOO9=OO?>98?DLIMPOG>;=PRKB5H---3;MN&&&&&F?B>;99;8AA53)A<=;>777:<>;;8:LM==))6:#K..M?6?::7,/4444=JK>>HNN=//16#--F#K;9<:6449#BADD;>CD11JE55K;;;=&&%%,3644DL&=:<877..3>344:>>?44*+MN66PG==:;;?0./AGLKF99&&5?>+++JOP333333AC#EBBFBCJ>>HINPMNNCC>>++6:??3344>B=<89:/000::K>A=00#,+-/.,#(LL#>#I555K22221115666666477KML559-,333?GGGKCCP:::PPNPPNP??PPPLLMNOKKFOP2Q&&P7777PM<<<=<6<HPOPPP44?=#=:?BB=89:<<DHI777777645545PPO((((((((C3P??PM0000#NOPJPPFGGL<<<NNGNKGGGGGEELKB'''(((((L===L<<..*--MJ111?PO=788<8GG>>?JJL88,,1CF))??=?M6667PPKAKM&&&&&<?P43?OENPP''''&5579ICIFRPPPPOP>:>>>P888PLPAJDPCCDMMD;9=FBADDJFD7;ALL?,,,,06ID13..000DA4CFJC44,,->ED99;44CJK?42FAB?=CLNO''PJI999&77&&ERP><)))O==D677FP768PA=##HEE.::NM&&&>O''PO88H#A999P<:?IHL;;;GIIPPMMPPB7777PP>>>>KOPIIEEE<<CL%%5656AAAG<<DDFFGG%%N21778;M&&>>CCL::LKK6.711DGHHMIA#BAJ7>%6700;;=##?=;J55>>QP<<:>MF;;RPL==JMMPPPQR##P===;=BM99M>>PPOQGD44777PKKFP=<'''2215566>CG>>HH<<PLJI800CE<<PPPMGNOPMJ>>GG***LCCC777,,#AP>>AOPMFN99ENNMEPP>>>>>>CLPP??66OOKLLP=:>>KMBCPOPP#FKEI<<ML?>EAF>>>LDCD77JK=H>BN==:=<<<:==JN,,,659???8K<:==<4))))))P98>>>>;967777N66###AMKKKIKPMG;;AD88HN&&LMIGJOJMGHPC>#5D((((C?9--?8HGCDPNH7?9974;;AC&ABH''#%:=NP:,,9999=GJG>>=>JG21''':9>>>;;MP*****OKKKIE??55PPKJ21:K---///Q11//EN&';;;;:=;00011;IP##PP11?778JDDMM>>::KKLLKLNONOHDMPKLMIB>>?JP>9;KJL====;8;;;L)))))E#=$$$#.::,,BPJK76B;;F5<<J::K
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/904/ccs
CTCTCTCATCACACACGAGGAGTGAAGAGAGAACCTCCTCTCCACACGTGGAGTGAGGAGATCCTCTCACACACGTGAGGTGTTGAGAGAGATACTCTCTCATCACCTCACGTGAGGAGTGAGAGAGAT
+
{~~~~~sXNL>>||~~fVM~jtu~&&(uxy~f8YHh=<gA5
''<O1A44N'`oK57(((G&&Q*Q66;"$$Df66E~Z\ZMO>^;%L}~~~~~Q.~~~~x~#-LF9>~MMqbV~ABBV=99mhIwGRR~
#different_number_of_seq_qual
ATCG
+
**!
#this_should_work
GGGG
+
****
The ones with an error, I'm trying to replace the seq and qual strings with empty strings
seq,qual = '',''
Here's my code so far. These edge cases are so difficult for me to figure out please help . . .
def read_fastq(input, offset):
"""
Inputs a fastq file and reads each line at a time. 'offset' parameter can be set to 33 (phred+33 encoding
fastq), and 64. Yields a tuple in the format (ID, comments for a sequence, sequence, [integer quality values])
Capable of reading empty sequences and empty files.
"""
ID, comment, seq, qual = None,'','',''
step = 1 #step is a variable that organizes the order fastq parsing
#step= 1 scans for ID and comment line
#step= 2 adds relevant lines to sequence string
#step= 3 adds quality values to string
for line in input:
line = line.strip()
if step == 1 and line.startswith('#'): #Step system from Nedda Saremi
if ID is not None:
qual = [ord(char)-offset for char in qual] #Converts from phred encoding to integer values
sep = None
if ' ' in ID: sep = ' '
if sep is not None:
ID, comment = ID.split(sep,1) #Separates ID and comment by ' '
yield ID, comment, seq, qual
ID,comment,seq,qual = None,'','','' #Resets variable for next sequence
ID = line[1:]
step = 2
continue
if step==2 and not line.startswith('#') and not line.startswith('+'):
seq = seq + line.strip()
continue
if step == 2 and line.startswith('+'):
step = 3
continue
while step == 3:
#process the quality data
if len(qual) == len(seq):
#once the length of the quality seq and seq are the same, end gathering data
step = 1
continue
if len(qual) < len(seq):
qual = qual + line.strip()
if len(qual) < len(seq):
step = 3
continue
if (len(qual) > len(seq)):
sys.stderr.write('\nError: ' + ID + ' sequence length not equal to quality values\n')
comment,seq,qual= '','',''
ID = line
step = 1
continue
break
if ID is not None:
#Section reserved for last entry in file
if len(qual) > 0:
qual = [ord(char)-offset for char in qual]
sep = None
if ' ' in ID: sep = ' '
if sep is not None:
ID, comment = ID.split(sep,1)
if len(seq) == 0: ID,comment,seq,qual= '','','',''
yield ID, comment, seq, qual
my output is skipping the ID #m120204_092117_richard_c100250832550000001523001204251233_s1_p0/904/ccs and adding #**! when it should not be in the output
#m120204_092117_richard_c100250832550000001523001204251233_s1_p0/422/ccs
CTGTTGCGGATTGTTTGGCTATGGCTAAAACCGATGAAGAAAAAGGAAATGCCAAAACCGTTTATAGCGATTGATCCAAGAAATCCAAAATAAAAGGACACAAAACAAACAAAATCAATTGAGTAAAACAGAAAGGCCATCAAGCAAGCGAGTGCTTGATAACTTAGATGACCCTACTGATCAAGAGGCCATAGAGCAATGTTTAGAGGGCTTGAGCGATAGTGAAAGGGCGCTAATTCTAGGAATTCAAACGACAAGCTGATGAAGTGGATCTGATTTATAGCGATCTAAGAAACCGTAAAACCTTTGATAACATGGCGGCTAAAGGTTATCCGTTGTTACCAATGGATTTCAAAAATGGCGGCGATATTGCCACTATTAACCGCTACTAATGTTGATGCGGACAAATAGCTAGCAGATAATCCTATTTATGCTTCCATAGAGCCTGATATTACCAAGCATACGAAACAGAAAAAACCATTAAGGATAAGAATTTAGAAGCTAAATTGGCTAAGGCTTTAGGTGGCAATAAACAAATGACGATAAAGAAAAAAGTAAAAAACCCACAGCAGAAACTAAAGCAGAAAGCAATAAGATAGACAAAGATGTCGCAGAAACTGCCAAAAATATCAGCGAAATCGCTCTTAAGAACAAAAAAGAAAAGAGTGGGATTTTGTAGATGAAAATGGTAATCCCATTGATGATAAAAAGAAAGAAGAAAAACAAGATGAAACAAGCCCTGTCAAACAGGCCTTTATAGGCAAGAGTGATCCCACATTTGTTTTTAGCGCAATACACCCCCATTGAAATCACTCTGACTTCTAAAGTAGATGCCACTCTCACAGGTATAGTGAGTGGGGTTGTAGCCAAAGATGTATGGAACATGAACGGCACTATGATCTTATTAAGACAAACGGCCACTAAGGTGTATGGGAATTATCAAAGCGTGAAAGGTGGCCACGCCTATTATGACTCGTTTAATGATAGTCTTTACTAAAGCCATTACGCCTGATGGGGTGGTGATACCTCTAGCAAACGCTCAAGCAGCAGGCATGCTGGGTGAAGCAGGCGGTAGATGGCTATGTGAATAATCACTTCATGAAGCGTATAGGCTTTGCTGTGATAGCAAGCGTGGTTAATAGCTTCTTGCAAACTGCACCTATCATAGCTCTAGATAAACTCATAGGCCTTGGCAAAGGCAGAAGTGAAAGGACACCTGAATTTAATTACGCTTTGGGTCAAGCTATCAATGGTAGTATGCAAAGTTCAGCTCAGATGTCTAATCAAATTCTAGGGCAACTGATGAATATCCCCCAAGTTTTTACAAAAATGAGGGCGATAGTATTAAGATTCTCACCATGGACGATATTGATTTTAGTGGTGTGTATGATGTTAAAATTGACCAACAAATCTGTGGTAGATGAAATTATCAAACAAAGCACCAAAAACTTTGTCTAGAGAACATGAAGAAATCACCACAGCCCCAAAGGTGGCAATTGATTCAAGAGAAAGGATAAAATATATTCATGTTATTAAACTCGGTTCTTTACAAAATAAAAAGACAAACCAACCTAGGCTCTTCTAGAGGA
+
J(78=AEEC65HR+++*3327H00GD++++FF440.+-64444426ABAB<:=7888((/788P>>LAA8*+')3&++=<////==<4&<>EFHGGIJ66P;;;9;;FE34KHKHP<<11;HK:57678NJ990((&26>PDDJE,,JL>=##88,8,+>::J88ELF9.-5.45G+###NP==??<>455F((<BB===;;EE;3><<;M=>89PLLPP?>KP8+7699>A;ANO===J#'''B;.(...HP?E##AHGE77MNOO9=OO?>98?DLIMPOG>;=PRKB5H---3;MN&&&&&F?B>;99;8AA53)A<=;>777:<>;;8:LM==))6:#K..M?6?::7,/4444=JK>>HNN=//16#--F#K;9<:6449#BADD;>CD11JE55K;;;=&&%%,3644DL&=:<877..3>344:>>?44*+MN66PG==:;;?0./AGLKF99&&5?>+++JOP333333AC#EBBFBCJ>>HINPMNNCC>>++6:??3344>B=<89:/000::K>A=00#,+-/.,#(LL#>#I555K22221115666666477KML559-,333?GGGKCCP:::PPNPPNP??PPPLLMNOKKFOP2Q&&P7777PM<<<=<6<HPOPPP44?=#=:?BB=89:<<DHI777777645545PPO((((((((C3P??PM0000#NOPJPPFGGL<<<NNGNKGGGGGEELKB'''(((((L===L<<..*--MJ111?PO=788<8GG>>?JJL88,,1CF))??=?M6667PPKAKM&&&&&<?P43?OENPP''''&5579ICIFRPPPPOP>:>>>P888PLPAJDPCCDMMD;9=FBADDJFD7;ALL?,,,,06ID13..000DA4CFJC44,,->ED99;44CJK?42FAB?=CLNO''PJI999&77&&ERP><)))O==D677FP768PA=##HEE.::NM&&&>O''PO88H#A999P<:?IHL;;;GIIPPMMPPB7777PP>>>>KOPIIEEE<<CL%%5656AAAG<<DDFFGG%%N21778;M&&>>CCL::LKK6.711DGHHMIA#BAJ7>%6700;;=##?=;J55>>QP<<:>MF;;RPL==JMMPPPQR##P===;=BM99M>>PPOQGD44777PKKFP=<'''2215566>CG>>HH<<PLJI800CE<<PPPMGNOPMJ>>GG***LCCC777,,#AP>>AOPMFN99ENNMEPP>>>>>>CLPP??66OOKLLP=:>>KMBCPOPP#FKEI<<ML?>EAF>>>LDCD77JK=H>BN==:=<<<:==JN,,,659???8K<:==<4))))))P98>>>>;967777N66###AMKKKIKPMG;;AD88HN&&LMIGJOJMGHPC>#5D((((C?9--?8HGCDPNH7?9974;;AC&ABH''#%:=NP:,,9999=GJG>>=>JG21''':9>>>;;MP*****OKKKIE??55PPKJ21:K---///Q11//EN&';;;;:=;00011;IP##PP11?778JDDMM>>::KKLLKLNONOHDMPKLMIB>>?JP>9;KJL====;8;;;L)))))E#=$$$#.::,,BPJK76B;;F5<<J::K
Error: different_number_of_seq_qual sequence length not equal to quality values
#**!
+
#this_should_work
GGGG
+
****
You probably should use BioPython.
Your bug appears to be the read that is skipped has 129 bases in its sequence but only 128 qv. So your parser reads the next defline as a quality line which then makes it too long so it prints the error.
Then your states don't account for the situation of where you are in step 1 but dont see a defline. So you keep reading extra lines overwritting the ID variable.
but if you really want to write your own parser:
I'll address your questions one at a time.
when sequence length is different than number of quality values
This is invalid. Each record in the fastq file must have the an equal number of bases and qualities. Different records in the file can be different lengths from each other, but each record must have equal bases and qualities.
when there's an empty sequence or entry
An empty read will have blank lines for the sequence and quality lines like this:
#SOLEXA1_0007:1:9:610:1983#GATCAG/2
+SOLEXA1_0007:1:9:610:1983#GATCAG/2
#SOLEXA1_0007:2:13:163:254#GATCAG/2
CGTAGTACGATATACGCGCGTGTACTGCTACGTCTCACTTTCGCAAGATTGCTCAGCTCATTGATGCTCAATGCTGGGCCATATCTCTTTTCTTTTTTTC
+SOLEXA1_0007:2:13:163:254#GATCAG/2
HHHHGHHEHHHHHE=HAHCEGEGHAG>CHH>EG5#>5*ECE+>AEEECGG72B&A*)569B+03B72>5.A>+*A>E+7A#G<CAD?#############
when the number of lines with quality values is more than one
Due to the requirements from the first answer above. We know that the number of bases and qualities must match. Also there will never be an + character in the sequence block. So we can keep parsing the sequence block until we see a line that starts with +. Then we know we are done parsing sequence. Then we can keep parsing quality lines until we get the same number of qualities as is in the sequence. We can't rely on looking for any special characters because depending on the quality encoding, # could be a valid quality call.
Also as an aside, you appear to be splitting the sequence defline to parse out the optional comment. You have to be careful for CASAVA 1.8 format which stupidly has spaces. So you might need a regex to see if it's a CASAVA 1.8 format then don't split on whitespace etc.
Have you considered using one of the robust python packages that are available for dealing with this kind of data rather than writing a parser from scratch? In partincular I'd recommend checking out HTSeq
I have a folder with about 50 .txt files containing data in the following format.
=== Predictions on test data ===
inst# actual predicted error distribution (OFTd1_OF_Latency)
1 1:S 2:R + 0.125,*0.875 (73.84)
I need to write a program that combines the following: my index number (i), the letter of the true class (R or S), the letter of the predicted class, and each of the distribution predictions (the decimals less than 1.0).
I would like it to look like the following when finished, but preferably as a .csv file.
ID True Pred S R
1 S R 0.125 0.875
2 R R 0.105 0.895
3 S S 0.945 0.055
. . . . .
. . . . .
. . . . .
n S S 0.900 0.100
I'm a beginner and a bit fuzzy on how to get all of that parsed and then concatenated and appended. Here's what I was thinking, but feel free to suggest another direction if that would be easier.
for i in range(1, n):
s = str(i)
readin = open('mydata/output/output'+s+'out','r')
#The files are all named the same but with different numbers associated
output = open("mydata/summary.csv", "a")
storage = []
for line in readin:
#data extraction/concatenation here
if line.startswith('1'):
id = i
true = # split at the ':' and take the letter after it
pred = # split at the second ':' and take the letter after it
#some have error '+'s and some don't so I'm not exactly sure what to do to get the distributions
ds = # split at the ',' and take the string of 5 digits before it
if pred == 'R':
dr = #skip the character after the comma but take the have characters after
else:
#take the five characters after the comma
lineholder = id+' , '+true+' , '+pred+' , '+ds+' , '+dr
else: continue
output.write(lineholder)
I think using the indexes would be another option, but it might complicate things if the spacing is off in any of the files and I haven't checked this for sure.
Thank you for your help!
Well first of all, if you want to use CSV, you should use CSV module that comes with python. More about this module here: https://docs.python.org/2.7/library/csv.html I won't demonstrate how to use it, because it's pretty simple.
As for reading the input data, here's my suggestion how to break down every line of the data itself. I assume that lines of data in the input file have their values separated by spaces, and each value cannot contain a space:
def process_line(id_, line):
pieces = line.split() # Now we have an array of values
true = pieces[1].split(':')[1] # split at the ':' and take the letter after it
pred = pieces[2].split(':')[1] # split at the second ':' and take the letter after it
if len(pieces) == 6: # There was an error, the + is there
p4 = pieces[4]
else: # There was no '+' only spaces
p4 = pieces[3]
ds = p4.split(',')[0] # split at the ',' and take the string of 5 digits before it
if pred == 'R':
dr = p4.split(',')[0][1:] #skip the character after the comma but take the have??? characters after
else:
dr = p4.split(',')[0]
return id_+' , '+true+' , '+pred+' , '+ds+' , '+dr
What I mainly used here was split function of strings: https://docs.python.org/2/library/stdtypes.html#str.split and in one place this simple syntax of str[1:] to skip the first character of the string (strings are arrays after all, we can use this slicing syntax).
Keep in mind that my function won't handle any errors or lines formated differently than the one you posted as an example. If the values in every line are separated by tabs and not spaces you should replace this line: pieces = line.split() with pieces = line.split('\t').
i think u can separte floats and then combine it with the strings with the help of re module as follows:
import re
file = open('sample.txt','r')
strings=[[num for num in re.findall(r'\d+\.+\d+',i) for i in file.readlines()]]
print (strings)
file.close()
file = open('sample.txt','r')
num=[[num for num in re.findall(r'\w+\:+\w+',i) for i in file.readlines()]]
print (num)
s= num+strings
print s #[['1:S','2:R'],['0.125','0.875','73.84']] output of the code
this prog is written for one line u can use it for multiple line as well but u need to use a loop for that
contents of sample.txt:
1 1:S 2:R + 0.125,*0.875 (73.84)
2 1:S 2:R + 0.15,*0.85 (69.4)
when you run the prog the result will be:
[['1:S,'2:R'],['1:S','2:R'],['0.125','0.875','73.84'],['0.15,'0.85,'69.4']]
simply concatenate them
This uses regular expressions and the CSV module.
import re
import csv
matcher = re.compile(r'[[:blank:]]*1.*:(.).*:(.).* ([^ ]*),[^0-9]?(.*) ')
filenametemplate = 'mydata/output/output%iout'
output = csv.writer(open('mydata/summary.csv', 'w'))
for i in range(1, n):
for line in open(filenametemplate % i):
m = matcher.match(line)
if m:
output.write([i] + list(m.groups()))