Reading in Files with Meaningful Whitespace (Python)

Reading in Files with Meaningful Whitespace (Python) - python

I'm trying to read in files from the released Enron Dataset for a data science project. My problem lies in how I'm trying to read in my files. Basically, the first 15 or so lines of every email is information about the email itself: to, from, subject, etc. Thus, you would think read in the first 15 lines and assign them into an array. The problem that arises is that I'm trying to use whitespace in my algorithm, but sometimes there can be like 50 lines for the "to" column.
Example of a (slightly truncated) troublesome email:
Message-ID: <29403111.1075855665483.JavaMail.evans#thyme>
Date: Wed, 13 Dec 2000 08:22:00 -0800 (PST)
From: rebecca.cantrell#enron.com
To: stephanie.miller#enron.com, ruth.concannon#enron.com, jane.tholt#enron.com,
tori.kuykendall#enron.com, randall.gay#enron.com,
phillip.allen#enron.com, timothy.hamilton#enron.com,
robert.superty#enron.com, collee n.sullivan#enron.com,
donna.greif#enron.com, julie.gomez#enron.com
Subject: Final Filed Version -- SDG&E Comments
My code:
def readEmailHead(username, emailNum):
text = ""
file = open(corpus_root + username + '/all_documents/' + emailNum)
for line in file:
text += line
file.close()
email = text.split('\n')
count = 0
for line in email:
mem = []
if line == '':
pass
else:
if line[0].isspace():
print(line,count)
email[count-1] += line
del email[count]
count += 1
return [email[:20]]
Right now it can handle emails with an extra line in the subject/to/from/etc, but not any more. Any ideas?

No need to reinvent the wheel. The module email.parse can be your friend. I include a more portable way of constructing the file name so to just parse the header you could use the built-in parser and write a function like:
import email.parser
import os.path
def read_email_header(username, email_number, corpus_root='~/tmp/data/enron'):
corpus_root = os.path.expanduser(corpus_root)
fname = os.path.join(corpus_root, username, 'all_documents', email_number)
with open(fname, 'rb') as fd:
header = email.parser.BytesHeaderParser().parse(fd)
return header
mm = read_email_header('dasovich-j', '13078.')
print(mm.keys())
print(mm['Date'])
print(mm['From'])
print(mm['To'].split())
print(mm['Subject'])
Running this gives:
['Message-ID', 'Date', 'From', 'To', 'Subject', 'Mime-Version', 'Content-Type', 'Content-Transfer-Encoding', 'X-From', 'X-To', 'X-cc', 'X-bcc', 'X-Folder', 'X-Origin', 'X-FileName']
Fri, 25 May 2001 02:50:00 -0700 (PDT)
rebecca.cantrell#enron.com
['ray.alvarez#enron.com,', 'steve.walton#enron.com,', 'susan.mara#enron.com,', 'alan.comnes#enron.com,', 'leslie.lawner#enron.com,', 'donna.fulton#enron.com,', 'jeff.dasovich#enron.com,', 'christi.nicolay#enron.com,', 'james.steffes#enron.com,', 'jalexander#gibbs-bruns.com,', 'phillip.allen#enron.com,', 'linda.noske#enron.com,', 'dave.perrino#enron.com,', 'don.black#enron.com,', 'robert.frank#enron.com,', 'stephanie.miller#enron.com,', 'barry.tycholiz#enron.com']
Reuters -- FERC told Calif natgas to reach limit this summer

The easy way to approach problems like this (setting aside the good idea of using an existing parser) is to treat the transformation as being performed one one list of lines to yield another list, rather than trying to mutate an existing list while looping over it. Something like:
new=[]
for l in old:
if is_continuation(l): new[-1]+=l
else: new.append(l)
For all but the longest lists (where del old[i] is expensive anyway) this is quite efficient if most lines are not continuations since they can be reused in new as-is.

Maybe use regular expressions accordingly to your needs. For example, you can identify sent to email addresses as follows:
import regex as re
sent_to=[]
def sent_to(text): #Add the text part from the file you want
global sent_to
email = re.search(r'(.+#.+\..+)', text) #Regex pattern to match an email
if email:
sent_to.append(list(email.groups())) #Adds the email into the sent_to list for each email

Related

Rewriting Single Words in a .txt with Python

I need to create a Database, using Python and a .txt file.
Creating new items is no Problem,the inside of the Databse.txt looks like this:
Index Objektname Objektplace Username
i.e:
1 Pen Office Daniel
2 Saw Shed Nic
6 Shovel Shed Evelyn
4 Knife Room6 Evelyn
I get the index from a QR-Scanner (OpenCV) and the other informations are gained via Tkinter Entrys and if an objekt is already saved in the Database, you should be able to rewrite Objektplace and Username.
My Problems now are the following:
If I scan the Code with the index 6, how do i navigate to that entry, even if it's not in line 6, without causing a Problem with the Room6?
How do I, for example, only replace the "Shed" from Index 4 when that Objekt is moved to f.e. Room6?
Same goes for the Usernames.
Up until now i've tried different methods, but nothing worked so far.
The last try looked something like this
def DBChange():
#Removes unwanted bits from the scanned code
data2 = data.replace("'", "")
Index = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
#Adds a whitespace at the end of the Entrys to seperate them
Userlen = len(User)
User2 = User.ljust(Userlen)
Einlagerungsortlen = len(Einlagerungsort)+1
Einlagerungsort2 = Einlagerungsort.ljust(Einlagerungsortlen)
#Navigate to the exact line of the scanned Index and replace the words
#for the place and the user ONLY in this line
file = open("Datenbank.txt","r+")
lines=file.readlines()
for word in lines[Index].split():
List.append(word)
checkWords = (List[2],List[3])
repWords = (Einlagerungsort2, User2)
for line in file:
for check, rep in zip(checkWords, repWords):
line = line.replace(check, rep)
file.write(line)
file.close()
Return()
Thanks in advance

I'd suggest using Pandas to read and write your textfile. That way you can just use the index to select the approriate line. And if there is no specific reason to use your text format, I would switch to csv for ease of use.
import pandas as pd
def DBChange():
#Removes unwanted bits from the scanned code
# I haven't changed this part, since I guess you need this for some input data
data2 = data.replace("'", "")
Indexnr = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
# I removed the lines here. This isn't necessary when using csv and Pandas
# read in the csv file
df = pd.read_csv("Datenbank.csv")
# Select line with index and replace value
df.loc[Indexnr, 'Username'] = User
df.loc[Indexnr, 'Objektplace'] = Einlagerungsort
# Write back to csv
df.to_csv("Datenbank.csv")
Return()
Since I can't reproduce your specific problem, I haven't tested it. But something like this should work.
Edit
To read and write text-file, use ' ' as the seperator. (I assume all values do not contain spaces, and your text file now uses 1 space between values).
reading:
df = pd.read_csv('Datenbank.txt', sep=' ')
Writing:
df.to_csv('Datenbank.txt', sep=' ')

First of all, this is a terrible way to store data. My suggestion is not particularily well code, don't do this in production! (edit
newlines = []
for line in lines:
entry = line.split()
if entry[0] == Index:
#line now is the correct line
#Index 2 is the place, index 0 the ID, etc
entry[2] = Einlagerungsort2
newlines.append(" ".join(entry))
# Now write newlines back to the file

Extracting numbers from outlook email body with Python

I get hourly email alerts that tell me how much revenue the company has made in the last hour. I want to extract this information into a pandas dataframe so that i can run some analysis on it.
My problem is that i can't figure out how to extract data from the email body in a usable format. I think i need to use regular expressions but i'm not too familiar with them.
This is what i have so far:
import os
import pandas as pd
import datetime as dt
import win32com.client
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items
#Empty Lists
email_subject = []
email_date = []
email_content = []
#find emails
for message in messages:
if message.SenderEmailAddress == 'oracle#xyz.com' and message.Subject.startswith('Demand'):
email_subject.append(message.Subject)
email_date.append(message.senton.date())
email_content.append(message.body)
The email_content list looks like this:
' \r\nDemand: $41,225 (-47%)\t \r\n \r\nOrders: 515 (-53%)\t \r\nUnits: 849 (-59%)\t \r\n \r\nAOV: $80 (12%) \r\nAUR: $49 (30%) \r\n \r\nOrders with Promo Code: 3% \r\nAverage Discount: 21% '
Can anyone tell me how i can split its contents to so that i can get the int value of Demand, Orders and Units in separate columns?
Thanks!

You could use a combination of string.split() and string.strip() to first extract each lines individually.
string = email_content
lines = string.split('\r\n')
lines_stripped = []
for line in lines:
line = line.strip()
if line != '':
lines_stripped.append(line)
This gives you an array like this:
['Demand: $41,225 (-47%)', 'Orders: 515 (-53%)', 'Units: 849 (-59%)', 'AOV: $80 (12%)', 'AUR: $49 (30%)', 'Orders with Promo Code: 3%', 'Average Discount: 21%']
You can also achieve the same result in a more compact (pythonic) way:
lines_stripped = [line.strip() for line in string.split('\r\n') if line.strip() != '']
Once you have this array, you use regexes as you correctly guessed to extract the values. I recommend https://regexr.com/ to experiment with your regex expressions.
After some quick experimenting, r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?' should work.
Here is the code that produces a dict from your lines_stripped we created above:
import re
regex = r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?'
matched_dict = {}
for line in lines_stripped:
match = re.match(regex, line)
matched_dict[match.groups()[0]] = (match.groups()[1], match.groups()[2])
print(matched_dict)
This produces the following output:
{'AOV': ('$80', '12%)'),
'AUR': ('$49', '30%)'),
'Average Discount': ('21%', ''),
'Demand': ('$41,225', '-47%)'),
'Orders': ('515', '-53%)'),
'Orders with Promo Code': ('3%', ''),
'Units': ('849', '-59%)')}
You asked for Units, Orders and Demand, so here is the extraction:
# Remove the dollar sign before converting to float
# Replace , with empty string
demand_string = matched_dict['Demand'][0].strip('$').replace(',', '')
print(int(demand_string))
print(int(matched_dict['Orders'][0]))
print(int(matched_dict['Units'][0]))
As you can see, Demand is a little bit more complicated because it contains some extra characters python can't decode when converting to int.
Here is the final output of those 3 prints:
41225
515
849
Hope I answered your question ! If you have more questions about regex, I encourage you to experiement with regexr, it's very well built !
EDIT: Looks like there is a small issue in the regex causing the final ')' to be included in the last group. This does not affect your question though !

String Cutting with multiple lines

so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?

I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]

You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

How to erase "forwarded message" title and unwanted content from body of Enron emails?

I am trying to append all the bodies of Enron Emails in one file so that I could process the text of these emails by eliminating Stop words and splitting it into sentences with NLTK.
My problem is with forwarded and replied messages, I am not sure how to clean them.
This is my code so far:
import os, email, sys, re,nltk, pprint
from email.parser import Parser
rootdir = '/Users/art/Desktop/maildir/lay-k/elizabeth'
#function that appends all the body parts of Emails
def email_analyse(inputfile, email_body):
with open(inputfile, "r") as f:
data = f.read()
email = Parser().parsestr(data)
email_body.append(email.get_payload())
#end of function
#defining a list that will contain bodies
email_body = []
#call the function email_analyse for every function in directory
for directory, subdirectory, filenames in os.walk(rootdir):
for filename in filenames:
email_analyse(os.path.join(directory, filename), email_body )
#the stage where I clean the emails
with open("email_body.txt", "w") as f:
for val in email_body:
if(val):
val = val.replace("\n", "")
val = val.replace("=01", "")
#for some reason I had many of ==20 and =01 in my text
val = val.replace("==20", "")
f.write(val)
f.write("\n")
This is the partial output:
Well, with the photographer and the band, I would say we've pretty much outdone our budget! Here's the information on the photographer. I have a feeling for some of the major packages we could negotiate at least a couple of hours at the rehearsal dinner. I have no idea how much this normally costs, but he isn't cheap!---------------------- Forwarded by Elizabeth Lay/HOU/AZURIX on 09/13/99 07:34 PM ---------------------------acollins#reggienet.com on 09/13/99 05:37:37 PMPlease respond to acollins#reggienet.com To: Elizabeth Lay/HOU/AZURIX#AZURIXcc: Subject: Denis Reggie Wedding PhotographyHello Elizabeth:Congratulations on your upcoming marriage! I am Ashley Collins, Mr.Reggie's Coordinator. Linda Kessler forwarded your e.mail address to me sothat I may provide you with information on photography coverage for Mr.Reggie's wedding photography.
So the result is not a pure text at all. Any ideas on how to do it right?

You might want to look at regular expressions to parse the forwarded and reply text because the format should be consistent throughout the corpus.
For deleting the forwarded text, you could use a regex like this:
-{4,}(.*)(\d{2}:\d{2}:\d{2})\s*(PM|AM)
Which will match all the content between four or more hyphens and the time in the format XX:XX:XX PM. Matching 3 dashes would probably work fine, too. We just want to avoid matching hyphens and em-dashes in the email body. You can play around with this regex and write your own for matching To and Subject headers at this link: https://regex101.com/r/VGG4bu/1/
You can also look at section 3.4 of the NLTK book, which talks about regular expressions in Python: http://www.nltk.org/book/ch03.html
Good luck! This sounds like an interesting project.

If you're still interested in this problem, I've created a pre-processing script specifically for the Enron dataset. You'll notice that a new email will always start with the tag 'subject:', I implemented a function which removes all text to the left of this tag, and only on the last 'subject:' tag to remove all forwarded messages. Specific code:
# Cleaning content column
df['content'] = df['content'].str.rsplit('Subject: ').str[-1]
df['content'] = df['content'].str.rsplit(' --------------------------- ').str[-1]
Overall script, if interested:
# Importing the dataset, and defining columns
import pandas as pd
df = pd.read_csv('enron_05_17_2015_with_labels_v2.csv', usecols=[2,3,4,13], dtype={13:str})
# Building a count of how many people are included in an email
df['Included_In_Email'] = df.To.str.count(',')
df['Included_In_Email'] = df['Included_In_Email'].apply(lambda x: x+1)
# Dropping any NaN's, and emails with >15 recipients
df = df.dropna()
df = df[~(df['Included_In_Email'] >=15)]
# Seperating remaining emails into a line-per-line format
df['To'] = df.To.str.split(',')
df2 = df.set_index(['From', 'Date', 'content', 'Included_In_Email'])
['To'].apply(pd.Series).stack()
df2 = df2.reset_index()
df2.columns = ['From','To','Date','content', 'Included_In_Email']
# Renaming the new column, dropping unneeded column, and changing indices
del df2['level_4']
df2 = df2.rename(columns = {0: 'To'})
df2 = df2[['Date','From','To','content','Included_In_Email']]
del df
# Cleaning email addresses
df2['From'] = df2['From'].map(lambda x: x.lstrip("frozenset"))
df2['To'] = df2['To'].map(lambda x: x.lstrip("frozenset"))
df2['From'] = df2['From'].str.strip("<\>(/){?}[:]*, ")
df2['To'] = df2['To'].str.strip("<\>(/){?}[:]*, ")
df2['From'] = df2['From'].str.replace("'", "")
df2['To'] = df2['To'].str.replace("'", "")
df2['From'] = df2['From'].str.replace('"', "")
df2['To'] = df2['To'].str.replace('"', "")
# Acccounting for users having different emails
email_dict = pd.read_csv('dict_email.csv')
df2['From'] = df2.From.replace(email_dict.set_index('Old')['New'])
df2['To'] = df2.To.replace(email_dict.set_index('Old')['New'])
del email_dict
# Removing emails not containing #enron
df2['Enron'] = df2.From.str.count('#enron')
df2['Enron'] = df2['Enron']+df2.To.str.count('#enron')
df2 = df2[df2.Enron != 0]
df2 = df2[df2.Enron != 1]
del df2['Enron']
# Adding job roles which correspond to staff
import csv
with open('dict_role.csv') as f:
role_dict = dict(filter(None, csv.reader(f)))
df2['Sender_Role'] = df2['From'].map(role_dict)
df2['Receiver_Role'] = df2['To'].map(role_dict)
df2 = df2[['Date','From','To','Sender_Role','Receiver_Role','content','Included_In_Email']]
del role_dict
# Cleaning content column
df2['content'] = df2['content'].str.rsplit('Subject: ').str[-1]
df2['content'] = df2['content'].str.rsplit(' --------------------------- ').str[-1]
# Condensing records into one line per email exchange, adding weights
Weighted = df2.groupby(['From', 'To']).count()
# Adding weight column, removing redundant columns, splitting indexed column
Weighted['Weight'] = Weighted['Date']
Weighted =
Weighted.drop(['Date','Sender_Role','Receiver_Role','content','Included_In_Email'], 1)
Weighted.reset_index(inplace=True)
# Re-adding job-roles to staff
with open('dict_role.csv') as f:
role_dict = dict(filter(None, csv.reader(f)))
Weighted['Sender_Role'] = Weighted['From'].map(role_dict)
del role_dict
# Dropping exchanges with a weight of <= x, or no identifiable role
Weighted2 = Weighted[~(Weighted['Weight'] <=3)]
Weighted2 = Weighted.dropna()
Two dictionaries are used in the script (for matching job-roles and changing multiple emails for the same person), and can be found here.

parsing a .srt file with regex

I am doing a small script in python, but since I am quite new I got stuck in one part:
I need to get timing and text from a .srt file. For example, from
1
00:00:01,000 --> 00:00:04,074
Subtitles downloaded from www.OpenSubtitles.org
I need to get:
00:00:01,000 --> 00:00:04,074
and
Subtitles downloaded from www.OpenSubtitles.org.
I have already managed to make the regex for timing, but i am stuck for the text. I've tried to use look behind where I use my regex for timing:
( ?<=(\d+):(\d+):(\d+)(?:\,)(\d+) --> (\d+):(\d+):(\d+)(?:\,)(\d+) )\w+
but with no effect. Personally, i think that using look behind is the right way to solve this, but i am not sure how to write it correctly. Can anyone help me? Thanks.

Honestly, I don't see any reason to throw regex at this problem. .srt files are highly structured. The structure goes like:
an integer starting at 1, monotonically increasing
start --> stop timing
one or more lines of subtitle content
a blank line
... and repeat. Note the bold part - you might have to capture 1, 2, or 20 lines of subtitle content after the time code.
So, just take advantage of the structure. In this way you can parse everything in just one pass, without needing to put more than one line into memory at a time and still keeping all the information for each subtitle together.
from itertools import groupby
# "chunk" our input file, delimited by blank lines
with open(filename) as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
For example, using the example on the SRT doc page, I get:
res
Out[60]:
[['1\n',
'00:02:17,440 --> 00:02:20,375\n',
"Senator, we're making\n",
'our final approach into Coruscant.\n'],
['2\n', '00:02:20,476 --> 00:02:22,501\n', 'Very good, Lieutenant.\n']]
And I could further transform that into a list of meaningful objects:
from collections import namedtuple
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, *content = sub # py3 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
subs
Out[65]:
[Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']),
Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]

Disagree with #roippi. Regex is a very nice solution to text matching. And the Regex for this solution is not tricky.
import re
f = file.open(yoursrtfile)
# Parse the file content
content = f.read()
# Find all result in content
# The first big (__) retrieve the timing, \s+ match all timing in between,
# The (.+) means retrieve any text content after that.
result = re.findall("(\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)\s+(.+)", content)
# Just print out the result list. I recommend you do some formatting here.
print result

number:^[0-9]+$
Time:
^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$
string: *[a-zA-Z]+*
hope this help.

Thanks #roippi for this excellent parser.
It helped me a lot to write a srt to stl converter in less than 40 lines (in python2 though, as it has to fit in a larger project)
from __future__ import print_function, division
from itertools import groupby
from collections import namedtuple
# prepare - adapt to you needs or use sys.argv
inputname = 'FR.srt'
outputname = 'FR.stl'
stlheader = """
$FontName = Arial
$FontSize = 34
$HorzAlign = Center
$VertAlign = Bottom
"""
def converttime(sttime):
"convert from srt time format (0...999) to stl one (0...25)"
st = sttime.split(',')
return "%s:%02d"%(st[0], round(25*float(st[1]) /1000))
# load
with open(inputname,'r') as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
# parse
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, content = sub[0], sub[1], sub[2:] # py 2 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
# write
with open(outputname,'w') as F:
F.write(stlheader)
for sub in subs:
F.write("%s , %s , %s\n"%(converttime(sub.start), converttime(sub.end), "|".join(sub.content)) )

for time:
pattern = ("(\d{2}:\d{2}:\d{2},\d{3}?.*)")

None of the pure REGEx solution above worked for the real life srt files.
Let's take a look of the following SRT patterned text :
1
00:02:17,440 --> 00:02:20,375
Some multi lined text
This is a second line
2
00:02:20,476 --> 00:02:22,501
as well as a single line
3
00:03:20,476 --> 00:03:22,501
should be able to parse unicoded text too
こんにちは
Take a note that :
text may contain unicode characters.
Text can consist of several lines.
every cue started with an integer value and ended with a blank new line which both unix style and windows style CR/LF are accepted
Here is the working regex :
\d+[\r\n](\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)[\r\n]((.+\r?\n)+(?=(\r?\n)?))
https://regex101.com/r/qICmEM/1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading in Files with Meaningful Whitespace (Python) - python

Related

Rewriting Single Words in a .txt with Python

Extracting numbers from outlook email body with Python

String Cutting with multiple lines

How to erase "forwarded message" title and unwanted content from body of Enron emails?

parsing a .srt file with regex

Categories

Resources