Extracting numbers from outlook email body with Python

Extracting numbers from outlook email body with Python - python

I get hourly email alerts that tell me how much revenue the company has made in the last hour. I want to extract this information into a pandas dataframe so that i can run some analysis on it.
My problem is that i can't figure out how to extract data from the email body in a usable format. I think i need to use regular expressions but i'm not too familiar with them.
This is what i have so far:
import os
import pandas as pd
import datetime as dt
import win32com.client
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items
#Empty Lists
email_subject = []
email_date = []
email_content = []
#find emails
for message in messages:
if message.SenderEmailAddress == 'oracle#xyz.com' and message.Subject.startswith('Demand'):
email_subject.append(message.Subject)
email_date.append(message.senton.date())
email_content.append(message.body)
The email_content list looks like this:
' \r\nDemand: $41,225 (-47%)\t \r\n \r\nOrders: 515 (-53%)\t \r\nUnits: 849 (-59%)\t \r\n \r\nAOV: $80 (12%) \r\nAUR: $49 (30%) \r\n \r\nOrders with Promo Code: 3% \r\nAverage Discount: 21% '
Can anyone tell me how i can split its contents to so that i can get the int value of Demand, Orders and Units in separate columns?
Thanks!

You could use a combination of string.split() and string.strip() to first extract each lines individually.
string = email_content
lines = string.split('\r\n')
lines_stripped = []
for line in lines:
line = line.strip()
if line != '':
lines_stripped.append(line)
This gives you an array like this:
['Demand: $41,225 (-47%)', 'Orders: 515 (-53%)', 'Units: 849 (-59%)', 'AOV: $80 (12%)', 'AUR: $49 (30%)', 'Orders with Promo Code: 3%', 'Average Discount: 21%']
You can also achieve the same result in a more compact (pythonic) way:
lines_stripped = [line.strip() for line in string.split('\r\n') if line.strip() != '']
Once you have this array, you use regexes as you correctly guessed to extract the values. I recommend https://regexr.com/ to experiment with your regex expressions.
After some quick experimenting, r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?' should work.
Here is the code that produces a dict from your lines_stripped we created above:
import re
regex = r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?'
matched_dict = {}
for line in lines_stripped:
match = re.match(regex, line)
matched_dict[match.groups()[0]] = (match.groups()[1], match.groups()[2])
print(matched_dict)
This produces the following output:
{'AOV': ('$80', '12%)'),
'AUR': ('$49', '30%)'),
'Average Discount': ('21%', ''),
'Demand': ('$41,225', '-47%)'),
'Orders': ('515', '-53%)'),
'Orders with Promo Code': ('3%', ''),
'Units': ('849', '-59%)')}
You asked for Units, Orders and Demand, so here is the extraction:
# Remove the dollar sign before converting to float
# Replace , with empty string
demand_string = matched_dict['Demand'][0].strip('$').replace(',', '')
print(int(demand_string))
print(int(matched_dict['Orders'][0]))
print(int(matched_dict['Units'][0]))
As you can see, Demand is a little bit more complicated because it contains some extra characters python can't decode when converting to int.
Here is the final output of those 3 prints:
41225
515
849
Hope I answered your question ! If you have more questions about regex, I encourage you to experiement with regexr, it's very well built !
EDIT: Looks like there is a small issue in the regex causing the final ')' to be included in the last group. This does not affect your question though !

Related

How do I make regex .findall() return all matches within for-loop as intended?

I am trying to write a for-loop that iterates through individual rows. It uses regex to find a specific date identified by name. It then strips the date name, and saves the date itself as a list object for placement in an appropriate empty column.
My issue is that some rows have multiple dates of the same name (e.g. 'Exit Date: xx/xx/xxxx), and the re.findall in my for-loop is only saving the first date that matches the pattern, instead of all of them.
My barebones Regex query test that only works on a single row, 37, finds all dates and prints them appropriately. However, the moment I increase the regex pattern to re.findall(r'(Exit Date:.*?\d{2}/\d{2}/\d{4})', x) in the for-loop, it begins to only return one single date, instead of all of them (if there happens to be more than one 'Exit Date: xx/xx/xxxx'.
x = exit_note.loc[37, 'Exit Note']
match = re.findall((r'(\d{2}/\d{2}/\d{4})'), x)
if match:
print(match)
else:
print('no match')
Prints out ['03/10/2020', '03/06/2020']
The actual for-loop code is as follows:
exit_note_date = []
for index, row in exit_note.iterrows():
x = row['Exit Note']
edmatch = re.findall(r'(Exit Date:.*?\d{2}/\d{2}/\d{4})', x)
if len(edmatch) > 0:
edstring = edmatch[0].strip('Exit Date: ')
exit_note_date.append(edstring)
else:
exit_note_date.append('null')
print(edstring)
exit_note['Exit Date note'] = pd.Series(exit_note_date)
The for-loop works, but re.findall only retrieves one single date per row before inserting it into the appropriate column.
Any ideas on how to make the for-loop enter the appropriate number of dates into the date column, when more than one date exists in the row? I am new to Python, new to Regex, and new to Pandas - but my understanding is that re.findall should be returning every and all patterns, instead of just the first one it finds.
Thanks!

as an additional answer, you can use ttp parser to get all data. You can also use regex to capture data, but, as far as I understand you won't even need regex if you use ttp option.
from ttp import ttp
import json
template_date = """
Exit Date: {{date1}}
"""
template_date2 = """
Exit Date: {{date1}} {{date2}}
"""
templates = [template_date, template_date2]
with open("text_text.txt") as f:
data_to_parse = f.read()
def parsing(data_to_parse, ttp_template):
parser = ttp(data=data_to_parse, template=ttp_template)
parser.parse()
# print result in JSON format
results = parser.result(format='json')[0]
#print(results)
#converting str to json.
result = json.loads(results)
print(result)
for ttp_template in templates:
parsing(data_to_parse, ttp_template)
see the output:
[{'date1': '03/10/2020'}]
[{'date1': '03/10/2020', 'date2': '03/06/2020'}]
see text_text.txt file:
Exit Date: 03/10/2020
Exit Date: 03/10/2020 03/06/2020
Regards.

was able to come up with a for-loop that does what I want. Thanks to everyone for your assistance!
exit_note_date = []
for index, row in exit_note.iterrows():
x = row['Exit Note']
edmatch = re.findall(r'(Exit Date:.*?\d{2}/\d{2}/\d{4})', x)
if len(edmatch) > 0:
edstring = [exit_date.strip('Exit Date: ') for exit_date in edmatch]
exit_note_date.append(edstring)
else:
exit_note_date.append('null')
exit_note['Exit Date note'] = pd.Series(exit_note_date)

pandas: text analysis: Transfer raw data to dataframe

I need to read lines from a text file and extract the
quoted person name and quoted text from each line.
lines look similar to this:
"Am I ever!", Homer Simpson responded.
Remarks:
Hint: Use the returned object from the 'open' method to get the file
handler. Each line you read is expected to contain a new-line in the
end of the line. Remove the new-line as following: line_cln =line.strip()
There are the options for each line (assume one of these
three options): The first set of patterns, for which the person name
appears before the quoted text. The second set of patterns, for which
the quoted text appears before the person. Empty lines.
Complete the transfer_raw_text_to_dataframe function to return a
dataframe with the extracted person name and text as explained
above. The information is expected to be extracted from the lines of
the given 'filename' file.
The returned dataframe should include two columns:
person_name - containing the extracted person name for each line.
extracted_text - containing the extracted quoted text for each line.
The returned values:
dataframe - The dataframe with the extracted information as described above.
Important Note: if a line does not contain any quotation pattern, no information should be saved in the
corresponding row in the dataframe.
what I got so far: [edited]
def transfer_raw_text_to_dataframe(filename):
data = open(filename)
quote_pattern ='"(.*)"'
name_pattern = "\w+\s\w+"
df = open(filename, encoding='utf8')
lines = df.readlines()
df.close()
dataframe = pd.DataFrame(columns=('person_name', 'extracted_text'))
i = 0
for line in lines:
quote = re.search(quote_pattern,line)
extracted_quotation = quote.group(1)
name = re.search(name_pattern,line)
extracted_person_name = name.group(0)
df2 = {'person_name': extracted_person_name, 'extracted_text': extracted_quotation}
dataframe = dataframe.append(df2, ignore_index = True)
dataframe.loc[i] = [person_name, extracted_text]
i =i+1
return dataframe
the dataframe is created with the correct shape, problem is, the person name in each row is: 'Oh man' and the quote is 'Oh man, that guy's tough to love.' (in all of them)
which is weird because it's not even in the txt file...
can anyone help me fix this?
Edit: I need to extract from a simple txt file that contains these lines only:
"Am I ever!", Homer Simpson responded.
"Hmmm. So... is it okay if I go to the women's conference with Chloe?", Lisa Simpson answered.
"Really? Uh, sure.", Bart Simpson answered.
"Sounds great.", Bart Simpson replied.
Homer Simpson responded: "Danica Patrick in my thoughts!"
C. Montgomery Burns: "Trust me, he'll say it, or I'll bust him down to Thursday night vespers."
"Gimme that torch." Lisa Simpson said.
"No! No, I've got a lot more mothering left in me!", Marge Simpson said.
"Oh, Homie, I don't care if you're a billionaire. I love you just because you're..." Marge Simpson said.
"Damn you, e-Bay!" Homer Simpson answered.

possibly in such a way:
import pandas as pd
import re
# do smth
with open("12.txt", "r") as f:
data = f.read()
# print(data)
# ########## findall text in quotes
m = re.findall(r'\"(.+)\"', data)
print("RESULT: \n", m)
df = pd.DataFrame({'rep': m})
print(df)
# ########## retrieve and replace text in quotes for nothing
m = re.sub(r'\"(.+)\"', r'', data)
# ########## get First Name & Last Name from the rest text in each line
regex = re.compile("([A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+)")
mm = regex.findall(m)
df1 = pd.DataFrame({'author': mm})
print(df1)
# ########## join 2 dataframes
fin = pd.concat([df, df1], axis=1)
print(fin)
all print just for checking (get them away for cleaner code).
Just "C. Montgomery Burns" is loosing his first letter...

for loop in folder:
# All files acc. mask ending with .txt
print(glob.glob("C:\\MyFolder\\*.txt"))
mylist=[ff for ff in glob.glob("C:\\MyFolder\\*.txt")]
print("file_list:\n", mylist)
for filepath in mylist:
# do smth with each filepath
to collect all dfs you're getting from files - smth like this (e.g. reading csv-files by-mask):
import glob
import pandas as pd
def dfs_collect():
mylist=[ff for ff in glob.glob("C:\\MyFolder\\*.txt")] # all files by-mask
print("file_list:\n", mylist)
dfa=pd.concat((pd.read_csv(file, sep=';', encoding='windows-1250', index_col=False) for file in mylist), ignore_index=True)
but to get the content of your files - the example of the content is needed... without the example of your txt file (having dummy_info but left its real structure), I doubt, that anybody will try to imagine how it should look like

I think that following does what you need. Please verify whether the output is accurate. I'll explain any line that is unclear
import pandas as pd
import numpy as np
import nltk
from nltk.tree import ParentedTree
import typing as t # This is optional
# Using `read_csv` to read in the text because I find it easier
data = pd.read_csv("dialog.txt", header = None, sep = "~", quoting=3)
dialouges = data.squeeze() # Getting a series from the above DF with one column
def tag_sentence(tokenized: t.List[str]) -> t.List[t.Tuple[str, str]]:
tagged = nltk.pos_tag(tokenized)
tagged = [(token, tag) if tag not in {"``", "''"} else (token, "Q") for token, tag in tagged]
keep = {"Q", "NNP"}
renamed = [(token, "TEXT") if tag not in keep else (token, tag) for token, tag in tagged]
return renamed
def get_parse_tree(tagged_sent):
grammar = """
NAME: {<NNP>+}
WORDS: {<TEXT>+}
DIALOUGE: {<Q><WORDS|NAME>+<Q>}
"""
cp = nltk.RegexpParser(grammar)
parse_tree = cp.parse(tagged_sent)
return parse_tree
def extract_info(parse_tree):
ptree = ParentedTree.convert(parse_tree)
trees = list(ptree.subtrees())
root = ptree.root()
for subtree in trees[1:]:
if subtree.parent() == root:
if subtree.label() == "DIALOUGE":
dialouge = ' '.join(word for word, _ in subtree.leaves()[1:-1]) # Skipping quotaton marks
if subtree.label() == "NAME":
person = ' '.join(word for word, _ in subtree.leaves())
return dialouge, person
def process_sentence(sentence):
return extract_info(get_parse_tree(tag_sentence(nltk.word_tokenize(sentence))))
processed = [process_sentence(line) for line in dialouges]
result = pd.DataFrame(processed, columns=["extracted_text", "person_name"])
The resulting DataFrame looks like this:

Reading in Files with Meaningful Whitespace (Python)

I'm trying to read in files from the released Enron Dataset for a data science project. My problem lies in how I'm trying to read in my files. Basically, the first 15 or so lines of every email is information about the email itself: to, from, subject, etc. Thus, you would think read in the first 15 lines and assign them into an array. The problem that arises is that I'm trying to use whitespace in my algorithm, but sometimes there can be like 50 lines for the "to" column.
Example of a (slightly truncated) troublesome email:
Message-ID: <29403111.1075855665483.JavaMail.evans#thyme>
Date: Wed, 13 Dec 2000 08:22:00 -0800 (PST)
From: rebecca.cantrell#enron.com
To: stephanie.miller#enron.com, ruth.concannon#enron.com, jane.tholt#enron.com,
tori.kuykendall#enron.com, randall.gay#enron.com,
phillip.allen#enron.com, timothy.hamilton#enron.com,
robert.superty#enron.com, collee n.sullivan#enron.com,
donna.greif#enron.com, julie.gomez#enron.com
Subject: Final Filed Version -- SDG&E Comments
My code:
def readEmailHead(username, emailNum):
text = ""
file = open(corpus_root + username + '/all_documents/' + emailNum)
for line in file:
text += line
file.close()
email = text.split('\n')
count = 0
for line in email:
mem = []
if line == '':
pass
else:
if line[0].isspace():
print(line,count)
email[count-1] += line
del email[count]
count += 1
return [email[:20]]
Right now it can handle emails with an extra line in the subject/to/from/etc, but not any more. Any ideas?

No need to reinvent the wheel. The module email.parse can be your friend. I include a more portable way of constructing the file name so to just parse the header you could use the built-in parser and write a function like:
import email.parser
import os.path
def read_email_header(username, email_number, corpus_root='~/tmp/data/enron'):
corpus_root = os.path.expanduser(corpus_root)
fname = os.path.join(corpus_root, username, 'all_documents', email_number)
with open(fname, 'rb') as fd:
header = email.parser.BytesHeaderParser().parse(fd)
return header
mm = read_email_header('dasovich-j', '13078.')
print(mm.keys())
print(mm['Date'])
print(mm['From'])
print(mm['To'].split())
print(mm['Subject'])
Running this gives:
['Message-ID', 'Date', 'From', 'To', 'Subject', 'Mime-Version', 'Content-Type', 'Content-Transfer-Encoding', 'X-From', 'X-To', 'X-cc', 'X-bcc', 'X-Folder', 'X-Origin', 'X-FileName']
Fri, 25 May 2001 02:50:00 -0700 (PDT)
rebecca.cantrell#enron.com
['ray.alvarez#enron.com,', 'steve.walton#enron.com,', 'susan.mara#enron.com,', 'alan.comnes#enron.com,', 'leslie.lawner#enron.com,', 'donna.fulton#enron.com,', 'jeff.dasovich#enron.com,', 'christi.nicolay#enron.com,', 'james.steffes#enron.com,', 'jalexander#gibbs-bruns.com,', 'phillip.allen#enron.com,', 'linda.noske#enron.com,', 'dave.perrino#enron.com,', 'don.black#enron.com,', 'robert.frank#enron.com,', 'stephanie.miller#enron.com,', 'barry.tycholiz#enron.com']
Reuters -- FERC told Calif natgas to reach limit this summer

The easy way to approach problems like this (setting aside the good idea of using an existing parser) is to treat the transformation as being performed one one list of lines to yield another list, rather than trying to mutate an existing list while looping over it. Something like:
new=[]
for l in old:
if is_continuation(l): new[-1]+=l
else: new.append(l)
For all but the longest lists (where del old[i] is expensive anyway) this is quite efficient if most lines are not continuations since they can be reused in new as-is.

Maybe use regular expressions accordingly to your needs. For example, you can identify sent to email addresses as follows:
import regex as re
sent_to=[]
def sent_to(text): #Add the text part from the file you want
global sent_to
email = re.search(r'(.+#.+\..+)', text) #Regex pattern to match an email
if email:
sent_to.append(list(email.groups())) #Adds the email into the sent_to list for each email

String Cutting with multiple lines

so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?

I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]

You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

parsing a .srt file with regex

I am doing a small script in python, but since I am quite new I got stuck in one part:
I need to get timing and text from a .srt file. For example, from
1
00:00:01,000 --> 00:00:04,074
Subtitles downloaded from www.OpenSubtitles.org
I need to get:
00:00:01,000 --> 00:00:04,074
and
Subtitles downloaded from www.OpenSubtitles.org.
I have already managed to make the regex for timing, but i am stuck for the text. I've tried to use look behind where I use my regex for timing:
( ?<=(\d+):(\d+):(\d+)(?:\,)(\d+) --> (\d+):(\d+):(\d+)(?:\,)(\d+) )\w+
but with no effect. Personally, i think that using look behind is the right way to solve this, but i am not sure how to write it correctly. Can anyone help me? Thanks.

Honestly, I don't see any reason to throw regex at this problem. .srt files are highly structured. The structure goes like:
an integer starting at 1, monotonically increasing
start --> stop timing
one or more lines of subtitle content
a blank line
... and repeat. Note the bold part - you might have to capture 1, 2, or 20 lines of subtitle content after the time code.
So, just take advantage of the structure. In this way you can parse everything in just one pass, without needing to put more than one line into memory at a time and still keeping all the information for each subtitle together.
from itertools import groupby
# "chunk" our input file, delimited by blank lines
with open(filename) as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
For example, using the example on the SRT doc page, I get:
res
Out[60]:
[['1\n',
'00:02:17,440 --> 00:02:20,375\n',
"Senator, we're making\n",
'our final approach into Coruscant.\n'],
['2\n', '00:02:20,476 --> 00:02:22,501\n', 'Very good, Lieutenant.\n']]
And I could further transform that into a list of meaningful objects:
from collections import namedtuple
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, *content = sub # py3 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
subs
Out[65]:
[Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']),
Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]

Disagree with #roippi. Regex is a very nice solution to text matching. And the Regex for this solution is not tricky.
import re
f = file.open(yoursrtfile)
# Parse the file content
content = f.read()
# Find all result in content
# The first big (__) retrieve the timing, \s+ match all timing in between,
# The (.+) means retrieve any text content after that.
result = re.findall("(\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)\s+(.+)", content)
# Just print out the result list. I recommend you do some formatting here.
print result

number:^[0-9]+$
Time:
^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$
string: *[a-zA-Z]+*
hope this help.

Thanks #roippi for this excellent parser.
It helped me a lot to write a srt to stl converter in less than 40 lines (in python2 though, as it has to fit in a larger project)
from __future__ import print_function, division
from itertools import groupby
from collections import namedtuple
# prepare - adapt to you needs or use sys.argv
inputname = 'FR.srt'
outputname = 'FR.stl'
stlheader = """
$FontName = Arial
$FontSize = 34
$HorzAlign = Center
$VertAlign = Bottom
"""
def converttime(sttime):
"convert from srt time format (0...999) to stl one (0...25)"
st = sttime.split(',')
return "%s:%02d"%(st[0], round(25*float(st[1]) /1000))
# load
with open(inputname,'r') as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
# parse
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, content = sub[0], sub[1], sub[2:] # py 2 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
# write
with open(outputname,'w') as F:
F.write(stlheader)
for sub in subs:
F.write("%s , %s , %s\n"%(converttime(sub.start), converttime(sub.end), "|".join(sub.content)) )

for time:
pattern = ("(\d{2}:\d{2}:\d{2},\d{3}?.*)")

None of the pure REGEx solution above worked for the real life srt files.
Let's take a look of the following SRT patterned text :
1
00:02:17,440 --> 00:02:20,375
Some multi lined text
This is a second line
2
00:02:20,476 --> 00:02:22,501
as well as a single line
3
00:03:20,476 --> 00:03:22,501
should be able to parse unicoded text too
こんにちは
Take a note that :
text may contain unicode characters.
Text can consist of several lines.
every cue started with an integer value and ended with a blank new line which both unix style and windows style CR/LF are accepted
Here is the working regex :
\d+[\r\n](\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)[\r\n]((.+\r?\n)+(?=(\r?\n)?))
https://regex101.com/r/qICmEM/1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting numbers from outlook email body with Python - python

Related

How do I make regex .findall() return all matches within for-loop as intended?

pandas: text analysis: Transfer raw data to dataframe

Reading in Files with Meaningful Whitespace (Python)

String Cutting with multiple lines

parsing a .srt file with regex

Categories

Resources