Python: If String with Dynamic Variables - python

Ok, This might be the wrong wording and I would love some one to correct me if so. I am trying to find out if a string contains a certain phrase, even if parts of that phrase us dynamic.
So for example, the string could be:
Hi there, Jordan has enrolled at St. Thomas on 10/02/19
Hi there, Lisa has enrolled at St. Thomas on 16/11/19
Hi there, Craig has enrolled at Sirius Academy on 12/10/19
In my python I have this:
if "Hi there, {$0} has enrolled at {$1} on ${2}" in email_body:
print("Someone new is arriving...")
However it does not fire. If I print email_body it shows me the email so the problem is with the if statement and the regex detection.
Weird Result Edit:
This is my code:
data = re.findall('Hi there, (.*?) has enrolled at (.*?) on (.*?)', message_body)[0]
print(data)
returns:
('Lisa', 'St. Thomas', '')
For some reason the third value is missing.
when i print(email_body) I am getting:
Hi there, Lisa has enrolled at St. Thomas on 16thSept2017

You can use re.findall:
import re
email_body = 'Hi there, Lisa has enrolled at St. Thomas on 16/11/19'
if re.findall('Hi there, [\w\W]+ has enrolled at [\w\W]+ on [\w\W]+', email_body):
print("Someone new is arriving...")
Regarding your recent comment, if you would like the entire line, you can just do this:
email_body = 'Hi there, Lisa has enrolled at St. Thomas on 16/11/19'
data = re.findall('Hi there, [\w\W]+ has enrolled at [\w\W]+ on [\w\W]+', email_body)
if data:
print(data[0])
Output:
'Hi there, Lisa has enrolled at St. Thomas on 16/11/19'
New Edit: More complex string
email_body1 = '53ewwffHi there, Lisa has enrolled at St. Thomas on 16/11/19\n \n dfdsg 45435'
email_body2 = "Hi there, Lisa has enrolled at St. Thomas on 16thSept2017"
data = re.findall('Hi there, (.*?) has enrolled at (.*?) on ([a-zA-Z0-9/]+)', email_body1)
data1 = re.findall('Hi there, (.*?) has enrolled at (.*?) on ([a-zA-Z0-9/]+)', email_body2)
print(data[0])
print(data1[0])
Output:
('Lisa', 'St. Thomas', '16/11/19')
('Lisa', 'St. Thomas', '16thSept2017')

You're correct that you'll want to use regular expressions here. For example:
>>> import re
>>> r = re.match(r'Hi there, (.+) has enrolled at (.+) on (.+)', 'Hi there, Jordan has enrolled at St. Thomas on 10/02/19')
>>> r.groups()
('Jordan', 'St. Thomas', '10/02/19')
To use them:
>>> person, place, day = r.groups()
>>> '{} / {} / {}'.format(person, place, day)
'Jordan / St. Thomas / 10/02/19'

Related

Capturing only specific sections/patterns of string with Regex

I have the following strings, which always follow a standard format:
'On 10/31/2018, Sally Brown picked 25 apples at the orchard.'
'On 11/01/2018, John Smith picked 12 peaches at the orchard.'
'On 09/15/2018, Jim Roe picked 10 pears at the orchard.'
I want to extract certain data fields into a series of lists:
['10/31/2018','Sally Brown','25','apples']
['11/01/2018','John Smith','12','peaches']
['09/15/2018','Jim Roe','10','pears']
As you can see, I need some of the sentence structure to be recognized, but not captured, so the program has context for where the data is located. The Regex that I thought would work is:
(?<=On\s)\d{2}\/\d{2}\/\d{4},\s(?=[A-Z][a-z]+\s[A-Z][a-z]+)\s.+?(?=\d+)\s(?=[a-z]+)\sat\sthe\sorchard\.
But of course, that is incorrect somehow.
This may be a simple question for someone, but I'm having trouble finding the answer. Thanks in advance, and someday when I'm more skilled I'll pay it forward on here.
use \w+ to match any word or [a-zA-Z0-9_]
import re
str = ''''On 10/31/2018, Sally Brown picked 25 apples at the orchard.'
'On 11/01/2018, John Smith picked 12 peaches at the orchard.'
'On 09/15/2018, Jim Roe picked 10 pears at the orchard.'''
arr = re.findall('On\s(.*?),\s(\w+\s\w+)\s\w+\s(\d+)\s(\w+)', str)
print arr
# [('10/31/2018', 'Sally Brown', '25', 'apples'),
# ('11/01/2018', 'John Smith', '12', 'peaches'),
# ('09/15/2018', 'Jim Roe', '10', 'pears')]

Parse a very large text file with Python?

So, the file has about 57,000 book titles, author names and a ETEXT No. I am trying to parse the file to only get the ETEXT NOs
The File is like this:
TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora,      56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
And this is what I tried:
def search_by_etext():
fhand = open('GUTINDEX.ALL')
print("Search by ETEXT:")
for line in fhand:
if not line.startswith(" [") and not line.startswith("~"):
if not line.startswith(" ") and not line.startswith("TITLE"):
words = line.rstrip()
words = line.lstrip()
words = words[-7:]
print (words)
search_by_etext()
Well the code mostly works. However for some lines it gives me part of title or other things. Like:
This kind of output(), containing 'decott' which is a part of author name and shouldn't be here.
2
For this:
The Bashful Earthquake, by Oliver Herford                                56765
[Subtitle: and Other Fables and Verses]
The House of Orchids and Other Poems, by George Sterling                 56764
North Italian Folk, by Alice Vansittart Strettel Carr                    56763
 and Randolph Caldecott
[Subtitle: Sketches of Town and Country Life]
Wild Life in New Zealand. Part 1, Mammalia, by George M. Thomson 56762
[Subtitle: New Zealand Board of Science and Art, Manual No. 2]
Universal Brotherhood, Volume 13, No. 10, January 1899, by Various 56761
De drie steden: Lourdes, door Émile Zola 56760
[Language: Dutch]
Another example:
4
For
Rhandensche Jongens, door Jan Lens 56702
[Illustrator: Tjeerd Bottema]
[Language: Dutch]
The Story of The Woman's Party, by Inez Haynes Irwin 56701
Mormon Doctrine Plain and Simple, by Charles W. Penrose 56700
[Subtitle: Or Leaves from the Tree of Life]
The Stone Axe of Burkamukk, by Mary Grant Bruce 56699
[Illustrator: J. Macfarlane]
The Latter-Day Prophet, by George Q. Cannon 56698
[Subtitle: History of Joseph Smith Written for Young People]
Here: Life] shouldn't be there. Lines starting with blank space has been parsed out with this:
if not line.startswith(" [") and not line.startswith("~"):
But Still I am getting those off values in my output results.
Simple solution: regexps to the rescue !
import re
with open("etext.txt") as f:
for line in f:
match = re.search(r" (\d+)$", line.strip())
if match:
print(match.group(1))
the regular expression (\d+)$ will match "at least one space followed by 1 or more digits at the end of the string", and capture only the "one or more digits" group.
You can eventually improve the regexp - ie if you know all etext codes are exactly 5 digits long, you can change the regexp to (\d{5})$.
This works with the example text you posted. If it doesn't properly work on your own file then we need enough of the real data to find out what you really have.
It could be that those extra lines that are not being filtered out start with whitespace other than a " " char, like a tab for example. As a minimal change that might work, try filtering out lines that start with any whitespace rather than specifically a space char?
To check for whitespace in general rather than a space char, you'll need to use regular expressions. Try if not re.match(r'^\s', line) and ...

Regex in Python: How to match a word pattern, if not preceded by another word of variable length?

I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.
For example,
'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'
'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'
I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'
The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'
This is the code I have so far:
s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)
if so:
print so.group(1) + ' ' + so.group(2).split()[1]
print so.group(2)
This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.
An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.
Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!
As you can rely on names starting with a capital letter, then you could do something like:
((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)
Live preview
Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.
Which for your examples and current code would yield:
Input
First print
Second print
John and Albert McDonald
John McDonald
Albert McDonald
Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond
It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond

How can I split lines based on characters in python?

I've recently started working with Python 2.7 and I've got an assignment in which I get a text file containing data separated with space. I would need to split every line into strings containing only one type of data. Here's an example:
Bruce Wayne 10012-34321 2016.02.20. 231231
John Doe 10201-11021 2016.01.10. 2310456
Chris Taylor 10001-31021 2015.12.30. 524432
James Michael Kent 10210-41011 2016.02.03. 3235332
I want to separate them by name, id, date, balance but the only thing I know is split which I can't really use because the last given name has three parts instead of two. How can I split a line based on charactersWhat could be the solution in this case?
Any help is appreciated.
You'll want to use str.rsplit() and supply a max number of splits, like this:
>>> s = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
>>> s.rsplit(' ', 3)
['James Michael Kent', '10210-41011', '2016.02.03.', '3235332']
>>> s = 'Chris Taylor 10001-31021 2015.12.30. 524432'
>>> s.rsplit(' ', 3)
['Chris Taylor', '10001-31021', '2015.12.30.', '524432']
What you need is to look up in list created by split from last:
To get details
ln = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
ln.split()[-3:]
['10210-41011', '2016.02.03.', '3235332']
ln = 'Bruce Wayne 10012-34321 2016.02.20. 231231'
ln.split()[-3:]
['10012-34321', '2016.02.20.', '231231']
To get names:
ln.split()[:-3]
['Bruce', 'Wayne']
ln = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
ln.split()[:-3]
['James', 'Michael', 'Kent']

Parsing Text Document Lines in Python on Wildcards/Patterns

What I Have
I'm working on parsing a .txt file that contains scheduling information for who works when on a given day. The .txt file looks like this:
START PAGE 0
XYZ Schedule for: Saturday, March 30, 2013
Barnes, Michael8:00a10:00aTech
Collins, Jessica8:00a4:00pSupervisor
Hamilton, Patricia8:00a10:00aTech
Smith, Jan8:00a10:00aTech
Park, Kimberly8:00a10:00aTech
Edwards, Terrell10:00a12:00pTech
Green, Harrold12:00p2:00pTech
Tait, Jessica12:00p2:00pTech
Tait, Jessica2:00p4:00pTech
Hernandez, William (Monte)4:00p6:30pSupervisor
Tait, Chioma4:00p6:00pTech
Hernandez, William (Monte)6:30p7:00pSupervisor
Hernandez, William (Monte)7:00p9:00pSupervisor
Tailor, Thomas (Jason)9:00p12:00aSupervisor
Jones, Deslynne10:00p12:00aTech
3/28/2013 2:21:17 PM
END PAGE 0
So the first two and last two lines are not relevant but every other line in the middle is the schedule for one person.
What I Want
I want to parse out the pieces of each line so that I can write it to a .csv file. I can use line.partition(',')[0] to get the last name (the first piece on each line) but after that I'm at a loss. I need to communicate the following to Python:
The part after the , to a number is a section (first
name)
The part from the first number to either an a or a p
(for am or pm) is another section (start time)
The part from the
number just after that a or p to the next a or p is another
section (end time)
Finally, the remaining section is another
section (the type/position of the shift.)
A line in my resulting csv file might look like this:
Barnes,Michael,8:00a,10:00a,Tech
Things to Note
1) One person can have more than one shift during a day.
2) Some people have a nickname in parentheses but some don't.
3) If Python had wild cards like # for a number and * for anything I could see how I might be able to keep using partition and keep splitting the remaining pieces, something like this:
for line in input:
name = str(line.partition(',')[0]+','+str(line.partition(',')[2].split(#)[0]))
output.write("".join(x for x in name))
output.write("\r\n")
However, Python doesn't seem to use wildcards like that. Also, this seems like a very inelegant solution.
This should be enough to get you started:
import re
data = '''Barnes, Michael8:00a10:00aTech
Collins, Jessica8:00a4:00pSupervisor
Hamilton, Patricia8:00a10:00aTech
Smith, Jan8:00a10:00aTech
Park, Kimberly8:00a10:00aTech
Edwards, Terrell10:00a12:00pTech
Green, Harrold12:00p2:00pTech
Tait, Jessica12:00p2:00pTech
Tait, Jessica2:00p4:00pTech
Hernandez, William (Monte)4:00p6:30pSupervisor
Tait, Chioma4:00p6:00pTech
Hernandez, William (Monte)6:30p7:00pSupervisor
Hernandez, William (Monte)7:00p9:00pSupervisor
Tailor, Thomas (Jason)9:00p12:00aSupervisor
Jones, Deslynne10:00p12:00aTech'''
print re.findall(r'(.*?)(\d{1,2}:\d\d[ap])(\d{1,2}:\d\d[ap])(.*)', data)
prints
[('Barnes, Michael', '8:00a', '10:00a', 'Tech'),
('Collins, Jessica', '8:00a', '4:00p', 'Supervisor'),
('Hamilton, Patricia', '8:00a', '10:00a', 'Tech'),
('Smith, Jan', '8:00a', '10:00a', 'Tech'),
('Park, Kimberly', '8:00a', '10:00a', 'Tech'),
('Edwards, Terrell', '10:00a', '12:00p', 'Tech'),
('Green, Harrold', '12:00p', '2:00p', 'Tech'),
('Tait, Jessica', '12:00p', '2:00p', 'Tech'),
('Tait, Jessica', '2:00p', '4:00p', 'Tech'),
('Hernandez, William (Monte)', '4:00p', '6:30p', 'Supervisor'),
('Tait, Chioma', '4:00p', '6:00p', 'Tech'),
('Hernandez, William (Monte)', '6:30p', '7:00p', 'Supervisor'),
('Hernandez, William (Monte)', '7:00p', '9:00p', 'Supervisor'),
('Tailor, Thomas (Jason)', '9:00p', '12:00a', 'Supervisor'),
('Jones, Deslynne', '10:00p', '12:00a', 'Tech')]
Read the documentation of the re module to understand the regular expression. You can parse the names as a separate step or expand the regex to be more specific. I recommend using the csv module to write to a csv file.
If you get stuck, post specific questions with code.
Assuming that you you know how to remove the first two and last two lines, and that the rest is in a string called s, here is how I would do what you want:
entries = [x.strip() for x in s.split('\n') if x]
for entry in entries:
ind = [i for i,x in enumerate(entry) if x.isdigit() and not entry[i-1].isdigit()]
name = entry[0:ind[0]]
name = name.split(',')
other = entry[ind[0]:]
ind = [-1]+[i for i,x in enumerate(other) if x in ('a', 'p') and other[i-1].isdigit()]
shifts = []
for i in xrange(1, len(ind)):
shifts.append(other[ind[i-1]+1:ind[i]+1])
position = other[ind[-1]+1:]
print(name, shifts, position)
This will work on an arbitrary number of shifts.
Output:
['Barnes', ' Michael'] ['8:00a', '10:00a'] Tech
['Collins', ' Jessica'] ['8:00a', '4:00p'] Supervisor
['Hamilton', ' Patricia'] ['8:00a', '10:00a'] Tech
['Smith', ' Jan'] ['8:00a', '10:00a'] Tech
['Park', ' Kimberly'] ['8:00a', '10:00a'] Tech
['Edwards', ' Terrell'] ['10:00a', '12:00p'] Tech
['Green', ' Harrold'] ['12:00p', '2:00p'] Tech
['Tait', ' Jessica'] ['12:00p', '2:00p'] Tech
['Tait', ' Jessica'] ['2:00p', '4:00p'] Tech
['Hernandez', ' William (Monte)'] ['4:00p', '6:30p'] Supervisor
['Tait', ' Chioma'] ['4:00p', '6:00p'] Tech
['Hernandez', ' William (Monte)'] ['6:30p', '7:00p'] Supervisor
['Hernandez', ' William (Monte)'] ['7:00p', '9:00p'] Supervisor
['Tailor', ' Thomas (Jason)'] ['9:00p', '12:00a'] Supervisor
['Jones', ' Deslynne'] ['10:00p', '12:00a'] Tech

Categories

Resources