Capturing only specific sections/patterns of string with Regex

Capturing only specific sections/patterns of string with Regex - python

I have the following strings, which always follow a standard format:
'On 10/31/2018, Sally Brown picked 25 apples at the orchard.'
'On 11/01/2018, John Smith picked 12 peaches at the orchard.'
'On 09/15/2018, Jim Roe picked 10 pears at the orchard.'
I want to extract certain data fields into a series of lists:
['10/31/2018','Sally Brown','25','apples']
['11/01/2018','John Smith','12','peaches']
['09/15/2018','Jim Roe','10','pears']
As you can see, I need some of the sentence structure to be recognized, but not captured, so the program has context for where the data is located. The Regex that I thought would work is:
(?<=On\s)\d{2}\/\d{2}\/\d{4},\s(?=[A-Z][a-z]+\s[A-Z][a-z]+)\s.+?(?=\d+)\s(?=[a-z]+)\sat\sthe\sorchard\.
But of course, that is incorrect somehow.
This may be a simple question for someone, but I'm having trouble finding the answer. Thanks in advance, and someday when I'm more skilled I'll pay it forward on here.

use \w+ to match any word or [a-zA-Z0-9_]
import re
str = ''''On 10/31/2018, Sally Brown picked 25 apples at the orchard.'
'On 11/01/2018, John Smith picked 12 peaches at the orchard.'
'On 09/15/2018, Jim Roe picked 10 pears at the orchard.'''
arr = re.findall('On\s(.*?),\s(\w+\s\w+)\s\w+\s(\d+)\s(\w+)', str)
print arr
# [('10/31/2018', 'Sally Brown', '25', 'apples'),
# ('11/01/2018', 'John Smith', '12', 'peaches'),
# ('09/15/2018', 'Jim Roe', '10', 'pears')]

Related

Python: If String with Dynamic Variables

Ok, This might be the wrong wording and I would love some one to correct me if so. I am trying to find out if a string contains a certain phrase, even if parts of that phrase us dynamic.
So for example, the string could be:
Hi there, Jordan has enrolled at St. Thomas on 10/02/19
Hi there, Lisa has enrolled at St. Thomas on 16/11/19
Hi there, Craig has enrolled at Sirius Academy on 12/10/19
In my python I have this:
if "Hi there, {$0} has enrolled at {$1} on ${2}" in email_body:
print("Someone new is arriving...")
However it does not fire. If I print email_body it shows me the email so the problem is with the if statement and the regex detection.
Weird Result Edit:
This is my code:
data = re.findall('Hi there, (.*?) has enrolled at (.*?) on (.*?)', message_body)[0]
print(data)
returns:
('Lisa', 'St. Thomas', '')
For some reason the third value is missing.
when i print(email_body) I am getting:
Hi there, Lisa has enrolled at St. Thomas on 16thSept2017

You can use re.findall:
import re
email_body = 'Hi there, Lisa has enrolled at St. Thomas on 16/11/19'
if re.findall('Hi there, [\w\W]+ has enrolled at [\w\W]+ on [\w\W]+', email_body):
print("Someone new is arriving...")
Regarding your recent comment, if you would like the entire line, you can just do this:
email_body = 'Hi there, Lisa has enrolled at St. Thomas on 16/11/19'
data = re.findall('Hi there, [\w\W]+ has enrolled at [\w\W]+ on [\w\W]+', email_body)
if data:
print(data[0])
Output:
'Hi there, Lisa has enrolled at St. Thomas on 16/11/19'
New Edit: More complex string
email_body1 = '53ewwffHi there, Lisa has enrolled at St. Thomas on 16/11/19\n \n dfdsg 45435'
email_body2 = "Hi there, Lisa has enrolled at St. Thomas on 16thSept2017"
data = re.findall('Hi there, (.*?) has enrolled at (.*?) on ([a-zA-Z0-9/]+)', email_body1)
data1 = re.findall('Hi there, (.*?) has enrolled at (.*?) on ([a-zA-Z0-9/]+)', email_body2)
print(data[0])
print(data1[0])
Output:
('Lisa', 'St. Thomas', '16/11/19')
('Lisa', 'St. Thomas', '16thSept2017')

You're correct that you'll want to use regular expressions here. For example:
>>> import re
>>> r = re.match(r'Hi there, (.+) has enrolled at (.+) on (.+)', 'Hi there, Jordan has enrolled at St. Thomas on 10/02/19')
>>> r.groups()
('Jordan', 'St. Thomas', '10/02/19')
To use them:
>>> person, place, day = r.groups()
>>> '{} / {} / {}'.format(person, place, day)
'Jordan / St. Thomas / 10/02/19'

Python: replace \n \r \t in a list excluding those starting \n\n and ends with \n\r\n\t

My List:
['\n\r\n\tThis article is about sweet bananas. For the genus to which
banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas
used in cooking, see Cooking banana. For other uses, see Banana
(disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya
and Australia, and are likely to have been first domesticated in Papua
New Guinea.\n\r\n\tThey are grown in 135
countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between
"bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is
the largest herbaceous flowering plant.\n\r\n\tAll the above-ground
parts of a banana plant grow from a structure usually called a
"corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West
African origin, possibly from the Wolof word banaana, and passed into
English via Spanish or Portuguese.\n']
Example code:
import requests
from bs4 import BeautifulSoup
import re
re=requests.get('http://www.abcde.com/banana')
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.page_article_title')
print(title_tag.text)
list=[]
for tag in soup.select('.page_article_content'):
list.append(tag.text)
#list=([c.replace('\n', '') for c in list])
#list=([c.replace('\r', '') for c in list])
#list=([c.replace('\t', '') for c in list])
print(list)
After I scraping a web page, I need to do data cleansing. I want to replace all the "\r", "\n", "\t" as "", but I found that I have subtitle in this, if I do this, subtitles and sentences are going to mix together.
Every subtitle always starts with \n\n and ends with \n\r\n\t, is it possible that I can do something to distinguish them in this list like \aEtymology\a. It's not going to work if I replace \n\n and \n\r\n\t separately to \a first cause other parts might have the same elements like this \n\n\r and it will become like \a\r. Thanks in advance!

Approach
Replace the subtitles to a custom string, <subtitles> in the list
Replace the \n, \r, \t etc. in the list
Replace the custom string with the actual subtitle
Code
l=['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']
import re
regex=re.findall("\n\n.*.\n\r\n\t",l[0])
print(regex)
for x in regex:
l = [r.replace(x,"<subtitles>") for r in l]
rep = ['\n','\t','\r']
for y in rep:
l = [r.replace(y, '') for r in l]
for x in regex:
l = [r.replace('<subtitles>', x, 1) for r in l]
print(l)
Output
['\n\nDescription\n\r\n\t', '\n\nEtymology\n\r\n\t']
['This article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).For starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.They are grown in 135 countries.Worldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.All the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.']

import re
print([re.sub(r'[\n\r\t]', '', c) for c in list])
I think you may use regex

You can do this by using regular expressions:
import re
subtitle = re.compile(r'\n\n(\w+)\n\r\n\t')
new_list = [subtitle.sub(r"\a\g<1>\a", l) for l in li]
\g<1> is a backreference to the (\w+) in the first regex. It lets you reuse what ever is in there.

Regex in Python: How to match a word pattern, if not preceded by another word of variable length?

I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.
For example,
'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'
'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'
I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'
The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'
This is the code I have so far:
s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)
if so:
print so.group(1) + ' ' + so.group(2).split()[1]
print so.group(2)
This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.
An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.
Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!

As you can rely on names starting with a capital letter, then you could do something like:
((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)
Live preview
Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.
Which for your examples and current code would yield:
Input
First print
Second print
John and Albert McDonald
John McDonald
Albert McDonald
Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond
It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond

How can I split lines based on characters in python?

I've recently started working with Python 2.7 and I've got an assignment in which I get a text file containing data separated with space. I would need to split every line into strings containing only one type of data. Here's an example:
Bruce Wayne 10012-34321 2016.02.20. 231231
John Doe 10201-11021 2016.01.10. 2310456
Chris Taylor 10001-31021 2015.12.30. 524432
James Michael Kent 10210-41011 2016.02.03. 3235332
I want to separate them by name, id, date, balance but the only thing I know is split which I can't really use because the last given name has three parts instead of two. How can I split a line based on charactersWhat could be the solution in this case?
Any help is appreciated.

You'll want to use str.rsplit() and supply a max number of splits, like this:
>>> s = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
>>> s.rsplit(' ', 3)
['James Michael Kent', '10210-41011', '2016.02.03.', '3235332']
>>> s = 'Chris Taylor 10001-31021 2015.12.30. 524432'
>>> s.rsplit(' ', 3)
['Chris Taylor', '10001-31021', '2015.12.30.', '524432']

What you need is to look up in list created by split from last:
To get details
ln = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
ln.split()[-3:]
['10210-41011', '2016.02.03.', '3235332']
ln = 'Bruce Wayne 10012-34321 2016.02.20. 231231'
ln.split()[-3:]
['10012-34321', '2016.02.20.', '231231']
To get names:
ln.split()[:-3]
['Bruce', 'Wayne']
ln = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
ln.split()[:-3]
['James', 'Michael', 'Kent']

Parsing Text Document Lines in Python on Wildcards/Patterns

What I Have
I'm working on parsing a .txt file that contains scheduling information for who works when on a given day. The .txt file looks like this:
START PAGE 0
XYZ Schedule for: Saturday, March 30, 2013
Barnes, Michael8:00a10:00aTech
Collins, Jessica8:00a4:00pSupervisor
Hamilton, Patricia8:00a10:00aTech
Smith, Jan8:00a10:00aTech
Park, Kimberly8:00a10:00aTech
Edwards, Terrell10:00a12:00pTech
Green, Harrold12:00p2:00pTech
Tait, Jessica12:00p2:00pTech
Tait, Jessica2:00p4:00pTech
Hernandez, William (Monte)4:00p6:30pSupervisor
Tait, Chioma4:00p6:00pTech
Hernandez, William (Monte)6:30p7:00pSupervisor
Hernandez, William (Monte)7:00p9:00pSupervisor
Tailor, Thomas (Jason)9:00p12:00aSupervisor
Jones, Deslynne10:00p12:00aTech
3/28/2013 2:21:17 PM
END PAGE 0
So the first two and last two lines are not relevant but every other line in the middle is the schedule for one person.
What I Want
I want to parse out the pieces of each line so that I can write it to a .csv file. I can use line.partition(',')[0] to get the last name (the first piece on each line) but after that I'm at a loss. I need to communicate the following to Python:
The part after the , to a number is a section (first
name)
The part from the first number to either an a or a p
(for am or pm) is another section (start time)
The part from the
number just after that a or p to the next a or p is another
section (end time)
Finally, the remaining section is another
section (the type/position of the shift.)
A line in my resulting csv file might look like this:
Barnes,Michael,8:00a,10:00a,Tech
Things to Note
1) One person can have more than one shift during a day.
2) Some people have a nickname in parentheses but some don't.
3) If Python had wild cards like # for a number and * for anything I could see how I might be able to keep using partition and keep splitting the remaining pieces, something like this:
for line in input:
name = str(line.partition(',')[0]+','+str(line.partition(',')[2].split(#)[0]))
output.write("".join(x for x in name))
output.write("\r\n")
However, Python doesn't seem to use wildcards like that. Also, this seems like a very inelegant solution.

This should be enough to get you started:
import re
data = '''Barnes, Michael8:00a10:00aTech
Collins, Jessica8:00a4:00pSupervisor
Hamilton, Patricia8:00a10:00aTech
Smith, Jan8:00a10:00aTech
Park, Kimberly8:00a10:00aTech
Edwards, Terrell10:00a12:00pTech
Green, Harrold12:00p2:00pTech
Tait, Jessica12:00p2:00pTech
Tait, Jessica2:00p4:00pTech
Hernandez, William (Monte)4:00p6:30pSupervisor
Tait, Chioma4:00p6:00pTech
Hernandez, William (Monte)6:30p7:00pSupervisor
Hernandez, William (Monte)7:00p9:00pSupervisor
Tailor, Thomas (Jason)9:00p12:00aSupervisor
Jones, Deslynne10:00p12:00aTech'''
print re.findall(r'(.*?)(\d{1,2}:\d\d[ap])(\d{1,2}:\d\d[ap])(.*)', data)
prints
[('Barnes, Michael', '8:00a', '10:00a', 'Tech'),
('Collins, Jessica', '8:00a', '4:00p', 'Supervisor'),
('Hamilton, Patricia', '8:00a', '10:00a', 'Tech'),
('Smith, Jan', '8:00a', '10:00a', 'Tech'),
('Park, Kimberly', '8:00a', '10:00a', 'Tech'),
('Edwards, Terrell', '10:00a', '12:00p', 'Tech'),
('Green, Harrold', '12:00p', '2:00p', 'Tech'),
('Tait, Jessica', '12:00p', '2:00p', 'Tech'),
('Tait, Jessica', '2:00p', '4:00p', 'Tech'),
('Hernandez, William (Monte)', '4:00p', '6:30p', 'Supervisor'),
('Tait, Chioma', '4:00p', '6:00p', 'Tech'),
('Hernandez, William (Monte)', '6:30p', '7:00p', 'Supervisor'),
('Hernandez, William (Monte)', '7:00p', '9:00p', 'Supervisor'),
('Tailor, Thomas (Jason)', '9:00p', '12:00a', 'Supervisor'),
('Jones, Deslynne', '10:00p', '12:00a', 'Tech')]
Read the documentation of the re module to understand the regular expression. You can parse the names as a separate step or expand the regex to be more specific. I recommend using the csv module to write to a csv file.
If you get stuck, post specific questions with code.

Assuming that you you know how to remove the first two and last two lines, and that the rest is in a string called s, here is how I would do what you want:
entries = [x.strip() for x in s.split('\n') if x]
for entry in entries:
ind = [i for i,x in enumerate(entry) if x.isdigit() and not entry[i-1].isdigit()]
name = entry[0:ind[0]]
name = name.split(',')
other = entry[ind[0]:]
ind = [-1]+[i for i,x in enumerate(other) if x in ('a', 'p') and other[i-1].isdigit()]
shifts = []
for i in xrange(1, len(ind)):
shifts.append(other[ind[i-1]+1:ind[i]+1])
position = other[ind[-1]+1:]
print(name, shifts, position)
This will work on an arbitrary number of shifts.
Output:
['Barnes', ' Michael'] ['8:00a', '10:00a'] Tech
['Collins', ' Jessica'] ['8:00a', '4:00p'] Supervisor
['Hamilton', ' Patricia'] ['8:00a', '10:00a'] Tech
['Smith', ' Jan'] ['8:00a', '10:00a'] Tech
['Park', ' Kimberly'] ['8:00a', '10:00a'] Tech
['Edwards', ' Terrell'] ['10:00a', '12:00p'] Tech
['Green', ' Harrold'] ['12:00p', '2:00p'] Tech
['Tait', ' Jessica'] ['12:00p', '2:00p'] Tech
['Tait', ' Jessica'] ['2:00p', '4:00p'] Tech
['Hernandez', ' William (Monte)'] ['4:00p', '6:30p'] Supervisor
['Tait', ' Chioma'] ['4:00p', '6:00p'] Tech
['Hernandez', ' William (Monte)'] ['6:30p', '7:00p'] Supervisor
['Hernandez', ' William (Monte)'] ['7:00p', '9:00p'] Supervisor
['Tailor', ' Thomas (Jason)'] ['9:00p', '12:00a'] Supervisor
['Jones', ' Deslynne'] ['10:00p', '12:00a'] Tech

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Capturing only specific sections/patterns of string with Regex - python

Related

Python: If String with Dynamic Variables

Python: replace \n \r \t in a list excluding those starting \n\n and ends with \n\r\n\t

Regex in Python: How to match a word pattern, if not preceded by another word of variable length?

How can I split lines based on characters in python?

Parsing Text Document Lines in Python on Wildcards/Patterns

Categories

Resources