Extracting required names from a given format - python

I have a text file containing data as shown below. I have to extract some required names from it. I am trying the below code but not getting the required results.
The file contains data as below:
Leader : Tim Lee ; 34567
Head\Organiser: Sam Mathews; 11:53 am
Head: Alica Mills; 45612
Head\Secretary: Maya Hill; #53190
Captain- Jocey David # 45123
Vice Captain:- Jacob Green; -65432
The code which I am trying:
import re
pattern = re.compile(r'(Leader|Head\\Organiser|Captain|Vice Captain).*(\w+)',re.I)
matches=pattern.findall(line)
for match in matches:
print(match)
Expected Output:
Tim Lee
Sam Mathews
Jocey David
Jacob Green

import re
line = '''
Leader : Tim Lee ; 34567
Head\Organiser: Sam Mathews; 11:53 am
Head: Alica Mills; 45612
Head\Secretary: Maya Hill; #53190
Captain- Jocey David # 45123
Vice Captain:- Jacob Green; -65432'''
pattern = re.compile(r'(?:Leader|Head(?:\\Organiser|\\Secretary)?|Captain|Vice Captain)\W+(\w+(?:\s+\w+)?)',re.I)
matches=pattern.findall(line)
for match in matches:
print(match)
Explanation:
(?: : start non capture group
Leader : literally
| : OR
Head : literally
(?: : start non capture group
\\Organiser : literally
| : OR
\\Secretary : literally
)? ! end group, optional
| : OR
Captain : literally
| : OR
Vice Captain : literally
) : end group
\W+ : 1 or more non word character
( : start group 1
\w+ : 1 or more word char
(?: : non capture group
\s+ : 1 or more spaces
\w+ : 1 or more word char
)? : end group, optional
) : end group 1
Result for given example:
Tim Lee
Sam Mathews
Alica Mills
Maya Hill
Jocey David
Jacob Green

Given:
s='''\
Leader : Tim Lee ; 34567
Head\Organiser: Sam Mathews; 11:53 am
Head: Alica Mills; 45612
Head\Secretary: Maya Hill; #53190
Captain- Jocey David # 45123
Vice Captain:- Jacob Green; -65432'''
You can get the names like so:
>>> [e.rstrip() for e in re.findall(r'[:-]+[ \t]+(.*?)[;#]',s)]
['Tim Lee', 'Sam Mathews', 'Alica Mills', 'Maya Hill', 'Jocey David', 'Jacob Green']
Or, create a dict of the titles and associated names:
>>> {k:v.rstrip() for k,v in re.findall(r'^\s*(Leader|Head\\Organiser|Head|Head\\Secretary|Captain|Vice Captain)\s*[:-]+[ \t]+(.*?)[;#]',s, re.M)}
{'Leader': 'Tim Lee', 'Head\\Organiser': 'Sam Mathews', 'Head': 'Alica Mills', 'Head\\Secretary': 'Maya Hill', 'Captain': 'Jocey David', 'Vice Captain': 'Jacob Green'}
Which then can be restricted to the titles desired:
>>> {k:v.rstrip() for k,v in re.findall(r'^\s*(Leader|Head\\Organiser|Captain|Vice Captain)\s*[:-]+[ \t]+(.*?)[;#]',s, re.M)}
{'Leader': 'Tim Lee', 'Head\\Organiser': 'Sam Mathews', 'Captain': 'Jocey David', 'Vice Captain': 'Jacob Green'}
And if you just want the names (Python 3.6+ maintains the order, so they will be in string order):
>>> {k:v.rstrip() for k,v in re.findall(r'^\s*(Leader|Head\\Organiser|Captain|Vice Captain)\s*[:-]+[ \t]+(.*?)[;#]',s, re.M)}.values()
dict_values(['Tim Lee', 'Sam Mathews', 'Jocey David', 'Jacob Green'])
Demo and explanation of regex

Related

choosing certain movie characters from a string using Regex in python

superheroines = '''
batman
spiderman
ironman
wonderwoman
superman
captainamerica
blackpanther
joker
hulk
thor
'''
''' I only want spiderman, ironman, captain america, hulk, thor '''
''' I want to exclude batman, wonderwoman, superman, joker '''
pattern = re.compile(r'[^batman]+')
matches = pattern.findall(super_heroines)
for match in matches:
print(f'Marvel characters : {match}')

Regex Text Cleaning on Multiple forms of text formats

I have a dataframe with multiple forms of names:
JOSEPH W. JASON
Ralph Landau
RAYMOND C ADAMS
ABD, SAMIR
ABDOU TCHOUSNOU, BOUBACAR
ABDL-ALI, OMAR R
For first 3, the rule is last word. For the last three, or anything with comma, the first word is the last name. However, for name like Abdou Tchousnou, I only took the last word, which is Tchousnou.
The expected output is
JASON
LANDAU
ADAMS
ABD
TCHOUNOU
ABDL-ALI
The left is the name, and the right is what I want to return.
str.extract(r'(^(?=[^,]*,?$)[\w-]+|(?<=, )[\w-]+)', expand=False)
Is there anyway to solve this? The current code only returns the first name instead of surname which is the one that I want.
Something like this would work:
(.+(?=,)|\S+$)
( - start capture group #1
.+(?=,) - get everything before a comma
| - or
\S+$ - get everything which is not a whitespace before the end of the line
) - end capture group #1
https://regex101.com/r/myvyS0/1
Python:
str.extract(r'(.+(?=,)|\S+$)', expand=False)
You may use this regex to extract:
>>> print (df)
name
0 JOSEPH W. JASON
1 Ralph Landau
2 RAYMOND C ADAMS
3 ABD, SAMIR
4 ABDOU TCHOUSNOU, BOUBACA
5 ABDL-ALI, OMAR R
>>> df['name'].str.extract(r'([^,]+(?=,)|\w+(?:-\w+)*(?=$))', expand=False)
0 JASON
1 Landau
2 ADAMS
3 ABD
4 ABDOU TCHOUSNOU
5 ABDL-ALI
RegEx Details:
(: Start capture group
[^,]+(?=,): Match 1+ non-comma characters tha
|: OR
\w+: Match 1+ word charcters
(?:-\w+)*: Match - followed 1+ word characters. Match 0 or more of this group
): End capture group
(?=,|$): Lookahead to assert that we have comma or end of line ahead

How to get (in Python), part of a string in a column and transform it in a list?

I've just started web scraping and decided to give it a go on the classic IMDb dataset. One of my columns ('actors'), is supposed to contain the name of several actors. This is how it looks right now:
"Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
My goal is to exclude the Director part, and keep only the actors as a list (for some data analysis):
["Zooey Deschanel", Joseph Gordon-Levitt", "Geoffrey Arend", "Chloe Grace Moretz"]
What is the best way to achieve this result on ALL of my rows by using Python? Thank you!
You can simply split() the string:
data = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
actors =[x.strip() for x in data.split('|')[1].split(':')[1].split(',')]
print(actors)
Output:
["Zooey Deschanel", "Joseph Gordon-Levitt", "Geoffrey Arend", "Chloe Grace Moretz"]
Suppose your the string is stored as s = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloe Grace Moretz" , then you could easily split the string as below -
Splitting the string into actors :
my_list = str.split('|') : This would split & convert the string into list by separating it at |
Output : ['Director: Marc Webb ', ' Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloe Grace Moretz']
my_list = my_list [1].split(':')
Output : [' Stars', ' Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloe Grace Moretz']
actors = my_list [1].split(',')
Output : [' Zooey Deschanel', ' Joseph Gordon-Levitt', ' Geoffrey Arend', ' Chloe Grace Moretz']
Now you have converted your string in the desired list format. Below is the code for same -
s = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloe Grace Moretz"
my_list = s.split('|') # <- this would separate director and stars
actors = my_list [1].split(':')[1].split(',') # <- this would split the elements into actors in each index
print(actors)
The above code would only print the actors in the list.
Assuming you have an array of strings containing data like you have described in your question, then you might consider doing something like the following:
import pprint
imdb_actors_columns = [
"Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz",
"Director: Alfred Hitchcock | Stars: James Stewart, Kim Novak, Barbara Bel Geddes",
# ... etc...
]
def does_look_like_stars_col(possible_stars_column_val):
if possible_stars_column_val:
return possible_stars_column_val.strip().lower().startswith('stars:')
return False
# Split apart strings into ['Director: ...', 'Stars: ...']
tokenized_columns = map(lambda s: s.split('|'), imdb_actors_columns)
# Run through generated lists of [
# ['Director:...', 'Stars:...'], ['Director:...', 'Stars:...'],
# ...
# ]
# And filter sublists so that we only retain the
# [['Stars: ...'], ['Stars: ...']]
# Then use mapping to extract the first 'Stars:...' entries to top-level like:
# ['Stars: ...', 'Stars: ...']
star_actor_columns = map(lambda a: a[0],
filter(bool,
map(lambda all_columns: list(filter(
does_look_like_stars_col, all_columns)),
tokenized_columns
)
)
)
# Loop through all the "Stars: Name1, Name2, ..." strings, get the "Name"
# portions, and then strip away any leading or trailing spaces so that the
# final result is [['Name1', 'Name2', ...], ['OtherName1', 'OtherName2', ...]]
all_stars = [list(map(
lambda s: s.strip(),
raw_star_list.strip().replace('Stars:', '').split(',')
)) for raw_star_list in star_actor_columns]
pprint.pprint(all_stars)
Execution produces the following results:
[['Zooey Deschanel',
'Joseph Gordon-Levitt',
'Geoffrey Arend',
'Chloë Grace Moretz'],
['James Stewart', 'Kim Novak', 'Barbara Bel Geddes']]
You can check this solution out on IDEOne.
You may use
import re
string = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
pattern = re.compile(r'Stars:(.+)')
stars = [star.strip() for m in pattern.findall(string) for star in m.split(",")]
Which yields
['Zooey Deschanel', 'Joseph Gordon-Levitt', 'Geoffrey Arend', 'Chloë Grace Moretz']
Or - using the regex module and \G:
import regex as re
string = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
pattern = re.compile(r'(?:\G(?!\A),\s*|Stars:\s+)\K[^,]+')
stars = pattern.findall(string)
print(stars)
See a demo for the latter on regex101.com.
Or - just for the sake of academic learning - use a proper parser:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
string = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
class MovieVisitor(NodeVisitor):
grammar = Grammar(r"""
line = director sep stars
director = ws? notsep colon notsep
stars = ws? notsep colon star+
star = notsep sep?
colon = ":"
sep = ~"[|,]"
notsep = ~"[^|,:]+"
ws = ~"\s+"
""")
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_line(self, node, visited_children):
director, _, stars = visited_children
return {"director": director, "stars": stars}
def visit_director(self, node, visited_children):
*_, director = visited_children
return director.text.strip()
def visit_stars(self, node, visited_children):
*_, stars = visited_children
return stars
def visit_star(self, node, visited_children):
star, _ = visited_children
return star.text.strip()
mv = MovieVisitor()
dct = mv.parse(string)
print(dct)
# {'director': 'Marc Webb', 'stars': ['Zooey Deschanel', 'Joseph Gordon-Levitt', 'Geoffrey Arend', 'Chloë Grace Moretz']}
string = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
#find the starting point for the part of the text which interests you.
start = string.find("Stars: ") + len("Stars: ")
#slice the string so that only the relevant part remains
string = string[start:]
#split the remaining string with a delimiter, here: ", "
output = string.split(", ")
As other have pointed out, of course you can do this as one line, with list comprehension:
output = string.rsplit(" Stars: ", 1)[-1].split(", ")
or
output = next(reversed(string.rsplit(" Stars: ", 1))).split(", ")
Which yields the same result but is less readable in my opinion.

Capturing only specific sections/patterns of string with Regex

I have the following strings, which always follow a standard format:
'On 10/31/2018, Sally Brown picked 25 apples at the orchard.'
'On 11/01/2018, John Smith picked 12 peaches at the orchard.'
'On 09/15/2018, Jim Roe picked 10 pears at the orchard.'
I want to extract certain data fields into a series of lists:
['10/31/2018','Sally Brown','25','apples']
['11/01/2018','John Smith','12','peaches']
['09/15/2018','Jim Roe','10','pears']
As you can see, I need some of the sentence structure to be recognized, but not captured, so the program has context for where the data is located. The Regex that I thought would work is:
(?<=On\s)\d{2}\/\d{2}\/\d{4},\s(?=[A-Z][a-z]+\s[A-Z][a-z]+)\s.+?(?=\d+)\s(?=[a-z]+)\sat\sthe\sorchard\.
But of course, that is incorrect somehow.
This may be a simple question for someone, but I'm having trouble finding the answer. Thanks in advance, and someday when I'm more skilled I'll pay it forward on here.
use \w+ to match any word or [a-zA-Z0-9_]
import re
str = ''''On 10/31/2018, Sally Brown picked 25 apples at the orchard.'
'On 11/01/2018, John Smith picked 12 peaches at the orchard.'
'On 09/15/2018, Jim Roe picked 10 pears at the orchard.'''
arr = re.findall('On\s(.*?),\s(\w+\s\w+)\s\w+\s(\d+)\s(\w+)', str)
print arr
# [('10/31/2018', 'Sally Brown', '25', 'apples'),
# ('11/01/2018', 'John Smith', '12', 'peaches'),
# ('09/15/2018', 'Jim Roe', '10', 'pears')]

Regex in Python: How to match a word pattern, if not preceded by another word of variable length?

I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.
For example,
'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'
'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'
I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'
The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'
This is the code I have so far:
s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)
if so:
print so.group(1) + ' ' + so.group(2).split()[1]
print so.group(2)
This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.
An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.
Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!
As you can rely on names starting with a capital letter, then you could do something like:
((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)
Live preview
Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.
Which for your examples and current code would yield:
Input
First print
Second print
John and Albert McDonald
John McDonald
Albert McDonald
Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond
It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond

Categories

Resources