I have a dataframe with multiple forms of names:
JOSEPH W. JASON
Ralph Landau
RAYMOND C ADAMS
ABD, SAMIR
ABDOU TCHOUSNOU, BOUBACAR
ABDL-ALI, OMAR R
For first 3, the rule is last word. For the last three, or anything with comma, the first word is the last name. However, for name like Abdou Tchousnou, I only took the last word, which is Tchousnou.
The expected output is
JASON
LANDAU
ADAMS
ABD
TCHOUNOU
ABDL-ALI
The left is the name, and the right is what I want to return.
str.extract(r'(^(?=[^,]*,?$)[\w-]+|(?<=, )[\w-]+)', expand=False)
Is there anyway to solve this? The current code only returns the first name instead of surname which is the one that I want.
Something like this would work:
(.+(?=,)|\S+$)
( - start capture group #1
.+(?=,) - get everything before a comma
| - or
\S+$ - get everything which is not a whitespace before the end of the line
) - end capture group #1
https://regex101.com/r/myvyS0/1
Python:
str.extract(r'(.+(?=,)|\S+$)', expand=False)
You may use this regex to extract:
>>> print (df)
name
0 JOSEPH W. JASON
1 Ralph Landau
2 RAYMOND C ADAMS
3 ABD, SAMIR
4 ABDOU TCHOUSNOU, BOUBACA
5 ABDL-ALI, OMAR R
>>> df['name'].str.extract(r'([^,]+(?=,)|\w+(?:-\w+)*(?=$))', expand=False)
0 JASON
1 Landau
2 ADAMS
3 ABD
4 ABDOU TCHOUSNOU
5 ABDL-ALI
RegEx Details:
(: Start capture group
[^,]+(?=,): Match 1+ non-comma characters tha
|: OR
\w+: Match 1+ word charcters
(?:-\w+)*: Match - followed 1+ word characters. Match 0 or more of this group
): End capture group
(?=,|$): Lookahead to assert that we have comma or end of line ahead
I've just started web scraping and decided to give it a go on the classic IMDb dataset. One of my columns ('actors'), is supposed to contain the name of several actors. This is how it looks right now:
"Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
My goal is to exclude the Director part, and keep only the actors as a list (for some data analysis):
["Zooey Deschanel", Joseph Gordon-Levitt", "Geoffrey Arend", "Chloe Grace Moretz"]
What is the best way to achieve this result on ALL of my rows by using Python? Thank you!
You can simply split() the string:
data = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
actors =[x.strip() for x in data.split('|')[1].split(':')[1].split(',')]
print(actors)
Output:
["Zooey Deschanel", "Joseph Gordon-Levitt", "Geoffrey Arend", "Chloe Grace Moretz"]
Suppose your the string is stored as s = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloe Grace Moretz" , then you could easily split the string as below -
Splitting the string into actors :
my_list = str.split('|') : This would split & convert the string into list by separating it at |
Output : ['Director: Marc Webb ', ' Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloe Grace Moretz']
my_list = my_list [1].split(':')
Output : [' Stars', ' Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloe Grace Moretz']
actors = my_list [1].split(',')
Output : [' Zooey Deschanel', ' Joseph Gordon-Levitt', ' Geoffrey Arend', ' Chloe Grace Moretz']
Now you have converted your string in the desired list format. Below is the code for same -
s = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloe Grace Moretz"
my_list = s.split('|') # <- this would separate director and stars
actors = my_list [1].split(':')[1].split(',') # <- this would split the elements into actors in each index
print(actors)
The above code would only print the actors in the list.
Assuming you have an array of strings containing data like you have described in your question, then you might consider doing something like the following:
import pprint
imdb_actors_columns = [
"Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz",
"Director: Alfred Hitchcock | Stars: James Stewart, Kim Novak, Barbara Bel Geddes",
# ... etc...
]
def does_look_like_stars_col(possible_stars_column_val):
if possible_stars_column_val:
return possible_stars_column_val.strip().lower().startswith('stars:')
return False
# Split apart strings into ['Director: ...', 'Stars: ...']
tokenized_columns = map(lambda s: s.split('|'), imdb_actors_columns)
# Run through generated lists of [
# ['Director:...', 'Stars:...'], ['Director:...', 'Stars:...'],
# ...
# ]
# And filter sublists so that we only retain the
# [['Stars: ...'], ['Stars: ...']]
# Then use mapping to extract the first 'Stars:...' entries to top-level like:
# ['Stars: ...', 'Stars: ...']
star_actor_columns = map(lambda a: a[0],
filter(bool,
map(lambda all_columns: list(filter(
does_look_like_stars_col, all_columns)),
tokenized_columns
)
)
)
# Loop through all the "Stars: Name1, Name2, ..." strings, get the "Name"
# portions, and then strip away any leading or trailing spaces so that the
# final result is [['Name1', 'Name2', ...], ['OtherName1', 'OtherName2', ...]]
all_stars = [list(map(
lambda s: s.strip(),
raw_star_list.strip().replace('Stars:', '').split(',')
)) for raw_star_list in star_actor_columns]
pprint.pprint(all_stars)
Execution produces the following results:
[['Zooey Deschanel',
'Joseph Gordon-Levitt',
'Geoffrey Arend',
'Chloë Grace Moretz'],
['James Stewart', 'Kim Novak', 'Barbara Bel Geddes']]
You can check this solution out on IDEOne.
You may use
import re
string = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
pattern = re.compile(r'Stars:(.+)')
stars = [star.strip() for m in pattern.findall(string) for star in m.split(",")]
Which yields
['Zooey Deschanel', 'Joseph Gordon-Levitt', 'Geoffrey Arend', 'Chloë Grace Moretz']
Or - using the regex module and \G:
import regex as re
string = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
pattern = re.compile(r'(?:\G(?!\A),\s*|Stars:\s+)\K[^,]+')
stars = pattern.findall(string)
print(stars)
See a demo for the latter on regex101.com.
Or - just for the sake of academic learning - use a proper parser:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
string = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
class MovieVisitor(NodeVisitor):
grammar = Grammar(r"""
line = director sep stars
director = ws? notsep colon notsep
stars = ws? notsep colon star+
star = notsep sep?
colon = ":"
sep = ~"[|,]"
notsep = ~"[^|,:]+"
ws = ~"\s+"
""")
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_line(self, node, visited_children):
director, _, stars = visited_children
return {"director": director, "stars": stars}
def visit_director(self, node, visited_children):
*_, director = visited_children
return director.text.strip()
def visit_stars(self, node, visited_children):
*_, stars = visited_children
return stars
def visit_star(self, node, visited_children):
star, _ = visited_children
return star.text.strip()
mv = MovieVisitor()
dct = mv.parse(string)
print(dct)
# {'director': 'Marc Webb', 'stars': ['Zooey Deschanel', 'Joseph Gordon-Levitt', 'Geoffrey Arend', 'Chloë Grace Moretz']}
string = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
#find the starting point for the part of the text which interests you.
start = string.find("Stars: ") + len("Stars: ")
#slice the string so that only the relevant part remains
string = string[start:]
#split the remaining string with a delimiter, here: ", "
output = string.split(", ")
As other have pointed out, of course you can do this as one line, with list comprehension:
output = string.rsplit(" Stars: ", 1)[-1].split(", ")
or
output = next(reversed(string.rsplit(" Stars: ", 1))).split(", ")
Which yields the same result but is less readable in my opinion.
I have the following strings, which always follow a standard format:
'On 10/31/2018, Sally Brown picked 25 apples at the orchard.'
'On 11/01/2018, John Smith picked 12 peaches at the orchard.'
'On 09/15/2018, Jim Roe picked 10 pears at the orchard.'
I want to extract certain data fields into a series of lists:
['10/31/2018','Sally Brown','25','apples']
['11/01/2018','John Smith','12','peaches']
['09/15/2018','Jim Roe','10','pears']
As you can see, I need some of the sentence structure to be recognized, but not captured, so the program has context for where the data is located. The Regex that I thought would work is:
(?<=On\s)\d{2}\/\d{2}\/\d{4},\s(?=[A-Z][a-z]+\s[A-Z][a-z]+)\s.+?(?=\d+)\s(?=[a-z]+)\sat\sthe\sorchard\.
But of course, that is incorrect somehow.
This may be a simple question for someone, but I'm having trouble finding the answer. Thanks in advance, and someday when I'm more skilled I'll pay it forward on here.
use \w+ to match any word or [a-zA-Z0-9_]
import re
str = ''''On 10/31/2018, Sally Brown picked 25 apples at the orchard.'
'On 11/01/2018, John Smith picked 12 peaches at the orchard.'
'On 09/15/2018, Jim Roe picked 10 pears at the orchard.'''
arr = re.findall('On\s(.*?),\s(\w+\s\w+)\s\w+\s(\d+)\s(\w+)', str)
print arr
# [('10/31/2018', 'Sally Brown', '25', 'apples'),
# ('11/01/2018', 'John Smith', '12', 'peaches'),
# ('09/15/2018', 'Jim Roe', '10', 'pears')]
I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.
For example,
'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'
'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'
I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'
The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'
This is the code I have so far:
s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)
if so:
print so.group(1) + ' ' + so.group(2).split()[1]
print so.group(2)
This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.
An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.
Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!
As you can rely on names starting with a capital letter, then you could do something like:
((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)
Live preview
Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.
Which for your examples and current code would yield:
Input
First print
Second print
John and Albert McDonald
John McDonald
Albert McDonald
Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond
It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond