I'm working on a web scraping project where I am trying to extract names from a series of photo captions. I have the captions stored as a list of unicode strings such as:
Phil Collins, with Beth and Jerry Smith
I have been able to usefoo = re.compile(r" +with +|, +and +| +and +|, +") and re.split(foo) to separate the captions into different individual names such as:
['Phil Collins', 'Beth', 'Jerry Smith']
Unfortunately, I'm having trouble finding a way to split Jane and Jerry Smith (I'm new to regular expressions) in a way that can detect their surname and produce the output:
['Phil Collins', 'Beth Smith', 'Jerry Smith']
I am able to detect Beth and Jerry Smith using re.compile(r"[A-Z][a-z]+ +and +[A-Z][a-z]+ +[A-Z][a-z]+", but I am not sure the best way to process it once it is detected.
The problem I am trying to tackle is that I need to iterate over the list of names, detect that 'Beth' is not a full name, read 'Jerry Smith', and finally read and append 'Smith' to 'Beth' giving me a complete list of: ['Phil Collins', 'Beth Smith', 'Jerry Smith']
Is there a method in re that can pipe the matching substring to a function so I can modify it to include Beth's surname? Or am I even approaching this problem the right way?
Instead of searching the names and the delimiters with a complex RegEx, you could split the text using re.split and a smaller RegEx of all possible delimiters.
Here, the delimiters I see are: ", with" and "and" (with spaces at begin and end). You could create a RegEx by joining every delimiter.
import re
text = "Phil Collins, with Beth and Jerry Smith"
delimiters = [r",\s+with\s+", "\s+and\s+"]
regex = "|".join(delimiters)
print(re.split(regex, text, flags=re.IGNORECASE))
# -> ['Phil Collins', 'Beth', 'Jerry Smith']
EDIT
To join the "Beth" with "Smith" and "Jerry" with "Smith", you need to first split on the "with", and then split on the and.
import re
text = "Phil Collins, with Beth and Jerry Smith"
for part in re.split(",\s+with\s+", text):
first, last = re.findall(r"(\w+(?:\s+and\s+\w+)?)\s+(\w+)",
part, flags=re.UNICODE)[0]
names = re.split(r"\s+and\s+", first)
result = [name + " " + last
for name in names]
print(result)
You get:
['Phil Collins']
['Beth Smith', 'Jerry Smith']
Related
I cant understand how does re module work. I performed many attempts to get the entire name if there is only one name or multiple names (surname).
This is the re.compile() format that I'm using to get the name if the string has the the surname optionally:
the_formmat = re.compile(r"Mr?s?\.?\s[A-Z][a-z]+\s[A-Z][a-z]+")
the_string = "this is Mr Samantha Rajapaksa and his wife Mrs. Chalani Rajapaksa. his fathers name is Mr Prabath and his mothers name is Mrs Karunarathnage Dayawathi Bandara Peiris "
print(the_formmat.findall(the_string))
I know the use case of the ? modifier but I don't know where to put it to get the surname if there is one or more.
From the above example I get this output:
['Mr Samantha Rajapaksa', 'Mrs. Chalani Rajapaksa', 'Mrs Karunarathnage Dayawathi']
The output that I want is:
['Mr Samantha Rajapaksa', 'Mrs. Chalani Rajapaksa', 'Mr Prabath', 'Mrs Karunarathnage Dayawathi Bandara Peiris']
Try this Regex:
/(?:Mr|Ms|Mrs)\.?(?: [A-Z][a-z]+)+/
Edited thanks to #treuss.
So change your the_formmat variable to:
the_formmat = re.compile(r"(?:Mr|Ms|Mrs)\.?(?: [A-Z][a-z]+)+")
What is does it it checks for Mr/Ms/Mrs, then when there's a space it will keep checking for words starting with an uppercase letter followed by a space until it doesn't match anymore.
You could check this RegExr link to learn more.
I have over 20,000 first and last name and I want to check the sentence if in that sentence is any first-name or last-name of my dataset, this is my dataset
l-name f-name
میلاد جورابلو
علی احمدی
امیر احمدی
this is the sentence sample
sentence = 'امروز با میلاد احمدی رفتم بیرون'
the english version the dataset
l-name f-name
Smith John
Johnson Anthony
Williams Ethan
this is the sentence in english version
sentence = 'I am going out with John Williams today'
I want my out put be like this
first_name = ['John']
last_name = ['Williams']
Just get lists of names from each column and check if a string contains any element from those lists.
import pandas as pd
names = [['John', 'Smith'], ['Anthony', 'Johnson'], ['Ethan', 'Williams']]
df = pd.DataFrame(names, columns = ['f_name', 'l_name'])
fname_list = df['f_name'].to_list()
lname_list = df['l_name'].to_list()
sentence = 'I am going out with John Williams today'
sentence = sentence.split()
fname_exist = [e for e in sentence if(e in fname_list)]
lname_exist = [e for e in sentence if(e in lname_list)]
if(len(fname_exist) > 0 and len(lname_exist) > 0):
print('first name: ' + fname_exist[0])
print('last name name: ' + lname_exist[0])
Output:
first name: John
last name: Williams
If you would like to approach this in a naive way you could consider regex, however this is based on the assumption that all first and last names are capitalised.
sentence = 'I am going out with John Williams today'
name = re.search(r"[A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+", sentence).group()
print(name) # Outputs: John Williams
This will search for a capital letter followed by any number of lower-case letters, then a space, then a repeat of the previous pattern.
Outside of this, you could consider using Named Entity Recognition (NER) using pre-built libraries to identify names in text. Please see here for more details. https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/
Edit:
I should add that in the event where there are multiple names within the same sentence, you can apply re.findall():
sentence = 'I am going out with John Williams and William Smith today'
names = re.findall(r"[A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+", sentence)
print(names) # Outputs: ['John Williams', 'William Smith']
I have a string with first and last names all separated with a space.
For example:
installers = "Joe Bloggs John Murphy Peter Smith"
I now need to replace every second space with ', ' (comma followed by a space) and output this as string.
The desired output is
print installers
Joe Bloggs, John Murphy, Peter Smith
You should be a able to do this with a regex that that finds the spaces and replaces the last one:
import re
installers = "Joe Bloggs John Murphy Peter Smith"
re.sub(r'(\s\S*?)\s', r'\1, ',installers)
# 'Joe Bloggs, John Murphy, Peter Smith'
This says, find a space followed by some non-spaces followed by a space and replace it with the found space followed by some non-spaces and ", ". You could add installers.strip() if there's a possibility of trailing spaces on the string.
One way to do this is to split the string into a space-separated list of names, get an iterator for the list, then loop over the iterator in a for loop, collecting the first name and then advancing to loop iterator to get the second name too.
names = installers.split()
it = iter(names)
out = []
for name in it:
next_name = next(it)
full_name = '{} {}'.format(name, next_name)
out.append(full_name)
fixed = ', '.join(out)
print fixed
'Joe Bloggs, John Murphy, Peter Smith'
The one line version of this would be
>>> ', '.join(' '.join(s) for s in zip(*[iter(installers.split())]*2))
'Joe Bloggs, John Murphy, Peter Smith'
this works by creating a list that contains the same iterator twice, so the zip function returns both parts of the name. See also the grouper recipe from the itertools recipes.
I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.
For example,
'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'
'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'
I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'
The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'
This is the code I have so far:
s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)
if so:
print so.group(1) + ' ' + so.group(2).split()[1]
print so.group(2)
This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.
An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.
Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!
As you can rely on names starting with a capital letter, then you could do something like:
((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)
Live preview
Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.
Which for your examples and current code would yield:
Input
First print
Second print
John and Albert McDonald
John McDonald
Albert McDonald
Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond
It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond
I've recently started working with Python 2.7 and I've got an assignment in which I get a text file containing data separated with space. I would need to split every line into strings containing only one type of data. Here's an example:
Bruce Wayne 10012-34321 2016.02.20. 231231
John Doe 10201-11021 2016.01.10. 2310456
Chris Taylor 10001-31021 2015.12.30. 524432
James Michael Kent 10210-41011 2016.02.03. 3235332
I want to separate them by name, id, date, balance but the only thing I know is split which I can't really use because the last given name has three parts instead of two. How can I split a line based on charactersWhat could be the solution in this case?
Any help is appreciated.
You'll want to use str.rsplit() and supply a max number of splits, like this:
>>> s = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
>>> s.rsplit(' ', 3)
['James Michael Kent', '10210-41011', '2016.02.03.', '3235332']
>>> s = 'Chris Taylor 10001-31021 2015.12.30. 524432'
>>> s.rsplit(' ', 3)
['Chris Taylor', '10001-31021', '2015.12.30.', '524432']
What you need is to look up in list created by split from last:
To get details
ln = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
ln.split()[-3:]
['10210-41011', '2016.02.03.', '3235332']
ln = 'Bruce Wayne 10012-34321 2016.02.20. 231231'
ln.split()[-3:]
['10012-34321', '2016.02.20.', '231231']
To get names:
ln.split()[:-3]
['Bruce', 'Wayne']
ln = 'James Michael Kent 10210-41011 2016.02.03. 3235332'
ln.split()[:-3]
['James', 'Michael', 'Kent']