Extract names of a sentence with regex - python

I'm very new with the syntax of regex, I already read some about the libary. I'm trying extract names from a simple sentence, but I found myself in trouble, below I show a exemple of what I've done.
x = 'Fred used to play with his brother, Billy, both are 10 and their parents Jude and Edde have two more kids.'
import re
re.findall('^[A-Za-z ]+$',x)
Anyone can explain me what is wrong and how to proceed?

Use
re.findall(r'\b[A-Z]\w*', x)
See proof. It matches words starting with uppercase letter and having any amount of letters, digits or underscores.

I think your regex has two problems.
You want to extract names of sentence. You need to remove ^ start of line and $ end of line.
Name starts with uppercase and does not have space. You should remove in your regex.
You could use following regex.
\b[A-Z][A-Za-z]+\b
I also tried to test result on python.
x = 'Fred used to play with his brother, Billy, both are 10 and their parents Jude and Edde have two more kids.'
import re
result = re.findall('\\b[A-Z][A-Za-z]+\\b',x)
print(result)
Result.
['Fred', 'Billy', 'Jude', 'Edde']

Related

Regex for matching exact words that contain apostrophes in Python?

For the purpose of this project, I'm using more exact regex expressions, rather than more general ones. I'm counting occurrences words from a list of words in a text file called I import into my script called vocabWords, where each word in the list is in the format \bword\b.
When I run my script, \bwhat\b will pick up the words "what" and "what's", but \bwhat's\b will pick up no words. If I switch the order so the apostrophe word is before the root word, words are counted correctly. How can I change my regex list so the words are counted correctly? I understand the problem is using "\b", but I haven't been able to find how to fix this. I cannot have a more general regex, and I have to include the words themselves in the regex pattern.
vocabWords:
\bwhat\b
\bwhat's\b
\biron\b
\biron's\b
My code:
matched = []
regex_all = re.compile('|'.join(vocabWords))
for row in df['test']:
matched.append(re.findall(regex_all, row))
There are at least another 2 solutions:
Test that next symbol isn't an apostrophe r"\bwhat(?!')\b"
Use more general rule r"\bwhat(?:'s)?\b" to caught both variants with/without apostrophe.
If you sort your wordlist by length before turning it into a regexp, longer words (like "what's") will precede shorter words (like "what"). This should do the trick.
regex_all = re.compile('|'.join(sorted(vocabWords, key=len, reverse=True)))

RegEx with Python: findall inside a boundry

I have a string that can be ilustrated by the following (extraspaces intended):
"words that don't matter START some words one some words two some words three END words that don't matter"
To grab each substring between START and END ['some words one', some words two', 'some words three'], I wrote the following code:
result = re.search(r'(?<=START).*?(?=END)', string, flags=re.S).group()
result = re.findall(r'(\(?\w+(?:\s\w+)*\)?)', result)
Is it possible to achieve this with one single regex?
In theory you could just wrap your second regex in ()* and put it into your first. That would capture all occurrences of your inner expression in the bounds. Unfortunately the Python implementation only retains the last match of a group that is matched multiple times. The only implementation that I know that retains all matches of a group is the .NET one. So unfortunately not a solution for you.
On the other hand why can't you simply keep the two step approach that you have?
Edit:
You can compare the behaviour I described using online regex tools.
Pattern: (\w+\s*)* Input: aaa bbb ccc
Try it for example with https://pythex.org/ and http://regexstorm.net/tester.
You will see that Python returns one match/group which is ccc while .NET returns $1 as three captures aaa, bbb, ccc.
Edit2: As #Jan says there is also the newer regex module that supports multi captures. I had completely forgotten about that.
With the newer regex module, you can do it in one step:
(?:\G(?!\A)|START)\s*\K
(?!\bEND\b)
\w+\s+\w+\s+\w+
This looks complicated, but broken down, it says:
(?:\G(?!\A)|START) # look for START or the end of the last match
\s*\K # whitespaces, \K "forgets" all characters to the left
(?!\bEND\b) # neg. lookahead, do not overrun END
\w+\s+\w+\s+\w+ # your original expression
In Python this looks like:
import regex as re
rx = re.compile(r'''
(?:\G(?!\A)|START)\s*\K
(?!\bEND\b)
\w+\s+\w+\s+\w+''', re.VERBOSE)
string = "words that don't matter START some words one some words two some words three END words that don't matter"
print(rx.findall(string))
# ['some words one', 'some words two', 'some words three']
Additionally, see a demo on regex101.com.
This is an ideal situation where we could use a re.split, as #PeterE mentioned to circumvent the problem of having access only to the last captured group.
import re
s=r'"words that don\'t matter START some words one some words two some words three END words that don\'t matter" START abc a bc c END'
print('\n'.join(re.split(r'^.*?START\s+|\s+END.*?START\s+|\s+END.*?$|\s{2,}',s)[1:-1]))
Enable a re.MULTILINE/re.M flag as we are using ^ and $.
OUTPUT
some words one
some words two
some words three
abc
a bc c

Regular expression: repeating patterns in the beginning of string

For example, consider the following string: "apple1: apple2: apple3: some random words here apple4:"
I want to match only apple1, apple2 and apple3 but not apple4. I am having a hard time to figure out how to archive this.
Any help is appreciated.
Thanks.
If you are using .net you can match the below pattern and then use the Captures property of the group to get all the different apples matched along the way.
(?:(apple\d).*?){3}
If you only want to match the first one:
apple\d
Sweet and simple. Just call match on this once.
So, maybe something like this:
^([A-Za-z]+)[^A-Za-z]+(\1[^A-Za-z]+)+
http://regexr.com/38vvb
From your comment, it sounds like you want to match the occurrences of apple followed by a digit throughout the string except an occurrence of apple followed by a digit at the end of the string.
>>> import re
>>> text = 'apple1: apple2: apple3: some random words here apple4:'
>>> matches = re.findall(r'(\bapple\d+):(?!$)', text)
['apple1', 'apple2', 'apple3']
Sorry guys, I did not format my question properly, it wasn't clear.
I found the solution:
r'\s*((apple)\d+[ \:\,]*)+'
Thanks for all your help!

Using the .split() function based on conditions?

How would you be able to use the .split() function based on conditions?
Lets say I have the raw data:
Apples,Oranges,Strawberries Green beans,Tomatoes,Broccoli
My intended result is:
['Apples','Oranges','Strawberries','Green beans','Tomatoes','Brocolli']
Would it be able to have it split at commas and if there is a space and a capital letter following it?
The literal interpretation of what you asked for, using re.split:
import re
pat = re.compile(r'\s(?=[A-Z])|,')
pat.split(my_str)
This is more simply done, in your case:
pat = re.compile(r'.(?=[A-Z])')
Basically, split on any character that is followed by a capital letter.
Using regex will make the code simpler than a complicated split statement.
import re
...
re.findall(", [A-Z]",data)
Note you asked for a split for a command, space, capital, but in your example there are no spaces after commas.

Find words with capital letters not at start of a sentence with regex

Using Python and regex I am trying to find words in a piece of text that start with a capital letter but are not at the start of a sentence.
The best way I can think of is to check that the word is not preceded by a full stop then a space. I am pretty sure that I need to use negative lookbehind. This is what I have so far, it will run but always returns nothing:
(?<!\.\s)\b[A-Z][a-z]*\b
I think the problem might be with the use of [A-Z][a-z]* inside the word boundary \b but I am really not sure.
Thanks for the help.
Your regex appears to work:
In [6]: import re
In [7]: re.findall(r'(?<!\.\s)\b[A-Z][a-z]*\b', 'lookbehind. This is what I have')
Out[7]: ['I']
Make sure you're using a raw string (r'...') when specifying the regex.
If you have some specific inputs on which the regex doesn't work, please add them to your question.
Although you asked specifically for a regex, it may be interesting to also consider a list comprehension. They're sometimes a bit more readable (although in this case, probably at the cost of efficiency). Here's one way to achieve this:
import string
S = "T'was brillig, and the slithy Toves were gyring and gimbling in the " + \
"Wabe. All mimsy were the Borogoves, and the Mome Raths outgrabe."
LS = S.split(' ')
words = [x for (pre,x) in zip(['.']+LS, LS+[' '])
if (x[0] in string.uppercase) and (pre[-1] != '.')]
Try and loop over your input with:
(?!^)\b([A-Z]\w+)
and capture the first group. As you can see, a negative lookahead can be used as well, since the position you want to match is everything but a beginning of line. A negative lookbehind would have the same effect.

Categories

Resources