I am trying to capture a phrase that starts with a Capital letter between 2 known phrase. Let's say between "Known phrase, " and the word "The".
For example in the text below, the phrase I'm trying to capture is: Stuff TO CApture That always start with Capital letter but stop capturing when
Ignore Words Known phrase, ignore random phrase Stuff TO CApture That always start with Capital letter but stop capturing when The appears.
Regex I have tried: (?<=Known phrase, ).*(?= The) and Known phrase, (.*) The
These regex also captures ignore random phrase. How do I ignore this?
For your exaple data, you might use:
Known phrase, [a-z ]+([A-Z].*?) The
See the regex demo
Explanation
Known phrase, Match literally
[a-z ]+ Match 1+ times a lowercase character or a space (Add to the character class what you would allow to match except an Uppercase character)
([A-Z].*?) Capture in a group matching an uppercase character followed by 0+ times any character except a newline.
The Match literally
I guess as regular expression is left side greedy you should first try to match anything that is not capital letters
Something like /Start[^A-Z]*(.*)stop/
([^A-Z] matches anything that is not capital letter)
regex101 demo
I'm not sure of what you are trying to do, but, trying to stick with your code, (?<=Known phrase, )([^A-Z]*)(.*)(?=The) should do the trick: the text you need is in the group 2.
If you need to match everything just change to (.*)(?<=Known phrase, )([^A-Z]*)(.*)(?=The)(.*) and get your text in group 3.
Related
I have a text file of the type:
[...speech...]
NAME_OF_SPEAKER_1: [...speech...]
NAME_OF_SPEAKER_2: [...speech...]
My aim is to isolate the speeches of the various speakers. They are clearly identified because the name of each speaker is always indicated in uppercase letters (name+surname). However, in the speeches there can be nouns (not people's names) which are in uppercase letter, but there is only one word that is actually long enough to give me issue (it has four letter, say it is 'ABCD'). I was thinking to identifiy the position of each speaker's name (I assume every name long at least 3 letters) with something like
re.search('[A-Z^(ABCD)]{3,}',text_to_search)
in order to exclude that specific (constant) word 'ABCD'. However, the command identifies that word instead of excluding it. Any ideas about how to overcome this problem?
In the pattern that you tried, you get partial matches, as there are no boundaries and [A-Z^(ABCD)]{3,} will match 3 or more times any of the listed characters.
A-Z will also match ABCD, so it could also be written as [A-Z^)(]{3,}
Instead of using the negated character class, you could assert that the word that consists only of uppercase chars A-Z does not contain ABCD using a negative lookahead (?!
\b(?![A-Z]*ABCD)[A-Z]{3,}\b
Regex demo
If the name should start with 3 uppercase char, and can contain also lowercase chars, an underscore or digits, you could add \w* after matching 3 uppercase chars:
\b(?![A-Z]*ABCD)[A-Z]{3}\w*\b
Regex demo
Square brackets [] match single characters, only. Also round brackets() inside of square brackets match single characters, only. That means:
[ABCD] and [(ABCD)] are the same as [A-D].
[^(ABCD)] matches any character, which is not one of A-D
I would try something different:
^[A-Z]*?: matches each word written in capital letters, which starts at the beginning of a line, and is followed by a colon
I'd like to define a regular expression in python3 where I can extract words that starts with alphabets and finish with digits.
what I've been trying is r'^[a-z][A-Z].[0-9]$'
and didn't return any single word.
Use
r'\b[A-Za-z]\w*[0-9]\b'
See proof. This matches words that begin with a letter, have any word characters after, and end in a digit. Notice the word boundaries that match whole words.
As per the valuable comment below, consider an alternative:
r'\b[A-Za-z][A-Za-z0-9]*[0-9]\b'
The [A-Za-z0-9]* won't match underscores while \w will.
I have this piece of text from which I want to remove both occurrences of each of the names, "Remggrehte Sertrro" and "Perrhhfson Forrtdd". I tried applying this regex: ([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+) but it identifies "Remggrehte Sertrro We", "Perrhhfson Forrtdd If" and also "Mash Mush" which is inside the text.
Basically I want it to only identify first two capitalized words at the beginning of the line without touching the rest. I am no regex expert and I am not sure how to adapt it.
This is the text:
Remggrehte Sertrro
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
Thanks in advance.
You can use this pattern /^([A-Z]+.*? ){2}/m if you are always certain that you are getting only two terms with capitalised first letters and always in the first two terms inline. Example working on regex101.com
You don't need the positive lookahead to match the first 2 capitalized words.
In your pattern, this part (?=\s[A-Z]) can be omitted as your first assert it and then directly match it.
You could match the first 2 words without a capturing group and assert a whitespace boundary (?!\S) at the right
^[A-Z][a-z]+[^\S\r\n][A-Z][a-z]+(?!\S)
Explanation
^ Start of string
[A-Z][a-z]+ Match a char A-Z and 1+ lowercase chars a-z
[^\S\r\n] Match a whitespace char except a newline as \s could also match a newline and you want to match two consecutive capitalized words at the beginning of the line
[A-Z][a-z]+ Match a char A-Z and 1+ lowercase chars a-z
(?!\S) Assert a whitespace boundary at the right
Regex demo
Note that [A-Z][a-z]+ matches only chars a-z. To match word characters you could use \w instead of [a-z] only.
You can remove the line which only contains the names using re.MULTILINE flag and the following regex: r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$". This regex will match each name only if it fits in the line without extra text.
Here is a demo:
import re
text = """\
Remggrehte Sertrro
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
"""
print(re.sub(r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$", "", text, flags=re.MULTILINE))
You get:
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
I am trying to match the word that appears immediately after a number - in the sentence below, it is the word "meters".
The tower is 100 meters tall.
Here's the pattern that I tried which didn't work:
\d+\s*(\b.+\b)
But this one did:
\d+\s*(\w+)
The first incorrect pattern matched this:
The tower is 100 meters tall.
I didn't want the word "tall" to be matched. I expected the following behavior:
\d+ match one or more occurrence of a digit
\s* match any or no spaces
( start new capturing group
\b find the word/non-word boundary
.+ match 1 or more of everything except new line
\b find the next word/non-word boundary
) stop capturing group
The problem is I don't know tiddly-twat about regex, and I am very much a noob as a noob can be. I am practicing by making my own problems and trying to solve them - this is one of them. Why didn't the match stop at the second break (\b)?
This is Python flavored
Here's the regex101 test link of the above regex.
It didn't stop because + is greedy by default, you want +? for a non-greedy match.
A concise explanation — * and + are greedy quantifiers/operators meaning they will match as much as they can and still allow the remainder of the regular expression to match.
You need to follow these operators with ? for a non-greedy match, going in the above order it would be (*?) "zero or more" or (+?) "one or more" — but preferably "as few as possible".
Also a word boundary \b matches positions where one side is a word character (letter, digit or underscore OR a unicode letter, digit or underscore in Python 3) and the other side is not a word character. I wouldn't use \b around the . if you're unclear what's in between the boundaries.
It match both words because . match (nearly) all characters, so also space character, and because + is greedy, so it will match as much as it could. If you would use \w instead of . it would work (because \w match only word characters - a-zA-Z_0-9).
I want to match a set of patterns at "word boundary", but the patterns may have a prefix [##] which should get matched if present.
I'm using following regex pattern in python.
r"\b[##]?(abc|ef|ghij)\b"
Sample text is : #abc is a pattern which should match. also abc should match. And finally #ef
In this text only abc, abc and ef are matched without and not #abc and #ef as I want.
You need to put the word boundary next to [##] which you made as optional. Because in this #abc part there is a non-word boundary \B exists before # (not a word character) and after the start of the line (not a word character) not a word boundary \b. Note that \b matches between a word character and a non-word character, vice-versa. \B matches between two word characters or two non-word characters.
r"[##]?\b(abc|ef|ghij)\b"
If you put \b before [##], it would match strings like foo#abc or bar#abc because here there is actually a word boundary exists before # and #.
DEMO
Example:
>>> s = "#abc is a pattern which should match. also abc should match. And finally #ef"
>>> re.findall(r'[##]?\b(?:abc|ef|ghij)\b', s)
['#abc', 'abc', '#ef']
#abc
^ ^
\B \b
The group (##)? is saying that the word may begin with "##". What you are looking for is [##]? which is saying the first character is # or #, but it is not required. If you need the match to be part of a group you could use (#|#)?.
I will also throw in my version of the fixed regex without capturing group (since you do not seem to be using them):
r'[##]?\b(?:abc|ef|ghij)\b'
See my demo.
EXPLANATION: [##] are non-word characters and are optional due to ?. \b is not optional, and regex engine consumes it first, i.e. it consumes right # or #, but they are not part of the match since \b is always zero-width.
Here are more details on \b from Regular-Expressions.info:
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match
is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.