How to find all words that ends with colon (:) using regex - python

I am new to regex, I have the following expressions, I want to find the word or consecutive words ending with colon(:) using regular expression.
Incarcerate: imprison or confine, Strike down: if someone is struck down, especially by an illness, they are killed or severely harmed by it, Accost: approach and address.
The output should be like this Incarcerate:, Strike down:, Accost: . I have written the following regex, but it captures the following.
My regex -> (\w+):+
It captures the words like Incarcerate:, Accost:, it does not capture Strike down:
Please help me.
I want to do it in both typescript and python.

You can optionally repeat a space and 1+ word chars. Note that the words are in group 1, and the : is outside of the group.
(\w+(?: \w+)*):
Regex demo
To include the : in the match:
\w+(?: \w+)*:
The pattern matches
\w+ Match 1 or more word characters
(?: \w+)* Repeat 0+ times matching a space and 1+ word characters
: Match a single :
Regex demo
Example in Python
import re
s = "Incarcerate: imprison or confine, Strike down: if someone is struck down, especially by an illness, they are killed or severely harmed by it, Accost: approach and address."
pattern = r"\w+(?: \w+)*:"
print(re.findall(pattern, s))
Output
['Incarcerate:', 'Strike down:', 'Accost:']

Related

Getting a correct regex for word starting and ending with different letters

I am quite new to regex and I right now Have a problem formulating a regex to match a string where the first and last letter are different. I looked up on the internet and found a regex that just does it's opposite. i.e. matches words that have same starting and ending letter. Can anyone please help me to understand if I can negeate this regex in some way or can create a new regex to match my requirements. The regex that needs to be modiifed or changed is:
^\s|^[a-z]$|^([a-z]).*\1$
This matches these Strings :
aba,
a,
b,
c,
d,
" ",
cccbbbbbbac,
aaaaba
But I want it to match strings like:
aaabbcz,
zba,
ccb,
cbbbba
Can anyone please help me in this regard? Thank you.
Note: I will be using this with Python Regex, so the regex should be compataible to be used with Python.
You don't need a regex for this, just use
s[0] != s[-1]
where s is your string. If you must use a regex, you can use this:
^(.).*(?!\1).$
This looks for
^ : beginning of string
(.) : a character (captured in group 1)
.* : some number of characters
(?!\1). : a character which is not the character captured in group 1
$ : end of string
Regex demo on regex101
This part of your pattern ^([a-z]).*\1$ only accounts for chars a-z, but you also want to exclude " "
You can rewrite that pattern by putting the part after the capture group inside a negative lookahead.
^(.)(?!.*\1$).+
^ Start of string
(.) Capture a single char (including spaces) in group 1
(?!.*\1$) Negative lookahead, assert that the string does not end with the same character
.+ Match 1+ characters so that the string has a minimum of 2 characters
See a regex demo.
If the string should start and end with a non whitespace character to prevent / trailing trailing spaces, you can start the match with a non whitespace character \S and also end the match with a non whitespace character.
^(\S)(?!.*\1$).*\S$
See another regex demo.

Python regex to identify two consecutive capitalized words at the beginning of the line

I have this piece of text from which I want to remove both occurrences of each of the names, "Remggrehte Sertrro" and "Perrhhfson Forrtdd". I tried applying this regex: ([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+) but it identifies "Remggrehte Sertrro We", "Perrhhfson Forrtdd If" and also "Mash Mush" which is inside the text.
Basically I want it to only identify first two capitalized words at the beginning of the line without touching the rest. I am no regex expert and I am not sure how to adapt it.
This is the text:
Remggrehte Sertrro
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
Thanks in advance.
You can use this pattern /^([A-Z]+.*? ){2}/m if you are always certain that you are getting only two terms with capitalised first letters and always in the first two terms inline. Example working on regex101.com
You don't need the positive lookahead to match the first 2 capitalized words.
In your pattern, this part (?=\s[A-Z]) can be omitted as your first assert it and then directly match it.
You could match the first 2 words without a capturing group and assert a whitespace boundary (?!\S) at the right
^[A-Z][a-z]+[^\S\r\n][A-Z][a-z]+(?!\S)
Explanation
^ Start of string
[A-Z][a-z]+ Match a char A-Z and 1+ lowercase chars a-z
[^\S\r\n] Match a whitespace char except a newline as \s could also match a newline and you want to match two consecutive capitalized words at the beginning of the line
[A-Z][a-z]+ Match a char A-Z and 1+ lowercase chars a-z
(?!\S) Assert a whitespace boundary at the right
Regex demo
Note that [A-Z][a-z]+ matches only chars a-z. To match word characters you could use \w instead of [a-z] only.
You can remove the line which only contains the names using re.MULTILINE flag and the following regex: r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$". This regex will match each name only if it fits in the line without extra text.
Here is a demo:
import re
text = """\
Remggrehte Sertrro
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
"""
print(re.sub(r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$", "", text, flags=re.MULTILINE))
You get:
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.

Regex match for non hyphenated words

I am trying to create a regex expression in Python for non-hyphenated words but I am unable to figure out the right syntax.
The requirements for the regex are:
It should not contain hyphens AND
It should contain atleast 1 number
The expressions that I tried are:=
^(?!.*-)
This matches all non-hyphenated words but I am not able to figure out how to additionally add the second condition.
^(?!.*-(?=/d{1,}))
I tried using double lookahead but I am not sure about the syntax to use for it. This matches ID101 but also matches STACKOVERFLOW
Sample Words Which Should Match:
1DRIVE , ID100 , W1RELESS
Sample Words Which Should Not Match:
Basically any non-numeric string (like STACK , OVERFLOW) or any hyphenated words (Test-11 , 24-hours)
Additional Info:
I am using library re and compiling the regex patterns and using re.search for matching.
Any assistance would be very helpful as I am new to regex matching and am stuck on this for quite a few hours.
Maybe,
(?!.*-)(?=.*\d)^.+$
might simply work OK.
Test
import re
string = '''
abc
abc1-
abc1
abc-abc1
'''
expression = r'(?m)(?!.*-)(?=.*\d)^.+$'
print(re.findall(expression, string))
Output
['abc1']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
RegEx 101 Explanation
/
(?!.*-)(?=.*\d)^.+$
/
gm
Negative Lookahead (?!.*-)
Assert that the Regex below does not match
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- matches the character - literally (case sensitive)
Positive Lookahead (?=.*\d)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
^ asserts position at start of a line
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
I came up with -
^[^-]*\d[^-]*$
so we need at LEAST one digit (\d)
We need the rest of the string to contain anything BUT a - ([^-])
We can have unlimited number of those characters, so [^-]*
but putting them together like [^-]*\d would fail on aaa3- because the - comes after a valid match- lets make sure no dashes can sneak in before or after our match ^[-]*\d$
Unfortunately that means that aaa555D fails. So we actually need to add the first group again- ^[^-]*\d[^-]$ --- which says start - any number of chars that aren't dashes - a digit - any number of chars that aren't dashes - end
Depending on style, we could also do ^([^-]*\d)+$ since the order of the digits/numbers dont matter, we can have as many of those as we want.
However, finally... this is how I would ACTUALLY solve this particular problem, since regexes may be powerful, but they tend to make the code harder to understand...
if ("-" not in text) and re.search("\d", text):

Regex condition after and before a known phrase

I am trying to capture a phrase that starts with a Capital letter between 2 known phrase. Let's say between "Known phrase, " and the word "The".
For example in the text below, the phrase I'm trying to capture is: Stuff TO CApture That always start with Capital letter but stop capturing when
Ignore Words Known phrase, ignore random phrase Stuff TO CApture That always start with Capital letter but stop capturing when The appears.
Regex I have tried: (?<=Known phrase, ).*(?= The) and Known phrase, (.*) The
These regex also captures ignore random phrase. How do I ignore this?
For your exaple data, you might use:
Known phrase, [a-z ]+([A-Z].*?) The
See the regex demo
Explanation
Known phrase, Match literally
[a-z ]+ Match 1+ times a lowercase character or a space (Add to the character class what you would allow to match except an Uppercase character)
([A-Z].*?) Capture in a group matching an uppercase character followed by 0+ times any character except a newline.
The Match literally
I guess as regular expression is left side greedy you should first try to match anything that is not capital letters
Something like /Start[^A-Z]*(.*)stop/
([^A-Z] matches anything that is not capital letter)
regex101 demo
I'm not sure of what you are trying to do, but, trying to stick with your code, (?<=Known phrase, )([^A-Z]*)(.*)(?=The) should do the trick: the text you need is in the group 2.
If you need to match everything just change to (.*)(?<=Known phrase, )([^A-Z]*)(.*)(?=The)(.*) and get your text in group 3.

Two word boundaries (\b) to isolate a single word

I am trying to match the word that appears immediately after a number - in the sentence below, it is the word "meters".
The tower is 100 meters tall.
Here's the pattern that I tried which didn't work:
\d+\s*(\b.+\b)
But this one did:
\d+\s*(\w+)
The first incorrect pattern matched this:
The tower is 100 meters tall.
I didn't want the word "tall" to be matched. I expected the following behavior:
\d+ match one or more occurrence of a digit
\s* match any or no spaces
( start new capturing group
\b find the word/non-word boundary
.+ match 1 or more of everything except new line
\b find the next word/non-word boundary
) stop capturing group
The problem is I don't know tiddly-twat about regex, and I am very much a noob as a noob can be. I am practicing by making my own problems and trying to solve them - this is one of them. Why didn't the match stop at the second break (\b)?
This is Python flavored
Here's the regex101 test link of the above regex.
It didn't stop because + is greedy by default, you want +? for a non-greedy match.
A concise explanation — * and + are greedy quantifiers/operators meaning they will match as much as they can and still allow the remainder of the regular expression to match.
You need to follow these operators with ? for a non-greedy match, going in the above order it would be (*?) "zero or more" or (+?) "one or more" — but preferably "as few as possible".
Also a word boundary \b matches positions where one side is a word character (letter, digit or underscore OR a unicode letter, digit or underscore in Python 3) and the other side is not a word character. I wouldn't use \b around the . if you're unclear what's in between the boundaries.
It match both words because . match (nearly) all characters, so also space character, and because + is greedy, so it will match as much as it could. If you would use \w instead of . it would work (because \w match only word characters - a-zA-Z_0-9).

Categories

Resources