Two word boundaries (\b) to isolate a single word - python

I am trying to match the word that appears immediately after a number - in the sentence below, it is the word "meters".
The tower is 100 meters tall.
Here's the pattern that I tried which didn't work:
\d+\s*(\b.+\b)
But this one did:
\d+\s*(\w+)
The first incorrect pattern matched this:
The tower is 100 meters tall.
I didn't want the word "tall" to be matched. I expected the following behavior:
\d+ match one or more occurrence of a digit
\s* match any or no spaces
( start new capturing group
\b find the word/non-word boundary
.+ match 1 or more of everything except new line
\b find the next word/non-word boundary
) stop capturing group
The problem is I don't know tiddly-twat about regex, and I am very much a noob as a noob can be. I am practicing by making my own problems and trying to solve them - this is one of them. Why didn't the match stop at the second break (\b)?
This is Python flavored
Here's the regex101 test link of the above regex.

It didn't stop because + is greedy by default, you want +? for a non-greedy match.
A concise explanation — * and + are greedy quantifiers/operators meaning they will match as much as they can and still allow the remainder of the regular expression to match.
You need to follow these operators with ? for a non-greedy match, going in the above order it would be (*?) "zero or more" or (+?) "one or more" — but preferably "as few as possible".
Also a word boundary \b matches positions where one side is a word character (letter, digit or underscore OR a unicode letter, digit or underscore in Python 3) and the other side is not a word character. I wouldn't use \b around the . if you're unclear what's in between the boundaries.

It match both words because . match (nearly) all characters, so also space character, and because + is greedy, so it will match as much as it could. If you would use \w instead of . it would work (because \w match only word characters - a-zA-Z_0-9).

Related

Regex that captures a group with a positive lookahead but doesn't match a pattern

Using regex (Python) I want to capture a group \d-.+? that is immediately followed by another pattern \sLEFT|\sRIGHT|\sUP.
Here is my test set (from http://nflsavant.com/about.php):
(9:03) (SHOTGUN) 30-J.RICHARD LEFT GUARD PUSHED OB AT MIA 9 FOR 18 YARDS (29-BR.JONES; 21-E.ROWE).
(1:06) 69-R.HILL REPORTED IN AS ELIGIBLE. 33-D.COOK LEFT GUARD TO NO 4 FOR -3 YARDS (56-D.DAVIS; 93-D.ONYEMATA).
(3:34) (SHOTGUN) 28-R.FREEMAN LEFT TACKLE TO LAC 37 FOR 6 YARDS (56-K.MURRAY JR.).
(1:19) 22-L.PERINE UP THE MIDDLE TO CLE 43 FOR 2 YARDS (54-O.VERNON; 51-M.WILSON).
My best attempt is (\d*-.+?)(?=\sLEFT|\sRIGHT|\sUP), which works unless other characters appear between a matching capture group and my positive lookahead. In the second line of my test set this expression captures "69-R.HILL REPORTED IN AS ELIGIBLE. 33-D.COOK." instead of the desired "33-D.COOK".
My inputs are also saved on regex101, here: https://regex101.com/r/tEyuiJ/1
How can I modify (or completely rewrite) my regex to only capture the group immediately followed by my exact positive lookahead with no extra characters between?
To prevent skipping over digits, use \D non-digit (upper is negated \d).
\b(\d+-\D+?)\s(?:LEFT|RIGHT|UP)
See this demo at regex101
Further added a word boundary and changed the lookahead to a group.
If you want a capture group without any lookarounds:
\b(\d+-\S*)\s(?:LEFT|RIGHT|UP)\b
Explanation
\b A word boundary to prevent a partial word match
(\d+-\S*) Capture group 1, match 1+ digits - and optional non whitespace characters
\s Match a single whitespace character
(?:LEFT|RIGHT|UP) Match any of the alternatives
\b A word boundary
See the capture group value on regex101.
This is why you should be careful about using . to match anything and everything unless it's absolutely necessary. From the example you provided, it appears that what you're actually wanting to capture contains no spaces, thus we could utilize a negative character class [^\s] or alternatively more precisely [\w.], with either case using a * quantifier.
Your end result would look like "(\d*-[\w.]*)(?=\sLEFT|\sRIGHT|\sUP)"gm. And of course, when . is within the character class it's treated as a literal string - so it's not required to be escaped.
See it live at regex101.com
Try this:
\b\d+-[^\r \n]+(?= +(?:LEFT|RIGHT|UP)\b)
\b\d+-[^\r \n]+
\b word boundary to ignore things like foo30-J.RICHARD
\d+ match one or more digit.
- match a literal -.
[^\r \n]+ match on or more character except \r, \n and a literal space . Excluding \r and \n helps us not to cross newlines, and that is why \s is not used(i.e., it matches \r and \n too)
(?= +(?:LEFT|RIGHT|UP)\b) Using positive lookahead.
+ Ensure there is one or more literal space .
(?:LEFT|RIGHT|UP)\b using non-caputring group, ensure our previous space followed by one of these words LEFT, RIGHT or UP. \b word boundary to ignore things like RIGHTfoo or LEFTbar.
See regex demo

Regex match for non hyphenated words

I am trying to create a regex expression in Python for non-hyphenated words but I am unable to figure out the right syntax.
The requirements for the regex are:
It should not contain hyphens AND
It should contain atleast 1 number
The expressions that I tried are:=
^(?!.*-)
This matches all non-hyphenated words but I am not able to figure out how to additionally add the second condition.
^(?!.*-(?=/d{1,}))
I tried using double lookahead but I am not sure about the syntax to use for it. This matches ID101 but also matches STACKOVERFLOW
Sample Words Which Should Match:
1DRIVE , ID100 , W1RELESS
Sample Words Which Should Not Match:
Basically any non-numeric string (like STACK , OVERFLOW) or any hyphenated words (Test-11 , 24-hours)
Additional Info:
I am using library re and compiling the regex patterns and using re.search for matching.
Any assistance would be very helpful as I am new to regex matching and am stuck on this for quite a few hours.
Maybe,
(?!.*-)(?=.*\d)^.+$
might simply work OK.
Test
import re
string = '''
abc
abc1-
abc1
abc-abc1
'''
expression = r'(?m)(?!.*-)(?=.*\d)^.+$'
print(re.findall(expression, string))
Output
['abc1']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
RegEx 101 Explanation
/
(?!.*-)(?=.*\d)^.+$
/
gm
Negative Lookahead (?!.*-)
Assert that the Regex below does not match
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- matches the character - literally (case sensitive)
Positive Lookahead (?=.*\d)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
^ asserts position at start of a line
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
I came up with -
^[^-]*\d[^-]*$
so we need at LEAST one digit (\d)
We need the rest of the string to contain anything BUT a - ([^-])
We can have unlimited number of those characters, so [^-]*
but putting them together like [^-]*\d would fail on aaa3- because the - comes after a valid match- lets make sure no dashes can sneak in before or after our match ^[-]*\d$
Unfortunately that means that aaa555D fails. So we actually need to add the first group again- ^[^-]*\d[^-]$ --- which says start - any number of chars that aren't dashes - a digit - any number of chars that aren't dashes - end
Depending on style, we could also do ^([^-]*\d)+$ since the order of the digits/numbers dont matter, we can have as many of those as we want.
However, finally... this is how I would ACTUALLY solve this particular problem, since regexes may be powerful, but they tend to make the code harder to understand...
if ("-" not in text) and re.search("\d", text):

Python regex to match words not having dot

I want to accept only those strings having the pattern 'wild.flower', 'pink.flower',...i.e any word preceding '.flower', but the word should not contain dot. For example, "pink.blue.flower" is unacceptable. Can anyone help how to do this in python using regex?
You are looking for "^\w+\.flower$".
Is this sufficient?
^\w+\.\w+$
Here is the regex for you. ^([^\.]*)\.flower$.
Example: https://regex101.com/r/cSL445/1.
Your case of pink.blue.flower is unclear. There are 2 possibilities:
Match only blue (cut off preceding dot and what was before).
Reject this case altogether (you want to match a word preceding .flower
only if it is not preceded with a dot).
In the first case accept other answers.
But if you want the second solution, use: \b(?<!\.)[a-z]+(?=\.flower).
Description:
\b - Start from a word boundary (but it allows the "after a dot" case).
(?<!\.) - Negative lookbehind - exclude the "after a dot" case.
[a-z]+ - Match a sequence of letters.
(?=\.flower) - Positive lookahead for .flower.
I assumed that you have only lower case letters, but if it is not the case,
then add i (case insensitive) option.
Another remark: Other answers include \w, which matches also digits and
_ or even [^\.] - any char other than a dot (including e.g. \n).
Are you happy with that? If you aren't, change to [a-z] (again, maybe
with i option).
To match any character except a newline or a dot you could use a negated character class [^.\r\n]+ and repeat that one or more times and use anchors to assert the start ^ and the end $ of the line.
^[^.\r\n]+\.flower$
Or you could specify in a character class which characters you would allow to match followed by a dot \. and flower.
^[a-z0-9]+\.flower$

regular expression pattern match excluding a list of cases

I wanna match the street name pattern which consists of several capital case words excluding some cases but I do not know how to do it.
The pattern is "([A-Z][a-z]+ {1,3})" (Let's assume the name of a street consists of 1-3 words) and a short version block list is ["Apt","West","East"] which denotes either direction or room number.
Any word that is in the list("West" for example) should not be in the match result. Words starting with those words in block list however("Westmoreland" for example), should be in the result. How am i gonna write this regular expression?
You may use
\b(?!(?:Apt|West|East)\b)[A-Z][a-z]+(?: (?!(?:Apt|West|East)\b)[A-Z][a-z]+){0,2}
See the regex demo
What I did:
Fixed your regex to actually match 1 to 3 words: [A-Z][a-z]+(?: [A-Z][a-z]+){0,2}
Added negative lookaheads to restrict the values matched by [A-Z][a-z]+ parts.
Expression details:
\b(?!(?:Apt|West|East)\b)[A-Z][a-z]+ - a capital ASCII letter ([A-Z]) followed with 1+ ASCII lowercase letters ([a-z] but I guess you can also use [a-zA-Z]+ or [a-zA-Z]* here) that are not a whole word Apt, West or East that is made possible with the negative lookahead anchored at the \b word boundary. The first \b is a leading word boundary, and then the negative lookahead makes sure there are no Apt, West or East right after the word boundary, and before a trailing \b word boundary (ensuring a whole word match)
(?: (?!(?:Apt|West|East)\b)[A-Z][a-z]+){0,2} - 0 to 2 occurrences of:
- a space
(?!(?:Apt|West|East)\b)[A-Z][a-z]+ - see above. You do not need a leading word boundary here as the Apt, West or East can only appear after a space here, which is a non-word char.
A lot of people would post a shorter solution like
(?: ?\b(?!(?:Apt|West|East)\b)[A-Z][a-z]+){1,3}
See the demo
However, the optional space at the start would also match this leading space. Morever, the regex does not match linearly now, and that affects performance. With small strings, it is OK, but still it is bad practice.

Regular expressions with \b and non-word characters (like '.')

Why does this regular expression:
r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$'
does not match J. F. Kennedy?
I have to remove \b in groups first_init and mid_init to match the words.
I am using Python. And for testing i am using https://regex101.com/
Thanks
You are over-applying the \b word breaks.
\b will only match if on one side there is a valid "word" character and on the other side not. Now you use this construction twice:
\b\w\.\b\s
.. and, rightly so, it does not match because on the left side you have a not-word character (a single full stop) and on the other side you also have a not-word character (a space).
Removing the \b between the full stop and \s is enough to make it work.
\b matches the empty string only at the beginning or end of a word. A word is a sequence of alphanumeric or underscore characters. The dot (.) cannot comprise part of the word.
>>> import re
# does not match when \. is within word boundary
>>> re.match(r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy')
# matches when \b is moved to left of \.
>>> re.match(r'^(?P<first_init>\b\w\b\.)\s(?P<mid_init>\b\w\b\.)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy') # matches
The . is not part of the word in this sense. See the docs here.
It does not match because of the \. (dot) character. A word boundary does not include the dot (it is not the same definition of word you perhaps would like). You can easily rewrite it without the need of \b. Read the documentation carefully.
Just remove the second boundary:
^(?P<first_init>\b\w\.)\s
(?P<mid_init>\b\w\.)\s
(?P<last_name>\b\w+\b)$
And see a demo on regex101.com.
Background is that the second \b is between a dot and a space, so it fails (remember that one of the sides needs to be a word character, ie one of a-zA-Z0-9_)
\b means border of a word.
Word here is defined like so:
A word ends, when there is a space character following it.
"J.", "F." and "Kennedy" are the words here.
You're example is trying to search for a space between the letter and the dot and it is searching for J . F . Kennedy.

Categories

Resources