Python Beginner Regular Expressions [duplicate] - python

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
what is the use of the = in the regex (?=.*?[A-Z]) and why are the ? and * in front of the [a-z] because I saw in a book that they should appear behind the word or expression they should take effect on and why the two ?

This whole RegEx
(?=.*?[A-Z])
is called a lookahead assertion, a kind of lookarounds.
It consists of three items:
(?= )
.*?
[A-Z]
The first one is the syntax for a lookahead assertion. The pattern would come in the brackets, after the initial ?=.
The second one is a dot that matches any character, with a repetition modifier *?, where the asserisk means "zero or more matches" and the question mark means "match as few as possible" instead of being greedy.
The third one I suppose you know it.
A lookaround assertion restricts the surrounding of a pattern without matching (capturing) extra things. For example:
a(?=b)
will match the letter a in ab, but not ac. Note it only matches the letter a, and the letter b is only a restriction about where the letter a should be matched. Whereas a(b) matches both letters in ab and captures the latter.

Related

How can I use \b boundary around special characters [duplicate]

This question already has answers here:
Word boundary with words starting or ending with special characters gives unexpected results
(2 answers)
What is a word boundary in regex?
(13 answers)
Closed 2 years ago.
\b✅\b do not match a single emoji: '✅'.
\b\u2B07\b do not match: '⬇️'.
\b-\b do not match '-'.
\bfoo\b certainly match 'foo'.
Why does that happens and what's an alternative to ensure my emoji or any special character is not in the middle of a string
playground: https://regex101.com/r/jRaQuJ/2
Edit: For the record, I think this question because i think it's still useful even somehow duplicated. 1st duplicate marked shows a specific and verbose question while this one is simple short and easy to find. 2nd duplicate is just the definition of \b boundary and someone with my problem would probably need something more specific.
You can use the pattern:
(?<!\w)✅(?!\w)
This uses negative lookarounds to match an emoji with no word characters on either side.
The reason for the matches you asked about is that \b is a zero-width boundary where one side of the boundary is \w (a word character, or [0-9A-Za-z_]) and the other is the beginning or end of the string or \W (a non-word character).
For example, consider the string "foo.":
start of string boundary (zero width)
|
| non-word character
| |
v v
foo.
^ ^
| |
word characters
The \b boundary could be used in the regex \bfoo\b and find a match thanks to the boundary between o and . characters and the boundary between the beginning of the string and the character f.
"foobar" does not match \bfoo\b because the second o and b don't satisfy the boundary condition, that is, b isn't a non-word character or end of the string.
The pattern \b-\b does not match the string "-" because "-" isn't a word character. Likewise, emojis are built from non-word characters so they won't respond to the boundary as a word character does as is the case with \bfoo\b.

Question about ".*" in match regex in Python [duplicate]

This question already has answers here:
Regular Expressions- Match Anything
(17 answers)
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 2 years ago.
Following is a simple piece of code about regex match:
import re
pattern = ".*"
s = "ab"
print(re.search(pattern, s))
output:
<_sre.SRE_Match object; span=(0, 2), match='ab'>
My confusion is "." matches any single character, so here it's able to match "a" or "b" , then with a "*" behind it, this combo should be able to match "" "a" or "aa" or "aaa..." or "b" or "bb" or "bbb..." or other single characters that repeat for several times.
But how comes it(".*") matches "ab" the same time?
The comments more or less covered it, but to provide an answer: the pattern .* means to match any character . zero or more times *. And by default, a regex is greedy so when presented with 'abc', even though '' would satisfy that rule, or 'a' would, etc., it will match the entire string, since matching all of it still meets the requirement.
It does not mean to match the same character zero or more times. Every character it matches can be a different character or the same as a previously matched one.
If instead you want to match any character, but match as many of that same character as possible, zero or more times, you can use:
(.)?\1*
See here https://regex101.com/r/FgvuX2/1 and here https://regex101.com/r/FgvuX2/2
What this effectively does, is match a single character optionally, creating a back reference which can be used in the second part of the expression. Thus it matches any single character (if there is one) to group 1 and matches that group 1 zero or more times, being greedy.

regex lookahead assertion [duplicate]

This question already has answers here:
Regex plus vs star difference? [duplicate]
(9 answers)
Closed 5 years ago.
I'm new to python regex and am learning the lookahead assertion.
I found the following strange. Could someone tell me how it works?
import regex as re
re.search('(\d*)(?<=a)(\.)','1a.')
<regex.Match object; span=(2, 3), match='.'>
re.search('(\d+)(?<=a)(\.)','1a.')
out put nothing
Why doesn't the second one match anything?
The first pattern:
re.search('(\d*)(?<=a)(\.)', '1a.')
says to find zero or more digits, followed by a dot. Right before the dot, it has a positive lookbehind, which asserts the previous character was an a. In this case, Python will match zero digits, followed by a single dot. The lookbehind fires true, because the preceding character was in fact an a.
However, the second pattern:
re.search('(\d+)(?<=a)(\.)','1a.')
matches one or more digits, followed the lookbehind and matching dot. In this case, Python is compelled to match the number 1. But then it the lookbehind must fail. Obviously, if the last character matched were a number, it cannot be the letter a. So, there is no match possible in the second case. Even if we were to remove (?<=a) from the second pattern, it would still fail because we are not accounting for the letter a.

Python regular expression to match a pattern when preceded by either start of line or whitespace [duplicate]

This question already has answers here:
Python Regex Engine - "look-behind requires fixed-width pattern" Error
(3 answers)
Closed 4 years ago.
I would like to write a regex that matches the word hello but only when it either starts a line or is preceded by whitespace. I don't want to match the whitespace if its there...I just need to know it (or the start of line) is there.
So I've tried:
r = re.compile('hello(?<=\s|^)')
but this throws:
error: look-behind requires fixed-width pattern
For the sake of an example, if my string to be searched is:
s = 'hello world hello thello'
then I would like my regex to match two times...at the locations in uppercase below:
'HELLO world HELLO thello'
where the first would match because it is preceded by the start of the line, while the second match would be because it is preceded by a space. The last 5 characters would not match because they are preceded by a t.
(?:(?<=\s)|^)hello would be that which you want. The lookbehind needs to be in the beginning of regular expression; and it must indeed be of fixed width - \s is 1 character wide, whereas ^ is 0 characters, so you cannot combine them with |. In this case we do not need to, we just alternate (?<=\s) and ^.
Notice that both of these would still match hellooo; if this is not acceptable, you have to add \b at the end.

Two word boundaries (\b) to isolate a single word

I am trying to match the word that appears immediately after a number - in the sentence below, it is the word "meters".
The tower is 100 meters tall.
Here's the pattern that I tried which didn't work:
\d+\s*(\b.+\b)
But this one did:
\d+\s*(\w+)
The first incorrect pattern matched this:
The tower is 100 meters tall.
I didn't want the word "tall" to be matched. I expected the following behavior:
\d+ match one or more occurrence of a digit
\s* match any or no spaces
( start new capturing group
\b find the word/non-word boundary
.+ match 1 or more of everything except new line
\b find the next word/non-word boundary
) stop capturing group
The problem is I don't know tiddly-twat about regex, and I am very much a noob as a noob can be. I am practicing by making my own problems and trying to solve them - this is one of them. Why didn't the match stop at the second break (\b)?
This is Python flavored
Here's the regex101 test link of the above regex.
It didn't stop because + is greedy by default, you want +? for a non-greedy match.
A concise explanation — * and + are greedy quantifiers/operators meaning they will match as much as they can and still allow the remainder of the regular expression to match.
You need to follow these operators with ? for a non-greedy match, going in the above order it would be (*?) "zero or more" or (+?) "one or more" — but preferably "as few as possible".
Also a word boundary \b matches positions where one side is a word character (letter, digit or underscore OR a unicode letter, digit or underscore in Python 3) and the other side is not a word character. I wouldn't use \b around the . if you're unclear what's in between the boundaries.
It match both words because . match (nearly) all characters, so also space character, and because + is greedy, so it will match as much as it could. If you would use \w instead of . it would work (because \w match only word characters - a-zA-Z_0-9).

Categories

Resources