regex lookahead assertion [duplicate] - python

This question already has answers here:
Regex plus vs star difference? [duplicate]
(9 answers)
Closed 5 years ago.
I'm new to python regex and am learning the lookahead assertion.
I found the following strange. Could someone tell me how it works?
import regex as re
re.search('(\d*)(?<=a)(\.)','1a.')
<regex.Match object; span=(2, 3), match='.'>
re.search('(\d+)(?<=a)(\.)','1a.')
out put nothing
Why doesn't the second one match anything?

The first pattern:
re.search('(\d*)(?<=a)(\.)', '1a.')
says to find zero or more digits, followed by a dot. Right before the dot, it has a positive lookbehind, which asserts the previous character was an a. In this case, Python will match zero digits, followed by a single dot. The lookbehind fires true, because the preceding character was in fact an a.
However, the second pattern:
re.search('(\d+)(?<=a)(\.)','1a.')
matches one or more digits, followed the lookbehind and matching dot. In this case, Python is compelled to match the number 1. But then it the lookbehind must fail. Obviously, if the last character matched were a number, it cannot be the letter a. So, there is no match possible in the second case. Even if we were to remove (?<=a) from the second pattern, it would still fail because we are not accounting for the letter a.

Related

Question about ".*" in match regex in Python [duplicate]

This question already has answers here:
Regular Expressions- Match Anything
(17 answers)
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 2 years ago.
Following is a simple piece of code about regex match:
import re
pattern = ".*"
s = "ab"
print(re.search(pattern, s))
output:
<_sre.SRE_Match object; span=(0, 2), match='ab'>
My confusion is "." matches any single character, so here it's able to match "a" or "b" , then with a "*" behind it, this combo should be able to match "" "a" or "aa" or "aaa..." or "b" or "bb" or "bbb..." or other single characters that repeat for several times.
But how comes it(".*") matches "ab" the same time?
The comments more or less covered it, but to provide an answer: the pattern .* means to match any character . zero or more times *. And by default, a regex is greedy so when presented with 'abc', even though '' would satisfy that rule, or 'a' would, etc., it will match the entire string, since matching all of it still meets the requirement.
It does not mean to match the same character zero or more times. Every character it matches can be a different character or the same as a previously matched one.
If instead you want to match any character, but match as many of that same character as possible, zero or more times, you can use:
(.)?\1*
See here https://regex101.com/r/FgvuX2/1 and here https://regex101.com/r/FgvuX2/2
What this effectively does, is match a single character optionally, creating a back reference which can be used in the second part of the expression. Thus it matches any single character (if there is one) to group 1 and matches that group 1 zero or more times, being greedy.

Regex backreference to match opposite case

Before I begin — it may be worth stating, that: this technically does not have to be solved using a Regex, it's just that I immediately thought of a Regex when I started solving this problem, and I'm interested in knowing whether it's possible to solve using a Regex.
I've spent the last couple hours trying to create a Regex that does the following.
The regex must match a string that is ten characters long, iff the first five characters and last five characters are identical but each individual character is opposite in case.
In other words, if you take the first five characters, invert the case of each individual character, that should match the last five characters of the string.
For example, the regex should match abCDeABcdE, since the first five characters and the last five characters are the same, but each matching character is opposite in case. In other words, flip_case("abCDe") == "ABcdE"
Here are a few more strings that should match:
abcdeABCDE, abcdEABCDe, zYxWvZyXwV.
And here are a few that shouldn't match:
abcdeABCDZ, although the case is opposite, the strings themselves do not match.
abcdeABCDe, is a very close match, but should not match since the e's are not opposite in case.
Here is the first regex I tried, which is obviously wrong since it doesn't account for the case-swap process.
/([a-zA-Z]{5})\1/g
My next though was whether the following is possible in a regex, but I've been reading several Regex tutorials and I can't seem to find it anywhere.
/([A-Z])[\1+32]/g
This new regex (that obviously doesn't work) is supposed to match a single uppercase letter, immediately followed by itself-plus-32-ascii, so, in other words, it should match an uppercase letter followed immediately by its' lowercase counterpart. But, as far as I'm concerned, you cannot "add an ascii value" to backreference in a regex.
And, bonus points to whoever can answer this — in this specific case, the string in question is known to be 10 characters long. Would it be possible to create a regex that matches strings of an arbitrary length?
You want to use the following pattern with the Python regex module:
^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)
See the regex demo
Details
^ - start of string
(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L})) - a positive lookahead with a sequence of five capturing groups that capture the first five letters individually
(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$) - a ppositive lookahead that make sure that, at the end of the string, there are 5 letters that are the same as the ones captured at the start but are of different case.
In brief, the first (\p{L}) in the first lookahead captures the first a in abcdeABCDE and then, inside the second lookahead, (?!\1)(?i:\1) makes sure the fifth char from the end is the same (with the case insensitive mode on), and (?!\1) negative lookahead make sure this letter is not identical to the one captured.
The re module does not support inline modifier groups, so this expression won't work with that moduue.
Python regex based module demo:
import regex
strs = ['abcdeABCDE', 'abcdEABCDe', 'zYxWvZyXwV', 'abcdeABCDZ', 'abcdeABCDe']
rx = r'^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)'
for s in strs:
print("Testing {}...".format(s))
if regex.search(rx, s):
print("Matched")
Output:
Testing abcdeABCDE...
Matched
Testing abcdEABCDe...
Matched
Testing zYxWvZyXwV...
Matched
Testing abcdeABCDZ...
Testing abcdeABCDe...

Python Regex symbolize unlimited amount of characters [duplicate]

This question already has answers here:
Using regex to match any character except =
(4 answers)
Closed 4 years ago.
I'm trying to figure out how to represent the following regex in python:
Find the first occurence of
{any character that isn't a letter}'{unlimited amount of any character including '}'{any character that isn't a letter}
For example:
She said 'Hello There!'.
`he Looked. 'I've been sick' and then...`
My question is how do I implement the middle part? How do I represent an unlimited amount of characters until the pattern in the end is found (`_)?
There are a few different ways you can represent an indefinite number of characters:
*: zero or more of the preceding character (greedy)
+: one or more of the preceding character (greedy)
*?: zero or more of the preceding character (non-greedy)
+?: one or more of the preceding character (non-greedy)
"Greedy" means that as many characters as possible will be matched. "Non-greedy" means that as few characters as possible will be matched. (For more explanation on greedy and non-greedy, see this answer.)
In your case, it sounds like you want to match one or more characters, and for the match to be non-greedy, so you need +?.
In Python code:
import re
my_regex = re.compile(r"\W'[^']+?'\W")
my_regex.search("She said 'Hello There!'.")
This regex won't match your second example, 'I've been sick' and then..., as there is no non-word character before the first '.

Python Beginner Regular Expressions [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
what is the use of the = in the regex (?=.*?[A-Z]) and why are the ? and * in front of the [a-z] because I saw in a book that they should appear behind the word or expression they should take effect on and why the two ?
This whole RegEx
(?=.*?[A-Z])
is called a lookahead assertion, a kind of lookarounds.
It consists of three items:
(?= )
.*?
[A-Z]
The first one is the syntax for a lookahead assertion. The pattern would come in the brackets, after the initial ?=.
The second one is a dot that matches any character, with a repetition modifier *?, where the asserisk means "zero or more matches" and the question mark means "match as few as possible" instead of being greedy.
The third one I suppose you know it.
A lookaround assertion restricts the surrounding of a pattern without matching (capturing) extra things. For example:
a(?=b)
will match the letter a in ab, but not ac. Note it only matches the letter a, and the letter b is only a restriction about where the letter a should be matched. Whereas a(b) matches both letters in ab and captures the latter.

Python regular expression to match a pattern when preceded by either start of line or whitespace [duplicate]

This question already has answers here:
Python Regex Engine - "look-behind requires fixed-width pattern" Error
(3 answers)
Closed 4 years ago.
I would like to write a regex that matches the word hello but only when it either starts a line or is preceded by whitespace. I don't want to match the whitespace if its there...I just need to know it (or the start of line) is there.
So I've tried:
r = re.compile('hello(?<=\s|^)')
but this throws:
error: look-behind requires fixed-width pattern
For the sake of an example, if my string to be searched is:
s = 'hello world hello thello'
then I would like my regex to match two times...at the locations in uppercase below:
'HELLO world HELLO thello'
where the first would match because it is preceded by the start of the line, while the second match would be because it is preceded by a space. The last 5 characters would not match because they are preceded by a t.
(?:(?<=\s)|^)hello would be that which you want. The lookbehind needs to be in the beginning of regular expression; and it must indeed be of fixed width - \s is 1 character wide, whereas ^ is 0 characters, so you cannot combine them with |. In this case we do not need to, we just alternate (?<=\s) and ^.
Notice that both of these would still match hellooo; if this is not acceptable, you have to add \b at the end.

Categories

Resources