Python Regular Expressions - How is "+?" equivalent to "*" - python

* : 0 or more occurrences of the pattern to its left
+ : 1 or more occurrences of the pattern to its left
? : 0 or 1 occurrences of the pattern to its left
How is "+?" equivalent to "*" ?
Consider a search for any 3 letter word if it exists.
re1.search(r,'(\w\w\w)*,"abc")
In case of re1, * tries to get either 0 or more occurrences of the pattern to its left which in this case is the group of 3 letters. So it will either try to find a 3 letter word or fail
re2.search(r,'(\w\w\w)+?,"abc")
In case of re2, it's supposed to give the same output but I'm confused as to why "*" and "?+" are equivalent. Can you please explain this ?

* and +? are not equivalent. The ? takes on a special meaning if it follows a quantifier, making that quantifier lazy.
Usually, quantifiers are greedy, meaning they will try to match as many repetitions as they can; lazy quantifiers match as few as they can. But a+? will still match at least one a.
In [1]: re.search("(a*)(.*)", "aaaaaa").groups()
Out[1]: ('aaaaaa', '')
In [2]: re.search("(a+?)(.*)", "aaaaaa").groups()
Out[2]: ('a', 'aaaaa')
In your example, both regexes happen to match the same text because both (\w\w\w)* and (\w\w\w)+? can match three letters, and there are exactly three letters in your input. But they will differ in other strings:
In [12]: re.search(r"(\w\w\w)+?", "abcdef")
Out[12]: <_sre.SRE_Match object; span=(0, 3), match='abc'>
In [13]: re.search(r"(\w\w\w)+?", "ab") # No match
In [14]: re.search(r"(\w\w\w)*", "abcdef")
Out[14]: <_sre.SRE_Match object; span=(0, 6), match='abcdef'>
In [15]: re.search(r"(\w\w\w)*", "ab")
Out[15]: <_sre.SRE_Match object; span=(0, 0), match=''>

If you run with a simpler expression you will see is not the same:
import re
>>> re.search("[0-9]*", "1")
<_sre.SRE_Match object; span=(0, 1), match='1'>
>>> re.search("[0-9]*", "")
<_sre.SRE_Match object; span=(0, 0), match=''>
>>> re.search("[0-9]+", "")
>>> re.search("[0-9]+", "1")
<_sre.SRE_Match object; span=(0, 1), match='1'>
The problem in your code is (words)+?. is one or more or nothing

Related

RegEx to validate for-loop header in Python

I took a string input from user that is a for-loop header and need to validate it using regex. It verifies successfully upto in keyword but what should I do for range(): or variable: matching in the end?
import re
loop_header = input("Enter the for loop header: ")
print(re.search(r'for[\s]+[a-zA-Z]+[\S][0-9]*[\s]+in[\s]+', loop_header))
I am not able to figure out how to validate the ending part of loop header.
With the regex pattern slightly modified (now taking variable number of non-space characters in the iterator, and ensuring a colon at the end), the following 4 different for-loop headers can be correctly matched:
import re
loop_headers = [
"for i in [1, 2, 3]:",
"for num in [1, 2, 3]:",
"for i in (1, 2, 3):",
"for word in enumerate(['and', 'now', 'for', 'something', 'completely', 'different']):",
]
regex = re.compile(r"for[\s]+[a-zA-Z]+[\S]*[0-9]*[\s]*in[\s]*(range\(.*\)|enumerate\(.*\)|zip\(.*\))?.*:$")
for loop_header in loop_headers:
print(re.search(regex, loop_header))
Returning
<re.Match object; span=(0, 18), match='for i in [1, 2, 3]:'>
<re.Match object; span=(0, 20), match='for num in [1, 2, 3]:'>
<re.Match object; span=(0, 18), match='for i in (1, 2, 3):'>
<re.Match object; span=(0, 84), match="for word in enumerate(['and', 'now', 'for', 'something', 'completely', 'different']:'>
You could then additionally have a second regex to require range, enumerate, zip etc. around the iterable, e.g.:
regex_2 = re.compile(
r"for[\s]+[a-zA-Z]+[\S]*[0-9]*[\s]*in[\s]*(range\(.*\)|enumerate\(.*\)|zip\(.*\))+:$"
)

Finding a given substring in a string with Regular Expression in Python

I am trying to find all the occurrences of a substring in a string like below:
import re
S = 'aaadaa'
matches = re.finditer('(aa)', S)
if matches:
#print(matches)
for match in matches:
print(match)
else:
print("No match")
The current output is:
<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(4, 6), match='aa'>
But I am expecting that it should give the values as:
<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(1, 3), match='aa'>
<re.Match object; span=(4, 6), match='aa'>
Could someone please help me on this?
Taken from the answer I linked in the comments, here is the pattern you need:
(?=(aa)).
You’ll have to access the matched substring using match_obj.groups(1), and the match indices using match_obj.span(1).
The problem here is that once the re module matches a double aa, it will also consume both of the letters. But, you want overlapping matches. One trick you could use here would be to search for a(?=a):
S = 'aaadaa'
matches = re.findall(r'a(?=a)', S)
matches = [s + "a" for s in matches]
print(matches)
['aa', 'aa', 'aa']
Note that we tag on the second a to the output list, since only the first letter is actually matched at each step.

Python: I am trying to match number but it is matching something else

import re
list = []
string = "[50,40]"
print(string)
for line in string.split(","):
print(line)
match = re.search(r'\d[0-9]', line)
print(match)
if match:
list.append(match)
print("list is", list)
list is:
[<_sre.SRE_Match object; span=(1, 3), match='50'>,
<_sre.SRE_Match object; span=(0, 2), match='40'>]
I want to match only 40 and 50 and not some other useless info like
[<_sre.SRE_Match object; span=(1, 3),
<_sre.SRE_Match object; span=(0, 2),]
How to avoid other things and match only 40 and 50
Use re.findall function, it'll
Return all non-overlapping matches of pattern in string, as a list of
strings
string = "[50,40]"
result = re.findall(r'\d+', string)
print(result)
The output:
['50', '40']
Your code is matching the numbers, but you need to extract the strings from the Match objects. You can do that with the .groups method.
Here's a repaired version of your code. I've changed some of your names because you should not shadow the builtin list type, also string is the name of a standard module.
import re
lst = []
s = "[50,40]"
print(s)
for line in s.split(","):
print(line)
match = re.search(r'\d[0-9]', line)
print(match)
if match:
lst.append(match.group(0))
print("list is", lst)
output
[50,40]
[50
<_sre.SRE_Match object; span=(1, 3), match='50'>
40]
<_sre.SRE_Match object; span=(0, 2), match='40'>
list is ['50', '40']
Your regex is a little strange. If you only want to match 2 digit numbers you could use r'\d\d' or r'\d{2}'. If you want to match any (non-negative) you should use r'\d+'.
And you really don't need to do that loop. Just use the re.findall method:
import re
s = "[50,40]"
lst = re.findall(r'\d+', s)
print("list is", lst)
If you intend to do lots of searches with the same pattern, it's a good idea to use a compiled regex. The re module compiles and caches all regexes anyway, but doing it explicitly is considered good style, and is a litle more efficient.
import re
pat = re.compile(r'\d+')
s = "[50,40]"
lst = pat.findall(s)
print("list is", lst)

Regex : match a string containing only one alphabet

If there is a string
str= "S23#"
It should match and
if str="WS23%"
it should not match (because 2 characters)
I used re.search("^[{A-Z}?0-9()*%#+?=:._<>,!/\-]*$", str) and it matches both strings
Just remove the pattern which matches all the uppercase alphabets from the character class and put it in between two [0-9()%#+?=:._<>,!/-]* patterns.
re.match(r"^[0-9()%#+?=:._<>,!/-]*[A-Za-z][0-9()%#+?=:._<>,!/-]*$", st)
Example:
>>> s= "S23#"
>>> s1 = "WS23%"
>>> re.match(r"^[0-9()%#+?=:._<>,!/-]*[A-Za-z][0-9()%#+?=:._<>,!/-]*$", s)
<_sre.SRE_Match object; span=(0, 4), match='S23#'>
>>> re.match(r"^[0-9()%#+?=:._<>,!/-]*[A-Za-z][0-9()%#+?=:._<>,!/-]*$", s1)
>>>
^(?!(.*?[A-Z]){2})[{A-Z}?0-9()%#+?=:._<>,!/-]+$
Try this.Also use re.match if you want to match the whole string.See demo.
https://regex101.com/r/aI4rA5/2
re.match("^(?!(.*?[A-Z]){2})[{A-Z}?0-9()%#+?=:._<>,!/-]+$", str)

Matching a Regex against a multiline string

I am trying to match a Regex against a multi-line string, but the match fails after the first line.
These expressions work as expected:
>>> import re
>>> r = re.compile("a")
>>> a = "a"
>>> r.match(a)
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>> a = "a\n"
>>> r.match(a)
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>>
Whilst this expression does not work:
>>> a = "\na"
>>> r.match(a)
>>>
re.match was designed to match from the first character (the start) of the string. In the first two examples, the match works fine because a is the first character. In the last example however, the match fails because \n is the first character.
You need to use re.search in this case to have Python search for the a:
>>> import re
>>> r = re.compile("a")
>>> a = "\na"
>>> r.search(a)
<_sre.SRE_Match object; span=(1, 2), match='a'>
>>>
Also, just a note: if you are working with multi-line strings, then you will need to set the dot-all flag to have . match newlines. This can be done with re.DOTALL.
Why doesnt match work?
match searches the pattern at the start of the string.
How to correct?
use search instead
>>> import re
>>> pat=re.compile('a')
>>> pat.search('\na')
<_sre.SRE_Match object at 0x7faef636d440>
>>>

Categories

Resources