Matching a Regex against a multiline string

Matching a Regex against a multiline string - python

I am trying to match a Regex against a multi-line string, but the match fails after the first line.
These expressions work as expected:
>>> import re
>>> r = re.compile("a")
>>> a = "a"
>>> r.match(a)
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>> a = "a\n"
>>> r.match(a)
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>>
Whilst this expression does not work:
>>> a = "\na"
>>> r.match(a)
>>>

re.match was designed to match from the first character (the start) of the string. In the first two examples, the match works fine because a is the first character. In the last example however, the match fails because \n is the first character.
You need to use re.search in this case to have Python search for the a:
>>> import re
>>> r = re.compile("a")
>>> a = "\na"
>>> r.search(a)
<_sre.SRE_Match object; span=(1, 2), match='a'>
>>>
Also, just a note: if you are working with multi-line strings, then you will need to set the dot-all flag to have . match newlines. This can be done with re.DOTALL.

Why doesnt match work?
match searches the pattern at the start of the string.
How to correct?
use search instead
>>> import re
>>> pat=re.compile('a')
>>> pat.search('\na')
<_sre.SRE_Match object at 0x7faef636d440>
>>>

Related

Finding a given substring in a string with Regular Expression in Python

I am trying to find all the occurrences of a substring in a string like below:
import re
S = 'aaadaa'
matches = re.finditer('(aa)', S)
if matches:
#print(matches)
for match in matches:
print(match)
else:
print("No match")
The current output is:
<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(4, 6), match='aa'>
But I am expecting that it should give the values as:
<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(1, 3), match='aa'>
<re.Match object; span=(4, 6), match='aa'>
Could someone please help me on this?

Taken from the answer I linked in the comments, here is the pattern you need:
(?=(aa)).
You’ll have to access the matched substring using match_obj.groups(1), and the match indices using match_obj.span(1).

The problem here is that once the re module matches a double aa, it will also consume both of the letters. But, you want overlapping matches. One trick you could use here would be to search for a(?=a):
S = 'aaadaa'
matches = re.findall(r'a(?=a)', S)
matches = [s + "a" for s in matches]
print(matches)
['aa', 'aa', 'aa']
Note that we tag on the second a to the output list, since only the first letter is actually matched at each step.

Substring[whole word] check using a string variable

In Python2.7, I am trying the following:
>>> import re
>>> text='0.0.0.0/0 172.36.128.214'
>>> far_end_ip="172.36.128.214"
>>>
>>>
>>> chk=re.search(r"\b172.36.128.214\b",text)
>>> chk
<_sre.SRE_Match object at 0x0000000002349578>
>>> chk=re.search(r"\b172.36.128.21\b",text)
>>> chk
>>> chk=re.search(r"\b"+far_end_ip+"\b",text)
>>>
>>> chk
>>>
Q:how can i make the search work when using the variable far_end_ip

Two issues:
You need to write the last bit of the string as a regex literal or escape the backslash: ... + r"\b"
You should escape the dots in the text to find: ... + re.escape(far_end_ip)
So:
re.search(r"\b" + re.escape(far_end_ip) + r"\b",text)
See also "How to use a variable inside a regular expression?".

Python: I am trying to match number but it is matching something else

import re
list = []
string = "[50,40]"
print(string)
for line in string.split(","):
print(line)
match = re.search(r'\d[0-9]', line)
print(match)
if match:
list.append(match)
print("list is", list)
list is:
[<_sre.SRE_Match object; span=(1, 3), match='50'>,
<_sre.SRE_Match object; span=(0, 2), match='40'>]
I want to match only 40 and 50 and not some other useless info like
[<_sre.SRE_Match object; span=(1, 3),
<_sre.SRE_Match object; span=(0, 2),]
How to avoid other things and match only 40 and 50

Use re.findall function, it'll
Return all non-overlapping matches of pattern in string, as a list of
strings
string = "[50,40]"
result = re.findall(r'\d+', string)
print(result)
The output:
['50', '40']

Your code is matching the numbers, but you need to extract the strings from the Match objects. You can do that with the .groups method.
Here's a repaired version of your code. I've changed some of your names because you should not shadow the builtin list type, also string is the name of a standard module.
import re
lst = []
s = "[50,40]"
print(s)
for line in s.split(","):
print(line)
match = re.search(r'\d[0-9]', line)
print(match)
if match:
lst.append(match.group(0))
print("list is", lst)
output
[50,40]
[50
<_sre.SRE_Match object; span=(1, 3), match='50'>
40]
<_sre.SRE_Match object; span=(0, 2), match='40'>
list is ['50', '40']
Your regex is a little strange. If you only want to match 2 digit numbers you could use r'\d\d' or r'\d{2}'. If you want to match any (non-negative) you should use r'\d+'.
And you really don't need to do that loop. Just use the re.findall method:
import re
s = "[50,40]"
lst = re.findall(r'\d+', s)
print("list is", lst)
If you intend to do lots of searches with the same pattern, it's a good idea to use a compiled regex. The re module compiles and caches all regexes anyway, but doing it explicitly is considered good style, and is a litle more efficient.
import re
pat = re.compile(r'\d+')
s = "[50,40]"
lst = pat.findall(s)
print("list is", lst)

Python Regular Expressions - How is "+?" equivalent to "*"

* : 0 or more occurrences of the pattern to its left
+ : 1 or more occurrences of the pattern to its left
? : 0 or 1 occurrences of the pattern to its left
How is "+?" equivalent to "*" ?
Consider a search for any 3 letter word if it exists.
re1.search(r,'(\w\w\w)*,"abc")
In case of re1, * tries to get either 0 or more occurrences of the pattern to its left which in this case is the group of 3 letters. So it will either try to find a 3 letter word or fail
re2.search(r,'(\w\w\w)+?,"abc")
In case of re2, it's supposed to give the same output but I'm confused as to why "*" and "?+" are equivalent. Can you please explain this ?

* and +? are not equivalent. The ? takes on a special meaning if it follows a quantifier, making that quantifier lazy.
Usually, quantifiers are greedy, meaning they will try to match as many repetitions as they can; lazy quantifiers match as few as they can. But a+? will still match at least one a.
In [1]: re.search("(a*)(.*)", "aaaaaa").groups()
Out[1]: ('aaaaaa', '')
In [2]: re.search("(a+?)(.*)", "aaaaaa").groups()
Out[2]: ('a', 'aaaaa')
In your example, both regexes happen to match the same text because both (\w\w\w)* and (\w\w\w)+? can match three letters, and there are exactly three letters in your input. But they will differ in other strings:
In [12]: re.search(r"(\w\w\w)+?", "abcdef")
Out[12]: <_sre.SRE_Match object; span=(0, 3), match='abc'>
In [13]: re.search(r"(\w\w\w)+?", "ab") # No match
In [14]: re.search(r"(\w\w\w)*", "abcdef")
Out[14]: <_sre.SRE_Match object; span=(0, 6), match='abcdef'>
In [15]: re.search(r"(\w\w\w)*", "ab")
Out[15]: <_sre.SRE_Match object; span=(0, 0), match=''>

If you run with a simpler expression you will see is not the same:
import re
>>> re.search("[0-9]*", "1")
<_sre.SRE_Match object; span=(0, 1), match='1'>
>>> re.search("[0-9]*", "")
<_sre.SRE_Match object; span=(0, 0), match=''>
>>> re.search("[0-9]+", "")
>>> re.search("[0-9]+", "1")
<_sre.SRE_Match object; span=(0, 1), match='1'>
The problem in your code is (words)+?. is one or more or nothing

Regex : match a string containing only one alphabet

If there is a string
str= "S23#"
It should match and
if str="WS23%"
it should not match (because 2 characters)
I used re.search("^[{A-Z}?0-9()*%#+?=:._<>,!/\-]*$", str) and it matches both strings

Just remove the pattern which matches all the uppercase alphabets from the character class and put it in between two [0-9()%#+?=:._<>,!/-]* patterns.
re.match(r"^[0-9()%#+?=:._<>,!/-]*[A-Za-z][0-9()%#+?=:._<>,!/-]*$", st)
Example:
>>> s= "S23#"
>>> s1 = "WS23%"
>>> re.match(r"^[0-9()%#+?=:._<>,!/-]*[A-Za-z][0-9()%#+?=:._<>,!/-]*$", s)
<_sre.SRE_Match object; span=(0, 4), match='S23#'>
>>> re.match(r"^[0-9()%#+?=:._<>,!/-]*[A-Za-z][0-9()%#+?=:._<>,!/-]*$", s1)
>>>

^(?!(.*?[A-Z]){2})[{A-Z}?0-9()%#+?=:._<>,!/-]+$
Try this.Also use re.match if you want to match the whole string.See demo.
https://regex101.com/r/aI4rA5/2
re.match("^(?!(.*?[A-Z]){2})[{A-Z}?0-9()%#+?=:._<>,!/-]+$", str)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching a Regex against a multiline string - python

Why doesnt match work? match searches the pattern at the start of the string. How to correct? use search instead >>> import re >>> pat=re.compile('a') >>> pat.search('\na') <_sre.SRE_Match object at 0x7faef636d440> >>>

Related

Finding a given substring in a string with Regular Expression in Python

Substring[whole word] check using a string variable

Python: I am trying to match number but it is matching something else

Python Regular Expressions - How is "+?" equivalent to "*"

Regex : match a string containing only one alphabet

Categories

Resources