Strange regex issue using findall() and search() [duplicate] - python

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 4 years ago.
I'm trying to use \w{2}\d/\d{1,2}(/\d{1,2})? in order to match the following two interfaces on a Cisco switch:
Gi1/0/1
Fa0/1
When I use re.search(), it returns the desired output.
import re
port = "Gi1/0/1 Fa0/1"
search = re.search(r'\w{2}\d/\d{1,2}(/\d{1,2})?', port)
print search.group()
I get "Gi1/0/1" as the output.
When I use re.findall()
import re
port = "Gi1/0/1 Fa0/1"
search = re.findall(r'\w{2}\d/\d{1,2}(/\d{1,2})?', port)
print search
I get "['/1', '']" which is undesired.
Why does't findall() return ['Gi1/0/1','Fa0/1']?
Is that because I used (/\d{1,2})?, and findall() is supposed to return this part? Why is that?
How do we get ['Gi1/0/1','Fa0/1'] using findall()?

From the findall docs
If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group.
In you regex you have a capturing group (/\d{1,2})?
You could make it a non capturing group instead (?:/\d{1,2})?
Your regex would look like:
\w{2}\d/\d{1,2}(?:/\d{1,2})?
import re
port = "Gi1/0/1 Fa0/1"
search = re.findall(r'\w{2}\d/\d{1,2}(?:/\d{1,2})?', port)
print search
Demo

search.group() returns entire match found by the regex \w{2}\d/\d{1,2}(/\d{1,2})?. It doesn't consider capturing group. It is equivalent to search.group(0). While using search.group(1), it will return /1: the result of first capturing group.
On other hand, re.findall returns all result of matched groups. To get the expected result, your regex should be
(\w{2}\d/(?:\d{1,2}/)?\d{1,2})
Python Code
>>> re.findall(r'(\w{2}\d/(?:\d{1,2}/)?\d{1,2})', port)
['Gi1/0/1', 'Fa0/1']
Regex Breakdown
( #Start Capturing group
\w{2}\d/ #Match two characters in [A-Za-z0-9_] followed by a digit and slash
(?:\d{1,2}/)? #Followed by two digits which are optional
\d{1,2} #Followed by two digits
) #End capturing group
P.S. From your question, I think you are matching only alphabets. In that case use, [A-Za-z] instead of \w

If you want it the regex way; this will work:
search = re.findall(r'\w{2}\d/\d{1}(?:/\d{1})?', port)
You may do this too:
>>> "Gi1/0/1 Fa0/1".split(' ')
['Gi1/0/1', 'Fa0/1']

Related

Getting the last occurrence of a match that is inside parenthesis using regular expressions

I want to use regular expressions to get the text inside parenthesis in a sentence. But if the string has two or more occurrence, the pattern I am using gets everything in between. I google it and some sources tells me to use negative lookahead and backreference, but it is not working as expected. The examples I found are: Here, here
An example of a string is:
s = "Para atuar no (GCA) do (CNPEM)"
What I want is to get just the last occurrence: "(CNPEM)"
The pattern I am using is:
pattern = "(\(.*\))(?!.*\1)"
But when I run (using python's re module) I get this:
output = (GCA) do (CNPEM)
How can I get just the last occurrence in this case?
You could use re.findall here, and then access the last match:
s = "Para atuar no (GCA) do (CNPEM)"
last = re.findall(r'\(.*?\)', s)[-1]
print(last) # (CNPEM)

Python regex to extract number of processors [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
I have a string which contains the number of processors:
SQLDB_GP_Gen5_2
The number is after _Gen and before _ (the number 5). How can I extract this using python and regular expressions?
I am trying to do it like this but don't get a match:
re.match('_Gen(.*?)_', 'SQLDB_GP_Gen5_2')
I was also trying this using pandas:
x['SLO'].extract(pat = '(?<=_Gen).*?(?:(?!_).)')
But this also wasn't working. (x is a Series)
Can someone please also point me to a book/tutorial site where I can learn regex and how to use with Pandas.
Thanks,
Mick
re.match searches from the beginning of the string. Use re.search instead, and retrieve the first capturing group:
>>> re.search(r'_Gen(\d+)_', 'SQLDB_GP_Gen5_2').group(1)
'5'
You need to use Series.str.extract with a pattern containing a capturing group:
x['SLO'].str.extract(r'_Gen(.*?)_', expand=False)
^^^^ ^^^^^^^^^^^
To only match a number, use r'_Gen(\d+)_'.
NOTES:
With Series.str.extract, you need to use a capturing group, the method only returns any value if it is captured
r'_Gen(.*?)_' will match _Gen, then will capture any 0+ chars other than line break chars as few as possible, and then match _. If you use \d+, it will only match 1+ digits.
Using re :
re.findall(r'Gen(.*)_',text)[0]

Does this regex fail, or do I need to modify the regex to support "optional followed by"?

I am trying the following regex: https://regex101.com/r/5dlRZV/1/, I am aware, that I am trying with \author and not \maketitle
In python, I try the following:
import re
text = str(r'
\author{
\small
}
\maketitle
')
regex = [re.compile(r'[\\]author*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S),
re.compile(r'[\\]maketitle*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S)]
for p in regex:
for m in p.finditer(text):
print(m.group())
Python freezes, I am suspecting that this has something to do with my pattern, and the SRE fails.
EDIT: Is there something wrong with my regex? Can it be improved to actually work? Still I get the same results on my machine.
EDIT 2: Can this be fixed somehow so the pattern supports optional followed by ?: or ?= look-heads? So that one can capture both?
After reading the heading, "Parentheses Create Numbered Capturing Groups", on this site: https://www.regular-expressions.info/brackets.html, I managed to find the answer which is:
Besides grouping part of a regular expression together, parentheses also create a
numbered capturing group. It stores the part of the string matched by the part of
the regular expression inside the parentheses.
The regex Set(Value)? matches Set or SetValue.
In the first case, the first (and only) capturing group remains empty.
In the second case, the first capturing group matches Value.

Are re.search() and re.findall() different in finding regex patterns (Python)? [duplicate]

This question already has answers here:
Difference between re.search() and re.findall()
(2 answers)
Closed 4 years ago.
the demonstration of my question is given below:
My code :
p = "goalgoalgoalgoalllllgoaloaloal'
print(re.search('g(oal){3}',p).group())
re.findall('g(oal){3}',p)
Output:
goaloaloal
['oal']
With the same regex pattern, re.search() finds the match to be 'goaloaloal' as I expected. However, re.findall() finds the match to be 'oal', which really surprises me. Could anyone please help to explain the cause of the difference? Thank you in advance:-)
Explanation: Sorry for the seemingly duplicate. My original purpose of this question is to find the exact difference between re.research() and re.findall() methods while dealing with the parenthesis in regex pattern. I even didn't know the term "capture" before. More specifically, I wanted to know how to extract exactly the 'goaloaloal' pattern using re.findall() method. Thanks #blhsing for the helpful answer!
This is because re.findall() returns only the substring in the capture group when there is one, while re.search() returns a Match object, and when you call the group() method of the Match object, it returns the substring that matches the entire regex regardless of capture groups.
If you want re.findall() to return the entire matching substring, you should use non-capturing groups instead:
re.findall('g(?:oal){3}', p) # returns ['goaloaloal']
It happens because of grouping. re.findall returns list of all matched groups except zero group I think. Groups are denoted by round brackets so in your code you have one group (oal). If you denote group which contains all the expression, you'll get the result:
import re
p = 'goalgoalgoalgoalllllgoaloaloal'
m = re.search('(g(oal){3})', p)
print(m.group()) # goaloaloal
m = re.findall('(g(oal){3})', p)
print(m) # [('goaloaloal', 'oal')]

How to make this a non capturing group so it does not return empty strings [duplicate]

According to the pattern match here, the matches are 213.239.250.131 and 014.10.26.06.
Yet when I run the generated Python code and print out the value of re.findall(p, test_str), I get:
[('', '', '213.239.250.131'), ('', '', '014.10.26.06')]
I could hack around the list and it tuples to get the values I'm looking for (the IP addresses), but (i) they might not always be in the same position in the tuples and (ii) I'd rather understand what's going on here so I can either tighten up the regex, or extract only IP addresses using Python's own re functionality.
Why do I get this list of tuples, why the apparent whitespace matches, and how do we ensure that only the IP addresses are returned?
Whenever you are using a capturing group, it always returns a submatch, even if it is empty/null. You have 3 capturing groups, so you will always have them in the findall result.
In regex101.com, you can see these non-participating groups by turning them on in Options:
You may tighten up your regex by removing capturing groups:
(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
Or even (?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}(?:\.\d{1,3}){3}.
See a regex demo
And since the regex pattern does not contain capturing groups, re.findall will only return matches, not capturing group contents:
import re
p = re.compile(r'(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
test_str = "from mail.example.com (example.com. [213.239.250.131]) by\n mx.google.com with ESMTPS id xc4si15480310lbb.82.2014.10.26.06.16.58 for\n <alex#example.com> (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256\n bits=128/128); Sun, 26 Oct 2014 06:16:58 -0700 (PDT)"
print(re.findall(p, test_str))
Output of the online Python demo:
['213.239.250.131', '014.10.26.06']
these are the capturing groups.
if you do or queries it will return empty matches for the non matching expressions.
(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
the first or has 2 groups:
(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})
and after the or there is the third:
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
to say it in a simple way each round bracket defines a capturing group which will show up if the value matches or not.

Categories

Resources