This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
I have a string which contains the number of processors:
SQLDB_GP_Gen5_2
The number is after _Gen and before _ (the number 5). How can I extract this using python and regular expressions?
I am trying to do it like this but don't get a match:
re.match('_Gen(.*?)_', 'SQLDB_GP_Gen5_2')
I was also trying this using pandas:
x['SLO'].extract(pat = '(?<=_Gen).*?(?:(?!_).)')
But this also wasn't working. (x is a Series)
Can someone please also point me to a book/tutorial site where I can learn regex and how to use with Pandas.
Thanks,
Mick
re.match searches from the beginning of the string. Use re.search instead, and retrieve the first capturing group:
>>> re.search(r'_Gen(\d+)_', 'SQLDB_GP_Gen5_2').group(1)
'5'
You need to use Series.str.extract with a pattern containing a capturing group:
x['SLO'].str.extract(r'_Gen(.*?)_', expand=False)
^^^^ ^^^^^^^^^^^
To only match a number, use r'_Gen(\d+)_'.
NOTES:
With Series.str.extract, you need to use a capturing group, the method only returns any value if it is captured
r'_Gen(.*?)_' will match _Gen, then will capture any 0+ chars other than line break chars as few as possible, and then match _. If you use \d+, it will only match 1+ digits.
Using re :
re.findall(r'Gen(.*)_',text)[0]
Related
This question already has answers here:
Extract part of a regex match
(11 answers)
Closed 3 years ago.
I'm trying to parse HTML in Python that has an inline script in it. I need to find a string inside of the script, then extract the value. I've been trying to do this in regex for the past few hours, but I'm still not convinced this is the correct approach.
Here is a sample:
['key_to_search_for']['post_date'] = '10 days ago';
The result I want to extract is: 10 days ago
This regex gets me part of the way, but I can't figure out the full match:
^\[\'key_to_search_for\'\]\[\'post_date\'\] = '(\d{1,2})+( \w)
Regex playground
However, even once I can match with regex, I'm not sure the best way to get only the value. I was thinking of just replacing the keys with blanks, like .replace('['key_to_search_for']['post_date'] = '',''), but that seems inefficient.
Should I be matching the regex then replacing? Is there a better way to handle this?
You can extract the value using a single capturing group and match the 2 words using a quantifier for \w+.
The value is in capture group 1.
^\['key_to_search_for'\]\['post_date'\] = '(\d{1,2} \w+ \w+)';$
Regex demo
Or use a negated character class matching any char except a '
^\['key_to_search_for'\]\['post_date'\] = '([^']+)';$
Regex demo
This question already has answers here:
Difference between re.search() and re.findall()
(2 answers)
Closed 4 years ago.
the demonstration of my question is given below:
My code :
p = "goalgoalgoalgoalllllgoaloaloal'
print(re.search('g(oal){3}',p).group())
re.findall('g(oal){3}',p)
Output:
goaloaloal
['oal']
With the same regex pattern, re.search() finds the match to be 'goaloaloal' as I expected. However, re.findall() finds the match to be 'oal', which really surprises me. Could anyone please help to explain the cause of the difference? Thank you in advance:-)
Explanation: Sorry for the seemingly duplicate. My original purpose of this question is to find the exact difference between re.research() and re.findall() methods while dealing with the parenthesis in regex pattern. I even didn't know the term "capture" before. More specifically, I wanted to know how to extract exactly the 'goaloaloal' pattern using re.findall() method. Thanks #blhsing for the helpful answer!
This is because re.findall() returns only the substring in the capture group when there is one, while re.search() returns a Match object, and when you call the group() method of the Match object, it returns the substring that matches the entire regex regardless of capture groups.
If you want re.findall() to return the entire matching substring, you should use non-capturing groups instead:
re.findall('g(?:oal){3}', p) # returns ['goaloaloal']
It happens because of grouping. re.findall returns list of all matched groups except zero group I think. Groups are denoted by round brackets so in your code you have one group (oal). If you denote group which contains all the expression, you'll get the result:
import re
p = 'goalgoalgoalgoalllllgoaloaloal'
m = re.search('(g(oal){3})', p)
print(m.group()) # goaloaloal
m = re.findall('(g(oal){3})', p)
print(m) # [('goaloaloal', 'oal')]
This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 4 years ago.
I'm trying to use \w{2}\d/\d{1,2}(/\d{1,2})? in order to match the following two interfaces on a Cisco switch:
Gi1/0/1
Fa0/1
When I use re.search(), it returns the desired output.
import re
port = "Gi1/0/1 Fa0/1"
search = re.search(r'\w{2}\d/\d{1,2}(/\d{1,2})?', port)
print search.group()
I get "Gi1/0/1" as the output.
When I use re.findall()
import re
port = "Gi1/0/1 Fa0/1"
search = re.findall(r'\w{2}\d/\d{1,2}(/\d{1,2})?', port)
print search
I get "['/1', '']" which is undesired.
Why does't findall() return ['Gi1/0/1','Fa0/1']?
Is that because I used (/\d{1,2})?, and findall() is supposed to return this part? Why is that?
How do we get ['Gi1/0/1','Fa0/1'] using findall()?
From the findall docs
If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group.
In you regex you have a capturing group (/\d{1,2})?
You could make it a non capturing group instead (?:/\d{1,2})?
Your regex would look like:
\w{2}\d/\d{1,2}(?:/\d{1,2})?
import re
port = "Gi1/0/1 Fa0/1"
search = re.findall(r'\w{2}\d/\d{1,2}(?:/\d{1,2})?', port)
print search
Demo
search.group() returns entire match found by the regex \w{2}\d/\d{1,2}(/\d{1,2})?. It doesn't consider capturing group. It is equivalent to search.group(0). While using search.group(1), it will return /1: the result of first capturing group.
On other hand, re.findall returns all result of matched groups. To get the expected result, your regex should be
(\w{2}\d/(?:\d{1,2}/)?\d{1,2})
Python Code
>>> re.findall(r'(\w{2}\d/(?:\d{1,2}/)?\d{1,2})', port)
['Gi1/0/1', 'Fa0/1']
Regex Breakdown
( #Start Capturing group
\w{2}\d/ #Match two characters in [A-Za-z0-9_] followed by a digit and slash
(?:\d{1,2}/)? #Followed by two digits which are optional
\d{1,2} #Followed by two digits
) #End capturing group
P.S. From your question, I think you are matching only alphabets. In that case use, [A-Za-z] instead of \w
If you want it the regex way; this will work:
search = re.findall(r'\w{2}\d/\d{1}(?:/\d{1})?', port)
You may do this too:
>>> "Gi1/0/1 Fa0/1".split(' ')
['Gi1/0/1', 'Fa0/1']
This question already has answers here:
Reference - What does this regex mean?
(1 answer)
What is a non-capturing group in regular expressions?
(18 answers)
Closed 5 years ago.
I'm trying to create a function to capture phone numbers written in a canonical form (XXX)XXX-XXX or XXX-XXX-XXXX with additional conditions. This is my approach
def parse_phone2(s):
phone_number = re.compile(r'''^\s*\(? # Begining of string, Ignore leading spaces
([0-9]{3}) # Area code
\)?\s*|-? # Match 0 or 1 ')' followed by 0 or more spaces or match a single hyphen
([0-9]{3}) # Three digit
-? # hyphen
([0-9]{4}) # four digits
\s*$ # End of string. ignore trailing spaces''', re.VERBOSE)
try:
return (phone_number.match(s).groups())
except AttributeError as e:
raise ValueError
I was failing this test case ' (404) 555-1212 ' but another question of SO suggest me to replace \)?\s*|-? by (?:\)?\s*|-?) and it works. The problem is that I don't understand the difference between both nor the purpose of (?:...) further than create non-capturing groups. The docs aren't clear enough for me as well.
https://docs.python.org/3/library/re.html
Consider a simpler example:
re.compile(r'(?:a|b)*')
which simply matches a (possibly empty) string of as and bs. The only difference between this and
re.compile(r'(a|b)*')
is that the matching engine will capture the first character matched for retrieval with the group method. Using a non-capture group is just an optimization to speed up the match (or at least save memory) when a capture group isn't needed.
You have an alternate token in the part you replaced. Alternate will match either what's before the token, or what's after. And since separating a regex into lines like you've done here isn't considered grouping, it would try to match not just what's before or after on the same line, but on the lines before and after as well.
Grouping should instead be done by surrounding the group in parentheses, BUT by default this will also "capture" the group, meaning it will return the match as one of the groups when you call groups(). To specify that it should not, you need to add ?:.
This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 1 year ago.
I have something like
store(s)
ending line like "1 store(s)".
I want to match it using Python regular expression.
I tried something like re.match('store\(s\)$', text)
but it's not working.
This is the code I tried:
import re
s = '1 store(s)'
if re.match('store\(s\)$', s):
print('match')
In more or less direct reply to your comment
Try this
import re
s = '1 stores(s)'
if re.match('store\(s\)$',s):
print('match')
The solution is to use re.search instead of re.match as the latter tries to match the whole string with the regexp while the former just tries to find a substring inside of the string that does match the expression.
Python offers two different primitive
operations based on regular
expressions: match checks for a match
only at the beginning of the string,
while search checks for a match
anywhere in the string (this is what
Perl does by default)
Straight from the docs, but it does come up alot.
have you considered re.match('(.*)store\(s\)$',text) ?